我可以在使用数据库后激活 PITR 吗？

Question

Asked: 2023-10-30 21:33:12 +0800 CST2023-10-30 21:33:12 +0800 CST 2023-10-30 21:33:12 +0800 CST

将复杂查询的结果限制为给定属性中最常见的值

772

假设您有一个多商店平台的下表

CREATE TABLE orders (
  id BIGINT PRIMARY KEY GENERATED ALWAYS AS IDENTITY,
  store_id BIGINT NOT NULL,
  ordered_at TIMESTAMPTZ NOT NULL
);

CREATE INDEX ON orders (store_id);

CREATE TABLE order_lines (
  id BIGINT PRIMARY KEY GENERATED ALWAYS AS IDENTITY,
  order_id BIGINT NOT NULL REFERENCES orders (id) ON DELETE CASCADE,
  item_id BIGINT NOT NULL REFERENCES items (id),
  quantity INT
);

CREATE INDEX ON order_lines (order_id);
CREATE INDEX ON order_lines (item_id);

CREATE TABLE items (
  id BIGINT PRIMARY KEY GENERATED ALWAYS AS IDENTITY,
  store_id BIGINT NOT NULL,
  color VARCHAR(255) NOT NULL,
  size VARCHAR(255) NOT NULL,
  category VARCHAR(255)
);

CREATE TABLE returns (
  id BIGINT PRIMARY KEY GENERATED ALWAYS AS IDENTITY,
  order_line_id BIGINT NOT NULL REFERENCES order_lines (id) ON DELETE CASCADE,
  returned_at TIMESTAMPTZ NOT NULL,
  quantity INT NOT NULL,
  reason VARCHAR(255)
);

CREATE INDEX ON returns (order_line_id);

由此，我想获取某个日期范围内所有订单的列表，并计算某些指标，例如购买了多少商品，退回了多少商品等。我还想根据颜色对商品的子集执行此操作或大小，但我也希望它们被“排名”。例如，我想显示总体上返回次数最多的颜色中颜色的项目的这些指标。

到目前为止，我想到的是通过两个查询来完成此操作。第一个是遍历所有退货并按颜色对它们进行分组并对项目进行求和，退货，如下所示：

select
  i.color,
  trunc(sum(r.quantity)::numeric / sum(ol.quantity)::numeric, 2) as return_rate 
from orders o
inner join
  order_lines ol on ol.order_id = o.id
inner join
  items i on i.id = ol.item_id
left outer join
  returns r on r.order_line_id = ol.id
group by i.color
order by return_rate desc nulls last 
limit 4;

    color     | return_rate 
--------------+-------------
 Black        |       0.43
 Blue         |       0.41
 White        |       0.40
 Yellow       |       0.39

然后，基于此查询，我将执行一个新查询，按天（日期）对所有订单进行分组，然后对某个时间范围内的总退货率和返回次数最多的颜色的退货率进行求和。我还希望能够通过项目的尺寸、类别等进行过滤。它将用于动态报告，人们可以在其中查看订单总数、退货、比率以及平均退货率的折线图、选定时间范围内最高退货颜色。

有更好的方法吗？遍历并执行所有连接两次感觉有点错误。阅读一些有关窗口函数的内容，但无法确定它是否适用于此处。

我开始用我幼稚的方法以及 CTE 的方法来尝试重用批量数据。https://www.db-fiddle.com/f/4jyoMCicNSZpjMt4jFYoz5/10700

显然，它在本地执行的操作与大约 100k 行有所不同，这是本地运行的 CTE 查询的解释分析计划。

QUERY PLAN                                                                                           
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 GroupAggregate  (cost=50730.74..50809.91 rows=2262 width=68) (actual time=350.787..351.912 rows=722 loops=1)
   Group Key: ((order_return_items.date)::date), order_return_items.color
   CTE order_return_items
     ->  HashAggregate  (cost=33884.69..39240.19 rows=226239 width=33) (actual time=211.390..258.041 rows=194309 loops=1)
           Group Key: o.ordered_at, i.color
           Planned Partitions: 4  Batches: 5  Memory Usage: 9265kB  Disk Usage: 7736kB
           ->  Hash Left Join  (cost=5740.52..14725.07 rows=226239 width=25) (actual time=44.080..158.610 rows=226902 loops=1)
                 Hash Cond: (ol.id = r.order_line_id)
                 ->  Hash Join  (cost=3754.69..9321.56 rows=226239 width=29) (actual time=31.592..107.385 rows=226239 loops=1)
                       Hash Cond: (ol.item_id = i.id)
                       ->  Hash Join  (cost=3572.53..8544.81 rows=226239 width=28) (actual time=29.966..79.287 rows=226239 loops=1)
                             Hash Cond: (ol.order_id = o.id)
                             ->  Seq Scan on order_lines ol  (cost=0.00..4378.39 rows=226239 width=28) (actual time=0.012..10.556 rows=226239 loops=1)
                             ->  Hash  (cost=2363.90..2363.90 rows=96690 width=16) (actual time=29.931..29.931 rows=96690 loops=1)
                                   Buckets: 131072  Batches: 1  Memory Usage: 5557kB
                                   ->  Seq Scan on orders o  (cost=0.00..2363.90 rows=96690 width=16) (actual time=0.005..13.289 rows=96690 loops=1)
                       ->  Hash  (cost=137.63..137.63 rows=3563 width=17) (actual time=1.620..1.621 rows=4462 loops=1)
                             Buckets: 8192 (originally 4096)  Batches: 1 (originally 1)  Memory Usage: 300kB
                             ->  Seq Scan on items i  (cost=0.00..137.63 rows=3563 width=17) (actual time=0.014..0.754 rows=4462 loops=1)
                 ->  Hash  (cost=1248.70..1248.70 rows=58970 width=12) (actual time=12.385..12.385 rows=61098 loops=1)
                       Buckets: 65536  Batches: 1  Memory Usage: 3376kB
                       ->  Seq Scan on returns r  (cost=0.00..1248.70 rows=58970 width=12) (actual time=0.007..5.392 rows=61098 loops=1)
   ->  Sort  (cost=11490.55..11496.20 rows=2262 width=52) (actual time=350.765..351.008 rows=6567 loops=1)
         Sort Key: ((order_return_items.date)::date), order_return_items.color
         Sort Method: quicksort  Memory: 626kB
         ->  Hash Join  (cost=6227.62..11364.52 rows=2262 width=52) (actual time=323.014..349.528 rows=6567 loops=1)
               Hash Cond: (order_return_items.color = most_returned_colors.color)
               ->  CTE Scan on order_return_items  (cost=0.00..4524.78 rows=226239 width=56) (actual time=211.393..223.913 rows=194309 loops=1)
               ->  Hash  (cost=6227.60..6227.60 rows=2 width=32) (actual time=111.569..111.571 rows=2 loops=1)
                     Buckets: 1024  Batches: 1  Memory Usage: 9kB
                     ->  Subquery Scan on most_returned_colors  (cost=6227.57..6227.60 rows=2 width=32) (actual time=111.564..111.566 rows=2 loops=1)
                           ->  Limit  (cost=6227.57..6227.58 rows=2 width=64) (actual time=111.564..111.565 rows=2 loops=1)
                                 ->  Sort  (cost=6227.57..6228.07 rows=200 width=64) (actual time=111.562..111.563 rows=2 loops=1)
                                       Sort Key: (trunc((sum(order_return_items_1.r_q) / sum(order_return_items_1.ol_q)), 3)) DESC NULLS LAST
                                       Sort Method: top-N heapsort  Memory: 25kB
                                       ->  HashAggregate  (cost=6221.57..6225.57 rows=200 width=64) (actual time=111.538..111.550 rows=44 loops=1)
                                             Group Key: order_return_items_1.color
                                             Batches: 1  Memory Usage: 48kB
                                             ->  CTE Scan on order_return_items order_return_items_1  (cost=0.00..4524.78 rows=226239 width=48) (actual time=0.000..85.098 rows=194309 loops=1)

如果有关于整体结构和索引的反馈，我们也将不胜感激！

1 个回答

Voted

Erwin Brandstetter · Answer 1 · 2023-11-01T09:36:41+08:00

询问

我觉得你的 CTE 方法不错。我有几个建议：

WITH order_return_items AS (
   SELECT (o.ordered_at AT TIME ZONE 'Europe/Vienna')::date AS the_day  -- ① deterministic date 
        , i.color
        , sum(r.quantity)::int AS r_q    -- ② cast to int avoid escalation to numeric in next step
        , sum(ol.quantity)::int AS ol_q  -- cannot be 0 (right?!)
   FROM   orders       o
   JOIN   order_lines  ol ON ol.order_id = o.id
   JOIN   items        i  ON i.id = ol.item_id
   LEFT   JOIN returns r  ON r.order_line_id = ol.id
   GROUP  BY 1, 2
   )
, most_returned_colors AS (
   SELECT color  -- don't include total rate while not using it down the line
   FROM   order_return_items
   GROUP  BY 1
   ORDER  BY sum(r_q) * 10000 / sum(ol_q) DESC NULLS LAST   -- ③
           , color  -- ④ tiebreaker!
   LIMIT  5
   )
SELECT the_day, color
     , COALESCE(round((sum(r_q)::numeric / sum(ol_q)), 3), 0) AS rate  -- ⑤
FROM   most_returned_colors m
JOIN   order_return_items   o USING (color)  -- ⑥
GROUP  BY 1, 2
ORDER  BY 1, 2;

小提琴

timestamptz① 从到的简单转换date是有缺陷的，因为这取决于timezone会话中的设置。可能会导致非常混乱的效果。使用明确的时区名称定义日期的时区。稍微贵一点，但是可靠。看：

② 在 Postgres 中，为了避免可能的溢出错误，sum(int)结果为bigint，sum(bigint)结果为numeric。但计算成本numeric更高。仅当永远不会出现整数溢出时，我的转换integer才可以（似乎可以安全地假设每天的订单和颜色）。这样，我们就可以避免升级到numeric下一个聚合步骤。次要的、可选的优化。

③ 我的表达式应该比numeric和 then中的计算更便宜、更精确trunc(n, 3)。但更重要的问题是：为什么要舍入或截断呢？此时是为了排名，不是为了展示......

④ 我添加了color决胜局。使用您认为更合适的任何内容，但请确保排序顺序是确定的。否则，您可能会在下次调用时针对相同的查询和相同的数据得到不同的结果。你将很难找出原因。

⑤ 对于显示来说，round()似乎比更有意义trunc()。误差较小。另外，我投入了COALESCE()零退货率，而不是null没有登记的退货率。

⑥ 连接时间更短、成本更低。IN (SELECT ...)可能会有所不同（并且更昂贵），因为它也会在右侧折叠重复项。两者在这里完全相同，因为m.color根据定义是唯一的。

指数

在处理所有行时，您不需要 PK 索引之外的任何索引 - 甚至可能不需要这些索引。事实上，这一切都Seq Scan在您的查询计划中。添加WHERE条件后，匹配索引可能会有所帮助（很多）。

整体结构

returns是标准 SQL 中的保留字。Postgres 允许这样做，但我会避免将它们作为标识符。导致混乱的错误和错误消息。出于类似的原因，我更喜欢“the_day”而不是“date”作为列名称。

VARCHAR(255)通常表示对 Postgres 字符串类型的误解。看：

使用数据类型“文本”存储字符串有什么缺点吗？

有关的：

如何在PostgreSQL中实现多对多关系？

将复杂查询的结果限制为给定属性中最常见的值

询问

指数

整体结构

连接到 PostgreSQL 服务器：致命：主机没有 pg_hba.conf 条目

如何让sqlplus的输出出现在一行中？

选择具有最大日期或最晚日期的日期

如何列出 PostgreSQL 中的所有模式？

列出指定表的所有列

如何在不修改我自己的 tnsnames.ora 的情况下使用 sqlplus 连接到位于另一台主机上的 Oracle 数据库

你如何mysqldump特定的表？

使用 psql 列出数据库权限

如何从 PostgreSQL 中的选择查询中将值插入表中？

如何使用 psql 列出所有数据库和表？

将复杂查询的结果限制为给定属性中最常见的值

1 个回答

询问

指数

整体结构

相关问题