我可以在使用数据库后激活 PITR 吗？

Question

tuskiomi

Asked: 2023-12-23 11:05:33 +0800 CST2023-12-23 11:05:33 +0800 CST 2023-12-23 11:05:33 +0800 CST

在 Postgres 中使用 GROUP BY 优化 SELECT MAX() ？

772

我正在尝试加快 postgres 中的以下查询速度：

select MAX(msg."timestamp") AS latestDate, msg.channel_id from message msg group by msg.channel_id

是explain这样的：

Finalize GroupAggregate  (cost=1000.63..2442779.42 rows=305 width=24)
  Group Key: channel_id
  ->  Gather Merge  (cost=1000.63..2442770.27 rows=1220 width=24)
        Workers Planned: 4
        ->  Partial GroupAggregate  (cost=0.57..2441624.90 rows=305 width=24)
              Group Key: channel_id
              ->  Parallel Index Only Scan using message_channel_id_timestamp on message msg  (cost=0.57..2243767.89 rows=39570792 width=24)
JIT:
  Functions: 6
  Options: Inlining true, Optimization true, Expressions true, Deforming true

该表的 DDL 如下：

CREATE TABLE public.message (
    message_pgid bigserial NOT NULL,
    id uuid NOT NULL,
    "timestamp" timestamptz NOT NULL,
    "content" text NOT NULL,
    channel_id uuid NOT NULL,
    CONSTRAINT message_pk PRIMARY KEY (message_pgid),
    CONSTRAINT message_un UNIQUE (channel_id, id)
);
CREATE INDEX message_channel_id_idx ON public.message USING btree (channel_id);
CREATE INDEX message_channel_id_timestamp ON public.message USING btree (channel_id, "timestamp");
CREATE INDEX message_id ON public.message USING btree (id);
CREATE INDEX message_timestamp_idx ON public.message USING btree ("timestamp");


-- public.message foreign keys
ALTER TABLE public.message ADD CONSTRAINT channel_fk FOREIGN KEY (channel_id) REFERENCES public.channel(id) DEFERRABLE;
ALTER TABLE public.message ADD CONSTRAINT message_fk FOREIGN KEY (user_id) REFERENCES public."user"(id);

最后，explain analyze：

Finalize GroupAggregate  (cost=1000.63..2442779.42 rows=305 width=24) (actual time=7631.501..7673.692 rows=597 loops=1)
  Group Key: channel_id
  ->  Gather Merge  (cost=1000.63..2442770.27 rows=1220 width=24) (actual time=7631.383..7673.511 rows=1667 loops=1)
        Workers Planned: 4
        Workers Launched: 4
        ->  Partial GroupAggregate  (cost=0.57..2441624.90 rows=305 width=24) (actual time=305.736..6125.479 rows=333 loops=5)
              Group Key: channel_id
              ->  Parallel Index Only Scan using message_channel_id_timestamp on message msg  (cost=0.57..2243767.89 rows=39570792 width=24) (actual time=0.557..4938.221 rows=31656633 loops=5)
                    Heap Fetches: 32082
Planning Time: 4.032 ms
JIT:
  Functions: 18
  Options: Inlining true, Optimization true, Expressions true, Deforming true
  Timing: Generation 12.315 ms, Inlining 193.685 ms, Optimization 122.739 ms, Emission 100.570 ms, Total 429.309 ms
Execution Time: 7684.655 ms

正如你所看到的，即使使用btree索引，操作仍然需要7.6秒，其中大部分花费在并行索引扫描上。我有点不知道如何进一步加快速度。该索引的相对大小为 5.7G，我为我的实例提供了 6GB 的 RAM，这对于 btree 最大搜索来说应该绰绰有余。我已经根据 pgtune ( https://pgtune.leopard.in.ua/ ) 设置了我的设置。

表面上我有什么遗漏的吗？

1 个回答

Voted

bobflux · Answer 1 · 2023-12-23T19:33:14+08:00

不幸的是，postgres 还没有实现自动优化此查询所需的索引扫描类型，因此它将扫描整个索引。

它能够使用 (a,b) 上的索引来优化“max(b) WHERE a=...”以及“WHERE a=... ORDER BY b DESC LIMIT 1”，它返回具有最高值的整行b 的值（如果您确实想要其他列，这可能比 max() 更有用）。但这仅适用于 a 的一个值或嵌套循环中的多个值，而不是像您所做的那样适用于整个表。

假设您有一个单独的表“channels”，其主键“channel_id”由表消息引用，则很容易手动模拟它。

如果您只要求一个channel_id 值，Postgres 知道如何使用索引找到您想要的行。因此，技巧是对channel_id 的每个值执行此操作，使用依赖子查询（如果您只需要 max() 列）或 LATERAL 连接（如果您还需要其他列，例如最新消息的内容）。

这会导致对每个channel_id 值的消息进行索引扫描。所以时间是 O(通道数 * log(消息数))，这应该比扫描整个消息表快得多。此外，它只访问包含最新消息的页面，因此不会破坏您的缓存。

创建测试数据：

CREATE UNLOGGED TABLE messages( ts INT NOT NULL, channel_id INT NOT NULL );
INSERT INTO messages SELECT n,n%1000 FROM generate_series(1,1000000) n;
CREATE INDEX ON messages( channel_id, ts );

CREATE UNLOGGED TABLE channels ( channel_id INT PRIMARY KEY );
INSERT INTO channels SELECT DISTINCT channel_id FROM messages;
VACUUM ANALYZE;

读取整个表（或索引）的慢查询：

-- SLOW
EXPLAIN ANALYZE SELECT channel_id, max(ts) FROM messages GROUP BY channel_id;

 Finalize GroupAggregate  (cost=11734.85..11988.20 rows=1000 width=8) (actual time=58.321..60.128 rows=1000 loops=1)
   Group Key: channel_id
   ->  Gather Merge  (cost=11734.85..11968.20 rows=2000 width=8) (actual time=58.315..59.871 rows=3000 loops=1)
         Workers Planned: 2
         Workers Launched: 2
         ->  Sort  (cost=10734.83..10737.33 rows=1000 width=8) (actual time=54.080..54.119 rows=1000 loops=3)
               Sort Key: channel_id
               Sort Method: quicksort  Memory: 56kB
               Worker 0:  Sort Method: quicksort  Memory: 56kB
               Worker 1:  Sort Method: quicksort  Memory: 56kB
               ->  Partial HashAggregate  (cost=10675.00..10685.00 rows=1000 width=8) (actual time=53.872..53.958 rows=1000 loops=3)
                     Group Key: channel_id
                     Batches: 1  Memory Usage: 129kB
                     Worker 0:  Batches: 1  Memory Usage: 129kB
                     Worker 1:  Batches: 1  Memory Usage: 129kB
                     ->  Parallel Seq Scan on messages  (cost=0.00..8591.67 rows=416667 width=8) (actual time=0.009..16.164 rows=333333 loops=3)
 Planning Time: 0.167 ms
 Execution Time: 60.236 ms

-- SLOW
EXPLAIN ANALYZE SELECT channel_id, max(ts) 
FROM channels JOIN messages USING (channel_id)
GROUP BY channel_id;

 Finalize GroupAggregate  (cost=12860.74..13114.09 rows=1000 width=8) (actual time=94.136..96.019 rows=1000 loops=1)
   Group Key: channels.channel_id
   ->  Gather Merge  (cost=12860.74..13094.09 rows=2000 width=8) (actual time=94.131..95.758 rows=3000 loops=1)
         Workers Planned: 2
         Workers Launched: 2
         ->  Sort  (cost=11860.72..11863.22 rows=1000 width=8) (actual time=92.507..92.542 rows=1000 loops=3)
               Sort Key: channels.channel_id
               Sort Method: quicksort  Memory: 56kB
               Worker 0:  Sort Method: quicksort  Memory: 56kB
               Worker 1:  Sort Method: quicksort  Memory: 56kB
               ->  Partial HashAggregate  (cost=11800.89..11810.89 rows=1000 width=8) (actual time=92.299..92.386 rows=1000 loops=3)
                     Group Key: channels.channel_id
                     Batches: 1  Memory Usage: 129kB
                     Worker 0:  Batches: 1  Memory Usage: 129kB
                     Worker 1:  Batches: 1  Memory Usage: 129kB
                     ->  Hash Join  (cost=27.50..9717.56 rows=416667 width=8) (actual time=0.154..58.426 rows=333333 loops=3)
                           Hash Cond: (messages.channel_id = channels.channel_id)
                           ->  Parallel Seq Scan on messages  (cost=0.00..8591.67 rows=416667 width=8) (actual time=0.004..16.329 rows=333333 loops=3)
                           ->  Hash  (cost=15.00..15.00 rows=1000 width=4) (actual time=0.143..0.143 rows=1000 loops=3)
                                 Buckets: 1024  Batches: 1  Memory Usage: 44kB
                                 ->  Seq Scan on channels  (cost=0.00..15.00 rows=1000 width=4) (actual time=0.006..0.060 rows=1000 loops=3)
 Planning Time: 0.127 ms
 Execution Time: 96.066 ms

更快的查询，使用索引立即找到每个channel_id的最新行，并使用选择 max(ts) 的依赖子查询：

-- FAST
EXPLAIN ANALYZE SELECT channel_id, 
(SELECT max(ts) FROM messages m WHERE m.channel_id=c.channel_id) 
FROM channels c;

 Seq Scan on channels c  (cost=0.00..482.00 rows=1000 width=8) (actual time=0.023..7.308 rows=1000 loops=1)
   SubPlan 2
     ->  Result  (cost=0.46..0.47 rows=1 width=4) (actual time=0.007..0.007 rows=1 loops=1000)
           InitPlan 1 (returns $1)
             ->  Limit  (cost=0.42..0.46 rows=1 width=4) (actual time=0.007..0.007 rows=1 loops=1000)
                   ->  Index Only Scan using messages_channel_id_ts_idx on messages m  (cost=0.42..32.42 rows=1000 width=4) (actual time=0.007..0.007 rows=1 loops=1000)
                         Index Cond: ((channel_id = c.channel_id) AND (ts IS NOT NULL))
                         Heap Fetches: 0
 Planning Time: 0.072 ms
 Execution Time: 7.349 ms

使用 LATERAL 的变体具有以下优点：可以在需要时从消息中返回更多列，并且可以每个通道返回最新或 N 个最新消息（只需更改 LIMIT）。

EXPLAIN ANALYZE SELECT * FROM channels c
LEFT JOIN LATERAL (
 SELECT * FROM messages m WHERE m.channel_id=c.channel_id
 ORDER BY ts DESC LIMIT 1) USING(channel_id);


 Nested Loop Left Join  (cost=0.42..492.00 rows=1000 width=8) (actual time=0.085..10.320 rows=1000 loops=1)
   ->  Seq Scan on channels c  (cost=0.00..15.00 rows=1000 width=4) (actual time=0.019..0.109 rows=1000 loops=1)
   ->  Subquery Scan on unnamed_subquery  (cost=0.42..0.47 rows=1 width=8) (actual time=0.010..0.010 rows=1 loops=1000)
         Filter: (c.channel_id = unnamed_subquery.channel_id)
         ->  Limit  (cost=0.42..0.45 rows=1 width=8) (actual time=0.010..0.010 rows=1 loops=1000)
               ->  Index Only Scan using messages_channel_id_ts_idx on messages m  (cost=0.42..29.93 rows=1000 width=8) (actual time=0.009..0.009 rows=1 loops=1000)
                     Index Cond: (channel_id = c.channel_id)
                     Heap Fetches: 0
 Planning Time: 0.313 ms
 Execution Time: 10.411 ms

LATERAL JOIN 语法有点奇怪。如果您想要一个没有消息的channel_id 行，则需要使用LEFT JOIN，并且这需要连接条件(USING(channel_id))。但因为它是 LATERAL JOIN，所以连接中的右表依赖于左表，所以这个条件已经在其中指定了。所以有一点重复。

在 Postgres 中使用 GROUP BY 优化 SELECT MAX() ？

连接到 PostgreSQL 服务器：致命：主机没有 pg_hba.conf 条目

如何让sqlplus的输出出现在一行中？

选择具有最大日期或最晚日期的日期

如何列出 PostgreSQL 中的所有模式？

列出指定表的所有列

如何在不修改我自己的 tnsnames.ora 的情况下使用 sqlplus 连接到位于另一台主机上的 Oracle 数据库

你如何mysqldump特定的表？

使用 psql 列出数据库权限

如何从 PostgreSQL 中的选择查询中将值插入表中？

如何使用 psql 列出所有数据库和表？

在 Postgres 中使用 GROUP BY 优化 SELECT MAX() ？

1 个回答

相关问题