Jukurrpa提出的问题 -dba

Jukurrpa

Asked: 2025-04-07 23:16:01 +0800 CST

在较小的 Postgres 表上，不同的计划和较慢的查询

在两个仅行数不同的表（约 7.8M vs 约 1.4M）上运行相同的查询，会得到两个不同的计划，这听起来很合理。但是在较小的表上执行速度要慢 4 到 5 倍，我想知道原因。

表格定义如下：

   Column   |           Type           | Collation | Nullable | Default 
------------+--------------------------+-----------+----------+---------
 image_id   | bigint                   |           | not null | 
 h3_cell    | h3index                  |           | not null | 
 created_at | timestamp with time zone |           | not null | 
 location   | geometry(PointZ,4326)    |           | not null | 
Indexes:
    "images_a_pkey" PRIMARY KEY, btree (image_id)
    "images_a_created_at_idx" btree (created_at)
    "images_a_h3_cell_idx" btree (h3_cell)

查询如下

h3_cells AS (
    SELECT UNNEST(h3_linestring_to_cells(:line_string, 13, 1)) AS cell
)
SELECT COUNT(*)
FROM images
JOIN h3_cells hc ON images.h3_cell = hc.cell

该h3_linestring_to_cells()函数返回一个数组，h3index其大小在某些情况下可能高达数万个值。在下面的示例中，它返回的值约为 50,000 个。

在具有 7.8M 行的表中，计划和执行条目如下（为简洁起见，删除了数组值）：

Aggregate  (cost=347404.47..347404.48 rows=1 width=8) (actual time=74.311..74.312 rows=1 loops=1)
  Buffers: shared hit=154681 read=328
  I/O Timings: shared read=1.362
  ->  Nested Loop  (cost=0.43..346724.23 rows=272093 width=0) (actual time=0.051..74.246 rows=833 loops=1)
        Buffers: shared hit=154681 read=328
        I/O Timings: shared read=1.362
        ->  ProjectSet  (cost=0.00..256.90 rows=51377 width=8) (actual time=0.002..4.113 rows=51377 loops=1)
              ->  Result  (cost=0.00..0.01 rows=1 width=0) (actual time=0.000..0.001 rows=1 loops=1)
        ->  Index Only Scan using images_a_h3_cell_idx on images_a  (cost=0.43..6.68 rows=5 width=8) (actual time=0.001..0.001 rows=0 loops=51377)
              Index Cond: (h3_cell = (unnest('{...}'::h3index[])))
              Heap Fetches: 354
              Buffers: shared hit=154681 read=328
              I/O Timings: shared read=1.362
Planning Time: 139.421 ms
Execution Time: 74.345 ms

而在较小的 1.4M 行表上，计划和执行如下：

Aggregate  (cost=105040.78..105040.79 rows=1 width=8) (actual time=327.586..327.587 rows=1 loops=1)
  Buffers: shared hit=148358 read=6315 written=41
  I/O Timings: shared read=26.521 write=0.327
  ->  Merge Join  (cost=4791.05..104802.14 rows=95455 width=0) (actual time=321.174..327.575 rows=118 loops=1)
        Merge Cond: (ptilmi.h3_cell = (unnest('{...}'::h3index[])))
        Buffers: shared hit=148358 read=6315 written=41
        I/O Timings: shared read=26.521 write=0.327
        ->  Index Only Scan using images_b_h3_cell_idx on images_b ptilmi  (cost=0.43..95041.10 rows=1415438 width=8) (actual time=0.026..245.897 rows=964987 loops=1)
              Heap Fetches: 469832
              Buffers: shared hit=148358 read=6315 written=41
              I/O Timings: shared read=26.521 write=0.327
        ->  Sort  (cost=4790.62..4919.07 rows=51377 width=8) (actual time=11.181..13.551 rows=51390 loops=1)
              Sort Key: (unnest('{...}'::h3index[]))
              Sort Method: quicksort  Memory: 1537kB
              ->  ProjectSet  (cost=0.00..256.90 rows=51377 width=8) (actual time=0.002..3.716 rows=51377 loops=1)
                    ->  Result  (cost=0.00..0.01 rows=1 width=0) (actual time=0.000..0.001 rows=1 loops=1)
Planning Time: 146.617 ms
Execution Time: 327.626 ms

对于较小的源数组（例如大小为 25,000），较小表上的计划更改为第一个（嵌套循环），并且其执行时间变得更符合预期（比较大的表更快）。

我不明白是什么促使计划改变为效率更低的计划。

请注意，我使用的是 CTE+JOIN 而不是 eg WHERE h3_cell = ANY(h3_linestring_to_cells(:line_string, 13, 1))，因为生成的数组通常很大，而且我发现在这种情况下前者通常更高效。有趣的是，对于包含 50,000 个条目的数组，这种= ANY()方法在较小的表上速度更快，而对于包含 25,000 个条目的数组，这种方法速度较慢。

Jukurrpa

Asked: 2021-06-18 09:31:20 +0800 CST

在 WHERE 中使用子查询会使查询非常慢

由于我无法弄清楚的原因，我有这个相当基本的查询非常慢：

SELECT s.id 
FROM segments s
WHERE
    ST_DWithin(
        s.geom::GEOGRAPHY,
        ST_Envelope((SELECT ST_COLLECT(s2.geom) FROM segments s2 WHERE s2.id IN (407820025,  407820024,  407817407,  407817408,  407816908,  407816909,  407817413,  407817414,  407817409,  407817410,  407817405,  407817406,  407816905,  407816907,  407817412,  407817411,  407816906,  407816904,  407816764,  407816765)))::GEOGRAPHY,
        30
    );

                                                                                                                           QUERY PLAN                                                                                                                            
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Seq Scan on segments s  (cost=55.58..48476381.06 rows=7444984 width=4)
   Filter: st_dwithin((geom)::geography, (st_astext(st_envelope($0)))::geography, '30'::double precision)
   InitPlan 1 (returns $0)
     ->  Aggregate  (cost=55.57..55.58 rows=1 width=32)
           ->  Index Scan using segments_pkey on segments s2  (cost=0.44..55.52 rows=20 width=113)
                 Index Cond: (id = ANY ('{407820025,407820024,407817407,407817408,407816908,407816909,407817413,407817414,407817409,407817410,407817405,407817406,407816905,407816907,407817412,407817411,407816906,407816904,407816764,407816765}'::integer[]))

我真正感到困惑的是带有子查询的 ST_Envelope 本身非常快

SELECT ST_Envelope((SELECT ST_COLLECT(geom) FROM segments WHERE id IN (407820025,  407820024,  407817407,  407817408,  407816908,  407816909,  407817413,  407817414,  407817409,  407817410,  407817405,  407817406,  407816905,  407816907,  407817412,  407817411,  407816906,  407816904,  407816764,  407816765)))::GEOGRAPHY;

                                                                                                                           QUERY PLAN                                                                                                                            
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Result  (cost=55.58..55.60 rows=1 width=32)
   InitPlan 1 (returns $0)
     ->  Aggregate  (cost=55.57..55.58 rows=1 width=32)
           ->  Index Scan using segments_pkey on segments  (cost=0.44..55.52 rows=20 width=113)
                 Index Cond: (id = ANY ('{407820025,407820024,407817407,407817408,407816908,407816909,407817413,407817414,407817409,407817410,407817405,407817406,407816905,407816907,407817412,407817411,407816906,407816904,407816764,407816765}'::integer[]))

如果我插入 ST_Envelope 的结果，主查询也是如此

SELECT id 
FROM segments
WHERE
    st_dwithin(
        geom::geography,
        '0103000020E61000000100000005000000C87B6E0D8FB85EC04BFD8462B9C34640C87B6E0D8FB85EC0929B35C16DC44640BBF8DDA6F2B75EC0929B35C16DC44640BBF8DDA6F2B75EC04BFD8462B9C34640C87B6E0D8FB85EC04BFD8462B9C34640'::GEOGRAPHY,
        30
    );

                                                                                                                                                                                                                                                                                QUERY PLAN                                                                                                                                                                                                                                                                                
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Index Scan using segments_geom_geo_idx on segments  (cost=0.42..4.82 rows=1 width=4)
   Index Cond: ((geom)::geography && '0103000020E61000000100000005000000C87B6E0D8FB85EC04BFD8462B9C34640C87B6E0D8FB85EC0929B35C16DC44640BBF8DDA6F2B75EC0929B35C16DC44640BBF8DDA6F2B75EC04BFD8462B9C34640C87B6E0D8FB85EC04BFD8462B9C34640'::geography)
   Filter: (('0103000020E61000000100000005000000C87B6E0D8FB85EC04BFD8462B9C34640C87B6E0D8FB85EC0929B35C16DC44640BBF8DDA6F2B75EC0929B35C16DC44640BBF8DDA6F2B75EC04BFD8462B9C34640C87B6E0D8FB85EC04BFD8462B9C34640'::geography && _st_expand((geom)::geography, '30'::double precision)) AND _st_dwithin((geom)::geography, '0103000020E61000000100000005000000C87B6E0D8FB85EC04BFD8462B9C34640C87B6E0D8FB85EC0929B35C16DC44640BBF8DDA6F2B75EC0929B35C16DC44640BBF8DDA6F2B75EC04BFD8462B9C34640C87B6E0D8FB85EC04BFD8462B9C34640'::geography, '30'::double precision, true))

Postgres 不应该计算一次 ST_Envelope 然后将其用于 WHERE 条件，有效地执行我手动执行的操作吗？我也不明白为什么在原始查询中没有使用索引来执行过滤器。

我尝试将子查询放在 CTE 中，但这并没有解决问题。

在较小的 Postgres 表上，不同的计划和较慢的查询

在 WHERE 中使用子查询会使查询非常慢

连接到 PostgreSQL 服务器：致命：主机没有 pg_hba.conf 条目

如何让sqlplus的输出出现在一行中？

选择具有最大日期或最晚日期的日期

如何列出 PostgreSQL 中的所有模式？

列出指定表的所有列

如何在不修改我自己的 tnsnames.ora 的情况下使用 sqlplus 连接到位于另一台主机上的 Oracle 数据库

你如何mysqldump特定的表？

使用 psql 列出数据库权限

如何从 PostgreSQL 中的选择查询中将值插入表中？

如何使用 psql 列出所有数据库和表？

Jukurrpa's questions