我有一个大表(约 22 亿行),表结构很基本:
Column | Type | Collation | Nullable | Default | Storage | Compression | Stats target | Description
--------------+---------+-----------+----------+----------------------------------+---------+-------------+--------------+-------------
Id | bigint | | not null | generated by default as identity | plain | | |
ItemId | uuid | | not null | | plain | | |
StartUtc | integer | | not null | | plain | | |
EndUtc | integer | | not null | | plain | | |
Price | integer | | not null | | plain | | |
Indexes:
"PK_PriceHistoryEntry" PRIMARY KEY, btree ("Id")
"IX_PriceHistoryEntry_ItemId" btree ("ItemId") CLUSTER
所有查询都是针对该ItemId
列完成的,格式如下:
SELECT "ItemId", "StartUtc", "EndUtc", "Price" FROM "PriceHistoryEntry" WHERE "ItemId" IN (...array of guids)
该表未分区,因为每个选择查询都需要跨每个分区,查询永远不会仅在某些日期范围内完成,始终是完整的历史记录。
一般的插入方法是每周插入一次 10-2000 万个价格历史条目,表在索引上重新聚集ItemId
,然后选择对于我们的目的来说是高性能的(我们所做的平均搜索约为 100 毫秒,大约为 250-400标准中的 GUIDIN
返回约 50k 行)。
在开发过程中,我在标准中测试了多达 50 亿行和多达 1000 个 GUID IN
,它总是很快。它已经完美运行了一年多,但今天 PostgreSQL 在包含超过 400 个 GUID 时停止使用索引IN
,我不明白为什么,索引计划为 400 个 GUID:
Index Scan using "IX_PriceHistoryEntry_ItemId" on public."PriceHistoryEntry" (cost=0.58..32570020.41 rows=23433784 width=28) (actual time=11.355..110.863 rows=160503 loops=1)
Output: "ItemId", "StartUtc", "EndUtc", "Price"
Index Cond: ("PriceHistoryEntry"."ItemId" = ANY ('{ac5aa227-8787-46fc-b34d-47017edc7d1f,*cut 398 guids*,16923b11-30b7-4311-bc54-3b2b1da314d0}'::uuid[]))
Buffers: shared hit=828 read=2550
I/O Timings: shared read=79.971
Planning Time: 0.314 ms
JIT:
Functions: 4
Options: Inlining true, Optimization true, Expressions true, Deforming true
Timing: Generation 0.234 ms, Inlining 0.064 ms, Optimization 6.114 ms, Emission 4.823 ms, Total 11.235 ms
Execution Time: 115.573 ms
(11 rows)
Time: 116.586 ms
但如果您使用 401 GUID:
Gather (cost=1000.99..32292650.29 rows=23734217 width=28) (actual time=232.393..55599.947 rows=161872 loops=1)
Output: "ItemId", "StartUtc", "EndUtc", "Price"
Workers Planned: 2
Workers Launched: 2
Buffers: shared hit=7543 read=17088587
I/O Timings: shared read=48381.022
-> Parallel Seq Scan on public."PriceHistoryEntry" (cost=0.99..29918228.59 rows=9889257 width=28) (actual time=227.890..51672.274 rows=53957 loops=3)
Output: "ItemId", "StartUtc", "EndUtc", "Price"
Filter: ("PriceHistoryEntry"."ItemId" = ANY ('{bff9de7e-7f35-4f5d-88c5-2c342806d69b,*cut 399 guids*,618ce691-c8f0-46b2-8fd1-96a404bdda71}'::uuid[]))
Rows Removed by Filter: 683791228
Buffers: shared hit=7543 read=17088587
I/O Timings: shared read=48381.022
Worker 0: actual time=329.642..50237.693 rows=54262 loops=1
JIT:
Functions: 4
Options: Inlining true, Optimization true, Expressions true, Deforming true
Timing: Generation 0.238 ms, Inlining 45.205 ms, Optimization 4.210 ms, Emission 3.867 ms, Total 53.520 ms
Buffers: shared hit=2368 read=5516111
I/O Timings: shared read=15615.749
Worker 1: actual time=122.393..49195.649 rows=52913 loops=1
JIT:
Functions: 4
Options: Inlining true, Optimization true, Expressions true, Deforming true
Timing: Generation 0.240 ms, Inlining 45.364 ms, Optimization 4.256 ms, Emission 3.936 ms, Total 53.796 ms
Buffers: shared hit=2447 read=5377318
I/O Timings: shared read=15380.362
Planning Time: 0.396 ms
JIT:
Functions: 12
Options: Inlining true, Optimization true, Expressions true, Deforming true
Timing: Generation 0.767 ms, Inlining 90.645 ms, Optimization 16.254 ms, Emission 13.825 ms, Total 121.492 ms
Execution Time: 55604.845 ms
(32 rows)
Time: 55606.114 ms (00:55.606)
正如我所说,在开发此功能时,我对标准中的多达 50 亿行和 1000 多个 GUID 进行了广泛测试IN
,这比数据库中当前的数量要多得多,而且我从未遇到过此问题。
我试过了:
- 重新聚类表
- 重新索引
IX_PriceHistoryEntry_ItemId
VACUUM ANALYZE
在桌子上VACUUM FULL
在桌子上- 重建整个表
pg_repack
现在它仍然拒绝使用索引。知道发生了什么、如何解决它以及如何确保它不再发生吗?
如果我禁用 seq scan ( set enable_seqscan = off;
),它会正确使用索引并在 100 毫秒内返回结果......而不是 PostgreSQL 使用它认为最好的计划时需要的 55 秒。不过,似乎不建议禁用 seqscan,如果我降低random_page_cost
它,它将使用索引进行稍大的查询,但对于特别大的查询,IN
它仍然使用序列扫描。我需要确保它永远不会使用序列扫描,因为它永远不会比这个表更快。
我猜“ItemId”的 ndistinct 已经变得疯狂了。你可以在pg_stats视图中查看一下。
这种情况经常发生在表聚集的列上,这是由于计算 ndistinct 时使用的采样方法有缺陷。
如果这是问题所在,您可以使用更准确的手动值覆盖 ndistinct: