我可以在使用数据库后激活 PITR 吗？

Question

Asked: 2024-05-23 05:54:43 +0800 CST2024-05-23 05:54:43 +0800 CST 2024-05-23 05:54:43 +0800 CST

PostgreSQL 已停止使用索引进行大型 IN 查询？

772

我有一个大表（约 22 亿行），表结构很基本：

    Column    |  Type   | Collation | Nullable |             Default              | Storage | Compression | Stats target | Description
--------------+---------+-----------+----------+----------------------------------+---------+-------------+--------------+-------------
 Id           | bigint  |           | not null | generated by default as identity | plain   |             |              |
 ItemId       | uuid    |           | not null |                                  | plain   |             |              |
 StartUtc     | integer |           | not null |                                  | plain   |             |              |
 EndUtc       | integer |           | not null |                                  | plain   |             |              |
 Price        | integer |           | not null |                                  | plain   |             |              |
Indexes:
    "PK_PriceHistoryEntry" PRIMARY KEY, btree ("Id")
    "IX_PriceHistoryEntry_ItemId" btree ("ItemId") CLUSTER

所有查询都是针对该ItemId列完成的，格式如下：

SELECT "ItemId", "StartUtc", "EndUtc", "Price" FROM "PriceHistoryEntry" WHERE "ItemId" IN (...array of guids)

该表未分区，因为每个选择查询都需要跨每个分区，查询永远不会仅在某些日期范围内完成，始终是完整的历史记录。

一般的插入方法是每周插入一次 10-2000 万个价格历史条目，表在索引上重新聚集ItemId，然后选择对于我们的目的来说是高性能的（我们所做的平均搜索约为 100 毫秒，大约为 250-400标准中的 GUIDIN返回约 50k 行）。

在开发过程中，我在标准中测试了多达 50 亿行和多达 1000 个 GUID IN，它总是很快。它已经完美运行了一年多，但今天 PostgreSQL 在包含超过 400 个 GUID 时停止使用索引IN，我不明白为什么，索引计划为 400 个 GUID：

Index Scan using "IX_PriceHistoryEntry_ItemId" on public."PriceHistoryEntry"  (cost=0.58..32570020.41 rows=23433784 width=28) (actual time=11.355..110.863 rows=160503 loops=1)
   Output: "ItemId", "StartUtc", "EndUtc", "Price"
   Index Cond: ("PriceHistoryEntry"."ItemId" = ANY ('{ac5aa227-8787-46fc-b34d-47017edc7d1f,*cut 398 guids*,16923b11-30b7-4311-bc54-3b2b1da314d0}'::uuid[]))
   Buffers: shared hit=828 read=2550
   I/O Timings: shared read=79.971
 Planning Time: 0.314 ms
 JIT:
   Functions: 4
   Options: Inlining true, Optimization true, Expressions true, Deforming true
   Timing: Generation 0.234 ms, Inlining 0.064 ms, Optimization 6.114 ms, Emission 4.823 ms, Total 11.235 ms
 Execution Time: 115.573 ms
(11 rows)

Time: 116.586 ms

但如果您使用 401 GUID：

 Gather  (cost=1000.99..32292650.29 rows=23734217 width=28) (actual time=232.393..55599.947 rows=161872 loops=1)
   Output: "ItemId", "StartUtc", "EndUtc", "Price"
   Workers Planned: 2
   Workers Launched: 2
   Buffers: shared hit=7543 read=17088587
   I/O Timings: shared read=48381.022
   ->  Parallel Seq Scan on public."PriceHistoryEntry"  (cost=0.99..29918228.59 rows=9889257 width=28) (actual time=227.890..51672.274 rows=53957 loops=3)
         Output: "ItemId", "StartUtc", "EndUtc", "Price"
         Filter: ("PriceHistoryEntry"."ItemId" = ANY ('{bff9de7e-7f35-4f5d-88c5-2c342806d69b,*cut 399 guids*,618ce691-c8f0-46b2-8fd1-96a404bdda71}'::uuid[]))
         Rows Removed by Filter: 683791228
         Buffers: shared hit=7543 read=17088587
         I/O Timings: shared read=48381.022
         Worker 0:  actual time=329.642..50237.693 rows=54262 loops=1
           JIT:
             Functions: 4
             Options: Inlining true, Optimization true, Expressions true, Deforming true
             Timing: Generation 0.238 ms, Inlining 45.205 ms, Optimization 4.210 ms, Emission 3.867 ms, Total 53.520 ms
           Buffers: shared hit=2368 read=5516111
           I/O Timings: shared read=15615.749
         Worker 1:  actual time=122.393..49195.649 rows=52913 loops=1
           JIT:
             Functions: 4
             Options: Inlining true, Optimization true, Expressions true, Deforming true
             Timing: Generation 0.240 ms, Inlining 45.364 ms, Optimization 4.256 ms, Emission 3.936 ms, Total 53.796 ms
           Buffers: shared hit=2447 read=5377318
           I/O Timings: shared read=15380.362
 Planning Time: 0.396 ms
 JIT:
   Functions: 12
   Options: Inlining true, Optimization true, Expressions true, Deforming true
   Timing: Generation 0.767 ms, Inlining 90.645 ms, Optimization 16.254 ms, Emission 13.825 ms, Total 121.492 ms
 Execution Time: 55604.845 ms
(32 rows)

Time: 55606.114 ms (00:55.606)

正如我所说，在开发此功能时，我对标准中的多达 50 亿行和 1000 多个 GUID 进行了广泛测试IN，这比数据库中当前的数量要多得多，而且我从未遇到过此问题。

我试过了：

重新聚类表
重新索引IX_PriceHistoryEntry_ItemId
VACUUM ANALYZE在桌子上
VACUUM FULL在桌子上
重建整个表pg_repack

现在它仍然拒绝使用索引。知道发生了什么、如何解决它以及如何确保它不再发生吗？

如果我禁用 seq scan ( set enable_seqscan = off;)，它会正确使用索引并在 100 毫秒内返回结果......而不是 PostgreSQL 使用它认为最好的计划时需要的 55 秒。不过，似乎不建议禁用 seqscan，如果我降低random_page_cost它，它将使用索引进行稍大的查询，但对于特别大的查询，IN它仍然使用序列扫描。我需要确保它永远不会使用序列扫描，因为它永远不会比这个表更快。

1 个回答

Voted

jjanes · Answer 1 · 2024-05-23T11:56:35+08:00

我猜“ItemId”的 ndistinct 已经变得疯狂了。你可以在pg_stats视图中查看一下。

这种情况经常发生在表聚集的列上，这是由于计算 ndistinct 时使用的采样方法有缺陷。

如果这是问题所在，您可以使用更准确的手动值覆盖 ndistinct：

alter table "PriceHistoryEntry" alter column "ItemId" set (ndistinct = 15000000)

PostgreSQL 已停止使用索引进行大型 IN 查询？

连接到 PostgreSQL 服务器：致命：主机没有 pg_hba.conf 条目

如何让sqlplus的输出出现在一行中？

选择具有最大日期或最晚日期的日期

如何列出 PostgreSQL 中的所有模式？

列出指定表的所有列

如何在不修改我自己的 tnsnames.ora 的情况下使用 sqlplus 连接到位于另一台主机上的 Oracle 数据库

你如何mysqldump特定的表？

使用 psql 列出数据库权限

如何从 PostgreSQL 中的选择查询中将值插入表中？

如何使用 psql 列出所有数据库和表？

PostgreSQL 已停止使用索引进行大型 IN 查询？

1 个回答

相关问题