我有一个表定义如下
CREATE TABLE details_search (
id int4 NOT NULL PRIMARY KEY,
"search" tsvector NULL
);
CREATE INDEX details_search_idx ON details_search USING gin (search);
我运行这个来了解它的大小:
SELECT pg_size_pretty(pg_relation_size('details_search')) relation_size,
pg_size_pretty(pg_total_relation_size('details_search')) total_relation_size,
pg_size_pretty(pg_table_size('details_search')) table_size,
pg_size_pretty(pg_indexes_size('details_search')) indexes_size;
这些是结果
relation_size|total_relation_size|table_size|indexes_size|
-------------+-------------------+----------+------------+
800 MB |64 GB |57 GB |6830 MB |
我只对执行短语搜索感兴趣,并且这些搜索是聚合使用的。当我使用不常见术语执行短语搜索时,一切正常。现在,当我使用具有常用术语的短语时,性能会受到很大影响。
这个查询花了 192 秒:
SELECT COUNT(id)
FROM details_search
WHERE search @@ phraseto_tsquery('simple', 'data management')
这是查询计划(这里是一个漂亮的界面中的查询计划):
Output: count(id)
Buffers: shared hit=25942383 read=6354221 written=4588
I/O Timings: shared/local read=512605.708 write=122.864
-> Gather (cost=178176.43..178176.64 rows=2 width=8) (actual time=192857.512..192861.652 rows=3 loops=1)
Output: (PARTIAL count(id))
Workers Planned: 2
Workers Launched: 2
Buffers: shared hit=25942383 read=6354221 written=4588
I/O Timings: shared/local read=512605.708 write=122.864
-> Partial Aggregate (cost=177176.43..177176.44 rows=1 width=8) (actual time=192852.434..192852.435 rows=1 loops=3)
Output: PARTIAL count(id)
Buffers: shared hit=25942383 read=6354221 written=4588
I/O Timings: shared/local read=512605.708 write=122.864
Worker 0: actual time=192851.530..192851.531 rows=1 loops=1
Buffers: shared hit=8650807 read=2115877 written=1469
I/O Timings: shared/local read=170775.853 write=38.985
Worker 1: actual time=192848.579..192848.581 rows=1 loops=1
Buffers: shared hit=8623424 read=2115864 written=1551
I/O Timings: shared/local read=170720.335 write=41.527
-> Parallel Bitmap Heap Scan on details_search (cost=33664.19..173376.94 rows=1519795 width=4) (actual time=1231.216..192758.374 rows=121050 loops=3)
Output: id, search
Recheck Cond: (search @@ '''data'' <-> ''management'''::tsquery)
Rows Removed by Index Recheck: 2268868
Heap Blocks: exact=12114 lossy=22061
Buffers: shared hit=25942383 read=6354221 written=4588
I/O Timings: shared/local read=512605.708 write=122.864
Worker 0: actual time=1230.572..192759.521 rows=121482 loops=1
Buffers: shared hit=8650807 read=2115877 written=1469
I/O Timings: shared/local read=170775.853 write=38.985
Worker 1: actual time=1227.317..192754.854 rows=120483 loops=1
Buffers: shared hit=8623424 read=2115864 written=1551
I/O Timings: shared/local read=170720.335 write=41.527
-> Bitmap Index Scan on job_posts_details_search_idx (cost=0.00..32752.32 rows=3647509 width=0) (actual time=1226.674..1226.675 rows=3956386 loops=1)
Index Cond: (search @@ '''data'' <-> ''management'''::tsquery)
Buffers: shared hit=832 read=2242
I/O Timings: shared/local read=424.365
Settings: effective_cache_size = '13153520kB', search_path = 'public, public, "$user"'
Query Identifier: 1461135140272243366
Planning:
Buffers: shared hit=194
Planning Time: 7.346 ms
Execution Time: 192861.763 ms
大部分时间都花在了并行位图堆扫描上的阅读上。考虑到它有 SSD(并且有一个专门用于数据缓存的SSD ),读取速度也相当慢,为 97 MB/s。pg_prewarm
如果我在查询之前加载表,这并没有得到改善。
我看到它有Recheck Cond: (search @@ '''data'' <-> ''management'''::tsquery)
,所以我猜它正在从磁盘中提取所有作业数据来检查实际列上的条件search
,就好像仅检查索引不足以验证是否存在短语匹配一样。这可以解释为什么这个问题只在通用条件下发生。
我可以做些什么来优化这些短语搜索?我很乐意考虑可能搜索的内容的限制(例如“仅查询最多 3 个单词”)或服务器设置更改(以加快那些讨厌的读取速度),如果这可以带来查询时间的一致性。
对此你无能为力。GIN 索引对单个成分进行索引,而不是对短语进行索引。因此,“位图索引扫描”将为您提供包含“数据”和“管理”的所有行,并且“位图堆扫描”中的重新检查会清除 95% 的误报。
读取超过 200 万个 8kB 块仅花费 170 秒,这表明大部分数据无论如何都缓存在内核页面缓存中。您可以通过增加来稍微提高性能
work_mem
,这样您就不会再获得“有损”块。如果问题是增加对某些特定短语的搜索,您可以使用同义词词典将它们替换为“数据管理”等单个单词,这将使索引扫描更加有效。但我猜你想加快任意短语的搜索速度。