我可以在使用数据库后激活 PITR 吗？

Question

Asked: 2024-02-15 19:15:00 +0800 CST2024-02-15 19:15:00 +0800 CST 2024-02-15 19:15:00 +0800 CST

改进 postgres 上常用术语的 FTS 短语搜索

772

我有一个表定义如下

CREATE TABLE details_search (
    id int4 NOT NULL PRIMARY KEY,
    "search" tsvector NULL
);
CREATE INDEX details_search_idx ON details_search USING gin (search);

我运行这个来了解它的大小：

SELECT pg_size_pretty(pg_relation_size('details_search')) relation_size,
       pg_size_pretty(pg_total_relation_size('details_search')) total_relation_size,
       pg_size_pretty(pg_table_size('details_search')) table_size,
       pg_size_pretty(pg_indexes_size('details_search')) indexes_size;

这些是结果

relation_size|total_relation_size|table_size|indexes_size|
-------------+-------------------+----------+------------+
800 MB       |64 GB              |57 GB     |6830 MB     |

我只对执行短语搜索感兴趣，并且这些搜索是聚合使用的。当我使用不常见术语执行短语搜索时，一切正常。现在，当我使用具有常用术语的短语时，性能会受到很大影响。

这个查询花了 192 秒：

SELECT COUNT(id)
FROM   details_search
WHERE  search @@ phraseto_tsquery('simple', 'data management')

这是查询计划（这里是一个漂亮的界面中的查询计划）：

  Output: count(id)
  Buffers: shared hit=25942383 read=6354221 written=4588
  I/O Timings: shared/local read=512605.708 write=122.864
  ->  Gather  (cost=178176.43..178176.64 rows=2 width=8) (actual time=192857.512..192861.652 rows=3 loops=1)
        Output: (PARTIAL count(id))
        Workers Planned: 2
        Workers Launched: 2
        Buffers: shared hit=25942383 read=6354221 written=4588
        I/O Timings: shared/local read=512605.708 write=122.864
        ->  Partial Aggregate  (cost=177176.43..177176.44 rows=1 width=8) (actual time=192852.434..192852.435 rows=1 loops=3)
              Output: PARTIAL count(id)
              Buffers: shared hit=25942383 read=6354221 written=4588
              I/O Timings: shared/local read=512605.708 write=122.864
              Worker 0:  actual time=192851.530..192851.531 rows=1 loops=1
                Buffers: shared hit=8650807 read=2115877 written=1469
                I/O Timings: shared/local read=170775.853 write=38.985
              Worker 1:  actual time=192848.579..192848.581 rows=1 loops=1
                Buffers: shared hit=8623424 read=2115864 written=1551
                I/O Timings: shared/local read=170720.335 write=41.527
              ->  Parallel Bitmap Heap Scan on details_search  (cost=33664.19..173376.94 rows=1519795 width=4) (actual time=1231.216..192758.374 rows=121050 loops=3)
                    Output: id, search
                    Recheck Cond: (search @@ '''data'' <-> ''management'''::tsquery)
                    Rows Removed by Index Recheck: 2268868
                    Heap Blocks: exact=12114 lossy=22061
                    Buffers: shared hit=25942383 read=6354221 written=4588
                    I/O Timings: shared/local read=512605.708 write=122.864
                    Worker 0:  actual time=1230.572..192759.521 rows=121482 loops=1
                      Buffers: shared hit=8650807 read=2115877 written=1469
                      I/O Timings: shared/local read=170775.853 write=38.985
                    Worker 1:  actual time=1227.317..192754.854 rows=120483 loops=1
                      Buffers: shared hit=8623424 read=2115864 written=1551
                      I/O Timings: shared/local read=170720.335 write=41.527
                    ->  Bitmap Index Scan on job_posts_details_search_idx  (cost=0.00..32752.32 rows=3647509 width=0) (actual time=1226.674..1226.675 rows=3956386 loops=1)
                          Index Cond: (search @@ '''data'' <-> ''management'''::tsquery)
                          Buffers: shared hit=832 read=2242
                          I/O Timings: shared/local read=424.365
Settings: effective_cache_size = '13153520kB', search_path = 'public, public, "$user"'
Query Identifier: 1461135140272243366
Planning:
  Buffers: shared hit=194
Planning Time: 7.346 ms
Execution Time: 192861.763 ms

大部分时间都花在了并行位图堆扫描上的阅读上。考虑到它有 SSD（并且有一个专门用于数据缓存的SSD ），读取速度也相当慢，为 97 MB/s。pg_prewarm如果我在查询之前加载表，这并没有得到改善。

我看到它有Recheck Cond: (search @@ '''data'' <-> ''management'''::tsquery)，所以我猜它正在从磁盘中提取所有作业数据来检查实际列上的条件search，就好像仅检查索引不足以验证是否存在短语匹配一样。这可以解释为什么这个问题只在通用条件下发生。

我可以做些什么来优化这些短语搜索？我很乐意考虑可能搜索的内容的限制（例如“仅查询最多 3 个单词”）或服务器设置更改（以加快那些讨厌的读取速度），如果这可以带来查询时间的一致性。

Laurenz Albe · Answer 1 · 2024-02-15T20:16:04+08:00

对此你无能为力。GIN 索引对单个成分进行索引，而不是对短语进行索引。因此，“位图索引扫描”将为您提供包含“数据”和“管理”的所有行，并且“位图堆扫描”中的重新检查会清除 95% 的误报。

读取超过 200 万个 8kB 块仅花费 170 秒，这表明大部分数据无论如何都缓存在内核页面缓存中。您可以通过增加来稍微提高性能work_mem，这样您就不会再获得“有损”块。

如果问题是增加对某些特定短语的搜索，您可以使用同义词词典将它们替换为“数据管理”等单个单词，这将使索引扫描更加有效。但我猜你想加快任意短语的搜索速度。

改进 postgres 上常用术语的 FTS 短语搜索

连接到 PostgreSQL 服务器：致命：主机没有 pg_hba.conf 条目

如何让sqlplus的输出出现在一行中？

选择具有最大日期或最晚日期的日期

如何列出 PostgreSQL 中的所有模式？

列出指定表的所有列

如何在不修改我自己的 tnsnames.ora 的情况下使用 sqlplus 连接到位于另一台主机上的 Oracle 数据库

你如何mysqldump特定的表？

使用 psql 列出数据库权限

如何从 PostgreSQL 中的选择查询中将值插入表中？

如何使用 psql 列出所有数据库和表？

改进 postgres 上常用术语的 FTS 短语搜索

1 个回答

相关问题