所以我有一个这样的警报表:
my_db=> \d alerts_alert
Table "public.alerts_alert"
Column | Type | Collation | Nullable | Default
----------------+--------------------------+-----------+----------+----------------------------------
received_at | timestamp with time zone | | not null |
id | bigint | | not null | generated by default as identity
data | jsonb | | not null |
updated_at | timestamp with time zone | | not null |
status | character varying(8) | | not null |
owner_id | uuid | | |
resolved_by_id | uuid | | |
Indexes:
"alerts_alert_pkey" PRIMARY KEY, btree (id)
"alerts_aler_data_eae7f5_gin" gin (data)
"alerts_alert_data_gin" gin (to_tsvector('english'::regconfig, COALESCE(data::text, ''::text)))
"alerts_alert_owner_id_0c00548a" btree (owner_id)
"alerts_alert_resolved_by_id_b59cbeaf" btree (resolved_by_id)
Foreign-key constraints:
"alerts_alert_owner_id_0c00548a_fk_accounts_user_id" FOREIGN KEY (owner_id) REFERENCES accounts_user(id) DEFERRABLE INITIALLY DEFERRED
"alerts_alert_resolved_by_id_b59cbeaf_fk_accounts_user_id" FOREIGN KEY (resolved_by_id) REFERENCES accounts_user(id) DEFERRABLE INITIALLY DEFERRED
我确实想对该列执行全文搜索data
。
我想出了这个查询,但它的性能很差:
my_db=> explain analyze WITH cte AS (
SELECT id, received_at, data, updated_at, owner_id, resolved_by_id, status,
to_tsvector('english'::regconfig, COALESCE(data::text, '')) AS search_vector
FROM alerts_alert
)
SELECT id, received_at, data, updated_at, owner_id, resolved_by_id, status, search_vector,
ts_rank(search_vector, websearch_to_tsquery('english'::regconfig, 'haykd')) AS rank
FROM cte
WHERE search_vector @@ websearch_to_tsquery('english'::regconfig, 'haykd')
ORDER BY rank DESC LIMIT 21;
这是它给我的:
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=1518.59..1529.61 rows=21 width=194) (actual time=3891.969..3952.546 rows=21 loops=1)
-> Result (cost=1518.59..2195.31 rows=1289 width=194) (actual time=3891.962..3952.522 rows=21 loops=1)
-> Sort (cost=1518.59..1521.81 rows=1289 width=162) (actual time=3868.134..3868.145 rows=21 loops=1)
Sort Key: (ts_rank(to_tsvector('english'::regconfig, COALESCE((alerts_alert.data)::text, ''::text)), '''haykd'''::tsquery)) DESC
Sort Method: top-N heapsort Memory: 28kB
-> Bitmap Heap Scan on alerts_alert (cost=538.11..1483.83 rows=1289 width=162) (actual time=19.143..3862.893 rows=1327 loops=1)
Recheck Cond: (to_tsvector('english'::regconfig, COALESCE((data)::text, ''::text)) @@ '''haykd'''::tsquery)
Heap Blocks: exact=202
-> Bitmap Index Scan on alerts_alert_data_gin (cost=0.00..537.79 rows=1289 width=0) (actual time=12.832..12.832 rows=1432 loops=1)
Index Cond: (to_tsvector('english'::regconfig, COALESCE((data)::text, ''::text)) @@ '''haykd'''::tsquery)
Planning Time: 35.525 ms
Execution Time: 3953.748 ms
该表不大,只有 12k 行。有什么建议可以让这个更快吗?我使用的是 PostgreSQL 16
更新:
explain (analyze, buffers)
正如设置后运行的评论中所建议的track_io_timing = on
:这是我得到的:
my_db=> explain (analyze, buffers) WITH cte AS (
SELECT id, received_at, data, updated_at, owner_id, resolved_by_id, status,
to_tsvector('english'::regconfig, COALESCE(data::text, '')) AS search_vector
FROM alerts_alert
)
SELECT id, received_at, data, updated_at, owner_id, resolved_by_id, status, search_vector,
ts_rank(search_vector, websearch_to_tsquery('english'::regconfig, 'haykd')) AS rank
FROM cte
WHERE search_vector @@ websearch_to_tsquery('english'::regconfig, 'haykd')
ORDER BY rank DESC LIMIT 21;
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=3064.58..3075.61 rows=21 width=194) (actual time=3029.762..3089.928 rows=21 loops=1)
Buffers: shared hit=5518
-> Result (cost=3064.58..3766.51 rows=1337 width=194) (actual time=3029.761..3089.919 rows=21 loops=1)
Buffers: shared hit=5518
-> Sort (cost=3064.58..3067.93 rows=1337 width=162) (actual time=3007.831..3007.840 rows=21 loops=1)
Sort Key: (ts_rank(to_tsvector('english'::regconfig, COALESCE((alerts_alert.data)::text, ''::text)), '''haykd'''::tsquery)) DESC
Sort Method: top-N heapsort Memory: 28kB
Buffers: shared hit=5417
-> Bitmap Heap Scan on alerts_alert (cost=2055.61..3028.54 rows=1337 width=162) (actual time=4.503..3005.258 rows=1337 loops=1)
Recheck Cond: (to_tsvector('english'::regconfig, COALESCE((data)::text, ''::text)) @@ '''haykd'''::tsquery)
Heap Blocks: exact=204
Buffers: shared hit=5414
-> Bitmap Index Scan on alerts_alert_data_gin (cost=0.00..2055.28 rows=1337 width=0) (actual time=2.097..2.097 rows=1465 loops=1)
Index Cond: (to_tsvector('english'::regconfig, COALESCE((data)::text, ''::text)) @@ '''haykd'''::tsquery)
Buffers: shared hit=482
Planning:
Buffers: shared hit=276 read=2
I/O Timings: shared/local read=1.171
Planning Time: 35.782 ms
Execution Time: 3090.356 ms
我正在 AWS RDS 上运行db.t4g.micro
全文索引:
用于位图索引扫描,但任何时候引擎需要
tsvector
查询中的其他内容时,data::text
都会再次通过全文解析器。这部分是 CPU 密集型的,并且确实会降低搜索性能,除非文本内容很小。特别是,该
ts_rank()
调用会导致执行to_tsvector('english'::regconfig, COALESCE(data::text, ''))
示例中的所有 1337 个匹配行。通过将此表达式具体化为表中的一列(当然还要在此列上创建 GIN 索引)可以避免这种情况。