Vejo algum comportamento estranho ao usar índices gin_trgm_ops
ou gist_trgm_ops
. Parece haver uma grande diferença nos planos ao usar say ILIKE
ou ~
e pesquisar frases ascii versus frases char multibyte. Como se houvesse um custo maior quando o operando é um operando não ASCII.
O que estou vendo é esperado? Qual é a razão para isso?
Eu tentei no Postgreql 12 e 13 mais recente de cada.
Aqui está um cenário:
CREATE DATABASE postgres
WITH
OWNER = postgres
ENCODING = 'UTF8'
LC_COLLATE = 'en_US.utf8'
LC_CTYPE = 'en_US.utf8'
TABLESPACE = pg_default
CONNECTION LIMIT = -1
IS_TEMPLATE = False;
-- snip
CREATE TABLE test_table (
id uuid PRIMARY KEY,
label varchar
);
-- insert 1m rows
VACUUM ANALYZE test_table;
No conjunto de dados, tenho 10 rótulos contendo 'acl'
e 10 contendo '定す'
.
Ao usar o índice GIN
CREATE INDEX test_table_label_gin_idx
ON test_table USING gin
(label gin_trgm_ops);
Eu vejo o seguinte.
EXPLAIN ANALYZE SELECT * FROM test_table WHERE label ILIKE '%定す%' LIMIT 100;
Limit (cost=1000.00..16573.18 rows=100 width=52) (actual time=392.153..395.095 rows=10 loops=1)
-> Gather (cost=1000.00..16728.91 rows=101 width=52) (actual time=392.135..394.830 rows=10 loops=1)
Workers Planned: 2
Workers Launched: 2
-> Parallel Seq Scan on test_table (cost=0.00..15718.81 rows=42 width=52) (actual time=382.922..388.082 rows=3 loops=3)
Filter: ((label)::text ~~* '%定す%'::text)
Rows Removed by Filter: 338417
Planning Time: 0.656 ms
Execution Time: 395.233 ms
EXPLAIN ANALYZE SELECT * FROM test_table WHERE label ILIKE '%acl%' LIMIT 100;
Limit (cost=28.78..400.51 rows=100 width=52) (actual time=0.072..0.406 rows=10 loops=1)
-> Bitmap Heap Scan on test_table (cost=28.78..404.23 rows=101 width=52) (actual time=0.053..0.197 rows=10 loops=1)
Recheck Cond: ((label)::text ~~* '%acl%'::text)
Heap Blocks: exact=10
-> Bitmap Index Scan on test_table_label_gin_idx (cost=0.00..28.76 rows=101 width=0) (actual time=0.025..0.034 rows=10 loops=1)
Index Cond: ((label)::text ~~* '%acl%'::text)
Planning Time: 0.231 ms
Execution Time: 0.542 ms
Com GIST
DROP INDEX test_table_label_gin_idx;
CREATE INDEX test_table_label_gist_idx
ON test_table USING gist
(label gist_trgm_ops);
Eu vejo
EXPLAIN ANALYZE SELECT * FROM test_table WHERE label ILIKE '%定す%' LIMIT 100;
Limit (cost=13.19..384.92 rows=100 width=52) (actual time=303.772..1557.498 rows=10 loops=1)
-> Bitmap Heap Scan on test_table (cost=13.19..388.64 rows=101 width=52) (actual time=303.752..1557.286 rows=10 loops=1)
Recheck Cond: ((label)::text ~~* '%定す%'::text)
Rows Removed by Index Recheck: 1015250
Heap Blocks: exact=10431
-> Bitmap Index Scan on test_table_label_gist_idx (cost=0.00..13.17 rows=101 width=0) (actual time=301.046..301.053 rows=1015260 loops=1)
Index Cond: ((label)::text ~~* '%定す%'::text)
Planning Time: 0.215 ms
Execution Time: 1557.643 ms
EXPLAIN ANALYZE SELECT * FROM test_table WHERE label ILIKE '%acl%' LIMIT 100;
Limit (cost=13.19..384.92 rows=100 width=52) (actual time=257.385..257.751 rows=10 loops=1)
-> Bitmap Heap Scan on test_table (cost=13.19..388.64 rows=101 width=52) (actual time=257.366..257.551 rows=10 loops=1)
Recheck Cond: ((label)::text ~~* '%acl%'::text)
Heap Blocks: exact=10
-> Bitmap Index Scan on test_table_label_gist_idx (cost=0.00..13.17 rows=101 width=0) (actual time=257.319..257.328 rows=10 loops=1)
Index Cond: ((label)::text ~~* '%acl%'::text)
Planning Time: 0.377 ms
Execution Time: 257.948 ms
Apenas mudar os caracteres do operando muda bastante o plano.
Editar
SELECT show_trgm('定す');
"{0x145ed8,0x6628fa,0x6cb12d}"
SELECT encode('定す', 'escape')
\345\256\232\343\201\231
Este problema parece semelhante ao Postgresql não usar o índice trigrama GIN ao executar uma consulta LIKE não ASCII?