我是 SQL 和 PostgreSQL 的新手。我试图弄清楚如何让这种类型的查询利用索引进行三元运算。使用PostgreSQL 12.7 on x86_64-pc-linux-gnu
基本思想是我们获取一个搜索短语,将其拆分为不同的单词,然后查看我们在搜索就绪名称的数据库列中获得了多少“相似性”匹配。与搜索就绪名称中找到的词具有相似性的词越多,得分就越高。我们还将整体搜索短语与原始名称相比较,作为“提升”权重的乘数。
dpl_base 表有 71,000 行,如下所示:
dpl_codes 表有 100 行,如下所示:
到目前为止,我都尝试过:
create index trgm_idx_gist_dpl_base on dpl_base using gist (denied_name_searchable, denied_name_original gist_trgm_ops);
create index trgm_idx_gin_dpl_base on dpl_base using gin (denied_name_searchable, denied_name_original gin_trgm_ops);
连同其他各种“标准”指数。无论有没有索引,查询 EXPLAIN ANALYZE 都会给出相同的精确计划。所以指数似乎没有什么区别。查询运行得非常快,通常不到 3 秒。也许我正在追逐一些我不需要的东西......我只是想学习如何正确索引这种设计的查询:
SET pg_trgm.similarity_threshold = 0.35;
SELECT
/* create weighting value for the distinct-word hits within the SEARCHABLE column */
/* multiply by the similarity value for the original search phrase, against the ORIGINAL column */
(
('BAD' % ANY(STRING_TO_ARRAY(UPPER(DPLB.DENIED_NAME_SEARCHABLE),' ')))::int +
('ACTOR' % ANY(STRING_TO_ARRAY(UPPER(DPLB.DENIED_NAME_SEARCHABLE),' ')))::int
)
* (-(DPLB.DENIED_NAME_ORIGINAL <-> 'Bad Actor') + 1) AS WEIGHT,
/* add in the remaining columns from our two tables */
DPLB.DENIED_NAME_ORIGINAL, DPLC.DENIAL_REASON
FROM DPL_BASE DPLB
INNER JOIN DPL_CODES DPLC ON DPLB.DENIAL_CODE = DPLC.DENIAL_CODE
WHERE
/* must have at least one hit from our distinct words, in the SEARCHABLE column */
(
('Bad' % ANY(STRING_TO_ARRAY(UPPER(DPLB.DENIED_NAME_SEARCHABLE),' ')))::int +
('Actor' % ANY(STRING_TO_ARRAY(UPPER(DPLB.DENIED_NAME_SEARCHABLE),' ')))::int
) > 0
ORDER BY WEIGHT DESC, DPLB.DENIED_NAME_ORIGINAL ASC;
这是查询计划的示例。任何有关(a)正确索引方法和/或(b)更好的查询设计或优化的提示或建议- 将不胜感激。
|QUERY PLAN |
|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
|Gather Merge (cost=9448.31..11733.27 rows=19584 width=62) (actual time=525.228..529.633 rows=204 loops=1) |
| Workers Planned: 2 |
| Workers Launched: 2 |
| -> Sort (cost=8448.29..8472.77 rows=9792 width=62) (actual time=519.954..520.104 rows=68 loops=3) |
| Sort Key: (((((('YOUTH'::text % ANY (string_to_array(upper((dplb.denied_name_searchable)::text), ' '::text))))::integer + (('SOCIETY'::text % ANY (string_to_array(upper((dplb.denied_name_searchable)::text), ' '::text))))::integer))::double precision * ((- ((dplb.denied_name_original)::text <-> 'Youth Society'::text)) + '1'::double precision))) DESC, dplb.denied_name_original|
| Sort Method: quicksort Memory: 34kB |
| Worker 0: Sort Method: quicksort Memory: 34kB |
| Worker 1: Sort Method: quicksort Memory: 34kB |
| -> Hash Join (cost=4.25..7799.21 rows=9792 width=62) (actual time=23.524..519.630 rows=68 loops=3) |
| Hash Cond: (dplb.denial_code = dplc.denial_code) |
| -> Parallel Seq Scan on dpl_base dplb (cost=0.00..7229.60 rows=9792 width=70) (actual time=22.937..516.516 rows=68 loops=3) |
| Filter: (((('YOUTH'::text % ANY (string_to_array(upper((denied_name_searchable)::text), ' '::text))))::integer + (('SOCIETY'::text % ANY (string_to_array(upper((denied_name_searchable)::text), ' '::text))))::integer) > 0) |
| Rows Removed by Filter: 23432 |
| -> Hash (cost=3.00..3.00 rows=100 width=38) (actual time=0.401..0.407 rows=100 loops=3) |
| Buckets: 1024 Batches: 1 Memory Usage: 15kB |
| -> Seq Scan on dpl_codes dplc (cost=0.00..3.00 rows=100 width=38) (actual time=0.044..0.216 rows=100 loops=3) |
|Planning Time: 0.399 ms |
|Execution Time: 530.078 ms
将布尔值转换为整数然后对它们进行算术运算肯定会搞砸索引。
应该与以下内容相同:
只有后者才有更好的机会被索引。还,
应该类似于,但不完全相同
但同样至少有一些机会使用索引。或者,将您的表格分解为一个不同的表格,其中每个元素都有一行,
string_to_array(upper((denied_name_searchable)::text), ' '::text)
这样您就不需要动态分解它。最后,
索引运算符不分布在
,
. 您需要为每一列指定它。因此,该索引根本不能用于对“denied_name_searchable”进行三元组搜索。此外,首先在索引中包含“denied_name_original”似乎没有任何意义。