我shared_buffers
在 Mac 上运行 PostgreSQL 11,并将内存设置为 3 GB。我有一个job
包含 500 万行的表。表结构是
Table "public.job"
Column | Type | Collation | Nullable | Default
------------+--------------------------+-----------+----------+---------
id | uuid | | not null |
name | text | | |
created_on | timestamp with time zone | | |
updated_on | timestamp with time zone | | |
Indexes:
"job_pkey" PRIMARY KEY, btree (id)
"job_created_on_idx" btree (created_on)
"job_name_idx" btree (name)
"job_updated_on_idx" btree (updated_on)
"job_updated_on_name_compound_asc_idx" btree (updated_on, upper(name))
"job_updated_on_name_compound_desc_idx" btree (updated_on DESC, upper(name))
注意我已经在updated_on
和name
列上创建了复合索引。
当我运行查询时select name, created_on from job where created_on >= '2023-10-08 00:00:00+08'::timestamp with time zone AND created_on < '2023-10-16 00:00:00+08' ORDER BY updated_on ASC, UPPER(name::text) ASC limit 25
,PostgreSQL 使用复合索引job_updated_on_name_compound_asc_idx
,花费了超过 4 秒的时间。
执行计划
Limit (cost=0.43..102.29 rows=25 width=61) (actual time=4549.668..4550.235 rows=25 loops=1)
Buffers: shared hit=4859940
-> Index Scan using job_updated_on_name_compound_asc_idx on job (cost=0.43..416764.16 rows=102293 width=61) (actual time=4549.667..4550.230 rows=25 loops=1)
Filter: ((created_on >= '2023-10-08 00:00:00+08'::timestamp with time zone) AND (created_on < '2023-10-16 00:00:00+08'::timestamp with time zone))
Rows Removed by Filter: 4828894
Buffers: shared hit=4859940
Planning Time: 0.218 ms
Execution Time: 4550.260 ms
该列有索引created_on
,但未使用。created_on
我可以通过附加id
到order by子句来强制 PostgreSQL 使用列索引。查询是select name, created_on from job where created_on >= '2023-10-08 00:00:00+08'::timestamp with time zone AND created_on < '2023-10-16 00:00:00+08' ORDER BY updated_on ASC, UPPER(name::text) ASC, id limit 25;
. 这次,PostgreSQL 使用了列上的索引created_on
,并且非常快地返回结果。
执行计划
Limit (cost=52190.61..52193.52 rows=25 width=77) (actual time=125.192..138.055 rows=25 loops=1)
Buffers: shared hit=42788
-> Gather Merge (cost=52190.61..62136.44 rows=85244 width=77) (actual time=125.191..138.049 rows=25 loops=1)
Workers Planned: 2
Workers Launched: 2
Buffers: shared hit=42788
-> Sort (cost=51190.58..51297.14 rows=42622 width=77) (actual time=119.359..119.362 rows=20 loops=3)
Sort Key: updated_on, (upper(name)), id
Sort Method: top-N heapsort Memory: 30kB
Worker 0: Sort Method: top-N heapsort Memory: 31kB
Worker 1: Sort Method: top-N heapsort Memory: 31kB
Buffers: shared hit=42788
-> Parallel Bitmap Heap Scan on job (cost=2512.94..49987.82 rows=42622 width=77) (actual time=19.915..109.984 rows=36562 loops=3)
Recheck Cond: ((created_on >= '2023-10-08 00:00:00+08'::timestamp with time zone) AND (created_on < '2023-10-16 00:00:00+08'::timestamp with time zone))
Heap Blocks: exact=24557
Buffers: shared hit=42738
-> Bitmap Index Scan on job_created_on_idx (cost=0.00..2487.36 rows=102293 width=0) (actual time=16.909..16.909 rows=109685 loops=1)
Index Cond: ((created_on >= '2023-10-08 00:00:00+08'::timestamp with time zone) AND (created_on < '2023-10-16 00:00:00+08'::timestamp with time zone))
Buffers: shared hit=395
Planning Time: 0.168 ms
Execution Time: 138.115 ms
如果数据库忙于更新大列行,则执行时间的差异会变得更大。
复合索引是为了提高排序性能而创建的,在某些情况下非常有用。由于我的系统根据用户选择动态生成 SQL,因此查询条件和排序可能会有所不同。在这种特定情况下,添加id
到order by子句以避免使用复合索引可以提高性能,但也许在其他一些情况下使用复合索引更好,所以我不能只是简单地删除复合索引。
我还检查了pg_stats表,结果如下:
attname | inherited | n_distinct | most_common_vals
------------+-----------+------------+------------------
id | f | -1 |
name | f | -1 |
created_on | f | -0.908167 |
updated_on | f | -1 |
我有两个问题:
- 对于上面的查询,显然使用索引
created_on
更好。为什么PostgreSQL选择order by子句的复合索引?我可以在 PostgreSQL 上配置什么让它使用正确的索引吗? - 看起来 PostgreSQL 不会在查询条件和order by中同时使用列索引。
Filter
尽管 中使用的列已建立索引,但它位于Filter
复合索引下。PostgreSQL 是否可以在单个查询中同时使用order by的复合索引和查询条件列的索引?