我有一个连接两个表但拒绝使用索引的查询,除非我添加限制。这个案例有点复杂,所以最好通过展示给你来描述。
我有两个表,users 和 user_latest_locations,它们看起来像这样,除了 users 表包含大约 30 列用于不同的东西。
CREATE TABLE users (
id serial PRIMARY KEY,
name character varying(50) NOT NULL,
description character varying(1000) NOT NULL,
gender gender NOT NULL,
looking_for_gender gender NOT NULL,
latest_location geometry,
search_radius integer NOT NULL,
search_min_age integer NOT NULL,
search_max_age integer NOT NULL,
email character varying(50),
password character varying(60) NOT NULL,
birthdate date NOT NULL,
last_activity timestamp without time zone NOT NULL,
updated_time timestamp without time zone NOT NULL,
created_time timestamp without time zone NOT NULL,
status status NOT NULL DEFAULT 'pending'::status,
country_code integer,
mobile_number character varying(32),
hide_contacts boolean NOT NULL DEFAULT true,
occupation character varying(100) NOT NULL DEFAULT ''::character varying,
school character varying(100) NOT NULL DEFAULT ''::character varying,
hometown character varying(100) NOT NULL DEFAULT ''::character varying,
hangouts character varying(1000) NOT NULL DEFAULT ''::character varying,
popularity double precision NOT NULL DEFAULT 0,
job character varying(100) NOT NULL DEFAULT ''::character varying,
preview_push_message boolean NOT NULL DEFAULT true,
region_ids integer[],
intent intent NOT NULL DEFAULT 'fate'::intent,
hide_mutual_contacts boolean NOT NULL DEFAULT false
);
CREATE UNIQUE INDEX users_mobile_idx ON users (mobile_number, country_code);
CREATE TABLE user_latest_locations (
user_id integer NOT NULL PRIMARY KEY,
last_activity timestamp without time zone NOT NULL,
latest_location geometry
);
两个表都为每个用户包含一行,总共大约 100 万。
查询如下所示:
SELECT * FROM users u
LEFT JOIN user_latest_locations ull
ON ull.user_id = u.id
WHERE mobile_number IN (
SELECT unnest('{1}'::text[]) as number
UNION
SELECT unnest('{1}'::text[]) as number)
生成的查询计划如下所示:
Hash Right Join (cost=24084.22..82388.58 rows=453489 width=306) (actual time=153.343..1123.307 rows=10 loops=1)
Hash Cond: (ull.user_id = u.id)
-> Seq Scan on user_latest_locations ull (cost=0.00..17549.25 rows=950725 width=44) (actual time=0.010..471.724 rows=949813 loops=1)
-> Hash (cost=2472.61..2472.61 rows=453489 width=262) (actual time=0.527..0.527 rows=10 loops=1)
Buckets: 16384 Batches: 64 Memory Usage: 129kB
-> Nested Loop (cost=3.94..2472.61 rows=453489 width=262) (actual time=0.111..0.218 rows=10 loops=1)
-> HashAggregate (cost=3.51..5.51 rows=200 width=0) (actual time=0.057..0.059 rows=1 loops=1)
Group Key: (unnest('{1}'::text[]))
-> Append (cost=0.00..3.01 rows=200 width=0) (actual time=0.023..0.040 rows=2 loops=1)
-> Result (cost=0.00..0.51 rows=100 width=0) (actual time=0.019..0.022 rows=1 loops=1)
-> Result (cost=0.00..0.51 rows=100 width=0) (actual time=0.004..0.006 rows=1 loops=1)
-> Index Scan using users_mobile_idx on users u (cost=0.42..12.31 rows=2 width=262) (actual time=0.042..0.118 rows=10 loops=1)
Index Cond: ((mobile_number)::text = (unnest('{1}'::text[])))
Planning time: 1.310 ms
Execution time: 1123.851 ms
为什么查询要进行 Seq Scan on user_latest_locations
?使用它的主键会更快吗?如果我在LIMIT
下面添加一个类似的东西到联合中,它就会按预期开始使用索引。
SELECT * FROM users u
LEFT JOIN user_latest_locations ull
ON ull.user_id = u.id
WHERE mobile_number IN (
SELECT unnest('{1}'::text[]) as number
UNION
SELECT unnest('{1}'::text[]) as number limit 100)
问题是这是一个函数,它接受一个长度可变的手机号码数组,在函数内部有一个任意的限制会很丑陋。
我还应该补充一点,这个表的统计数据应该表明它的大部分值是唯一的:
n_distinct | correlation
------------+-------------
-1 | 0.0183006
PostgreSQL 9.3 和 9.5 上的行为相同。分析表格没有帮助。
规划器不会查看您的 unnest 操作,也不会看到数组中只有一个 1 值。它的标准假设是未知数组每个有 100 个元素,或 200 个元素。所以它认为会有很多(453489)
users
符合条件,并且读取整个和哈希连接会更快user_latest_locations
,而不是进行 453489 个单独的索引查找。一种可能的解决方案是
unnest...union...unnest
单独执行查询并将结果存储到数组变量中,然后将该数组传递给主查询。这样规划者更有可能看到数组的真实大小。