PostgreSQL 中 UniProt 的生物序列

Question

Asked: 2016-03-10 02:15:05 +0800 CST2016-03-10 02:15:05 +0800 CST 2016-03-10 02:15:05 +0800 CST

为什么索引不用于具有未嵌套数组的 IN 表达式

772

我有一个连接两个表但拒绝使用索引的查询，除非我添加限制。这个案例有点复杂，所以最好通过展示给你来描述。

我有两个表，users 和 user_latest_locations，它们看起来像这样，除了 users 表包含大约 30 列用于不同的东西。

CREATE TABLE users (
    id serial PRIMARY KEY,
    name character varying(50) NOT NULL,
    description character varying(1000) NOT NULL,
    gender gender NOT NULL,
    looking_for_gender gender NOT NULL,
    latest_location geometry,
    search_radius integer NOT NULL,
    search_min_age integer NOT NULL,
    search_max_age integer NOT NULL,
    email character varying(50),
    password character varying(60) NOT NULL,
    birthdate date NOT NULL,
    last_activity timestamp without time zone NOT NULL,
    updated_time timestamp without time zone NOT NULL,
    created_time timestamp without time zone NOT NULL,
    status status NOT NULL DEFAULT 'pending'::status,
    country_code integer,
    mobile_number character varying(32),
    hide_contacts boolean NOT NULL DEFAULT true,
    occupation character varying(100) NOT NULL DEFAULT ''::character varying,
    school character varying(100) NOT NULL DEFAULT ''::character varying,
    hometown character varying(100) NOT NULL DEFAULT ''::character varying,
    hangouts character varying(1000) NOT NULL DEFAULT ''::character varying,
    popularity double precision NOT NULL DEFAULT 0,
    job character varying(100) NOT NULL DEFAULT ''::character varying,
    preview_push_message boolean NOT NULL DEFAULT true,
    region_ids integer[],
    intent intent NOT NULL DEFAULT 'fate'::intent,
    hide_mutual_contacts boolean NOT NULL DEFAULT false
);

CREATE UNIQUE INDEX users_mobile_idx ON users (mobile_number, country_code);

CREATE TABLE user_latest_locations (
  user_id integer NOT NULL PRIMARY KEY,
  last_activity timestamp without time zone NOT NULL,
  latest_location geometry
);

两个表都为每个用户包含一行，总共大约 100 万。
查询如下所示：

SELECT * FROM users u
  LEFT JOIN user_latest_locations ull 
  ON ull.user_id = u.id
WHERE mobile_number IN (
  SELECT unnest('{1}'::text[]) as number 
  UNION 
  SELECT unnest('{1}'::text[]) as number)

生成的查询计划如下所示：

Hash Right Join  (cost=24084.22..82388.58 rows=453489 width=306) (actual time=153.343..1123.307 rows=10 loops=1)
  Hash Cond: (ull.user_id = u.id)
  ->  Seq Scan on user_latest_locations ull  (cost=0.00..17549.25 rows=950725 width=44) (actual time=0.010..471.724 rows=949813 loops=1)
  ->  Hash  (cost=2472.61..2472.61 rows=453489 width=262) (actual time=0.527..0.527 rows=10 loops=1)
        Buckets: 16384  Batches: 64  Memory Usage: 129kB
        ->  Nested Loop  (cost=3.94..2472.61 rows=453489 width=262) (actual time=0.111..0.218 rows=10 loops=1)
              ->  HashAggregate  (cost=3.51..5.51 rows=200 width=0) (actual time=0.057..0.059 rows=1 loops=1)
                    Group Key: (unnest('{1}'::text[]))
                    ->  Append  (cost=0.00..3.01 rows=200 width=0) (actual time=0.023..0.040 rows=2 loops=1)
                          ->  Result  (cost=0.00..0.51 rows=100 width=0) (actual time=0.019..0.022 rows=1 loops=1)
                          ->  Result  (cost=0.00..0.51 rows=100 width=0) (actual time=0.004..0.006 rows=1 loops=1)
              ->  Index Scan using users_mobile_idx on users u  (cost=0.42..12.31 rows=2 width=262) (actual time=0.042..0.118 rows=10 loops=1)
                    Index Cond: ((mobile_number)::text = (unnest('{1}'::text[])))
Planning time: 1.310 ms
Execution time: 1123.851 ms

为什么查询要进行 Seq Scan on user_latest_locations？使用它的主键会更快吗？如果我在LIMIT下面添加一个类似的东西到联合中，它就会按预期开始使用索引。

SELECT * FROM users u
  LEFT JOIN user_latest_locations ull 
  ON ull.user_id = u.id
WHERE mobile_number IN (
  SELECT unnest('{1}'::text[]) as number 
  UNION 
  SELECT unnest('{1}'::text[]) as number limit 100)

问题是这是一个函数，它接受一个长度可变的手机号码数组，在函数内部有一个任意的限制会很丑陋。

我还应该补充一点，这个表的统计数据应该表明它的大部分值是唯一的：

 n_distinct | correlation 
------------+-------------
         -1 |   0.0183006

PostgreSQL 9.3 和 9.5 上的行为相同。分析表格没有帮助。

jjanes · Answer 1 · 2016-03-10T13:32:35+08:00

规划器不会查看您的 unnest 操作，也不会看到数组中只有一个 1 值。它的标准假设是未知数组每个有 100 个元素，或 200 个元素。所以它认为会有很多（453489）users符合条件，并且读取整个和哈希连接会更快user_latest_locations，而不是进行 453489 个单独的索引查找。

一种可能的解决方案是unnest...union...unnest单独执行查询并将结果存储到数组变量中，然后将该数组传递给主查询。这样规划者更有可能看到数组的真实大小。

为什么索引不用于具有未嵌套数组的 IN 表达式

连接到 PostgreSQL 服务器：致命：主机没有 pg_hba.conf 条目

如何让sqlplus的输出出现在一行中？

选择具有最大日期或最晚日期的日期

如何列出 PostgreSQL 中的所有模式？

列出指定表的所有列

如何在不修改我自己的 tnsnames.ora 的情况下使用 sqlplus 连接到位于另一台主机上的 Oracle 数据库

你如何mysqldump特定的表？

使用 psql 列出数据库权限

如何从 PostgreSQL 中的选择查询中将值插入表中？

如何使用 psql 列出所有数据库和表？

为什么索引不用于具有未嵌套数组的 IN 表达式

1 个回答

相关问题