我在索引上放了多少“填充”？

Question

rubik

Asked: 2018-09-12 03:19:56 +0800 CST2018-09-12 03:19:56 +0800 CST 2018-09-12 03:19:56 +0800 CST

如果表已经有正确的索引，如何摆脱位图堆扫描

772

我正在运行 PostgreSQL 9.6。这些是相关的定义：

CREATE TABLE IF NOT EXISTS instagram.profiles_1000 (
    id                          SERIAL PRIMARY KEY,
    username                    VARCHAR(255) NOT NULL UNIQUE,
    followers                   BIGINT,
    tsv                         TSVECTOR
);

CREATE UNIQUE INDEX IF NOT EXISTS instagram_username_index
    ON instagram.profiles_1000(username);
CREATE INDEX IF NOT EXISTS instagram_followers_index
    ON instagram.profiles_1000(followers);
CREATE INDEX IF NOT EXISTS instagram_textsearch_index
    ON instagram.profiles_1000 USING GIN(tsv);

文本向量由触发器更新：

CREATE FUNCTION instagram_documents_search_trigger() RETURNS trigger AS $$
begin
  new.tsv :=
        setweight(to_tsvector(COALESCE(new.username, '')), 'D') || ' ' ||
        setweight(to_tsvector(COALESCE(new.full_name, '')), 'C') || ' ' ||
        setweight(to_tsvector(COALESCE(new.location_country, '')), 'B') || ' ' ||
        setweight(to_tsvector(COALESCE(new.location_region, '')), 'B') || ' ' ||
        setweight(to_tsvector(COALESCE(new.biography, '')), 'A') || ' ' ||
        setweight(to_tsvector(COALESCE(new.location_city, '')), 'A');
  return new;
end
$$ LANGUAGE plpgsql;


CREATE TRIGGER instagram_tsvectorupdate BEFORE INSERT OR UPDATE
    ON instagram.profiles_1000 FOR EACH ROW
    EXECUTE PROCEDURE instagram_documents_search_trigger();

这是查询：

select instagram.profiles_1000.*, categories, followers as rank                                                                                            
from instagram.profiles_1000
join plainto_tsquery('arts') as q on q @@ tsv
left outer join instagram.profile_categories_agg on instagram.profiles_1000.username = instagram.profile_categories_agg.username
where followers is not null and followers > 0
order by (followers, -id) desc
limit 50;

这是的输出EXPLAIN (ANALYZE, BUFFERS)：

https://explain.depesz.com/s/ceCd

罪魁祸首是位图堆扫描，它占了总执行时间的大部分。坦率地说，我不明白为什么需要它，特别是因为位图索引扫描instagram_textsearch_index已经根据搜索词过滤了行。

有人可以阐明一下吗？

编辑有人指出我误读了解释输出。事实上，左外连接花费了很多时间。我尝试按如下方式删除它：

select instagram.profiles_1000.*, followers as rank
from instagram.profiles_1000
join plainto_tsquery('arts') as q on q @@ tsv                                              
where followers is not null and followers > 0
order by (followers, -id) desc
limit 50;

但是查询仍然需要 13 秒！这是EXPLAIN (ANALYZE, BUFFERS)输出：

https://explain.depesz.com/s/awfH

现在瓶颈似乎是全文搜索。真的有那么慢吗？该表只有 500 万行，并且tsv（具有 type TSVECTOR）由以下索引索引：

CREATE INDEX IF NOT EXISTS instagram_textsearch_index_1000
    ON instagram.profiles_1000 USING GIN(tsv);

编辑 2我意识到如果我只处理与搜索匹配的配置文件（最多总是 50 个），我可以编写更精简的查询。使用此查询：

select p.*, categories
from
    (select id
    from instagram.profiles_1000, plainto_tsquery('arts') as q
    where q @@ tsv and followers is not null and followers > 0
    order by (followers, -id) desc
    limit 50) as ids
inner join instagram.profiles_1000 as p on
    p.id = ids.id
left outer join instagram.profile_categories_agg as c on
    c.username = p.username;

我能够得到这个结果： https ://explain.depesz.com/s/OvG

这使搜索时间约为 3 秒。至少达到 1 秒会更好。

3 个回答

Voted

jjanes · Answer 1 · 2018-09-12T04:53:09+08:00

jjanes

2018-09-12T04:53:09+08:002018-09-12T04:53:09+08:00

这里的“罪魁祸首”占用不到总时间的1/4。真正的瓶颈是使用 instagram_categories_username_category_agg 的 Index Only Scan，这需要 0.200 * 118453 = 23690.6 ms，几乎是 24 秒，这是大部分时间。

看起来每个用户都只有一个类别，除非这只是一个惊人的巧合，那么为什么会有一个单独的用户类别表而不是 profile_1000 的属性呢？这种设计似乎是真正的罪魁祸首。

无论如何，它必须进行位图堆扫描的原因是因为它是如何得出正确答案的。如果它只进行位图索引扫描，它不会有任何关于匹配行的数据，只有它们的地址。由于可见性、重新检查和有损位图压缩，它也不知道该地址是否真的匹配行。

常规索引扫描也访问索引和表，它们只是不会像位图扫描那样将这些操作分开为 EXPLAIN 计划中的两个不同条目。

2

Erwin Brandstetter · Answer 2 · 2018-09-12T09:32:24+08:00

无论您做什么，都可以在很大程度上简化您的查询：

SELECT p.*, c.categories
FROM  (
   SELECT *
   FROM   instagram.profiles_1000
   WHERE  tsv @@ plainto_tsquery('arts')
   AND    followers > 0
   ORDER  by followers DESC, id  -- untangled
   LIMIT  50
   ) p
LEFT  JOIN instagram.profile_categories_agg c USING (username);

我删除了无偿的自我加入（这只是噪音，id是PK）

我还删除了该子句：

AND    followers is not null  -- redundant

这也是多余的，而你有AND followers > 0.

不过，这可能只会产生较小的性能改进。

部分索引可能会有所帮助：

CREATE INDEX IF NOT EXISTS instagram_textsearch_index_1000_partial
ON instagram.profiles_1000 USING GIN(tsv);
WHERE followers > 0;

您可以通过迁移到此表定义来获得更多的整体性能：

CREATE TABLE instagram.profiles_1000 (
    id          SERIAL PRIMARY KEY,
    followers   INT,  -- nobody has > 2^31 followers
    username    VARCHAR(255) NOT NULL UNIQUE, -- why VARCHAR(255)?
    tsv         TSVECTOR
);

假设一个平原integer很容易涵盖最大数量的追随者。

然后我将列移动到第二位，以避免使用对齐填充浪费空间。看：

在 PostgreSQL 中计算和节省空间参见：

而且您应该使用id而不是usernamein table instagram.profile_categories_agg，这既可以使表更小，又可以更快地连接。

最后，您没有使用weights。出于此查询的目的，您可以使用更简单的 tsvector 函数来生成更小的表和索引：

CREATE FUNCTION instagram_documents_search_trigger()
  RETURNS trigger AS
$func$
BEGIN
   NEW.tsv := to_tsvector(COALESCE(NEW.username        , '')
                || ' ' || COALESCE(NEW.full_name       , '')
                || ' ' || COALESCE(NEW.location_country, '')
                || ' ' || COALESCE(NEW.location_region , '')
                || ' ' || COALESCE(NEW.biography       , '')
                || ' ' || COALESCE(NEW.location_city   , ''));
  RETURN NEW;
END
$func$  LANGUAGE plpgsql;

jjanes · Answer 3 · 2018-09-12T16:32:15+08:00

Best Answer

jjanes

2018-09-12T16:32:15+08:002018-09-12T16:32:15+08:00

如果您想进一步改进时间安排，最好的选择可能是放弃使用 FTS 索引，至少在 @@ 匹配条件返回大量结果的情况下是这样。

首先，您必须将您的 ORDER BY 从更改order by (followers, -id) desc为order by followers desc, id。这个版本在语义上是等价的（可能除了它如何处理 NULL 值），但它没有经过将两列打包成一个伪行然后对这些行值进行排序的步骤。它直接对列值进行排序。这种直接排序要快得多，但更重要的是，它开辟了使用索引而不是排序来完成 ORDER BY 的可能性。

然后，如果您在上创建索引(followers desc, id)，您的查询可以逐步通过该索引查找满足 @@ 条件的行，一旦找到其中的 50 行就停止。这样做可能比拉出超过 100,000 行 @@ 匹配并将它们排序以拉出前 50 行要快得多。

1

如果表已经有正确的索引，如何摆脱位图堆扫描

连接到 PostgreSQL 服务器：致命：主机没有 pg_hba.conf 条目

如何让sqlplus的输出出现在一行中？

选择具有最大日期或最晚日期的日期

如何列出 PostgreSQL 中的所有模式？

列出指定表的所有列

如何在不修改我自己的 tnsnames.ora 的情况下使用 sqlplus 连接到位于另一台主机上的 Oracle 数据库

你如何mysqldump特定的表？

使用 psql 列出数据库权限

如何从 PostgreSQL 中的选择查询中将值插入表中？

如何使用 psql 列出所有数据库和表？

如果表已经有正确的索引，如何摆脱位图堆扫描

3 个回答

相关问题