我在索引上放了多少“填充”？

Question

Charlie Clark

Asked: 2015-07-01 04:44:34 +0800 CST2015-07-01 04:44:34 +0800 CST 2015-07-01 04:44:34 +0800 CST

Postgres 正在执行顺序扫描而不是索引扫描

772

我有一个包含大约 1000 万行的表和一个日期字段的索引。当我尝试提取索引字段的唯一值时，即使结果集只有 26 个项目，Postgres 也会运行顺序扫描。为什么优化师选择这个计划？我能做些什么来避免它？

从其他答案中，我怀疑这与查询和索引一样多。

explain select "labelDate" from pages group by "labelDate";
                              QUERY PLAN
-----------------------------------------------------------------------
 HashAggregate  (cost=524616.78..524617.04 rows=26 width=4)
   Group Key: "labelDate"
   ->  Seq Scan on pages  (cost=0.00..499082.42 rows=10213742 width=4)
(3 rows)

表结构：

http=# \d pages
                                       Table "public.pages"
     Column      |          Type          |        Modifiers
-----------------+------------------------+----------------------------------
 pageid          | integer                | not null default nextval('...
 createDate      | integer                | not null
 archive         | character varying(16)  | not null
 label           | character varying(32)  | not null
 wptid           | character varying(64)  | not null
 wptrun          | integer                | not null
 url             | text                   |
 urlShort        | character varying(255) |
 startedDateTime | integer                |
 renderStart     | integer                |
 onContentLoaded | integer                |
 onLoad          | integer                |
 PageSpeed       | integer                |
 rank            | integer                |
 reqTotal        | integer                | not null
 reqHTML         | integer                | not null
 reqJS           | integer                | not null
 reqCSS          | integer                | not null
 reqImg          | integer                | not null
 reqFlash        | integer                | not null
 reqJSON         | integer                | not null
 reqOther        | integer                | not null
 bytesTotal      | integer                | not null
 bytesHTML       | integer                | not null
 bytesJS         | integer                | not null
 bytesCSS        | integer                | not null
 bytesHTML       | integer                | not null
 bytesJS         | integer                | not null
 bytesCSS        | integer                | not null
 bytesImg        | integer                | not null
 bytesFlash      | integer                | not null
 bytesJSON       | integer                | not null
 bytesOther      | integer                | not null
 numDomains      | integer                | not null
 labelDate       | date                   |
 TTFB            | integer                |
 reqGIF          | smallint               | not null
 reqJPG          | smallint               | not null
 reqPNG          | smallint               | not null
 reqFont         | smallint               | not null
 bytesGIF        | integer                | not null
 bytesJPG        | integer                | not null
 bytesPNG        | integer                | not null
 bytesFont       | integer                | not null
 maxageMore      | smallint               | not null
 maxage365       | smallint               | not null
 maxage30        | smallint               | not null
 maxage1         | smallint               | not null
 maxage0         | smallint               | not null
 maxageNull      | smallint               | not null
 numDomElements  | integer                | not null
 numCompressed   | smallint               | not null
 numHTTPS        | smallint               | not null
 numGlibs        | smallint               | not null
 numErrors       | smallint               | not null
 numRedirects    | smallint               | not null
 maxDomainReqs   | smallint               | not null
 bytesHTMLDoc    | integer                | not null
 maxage365       | smallint               | not null
 maxage30        | smallint               | not null
 maxage1         | smallint               | not null
 maxage0         | smallint               | not null
 maxageNull      | smallint               | not null
 numDomElements  | integer                | not null
 numCompressed   | smallint               | not null
 numHTTPS        | smallint               | not null
 numGlibs        | smallint               | not null
 numErrors       | smallint               | not null
 numRedirects    | smallint               | not null
 maxDomainReqs   | smallint               | not null
 bytesHTMLDoc    | integer                | not null
 fullyLoaded     | integer                |
 cdn             | character varying(64)  |
 SpeedIndex      | integer                |
 visualComplete  | integer                |
 gzipTotal       | integer                | not null
 gzipSavings     | integer                | not null
 siteid          | numeric                |
Indexes:
    "pages_pkey" PRIMARY KEY, btree (pageid)
    "pages_date_url" UNIQUE CONSTRAINT, btree ("urlShort", "labelDate")
    "idx_pages_cdn" btree (cdn)
    "idx_pages_labeldate" btree ("labelDate") CLUSTER
    "idx_pages_urlshort" btree ("urlShort")
Triggers:
    pages_label_date BEFORE INSERT OR UPDATE ON pages
      FOR EACH ROW EXECUTE PROCEDURE fix_label_date()

3 个回答

Voted

ypercubeᵀᴹ · Answer 1 · 2015-07-01T05:42:51+08:00

Best Answer

ypercubeᵀᴹ

2015-07-01T05:42:51+08:002015-07-01T05:42:51+08:00

这是有关 Postgres 优化的已知问题。如果不同的值很少 - 就像你的情况一样 - 并且你在 8.4+ 版本中，这里描述了一个使用递归查询的非常快速的解决方法：Loose Indexscan。

您的查询可以重写（LATERAL需要 9.3+ 版本）：

WITH RECURSIVE pa AS 
( ( SELECT labelDate FROM pages ORDER BY labelDate LIMIT 1 ) 
  UNION ALL
    SELECT n.labelDate 
    FROM pa AS p
         , LATERAL 
              ( SELECT labelDate 
                FROM pages 
                WHERE labelDate > p.labelDate 
                ORDER BY labelDate 
                LIMIT 1
              ) AS n
) 
SELECT labelDate 
FROM pa ;

Erwin Brandstetter 在这个答案中有详尽的解释和查询的几个变体（在一个相关但不同的问题上）：优化 GROUP BY 查询以检索每个用户的最新记录

8

Erwin Brandstetter · Answer 2 · 2015-07-02T15:56:22+08:00

最好的查询很大程度上取决于数据分布。

每个日期您有很多行，这已经确定。由于您的案例在结果中仅消耗 26 个值，因此一旦使用索引，以下所有解决方案都会非常快。
（对于更多不同的值，案例会变得更有趣。）

pageid 根本不需要参与（就像你评论的那样）。

指数

你所需要的只是一个简单的 btree 索引"labelDate"。
如果列中有多个 NULL 值，则部分索引可以帮助更多（并且更小）：

CREATE INDEX pages_labeldate_nonull_idx ON big ("labelDate")
WHERE  "labelDate" IS NOT NULL;

你后来澄清：

0% NULL 但仅在导入时修复后。

部分索引对于排除具有 NULL 值的行的中间状态可能仍然有意义。将避免对索引进行不必要的更新（导致膨胀）。

询问

基于临时范围

如果您的日期出现在连续的范围内，没有太多的空白，我们可以利用数据类型的性质date来发挥我们的优势。两个给定值之间只有有限的、可数的值。如果差距很小，这将是最快的：

SELECT d."labelDate"
FROM  (
   SELECT generate_series(min("labelDate")::timestamp
                        , max("labelDate")::timestamp
                        , interval '1 day')::date AS "labelDate"
   FROM   pages
   ) d
WHERE  EXISTS (SELECT FROM pages WHERE "labelDate" = d."labelDate");

为什么要投到timestampin generate_series()？看：

在 PostgreSQL 中生成两个日期之间的时间序列

可以从索引中廉价地选择最小值和最大值。如果您知道最小和/或最大可能日期，它会便宜一些。例子：

SELECT d."labelDate"
FROM  (SELECT date '2011-01-01' + g AS "labelDate"
       FROM   generate_series(0, now()::date - date '2011-01-01' - 1) g) d
WHERE  EXISTS (SELECT FROM pages WHERE "labelDate" = d."labelDate");

或者，对于不可变的间隔：

SELECT d."labelDate"
FROM  (SELECT date '2011-01-01' + g AS "labelDate"
       FROM generate_series(0, 363) g) d
WHERE  EXISTS (SELECT FROM pages WHERE "labelDate" = d."labelDate");

松散索引扫描

这适用于任何日期分布（只要每个日期有很多行）。基本上@ypercube 已经提供了什么。但是有一些优点，我们需要确保我们最喜欢的索引可以在任何地方使用。

WITH RECURSIVE p AS (
   ( -- parentheses required for LIMIT
   SELECT "labelDate"
   FROM   pages
   WHERE  "labelDate" IS NOT NULL
   ORDER  BY "labelDate"
   LIMIT  1
   ) 
   UNION ALL
   SELECT (SELECT "labelDate" 
           FROM   pages 
           WHERE  "labelDate" > p."labelDate" 
           ORDER  BY "labelDate" 
           LIMIT  1)
   FROM   p
   WHERE  "labelDate" IS NOT NULL
   ) 
SELECT "labelDate" 
FROM   p
WHERE  "labelDate" IS NOT NULL;

第一个 CTEp实际上与
```
SELECT min("labelDate") FROM pages
```
但是详细的形式确保使用我们的部分索引。另外，根据我的经验（和我的测试），这种形式通常要快一些。
对于单个列，rCTE 递归项中的相关子查询应该快一点。这需要排除导致“labelDate”为 NULL 的行。看：
优化 GROUP BY 查询以检索每个用户的最新记录

旁白

不带引号的合法小写标识符使您的生活更轻松。
对表定义中的列进行有利的排序以节省一些磁盘空间：

在 PostgreSQL 中计算和节省空间

Fabrizio Mazzoni · Answer 3 · 2015-07-01T04:55:59+08:00

Fabrizio Mazzoni

2015-07-01T04:55:59+08:002015-07-01T04:55:59+08:00

从 postgresql 文档中：

CLUSTER 可以使用指定索引上的索引扫描或（如果索引是 b 树）顺序扫描然后排序来重新排序表。它将尝试根据计划者成本参数和可用的统计信息选择更快的方法。

您在 labelDate 上的索引是一个 btree ..

参考：

http://www.postgresql.org/docs/9.1/static/sql-cluster.html

-2

Postgres 正在执行顺序扫描而不是索引扫描

指数

询问

基于临时范围

松散索引扫描

旁白

连接到 PostgreSQL 服务器：致命：主机没有 pg_hba.conf 条目

如何让sqlplus的输出出现在一行中？

选择具有最大日期或最晚日期的日期

如何列出 PostgreSQL 中的所有模式？

列出指定表的所有列

如何在不修改我自己的 tnsnames.ora 的情况下使用 sqlplus 连接到位于另一台主机上的 Oracle 数据库

你如何mysqldump特定的表？

使用 psql 列出数据库权限

如何从 PostgreSQL 中的选择查询中将值插入表中？

如何使用 psql 列出所有数据库和表？

Postgres 正在执行顺序扫描而不是索引扫描

3 个回答

指数

询问

基于临时范围

松散索引扫描

旁白

相关问题