PostgreSQL 中 UniProt 的生物序列

Question

Kokizzu

Asked: 2017-06-25 00:29:51 +0800 CST2017-06-25 00:29:51 +0800 CST 2017-06-25 00:29:51 +0800 CST

如何在 PostgreSQL 中使 DISTINCT ON 更快？

772

我station_logs在 PostgreSQL 9.6 数据库中有一个表：

    Column     |            Type             |    
---------------+-----------------------------+
 id            | bigint                      | bigserial
 station_id    | integer                     | not null
 submitted_at  | timestamp without time zone | 
 level_sensor  | double precision            | 
Indexes:
    "station_logs_pkey" PRIMARY KEY, btree (id)
    "uniq_sid_sat" UNIQUE CONSTRAINT, btree (station_id, submitted_at)

我试图获得level_sensor基于submitted_at,的最后一个值station_id。大约有 400 个唯一station_id值，每个station_id.

创建索引之前：

EXPLAIN ANALYZE
SELECT DISTINCT ON(station_id) station_id, submitted_at, level_sensor
FROM station_logs ORDER BY station_id, submitted_at DESC;

唯一（成本=4347852.14..4450301.72 行=89 宽度=20）（实际时间=22202.080..27619.167 行=98 循环=1）
   -> 排序（成本=4347852.14..4399076.93 行=20489916 宽度=20）（实际时间=22202.077..26540.827 行=20489812 循环=1）
         排序键：station_id，submitted_at DESC
         排序方法：外部合并磁盘：681040kB
         -> Seq Scan on station_logs (cost=0.00..598895.16 rows=20489916 width=20) (实际时间=0.023..3443.587 rows=20489812 loops=$
 规划时间：0.072 ms
 执行时间：27690.644 ms

创建索引：

CREATE INDEX station_id__submitted_at ON station_logs(station_id, submitted_at DESC);

创建索引后，对于相同的查询：

唯一（成本=0.56..2156367.51 行=89 宽度=20）（实际时间=0.184..16263.413 行=98 循环=1）
   -> 在 station_logs 上使用 station_id__submitted_at 进行索引扫描（成本=0.56..2105142.98 行=20489812 宽度=20）（实际时间=0.181..1$
 规划时间：0.206 ms
 执行时间：16263.490 ms

有没有办法让这个查询更快？例如 1 秒，16 秒仍然太多。

2 个回答

Voted

Erwin Brandstetter · Answer 1 · 2017-06-25T06:42:02+08:00

仅对于 400 个站点，此查询将大大加快：

SELECT s.station_id, l.submitted_at, l.level_sensor
FROM   station s
CROSS  JOIN LATERAL (
   SELECT submitted_at, level_sensor
   FROM   station_logs
   WHERE  station_id = s.station_id
   ORDER  BY submitted_at DESC NULLS LAST
   LIMIT  1
   ) l;

dbfiddle here _{（比较此查询的计划，Abelisto 的替代方案和您的原始方案）}

结果EXPLAIN ANALYZE由 OP 提供：

Nested Loop  (cost=0.56..356.65 rows=102 width=20) (actual time=0.034..0.979 rows=98 loops=1)
   ->  Seq Scan on stations s  (cost=0.00..3.02 rows=102 width=4) (actual time=0.009..0.016 rows=102 loops=1)
   ->  Limit  (cost=0.56..3.45 rows=1 width=16) (actual time=0.009..0.009 rows=1 loops=102)
         ->  Index Scan using station_id__submitted_at on station_logs  (cost=0.56..664062.38 rows=230223 width=16) (actual time=0.009$
               Index Cond: (station_id = s.id)
 Planning time: 0.542 ms
 Execution time: <b>1.013 ms</b>  -- !!

您需要的唯一索引是您创建的索引：station_id__submitted_at. 基本上，UNIQUE约束uniq_sid_sat也可以完成这项工作。维护两者似乎浪费了磁盘空间和写入性能。

我在查询中添加了NULLS LASTtoORDER BY因为submitted_atis not defined NOT NULL。理想情况下，如果适用！NOT NULL向列添加约束submitted_at，删除附加索引并NULLS LAST从查询中删除。

如果submitted_at可以NULL，请创建此UNIQUE索引以替换当前索引和唯一约束：

CREATE UNIQUE INDEX station_logs_uni ON station_logs(station_id, submitted_at DESC NULLS LAST);

考虑：

这是假设一个单独的表station，每个相关（通常是 PK）有一行station_id- 你应该有任何一种方式。如果没有，请创建它。同样，使用这种 rCTE 技术非常快：

CREATE TABLE station AS
WITH RECURSIVE cte AS (
   (
   SELECT station_id
   FROM   station_logs
   ORDER  BY station_id
   LIMIT  1
   )
   UNION ALL
   SELECT l.station_id
   FROM   cte c
   ,      LATERAL (   
      SELECT station_id
      FROM   station_logs
      WHERE  station_id > c.station_id
      ORDER  BY station_id
      LIMIT  1
      ) l
   )
TABLE cte;

我也在小提琴中使用它。您可以使用类似的查询直接解决您的任务，无需station表格 - 如果您无法说服创建它。

详细说明、解释和替代方案：

优化索引

您的查询现在应该非常快。仅当您仍需要优化读取性能时...

将level_sensor作为最后一列添加到索引以允许仅索引扫描可能是有意义的，例如joanolo commented。
缺点：它使索引更大 - 这为使用它的所有查询增加了一点成本。
优点：如果你真的只扫描索引，手头的查询根本不需要访问堆页面，这使它的速度大约是原来的两倍。但这对于现在非常快速的查询来说可能是微不足道的收获。

但是，我不希望这适用于您的情况。你提到：

... 每天大约 20k 行station_id。

通常，这将表明不断的写入负载（station_id每 5 秒 1 次）。并且您对最新的行感兴趣。仅索引扫描仅适用于所有事务可见的堆页面（可见性映射中的位已设置）。您将不得不为VACUUM表运行极其激进的设置以跟上写入负载，而且它在大多数情况下仍然无法正常工作。如果我的假设是正确的，那么仅索引扫描就出来了，不要添加level_sensor到索引中。

OTOH，如果我的假设成立，并且您的表格变得非常大，那么BRIN 索引可能会有所帮助。有关的：

加快 Postgres 部分索引的创建

或者，更专业和更高效：仅用于最新添加的部分索引，以切断大量不相关的行：

CREATE INDEX station_id__submitted_at_recent_idx ON station_logs(station_id, submitted_at DESC NULLS LAST)
WHERE submitted_at > '2017-06-24 00:00';

选择一个您知道必须存在较年轻行的时间戳。您必须为所有查询添加匹配WHERE条件，例如：

...
WHERE  station_id = s.station_id
AND    submitted_at > '2017-06-24 00:00'
...

您必须不时调整索引和查询。
更多详细信息的相关答案：

Abelisto · Answer 2 · 2017-06-25T01:30:38+08:00

试试经典方法：

create index idx_station_logs__station_id on station_logs(station_id);
create index idx_station_logs__submitted_at on station_logs(submitted_at);

analyse station_logs;

with t as (
  select station_id, max(submitted_at) submitted_at 
  from station_logs 
  group by station_id)
select * 
from t join station_logs l on (
  l.station_id = t.station_id and l.submitted_at = t.submitted_at);

小提琴手

通过 ThreadStarter 解释分析

 Nested Loop  (cost=701344.63..702110.58 rows=4 width=155) (actual time=6253.062..6253.544 rows=98 loops=1)
   CTE t
     ->  HashAggregate  (cost=701343.18..701344.07 rows=89 width=12) (actual time=6253.042..6253.069 rows=98 loops=1)
           Group Key: station_logs.station_id
           ->  Seq Scan on station_logs  (cost=0.00..598894.12 rows=20489812 width=12) (actual time=0.034..1841.848 rows=20489812 loop$
   ->  CTE Scan on t  (cost=0.00..1.78 rows=89 width=12) (actual time=6253.047..6253.085 rows=98 loops=1)
   ->  Index Scan using station_id__submitted_at on station_logs l  (cost=0.56..8.58 rows=1 width=143) (actual time=0.004..0.004 rows=$
         Index Cond: ((station_id = t.station_id) AND (submitted_at = t.submitted_at))
 Planning time: 0.542 ms
 Execution time: 6253.701 ms

如何在 PostgreSQL 中使 DISTINCT ON 更快？

优化索引

连接到 PostgreSQL 服务器：致命：主机没有 pg_hba.conf 条目

如何让sqlplus的输出出现在一行中？

选择具有最大日期或最晚日期的日期

如何列出 PostgreSQL 中的所有模式？

列出指定表的所有列

如何在不修改我自己的 tnsnames.ora 的情况下使用 sqlplus 连接到位于另一台主机上的 Oracle 数据库

你如何mysqldump特定的表？

使用 psql 列出数据库权限

如何从 PostgreSQL 中的选择查询中将值插入表中？

如何使用 psql 列出所有数据库和表？

如何在 PostgreSQL 中使 DISTINCT ON 更快？

2 个回答

优化索引

相关问题