我可以在使用数据库后激活 PITR 吗？

Question

Asked: 2016-01-04 17:40:08 +0800 CST2016-01-04 17:40:08 +0800 CST 2016-01-04 17:40:08 +0800 CST

提高 GROUP BY 子句中的排序性能

772

我在 Postgres 9.4.1 中有两个表，events并event_refs具有以下模式：

events桌子

CREATE TABLE events (
  id serial NOT NULL PRIMARY KEY,
  event_type text NOT NULL,
  event_path jsonb,
  event_data jsonb,
  created_at timestamp with time zone NOT NULL
);

-- Index on type and created time

CREATE INDEX events_event_type_created_at_idx
  ON events (event_type, created_at);

event_refs桌子

CREATE TABLE event_refs (
  event_id integer NOT NULL,
  reference_key text NOT NULL,
  reference_value text NOT NULL,
  CONSTRAINT event_refs_pkey PRIMARY KEY (event_id, reference_key, reference_value),
  CONSTRAINT event_refs_event_id_fkey FOREIGN KEY (event_id)
      REFERENCES events (id) MATCH SIMPLE
      ON UPDATE NO ACTION ON DELETE NO ACTION
);

两个表都包含 2M 行。这是我正在运行的查询

SELECT
  EXTRACT(EPOCH FROM (MAX(events.created_at) - MIN(events.created_at))) as funnel_time
FROM
  events
INNER JOIN
  event_refs
ON
  event_refs.event_id = events.id AND
  event_refs.reference_key = 'project'
WHERE
    events.event_type = 'event1' OR
    events.event_type = 'event2' AND
    events.created_at >= '2015-07-01 00:00:00+08:00' AND
    events.created_at < '2015-12-01 00:00:00+08:00'
GROUP BY event_refs.reference_value
HAVING COUNT(*) > 1

我知道 where 子句中的运算符优先级。它只应该按日期过滤类型为“event2”的事件。

这是EXPLAIN ANALYZE输出

GroupAggregate  (cost=116503.86..120940.20 rows=147878 width=14) (actual time=3970.530..4163.041 rows=53532 loops=1)
   Group Key: event_refs.reference_value
   Filter: (count(*) > 1)
   Rows Removed by Filter: 41315
   ->  Sort  (cost=116503.86..116873.56 rows=147878 width=14) (actual time=3970.509..4105.316 rows=153766 loops=1)
         Sort Key: event_refs.reference_value
         Sort Method: external merge  Disk: 3904kB
         ->  Hash Join  (cost=24302.26..101275.04 rows=147878 width=14) (actual time=101.667..1394.281 rows=153766 loops=1)
               Hash Cond: (event_refs.event_id = events.id)
               ->  Seq Scan on event_refs  (cost=0.00..37739.00 rows=2000000 width=10) (actual time=0.007..368.661 rows=2000000 loops=1)
                     Filter: (reference_key = 'project'::text)
               ->  Hash  (cost=21730.79..21730.79 rows=147878 width=12) (actual time=101.524..101.524 rows=153766 loops=1)
                     Buckets: 16384  Batches: 2  Memory Usage: 3315kB
                     ->  Bitmap Heap Scan on events  (cost=3761.23..21730.79 rows=147878 width=12) (actual time=23.139..75.814 rows=153766 loops=1)
                           Recheck Cond: ((event_type = 'event1'::text) OR ((event_type = 'event2'::text) AND (created_at >= '2015-07-01 04:00:00+12'::timestamp with time zone) AND (created_at < '2015-12-01 05:00:00+13'::timestamp with time zone)))
                           Heap Blocks: exact=14911
                           ->  BitmapOr  (cost=3761.23..3761.23 rows=150328 width=0) (actual time=21.210..21.210 rows=0 loops=1)
                                 ->  Bitmap Index Scan on events_event_type_created_at_idx  (cost=0.00..2349.42 rows=102533 width=0) (actual time=12.234..12.234 rows=99864 loops=1)
                                       Index Cond: (event_type = 'event1'::text)
                                 ->  Bitmap Index Scan on events_event_type_created_at_idx  (cost=0.00..1337.87 rows=47795 width=0) (actual time=8.975..8.975 rows=53902 loops=1)
                                       Index Cond: ((event_type = 'event2'::text) AND (created_at >= '2015-07-01 04:00:00+12'::timestamp with time zone) AND (created_at < '2015-12-01 05:00:00+13'::timestamp with time zone))
 Planning time: 0.493 ms
 Execution time: 4178.517 ms

我知道event_refs表扫描上的过滤器没有过滤任何东西，这是我的测试数据的结果，以后会添加不同的类型。

包括HashJoin似乎合理的所有内容都提供了我的测试数据，但我想知道是否可以从该子句中提高Sort速度？GROUP BY

我尝试在reference_value列中添加一个 b 树索引，但它似乎没有使用它。如果我没记错的话（我很可能是这样，请告诉我），它正在对 153766 行进行排序。索引不会有利于这个排序过程吗？

1 个回答

Voted

Erwin Brandstetter · Answer 1 · 2016-01-05T08:23:24+08:00

`work_mem`

这就是使您的排序变得昂贵的原因：

排序方式：外部合并磁盘：3904kB

排序溢出到磁盘，这会降低性能。你需要更多的内存。特别是，您需要增加work_mem. 手册：

work_mem( integer)

指定写入临时磁盘文件之前内部排序操作和哈希表要使用的内存量。

在这种特殊情况下，将设置提高 4MB 就可以解决问题。通常，由于您在完全部署 60M 行时需要更多数据，并且如果一般设置work_mem太高会适得其反（请阅读我链接到的手册！），请考虑在本地将其设置为足够高以供您使用查询，例如：

BEGIN;
SET LOCAL work_mem = '500MB';  -- adapt to your needs
SELECT ...;
COMMIT;

请注意，即使SET LOCAL坚持到交易结束。如果您在同一笔交易中投入更多资金，您可能需要重置：

RESET work_mem;

或者将查询封装在具有函数本地设置的函数中。与功能示例相关的答案：

索引扫描时 Postgres 不使用索引是更好的选择

索引

我也会尝试这些索引：

CREATE INDEX events_event_type_created_at_idx ON events (event_type, created_at, id);

仅当您从中获得仅索引扫描时，添加id为最后一列才有意义。看：

以及关于的部分索引event_refs：

CREATE INDEX event_refs_foo_idx ON event_refs (event_id, reference_value);
WHERE  reference_key = 'project';

谓词WHERE reference_key = 'project'在您的测试用例中没有多大帮助（除了查询计划），但它应该对您的完整部署有很大帮助 where there will be different types added later.

这也应该允许仅索引扫描。

可能的替代查询

由于您要选择大部分内容events，因此此替代查询可能会更快（很大程度上取决于数据分布）：

SELECT EXTRACT(EPOCH FROM (MAX(e.created_at) - MIN(e.created_at))) as funnel_time
FROM   events e
JOIN  (
   SELECT event_id, reference_value, count(*) AS ct
   FROM   event_refs
   WHERE  reference_key = 'project'                   
   GROUP  BY event_id, reference_value
   ) r ON r.event_id = e.id
WHERE (e.event_type = 'event1' OR
       e.event_type = 'event2')        -- see below !
AND    e.created_at >= '2015-07-01 00:00:00+08:00'
AND    e.created_at <  '2015-12-01 00:00:00+08:00'
GROUP  BY r.reference_value
HAVING sum(r.ct) > 1;

我怀疑查询中存在错误，并且您希望WHERE像我添加的那样在子句中使用括号。根据运算符优先级，AND在之前绑定OR。

只有在每个in有很多行时才有意义。同样，上述索引会有所帮助。(event_id, reference_value)event_refs

提高 GROUP BY 子句中的排序性能

`work_mem`

索引

可能的替代查询

连接到 PostgreSQL 服务器：致命：主机没有 pg_hba.conf 条目

如何让sqlplus的输出出现在一行中？

选择具有最大日期或最晚日期的日期

如何列出 PostgreSQL 中的所有模式？

列出指定表的所有列

如何在不修改我自己的 tnsnames.ora 的情况下使用 sqlplus 连接到位于另一台主机上的 Oracle 数据库

你如何mysqldump特定的表？

使用 psql 列出数据库权限

如何从 PostgreSQL 中的选择查询中将值插入表中？

如何使用 psql 列出所有数据库和表？

提高 GROUP BY 子句中的排序性能

1 个回答

work_mem

索引

可能的替代查询

相关问题

`work_mem`