AskOverflow.Dev

AskOverflow.Dev Logo AskOverflow.Dev Logo

AskOverflow.Dev Navigation

  • 主页
  • 系统&网络
  • Ubuntu
  • Unix
  • DBA
  • Computer
  • Coding
  • LangChain

Mobile menu

Close
  • 主页
  • 系统&网络
    • 最新
    • 热门
    • 标签
  • Ubuntu
    • 最新
    • 热门
    • 标签
  • Unix
    • 最新
    • 标签
  • DBA
    • 最新
    • 标签
  • Computer
    • 最新
    • 标签
  • Coding
    • 最新
    • 标签
主页 / dba / 问题 / 97076
Accepted
Julien Bourdon
Julien Bourdon
Asked: 2015-04-06 07:38:15 +0800 CST2015-04-06 07:38:15 +0800 CST 2015-04-06 07:38:15 +0800 CST

可扩展查询前 x 天内事件的运行计数

  • 772

我已经在stackoverflow上发布了这个问题,但我认为我可能会在这里得到更好的答案。
我有一个表存储发生在用户身上的数百万个事件:

                                       Table "public.events"
   Column   |           Type           |                         Modifiers                         
------------+--------------------------+-----------------------------------------------------------
 event_id   | integer                  | not null default nextval('events_event_id_seq'::regclass)
 user_id    | bigint                   | 
 event_type | integer                  | 
 ts         | timestamp with time zone | 

event_type 有 5 个不同的值,即数百万用户,每个 event_type 的每个用户的事件数量不同,通常在 1 到 50 之间。

数据样本:

+-----------+----------+-------------+----------------------------+
| event_id  | user_id  | event_type  |         timestamp          |
+-----------+----------+-------------+----------------------------+
|        1  |       1  |          1  | January, 01 2015 00:00:00  |
|        2  |       1  |          1  | January, 10 2015 00:00:00  |
|        3  |       1  |          1  | January, 20 2015 00:00:00  |
|        4  |       1  |          1  | January, 30 2015 00:00:00  |
|        5  |       1  |          1  | February, 10 2015 00:00:00 |
|        6  |       1  |          1  | February, 21 2015 00:00:00 |
|        7  |       1  |          1  | February, 22 2015 00:00:00 |
+-----------+----------+-------------+----------------------------+

对于每个事件,我想获取同一用户的事件数以及事件event_type发生前 30 天内发生的事件数。

它应该如下所示:

+-----------+----------+-------------+-----------------------------+-------+
| event_id  | user_id  | event_type  |         timestamp           | count |
+-----------+----------+-------------+-----------------------------+-------+
|        1  |       1  |          1  | January, 01 2015 00:00:00   |     1 |
|        2  |       1  |          1  | January, 10 2015 00:00:00   |     2 |
|        3  |       1  |          1  | January, 20 2015 00:00:00   |     3 |
|        4  |       1  |          1  | January, 30 2015 00:00:00   |     4 |
|        5  |       1  |          1  | February, 10 2015 00:00:00  |     3 |
|        6  |       1  |          1  | February, 21 2015 00:00:00  |     3 |
|        7  |       1  |          1  | February, 22 2015 00:00:00  |     4 |
+-----------+----------+-------------+-----------------------------+-------+

到目前为止,我成功地使用了两个不同的查询(在 PostgreSQL 9.4.1 上对 1000 行生成的样本进行测试):

SELECT 
  event_id, user_id,event_type,"timestamp", 
  (
    SELECT count(*) 
    FROM events 
    WHERE timestamp >= e.timestamp - interval '30 days'
    AND timestamp <= e.timestamp
    AND user_id = e.user_id 
    AND event_type = e.event_type
    GROUP BY event_type, user_id
  ) as "count"
FROM events e;

第一次查询的 SQL Fiddle

它工作得很好,特别是因为我有一个关于时间戳的索引:

Index Scan using pk_event_id on events e  (cost=0.28..12018.74 rows=1000 width=24)
SubPlan 1
  ->  GroupAggregate  (cost=4.33..11.97 rows=1 width=20)
        Group Key: events.event_type, events.user_id
        ->  Bitmap Heap Scan on events  (cost=4.33..11.95 rows=1 width=20)
              Recheck Cond: ((""timestamp"" >= (e."timestamp" - '30 days'::interval)) AND ("timestamp" <= e."timestamp"))
              Filter: ((user_id = e.user_id) AND (event_type = e.event_type))
              ->  Bitmap Index Scan on idx_events_timestamp  (cost=0.00..4.33 rows=5 width=0)
                    Index Cond: ((""timestamp"" >= (e."timestamp" - '30 days'::interval)) AND ("timestamp" <= e."timestamp"))

尽管如此,它还是不能很好地扩展,我认为使用窗口函数可能会提高性能:

SELECT toto.event_id,toto.user_id,toto.event_type,toto.lv as time,COUNT(*)
FROM(
    SELECT e.event_id, e.user_id,e.event_type,"timestamp",
    last_value("timestamp") OVER w as lv,
    unnest(array_agg(e."timestamp") OVER w) as agg
    FROM events e
    WINDOW w AS (PARTITION BY e.user_id,e.event_type ORDER BY e."timestamp"
    ROWS UNBOUNDED PRECEDING)) AS toto
WHERE toto.agg >= toto.lv - interval '30 days'
GROUP by event_id,user_id,event_type,lv;

用于第二个查询的 SQL Fiddle

由于我必须使用 unnest 和子查询,因此性能实际上变得更糟:

Sort  (cost=5344.41..5427.74 rows=33333 width=24)
  Sort Key: toto.event_id
  ->  HashAggregate  (cost=2506.99..2840.32 rows=33333 width=24)
        Group Key: toto.event_id, toto.user_id, toto.event_type, toto.lv
        ->  Subquery Scan on toto  (cost=67.83..2090.33 rows=33333 width=24)
              Filter: (toto.agg >= (toto.lv - '30 days'::interval))
              ->  WindowAgg  (cost=67.83..590.33 rows=100000 width=24)
                    ->  Sort  (cost=67.83..70.33 rows=1000 width=24)
                          Sort Key: e.user_id, e.event_type, e."timestamp"
                          ->  Seq Scan on events e  (cost=0.00..18.00 rows=1000 width=24)

我想知道是否可以修改是否只能保留子查询并以某种方式修改窗口框架以仅保留行时间戳之前 30 天或更短的时间戳。您是否认为可以在不切换到 MapReduce 框架的情况下针对非常大的表扩展此查询?

第二次,我想排除重复的事件,即event_type相同的时间戳。

postgresql scalability
  • 3 3 个回答
  • 2813 Views

3 个回答

  • Voted
  1. Best Answer
    Erwin Brandstetter
    2015-04-06T17:24:42+08:002015-04-06T17:24:42+08:00

    假设这个清理过的表定义

    CREATE TABLE events (
      event_id   serial PRIMARY KEY
    , user_id    int
    , event_type int
    , ts         timestamp  -- don't use reserved word as identifier
    );

    您的比较似乎不公平,第一个查询有ORDER BY event_id,但第二个没有。EXPLAIN输出不适合第一个查询(无排序步骤)。确保使用相同的ORDER BY子句运行所有测试以获得有效结果。最好运行几次并比较最好的 5 次以消除缓存效果。

    指数

    性能的关键是这个多列索引:

    CREATE INDEX events_fast_idx ON events (user_id, event_type, ts);
    

    列的顺序很重要!为什么?

    • 多列索引和性能

    查询

    您的每个查询都可以改进:

    查询 1

    删除group by event_type, user_id而不替换:

    SELECT event_id, user_id, event_type, ts
        , (SELECT count(*) 
           FROM   events 
           WHERE  user_id    = e.user_id 
           AND    event_type = e.event_type
           AND    ts >= e.ts - interval '30 days'
           AND    ts <= e.ts
          ) AS  ct
    FROM   events e
    ORDER  BY event_id;
    

    等效于更现代的LATERAL连接(Postgres 9.3+):

    SELECT *
    FROM   events e
        ,  LATERAL (
       SELECT count(*) AS ct
       FROM   events 
       WHERE  user_id    = e.user_id 
       AND    event_type = e.event_type
       AND    ts >= e.ts - interval '30 days'
       AND    ts <= e.ts
       ) ct
    ORDER  BY event_id;
    

    这也可能是结合上述索引最快的查询。
    相关答案和更多解释:

    • 优化 GROUP BY 查询以检索每个用户的最新记录

    查询 2

    • last_value(ts) OVER w as lv只是一个昂贵的副本ts。
    • ROWS UNBOUNDED PRECEDING是默认值,因此只是噪音。

    SELECT event_id, user_id, event_type, ts, count(*) AS ct
    FROM  (
       SELECT event_id, user_id, event_type, ts
            , unnest(array_agg(ts) OVER (PARTITION BY user_id, event_type
                                         ORDER BY ts)) AS agg
       FROM   events   
       ) e
    WHERE  agg >= ts - interval '30 days'
    GROUP  BY event_id, user_id, event_type, ts
    ORDER  BY event_id;
    

    但这是不必要的复杂。使用连接而不是使用窗口函数的子查询可以使相同的逻辑便宜得多:

    SELECT e.*, count(*) AS ct
    FROM   events e
    JOIN   events x USING (user_id, event_type)
    WHERE  x.ts >= e.ts - interval '30 days'
    AND    x.ts <= e.ts
    GROUP  BY e.event_id
    ORDER  BY e.event_id;
    

    这是我最喜欢的另一个顶级性能。再次使用上述索引。

    其他查询

    这是另一个想法,但我怀疑它是否可以竞争。不过,试一试:

    WITH cte AS (
       SELECT event_id, user_id, event_type, ts
            , row_number(*) OVER (PARTITION BY user_id, event_type
                                  ORDER BY ts) AS rn
       FROM   events
       )
    SELECT e.event_id, e.user_id, e.event_type, e.ts, e.rn - min(x.rn) + 1 AS ct
    FROM   cte e
    JOIN   cte x USING (user_id, event_type)
    WHERE  x.ts >= e.ts - interval '30 days'
    AND    x.ts <= e.ts
    GROUP  BY e.event_id, e.user_id, e.event_type, e.ts, e.rn
    ORDER  BY e.event_id;
    

    SQL Fiddle在 Postgres 9.3 中演示了所有内容。

    • 5
  2. Julien Bourdon
    2015-04-07T09:34:14+08:002015-04-07T09:34:14+08:00

    我接受了@Erwin 的回答,但这里是使用更正查询生成的数据(10000 行,最佳 5 次执行)的基准。我使用多列索引运行它。

    正如预期的那样,查询 1 (26.324 ms) 和 2 (23.264 ms) 在性能方面非常相似,而查询 3 最慢 (32.775 ms)。

    CREATE INDEX events_fast_idx ON events (user_id, event_type, ts);
    

    查询 1

    SELECT *
    FROM   events e
        ,  LATERAL (
       SELECT count(*) AS ct
       FROM   events 
       WHERE  user_id    = e.user_id 
       AND    event_type = e.event_type
       AND    ts >= e.ts - interval '30 days'
       AND    ts <= e.ts
       ) ct
    ORDER  BY event_id;
    

    解释(缓冲,分析)

    Nested Loop  (cost=8.60..83797.29 rows=10000 width=32) (actual time=0.036..25.775 rows=10000 loops=1)
      Buffers: shared hit=31964
      ->  Index Scan using pk_event_id on events e  (cost=0.29..347.29 rows=10000 width=24) (actual time=0.016..1.786 rows=10000 loops=1)
            Buffers: shared hit=103
      ->  Aggregate  (cost=8.31..8.32 rows=1 width=0) (actual time=0.002..0.002 rows=1 loops=10000)
            Buffers: shared hit=31861
            ->  Index Only Scan using events_fast_idx on events  (cost=0.29..8.31 rows=1 width=0) (actual time=0.001..0.001 rows=1 loops=10000)
                  Index Cond: ((user_id = e.user_id) AND (event_type = e.event_type) AND (ts >= (e.ts - '30 days'::interval)) AND (ts <= e.ts))
                  Heap Fetches: 11780
                  Buffers: shared hit=31861
    Planning time: 0.136 ms
    Execution time: 26.324 ms
    

    查询 2

    SELECT e.*, count(*) AS ct
    FROM   events e
    JOIN   events x USING (user_id, event_type)
    WHERE  x.ts >= e.ts - interval '30 days'
    AND    x.ts <= e.ts
    GROUP  BY e.event_id
    ORDER  BY e.event_id;
    

    解释(缓冲,分析)

    GroupAggregate  (cost=1597.56..1613.57 rows=915 width=24) (actual time=18.638..22.797 rows=10000 loops=1)
      Group Key: e.event_id
      Buffers: shared hit=26236
      ->  Sort  (cost=1597.56..1599.85 rows=915 width=24) (actual time=18.631..19.974 rows=11780 loops=1)
            Sort Key: e.event_id
            Sort Method: quicksort  Memory: 1305kB
            Buffers: shared hit=26236
            ->  Merge Join  (cost=0.57..1552.55 rows=915 width=24) (actual time=0.018..15.403 rows=11780 loops=1)
                  Merge Cond: ((e.user_id = x.user_id) AND (e.event_type = x.event_type))
                  Join Filter: ((x.ts <= e.ts) AND (x.ts >= (e.ts - '30 days'::interval)))
                  Rows Removed by Join Filter: 4710
                  Buffers: shared hit=26236
                  ->  Index Scan using events_fast_idx on events e  (cost=0.29..654.26 rows=10000 width=24) (actual time=0.003..2.503 rows=10000 loops=1)
                        Buffers: shared hit=9909
                  ->  Index Only Scan using events_fast_idx on events x  (cost=0.29..654.26 rows=10000 width=20) (actual time=0.005..5.111 rows=16490 loops=1)
                        Heap Fetches: 16490
                        Buffers: shared hit=16327
    Planning time: 0.216 ms
    Execution time: 23.264 ms
    

    查询 3

    WITH cte AS (
       SELECT event_id, user_id, event_type, ts
            , row_number(*) OVER (PARTITION BY user_id, event_type
                                  ORDER BY ts) AS rn
       FROM   events
       )
    SELECT e.event_id, e.user_id, e.event_type, e.ts, e.rn - min(x.rn) + 1 AS ct
    FROM   cte e
    JOIN   cte x USING (user_id, event_type)
    WHERE  x.ts >= e.ts - interval '30 days'
    AND    x.ts <= e.ts
    GROUP  BY e.event_id, e.user_id, e.event_type, e.ts, e.rn
    ORDER  BY e.event_id;
    

    解释(缓冲,分析)

    GroupAggregate  (cost=2788.06..2797.10 rows=278 width=40) (actual time=27.711..32.004 rows=10000 loops=1)
      Group Key: e.event_id, e.user_id, e.event_type, e.ts, e.rn
      Buffers: shared hit=9909
      CTE cte
        ->  WindowAgg  (cost=0.29..854.26 rows=10000 width=24) (actual time=0.015..8.340 rows=10000 loops=1)
              Buffers: shared hit=9909
              ->  Index Scan using events_fast_idx on events  (cost=0.29..654.26 rows=10000 width=24) (actual time=0.012..3.743 rows=10000 loops=1)
                    Buffers: shared hit=9909
      ->  Sort  (cost=1933.81..1934.50 rows=278 width=40) (actual time=27.696..28.470 rows=11780 loops=1)
            Sort Key: e.event_id, e.user_id, e.event_type, e.ts, e.rn
            Sort Method: quicksort  Memory: 1305kB
            Buffers: shared hit=9909
            ->  Merge Join  (cost=1728.77..1922.52 rows=278 width=40) (actual time=14.463..23.720 rows=11780 loops=1)
                  Merge Cond: ((e.user_id = x.user_id) AND (e.event_type = x.event_type))
                  Join Filter: ((x.ts <= e.ts) AND (x.ts >= (e.ts - '30 days'::interval)))
                  Rows Removed by Join Filter: 4710
                  Buffers: shared hit=9909
                  ->  Sort  (cost=864.39..889.39 rows=10000 width=32) (actual time=11.840..12.296 rows=10000 loops=1)
                        Sort Key: e.user_id, e.event_type
                        Sort Method: quicksort  Memory: 1166kB
                        Buffers: shared hit=9909
                        ->  CTE Scan on cte e  (cost=0.00..200.00 rows=10000 width=32) (actual time=0.017..10.536 rows=10000 loops=1)
                              Buffers: shared hit=9909
                  ->  Sort  (cost=864.39..889.39 rows=10000 width=28) (actual time=2.610..3.299 rows=16490 loops=1)
                        Sort Key: x.user_id, x.event_type
                        Sort Method: quicksort  Memory: 1166kB
                        ->  CTE Scan on cte x  (cost=0.00..200.00 rows=10000 width=28) (actual time=0.001..1.183 rows=10000 loops=1)
    Planning time: 0.151 ms
    Execution time: 32.775 ms
    
    • 3
  3. ypercubeᵀᴹ
    2015-04-07T12:06:27+08:002015-04-07T12:06:27+08:00

    我不认为这会比已经提供的替代方案更好,但它可能值得测试并添加到选项中:

    with t as
      ( select 
            event_id, user_id, event_type, ts,
            row_number() over w as rn
        from events
        window w as (partition by user_id, event_type 
                     order by ts)
      ) 
    select t.event_id, t.user_id, t.event_type, t.ts,
           1 + t.rn - c.rn as cnt
    from t, lateral
         ( select tt.rn 
           from t as tt
           where tt.user_id = t.user_id
             and tt.event_type = t.event_type
             and tt.ts >= t.ts - interval '30 days'
           order by tt.ts                        -- option b:  order by tt.rn
           limit 1
         ) c
    order by t.user_id, t.event_type, t.ts ;
    
    • 1

相关问题

  • 我可以在使用数据库后激活 PITR 吗?

  • 运行时间偏移延迟复制的最佳实践

  • 存储过程可以防止 SQL 注入吗?

  • PostgreSQL 中 UniProt 的生物序列

  • PostgreSQL 9.0 Replication 和 Slony-I 有什么区别?

Sidebar

Stats

  • 问题 205573
  • 回答 270741
  • 最佳答案 135370
  • 用户 68524
  • 热门
  • 回答
  • Marko Smith

    连接到 PostgreSQL 服务器:致命:主机没有 pg_hba.conf 条目

    • 12 个回答
  • Marko Smith

    如何让sqlplus的输出出现在一行中?

    • 3 个回答
  • Marko Smith

    选择具有最大日期或最晚日期的日期

    • 3 个回答
  • Marko Smith

    如何列出 PostgreSQL 中的所有模式?

    • 4 个回答
  • Marko Smith

    列出指定表的所有列

    • 5 个回答
  • Marko Smith

    如何在不修改我自己的 tnsnames.ora 的情况下使用 sqlplus 连接到位于另一台主机上的 Oracle 数据库

    • 4 个回答
  • Marko Smith

    你如何mysqldump特定的表?

    • 4 个回答
  • Marko Smith

    使用 psql 列出数据库权限

    • 10 个回答
  • Marko Smith

    如何从 PostgreSQL 中的选择查询中将值插入表中?

    • 4 个回答
  • Marko Smith

    如何使用 psql 列出所有数据库和表?

    • 7 个回答
  • Martin Hope
    Jin 连接到 PostgreSQL 服务器:致命:主机没有 pg_hba.conf 条目 2014-12-02 02:54:58 +0800 CST
  • Martin Hope
    Stéphane 如何列出 PostgreSQL 中的所有模式? 2013-04-16 11:19:16 +0800 CST
  • Martin Hope
    Mike Walsh 为什么事务日志不断增长或空间不足? 2012-12-05 18:11:22 +0800 CST
  • Martin Hope
    Stephane Rolland 列出指定表的所有列 2012-08-14 04:44:44 +0800 CST
  • Martin Hope
    haxney MySQL 能否合理地对数十亿行执行查询? 2012-07-03 11:36:13 +0800 CST
  • Martin Hope
    qazwsx 如何监控大型 .sql 文件的导入进度? 2012-05-03 08:54:41 +0800 CST
  • Martin Hope
    markdorison 你如何mysqldump特定的表? 2011-12-17 12:39:37 +0800 CST
  • Martin Hope
    Jonas 如何使用 psql 对 SQL 查询进行计时? 2011-06-04 02:22:54 +0800 CST
  • Martin Hope
    Jonas 如何从 PostgreSQL 中的选择查询中将值插入表中? 2011-05-28 00:33:05 +0800 CST
  • Martin Hope
    Jonas 如何使用 psql 列出所有数据库和表? 2011-02-18 00:45:49 +0800 CST

热门标签

sql-server mysql postgresql sql-server-2014 sql-server-2016 oracle sql-server-2008 database-design query-performance sql-server-2017

Explore

  • 主页
  • 问题
    • 最新
    • 热门
  • 标签
  • 帮助

Footer

AskOverflow.Dev

关于我们

  • 关于我们
  • 联系我们

Legal Stuff

  • Privacy Policy

Language

  • Pt
  • Server
  • Unix

© 2023 AskOverflow.DEV All Rights Reserve