我可以在使用数据库后激活 PITR 吗？

Question

Asked: 2024-08-01 01:53:31 +0800 CST2024-08-01 01:53:31 +0800 CST 2024-08-01 01:53:31 +0800 CST

如何在 GROUP BY 中使用类似 date_trunc() 的表达式？

772

我有几个包含工业数据的连接表：

create table v2.tag (
  tag_id integer generated always as identity,
  tag text not null,

  primary key (tag_id),
  unique (tag)
);

create table v2.state (
  tag_id integer not null,
  "timestamp" timestamp without time zone not null,
  value float not null,

  primary key (tag_id, timestamp),
  foreign key (tag_id) references v2.tag (tag_id)
) partition by range (timestamp);

该state表包含过去 6 个月的约 5000 万行时间序列数据，我需要使用各种查询对其进行基准测试。该表按月分区。

我尝试的查询只是获取每天的数据点数和标签，任何实际的 TSDB 都可以轻松地在如此小的数据集上完成此操作：

SELECT 
    count(*) as points,date_trunc('day', timestamp) as timestamp,tag.tag 
FROM 
    v2.state 
JOIN 
    v2.tag USING (tag_id) 
GROUP BY 
    timestamp, tag 
ORDER BY 
    timestamp ASC;

问题是，出于某种原因，此查询使数据库占用了近 3GB 的 RAM 并返回了一堆重复项。如下所示：

 2024-02-01 00:00:00 | /Heave         |      1
 2024-02-01 00:00:00 | /Pitch         |      1
 2024-02-01 00:00:00 | /Roll          |      1
 2024-02-01 00:00:00 | /Velocity      |      1
 2024-02-01 00:00:00 | /Heave         |      1
 ...

等等。在同一天，我无法滚动到下一天，它只是在结果中不断重复这些行，而不是像我预期的那样按标签计数。因此，它似乎并没有计算每天/标签的数据点数量，而是为数据库中约 5000 万行中的每一行生成一个重复项。

因此，聚合中有些地方出了问题。我预计此查询将返回大约 12K 行 ( 65*30*6)，但它却返回了数百万行，导致我尝试将其加载到的 Jupyter 笔记本发生 OOM 故障。

我尝试用来运行它EXPLAIN ANALYZE，但是由于我是 Postgres 的新手，所以它并没有真正......解释任何事情：

 Sort  (cost=700769.72..700798.22 rows=11400 width=78) (actual time=80503.260..83825.211 rows=47499969 loops=1)
   Sort Key: (date_trunc('day'::text, state."timestamp"))
   Sort Method: external merge  Disk: 4703296kB
   ->  Finalize GroupAggregate  (cost=697027.86..700001.55 rows=11400 width=78) (actual time=35609.801..64328.719 rows=47
499969 loops=1)
         Group Key: state."timestamp", tag.tag
         ->  Gather Merge  (cost=697027.86..699688.05 rows=22800 width=70) (actual time=35609.453..55143.276 rows=4749996
9 loops=1)
               Workers Planned: 2
               Workers Launched: 2
               ->  Sort  (cost=696027.84..696056.34 rows=11400 width=70) (actual time=34526.070..42018.956 rows=15833323 
loops=3)
                     Sort Key: state."timestamp", tag.tag
                     Sort Method: external merge  Disk: 1414088kB
                     Worker 0:  Sort Method: external merge  Disk: 1446832kB
                     Worker 1:  Sort Method: external merge  Disk: 1470664kB
                     ->  Partial HashAggregate  (cost=695145.67..695259.67 rows=11400 width=70) (actual time=8690.289..20
138.661 rows=15833323 loops=3)
                           Group Key: state."timestamp", tag.tag
                           Batches: 1029  Memory Usage: 8241kB  Disk Usage: 1694608kB
                           Worker 0:  Batches: 901  Memory Usage: 8241kB  Disk Usage: 1727928kB
                           Worker 1:  Batches: 773  Memory Usage: 8241kB  Disk Usage: 1748528kB
                           ->  Hash Join  (cost=2.28..652834.40 rows=5641502 width=62) (actual time=138.598..4142.702 row
s=15833323 loops=3)
                                 Hash Cond: (state.tag_id = tag.tag_id)
                                 ->  Parallel Append  (cost=0.00..599769.83 rows=19794743 width=12) (actual time=138.383.
.2665.699 rows=15833323 loops=3)
                                       ->  Parallel Seq Scan on state_y2024m04 state_4  (cost=0.00..221214.31 rows=874583
1 width=12) (actual time=39.308..827.302 rows=6996457 loops=3)
                                       ->  Parallel Seq Scan on state_y2024m02 state_2  (cost=0.00..172317.34 rows=680943
4 width=12) (actual time=58.866..1102.604 rows=8171318 loops=2)
                                       ->  Parallel Seq Scan on state_y2024m03 state_3  (cost=0.00..78305.04 rows=3095204
 width=12) (actual time=0.766..694.493 rows=7428501 loops=1)
                                       ->  Parallel Seq Scan on state_y2024m05 state_5  (cost=0.00..28879.42 rows=1141442
 width=12) (actual time=180.418..416.467 rows=2739461 loops=1)
                                       ->  Parallel Seq Scan on state_y2024m01 state_1  (cost=0.00..20.00 rows=1000 width
=12) (actual time=0.000..0.001 rows=0 loops=1)
                                       ->  Parallel Seq Scan on state_y2024m06 state_6  (cost=0.00..20.00 rows=1000 width
=12) (actual time=0.000..0.001 rows=0 loops=1)
                                       ->  Parallel Seq Scan on state_y2024m07 state_7  (cost=0.00..20.00 rows=1000 width
=12) (actual time=0.000..0.001 rows=0 loops=1)
                                       ->  Parallel Seq Scan on state_y2024m08 state_8  (cost=0.00..20.00 rows=1000 width
=12) (actual time=0.002..0.002 rows=0 loops=1)
                                 ->  Hash  (cost=1.57..1.57 rows=57 width=58) (actual time=0.149..0.268 rows=65 loops=3)
                                       Buckets: 1024  Batches: 1  Memory Usage: 14kB
                                       ->  Seq Scan on tag  (cost=0.00..1.57 rows=57 width=58) (actual time=0.031..0.036 
rows=65 loops=3)
 Planning Time: 2.447 ms
 JIT:
   Functions: 96
   Options: Inlining true, Optimization true, Expressions true, Deforming true
   Timing: Generation 14.487 ms, Inlining 105.515 ms, Optimization 203.723 ms, Emission 143.355 ms, Total 467.081 ms
 Execution Time: 86164.911 ms

那么我的查询有什么问题？为什么它没有聚合？

2 个回答

Voted

Erwin Brandstetter · Answer 1 · 2024-08-01T10:29:21+08:00

紧急问题

您的答案没有指出实际问题。您可以使用“timestamp”作为别名。以及“day”。（这并不意味着两者都是好主意。）“timestamp”和“day”是标准 SQL 中的保留字。在 Postgres 中，两者的限制较少，但无论如何，最好避免使用保留字作为标识符。不过，这不是眼前的问题。

当前问题是这样的：

如果中的简单名称GROUP BY与输出列名称和输入列名称都匹配，它将把它解释为输入列名称。在您的例子中，GROUP BY timestamp解析为GROUP BY state.timestamp。这不是您想要的。由于这几乎不会对任何行进行分组，因此您最终会得到数百万个输出行，从而导致性能下降。

ORDER BY使用相反的优先级。这一切都是在 SQL 标准中设置的，有些是由于历史原因，并反映了命令中的事件顺序SELECT。请参阅：

解决方案

避免输出列名与输入列名冲突，以排除任何混淆。如果这种冲突不可避免（真的吗？），请根据中的输入列重复表达式GROUP BY，或使用位置引用。例如：

SELECT count(*) AS points, date_trunc('day', state.timestamp) AS timestamp, tag.tag 
FROM   v2.state
JOIN   v2.tag USING (tag_id) 
GROUP  BY date_trunc('day', state.timestamp), tag.tag 
--   GROUP BY 2, 3  -- or just this for short
ORDER  BY timestamp;

保留您可疑的别名“时间戳”和一些其他缺陷。

更好的解决方案

SELECT s.points, s.the_day, t.tag
FROM  (
   SELECT count(*) AS points, s.timestamp::date AS the_day, s.tag_id
   FROM   v2.state s
   GROUP  BY 2, 3
   ) s
JOIN   v2.tag t USING (tag_id) 
ORDER  BY s.the_day, t.tag;  -- this deterministic order seems more adequate

避免使用保留字。在连接之前
进行聚合。通常更快。请参阅：

单个查询中多次调用 array_agg()

使用表别名来降低噪音水平。

对查询中可能出现冲突或混淆的所有列进行表限定。

timestamp::date相当于date_trunc('day', timestamp)实际timestamp数据类型，返回类型date而不是timestamp，并且效率更高 - 最重要的是对于以下索引。

由于这个问题（也）与性能有关：如果表主要是只读的并且这个查询的性能很重要，请考虑这个匹配的多列表达式索引：

CREATE INDEX my_idx ON ON state ((timestamp::date), tag_id);

确保表VACUUM足够大，以便为您提供仅索引扫描。我还期望+ = 8 字节
的局部最优值。参见：dateinteger

复合索引也适用于第一个字段的查询吗？

并考虑一个干净的名字来代替state.timestamp。

Megakoresh · Answer 2 · 2024-08-01T02:11:18+08:00

好吧，我是个 Postgres 菜鸟，这是根本原因。出于某种原因，我无法将其用作timestamp字段的别名。当我将查询更改为如下所示时：

SELECT 
    count(*) as points,date_trunc('day', timestamp) as day,tag.tag 
FROM 
    v2.state 
JOIN 
    v2.tag USING (tag_id) 
GROUP BY 
    day, tag 
ORDER BY 
    day ASC;

它在 3.5 秒内执行并获得了预期结果。好的，明白了，那么...

如何在 GROUP BY 中使用类似 date_trunc() 的表达式？

紧急问题

解决方案

更好的解决方案

连接到 PostgreSQL 服务器：致命：主机没有 pg_hba.conf 条目

如何让sqlplus的输出出现在一行中？

选择具有最大日期或最晚日期的日期

如何列出 PostgreSQL 中的所有模式？

列出指定表的所有列

如何在不修改我自己的 tnsnames.ora 的情况下使用 sqlplus 连接到位于另一台主机上的 Oracle 数据库

你如何mysqldump特定的表？

使用 psql 列出数据库权限

如何从 PostgreSQL 中的选择查询中将值插入表中？

如何使用 psql 列出所有数据库和表？

如何在 GROUP BY 中使用类似 date_trunc() 的表达式？

2 个回答

紧急问题

解决方案

更好的解决方案

相关问题