我可以在使用数据库后激活 PITR 吗？

Question

Asked: 2023-09-28 18:44:38 +0800 CST2023-09-28 18:44:38 +0800 CST 2023-09-28 18:44:38 +0800 CST

Postgres 大量更新导致表快速增长

772

我有一个与我们的 Postgres 数据库上发生的事件相关的问题。

我们有一个后端服务来跟踪用户在某种在线活动中的活动。user_event该服务使用 Postgres 12 作为数据库，对于此类事件，有一个表（我将在问题的其余部分中称之为）来保存数据。该表每秒大约有 3500 个事务，其中大部分是更新（其中几乎所有都是热更新 - 这可能是一个重要的部分）。

这是插入和删除元组的单独图表（删除的元组为 0），因为它在上图中不可见：

存活元组的数量如下所示：

表的使用量从约 11 点开始，白天表现良好，但在 18 小时左右，表大小在 2 小时内从约 3GB 增加到 >80GB。这导致磁盘几乎达到其满容量，因此我们必须禁用正在进行的事件并执行真空满，这将表的大小从 >80GB 减少到约 3GB。

以下是带有膨胀估计的表的大小：

* 我们使用此查询计算第二个图表上显示的膨胀

表索引没有增长那么多：

在那一刻（大约 20.30 - 21 小时），我们怀疑 autovacuum 是问题所在，并且需要更加积极，因此我们将午睡时间配置为 30 秒（默认为 1 分钟），并将其设置为 0.005 和 0（最后 2 个autovacuum_vacuum_scale_factor参数autovacuum_vacuum_cost_delay）我们只为关键表设置）。在此更改之前，autovacuum 在 18 小时前每 60 秒执行一次，在 18 小时后每 2 分钟执行一次（由于表的增长）。

也许重要的信息是，在整个活动期间，绝大多数检查点的请求（未计时）频率为每 3-4 分钟一次：

此后，该问题不再重复，但我们仍然无法解释为什么以及如何发生表大小的巨大增加。经过详细检查后，autovacuum（和 autoanalyze）确实在整个过程中都在工作（我们已经在指标和日志中确认了这一点）。

为了给您提供更多背景信息，我们注意到当表开始快速增加时（大约 18 小时），缓冲区中的动态发生了变化：

有谁知道要寻找什么/在哪里来解释这一点？如果您需要更多信息，我可以提供。

附加数据：user_event表方案是这样的：

# \d user_event;
                          Table "public.user_event"
              Column              |           Type           | Collation | Nullable | Default 
----------------------------------+--------------------------+-----------+----------+---------
 id                               | uuid                     |           | not null | 
 event_id                         | text                     |           | not null | 
 user_id                          | bigint                   |           | not null | 
 server_id                        | integer                  |           | not null | 
 state                            | text                     |           | not null | 
 generator_type                   | text                     |           |          | 
 timestamp_when_claimable         | timestamp with time zone |           | not null | 
 claimeditems                     | jsonb                    |           | not null | 
 timestamp_when_energy_full       | timestamp with time zone |           | not null | 
 overflow_energy                  | integer                  |           | not null | 
 timestamp_when_energy_cost_reset | timestamp with time zone |           | not null | 
 grid                             | jsonb                    |           | not null | 
 refill_count                     | integer                  |           | not null | 
Indexes:
    "user_event_pkey" PRIMARY KEY, btree (id)
    "idx_event_id" btree (event_id)
    "idx_event_id_state" btree (event_id, state)
    "idx_login_id" btree (login_id)
    "idx_login_id_event_id" btree (login_id, event_id)

Postgres 版本是 12，最重要的参数是：

共享缓冲区：'6GB'
维护工作内存：1GB
最大瓦尔大小：10GB
最小瓦尔大小：1GB
检查点超时：'10分钟'
effective_cache_size：“58GB”
log_autovacuum_min_duration: 0
自动真空真空比例因子：0.02
自动真空分析缩放因子：0.01

VM 的资源有：

16个CPU
64 GB 内存
250GB SSD盘

jjanes · Answer 1 · 2023-09-29T00:25:36+08:00

It is hard to know what your charts are actually showing. n_tup_hot_upd, for example, is a permanently increasing value, it only goes down when the system crashes or the stats gets reset. Yet your chart does not show it this way, so it must be plotting some kind of difference between consecutive snapshots. But in that case, is it also showing all the other lines on the same basis?

The first picture shows that roughly 99% of the updates were HOT

The important thing is not the percentage which were HOT, but the raw number which were not HOT. Your first chart doesn't really accentuate this aspect, but if you squint at it you can kind of see a gap opening up between the two lines right when the problem started occurring.

So the likely answer is that something was holding open a snapshot, which prevented HOT tuples from being pruned. Once the pages are full of unpruned HOT tuples, new versions need to go on new pages, defeating HOT. Updates on tuples put into those new pages would then qualify for HOT again, until those new pages again become full. So HOT never fully shuts off, its effectiveness just decays.

The non-HOT updates will trigger more vacuums to occur, but the vacuums can't actually do anything useful, because the same snapshot which defeats HOT update also defeats vacuuming.

使快照长时间保持打开状态的主要因素要么是运行时间非常长的查询，要么是隔离级别高于 READ COMMITTED 且在事务中处于空闲状态的事务。这些在发生时很容易在 pg_stat_activity 中看到（query_start、xact_start、state），但事后很难确定。

Postgres 大量更新导致表快速增长

连接到 PostgreSQL 服务器：致命：主机没有 pg_hba.conf 条目

如何让sqlplus的输出出现在一行中？

选择具有最大日期或最晚日期的日期

如何列出 PostgreSQL 中的所有模式？

列出指定表的所有列

如何在不修改我自己的 tnsnames.ora 的情况下使用 sqlplus 连接到位于另一台主机上的 Oracle 数据库

你如何mysqldump特定的表？

使用 psql 列出数据库权限

如何从 PostgreSQL 中的选择查询中将值插入表中？

如何使用 psql 列出所有数据库和表？

Postgres 大量更新导致表快速增长

1 个回答

相关问题