我有一个与我们的 Postgres 数据库上发生的事件相关的问题。
我们有一个后端服务来跟踪用户在某种在线活动中的活动。user_event
该服务使用 Postgres 12 作为数据库,对于此类事件,有一个表(我将在问题的其余部分中称之为)来保存数据。该表每秒大约有 3500 个事务,其中大部分是更新(其中几乎所有都是热更新 - 这可能是一个重要的部分)。
这是插入和删除元组的单独图表(删除的元组为 0),因为它在上图中不可见:
表的使用量从约 11 点开始,白天表现良好,但在 18 小时左右,表大小在 2 小时内从约 3GB 增加到 >80GB。这导致磁盘几乎达到其满容量,因此我们必须禁用正在进行的事件并执行真空满,这将表的大小从 >80GB 减少到约 3GB。
* 我们使用此查询计算第二个图表上显示的膨胀
在那一刻(大约 20.30 - 21 小时),我们怀疑 autovacuum 是问题所在,并且需要更加积极,因此我们将午睡时间配置为 30 秒(默认为 1 分钟),并将其设置为 0.005 和 0(最后 2 个autovacuum_vacuum_scale_factor
参数autovacuum_vacuum_cost_delay
)我们只为关键表设置)。在此更改之前,autovacuum 在 18 小时前每 60 秒执行一次,在 18 小时后每 2 分钟执行一次(由于表的增长)。
也许重要的信息是,在整个活动期间,绝大多数检查点的请求(未计时)频率为每 3-4 分钟一次:
此后,该问题不再重复,但我们仍然无法解释为什么以及如何发生表大小的巨大增加。经过详细检查后,autovacuum(和 autoanalyze)确实在整个过程中都在工作(我们已经在指标和日志中确认了这一点)。
为了给您提供更多背景信息,我们注意到当表开始快速增加时(大约 18 小时),缓冲区中的动态发生了变化:
有谁知道要寻找什么/在哪里来解释这一点?如果您需要更多信息,我可以提供。
附加数据:user_event表方案是这样的:
# \d user_event;
Table "public.user_event"
Column | Type | Collation | Nullable | Default
----------------------------------+--------------------------+-----------+----------+---------
id | uuid | | not null |
event_id | text | | not null |
user_id | bigint | | not null |
server_id | integer | | not null |
state | text | | not null |
generator_type | text | | |
timestamp_when_claimable | timestamp with time zone | | not null |
claimeditems | jsonb | | not null |
timestamp_when_energy_full | timestamp with time zone | | not null |
overflow_energy | integer | | not null |
timestamp_when_energy_cost_reset | timestamp with time zone | | not null |
grid | jsonb | | not null |
refill_count | integer | | not null |
Indexes:
"user_event_pkey" PRIMARY KEY, btree (id)
"idx_event_id" btree (event_id)
"idx_event_id_state" btree (event_id, state)
"idx_login_id" btree (login_id)
"idx_login_id_event_id" btree (login_id, event_id)
Postgres 版本是 12,最重要的参数是:
- 共享缓冲区:'6GB'
- 维护工作内存:1GB
- 最大瓦尔大小:10GB
- 最小瓦尔大小:1GB
- 检查点超时:'10分钟'
- effective_cache_size:“58GB”
- log_autovacuum_min_duration: 0
- 自动真空真空比例因子:0.02
- 自动真空分析缩放因子:0.01
VM 的资源有:
- 16个CPU
- 64 GB 内存
- 250GB SSD盘
It is hard to know what your charts are actually showing. n_tup_hot_upd, for example, is a permanently increasing value, it only goes down when the system crashes or the stats gets reset. Yet your chart does not show it this way, so it must be plotting some kind of difference between consecutive snapshots. But in that case, is it also showing all the other lines on the same basis?
The important thing is not the percentage which were HOT, but the raw number which were not HOT. Your first chart doesn't really accentuate this aspect, but if you squint at it you can kind of see a gap opening up between the two lines right when the problem started occurring.
So the likely answer is that something was holding open a snapshot, which prevented HOT tuples from being pruned. Once the pages are full of unpruned HOT tuples, new versions need to go on new pages, defeating HOT. Updates on tuples put into those new pages would then qualify for HOT again, until those new pages again become full. So HOT never fully shuts off, its effectiveness just decays.
The non-HOT updates will trigger more vacuums to occur, but the vacuums can't actually do anything useful, because the same snapshot which defeats HOT update also defeats vacuuming.
使快照长时间保持打开状态的主要因素要么是运行时间非常长的查询,要么是隔离级别高于 READ COMMITTED 且在事务中处于空闲状态的事务。这些在发生时很容易在 pg_stat_activity 中看到(query_start、xact_start、state),但事后很难确定。