我可以在使用数据库后激活 PITR 吗？

Question

A.F.N

Asked: 2023-06-28 19:43:25 +0800 CST2023-06-28 19:43:25 +0800 CST 2023-06-28 19:43:25 +0800 CST

Postgres 服务崩溃后无法重新启动（被 OOM 杀死）

772

我在 Debian 上使用下面的 Postgres。

PostgreSQL 13.8 (Debian 13.8-0+deb11u1) on x86_64-pc-linux-gnu, compiled by gcc (Debian 10.2.1-6) 10.2.1 20210110, 64-bit

有时 OOM 会杀死 Postgres 进程。这是发生的事情的日志：

/var/log/syslog:Jun 28 04:18:09 dashboard03 kernel: [17747145.470355] Out of memory: Killed process 3680079 (postgres) total-vm:60173160kB, anon-rss:2644kB, file-rss:0kB, shmem-rss:32631780kB, UID:107 pgtables:76056kB oom_score_adj:0

问题是 Postgres 崩溃后无法自动启动。这是 Postgres 日志：

2023-06-28 04:18:09.891 UTC [3680076] LOG:  checkpointer process (PID 3680079) was terminated by signal 9: Killed
2023-06-28 04:18:09.917 UTC [3680076] LOG:  terminating any other active server processes
2023-06-28 04:18:09.917 UTC [4109121] ehsan@dashboard WARNING:  terminating connection because of crash of another server process
2023-06-28 04:18:09.917 UTC [4109121] ehsan@dashboard DETAIL:  The postmaster has commanded this server process to roll back the cu>
2023-06-28 04:18:09.917 UTC [4109121] ehsan@dashboard HINT:  In a moment you should be able to reconnect to the database and repeat>
2023-06-28 04:18:09.917 UTC [275401] ehsan@dashboard WARNING:  terminating connection because of crash of another server process
2023-06-28 04:18:09.917 UTC [275401] ehsan@dashboard DETAIL:  The postmaster has commanded this server process to roll back the cur>
2023-06-28 04:18:09.917 UTC [275401] ehsan@dashboard HINT:  In a moment you should be able to reconnect to the database and repeat >
2023-06-28 04:18:09.917 UTC [114738] ehsan@dashboard WARNING:  terminating connection because of crash of another server process
2023-06-28 04:18:09.917 UTC [114738] ehsan@dashboard DETAIL:  The postmaster has commanded this server process to roll back the cur>
2023-06-28 04:18:09.917 UTC [114738] ehsan@dashboard HINT:  In a moment you should be able to reconnect to the database and repeat >
2023-06-28 04:18:09.917 UTC [3680082] WARNING:  terminating connection because of crash of another server process
2023-06-28 04:18:09.917 UTC [3680082] DETAIL:  The postmaster has commanded this server process to roll back the current transactio>
2023-06-28 04:18:09.917 UTC [3680082] HINT:  In a moment you should be able to reconnect to the database and repeat your command.
2023-06-28 04:18:10.162 UTC [1907896] ehsan@dashboard FATAL:  the database system is in recovery mode
2023-06-28 04:18:11.191 UTC [3680076] LOG:  all server processes terminated; reinitializing
2023-06-28 04:18:17.286 UTC [3680076] LOG:  received fast shutdown request
2023-06-28 04:18:17.287 UTC [1907912] LOG:  database system was interrupted; last known up at 2023-06-28 04:15:38 UTC
2023-06-28 04:18:17.364 UTC [1907913] ehsan@dashboard FATAL:  the database system is shutting down
2023-06-28 04:18:17.365 UTC [1907914] ehsan@dashboard FATAL:  the database system is shutting down
2023-06-28 04:18:17.484 UTC [1907912] LOG:  database system was not properly shut down; automatic recovery in progress
2023-06-28 04:18:17.514 UTC [1907912] LOG:  redo starts at 46C/93DCE1A8
2023-06-28 04:18:17.516 UTC [3680076] LOG:  abnormal database system shutdown
2023-06-28 04:18:17.678 UTC [3680076] LOG:  database system is shut down

我检查了restart_after_crash参数和值是on

为什么Postgres在这次崩溃后无法自动重启？虽然我可以手动启动它。我错过了什么配置以在崩溃后重新启动 Postgres？

当我手动启动它时，这里记录了sudo service postgresql start：

2023-06-28 04:26:22.605 UTC [1909598] LOG:  starting PostgreSQL 13.8 (Debian 13.8-0+deb11u1) on x86>
2023-06-28 04:26:22.606 UTC [1909598] LOG:  listening on IPv4 address "127.0.0.1", port 5432
2023-06-28 04:26:22.606 UTC [1909598] LOG:  listening on Unix socket "/var/run/postgresql/.s.PGSQL.>
2023-06-28 04:26:22.649 UTC [1909600] LOG:  database system was interrupted while in recovery at 20>
2023-06-28 04:26:22.649 UTC [1909600] HINT:  This probably means that some data is corrupted and yo>
2023-06-28 04:26:22.653 UTC [1909601] ehsan@dashboard FATAL:  the database system is starting up
2023-06-28 04:26:22.653 UTC [1909602] ehsan@dashboard FATAL:  the database system is starting up
2023-06-28 04:26:22.762 UTC [1909600] LOG:  database system was not properly shut down; automatic r>
2023-06-28 04:26:22.765 UTC [1909600] LOG:  redo starts at 46C/93DCE1A8
2023-06-28 04:26:22.841 UTC [1909600] LOG:  invalid record length at 46C/94EF5338: wanted 24, got 0
2023-06-28 04:26:22.841 UTC [1909600] LOG:  redo done at 46C/94EF5300
2023-06-28 04:26:23.089 UTC [1909598] LOG:  database system is ready to accept connections

看来DB恢复得很快。

以下是 systemd 文件内容/lib/systemd/system/[email protected]：

# systemd service template for PostgreSQL clusters. The actual instances will
# be called "postgresql@version-cluster", e.g. "[email protected]". The
# variable %i expands to "version-cluster", %I expands to "version/cluster".
# (%I breaks for cluster names containing dashes.)

[Unit]
Description=PostgreSQL Cluster %i
AssertPathExists=/etc/postgresql/%I/postgresql.conf
RequiresMountsFor=/etc/postgresql/%I /var/lib/postgresql/%I
PartOf=postgresql.service
ReloadPropagatedFrom=postgresql.service
Before=postgresql.service
# stop server before networking goes down on shutdown
After=network.target
[Service]
Type=forking
# -: ignore startup failure (recovery might take arbitrarily long)
# the actual pg_ctl timeout is configured in pg_ctl.conf
ExecStart=-/usr/bin/pg_ctlcluster --skip-systemctl-redirect %i start
# 0 is the same as infinity, but "infinity" needs systemd 229
TimeoutStartSec=0
ExecStop=/usr/bin/pg_ctlcluster --skip-systemctl-redirect -m fast %i stop
TimeoutStopSec=1h
ExecReload=/usr/bin/pg_ctlcluster --skip-systemctl-redirect %i reload
PIDFile=/run/postgresql/%i.pid
SyslogIdentifier=postgresql@%i
# prevent OOM killer from choosing the postmaster (individual backends will
# reset the score to 0)
OOMScoreAdjust=-900
# restarting automatically will prevent "pg_ctlcluster ... stop" from working,
# so we disable it here. Also, the postmaster will restart by itself on most
# problems anyway, so it is questionable if one wants to enable external
# automatic restarts.
#Restart=on-failure
# (This should make pg_ctlcluster stop work, but doesn't:)
#RestartPreventExitStatus=SIGINT SIGTERM

[Install]
WantedBy=multi-user.target

1 个回答

Voted

jjanes · Answer 1 · 2023-06-28T23:03:33+08:00

当 OOM 启动时，您的系统可能处于较差的状态，这就是它启动的原因。如果恢复缓慢也就不足为奇了，因为可能已被换出的东西随后需要换回。当您稍后再次手动启动它时，事情会处于更好的状态，因此速度会更快也就不足为奇了。这并不能解释是谁首先命令恢复中止，但确实解释了为什么稍后再次手动尝试恢复时恢复可能会更快。

您应该关注恢复期间谁命令快速关闭，而不是第二次尝试恢复需要多长时间。

编辑

无论如何，我可以使用 Debian 和 Ubuntu 简单地重现这一点，并使用 apt 安装 PostgreSQL 和它提供的 systemd 文件。在引发 OOM Killer 后，PostgreSQL 自动重新启动，但随后立即被告知关闭。唯一可以发出信号的是 systemd 系统本身。这只发生在 OOM 情况下；如果我手动杀死-9后端进程，那么PostgreSQL会成功重新启动。

在 t2.micro 机器上，只需执行以下操作即可重现它：

set work_mem TO "80GB";
select * from generate_series(1,10000000) order by random();

我不是系统律师。我不知道为什么它会这样做，但我不得不猜测它是故意这样做的。

您可以通过禁用内存过量使用来防止 OOM 杀手。但在 t2.micro 上，这样做会导致机器无法使用，因为由于 snapd、packagekit 和其他臃肿软件，内存在开箱即用时就被过度使用。

Postgres 服务崩溃后无法重新启动（被 OOM 杀死）

连接到 PostgreSQL 服务器：致命：主机没有 pg_hba.conf 条目

如何让sqlplus的输出出现在一行中？

选择具有最大日期或最晚日期的日期

如何列出 PostgreSQL 中的所有模式？

列出指定表的所有列

如何在不修改我自己的 tnsnames.ora 的情况下使用 sqlplus 连接到位于另一台主机上的 Oracle 数据库

你如何mysqldump特定的表？

使用 psql 列出数据库权限

如何从 PostgreSQL 中的选择查询中将值插入表中？

如何使用 psql 列出所有数据库和表？

Postgres 服务崩溃后无法重新启动（被 OOM 杀死）

1 个回答

相关问题