我在 Debian 上使用下面的 Postgres。
PostgreSQL 13.8 (Debian 13.8-0+deb11u1) on x86_64-pc-linux-gnu, compiled by gcc (Debian 10.2.1-6) 10.2.1 20210110, 64-bit
有时 OOM 会杀死 Postgres 进程。这是发生的事情的日志:
/var/log/syslog:Jun 28 04:18:09 dashboard03 kernel: [17747145.470355] Out of memory: Killed process 3680079 (postgres) total-vm:60173160kB, anon-rss:2644kB, file-rss:0kB, shmem-rss:32631780kB, UID:107 pgtables:76056kB oom_score_adj:0
问题是 Postgres 崩溃后无法自动启动。这是 Postgres 日志:
2023-06-28 04:18:09.891 UTC [3680076] LOG: checkpointer process (PID 3680079) was terminated by signal 9: Killed
2023-06-28 04:18:09.917 UTC [3680076] LOG: terminating any other active server processes
2023-06-28 04:18:09.917 UTC [4109121] ehsan@dashboard WARNING: terminating connection because of crash of another server process
2023-06-28 04:18:09.917 UTC [4109121] ehsan@dashboard DETAIL: The postmaster has commanded this server process to roll back the cu>
2023-06-28 04:18:09.917 UTC [4109121] ehsan@dashboard HINT: In a moment you should be able to reconnect to the database and repeat>
2023-06-28 04:18:09.917 UTC [275401] ehsan@dashboard WARNING: terminating connection because of crash of another server process
2023-06-28 04:18:09.917 UTC [275401] ehsan@dashboard DETAIL: The postmaster has commanded this server process to roll back the cur>
2023-06-28 04:18:09.917 UTC [275401] ehsan@dashboard HINT: In a moment you should be able to reconnect to the database and repeat >
2023-06-28 04:18:09.917 UTC [114738] ehsan@dashboard WARNING: terminating connection because of crash of another server process
2023-06-28 04:18:09.917 UTC [114738] ehsan@dashboard DETAIL: The postmaster has commanded this server process to roll back the cur>
2023-06-28 04:18:09.917 UTC [114738] ehsan@dashboard HINT: In a moment you should be able to reconnect to the database and repeat >
2023-06-28 04:18:09.917 UTC [3680082] WARNING: terminating connection because of crash of another server process
2023-06-28 04:18:09.917 UTC [3680082] DETAIL: The postmaster has commanded this server process to roll back the current transactio>
2023-06-28 04:18:09.917 UTC [3680082] HINT: In a moment you should be able to reconnect to the database and repeat your command.
2023-06-28 04:18:10.162 UTC [1907896] ehsan@dashboard FATAL: the database system is in recovery mode
2023-06-28 04:18:11.191 UTC [3680076] LOG: all server processes terminated; reinitializing
2023-06-28 04:18:17.286 UTC [3680076] LOG: received fast shutdown request
2023-06-28 04:18:17.287 UTC [1907912] LOG: database system was interrupted; last known up at 2023-06-28 04:15:38 UTC
2023-06-28 04:18:17.364 UTC [1907913] ehsan@dashboard FATAL: the database system is shutting down
2023-06-28 04:18:17.365 UTC [1907914] ehsan@dashboard FATAL: the database system is shutting down
2023-06-28 04:18:17.484 UTC [1907912] LOG: database system was not properly shut down; automatic recovery in progress
2023-06-28 04:18:17.514 UTC [1907912] LOG: redo starts at 46C/93DCE1A8
2023-06-28 04:18:17.516 UTC [3680076] LOG: abnormal database system shutdown
2023-06-28 04:18:17.678 UTC [3680076] LOG: database system is shut down
我检查了restart_after_crash
参数和值是on
为什么Postgres在这次崩溃后无法自动重启?虽然我可以手动启动它。我错过了什么配置以在崩溃后重新启动 Postgres?
当我手动启动它时,这里记录了sudo service postgresql start
:
2023-06-28 04:26:22.605 UTC [1909598] LOG: starting PostgreSQL 13.8 (Debian 13.8-0+deb11u1) on x86>
2023-06-28 04:26:22.606 UTC [1909598] LOG: listening on IPv4 address "127.0.0.1", port 5432
2023-06-28 04:26:22.606 UTC [1909598] LOG: listening on Unix socket "/var/run/postgresql/.s.PGSQL.>
2023-06-28 04:26:22.649 UTC [1909600] LOG: database system was interrupted while in recovery at 20>
2023-06-28 04:26:22.649 UTC [1909600] HINT: This probably means that some data is corrupted and yo>
2023-06-28 04:26:22.653 UTC [1909601] ehsan@dashboard FATAL: the database system is starting up
2023-06-28 04:26:22.653 UTC [1909602] ehsan@dashboard FATAL: the database system is starting up
2023-06-28 04:26:22.762 UTC [1909600] LOG: database system was not properly shut down; automatic r>
2023-06-28 04:26:22.765 UTC [1909600] LOG: redo starts at 46C/93DCE1A8
2023-06-28 04:26:22.841 UTC [1909600] LOG: invalid record length at 46C/94EF5338: wanted 24, got 0
2023-06-28 04:26:22.841 UTC [1909600] LOG: redo done at 46C/94EF5300
2023-06-28 04:26:23.089 UTC [1909598] LOG: database system is ready to accept connections
看来DB恢复得很快。
以下是 systemd 文件内容/lib/systemd/system/[email protected]
:
# systemd service template for PostgreSQL clusters. The actual instances will
# be called "postgresql@version-cluster", e.g. "[email protected]". The
# variable %i expands to "version-cluster", %I expands to "version/cluster".
# (%I breaks for cluster names containing dashes.)
[Unit]
Description=PostgreSQL Cluster %i
AssertPathExists=/etc/postgresql/%I/postgresql.conf
RequiresMountsFor=/etc/postgresql/%I /var/lib/postgresql/%I
PartOf=postgresql.service
ReloadPropagatedFrom=postgresql.service
Before=postgresql.service
# stop server before networking goes down on shutdown
After=network.target
[Service]
Type=forking
# -: ignore startup failure (recovery might take arbitrarily long)
# the actual pg_ctl timeout is configured in pg_ctl.conf
ExecStart=-/usr/bin/pg_ctlcluster --skip-systemctl-redirect %i start
# 0 is the same as infinity, but "infinity" needs systemd 229
TimeoutStartSec=0
ExecStop=/usr/bin/pg_ctlcluster --skip-systemctl-redirect -m fast %i stop
TimeoutStopSec=1h
ExecReload=/usr/bin/pg_ctlcluster --skip-systemctl-redirect %i reload
PIDFile=/run/postgresql/%i.pid
SyslogIdentifier=postgresql@%i
# prevent OOM killer from choosing the postmaster (individual backends will
# reset the score to 0)
OOMScoreAdjust=-900
# restarting automatically will prevent "pg_ctlcluster ... stop" from working,
# so we disable it here. Also, the postmaster will restart by itself on most
# problems anyway, so it is questionable if one wants to enable external
# automatic restarts.
#Restart=on-failure
# (This should make pg_ctlcluster stop work, but doesn't:)
#RestartPreventExitStatus=SIGINT SIGTERM
[Install]
WantedBy=multi-user.target
当 OOM 启动时,您的系统可能处于较差的状态,这就是它启动的原因。如果恢复缓慢也就不足为奇了,因为可能已被换出的东西随后需要换回。当您稍后再次手动启动它时,事情会处于更好的状态,因此速度会更快也就不足为奇了。这并不能解释是谁首先命令恢复中止,但确实解释了为什么稍后再次手动尝试恢复时恢复可能会更快。
您应该关注恢复期间谁命令快速关闭,而不是第二次尝试恢复需要多长时间。
编辑
无论如何,我可以使用 Debian 和 Ubuntu 简单地重现这一点,并使用 apt 安装 PostgreSQL 和它提供的 systemd 文件。在引发 OOM Killer 后,PostgreSQL 自动重新启动,但随后立即被告知关闭。唯一可以发出信号的是 systemd 系统本身。这只发生在 OOM 情况下;如果我手动杀死-9后端进程,那么PostgreSQL会成功重新启动。
在 t2.micro 机器上,只需执行以下操作即可重现它:
我不是系统律师。我不知道为什么它会这样做,但我不得不猜测它是故意这样做的。
您可以通过禁用内存过量使用来防止 OOM 杀手。但在 t2.micro 上,这样做会导致机器无法使用,因为由于 snapd、packagekit 和其他臃肿软件,内存在开箱即用时就被过度使用。