我在 Debian 上使用下面的 Postgres。
PostgreSQL 13.8 (Debian 13.8-0+deb11u1) on x86_64-pc-linux-gnu, compiled by gcc (Debian 10.2.1-6) 10.2.1 20210110, 64-bit
有时 OOM 会杀死 Postgres 进程。这是发生的事情的日志:
/var/log/syslog:Jun 28 04:18:09 dashboard03 kernel: [17747145.470355] Out of memory: Killed process 3680079 (postgres) total-vm:60173160kB, anon-rss:2644kB, file-rss:0kB, shmem-rss:32631780kB, UID:107 pgtables:76056kB oom_score_adj:0
问题是 Postgres 崩溃后无法自动启动。这是 Postgres 日志:
2023-06-28 04:18:09.891 UTC [3680076] LOG: checkpointer process (PID 3680079) was terminated by signal 9: Killed
2023-06-28 04:18:09.917 UTC [3680076] LOG: terminating any other active server processes
2023-06-28 04:18:09.917 UTC [4109121] ehsan@dashboard WARNING: terminating connection because of crash of another server process
2023-06-28 04:18:09.917 UTC [4109121] ehsan@dashboard DETAIL: The postmaster has commanded this server process to roll back the cu>
2023-06-28 04:18:09.917 UTC [4109121] ehsan@dashboard HINT: In a moment you should be able to reconnect to the database and repeat>
2023-06-28 04:18:09.917 UTC [275401] ehsan@dashboard WARNING: terminating connection because of crash of another server process
2023-06-28 04:18:09.917 UTC [275401] ehsan@dashboard DETAIL: The postmaster has commanded this server process to roll back the cur>
2023-06-28 04:18:09.917 UTC [275401] ehsan@dashboard HINT: In a moment you should be able to reconnect to the database and repeat >
2023-06-28 04:18:09.917 UTC [114738] ehsan@dashboard WARNING: terminating connection because of crash of another server process
2023-06-28 04:18:09.917 UTC [114738] ehsan@dashboard DETAIL: The postmaster has commanded this server process to roll back the cur>
2023-06-28 04:18:09.917 UTC [114738] ehsan@dashboard HINT: In a moment you should be able to reconnect to the database and repeat >
2023-06-28 04:18:09.917 UTC [3680082] WARNING: terminating connection because of crash of another server process
2023-06-28 04:18:09.917 UTC [3680082] DETAIL: The postmaster has commanded this server process to roll back the current transactio>
2023-06-28 04:18:09.917 UTC [3680082] HINT: In a moment you should be able to reconnect to the database and repeat your command.
2023-06-28 04:18:10.162 UTC [1907896] ehsan@dashboard FATAL: the database system is in recovery mode
2023-06-28 04:18:11.191 UTC [3680076] LOG: all server processes terminated; reinitializing
2023-06-28 04:18:17.286 UTC [3680076] LOG: received fast shutdown request
2023-06-28 04:18:17.287 UTC [1907912] LOG: database system was interrupted; last known up at 2023-06-28 04:15:38 UTC
2023-06-28 04:18:17.364 UTC [1907913] ehsan@dashboard FATAL: the database system is shutting down
2023-06-28 04:18:17.365 UTC [1907914] ehsan@dashboard FATAL: the database system is shutting down
2023-06-28 04:18:17.484 UTC [1907912] LOG: database system was not properly shut down; automatic recovery in progress
2023-06-28 04:18:17.514 UTC [1907912] LOG: redo starts at 46C/93DCE1A8
2023-06-28 04:18:17.516 UTC [3680076] LOG: abnormal database system shutdown
2023-06-28 04:18:17.678 UTC [3680076] LOG: database system is shut down
我检查了restart_after_crash
参数和值是on
为什么Postgres在这次崩溃后无法自动重启?虽然我可以手动启动它。我错过了什么配置以在崩溃后重新启动 Postgres?
当我手动启动它时,这里记录了sudo service postgresql start
:
2023-06-28 04:26:22.605 UTC [1909598] LOG: starting PostgreSQL 13.8 (Debian 13.8-0+deb11u1) on x86>
2023-06-28 04:26:22.606 UTC [1909598] LOG: listening on IPv4 address "127.0.0.1", port 5432
2023-06-28 04:26:22.606 UTC [1909598] LOG: listening on Unix socket "/var/run/postgresql/.s.PGSQL.>
2023-06-28 04:26:22.649 UTC [1909600] LOG: database system was interrupted while in recovery at 20>
2023-06-28 04:26:22.649 UTC [1909600] HINT: This probably means that some data is corrupted and yo>
2023-06-28 04:26:22.653 UTC [1909601] ehsan@dashboard FATAL: the database system is starting up
2023-06-28 04:26:22.653 UTC [1909602] ehsan@dashboard FATAL: the database system is starting up
2023-06-28 04:26:22.762 UTC [1909600] LOG: database system was not properly shut down; automatic r>
2023-06-28 04:26:22.765 UTC [1909600] LOG: redo starts at 46C/93DCE1A8
2023-06-28 04:26:22.841 UTC [1909600] LOG: invalid record length at 46C/94EF5338: wanted 24, got 0
2023-06-28 04:26:22.841 UTC [1909600] LOG: redo done at 46C/94EF5300
2023-06-28 04:26:23.089 UTC [1909598] LOG: database system is ready to accept connections
看来DB恢复得很快。
以下是 systemd 文件内容/lib/systemd/system/[email protected]
:
# systemd service template for PostgreSQL clusters. The actual instances will
# be called "postgresql@version-cluster", e.g. "[email protected]". The
# variable %i expands to "version-cluster", %I expands to "version/cluster".
# (%I breaks for cluster names containing dashes.)
[Unit]
Description=PostgreSQL Cluster %i
AssertPathExists=/etc/postgresql/%I/postgresql.conf
RequiresMountsFor=/etc/postgresql/%I /var/lib/postgresql/%I
PartOf=postgresql.service
ReloadPropagatedFrom=postgresql.service
Before=postgresql.service
# stop server before networking goes down on shutdown
After=network.target
[Service]
Type=forking
# -: ignore startup failure (recovery might take arbitrarily long)
# the actual pg_ctl timeout is configured in pg_ctl.conf
ExecStart=-/usr/bin/pg_ctlcluster --skip-systemctl-redirect %i start
# 0 is the same as infinity, but "infinity" needs systemd 229
TimeoutStartSec=0
ExecStop=/usr/bin/pg_ctlcluster --skip-systemctl-redirect -m fast %i stop
TimeoutStopSec=1h
ExecReload=/usr/bin/pg_ctlcluster --skip-systemctl-redirect %i reload
PIDFile=/run/postgresql/%i.pid
SyslogIdentifier=postgresql@%i
# prevent OOM killer from choosing the postmaster (individual backends will
# reset the score to 0)
OOMScoreAdjust=-900
# restarting automatically will prevent "pg_ctlcluster ... stop" from working,
# so we disable it here. Also, the postmaster will restart by itself on most
# problems anyway, so it is questionable if one wants to enable external
# automatic restarts.
#Restart=on-failure
# (This should make pg_ctlcluster stop work, but doesn't:)
#RestartPreventExitStatus=SIGINT SIGTERM
[Install]
WantedBy=multi-user.target