环境:Postgres 版本:9.6 3 个服务器的集群,带有 Patroni 和 etcd
场景:当使用 16 个并行请求(在 16CPU 机器上)启动表索引时,主服务器上的 postgres 因 Linux OOM 杀手而崩溃。这是一台 124GB 的机器。我们知道产生如此多的并行请求需要更多的内存,我们已经解决了这个问题。
问题:然而,令人担忧的是,当master由于OOM而崩溃时,所有的replica也都崩溃了。这是意料之外的事情,并且对集群的高可用性提出了质疑。我们可以轻松地模拟这一点,并且每次副本的行为都完全相同。
崩溃发生时master的日志:
2020-12-16 09:54:44 UTC [11619]: [9-1] user=,db=LOG: checkpointer process (PID 30834) was terminated by signal 9: Killed
2020-12-16 09:54:44 UTC [11619]: [10-1] user=,db=LOG: terminating any other active server processes
2020-12-16 09:54:44 UTC [30838]: [1-1] user=,db=FATAL: archive command was terminated by signal 3: Quit
2020-12-16 09:54:44 UTC [16870]: [1-1] user=postgres,db=mydbWARNING: terminating connection because of crash of another server process
2020-12-16 09:54:44 UTC [16870]: [2-1] user=postgres,db=mydbDETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
2020-12-16 09:54:44 UTC [16870]: [3-1] user=postgres,d=mydbHINT: In a moment you should be able to reconnect to the database and repeat your command.
...
2020-12-16 09:54:59 UTC [24609]: [1-1] user=postgres,db=mydbFATAL: the database system is in recovery mode
...
2020-12-16 09:55:04 UTC [22780]: [4-1] user=,db=LOG: redo done at 52712/BAFFD3C0
2020-12-16 09:55:07 UTC [11619]: [13-1] user=,db=LOG: database system is ready to accept connections
发生崩溃时副本的日志:
WARNING: terminating connection because of crash of another server process
DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
HINT: In a moment you should be able to reconnect to the database and repeat your command.
2020-12-16 09:54:44 UTC [13293]: [2-1] user=,db=FATAL: could not receive data from WAL stream: server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
...
2020-12-16 09:54:55 UTC [12843]: [19675-1] user=,db=LOG: restored log file "00000007000526C70000005D" from archive
...
2020-12-16 11:19:18 UTC [12843]: [38972-1] user=,db=LOG: restored log file "0000000700052712000000C1" from archive
2020-12-16 11:19:20 UTC [18104]: [1-1] user=,db=LOG: started streaming WAL from primary at 52712/C2000000 on timeline 7
问题:master 上的 postgres 是否会崩溃(由于 OOM/损坏的共享内存),是否也必然会导致副本上的类似崩溃?有没有办法绕过这个?*
这是无害的。
当进程崩溃并且后端即将死亡(
quickdie
inpostgres.c
)时,该消息由服务器发送到客户端,以便可以开始崩溃恢复。您只看到主服务器发送给 WAL 接收器的消息。请注意
WARNING
:这甚至不是错误。这只会被记录,因为log_min_messages
在备用设置为warning
或更低。备用服务器继续运行——如您所见,它在主服务器恢复时从存档中赶上。一旦它读取了所有档案并且主服务器再次启动,它将重新连接并继续流式传输。