我可以在使用数据库后激活 PITR 吗？

Question

rsmoorthy

Asked: 2020-12-17 04:05:55 +0800 CST2020-12-17 04:05:55 +0800 CST 2020-12-17 04:05:55 +0800 CST

postgres 集群：master 的崩溃也会导致副本崩溃

772

环境：Postgres 版本：9.6 3 个服务器的集群，带有 Patroni 和 etcd

场景：当使用 16 个并行请求（在 16CPU 机器上）启动表索引时，主服务器上的 postgres 因 Linux OOM 杀手而崩溃。这是一台 124GB 的机器。我们知道产生如此多的并行请求需要更多的内存，我们已经解决了这个问题。

问题：然而，令人担忧的是，当master由于OOM而崩溃时，所有的replica也都崩溃了。这是意料之外的事情，并且对集群的高可用性提出了质疑。我们可以轻松地模拟这一点，并且每次副本的行为都完全相同。

崩溃发生时master的日志：

2020-12-16 09:54:44 UTC [11619]: [9-1] user=,db=LOG:  checkpointer process (PID 30834) was terminated by signal 9: Killed
2020-12-16 09:54:44 UTC [11619]: [10-1] user=,db=LOG:  terminating any other active server processes
2020-12-16 09:54:44 UTC [30838]: [1-1] user=,db=FATAL:  archive command was terminated by signal 3: Quit
2020-12-16 09:54:44 UTC [16870]: [1-1] user=postgres,db=mydbWARNING:  terminating connection because of crash of another server process
2020-12-16 09:54:44 UTC [16870]: [2-1] user=postgres,db=mydbDETAIL:  The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
2020-12-16 09:54:44 UTC [16870]: [3-1] user=postgres,d=mydbHINT:  In a moment you should be able to reconnect to the database and repeat your command.
...
2020-12-16 09:54:59 UTC [24609]: [1-1] user=postgres,db=mydbFATAL:  the database system is in recovery mode
...
2020-12-16 09:55:04 UTC [22780]: [4-1] user=,db=LOG:  redo done at 52712/BAFFD3C0
2020-12-16 09:55:07 UTC [11619]: [13-1] user=,db=LOG:  database system is ready to accept connections

发生崩溃时副本的日志：

WARNING:  terminating connection because of crash of another server process
DETAIL:  The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
HINT:  In a moment you should be able to reconnect to the database and repeat your command.
2020-12-16 09:54:44 UTC [13293]: [2-1] user=,db=FATAL:  could not receive data from WAL stream: server closed the connection unexpectedly
                This probably means the server terminated abnormally
                before or while processing the request.
...
2020-12-16 09:54:55 UTC [12843]: [19675-1] user=,db=LOG:  restored log file "00000007000526C70000005D" from archive
...
2020-12-16 11:19:18 UTC [12843]: [38972-1] user=,db=LOG:  restored log file "0000000700052712000000C1" from archive
2020-12-16 11:19:20 UTC [18104]: [1-1] user=,db=LOG:  started streaming WAL from primary at 52712/C2000000 on timeline 7

问题：master 上的 postgres 是否会崩溃（由于 OOM/损坏的共享内存），是否也必然会导致副本上的类似崩溃？有没有办法绕过这个？*

Laurenz Albe · Answer 1 · 2020-12-17T04:36:36+08:00

这是无害的。

当进程崩溃并且后端即将死亡（quickdiein postgres.c）时，该消息由服务器发送到客户端，以便可以开始崩溃恢复。

您只看到主服务器发送给 WAL 接收器的消息。请注意WARNING：这甚至不是错误。这只会被记录，因为log_min_messages在备用设置为warning或更低。

备用服务器继续运行——如您所见，它在主服务器恢复时从存档中赶上。一旦它读取了所有档案并且主服务器再次启动，它将重新连接并继续流式传输。

postgres 集群：master 的崩溃也会导致副本崩溃

连接到 PostgreSQL 服务器：致命：主机没有 pg_hba.conf 条目

如何让sqlplus的输出出现在一行中？

选择具有最大日期或最晚日期的日期

如何列出 PostgreSQL 中的所有模式？

列出指定表的所有列

如何在不修改我自己的 tnsnames.ora 的情况下使用 sqlplus 连接到位于另一台主机上的 Oracle 数据库

你如何mysqldump特定的表？

使用 psql 列出数据库权限

如何从 PostgreSQL 中的选择查询中将值插入表中？

如何使用 psql 列出所有数据库和表？

postgres 集群：master 的崩溃也会导致副本崩溃

1 个回答

相关问题