我的 PostgreSQL 11 服务器之一抛出了一个奇怪的错误,并在基本备份后拒绝正确启动流复制:
致命:由于管理员命令而终止 walreceiver 进程
我尝试启用最高的日志详细度(Debug5),但这并没有提供更多关于它为什么一直死的洞察力:
2021-07-29 15:33:30.862 UTC [39315] DEBUG: postgres: PostmasterMain: initial environment dump:
2021-07-29 15:33:30.862 UTC [39315] DEBUG: -----------------------------------------
2021-07-29 15:33:30.862 UTC [39315] DEBUG: PG_OOM_ADJUST_FILE=/proc/self/oom_score_adj
2021-07-29 15:33:30.862 UTC [39315] DEBUG: PG_GRANDPARENT_PID=39310
2021-07-29 15:33:30.862 UTC [39315] DEBUG: PGLOCALEDIR=/usr/share/locale
2021-07-29 15:33:30.862 UTC [39315] DEBUG: PGSYSCONFDIR=/etc/postgresql-common
2021-07-29 15:33:30.862 UTC [39315] DEBUG: LANG=en_US.UTF-8
2021-07-29 15:33:30.862 UTC [39315] DEBUG: PWD=/
2021-07-29 15:33:30.863 UTC [39315] DEBUG: PGDATA=/var/lib/postgresql/11/replica
2021-07-29 15:33:30.863 UTC [39315] DEBUG: LC_COLLATE=en_US.UTF-8
2021-07-29 15:33:30.863 UTC [39315] DEBUG: LC_CTYPE=en_US.UTF-8
2021-07-29 15:33:30.863 UTC [39315] DEBUG: LC_MESSAGES=en_US.UTF-8
2021-07-29 15:33:30.863 UTC [39315] DEBUG: LC_MONETARY=C
2021-07-29 15:33:30.863 UTC [39315] DEBUG: LC_NUMERIC=C
2021-07-29 15:33:30.863 UTC [39315] DEBUG: LC_TIME=C
2021-07-29 15:33:30.863 UTC [39315] DEBUG: -----------------------------------------
2021-07-29 15:33:30.867 UTC [39315] DEBUG: registering background worker "logical replication launcher"
2021-07-29 15:33:30.868 UTC [39315] LOG: listening on IPv6 address "::1", port 5433
2021-07-29 15:33:30.868 UTC [39315] LOG: listening on IPv4 address "127.0.0.1", port 5433
2021-07-29 15:33:30.868 UTC [39315] LOG: listening on IPv4 address "*snip*", port 5433
2021-07-29 15:33:30.868 UTC [39315] LOG: listening on Unix socket "/var/run/postgresql/.s.PGSQL.5433"
2021-07-29 15:33:30.869 UTC [39315] DEBUG: invoking IpcMemoryCreate(size=156459008)
2021-07-29 15:33:30.869 UTC [39315] DEBUG: mmap(157286400) with MAP_HUGETLB failed, huge pages disabled: Cannot allocate memory
2021-07-29 15:33:30.885 UTC [39315] DEBUG: SlruScanDirectory invoking callback on pg_notify/0000
2021-07-29 15:33:30.885 UTC [39315] DEBUG: removing file "pg_notify/0000"
2021-07-29 15:33:30.885 UTC [39315] DEBUG: dynamic shared memory system will support 588 segments
2021-07-29 15:33:30.885 UTC [39315] DEBUG: created dynamic shared memory control segment 1597797971 (14128 bytes)
2021-07-29 15:33:30.887 UTC [39315] DEBUG: max_safe_fds = 982, usable_fds = 1000, already_open = 8
2021-07-29 15:33:30.889 UTC [39316] LOG: database system was shut down in recovery at 2021-07-29 15:31:23 UTC
2021-07-29 15:33:30.890 UTC [39316] DEBUG: standby_mode = 'on'
2021-07-29 15:33:30.890 UTC [39316] DEBUG: primary_conninfo = '*snip*'
2021-07-29 15:33:30.890 UTC [39316] DEBUG: recovery_target_timeline = latest
2021-07-29 15:33:30.890 UTC [39316] LOG: entering standby mode
2021-07-29 15:33:30.890 UTC [39316] LOG: invalid resource manager ID 172 at 235/450000D0
2021-07-29 15:33:30.890 UTC [39316] DEBUG: switched WAL source from archive to stream after failure
2021-07-29 15:33:30.891 UTC [39317] DEBUG: find_in_dynamic_libpath: trying "/usr/lib/postgresql/11/lib/libpqwalreceiver"
2021-07-29 15:33:30.905 UTC [39317] DEBUG: find_in_dynamic_libpath: trying "/usr/lib/postgresql/11/lib/libpqwalreceiver.so"
2021-07-29 15:33:30.918 UTC [39317] LOG: started streaming WAL from primary at 235/45000000 on timeline 1
2021-07-29 15:33:30.920 UTC [39317] DEBUG: sendtime 2021-07-29 15:33:30.917704+00 receipttime 2021-07-29 15:33:30.920718+00 replication apply delay (N/A) transfer latency 3 ms
2021-07-29 15:33:30.920 UTC [39317] DEBUG: sending write 235/45020000 flush 0/0 apply 0/0
2021-07-29 15:33:30.921 UTC [39317] DEBUG: sending write 235/45020000 flush 235/45020000 apply 0/0
2021-07-29 15:33:30.921 UTC [39316] LOG: invalid resource manager ID 172 at 235/450000D0
2021-07-29 15:33:30.921 UTC [39317] FATAL: terminating walreceiver process due to administrator command
2021-07-29 15:33:30.921 UTC [39317] DEBUG: shmem_exit(1): 1 before_shmem_exit callbacks to make
2021-07-29 15:33:30.921 UTC [39317] DEBUG: shmem_exit(1): 5 on_shmem_exit callbacks to make
2021-07-29 15:33:30.921 UTC [39317] DEBUG: proc_exit(1): 2 callbacks to make
2021-07-29 15:33:30.921 UTC [39317] DEBUG: exit(1)
2021-07-29 15:33:30.921 UTC [39316] DEBUG: switched WAL source from stream to archive after failure
唯一值得关注的是 LOG 级别的消息,例如
日志:235/450000D0 处的资源管理器 ID 172 无效
然而,事实证明,这些只是 Postgres 的说法“到达有效 WAL 结构的末尾”,并且可以安全地忽略 LOG 级别的消息。
我尝试删除现有的 WAL 日志(来自datadir/pg_wal/
),认为可能是文件损坏,服务器仍然不会启动复制。唯一的解决方案是制作一个全新的基本备份。
我的问题 - 当 Postgres 的terminating xyz process due to administrator command
一个进程以非标准方式死亡时,Postgres 是否会给出默认消息?
在这种情况下会有更多的调试选项吗?即使是最高的日志记录详细程度也没有提供太多有用的信息。
主服务器和副本服务器上的完整版 PostgreSQL 是什么?
这看起来可能是损坏了,您可以做的就是进行新的基本备份(并调查您的硬件,看看您是否可以找出损坏的原因)。
我假设您从副本中删除了 WAL,而不是从主服务器中删除。所以它会重新获取相同的 WAL 文件,并发现它们仍然是损坏的。不一定是那样,但大概是这样,否则你不会问这个问题。如果损坏是传输中的文件发生的网络故障,那么新副本可能是好的,但显然损坏存在于主服务器上的原始 WAL 文件中,因此无论您阅读多少次,它仍然是损坏的。此外,如果损坏只是在将原始 WAL 文件写入磁盘时发生在磁盘上,那么存档可能不会损坏。如果将 WAL 文件复制到存档的存档命令将数据从 RAM 中的文件系统缓存中取出,则副本不会损坏。所以要么损坏发生在 RAM 本身,或者存档命令必须从磁盘读取(现在已损坏的)数据才能将其复制到存档中。或者它发生得更早,就像软件错误一样。
没有额外的信息。系统无法弄清楚它在看什么,所以它所能做的就是把乱七八糟的东西吐出来让我们去推测。但是,是的,它确实命令 wal 接收器退出,让它继续运行是没有意义的,因为它无法重播超过损坏点。所以它正在切换回 WAL 存档,以查看文件是否有损坏。