直到几天前,从 PG 15.3 到 15.3 的逻辑复制一直运行没有问题。
现在订阅者会显示重复的消息,例如:
2023-07-29 08:25:04.523 UTC [26] LOG: checkpoint complete: wrote 8692 buffers (53.1%); 0 WAL file(s) added, 1 removed, 14 recycled; write=269.921 s, sync=0.485 s, total=270.438 s; sync files=37, longest=0.224 s, average=0.014 s; distance=230568 kB, estimate=436766 kB
2023-07-29 08:25:34.550 UTC [26] LOG: checkpoint starting: time
2023-07-29 08:27:55.699 UTC [142] ERROR: could not receive data from WAL stream: server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
2023-07-29 08:27:55.702 UTC [159] LOG: logical replication apply worker for subscription "<SUB_NAME>" has started
2023-07-29 08:27:55.706 UTC [1] LOG: background worker "logical replication worker" (PID 142) exited with exit code 1
发布者会显示重复的消息,例如:
2023-07-29 08:24:50.341 UTC [530982] STATEMENT: START_REPLICATION SLOT "<SUB NAME>" LOGICAL 37D1/1E0DD9A0 (proto_version '3', publication_names '"<PUB NAME>"')
2023-07-29 08:27:36.956 UTC [530982] LOG: terminating walsender process due to replication timeout
2023-07-29 08:27:36.956 UTC [530982] CONTEXT: slot "<SUB NAME>", output plugin "pgoutput", in the change callback, associated LSN 37D0/F9E8C2E8
我可以使用psql
从任一节点连接回另一个节点。据我所知,没有对路由或防火墙进行任何更改。
阅读其他类似的报告表明删除并重新创建子文件将解决问题,但我想理解/避免它。
任何有关如何追踪此问题的建议将不胜感激。
编辑:进一步调查显示“逻辑复制工作线程”100% CPU 限制,但 pg_stat_activity 中没有当前命令。