Eu tenho um cluster Patroni2.1.1 + PGSQL13.4 de 3 nós e um dos nós está agindo de forma estranha, uma vez que terminou o bootstrap basebackup. Eu tenho outro nó idêntico funcionando bem quando submetido ao mesmo processo.
Basicamente, uma vez que Patroni tenta se conectar ao localhost: 5432, o servidor parece morrer e refork (pelo meu entendimento). Você verá abaixo o "loop" começando em connection received: host=::1 port=XXXXX
.
2021-11-01 12:01:43.395 EDT [242925] LOG: starting PostgreSQL 13.4 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 8.4.1 20200928 (Red Hat 8.4.1-1), 64-bit
2021-11-01 12:01:43.396 EDT [242925] LOG: listening on IPv6 address "::1", port 5432
2021-11-01 12:01:43.396 EDT [242925] LOG: listening on IPv4 address "127.0.0.1", port 5432
2021-11-01 12:01:43.397 EDT [242925] LOG: listening on IPv4 address "192.168.23.13", port 5432
2021-11-01 12:01:43.398 EDT [242925] LOG: listening on Unix socket "/var/run/postgresql/.s.PGSQL.5432"
2021-11-01 12:01:43.401 EDT [242925] LOG: listening on Unix socket "/tmp/.s.PGSQL.5432"
2021-11-01 12:01:43.405 EDT [242928] LOG: database system was interrupted; last known up at 2021-11-01 12:01:25 EDT
2021-11-01 12:01:43.405 EDT [242928] DEBUG: removing all temporary WAL segments
2021-11-01 12:01:43.436 EDT [242928] DEBUG: executing restore command "cp /var/lib/pgsql/13/archive/00000077.history pg_wal/RECOVERYHISTORY"
cp: cannot stat '/var/lib/pgsql/13/archive/00000077.history': No such file or directory
2021-11-01 12:01:43.440 EDT [242928] DEBUG: could not restore file "00000077.history" from archive: child process exited with exit code 1
2021-11-01 12:01:43.441 EDT [242928] LOG: entering standby mode
2021-11-01 12:01:43.441 EDT [242928] DEBUG: backup time 2021-11-01 12:01:25 EDT in file "backup_label"
2021-11-01 12:01:43.441 EDT [242928] DEBUG: backup label pg_basebackup base backup in file "backup_label"
2021-11-01 12:01:43.441 EDT [242928] DEBUG: backup timeline 118 in file "backup_label"
2021-11-01 12:01:43.441 EDT [242928] DEBUG: executing restore command "cp /var/lib/pgsql/13/archive/00000076.history pg_wal/RECOVERYHISTORY"
2021-11-01 12:01:43.538 EDT [242928] LOG: restored log file "00000076.history" from archive
2021-11-01 12:01:43.540 EDT [242928] DEBUG: executing restore command "cp /var/lib/pgsql/13/archive/000000760000000100000000 pg_wal/RECOVERYXLOG"
2021-11-01 12:01:43.689 EDT [242928] LOG: restored log file "000000760000000100000000" from archive
2021-11-01 12:01:43.692 EDT [242928] DEBUG: got WAL segment from archive
2021-11-01 12:01:43.693 EDT [242928] DEBUG: checkpoint record is at 1/60
2021-11-01 12:01:43.693 EDT [242928] DEBUG: redo record is at 1/28; shutdown false
2021-11-01 12:01:43.693 EDT [242928] DEBUG: next transaction ID: 531; next OID: 24606
2021-11-01 12:01:43.693 EDT [242928] DEBUG: next MultiXactId: 1; next MultiXactOffset: 0
2021-11-01 12:01:43.693 EDT [242928] DEBUG: oldest unfrozen transaction ID: 478, in database 1
2021-11-01 12:01:43.693 EDT [242928] DEBUG: oldest MultiXactId: 1, in database 1
2021-11-01 12:01:43.693 EDT [242928] DEBUG: commit timestamp Xid oldest/newest: 0/0
2021-11-01 12:01:43.693 EDT [242928] DEBUG: transaction ID wrap limit is 2147484125, limited by database with OID 1
2021-11-01 12:01:43.693 EDT [242928] DEBUG: MultiXactId wrap limit is 2147483648, limited by database with OID 1
2021-11-01 12:01:43.693 EDT [242928] DEBUG: starting up replication slots
2021-11-01 12:01:43.693 EDT [242928] DEBUG: starting up replication origin progress state
2021-11-01 12:01:43.697 EDT [242928] DEBUG: resetting unlogged relations: cleanup 1 init 0
2021-11-01 12:01:43.698 EDT [242928] DEBUG: initializing for hot standby
2021-11-01 12:01:43.698 EDT [242928] DEBUG: my backend ID is 1
2021-11-01 12:01:43.698 EDT [242928] LOG: redo starts at 1/28
2021-11-01 12:01:43.698 EDT [242928] DEBUG: prune KnownAssignedXids to 531
2021-11-01 12:01:43.698 EDT [242928] CONTEXT: WAL redo at 1/28 for Standby/RUNNING_XACTS: nextXid 531 latestCompletedXid 530 oldestRunningXid 531
2021-11-01 12:01:43.698 EDT [242928] DEBUG: 0 KnownAssignedXids (num=0 tail=0 head=0)
2021-11-01 12:01:43.698 EDT [242928] CONTEXT: WAL redo at 1/28 for Standby/RUNNING_XACTS: nextXid 531 latestCompletedXid 530 oldestRunningXid 531
2021-11-01 12:01:43.698 EDT [242928] DEBUG: recovery snapshots are now enabled
2021-11-01 12:01:43.698 EDT [242928] CONTEXT: WAL redo at 1/28 for Standby/RUNNING_XACTS: nextXid 531 latestCompletedXid 530 oldestRunningXid 531
2021-11-01 12:01:43.698 EDT [242928] DEBUG: prune KnownAssignedXids to 531
2021-11-01 12:01:43.698 EDT [242928] CONTEXT: WAL redo at 1/D8 for Standby/RUNNING_XACTS: nextXid 531 latestCompletedXid 530 oldestRunningXid 531
2021-11-01 12:01:43.698 EDT [242928] DEBUG: end of backup reached
2021-11-01 12:01:43.698 EDT [242928] CONTEXT: WAL redo at 1/110 for XLOG/BACKUP_END: 1/28
2021-11-01 12:01:43.699 EDT [242932] DEBUG: checkpointer updated shared memory configuration values
2021-11-01 12:01:43.700 EDT [242928] LOG: consistent recovery state reached at 1/138
2021-11-01 12:01:43.700 EDT [242928] DEBUG: executing restore command "cp /var/lib/pgsql/13/archive/000000760000000100000001 pg_wal/RECOVERYXLOG"
2021-11-01 12:01:43.700 EDT [242925] LOG: database system is ready to accept read only connections
cp: cannot stat '/var/lib/pgsql/13/archive/000000760000000100000001': No such file or directory
2021-11-01 12:01:43.704 EDT [242928] DEBUG: could not restore file "000000760000000100000001" from archive: child process exited with exit code 1
2021-11-01 12:01:43.704 EDT [242928] DEBUG: could not open file "pg_wal/000000760000000100000001": No such file or directory
2021-11-01 12:01:43.704 EDT [242928] DEBUG: switched WAL source from archive to stream after failure
2021-11-01 12:01:43.705 EDT [242936] DEBUG: find_in_dynamic_libpath: trying "/usr/pgsql-13/lib/libpqwalreceiver"
2021-11-01 12:01:43.706 EDT [242936] DEBUG: find_in_dynamic_libpath: trying "/usr/pgsql-13/lib/libpqwalreceiver.so"
2021-11-01 12:01:44.370 EDT [242925] DEBUG: forked new backend, pid=242938 socket=12
2021-11-01 12:01:44.371 EDT [242938] LOG: connection received: host=::1 port=36714
2021-11-01 12:01:51.235 EDT [242938] DEBUG: shmem_exit(0): 0 before_shmem_exit callbacks to make
2021-11-01 12:01:51.235 EDT [242938] DEBUG: shmem_exit(0): 0 on_shmem_exit callbacks to make
2021-11-01 12:01:51.235 EDT [242938] DEBUG: proc_exit(0): 1 callbacks to make
2021-11-01 12:01:51.235 EDT [242938] DEBUG: exit(0)
2021-11-01 12:01:51.235 EDT [242938] DEBUG: shmem_exit(-1): 0 before_shmem_exit callbacks to make
2021-11-01 12:01:51.235 EDT [242938] DEBUG: shmem_exit(-1): 0 on_shmem_exit callbacks to make
2021-11-01 12:01:51.235 EDT [242938] DEBUG: proc_exit(-1): 0 callbacks to make
2021-11-01 12:01:51.236 EDT [242925] DEBUG: forked new backend, pid=242941 socket=12
2021-11-01 12:01:51.237 EDT [242941] LOG: connection received: host=127.0.0.1 port=43428
2021-11-01 12:01:51.237 EDT [242925] DEBUG: reaping dead processes
2021-11-01 12:01:51.237 EDT [242925] DEBUG: server process (PID 242938) exited with exit code 0
2021-11-01 12:01:58.284 EDT [242941] DEBUG: shmem_exit(0): 0 before_shmem_exit callbacks to make
2021-11-01 12:01:58.284 EDT [242941] DEBUG: shmem_exit(0): 0 on_shmem_exit callbacks to make
2021-11-01 12:01:58.284 EDT [242941] DEBUG: proc_exit(0): 1 callbacks to make
2021-11-01 12:01:58.284 EDT [242941] DEBUG: exit(0)
2021-11-01 12:01:58.284 EDT [242941] DEBUG: shmem_exit(-1): 0 before_shmem_exit callbacks to make
2021-11-01 12:01:58.284 EDT [242941] DEBUG: shmem_exit(-1): 0 on_shmem_exit callbacks to make
2021-11-01 12:01:58.284 EDT [242941] DEBUG: proc_exit(-1): 0 callbacks to make
2021-11-01 12:01:58.285 EDT [242925] DEBUG: reaping dead processes
2021-11-01 12:01:58.285 EDT [242925] DEBUG: server process (PID 242941) exited with exit code 0
2021-11-01 12:01:59.295 EDT [242925] DEBUG: forked new backend, pid=242944 socket=12
2021-11-01 12:01:59.295 EDT [242944] LOG: connection received: host=::1 port=36758
2021-11-01 12:02:06.301 EDT [242944] DEBUG: shmem_exit(0): 0 before_shmem_exit callbacks to make
2021-11-01 12:02:06.301 EDT [242944] DEBUG: shmem_exit(0): 0 on_shmem_exit callbacks to make
2021-11-01 12:02:06.301 EDT [242944] DEBUG: proc_exit(0): 1 callbacks to make
2021-11-01 12:02:06.301 EDT [242944] DEBUG: exit(0)
2021-11-01 12:02:06.301 EDT [242944] DEBUG: shmem_exit(-1): 0 before_shmem_exit callbacks to make
2021-11-01 12:02:06.301 EDT [242944] DEBUG: shmem_exit(-1): 0 on_shmem_exit callbacks to make
2021-11-01 12:02:06.301 EDT [242944] DEBUG: proc_exit(-1): 0 callbacks to make
2021-11-01 12:02:06.301 EDT [242925] DEBUG: forked new backend, pid=242948 socket=12
2021-11-01 12:02:06.302 EDT [242925] DEBUG: reaping dead processes
2021-11-01 12:02:06.302 EDT [242925] DEBUG: server process (PID 242944) exited with exit code 0
2021-11-01 12:02:06.302 EDT [242948] LOG: connection received: host=127.0.0.1 port=43470
2021-11-01 12:02:13.440 EDT [242948] DEBUG: shmem_exit(0): 0 before_shmem_exit callbacks to make
2021-11-01 12:02:13.440 EDT [242948] DEBUG: shmem_exit(0): 0 on_shmem_exit callbacks to make
2021-11-01 12:02:13.440 EDT [242948] DEBUG: proc_exit(0): 1 callbacks to make
2021-11-01 12:02:13.440 EDT [242948] DEBUG: exit(0)
2021-11-01 12:02:13.440 EDT [242948] DEBUG: shmem_exit(-1): 0 before_shmem_exit callbacks to make
2021-11-01 12:02:13.440 EDT [242948] DEBUG: shmem_exit(-1): 0 on_shmem_exit callbacks to make
2021-11-01 12:02:13.440 EDT [242948] DEBUG: proc_exit(-1): 0 callbacks to make
2021-11-01 12:02:13.441 EDT [242925] DEBUG: reaping dead processes
2021-11-01 12:02:13.441 EDT [242925] DEBUG: server process (PID 242948) exited with exit code 0
2021-11-01 12:02:14.148 EDT [242925] DEBUG: postmaster received signal 2
2021-11-01 12:02:14.148 EDT [242925] LOG: received fast shutdown request
2021-11-01 12:02:14.150 EDT [242925] LOG: aborting any active transactions
2021-11-01 12:02:14.150 EDT [242933] DEBUG: shmem_exit(0): 1 before_shmem_exit callbacks to make
2021-11-01 12:02:14.150 EDT [242933] DEBUG: shmem_exit(0): 5 on_shmem_exit callbacks to make
2021-11-01 12:02:14.150 EDT [242933] DEBUG: proc_exit(0): 2 callbacks to make
2021-11-01 12:02:14.150 EDT [242933] DEBUG: exit(0)
2021-11-01 12:02:14.150 EDT [242933] DEBUG: shmem_exit(-1): 0 before_shmem_exit callbacks to make
2021-11-01 12:02:14.150 EDT [242933] DEBUG: shmem_exit(-1): 0 on_shmem_exit callbacks to make
2021-11-01 12:02:14.150 EDT [242933] DEBUG: proc_exit(-1): 0 callbacks to make
2021-11-01 12:02:14.150 EDT [242928] DEBUG: shmem_exit(1): 1 before_shmem_exit callbacks to make
2021-11-01 12:02:14.150 EDT [242928] DEBUG: shmem_exit(1): 7 on_shmem_exit callbacks to make
2021-11-01 12:02:14.150 EDT [242928] DEBUG: removing all KnownAssignedXids
2021-11-01 12:02:14.150 EDT [242928] DEBUG: release all standby locks
2021-11-01 12:02:14.150 EDT [242928] DEBUG: proc_exit(1): 2 callbacks to make
2021-11-01 12:02:14.150 EDT [242928] DEBUG: exit(1)
2021-11-01 12:02:14.150 EDT [242928] DEBUG: shmem_exit(-1): 0 before_shmem_exit callbacks to make
2021-11-01 12:02:14.150 EDT [242928] DEBUG: shmem_exit(-1): 0 on_shmem_exit callbacks to make
2021-11-01 12:02:14.150 EDT [242928] DEBUG: proc_exit(-1): 0 callbacks to make
2021-11-01 12:02:14.151 EDT [242925] DEBUG: reaping dead processes
2021-11-01 12:02:14.151 EDT [242925] DEBUG: reaping dead processes
2021-11-01 12:02:14.449 EDT [242925] DEBUG: forked new backend, pid=242954 socket=12
2021-11-01 12:02:14.450 EDT [242954] LOG: connection received: host=::1 port=36798
2021-11-01 12:02:15.915 EDT [242936] FATAL: terminating walreceiver process due to administrator command
2021-11-01 12:02:15.915 EDT [242936] DEBUG: shmem_exit(1): 1 before_shmem_exit callbacks to make
2021-11-01 12:02:15.915 EDT [242936] DEBUG: shmem_exit(1): 6 on_shmem_exit callbacks to make
2021-11-01 12:02:15.915 EDT [242936] DEBUG: proc_exit(1): 2 callbacks to make
2021-11-01 12:02:15.915 EDT [242936] DEBUG: exit(1)
2021-11-01 12:02:15.916 EDT [242936] DEBUG: shmem_exit(-1): 0 before_shmem_exit callbacks to make
2021-11-01 12:02:15.916 EDT [242936] DEBUG: shmem_exit(-1): 0 on_shmem_exit callbacks to make
2021-11-01 12:02:15.916 EDT [242936] DEBUG: proc_exit(-1): 0 callbacks to make
2021-11-01 12:02:15.917 EDT [242925] DEBUG: reaping dead processes
2021-11-01 12:02:15.917 EDT [242932] LOG: shutting down
2021-11-01 12:02:15.917 EDT [242932] DEBUG: skipping restartpoint, already performed at 1/28
2021-11-01 12:02:15.919 EDT [242932] DEBUG: updated min recovery point to 1/1000000 on timeline 118
2021-11-01 12:02:15.920 EDT [242932] DEBUG: shmem_exit(0): 1 before_shmem_exit callbacks to make
2021-11-01 12:02:15.920 EDT [242932] DEBUG: shmem_exit(0): 5 on_shmem_exit callbacks to make
2021-11-01 12:02:15.920 EDT [242932] DEBUG: proc_exit(0): 2 callbacks to make
2021-11-01 12:02:15.920 EDT [242932] DEBUG: exit(0)
2021-11-01 12:02:15.920 EDT [242932] DEBUG: shmem_exit(-1): 0 before_shmem_exit callbacks to make
2021-11-01 12:02:15.920 EDT [242932] DEBUG: shmem_exit(-1): 0 on_shmem_exit callbacks to make
2021-11-01 12:02:15.920 EDT [242932] DEBUG: proc_exit(-1): 0 callbacks to make
2021-11-01 12:02:15.921 EDT [242925] DEBUG: reaping dead processes
2021-11-01 12:02:15.921 EDT [242934] DEBUG: writing stats file "pg_stat/global.stat"
2021-11-01 12:02:15.921 EDT [242934] DEBUG: shmem_exit(-1): 0 before_shmem_exit callbacks to make
2021-11-01 12:02:15.921 EDT [242934] DEBUG: shmem_exit(-1): 0 on_shmem_exit callbacks to make
2021-11-01 12:02:15.921 EDT [242934] DEBUG: proc_exit(-1): 0 callbacks to make
2021-11-01 12:02:15.922 EDT [242925] DEBUG: reaping dead processes
2021-11-01 12:02:17.890 EDT [242925] DEBUG: postmaster received signal 2
2021-11-01 12:02:18.617 EDT [242954] DEBUG: shmem_exit(0): 0 before_shmem_exit callbacks to make
2021-11-01 12:02:18.617 EDT [242954] DEBUG: shmem_exit(0): 0 on_shmem_exit callbacks to make
2021-11-01 12:02:18.617 EDT [242954] DEBUG: proc_exit(0): 1 callbacks to make
2021-11-01 12:02:18.617 EDT [242954] DEBUG: exit(0)
2021-11-01 12:02:18.617 EDT [242954] DEBUG: shmem_exit(-1): 0 before_shmem_exit callbacks to make
2021-11-01 12:02:18.617 EDT [242954] DEBUG: shmem_exit(-1): 0 on_shmem_exit callbacks to make
2021-11-01 12:02:18.617 EDT [242954] DEBUG: proc_exit(-1): 0 callbacks to make
2021-11-01 12:02:18.618 EDT [242925] DEBUG: reaping dead processes
2021-11-01 12:02:18.618 EDT [242925] DEBUG: server process (PID 242954) exited with exit code 0
2021-11-01 12:02:18.618 EDT [242925] DEBUG: shmem_exit(0): 0 before_shmem_exit callbacks to make
2021-11-01 12:02:18.618 EDT [242925] DEBUG: shmem_exit(0): 5 on_shmem_exit callbacks to make
2021-11-01 12:02:18.618 EDT [242925] DEBUG: cleaning up dynamic shared memory control segment with ID 2104592565
2021-11-01 12:02:18.631 EDT [242925] DEBUG: proc_exit(0): 2 callbacks to make
2021-11-01 12:02:18.631 EDT [242925] LOG: database system is shut down
2021-11-01 12:02:18.631 EDT [242925] DEBUG: exit(0)
2021-11-01 12:02:18.631 EDT [242925] DEBUG: shmem_exit(-1): 0 before_shmem_exit callbacks to make
2021-11-01 12:02:18.631 EDT [242925] DEBUG: shmem_exit(-1): 0 on_shmem_exit callbacks to make
2021-11-01 12:02:18.631 EDT [242925] DEBUG: proc_exit(-1): 0 callbacks to make
2021-11-01 12:02:18.633 EDT [242927] DEBUG: logger shutting down
2021-11-01 12:02:18.633 EDT [242927] DEBUG: shmem_exit(0): 0 before_shmem_exit callbacks to make
2021-11-01 12:02:18.633 EDT [242927] DEBUG: shmem_exit(0): 0 on_shmem_exit callbacks to make
2021-11-01 12:02:18.633 EDT [242927] DEBUG: proc_exit(0): 0 callbacks to make
2021-11-01 12:02:18.633 EDT [242927] DEBUG: exit(0)
2021-11-01 12:02:18.633 EDT [242927] DEBUG: shmem_exit(-1): 0 before_shmem_exit callbacks to make
2021-11-01 12:02:18.633 EDT [242927] DEBUG: shmem_exit(-1): 0 on_shmem_exit callbacks to make
2021-11-01 12:02:18.633 EDT [242927] DEBUG: proc_exit(-1): 0 callbacks to make
Se eu parar o Patroni, o servidor começará a funcionar bem e ingressará na replicação conforme o esperado. O usuário postgres pode se conectar localmente ou remotamente usando o utilitário psql ou até mesmo usando pgadmin.
Alguém poderia explicar o que está acontecendo aqui? Ainda estou aprendendo PGSQL e ainda não estou muito confortável com esses logs... Obrigado!
Para mais contexto, você também pode ver a edição nº 2104 do Patroni que publiquei .
Ok finalmente encontrei a causa! Usando strace no patroni, encontrei o seguinte loop:
Acontece que um principal incorreto para o usuário postgres ainda estava armazenado em cache e depois que kdestroy -A estava de volta aos negócios.