我们正在运行 Postgres 9.6,大小为 10+TB。已使用自制工具“pgrsync”进行备份,该工具使用 S3 作为存储库。备份文件和 WAL 存档都存储在 S3 上。
问题:在尝试还原时,WAL 存档还原对于某些备份随机失败。
在这种情况下,备份开始位置是00000002000544C60000006B
,停止位置是00000002000545210000008D
,(基于 pg_stop_backup() 输出)但它在 005493D 之间停止并终止还原。如果我再次重做恢复,它会在同一点完全停止。更多备份的类似结果,而其他备份成功恢复。
这可能表明某些特定的 WAL 文件在备份/恢复过程中损坏了。这是正确的解释吗?
问题:
- 有什么方法可以识别 WAL 文件是否损坏?
- 有没有办法在不丢失数据的情况下继续前进?(我对使用很谨慎
pg_resetxlog
)
首先是备份文件(来自 WAL 位置00000002000544C60000006B.0001E9A0.backup
)
START WAL LOCATION: 544C6/6B01E9A0 (file 00000002000544C60000006B)
STOP WAL LOCATION: 54521/8D235490 (file 00000002000545210000008D)
CHECKPOINT LOCATION: 544C6/BA84E0A8
BACKUP METHOD: streamed
BACKUP FROM: master
START TIME: 2020-11-04 03:24:14 UTC
LABEL: inc04nov
STOP TIME: 2020-11-04 09:21:38 UTC
然后是日志文件:
2020-11-11 06:38:08 UTC [21731]: [22299-1] user=,db=LOG: restored log file "000000020005451D00000080" from archive
2020-11-11 06:38:08 UTC [21731]: [22300-1] user=,db=LOG: restored log file "000000020005451D00000081" from archive
2020-11-11 06:38:08 UTC [21731]: [22301-1] user=,db=LOG: restored log file "000000020005451D00000082" from archive
2020-11-11 06:38:08 UTC [21731]: [22302-1] user=,db=LOG: restored log file "000000020005451D00000083" from archive
2020-11-11 06:38:08 UTC [21731]: [22303-1] user=,db=LOG: restored log file "000000020005451D00000084" from archive
2020-11-11 06:38:08 UTC [21731]: [22304-1] user=,db=LOG: restored log file "000000020005451D00000085" from archive
2020-11-11 06:38:08 UTC [21731]: [22305-1] user=,db=LOG: restored log file "000000020005451D00000086" from archive
2020-11-11 06:38:08 UTC [21731]: [22306-1] user=,db=LOG: restored log file "000000020005451D00000087" from archive
2020-11-11 06:38:08 UTC [21731]: [22307-1] user=,db=LOG: restored log file "000000020005451D00000088" from archive
2020-11-11 06:38:08 UTC [21731]: [22308-1] user=,db=LOG: restored log file "000000020005451D00000089" from archive
2020-11-11 06:38:08 UTC [21731]: [22309-1] user=,db=LOG: restored log file "000000020005451D0000008A" from archive
2020-11-11 06:38:08 UTC [21731]: [22310-1] user=,db=LOG: restored log file "000000020005451D0000008B" from archive
2020-11-11 06:38:08 UTC [21731]: [22311-1] user=,db=LOG: redo done at 5451D/8AFFE500
2020-11-11 06:38:08 UTC [21731]: [22312-1] user=,db=LOG: last completed transaction was at log time 2020-11-04 09:10:42.219935+00
2020-11-11 06:38:11 UTC [21731]: [22314-1] user=,db=FATAL: WAL ends before end of online backup
2020-11-11 06:38:11 UTC [21731]: [22315-1] user=,db=HINT: All WAL generated while online backup was taken must be available at recovery.
2020-11-11 06:38:13 UTC [21728]: [3-1] user=,db=LOG: startup process (PID 21731) exited with exit code 1
2020-11-11 06:38:13 UTC [21728]: [4-1] user=,db=LOG: terminating any other active server processes
2020-11-11 06:38:16 UTC [4559]: [1-1] user=postgres,db=postgresFATAL: the database system is in recovery mode
2020-11-11 06:38:16 UTC [4561]: [1-1] user=postgres,db=postgresFATAL: the database system is in recovery mode
2020-11-11 06:38:16 UTC [4576]: [1-1] user=postgres,db=postgresFATAL: the database system is in recovery mode
2020-11-11 06:38:25 UTC [21728]: [5-1] user=,db=LOG: database system is shut down
看起来 WAL 文件可能已损坏。
这是恢复正常的正确 WAL 文件:
-bash-4.2$ /usr/pgsql-9.6/bin/pg_xlogdump 000000020005451D0000008A | head -2
rmgr: Heap len (rec/tot): 151/ 151, tx: 3501354263, lsn: 5451D/8A0001D8, prev 5451D/89FFE1E8, desc: INSERT off 15, blkref #0: rel 3435996123/765803221/4171942326 blk 15806513
rmgr: Btree len (rec/tot): 72/ 72, tx: 3501354263, lsn: 5451D/8A000270, prev 5451D/8A0001D8, desc: INSERT_LEAF off 2, blkref #0: rel 3435996123/765803221/4171944289 blk 3881149
这是实际停止归档恢复处理的 WAL 文件:
-bash-4.2$ /usr/pgsql-9.6/bin/pg_xlogdump 000000020005451D0000008B
pg_xlogdump: FATAL: could not find a valid record after 5451D/8B000000
对此的唯一解释是恢复过程从未见过
BACKUP_END
WAL 条目,也就是说,它从未读取包含pg_stop_backup
调用效果的 WAL 段。现在您令人信服地争辩说您已经运行了该函数,否则您将不会
backup_label
在非独占备份中拥有由该函数生成的文件。存档恢复不允许您在恢复期间跳过 WAL 段,因此恢复不可能跳过该段。
这留下了一些解释:
您使用了
backup_label
其他备份中的文件,因为有些东西混淆了。您从不包含
BACKUP_END
该条目的不同集群中恢复了具有相同名称的 WAL 段。你把时间线弄混了,在备份期间有一个时间线切换,所以
BACKUP_END
实际上是00000003000545210000008D
这样。(我不确定这是否可能,或者时间线切换是否会破坏在线备份;我没有测试。)
如果一切都像您期望的那样,则
00000002000545210000008D
必须包含一个BACKUP_END
条目。验证与处理此条目后,PostgreSQL 将发出日志行