在http://www.postgresql.org/docs/9.1/static/continuous-archiving.html之后,我将 WAL 档案复制到另一台机器上并定期应用它们。最近由于我的主服务器上的 pg 服务器崩溃,我尝试通过推送触发器(一个空文件,期待智能故障转移)来恢复数据库,日志如下所示。
prodrestore_error.log
WAL file not present yet. Checking for trigger file...
trigger file found: smart failover
Trigger file: /app/recovery-prod/trigger/pgsql.trigger.5432
Waiting for WAL file: 0000000100000001000000CC
WAL file path: /backup/prod/db_backup/archive/0000000100000001000000CC
Restoring to: pg_xlog/RECOVERYXLOG
Sleep interval: 60 seconds
Max wait interval: 0 forever
Command for restore: cp "/backup/prod/db_backup/archive/0000000100000001000000CC" "pg_xlog/RECOVERYXLOG"
Keep archive history: 000000000000000000000000 and later
trigger file found: smart failover
running restore: OK
Trigger file: /app/recovery-prod/trigger/pgsql.trigger.5432
Waiting for WAL file: 00000002.history
WAL file path: /backup/prod/db_backup/archive/00000002.history
Restoring to: pg_xlog/RECOVERYHISTORY
Sleep interval: 60 seconds
Max wait interval: 0 forever
Command for restore: cp "/backup/prod/db_backup/archive/00000002.history" "pg_xlog/RECOVERYHISTORY"
Keep archive history: 000000000000000000000000 and later
running restore: cp: cannot stat `/backup/prod/db_backup/archive/00000002.history': No such file or directory
cp: cannot stat `/backup/prod/db_backup/archive/00000002.history': No such file or directory
cp: cannot stat `/backup/prod/db_backup/archive/00000002.history': No such file or directory
cp: cannot stat `/backup/prod/db_backup/archive/00000002.history': No such file or directory
not restored
history file not found
Trigger file: /app/recovery-prod/trigger/pgsql.trigger.5432
Waiting for WAL file: 00000001.history
WAL file path: /backup/prod/db_backup/archive/00000001.history
Restoring to: pg_xlog/RECOVERYHISTORY
Sleep interval: 60 seconds
Max wait interval: 0 forever
Command for restore: cp "/backup/prod/db_backup/archive/00000001.history" "pg_xlog/RECOVERYHISTORY"
Keep archive history: 000000000000000000000000 and later
running restore: cp: cannot stat `/backup/prod/db_backup/archive/00000001.history': No such file or directory
cp: cannot stat `/backup/prod/db_backup/archive/00000001.history': No such file or directory
cp: cannot stat `/backup/prod/db_backup/archive/00000001.history': No such file or directory
cp: cannot stat `/backup/prod/db_backup/archive/00000001.history': No such file or directory
not restored
history file not found
postgresql-9.1-main.log
2015-08-03 08:08:44 IST::@:[5542]:LOG: restored log file "0000000100000001000000CC" from archive
2015-08-03 08:11:44 IST::@:[5542]:LOG: could not open file "pg_xlog/0000000100000001000000CD" (log file 1, segment 205): No such file or directory
2015-08-03 08:11:44 IST::@:[5542]:LOG: redo done at 1/CC001E10
2015-08-03 08:11:44 IST::@:[5542]:LOG: last completed transaction was at log time 2015-08-03 07:59:58.121908+05:30
2015-08-03 08:11:44 IST::@:[5542]:LOG: restored log file "0000000100000001000000CC" from archive
2015-08-03 08:17:44 IST::@:[5542]:LOG: selected new timeline ID: 2
2015-08-03 08:23:45 IST::@:[5542]:LOG: archive recovery complete
2015-08-03 08:23:51 IST::@:[6088]:LOG: autovacuum launcher started
2015-08-03 08:23:51 IST::@:[5541]:LOG: database system is ready to accept connections
整个查找两个历史文件,4次,最后启动数据库的过程00000002.history
大约00000001.history
用了12分钟。然后,通过恢复新的基本备份进行的每次重复迭代都需要更多时间,因为它开始寻找更多 .history 文件。
如何使查找历史文件的速度尽可能快,如果不存在,则快速继续以读/写模式启动数据库?否则,根本不查找历史文件。
这不是历史文件的错。
引用您引用的文档链接:
由于看起来您正在使用 pg_standby,因此您有几个选择。
调整
-r maxretries
和-s sleeptime
。默认情况下
-r maxretries
是 3,因为看起来你的-s sleeptime
设置设置为 60,所以它会在第一次重试时等待 60 秒,然后在第二次重试时等待 120 秒,然后在第三次也是最后一次重试复制命令时等待 180 秒它放弃并将副本服务器带入独立模式。因此,除了重播所有日志之外,在命令尝试复制 .history 文件结束时,您总共要等待 360 秒或 6 分钟,加上恢复所有 WAL 所需的时间智能模式下的细分市场。
执行快速故障转移而不是智能故障转移。
如果您迫切希望快速进行故障转移,并且可以忍受丢失任何未应用的 WAL 段,您可以执行快速故障转移,而不是触发的智能故障转移,如 pg_standby 文档中所述。
智能故障转移与快速故障转移的优点取决于您希望副本服务器运行的速度以及您对丢失事务的容忍度。
快速故障转移会忽略未应用的 WAL 段,因此事务会丢失。在以独立模式启动副本之前,智能故障转移会重播所有可能的 WAL 段。这可能需要很长时间,具体取决于您需要重播多少 WAL 段。
最后,由于您使用的是 9.1,因此您可以使用流复制和
restore_command
您的 arecovery.conf
来保持服务器比标准日志传送更最新,方法是在生成单个 WAL 记录时对其进行流式传输。此外,如果副本无法跟上流负载(由于网络问题或任何数量的事情),它可以交替使用 a
restore_command
从您已经设置的存档中获取 WAL 段。但是,您不能
pg_standby
用于 restore_command。从理论上讲,如果一切顺利,恢复 WAL 记录几乎没有延迟,而不是恢复完整的 WAL 段,您的副本应该更快地以独立模式出现。
可以在此处找到流复制和存档的示例:
http://evol-monkey.blogspot.com/2014/09/offsite-replication-problems-and-how-to.html
监控您的复制延迟也很有用,以确保一切都按照您的要求保持最新状态。示例如下:
http://www.keithf4.com/monitoring_streaming_slave_lag/
希望有帮助。=)