我在 CentOS 7 服务器上备份和运行 PostgreSQL 时遇到问题。
一些上下文:在注意到我在服务器的 /root 分区上的存储空间很快耗尽后,我快速挖掘了一下,发现罪魁祸首是 pg_xlog 中的所有文件都没有被删除。由于 /home 有很多可用空间,我选择将 pg_xlog 目录移动到 /home 上的新目录,以便争取足够的时间来解决日志文件未被清除的初始问题。然后,根据我在此处找到的一些建议,我将原始 pg_xlog 设置为指向 /home 分区中新的符号链接。
但这导致 SELinux 出现一些奇怪的错误,即使在 chown-ing 到 postgres:postgres (符号链接和新目录)之后,也不允许 Postgres 从新目录中读取。
由于我并不完全精通 SELinux,因此我决定将日志文件复制回其原始目录并研究物理扩展分区。
但是在重新复制所有文件后,PostgreSQL 将无法启动。错误消息似乎不是很有帮助(至少对我来说不是)。Journalctl -xe 报告:
Apr 19 00:12:29 localhost.localdomain systemd[1]: Unit postgresql.service entered failed state. Apr 19 00:12:29 localhost.localdomain systemd[1]: postgresql.service failed. Apr 19 00:12:29 localhost.localdomain polkitd[1105]: Unregistered >Authentication Agent for unix-process:17236:1124444 (system bus name :1.588, object path /org/freedesk Apr 19 00:12:31 localhost.localdomain abrt-server[17258]: Email address of sender was not specified. Would you like to do so now? If not, 'user@localhost' is to be used Apr 19 00:12:31 localhost.localdomain abrt-server[17258]: Email address of receiver was not specified. Would you like to do so now? If not, 'root@localhost' is to be us Apr 19 00:12:31 localhost.localdomain abrt-server[17258]: Sending an e-mail... Apr 19 00:12:31 localhost.localdomain abrt-server[17258]: Sending a notification email to: root@localhost Apr 19 00:12:31 localhost.localdomain abrt-server[17258]: Email was sent to: root@localhost Apr 19 00:12:31 localhost.localdomain postfix/pickup[10305]: AA0676622C: uid=0 from=<user@localhost> Apr 19 00:12:31 localhost.localdomain postfix/cleanup[17283]: AA0676622C: message-id=<5ad7b4bf.TazSMT+2CbgyTlql%user@localhost> Apr 19 00:12:31 localhost.localdomain postfix/qmgr[1963]: AA0676622C: from=<[email protected]>, size=45585, nrcpt=1 (queue active) Apr 19 00:12:31 localhost.localdomain postfix/local[17285]: AA0676622C: to=<[email protected]>, orig_to=<root@localhost>, relay=local, delay=0.05, delays=0.04/ Apr 19 00:12:31 localhost.localdomain postfix/qmgr[1963]: AA0676622C: removed
systemctl status postgresql 说:
● postgresql.service - PostgreSQL 数据库服务器 已加载:已加载(/usr/lib/systemd/system/postgresql.service;已启用;供应商预设:已禁用) 活动:自 2018 年 4 月 19 日星期四 00 日起失败(结果:退出代码) :12:29 吃;1 分钟 34 秒前 进程:17252 ExecStart=/usr/bin/pg_ctl start -D ${PGDATA} -s -o -p ${PGPORT} -w -t 300 (code=exited, status=1/FAILURE) 进程:17243 ExecStartPre=/usr/bin/postgresql-check-db-dir ${PGDATA} (code=exited, status=0/SUCCESS)
4 月 19 日 00:12:28 localhost.localdomain systemd 1:正在启动 PostgreSQL 数据库服务器... 4 月 19 日 00:12:29 localhost.localdomain systemd 1:postgresql.service:控制进程退出,代码=退出状态=1 Apr 19 00 :12:29 localhost.localdomain systemd 1 : 无法启动 PostgreSQL 数据库服务器。Apr 19 00:12:29 localhost.localdomain systemd 1:单元 postgresql.service 进入失败状态。4 月 19 日 00:12:29 localhost.localdomain systemd 1:postgresql.service 失败。
我必须承认我已经束手无策了。任何帮助,即使它只是生成更多有用的错误消息的一种方式,将不胜感激。
编辑:
似乎在 pg_xlog 文件的复制过程中,文件的所有权更改为 root。此后,为了保留权限,我再次使用 rsync 重新复制了这些相同的文件。现在我按照下面评论部分的建议包括 PostgreSQL 日志。PostgreSQL 日志中的错误消息是:
WARNING: transaction log file "00000001000000470000008D" could not be archived: too many failures LOG: archive command failed with exit code 1 DETAIL: The failed archive command was: false LOG: archive command failed with exit code 1 DETAIL: The failed archive command was: false LOG: archive command failed with exit code 1 DETAIL: The failed archive command was: false WARNING: transaction log file "00000001000000470000008D" could not be archived: too many failures LOG: database system was interrupted; last known up at 2018-04-18 19:21:53 EAT PANIC: could not open file "pg_xlog/000000010000006900000017" (log file 105, segment 23): Permission denied LOG: startup process (PID 17256) was terminated by signal 6: Aborted LOG: aborting startup due to startup process failure LOG: database system was interrupted; last known up at 2018-04-18 19:21:53 EAT PANIC: could not open file "pg_xlog/000000010000006900000017" (log file 105, segment 23): Permission denied LOG: startup process (PID 20020) was terminated by signal 6: Aborted LOG: aborting startup due to startup process failure LOG: database system was interrupted; last known up at 2018-04-18 19:21:53 EAT FATAL: the database system is starting up PANIC: could not open file "pg_xlog/000000010000006900000017" (log file 105, segment 23): Permission denied LOG: startup process (PID 25607) was terminated by signal 6: Aborted LOG: aborting startup due to startup process failure LOG: database system was interrupted; last known up at 2018-04-18 19:21:53 EAT FATAL: the database system is starting up FATAL: the database system is starting up LOG: invalid magic number 0000 in log file 105, segment 23, offset 9617408 LOG: invalid primary checkpoint record LOG: invalid magic number 0000 in log file 105, segment 23, offset 9601024 LOG: invalid secondary checkpoint record PANIC: could not locate a valid checkpoint record FATAL: the database system is starting up LOG: startup process (PID 28108) was terminated by signal 6: Aborted LOG: aborting startup due to startup process failure LOG: database system was interrupted; last known up at 2018-04-18 19:21:53 EAT LOG: invalid magic number 0000 in log file 105, segment 23, offset 9617408 LOG: invalid primary checkpoint record LOG: invalid magic number 0000 in log file 105, segment 23, offset 9601024 LOG: invalid secondary checkpoint record PANIC: could not locate a valid checkpoint record LOG: startup process (PID 28529) was terminated by signal 6: Aborted LOG: aborting startup due to startup process failure
这看起来很糟糕。当您来回复制 xlog 文件时,它们似乎以某种方式归零。您可能需要聘请专业的服务公司进行 PostgreSQL 数据恢复。您应该备份您仍然可以找到的所有数据,并将其放在无法更改或删除的地方。包括 /root 分区上的两个,以及 /home 上剩下的任何内容,以防在从 /home 复制回 /root 时出现问题。
我相信有问题的 xlog 文件将是 000000xx0000006900000017 (其中 xx 可以是任何十六进制数字)。知道该文件是全为零还是大部分为零会很有趣。
另外,您是否有第一次尝试启动服务器时的服务器日志文件?或者,日志文件中出现的第一个 PANIC?
我最终
pg_resetxlog
按照这个 StackOverflow帖子的建议求助于。我应该强调的是,在花费数小时将所有内容备份到外部磁盘之后,我只是作为最后的手段才这样做。我似乎很幸运,因为
pg_resetxlog
工作完美无缺,并且数据库服务器重新启动并运行良好。根据对该 SO 帖子的评论,使用pg_resetxlog
有可能使一切变得更糟。回顾事情,我不禁觉得如果我使用
rsync
而不是普通的旧cp
并mv
移动xlog文件,xlog错误可能已经被阻止了。