我们收到与以下错误有关的 postgres 环境中断:
Jul 1 00:36:04 test1 postgres[219259]: [770-2] user=,db=,app=client= CONTEXT: writing block 199237 of relation pg_tblspc/16402/PG_9.6_201608131/7358881/41721132
Jul 1 00:36:05 test1 postgres[219252]: [3-1] user=,db=,app=client= LOG: checkpointer process (PID 219259) was terminated by signal 6: Aborted
Jul 1 00:36:05 test1 postgres[219252]: [4-1] user=,db=,app=client= LOG: terminating any other active server processes
Jul 1 00:36:05 test1 postgres[110539]: [5-1] user=postgres,db=product,app=psqlclient=[local] WARNING: terminating connection because of crash of another server process
Jul 1 00:36:05 test1 postgres[110539]: [5-2] user=postgres,db=product,app=psqlclient=[local] DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
Jul 1 00:36:05 test1 postgres[110539]: [5-3] user=postgres,db=product,app=psqlclient=[local] HINT: In a moment you should be able to reconnect to the database and repeat your command.
Jul 1 00:36:04 test1 postgres[219259]: [770-1] user=,db=,app=client= PANIC: could not write to file "pg_xlog/xlogtemp.219259": No space left on device
Jul 1 00:36:04 test1 postgres[219259]: [770-2] user=,db=,app=client= CONTEXT: writing block 199237 of relation pg_tblspc/16402/PG_9.6_201608131/7358881/41721132
Jul 1 00:36:05 test1 postgres[219252]: [3-1] user=,db=,app=client= LOG: checkpointer process (PID 219259) was terminated by signal 6: Aborted
Jul 1 00:36:05 test1 postgres[219252]: [4-1] user=,db=,app=client= LOG: terminating any other active server processes
Postgres 崩溃然后恢复,观察存储 pg_xlog 的空间可以看到驱动器已满,通常是 80GB 驱动器,使用率约为 10%,每天晚上大约在同一时间发生。我试图找出原因,但 postgres 日志文件中没有任何内容指向罪魁祸首。
我们有监控数据库服务器的数据狗,可以在发出错误时看到错误,但没有指出它可能是什么。
任何帮助表示赞赏。
1毫秒后可以看到:
Jul 1 00:36:04 test1 postgres[219259]: [770-1] user=,db=,app=client= PANIC: could not write to file "pg_xlog/xlogtemp.219259": No space left on device
Jul 1 00:36:04 test1 postgres[219259]: [770-2] user=,db=,app=client= CONTEXT: writing block 199237 of relation pg_tblspc/16402/PG_9.6_201608131/7358881/41721132
Jul 1 00:36:05 test1 postgres[219252]: [3-1] user=,db=,app=client= LOG: checkpointer process (PID 219259) was terminated by signal 6: Aborted
Jul 1 00:36:05 test1 postgres[219252]: [4-1] user=,db=,app=client= LOG: terminating any other active server processes
Jul 1 00:36:05 test1 postgres[110539]: [5-1] user=postgres,db=product,app=psqlclient=[local] WARNING: terminating connection because of crash of another server process
Jul 1 00:36:05 test1 postgres[110539]: [5-2] user=postgres,db=product,app=psqlclient=[local] DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
Jul 1 00:36:05 test1 postgres[110539]: [5-3] user=postgres,db=product,app=psqlclient=[local] HINT: In a moment you should be able to reconnect to the database and repeat your command.
Jul 1 00:36:05 test1 postgres[110539]: [5-4] user=postgres,db=product,app=psqlclient=[local] CONTEXT: SQL statement "INSERT INTO AGGREGATES.agg_item_part_count
Jul 1 00:36:05 test1 postgres[110539]: [5-5] #011#011#011SELECT b.item_id,
Jul 1 00:36:05 test1 postgres[110539]: [5-6] #011#011 count(bp.item_id) as item_parts
Jul 1 00:36:05 test1 postgres[110539]: [5-7] #011 #011FROM item.item b
Jul 1 00:36:05 test1 postgres[110539]: [5-8] #011#011LEFT JOIN item.item_part bp USING (item_id)
Jul 1 00:36:05 test1 postgres[110539]: [5-9] #011 #011WHERE b.item_id >= starting_item
Jul 1 00:36:05 test1 postgres[110539]: [5-10] #011#011 GROUP by b.item_id
Jul 1 00:36:05 test1 postgres[110539]: [5-11] #011#011ON CONFLICT (item_id) DO
Jul 1 00:36:05 test1 postgres[110539]: [5-12] #011#011#011UPDATE SET
Jul 1 00:36:05 test1 postgres[110539]: [5-13] #011#011#011item_parts = EXCLUDED.item_parts"
Jul 1 00:36:05 test1 postgres[110539]: [5-14] #011PL/pgSQL function etl.update_bpart_agg() line 39 at SQL statement
您的日志中只有 1 秒的分辨率(您应该将其更改为 %t 为 %m),所以这些是从下一秒开始的。但是我们不知道您离第二个边界有多近,并且这些日志消息是您所期望的。
看起来 INSERT...SELECT 语句可能是填充 wal 目录的罪魁祸首。如果 SELECT 返回很多行,那么 INSERT 会快速生成大量 WAL 是有道理的。我们不知道为什么它完全填满了,也许您正在归档并且归档命令无法跟上,或者有复制槽并且副本无法跟上,或者您可能没有任何这些东西并且它只是无法跟上的检查点。
如果您的临时文件被写入与 WAL 文件相同的分区,那么它们可能有助于填满分区。