我在具有 512 MB RAM 和 8 GB 磁盘的设备上使用 Debian 9。我已经安装了带有扩展 TimescaleDB 1.7.1 的 PostgreSQL 9.6.22。PostgreSQL内存相关参数如下:
shared_buffers = 128MB # min 128kB
#temp_buffers = 8MB # min 800kB
#work_mem = 4MB # min 64kB
#maintenance_work_mem = 64MB # min 1MB
#autovacuum_work_mem = -1 # min 1MB, or -1 to use maintenance_work_mem
dynamic_shared_memory_type = posix # the default is the first option
# use none to disable dynamic shared memory
max_connections = 13 # (change requires restart)
#superuser_reserved_connections = 3 # (change requires restart)
最近OOM杀手出现了,看来PostgreSQL是占内存最多的:
postgres invoked oom-killer: gfp_mask=0x26084c0, order=0, oom_score_adj=0
postgres cpuset=/ mems_allowed=0
CPU: 0 PID: 11632 Comm: postgres Not tainted 4.4.0-cip #1
Hardware name: Generic AM33XX (Flattened Device Tree)
[<c00133d0>] (unwind_backtrace) from [<c0011cdc>] (show_stack+0x10/0x14)
[<c0011cdc>] (show_stack) from [<c00dca94>] (dump_header+0x4c/0x180)
[<c00dca94>] (dump_header) from [<c00a38e4>] (oom_kill_process+0x6c/0x39c)
[<c00a38e4>] (oom_kill_process) from [<c00a3e54>] (out_of_memory+0x1d8/0x2fc)
[<c00a3e54>] (out_of_memory) from [<c00a7ecc>] (__alloc_pages_nodemask+0x830/0x89c)
[<c00a7ecc>] (__alloc_pages_nodemask) from [<c00c012c>] (__pte_alloc+0x20/0x1b0)
[<c00c012c>] (__pte_alloc) from [<c00c209c>] (handle_mm_fault+0x224/0xc80)
[<c00c209c>] (handle_mm_fault) from [<c04be4b0>] (do_page_fault+0x20c/0x35c)
[<c04be4b0>] (do_page_fault) from [<c000923c>] (do_DataAbort+0x38/0xb8)
[<c000923c>] (do_DataAbort) from [<c04bdebc>] (__dabt_usr+0x3c/0x40)
Exception stack(0xdb95bfb0 to 0xdb95bff8)
bfa0: aaf9a000 00000000 00101002 aaf9a008
bfc0: b6a5d7a4 00101000 00098dd6 b6a5d7d4 00100008 b6a5d000 000001ff b6a4185c
bfe0: 000000c0 bed9f978 b6a052d7 b69c444c 40070030 ffffffff
Mem-Info:
active_anon:113550 inactive_anon:651 isolated_anon:0
active_file:69 inactive_file:126 isolated_file:0
unevictable:0 dirty:0 writeback:0 unstable:0
slab_reclaimable:1250 slab_unreclaimable:2180
mapped:34529 shmem:35279 pagetables: free:8083 free_pcp:0 free_cma:2861
Normal free:32332kB min:16384kB low:20480kB high:24576kB active_anon:454200kB inactive_anon:2604kB active_file:276kB inactive_file:504kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:524288kB managed:511816kB mlocked:0kB dirty:0kB writeback:0kB mapped:138116kB shmem:141116kB slab_reclaimable:5000kB slab_unreclaimable:8720kB kernel_stack:1136kB pagetables:4776kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:11444kB writeback_tmp:0kB pages_scanned:4960 all_unreclaimable? yes
lowmem_reserve[]: 0 0 0
Normal: 989*4kB (UMEHC) 397*8kB (UMEH) 215*16kB (UMEHC) 138*32kB (UMEHC) 67*64kB (UEH) 12*128kB (UEC) 1*256kB (U) 0*512kB 1*1024kB (C) 1*2048kB (C) 2*4096kB (C) 0*8192kB = 32332kB
35474 total pagecache pages
0 pages in swap cache
Swap cache stats: add 0, delete 0, find 0/0
Free swap = 0kB
Total swap = 0kB
131072 pages RAM
0 pages HighMem/MovableOnly
3118 pages reserved
4096 pages cma reserved
[ pid ] uid tgid total_vm rss nr_ptes nr_pmds swapents oom_score_adj name
[ 220] 0 220 36222 113 64 0 0 0 systemd-journal
[ 247] 0 247 3154 103 7 0 0 -1000 systemd-udevd
[ 352] 1000 352 4778 31 6 0 0 0 custom-process
[ 353] 1000 353 6572 35 5 0 0 0 custom-process2
[ 356] 1000 356 4778 46 5 0 0 0 custom-process3
[ 358] 0 358 1163 55 5 0 0 0 cron
[ 368] 1000 368 4778 32 7 0 0 0 custom-process4
[ 377] 107 377 1332 104 6 0 0 -900 dbus-daemon
[ 385] 0 385 1795 212 7 0 0 0 openvpn
[ 389] 1000 389 5477 55 7 0 0 0 custom-process5
[ 405] 0 405 1524 105 5 0 0 0 systemd-logind
[ 408] 1000 408 7084 40 8 0 0 0 custom-process6
[ 412] 0 412 1100 54 6 0 0 0 cgmanager
[ 416] 1000 416 6572 41 6 0 0 0 custom-process7
[ 422] 0 422 2488 33 3 0 0 0 custom-process8
[ 526] 109 526 47120 2045 22 0 0 -900 postgres
[ 530] 109 530 47145 32270 84 0 0 0 postgres
[ 531] 109 531 47120 538 18 0 0 0 postgres
[ 532] 109 532 47120 1343 19 0 0 0 postgres
[ 533] 109 533 47206 542 21 0 0 0 postgres
[ 534] 109 534 12158 325 16 0 0 0 postgres
[ 535] 109 535 47187 500 20 0 0 0 postgres
[ 537] 109 537 96668 51046 124 0 0 0 postgres
[ 564] 0 564 1315 174 6 0 0 0 dhclient
[ 641] 0 641 2125 134 7 0 0 -1000 sshd
[ 643] 0 643 923 29 5 0 0 0 agetty
[ 646] 0 646 868 29 5 0 0 0 agetty
[ 675] 105 675 1993 408 7 0 0 0 ntpd
[29365] 0 29365 10566 333 15 0 0 0 packagekitd
[29371] 0 29371 9412 435 14 0 0 0 polkitd
[29655] 0 29655 10825 268 13 0 0 0 nginx
[29656] 33 29656 10876 319 13 0 0 0 nginx
[19701] 1000 19701 42241 7379 114 0 0 0 node
[11625] 109 11625 47909 17171 85 0 0 0 postgres
[11629] 109 11629 47775 15420 84 0 0 0 postgres
[11631] 109 11631 50664 6973 58 0 0 0 postgres
[11632] 109 11632 50987 6972 58 0 0 0 postgres
[11633] 109 11633 50986 6982 58 0 0 0 postgres
[11634] 109 11634 50985 6986 58 0 0 0 postgres
[11635] 109 11635 50987 6981 58 0 0 0 postgres
[11636] 109 11636 50987 6981 58 0 0 0 postgres
Out of memory: Kill process 537 (postgres) score 399 or sacrifice child
Killed process 537 (postgres) total-vm:386672kB, anon-rss:19679
这是当时的 PostgreSQL 日志:
[526] LOG: worker process: TimescaleDB Background Worker Scheduler (PID 537) was terminated by signal 9: Killed
[526] LOG: terminating any other active server processes
[11629] postgres@db WARNING: terminating connection because of crash of another server process
[11629] postgres@db DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit,
because another server process exited abnormally and possibly corrupted shared memory.
[11629] postgres@db HINT: In a moment you should be able to reconnect to the database and repeat your command.
[11625] postgres@db WARNING: terminating connection because of crash of another server process
[11625] postgres@db DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit,
because another server process exited abnormally and possibly corrupted shared memory.
[11625] postgres@db HINT: In a moment you should be able to reconnect to the database and repeat your command.
[533] WARNING: terminating connection because of crash of another server process
[533] DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another
server process exited abnormally and possibly corrupted shared memory.
[533] HINT: In a moment you should be able to reconnect to the database and repeat your command.
[11632] postgres@db WARNING: terminating connection because of crash of another server process
[11632] postgres@db DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit,
because another server process exited abnormally and possibly corrupted shared memory.
[11632] postgres@db HINT: In a moment you should be able to reconnect to the database and repeat your command.
[11635] postgres@db WARNING: terminating connection because of crash of another server process
[11635] postgres@db DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit,
because another server process exited abnormally and possibly corrupted shared memory.
[11635] postgres@db HINT: In a moment you should be able to reconnect to the database and repeat your command.
[11631] postgres@db WARNING: terminating connection because of crash of another server process
[11631] postgres@db DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit,
because another server process exited abnormally and possibly corrupted shared memory.
[11631] postgres@db HINT: In a moment you should be able to reconnect to the database and repeat your command.
[11634] postgres@db WARNING: terminating connection because of crash of another server process
[11634] postgres@db DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit,
because another server process exited abnormally and possibly corrupted shared memory.
[11634] postgres@db HINT: In a moment you should be able to reconnect to the database and repeat your command.
[11633] postgres@db WARNING: terminating connection because of crash of another server process
[11633] postgres@db DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit,
because another server process exited abnormally and possibly corrupted shared memory.
[11633] postgres@db HINT: In a moment you should be able to reconnect to the database and repeat your command.
[11636] postgres@db WARNING: terminating connection because of crash of another server process
[11636] postgres@db DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit,
because another server process exited abnormally and possibly corrupted shared memory.
[11636] postgres@db HINT: In a moment you should be able to reconnect to the database and repeat your command.
[526] LOG: all server processes terminated; reinitializing
[11637] LOG: database system was interrupted; last known up at 2022-09-12 21:47:45 CEST
[11637] LOG: database system was not properly shut down; automatic recovery in progress
[11637] LOG: redo starts at 1C/32C6E978
[11637] LOG: invalid record length at 1C/32D041F8: wanted 24, got 0
[11637] LOG: redo done at 1C/32D041D0
[11637] LOG: last completed transaction was at log time 2022-09-12 21:50:23.799706+02
[11637] LOG: MultiXact member wraparound protections are now enabled
[11641] LOG: autovacuum launcher started
[526] LOG: database system is ready to accept connections
[11643] LOG: TimescaleDB background worker launcher connected to shared catalogs
我知道有过度使用调整可用,并且我的应用程序可能存在调用 PostgreSQL 的潜在问题,但我想了解 PostgreSQL 对内存的使用。我试图了解 OOM Killer 输出和 PostgreSQL 内存相关参数,但我有几个疑问:
- PostgreSQl 的最大进程数是多少?我认为它对应于最大连接数(
max_connections + superuser_reserved_connections
)。但是,在这种情况下,这个限制加起来是 13 个,而有 16 个 postgres 进程。 - 根据此处和此处描述的 OOM 杀手输出的解释,实际使用的 RAM 是
rss
值的总和。但是,在这种情况下,总和为 187604 页≈730 MB,这超过了设备中的总 RAM。 - PostgreSQL 可以使用的最大内存是多少?根据这个答案,
max RAM = shared_buffers + (temp_buffers + work_mem) * max_connections
。在这种情况下,128 + (8+4)*10 = 248MB。但是,OOM Killer 输出与此假设不匹配。postgres 进程的总 rss 约为 690MB,比它应该的要多得多,甚至比设备中的总 RAM 还要多。 - 我应该考虑其他任何 PostgreSQL 参数吗?
除了“用户”后端之外,PostgreSQL 还使用多个后台进程。诸如 wal writer、background writer、checkpointer、archiver、WAL senders、autovacuum workers 等。这些(可能)不是问题,它们只是使报告复杂化。如果进程名称包含该信息会很好,它在 top 和 ps 中包含该信息(或可以,取决于配置),但显然不在 OOM 报告中。查看 db 服务器日志文件以查看被杀死的进程实际上是什么。它不会是一份完整的报告,但比您目前所知道的要好。
大部分 RSS 可能是每个进程都接触过的 shared_buffers 部分,这与所有其他进程也接触过的部分相同。所以它是一遍又一遍地计算的相同内存。仅仅总结那一栏并不能给你真正的答案。
如果您关闭了 overcommit,您将在 db 服务器的日志文件中获得一份关于进程是什么、内存不足时它在做什么以及内存去向的很好的报告。但是由于您将内核配置为只是随机核对事物,因此您不会明白这一点。不再存在的进程无法生成错误报告。
如果您安装了窗口系统,则无法关闭过度提交,因为它们会疯狂地过度分配他们永远不会尝试使用的 RAM。但它看起来不像你正在运行其中之一,所以可能没有理由不按照文档所说的去做。