运行大约 18 小时后,该系统使用了约 10GB 的内存,导致当我们运行我们的日常任务时触发 OOM-killer:
# free -h
total used free shared buffers cached
Mem: 14G 9.4G 5.3G 400K 27M 59M
-/+ buffers/cache: 9.3G 5.4G
Swap: 0B 0B 0B
# cat /proc/meminfo
MemTotal: 15400928 kB
MemFree: 5567028 kB
Buffers: 28464 kB
Cached: 60816 kB
SwapCached: 0 kB
Active: 321464 kB
Inactive: 59156 kB
Active(anon): 291464 kB
Inactive(anon): 316 kB
Active(file): 30000 kB
Inactive(file): 58840 kB
Unevictable: 0 kB
Mlocked: 0 kB
SwapTotal: 0 kB
SwapFree: 0 kB
Dirty: 40 kB
Writeback: 0 kB
AnonPages: 291380 kB
Mapped: 14356 kB
Shmem: 400 kB
Slab: 364596 kB
SReclaimable: 18856 kB
SUnreclaim: 345740 kB
KernelStack: 1832 kB
PageTables: 3720 kB
NFS_Unstable: 0 kB
Bounce: 0 kB
WritebackTmp: 0 kB
CommitLimit: 7700464 kB
Committed_AS: 313224 kB
VmallocTotal: 34359738367 kB
VmallocUsed: 35976 kB
VmallocChunk: 34359678732 kB
HardwareCorrupted: 0 kB
AnonHugePages: 231424 kB
HugePages_Total: 0
HugePages_Free: 0
HugePages_Rsvd: 0
HugePages_Surp: 0
Hugepagesize: 2048 kB
DirectMap4k: 9598976 kB
DirectMap2M: 6260736 kB
但是,进程似乎并没有使用大量的内存:
# top -o %MEM -n 1
top - 15:07:00 up 18:28, 1 user, load average: 0.00, 0.01, 0.05
Tasks: 155 total, 1 running, 154 sleeping, 0 stopped, 0 zombie
%Cpu(s): 23.7 us, 4.8 sy, 0.0 ni, 71.4 id, 0.0 wa, 0.0 hi, 0.1 si, 0.0 st
KiB Mem: 15400928 total, 9838560 used, 5562368 free, 29764 buffers
KiB Swap: 0 total, 0 used, 0 free. 62760 cached Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
1333 root 20 0 5763204 274132 5352 S 0.0 1.8 7:00.19 java
1466 newrelic 20 0 251484 4884 2056 S 0.0 0.0 0:56.41 nrsysmond
16804 root 20 0 105636 4212 3224 S 0.0 0.0 0:00.00 sshd
16876 root 20 0 21420 3908 1764 S 0.0 0.0 0:00.03 bash
16858 ubuntu 20 0 21456 3828 1684 S 0.0 0.0 0:00.05 bash
770 root 20 0 10216 2868 576 S 0.0 0.0 0:00.02 dhclient
1 root 20 0 33700 2216 624 S 0.0 0.0 0:35.50 init
16875 root 20 0 63664 2084 1612 S 0.0 0.0 0:00.00 sudo
16857 ubuntu 20 0 105636 1860 880 S 0.0 0.0 0:00.01 sshd
16920 root 20 0 23688 1528 1064 R 0.0 0.0 0:00.00 top
16803 postfix 20 0 27400 1492 1216 S 0.0 0.0 0:00.00 pickup
976 root 20 0 43444 1100 748 S 0.0 0.0 0:00.00 systemd-logind
572 root 20 0 51480 1048 308 S 0.0 0.0 0:00.53 systemd-udevd
1840 ntp 20 0 31448 1044 448 S 0.0 0.0 0:02.94 ntpd
990 syslog 20 0 255836 924 76 S 0.0 0.0 0:00.13 rsyslogd
1167 root 20 0 61372 828 148 S 0.0 0.0 0:00.00 sshd
945 message+ 20 0 39212 788 416 S 0.0 0.0 0:00.12 dbus-daemon
1323 root 20 0 20692 676 0 S 0.0 0.0 0:40.92 wrapper
1230 root 20 0 19320 588 244 S 0.0 0.0 0:04.57 irqbalance
1538 root 20 0 25336 500 188 S 0.0 0.0 0:00.18 master
567 root 20 0 19604 480 96 S 0.0 0.0 0:00.34 upstart-udev-br
1175 root 20 0 23648 404 156 S 0.0 0.0 0:00.08 cron
1005 root 20 0 15272 348 88 S 0.0 0.0 0:00.08 upstart-file-br
临时和共享内存文件系统基本上是空的:
# df -h
Filesystem Size Used Avail Use% Mounted on
udev 7.4G 12K 7.4G 1% /dev
tmpfs 1.5G 384K 1.5G 1% /run
/dev/xvda1 9.8G 6.7G 2.7G 72% /
none 4.0K 0 4.0K 0% /sys/fs/cgroup
none 5.0M 0 5.0M 0% /run/lock
none 7.4G 0 7.4G 0% /run/shm
none 100M 0 100M 0% /run/user
/dev/xvda15 104M 4.7M 99M 5% /boot/efi
/dev/xvdb 64G 1.1G 60G 2% /mnt
smem
说它正在被内核使用:
# smem -tw
Area Used Cache Noncache
firmware/hardware 0 0 0
kernel image 0 0 0
kernel dynamic memory 9525544 92468 9433076
userspace memory 311064 15648 295416
free memory 5564320 5564320 0
----------------------------------------------------------
15400928 5672436 9728492
但slabtop
没有帮助:
# slabtop -o -s c
Active / Total Objects (% used) : 2915263 / 2937006 (99.3%)
Active / Total Slabs (% used) : 60745 / 60745 (100.0%)
Active / Total Caches (% used) : 68 / 103 (66.0%)
Active / Total Size (% used) : 356086.71K / 360884.30K (98.7%)
Minimum / Average / Maximum Object : 0.01K / 0.12K / 14.00K
OBJS ACTIVE USE OBJ SIZE SLABS OBJ/SLAB CACHE SIZE NAME
2226784 2226784 100% 0.07K 39764 56 159056K Acpi-ParseExt
273408 272598 99% 0.25K 8544 32 68352K kmalloc-256
8568 8560 99% 4.00K 1071 8 34272K kmalloc-4096
52320 52320 100% 0.50K 1635 32 26160K kmalloc-512
1988 1975 99% 8.00K 497 4 15904K kmalloc-8192
58044 53370 91% 0.19K 2764 21 11056K kmalloc-192
150016 141356 94% 0.06K 2344 64 9376K kmalloc-64
5016 3504 69% 0.96K 152 33 4864K ext4_inode_cache
7280 6834 93% 0.57K 260 28 4160K inode_cache
20265 20067 99% 0.19K 965 21 3860K dentry
1760 1721 97% 2.00K 110 16 3520K kmalloc-2048
19800 19800 100% 0.11K 550 36 2200K sysfs_dir_cache
2112 1966 93% 1.00K 66 32 2112K kmalloc-1024
305 260 85% 6.00K 61 5 1952K task_struct
14616 14242 97% 0.09K 348 42 1392K kmalloc-96
2125 2092 98% 0.63K 85 25 1360K proc_inode_cache
2324 2324 100% 0.55K 83 28 1328K radix_tree_node
9828 9828 100% 0.10K 252 39 1008K buffer_head
1400 1400 100% 0.62K 56 25 896K sock_inode_cache
54 39 72% 12.00K 27 2 864K nvidia_stack_cache
975 975 100% 0.81K 25 39 800K task_xstate
690 515 74% 1.06K 23 30 736K signal_cache
到目前为止,我能够解决此问题的唯一方法是重新启动。10GB 内存藏在哪里?
我正在运行一个具有 32GB 内存的机器,显着的区别是 DirectMap4k 值;
对比你的;
这可能是一个起点。谷歌搜索表明这个值可能会受到从主机分配给 VPS 的影响......你是在虚拟服务器中运行这台机器吗?
可能是主机服务器没有足够的 RAM,并且弄乱了
/proc/meminfo
.另外,我会粘贴 的输出
smem -tw
,因为这可能确定内存泄漏是在内核还是应用程序中;smem
帮助我跟踪内核的问题,我相信 NVIDIA 驱动程序是罪魁祸首。升级到 367.35 后情况看起来不错。参考: