笔记:
原始帖子被截断以符合 StackExchange 帖子的精神(问题的答案)。然而,当我描述我的过程时,我正在做的日记仍然很有价值:
我已经在我的博客中存档了这篇文章的最后一个“日志状态”:https ://erotemic.wordpress.com/2021/10/01/debugging-unexpected-system-shutdown-initial-archive/以及将来我将发布到博客以提供更新,我将编辑这个超级用户问题以获得解决这些问题的实验的核心问题和描述。
与此同时,除了我的最新更新之外,我已经删除了所有内容。我想这整个帖子将被彻底重组。
症状
运行自定义 Pytorch 脚本时,我的机器遇到了硬关机。
我拍摄了三个视频来演示这个问题:
https://www.youtube.com/watch?v=Ue4XHcusqto
https://www.youtube.com/watch?v=LPwaI1SRlXk
https://www.youtube.com/watch?v=yQ7i-8Kp6xg
调试步骤和结果总结
- 发生关机时测量的瓦数在限制范围内,瓦数可能性大大降低。
- 关机时测量的热量,在 CPU / GPU 的限制范围内,没有严重的异常,热量的可能性大大降低。
- 一夜之间运行 MemTest86+:所有测试均通过。问题是坏 RAM 的可能性被有效地排除了。
- 用相同型号的 1000W PSU 交换 1600W PSU。仍然发生关机。有效排除了问题是坏 PSU 的可能性。
- 在 PCIE 插槽 #1 和 #3 中仅使用 1080ti 运行,两种情况下仍会发生关机。3090 坏的可能性大大降低。
- 在 PCIE 插槽 #1 中仅连接 3090 时运行,仍会发生关机。1080ti坏的可能性大大降低。
- 运行不同的 ML 脚本,没有发生关机,我的自定义 ML 脚本包含问题的可能性增加了。
- 在 CPU / GPU 上进行大型压力测试,关机似乎只发生在我的自定义 ML 脚本中。
有效排除了罪魁祸首
- 热流
- 瓦数
- 电源
- 图形处理器
- 内存
潜在的罪魁祸首和 TODO:
- 平分自定义 ML 脚本以查找导致关机的 MWE
- 主板问题?
- CPU问题?
- 存储问题?(不太可能)
潜在的解决方案!?
我已经用更多信息更新了我的博客。要点是,我找到了 BIOS 设置:ASUS MultiCore Enhancement: Auto
并将其设置为Disabled
,似乎可以解决问题。我在没有断电的情况下运行了超过 14 个小时的实验。
原帖部分内容:待重组
我正在尝试调试经常发生的意外系统关闭,这有时会在机器处于负载状态时发生,但我无法让它可靠地发生。我目前的假设是:
- 从墙上汲取过多的能量
- 散热问题
- 未发现的硬件问题
硬件 + 软件 + 工作量
我的机器上的硬件列表可以在这里找到:https ://pcpartpicker.com/user/erotemic/saved/#view=WKpmD3
相关位是:
- CPU:Intel i9-11900K 搭配 Noctua NH-d15 空气冷却器
- GPU0:RTX 3090(连接到显示器)
- GPU1:GTX 1080ti
- 电源:EVGA T2 1600 W 80+ 钛金属
我正在运行股票 Ubuntu 21.04
我会用几种不同的工作负载来给机器施加压力。
- ethermine - 使用两个 GPU。
- BOINC - 使用climateprediction.net 和 World Community Grid(只要机器不使用,就设置为使用 90% 的 CPU)
- 使用 PyTorch 自定义机器学习工作流程。
我最近没有使用 ethermine,我一直在运行我的 ML 工作负载。
功率假设
我测量了系统的瓦数,它消耗了大约 700-800 瓦,由 Kill-O-Watt P3 测量(这包括显示器和插入电涌保护器的所有其他东西)。我住在一栋被改造成公寓的美国老建筑里。所以,我不能 100% 确定电路的容量,但假设一切都取决于代码(我不相信它是这样),电路应该能够容纳 1800 瓦。房间里的其他电子设备是一个 10 瓦的灯和一个 989 瓦的交流电。因此,这正好突破了 1800 瓦的限制。起初我确信这一定是罪魁祸首,但有一天晚上天气凉爽时,我开始工作并拔掉交流电源,早上电源关闭,所以这个假设不再很好地解释症状。
此外,我认为我的廉价“Quirky Pivot Power”电涌保护器可能是个问题,所以我订购了一个 Tripp Lite ISOBAR6Ultra,希望它的质量会更高,但它还没有到货,我不认为那是问题。
热假说
我目前更倾向于散热问题,但是当我搜索日志时,我没有看到任何与散热相关的关机相关的信息。
我一直在使用 psensor 监控温度并每 300 秒将日志转储到磁盘(因此记录的温度可能不包括导致关机的高温)。
我绘制了最近一次关机前后记录的温度图表,该关机发生在 2021 年 8 月 18 日凌晨 3:00 左右:
请注意,我故意不在这里使用 RTX 3090 来防止此类问题,但似乎即使 1080ti 运行也会触发导致此关机的任何情况。
CPU 在这里记录的最高温度为 93C,但我看到温度记录的最高温度接近 99C,而“传感器”报告的临界温度为 100C。因此,考虑到 CPU 温度在关机前正在升高,并且记录间隔是每 5 分钟一次,很可能系统在下一个日志发生之前达到临界温度并关机。
但我仍然对此不满意。首先按照https://unix.stackexchange.com/questions/502226/how-do-you-find-out-if-a-linux-machine-overheated-before-the-previous-boot-and-wjournalctl -g 'temperature|critical' -b -2
中的建议运行没有表明系统记录了温度问题。
的结果journalctl -b -1
Aug 18 02:46:57 toothbrush smartd[1857]: Device: /dev/sda [SAT], SMART Prefailure Attribute: 1 Raw_Read_Error_Rate changed from 84 to 71
Aug 18 02:46:57 toothbrush smartd[1857]: Device: /dev/sdb [SAT], SMART Prefailure Attribute: 1 Raw_Read_Error_Rate changed from 84 to 72
Aug 18 02:46:57 toothbrush smartd[1857]: Device: /dev/sdb [SAT], SMART Usage Attribute: 195 Hardware_ECC_Recovered changed from 20 to 19
Aug 18 02:46:57 toothbrush smartd[1857]: Device: /dev/sdc [SAT], SMART Prefailure Attribute: 1 Raw_Read_Error_Rate changed from 83 to 74
Aug 18 02:46:57 toothbrush smartd[1857]: Device: /dev/sdc [SAT], SMART Usage Attribute: 195 Hardware_ECC_Recovered changed from 20 to 23
Aug 18 02:46:57 toothbrush smartd[1857]: Device: /dev/sdd [SAT], SMART Prefailure Attribute: 1 Raw_Read_Error_Rate changed from 67 to 77
Aug 18 02:46:57 toothbrush smartd[1857]: Device: /dev/sdd [SAT], SMART Usage Attribute: 195 Hardware_ECC_Recovered changed from 27 to 23
Aug 18 02:47:00 toothbrush boinc[3170]: 18-Aug-2021 02:47:00 [---] Suspending computation - CPU is busy
Aug 18 02:47:00 toothbrush vnstatd[1944]: Info: Traffic rate for "tun0" higher than set maximum 10 Mbit (20s->27262976, r344465819 t8087795, 64bit:1), syncing.
Aug 18 02:47:20 toothbrush vnstatd[1944]: Info: Traffic rate for "tun0" higher than set maximum 10 Mbit (20s->27262976, r398433847 t7744494, 64bit:1), syncing.
Aug 18 02:47:40 toothbrush vnstatd[1944]: Info: Traffic rate for "tun0" higher than set maximum 10 Mbit (20s->27262976, r349747452 t8371229, 64bit:1), syncing.
Aug 18 02:48:00 toothbrush vnstatd[1944]: Info: Traffic rate for "tun0" higher than set maximum 10 Mbit (20s->27262976, r390229100 t7980049, 64bit:1), syncing.
Aug 18 02:48:20 toothbrush vnstatd[1944]: Info: Traffic rate for "tun0" higher than set maximum 10 Mbit (20s->27262976, r352409333 t7226854, 64bit:1), syncing.
Aug 18 02:48:40 toothbrush vnstatd[1944]: Info: Traffic rate for "tun0" higher than set maximum 10 Mbit (20s->27262976, r508920330 t10538384, 64bit:1), syncing.
Aug 18 02:48:50 toothbrush boinc[3170]: 18-Aug-2021 02:48:50 [---] Resuming computation
Aug 18 02:49:00 toothbrush vnstatd[1944]: Info: Traffic rate for "tun0" higher than set maximum 10 Mbit (20s->27262976, r261199946 t4894398, 64bit:1), syncing.
Aug 18 02:49:01 toothbrush boinc[3170]: 18-Aug-2021 02:49:01 [---] Suspending computation - CPU is busy
Aug 18 02:49:20 toothbrush vnstatd[1944]: Info: Traffic rate for "tun0" higher than set maximum 10 Mbit (20s->27262976, r251680223 t6509690, 64bit:1), syncing.
Aug 18 02:49:21 toothbrush boinc[3170]: 18-Aug-2021 02:49:21 [---] Resuming computation
Aug 18 02:49:31 toothbrush boinc[3170]: 18-Aug-2021 02:49:31 [---] Suspending computation - CPU is busy
Aug 18 02:49:40 toothbrush vnstatd[1944]: Info: Traffic rate for "tun0" higher than set maximum 10 Mbit (20s->27262976, r346528983 t5840449, 64bit:1), syncing.
Aug 18 02:50:00 toothbrush vnstatd[1944]: Info: Traffic rate for "tun0" higher than set maximum 10 Mbit (20s->27262976, r560923145 t12173867, 64bit:1), syncing.
Aug 18 02:50:20 toothbrush vnstatd[1944]: Info: Traffic rate for "tun0" higher than set maximum 10 Mbit (20s->27262976, r567474866 t11497897, 64bit:1), syncing.
Aug 18 02:50:40 toothbrush vnstatd[1944]: Info: Traffic rate for "tun0" higher than set maximum 10 Mbit (20s->27262976, r519892497 t10585216, 64bit:1), syncing.
Aug 18 02:51:00 toothbrush vnstatd[1944]: Info: Traffic rate for "tun0" higher than set maximum 10 Mbit (20s->27262976, r553040012 t11503711, 64bit:1), syncing.
Aug 18 02:51:20 toothbrush vnstatd[1944]: Info: Traffic rate for "tun0" higher than set maximum 10 Mbit (20s->27262976, r489967052 t11999909, 64bit:1), syncing.
Aug 18 02:51:31 toothbrush boinc[3170]: 18-Aug-2021 02:51:31 [---] Resuming computation
Aug 18 02:51:40 toothbrush vnstatd[1944]: Info: Traffic rate for "tun0" higher than set maximum 10 Mbit (20s->27262976, r279491189 t4690385, 64bit:1), syncing.
Aug 18 02:51:41 toothbrush boinc[3170]: 18-Aug-2021 02:51:41 [---] Suspending computation - CPU is busy
Aug 18 02:52:00 toothbrush vnstatd[1944]: Info: Traffic rate for "tun0" higher than set maximum 10 Mbit (20s->27262976, r233899151 t4847426, 64bit:1), syncing.
Aug 18 02:52:01 toothbrush boinc[3170]: 18-Aug-2021 02:52:01 [---] Resuming computation
Aug 18 02:52:11 toothbrush boinc[3170]: 18-Aug-2021 02:52:11 [---] Suspending computation - CPU is busy
Aug 18 02:52:20 toothbrush vnstatd[1944]: Info: Traffic rate for "tun0" higher than set maximum 10 Mbit (20s->27262976, r268957755 t5537306, 64bit:1), syncing.
Aug 18 02:52:40 toothbrush vnstatd[1944]: Info: Traffic rate for "tun0" higher than set maximum 10 Mbit (20s->27262976, r333913668 t7187733, 64bit:1), syncing.
Aug 18 02:53:00 toothbrush vnstatd[1944]: Info: Traffic rate for "tun0" higher than set maximum 10 Mbit (20s->27262976, r450294755 t8957939, 64bit:1), syncing.
Aug 18 02:53:20 toothbrush vnstatd[1944]: Info: Traffic rate for "tun0" higher than set maximum 10 Mbit (20s->27262976, r264028304 t5582071, 64bit:1), syncing.
Aug 18 02:53:40 toothbrush vnstatd[1944]: Info: Traffic rate for "tun0" higher than set maximum 10 Mbit (20s->27262976, r379501357 t8308167, 64bit:1), syncing.
Aug 18 02:54:00 toothbrush vnstatd[1944]: Info: Traffic rate for "tun0" higher than set maximum 10 Mbit (20s->27262976, r364408338 t9670592, 64bit:1), syncing.
Aug 18 02:54:12 toothbrush boinc[3170]: 18-Aug-2021 02:54:12 [---] Resuming computation
Aug 18 02:54:20 toothbrush vnstatd[1944]: Info: Traffic rate for "tun0" higher than set maximum 10 Mbit (20s->27262976, r410359086 t6437227, 64bit:1), syncing.
Aug 18 02:54:22 toothbrush boinc[3170]: 18-Aug-2021 02:54:22 [---] Suspending computation - CPU is busy
Aug 18 02:54:40 toothbrush vnstatd[1944]: Info: Traffic rate for "tun0" higher than set maximum 10 Mbit (20s->27262976, r266936223 t4903133, 64bit:1), syncing.
Aug 18 02:54:42 toothbrush boinc[3170]: 18-Aug-2021 02:54:42 [---] Resuming computation
Aug 18 02:54:52 toothbrush boinc[3170]: 18-Aug-2021 02:54:52 [---] Suspending computation - CPU is busy
Aug 18 02:55:00 toothbrush vnstatd[1944]: Info: Traffic rate for "tun0" higher than set maximum 10 Mbit (20s->27262976, r258961514 t5642594, 64bit:1), syncing.
Aug 18 02:55:01 toothbrush CRON[313877]: pam_unix(cron:session): session opened for user root by (uid=0)
Aug 18 02:55:01 toothbrush CRON[313878]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1)
Aug 18 02:55:01 toothbrush CRON[313877]: pam_unix(cron:session): session closed for user root
Aug 18 02:55:20 toothbrush vnstatd[1944]: Info: Traffic rate for "tun0" higher than set maximum 10 Mbit (20s->27262976, r485119089 t10059003, 64bit:1), syncing.
Aug 18 02:55:40 toothbrush vnstatd[1944]: Info: Traffic rate for "tun0" higher than set maximum 10 Mbit (20s->27262976, r482961424 t9750792, 64bit:1), syncing.
Aug 18 02:56:00 toothbrush vnstatd[1944]: Info: Traffic rate for "tun0" higher than set maximum 10 Mbit (20s->27262976, r334697691 t7035018, 64bit:1), syncing.
Aug 18 02:56:20 toothbrush vnstatd[1944]: Info: Traffic rate for "tun0" higher than set maximum 10 Mbit (20s->27262976, r449591310 t9490996, 64bit:1), syncing.
Aug 18 02:56:40 toothbrush vnstatd[1944]: Info: Traffic rate for "tun0" higher than set maximum 10 Mbit (20s->27262976, r415820654 t10568703, 64bit:1), syncing.
Aug 18 02:56:43 toothbrush boinc[3170]: 18-Aug-2021 02:56:43 [---] Resuming computation
Aug 18 02:56:53 toothbrush boinc[3170]: 18-Aug-2021 02:56:53 [---] Suspending computation - CPU is busy
Aug 18 02:57:00 toothbrush vnstatd[1944]: Info: Traffic rate for "tun0" higher than set maximum 10 Mbit (20s->27262976, r326675026 t4890602, 64bit:1), syncing.
Aug 18 02:57:20 toothbrush vnstatd[1944]: Info: Traffic rate for "tun0" higher than set maximum 10 Mbit (20s->27262976, r461180383 t10357149, 64bit:1), syncing.
Aug 18 02:57:23 toothbrush boinc[3170]: 18-Aug-2021 02:57:23 [---] Resuming computation
Aug 18 02:57:33 toothbrush boinc[3170]: 18-Aug-2021 02:57:33 [---] Suspending computation - CPU is busy
Aug 18 02:57:40 toothbrush vnstatd[1944]: Info: Traffic rate for "tun0" higher than set maximum 10 Mbit (20s->27262976, r311496530 t5584467, 64bit:1), syncing.
Aug 18 02:58:00 toothbrush vnstatd[1944]: Info: Traffic rate for "tun0" higher than set maximum 10 Mbit (20s->27262976, r345401175 t6977056, 64bit:1), syncing.
Aug 18 02:58:20 toothbrush vnstatd[1944]: Info: Traffic rate for "tun0" higher than set maximum 10 Mbit (20s->27262976, r413257951 t8468887, 64bit:1), syncing.
Aug 18 02:58:40 toothbrush vnstatd[1944]: Info: Traffic rate for "tun0" higher than set maximum 10 Mbit (20s->27262976, r430901546 t9350168, 64bit:1), syncing.
Aug 18 02:59:00 toothbrush vnstatd[1944]: Info: Traffic rate for "tun0" higher than set maximum 10 Mbit (20s->27262976, r316409469 t6532987, 64bit:1), syncing.
Aug 18 02:59:20 toothbrush vnstatd[1944]: Info: Traffic rate for "tun0" higher than set maximum 10 Mbit (20s->27262976, r496502797 t11915940, 64bit:1), syncing.
Aug 18 02:59:24 toothbrush boinc[3170]: 18-Aug-2021 02:59:24 [---] Resuming computation
cat /var/log/syslog
接近关机的结果是:
Aug 18 02:52:11 toothbrush boinc[3170]: 18-Aug-2021 02:52:11 [---] Suspending computation - CPU is busy
Aug 18 02:52:20 toothbrush vnstatd[1944]: Info: Traffic rate for "tun0" higher than set maximum 10 Mbit (20s->27262976, r268957755 t5537306, 64bit:1), syncing.
Aug 18 02:52:40 toothbrush vnstatd[1944]: Info: Traffic rate for "tun0" higher than set maximum 10 Mbit (20s->27262976, r333913668 t7187733, 64bit:1), syncing.
Aug 18 02:53:00 toothbrush vnstatd[1944]: Info: Traffic rate for "tun0" higher than set maximum 10 Mbit (20s->27262976, r450294755 t8957939, 64bit:1), syncing.
Aug 18 02:53:20 toothbrush vnstatd[1944]: Info: Traffic rate for "tun0" higher than set maximum 10 Mbit (20s->27262976, r264028304 t5582071, 64bit:1), syncing.
Aug 18 02:53:40 toothbrush vnstatd[1944]: Info: Traffic rate for "tun0" higher than set maximum 10 Mbit (20s->27262976, r379501357 t8308167, 64bit:1), syncing.
Aug 18 02:54:00 toothbrush vnstatd[1944]: Info: Traffic rate for "tun0" higher than set maximum 10 Mbit (20s->27262976, r364408338 t9670592, 64bit:1), syncing.
Aug 18 02:54:12 toothbrush boinc[3170]: 18-Aug-2021 02:54:12 [---] Resuming computation
Aug 18 02:54:20 toothbrush vnstatd[1944]: Info: Traffic rate for "tun0" higher than set maximum 10 Mbit (20s->27262976, r410359086 t6437227, 64bit:1), syncing.
Aug 18 02:54:22 toothbrush boinc[3170]: 18-Aug-2021 02:54:22 [---] Suspending computation - CPU is busy
Aug 18 02:54:40 toothbrush vnstatd[1944]: Info: Traffic rate for "tun0" higher than set maximum 10 Mbit (20s->27262976, r266936223 t4903133, 64bit:1), syncing.
Aug 18 02:54:42 toothbrush boinc[3170]: 18-Aug-2021 02:54:42 [---] Resuming computation
Aug 18 02:54:52 toothbrush boinc[3170]: 18-Aug-2021 02:54:52 [---] Suspending computation - CPU is busy
Aug 18 02:55:00 toothbrush vnstatd[1944]: Info: Traffic rate for "tun0" higher than set maximum 10 Mbit (20s->27262976, r258961514 t5642594, 64bit:1), syncing.
Aug 18 02:55:01 toothbrush CRON[313878]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1)
Aug 18 02:55:20 toothbrush vnstatd[1944]: Info: Traffic rate for "tun0" higher than set maximum 10 Mbit (20s->27262976, r485119089 t10059003, 64bit:1), syncing.
Aug 18 02:55:40 toothbrush vnstatd[1944]: Info: Traffic rate for "tun0" higher than set maximum 10 Mbit (20s->27262976, r482961424 t9750792, 64bit:1), syncing.
Aug 18 02:56:00 toothbrush vnstatd[1944]: Info: Traffic rate for "tun0" higher than set maximum 10 Mbit (20s->27262976, r334697691 t7035018, 64bit:1), syncing.
Aug 18 02:56:20 toothbrush vnstatd[1944]: Info: Traffic rate for "tun0" higher than set maximum 10 Mbit (20s->27262976, r449591310 t9490996, 64bit:1), syncing.
Aug 18 02:56:40 toothbrush vnstatd[1944]: Info: Traffic rate for "tun0" higher than set maximum 10 Mbit (20s->27262976, r415820654 t10568703, 64bit:1), syncing.
Aug 18 02:56:43 toothbrush boinc[3170]: 18-Aug-2021 02:56:43 [---] Resuming computation
Aug 18 02:56:53 toothbrush boinc[3170]: 18-Aug-2021 02:56:53 [---] Suspending computation - CPU is busy
Aug 18 02:57:00 toothbrush vnstatd[1944]: Info: Traffic rate for "tun0" higher than set maximum 10 Mbit (20s->27262976, r326675026 t4890602, 64bit:1), syncing.
Aug 18 02:57:20 toothbrush vnstatd[1944]: Info: Traffic rate for "tun0" higher than set maximum 10 Mbit (20s->27262976, r461180383 t10357149, 64bit:1), syncing.
Aug 18 02:57:23 toothbrush boinc[3170]: 18-Aug-2021 02:57:23 [---] Resuming computation
Aug 18 02:57:33 toothbrush boinc[3170]: 18-Aug-2021 02:57:33 [---] Suspending computation - CPU is busy
Aug 18 02:57:40 toothbrush vnstatd[1944]: Info: Traffic rate for "tun0" higher than set maximum 10 Mbit (20s->27262976, r311496530 t5584467, 64bit:1), syncing.
Aug 18 02:58:00 toothbrush vnstatd[1944]: Info: Traffic rate for "tun0" higher than set maximum 10 Mbit (20s->27262976, r345401175 t6977056, 64bit:1), syncing.
Aug 18 02:58:20 toothbrush vnstatd[1944]: Info: Traffic rate for "tun0" higher than set maximum 10 Mbit (20s->27262976, r413257951 t8468887, 64bit:1), syncing.
Aug 18 02:58:40 toothbrush vnstatd[1944]: Info: Traffic rate for "tun0" higher than set maximum 10 Mbit (20s->27262976, r430901546 t9350168, 64bit:1), syncing.
Aug 18 02:59:00 toothbrush vnstatd[1944]: Info: Traffic rate for "tun0" higher than set maximum 10 Mbit (20s->27262976, r316409469 t6532987, 64bit:1), syncing.
Aug 18 02:59:20 toothbrush vnstatd[1944]: Info: Traffic rate for "tun0" higher than set maximum 10 Mbit (20s->27262976, r496502797 t11915940, 64bit:1), syncing.
Aug 18 02:59:24 toothbrush boinc[3170]: 18-Aug-2021 02:59:24 [---] Resuming computation
Aug 18 09:25:52 toothbrush systemd-modules-load[472]: Inserted module 'lp'
Aug 18 09:25:52 toothbrush systemd-modules-load[472]: Inserted module 'ppdev'
Aug 18 09:25:52 toothbrush systemd-modules-load[472]: Inserted module 'parport_pc'
Aug 18 09:25:52 toothbrush systemd-modules-load[472]: Inserted module 'msr'
Aug 18 09:25:52 toothbrush kernel: [ 0.000000] microcode: microcode updated early to revision 0x40, date = 2021-04-11
Aug 18 09:25:52 toothbrush lvm[461]: 2 logical volume(s) in volume group "vgubuntu" monitored
Aug 18 09:25:52 toothbrush kernel: [ 0.000000] Linux version 5.11.0-25-generic (buildd@lgw01-amd64-044) (gcc (Ubuntu 10.3.0-1ubuntu1) 10.3.0, GNU ld (GNU Binutils for Ubuntu) 2.36.1) #27-Ubuntu SMP Fri Jul 9 23:06:29 UTC 2021 (Ubuntu 5.11.0-25.27-generic 5.11.22)
Aug 18 09:25:52 toothbrush kernel: [ 0.000000] Command line: BOOT_IMAGE=/boot/vmlinuz-5.11.0-25-generic root=/dev/mapper/vgubuntu-root ro quiet splash vt.handoff=7
Aug 18 09:25:52 toothbrush kernel: [ 0.000000] KERNEL supported cpus:
Aug 18 09:25:52 toothbrush systemd[1]: Starting Flush Journal to Persistent Storage...
Aug 18 09:25:52 toothbrush kernel: [ 0.000000] Intel GenuineIntel
Aug 18 09:25:52 toothbrush kernel: [ 0.000000] AMD AuthenticAMD
Aug 18 09:25:52 toothbrush kernel: [ 0.000000] Hygon HygonGenuine
Aug 18 09:25:52 toothbrush kernel: [ 0.000000] Centaur CentaurHauls
Aug 18 09:25:52 toothbrush kernel: [ 0.000000] zhaoxin Shanghai
Aug 18 09:25:52 toothbrush systemd[1]: Finished Load Kernel Modules.
这里有趣的是,关机前的最后一个日志是Aug 18 02:59:24 toothbrush boinc[3170]: 18-Aug-2021 02:59:24 [---] Resuming computation
,表明 BOINC 即将开始运行 CPU 密集型进程。
跑步 cat /var/log/kern.log
并查看附近的时间提供的信息较少:
Aug 17 23:47:21 toothbrush kernel: [100858.782842] pcieport 0000:00:01.0: AER: Corrected error received: 0000:00:01.0
Aug 17 23:47:21 toothbrush kernel: [100858.782850] pcieport 0000:00:01.0: PCIe Bus Error: severity=Corrected, type=Physical Layer, (Receiver ID)
Aug 17 23:47:21 toothbrush kernel: [100858.782851] pcieport 0000:00:01.0: device [8086:4c01] error status/mask=00000001/00002000
Aug 17 23:47:21 toothbrush kernel: [100858.782852] pcieport 0000:00:01.0: [ 0] RxErr (First)
Aug 18 00:00:01 toothbrush kernel: [101618.605604] audit: type=1400 audit(1629259201.304:83): apparmor="DENIED" operation="capable" profile="/usr/sbin/cupsd" pid=3302495 comm="cupsd" capability=12 capname="net_admin"
Aug 18 00:00:05 toothbrush kernel: [101622.407042] audit: type=1400 audit(1629259205.104:84): apparmor="DENIED" operation="capable" profile="/usr/sbin/cups-browsed" pid=3302502 comm="cups-browsed" capability=23 capname="sys_nice"
Aug 18 09:25:52 toothbrush kernel: [ 0.000000] microcode: microcode updated early to revision 0x40, date = 2021-04-11
Aug 18 09:25:52 toothbrush kernel: [ 0.000000] Linux version 5.11.0-25-generic (buildd@lgw01-amd64-044) (gcc (Ubuntu 10.3.0-1ubuntu1) 10.3.0, GNU ld (GNU Binutils for Ubuntu) 2.36.1) #27-Ubuntu SMP Fri Jul 9 23:06:29 UTC 2021 (Ubuntu 5.11.0-25.27-generic 5.11.22)
Aug 18 09:25:52 toothbrush kernel: [ 0.000000] Command line: BOOT_IMAGE=/boot/vmlinuz-5.11.0-25-generic root=/dev/mapper/vgubuntu-root ro quiet splash vt.handoff=7
Aug 18 09:25:52 toothbrush kernel: [ 0.000000] KERNEL supported cpus:
跑步:last -x | head | tac
joncrall :0 :0 Mon Aug 16 19:46 - crash (1+13:38)
runlevel (to lvl 5) 5.11.0-25-generi Mon Aug 16 19:47 - 09:26 (1+13:39)
joncrall pts/3 tmux(11727).%0 Mon Aug 16 20:41 - 21:49 (01:07)
joncrall pts/23 tmux(3215922).%0 Tue Aug 17 23:46 - crash (09:39)
reboot system boot 5.11.0-25-generi Wed Aug 18 09:25 still running
joncrall :0 :0 Wed Aug 18 09:25 still logged in
runlevel (to lvl 5) 5.11.0-25-generi Wed Aug 18 09:26 still running
我完全忘记了“崩溃”与“仍在运行”列中的含义last reboot
,所以我不确定如何解释这一点,或者这里是否有任何诊断信息。
所以,如果是热量,我认为系统不会记录它。
我的问题+总结:
所以,我的机器正在关闭,我不确定它的散热、电源或其他原因。为了缓解散热问题,我在机箱中额外安装了 4 个风扇,前下部有 2 个进气风扇,前底部有 1 个进气风扇,后顶部有 2 个出风口,后部有 1 个出风口。NH-d15 上安装了两个风扇(我仔细检查了方向)。
我可以检查其他日志以调试热问题吗?
使用空气冷却是我的傻瓜吗,这是否只是可以通过 AOI CPU 水冷却器缓解的温度波动?
还有其他我没有考虑的假设吗?
2021-10-01 更新
十月快乐。我的机器仍然让我发疯。但至少我有几个角度可以一分为二来尝试定位问题。
我重新配置了硬件,试图确定 3090 是否是问题的一部分,但我认为不是。
我完全移除了 3090,因此 1080ti 现在是其中唯一的显卡。我没有更改 1080ti 连接的 PCIE 插槽。以前 3090 位于插槽 1/3(最靠近 CPU 的插槽)上,而 1080 ti 位于插槽 3/3(最远的插槽)上。我刚刚卸下了 3090 并将 1080ti 保留在插槽 3 上。我连接了 DVI 电缆,启动并运行了 pytorch 培训代码。我于 2021 年 9 月 30 日晚上 10 点 18 分开始训练,睡觉时它仍在运行,但我醒来时发现机器已断电。查看日志,它似乎在 2021 年 10 月 1 日凌晨 2 点 14 分左右关机,所以它能够突突突突突突了近 4 个小时才遇到问题。
So, even without the 3090, the issue persists (yay the overpriced GPU isn't the problem), although using the 3090 does seem to induce the issue faster, but it is not the root cause.
I'm wondering if I might have uncovered a vulnerability with my hardware and the type of training I'm doing. Hopefully I can find a MWE so I can point to a particular instruction that causes this (recall running standard ConvNet training with torch / tensorflow stock scripts does not trigger the issue, the code I'm running right now is training a transformer network with pytorch-lightning).
Before I do this I'm going to try a few more hardware configurations while I have the machine open.
Later: reproduced error with the 1080ti in slot 0. I suppose the next test is to try swapping out the power supply. It's going to suck undoing my cable management, but it should either rule out the PSU or hone in on it.
Later: It's not the PSU. I swapped it out with a 1000W PSU and ran an experiment at 7:43. Poweroff at 7:57. So torch-exploit, the CPU, the MOBO, others? Initial build had bad ram, but got it replaced. I'll rerun.