我已经让 munin-node 在我的机器上成功运行了一段时间,但最近它不再启动了。没有 munin-node 日志可供我检查,systemctl status munin-node
也没有提供很多有用的信息:
[root@host /]# systemctl status munin-node
● munin-node.service - Munin Node
Loaded: loaded (/usr/lib/systemd/system/munin-node.service; enabled; vendor preset: disabled)
Active: failed (Result: timeout) since Tue 2021-05-18 23:35:16 CEST; 1h 8min ago
Docs: man:munin-node(1)
http://guide.munin-monitoring.org/en/latest/node/index.html
Process: 7710 ExecStart=/usr/sbin/munin-node --foreground (code=exited, status=0/SUCCESS)
Main PID: 7710 (code=exited, status=0/SUCCESS)
May 18 23:33:44 host systemd[1]: Starting Munin Node...
May 18 23:35:14 host systemd[1]: munin-node.service start operation timed out. Terminating.
May 18 23:35:16 host systemd[1]: Failed to start Munin Node.
May 18 23:35:16 host systemd[1]: Unit munin-node.service entered failed state.
May 18 23:35:16 host systemd[1]: munin-node.service failed.
问题原来是插件花费的时间太长,特别是
nvidia_gpu_*
插件,因为它是一台多 GPU 机器。没有明确的指标表明插件导致了超时。为了加快
nvidia_gpu_*
插件速度,我使用了以下命令,基于https://forums.developer.nvidia.com/t/nvidia-smi-is-slow-on-ubuntu-16-04/50416:只需运行
nvidia_smi
命令即可测试其效果,因为不必先唤醒 GPU,因此加载速度会更快。