我已将 systemd 配置为使用硬件看门狗。我的内核版本是5.10 这是配置
RuntimeWatchdogSec=120 in /etc/systemd/system.conf
WatchdogDevice=/dev/watchdog1
我可以看到 systemd 正在启动硬件看门狗并且系统运行良好。我需要测试这个硬件看门狗是否确实重置了硬件,所以我需要让 systemd 在运行时停止踢它。这可能吗 ?
我无法杀死 systemd 进程。
我有一块Supermicro X9DR3-F 主板,其中JWD
跳线针 1 和 2 短路并且 UEFI 中的看门狗功能已启用:
这意味着如果没有重置硬件看门狗定时器,系统将在大约 5 分钟后重置。我安装了watchdog
守护进程并将其配置为使用iTCO_wdt
驱动程序:
$ cat /etc/default/watchdog
# Start watchdog at boot time? 0 or 1
run_watchdog=1
# Start wd_keepalive after stopping watchdog? 0 or 1
run_wd_keepalive=1
# Load module before starting watchdog
watchdog_module="iTCO_wdt"
# Specify additional watchdog options here (see manpage).
$
当watchdog
守护程序启动时,驱动程序加载没有问题:
$ sudo dmesg | grep iTCO_wdt
[ 17.435620] iTCO_wdt: Intel TCO WatchDog Timer Driver v1.11
[ 17.435667] iTCO_wdt: Found a Patsburg TCO device (Version=2, TCOBASE=0x0460)
[ 17.435761] iTCO_wdt: initialized. heartbeat=30 sec (nowayout=0)
$
此外,该/dev/watchdog
文件存在:
$ ls -l /dev/watchdog
crw------- 1 root root 10, 130 Dec 8 22:36 /dev/watchdog
$
watchdog-device
守护程序配置中的选项watchdog
指向此文件:
$ grep -v ^# /etc/watchdog.conf
watchdog-device = /dev/watchdog
watchdog-timeout = 60
interval = 5
log-dir = /var/log/watchdog
verbose = yes
realtime = yes
priority = 1
heartbeat-file = /var/log/watchdog/heartbeat
heartbeat-stamps = 1000
$
为了调试对看门狗设备的写入,我启用heartbeat-file
了选项并看起来发送了 keepalive 消息/dev/watchdog
:
$ tail /var/log/watchdog/heartbeat
1575830728
1575830728
1575830728
1575830733
1575830733
1575830733
1575830733
1575830733
1575830733
1575830733
$
然而,尽管如此,服务器会以大约五分钟的间隔自行重置。
我的下一个想法是iTCO_wdt
驱动程序控制C606 芯片组中的看门狗,而重置服务器的看门狗是 IPMI 的一部分。因此,我确保iTCO_wdt
在引导期间未加载驱动程序并重新启动服务器。很公平,/dev/watchdog
不再存在。现在我加载了ipmi_watchdog
模块:
$ ls -l /dev/watchdog
ls: cannot access '/dev/watchdog': No such file or directory
$ sudo modprobe ipmi_watchdog
$ sudo dmesg -T | tail -1
[Tue Dec 10 21:12:48 2019] IPMI Watchdog: driver initialized
$ ls -l /dev/watchdog
crw------- 1 root root 10, 130 Dec 10 21:12 /dev/watchdog
$
..最后启动了watchdog
基于/var/log/watchdog/heartbeat
文件写入的守护进程,/dev/watchdog
间隔为5s。此外,可以通过以下方式确认这一点strace
:
$ ps -p 2296 -f
UID PID PPID C STIME TTY TIME CMD
root 2296 1 0 01:28 ? 00:00:00 /usr/sbin/watchdog
$ sudo strace -y -p 2296
strace: Process 2296 attached
restart_syscall(<... resuming interrupted nanosleep ...>) = 0
write(1</dev/watchdog>, "\0", 1) = 1
write(1</dev/watchdog>, "\0", 1) = 1
open("/proc/uptime", O_RDONLY) = 2</proc/uptime>
close(2</proc/uptime>) = 0
write(1</dev/watchdog>, "\0", 1) = 1
write(1</dev/watchdog>, "\0", 1) = 1
write(1</dev/watchdog>, "\0", 1) = 1
write(1</dev/watchdog>, "\0", 1) = 1
write(1</dev/watchdog>, "\0", 1) = 1
nanosleep({5, 0}, NULL) = 0
write(1</dev/watchdog>, "\0", 1) = 1
write(1</dev/watchdog>, "\0", 1) = 1
open("/proc/uptime", O_RDONLY) = 2</proc/uptime>
close(2</proc/uptime>) = 0
write(1</dev/watchdog>, "\0", 1) = 1
write(1</dev/watchdog>, "\0", 1) = 1
write(1</dev/watchdog>, "\0", 1) = 1
write(1</dev/watchdog>, "\0", 1) = 1
write(1</dev/watchdog>, "\0", 1) = 1
nanosleep({5, 0}, NULL) = 0
write(1</dev/watchdog>, "\0", 1) = 1
write(1</dev/watchdog>, "\0", 1) = 1
open("/proc/uptime", O_RDONLY) = 2</proc/uptime>
close(2</proc/uptime>) = 0
write(1</dev/watchdog>, "\0", 1) = 1
write(1</dev/watchdog>, "\0", 1) = 1
write(1</dev/watchdog>, "\0", 1) = 1
write(1</dev/watchdog>, "\0", 1) = 1
write(1</dev/watchdog>, "\0", 1) = 1
nanosleep({5, 0}, ^Cstrace: Process 2296 detached
<detached ...>
$
watchdog
上面带有 PID 的守护进程是以注释掉选项 in的2296
方式启动的,以减少.heartbeat-file
/etc/watchdog.conf
write
strace
但是,服务器仍会以大约 300 秒的间隔重新启动。
为什么看门狗守护程序不能重置 Supermicro X9DR3-F 主板上的硬件看门狗定时器?