在我主 SSD 上复制一个大文件后不久,IO PSI 值上升到 90% 以上,系统变得非常迟钝。大约需要一分钟才能恢复正常。
我还注意到在执行诸如此类的操作时会出现冻结问题docker pull
,但现在能够重现我认为是同一问题rsync
。
当测试文件位于第二个SSD时,问题似乎不会发生。
可重现的示例:
受到这篇文章的启发,我每秒将负载和 IO PSI 记录到 CSV 文件中。
在第一个终端会话中我运行
rm monitor.csv; t=0; while true; do t=$(($t+1)); echo $s,$(cat /proc/loadavg | cut -d ' ' -f1),$(grep some /proc/pressure/cpu | cut -d' ' -f2 | cut -d= -f2),$(grep some /proc/pressure/io | cut -d' ' -f2 | cut -d= -f2),$(grep full /proc/pressure/io | cut -d' ' -f2 | cut -d= -f2) >> monitor.csv; sleep 1; done
我在主 SSD 上复制了一个 3.3 GB 的测试文件:
rm test.zip ; echo "start rsync" >> monitor.csv ; rsync original.zip test.zip ; echo "end rsync" >> monitor.csv
现在 monitor.csv 如下所示:
1,0.64,0.18,0.00,0.00
2,0.67,0.14,0.00,0.00
3,0.67,0.14,0.00,0.00
4,0.67,0.12,0.00,0.00
start rsync
5,0.67,0.12,0.00,0.00
6,0.67,0.09,0.00,0.00
7,0.70,0.09,0.00,0.00
8,0.70,0.08,0.00,0.00
end rsync
9,0.70,0.08,0.00,0.00
10,0.70,0.42,6.15,5.79
11,0.70,0.42,6.15,5.79
12,0.80,0.35,16.99,15.97
13,0.80,0.35,16.99,15.97
14,0.80,0.28,30.94,29.38
15,0.80,0.28,30.94,29.38
16,1.14,0.23,43.09,41.27
17,1.14,0.23,51.59,49.73
18,1.14,0.19,51.59,49.73
19,1.14,0.19,51.59,49.73
20,1.14,0.33,58.00,55.40
21,2.57,0.27,62.89,60.22
22,2.57,0.27,62.89,60.22
23,2.57,0.22,67.80,65.25
24,2.57,0.22,67.80,65.25
25,2.57,0.18,72.37,69.73
26,3.57,0.18,72.37,69.73
27,3.57,0.15,75.38,72.68
28,3.57,0.15,75.38,72.68
29,3.57,0.12,79.48,76.90
30,3.57,0.12,79.48,76.90
31,4.56,0.10,83.01,80.36
32,4.56,0.10,83.01,80.36
33,4.56,0.08,84.46,81.38
34,4.56,0.08,84.46,81.38
35,4.56,0.06,86.37,83.12
36,5.24,0.06,86.37,83.12
37,5.24,0.05,86.66,83.10
38,5.24,0.05,86.66,83.10
39,5.24,0.04,87.09,82.90
40,5.24,0.04,87.09,82.90
41,5.94,0.03,84.71,80.02
42,5.94,0.03,84.71,80.02
43,5.94,0.02,84.77,79.65
44,5.94,0.02,84.77,79.65
45,5.94,0.02,86.26,80.08
46,6.91,0.02,86.26,80.08
47,6.91,0.01,86.93,80.42
48,6.91,0.01,86.93,80.42
49,6.91,0.01,86.04,79.80
50,6.91,0.01,86.04,79.80
51,7.95,0.01,85.67,79.29
52,7.95,0.01,85.67,79.29
53,7.95,0.01,86.45,79.78
54,7.95,0.01,86.45,79.78
55,7.95,0.00,86.73,79.82
56,8.12,0.00,86.73,79.82
57,8.12,0.18,80.08,73.69
58,8.12,0.18,80.08,73.69
59,8.12,0.33,65.57,60.34
60,7.55,0.33,65.57,60.34
61,7.55,0.27,54.05,49.77
62,7.55,0.27,54.05,49.77
63,7.55,0.22,44.26,40.76
64,7.55,0.18,36.24,33.37
65,7.02,0.18,36.24,33.37
66,7.02,0.18,36.24,33.37
67,7.02,0.14,29.67,27.32
68,7.02,0.12,24.30,22.37
69,7.02,0.12,24.30,22.37
70,6.46,0.10,19.89,18.32
71,6.46,0.10,19.89,18.32
72,6.46,0.08,16.29,15.00
73,6.46,0.08,16.29,15.00
74,6.46,0.06,13.34,12.28
75,5.94,0.06,13.34,12.28
76,5.94,0.05,10.92,10.06
77,5.94,0.05,10.92,10.06
78,5.94,0.04,8.94,8.23
79,5.94,0.04,8.94,8.23
80,5.55,0.03,7.32,6.74
81,5.55,0.03,7.32,6.74
82,5.55,0.02,5.99,5.52
83,5.55,0.02,5.99,5.52
84,5.55,0.02,4.91,4.52
85,5.10,0.02,4.91,4.52
86,5.10,0.01,4.02,3.70
87,5.10,0.01,4.02,3.70
88,5.10,0.01,3.29,3.03
89,5.10,0.01,3.29,3.03
图表:
- 复制文件仅需约 5 秒
- 此后 PSI 会变得非常高,一分钟后恢复正常
- 在此期间,系统 GUI 变得非常迟钝
复制过程结束后,顶部还会闪烁一个警告:
对位于第二个 SSD 上的文件进行相同的操作会花费更长的时间,但系统不会冻结,并且 IO PSI 值也不会出现如此严重的峰值:
1,2.42,0.00,0.01,0.01
2,2.42,0.00,0.01,0.01
3,2.42,0.00,0.01,0.00
4,2.23,0.00,0.01,0.00
start rsync
5,2.23,0.00,0.00,0.00
6,2.23,0.00,0.00,0.00
7,2.23,0.00,1.27,1.09
8,2.23,0.00,1.27,1.09
9,2.13,0.00,2.85,2.70
10,2.13,0.00,2.85,2.70
11,2.13,0.18,3.42,3.30
12,2.13,0.18,3.42,3.30
13,2.13,0.14,4.25,4.15
14,2.04,0.14,4.25,4.15
15,2.04,0.12,5.11,5.03
16,2.04,0.12,5.11,5.03
17,2.04,3.72,9.44,7.02
end rsync
18,2.04,3.72,9.44,7.02
19,1.88,3.04,8.63,6.65
20,1.88,3.04,8.63,6.65
21,1.88,2.49,7.07,5.44
22,1.88,2.49,7.07,5.44
23,1.73,2.04,5.79,4.46
24,1.73,2.04,5.79,4.46
25,1.73,1.67,4.74,3.65
26,1.73,1.67,4.74,3.65
27,1.73,1.36,3.88,2.99
28,1.59,1.36,3.88,2.99
29,1.59,1.12,3.17,2.44
30,1.59,1.12,3.17,2.44
系统详细信息
联想 E14
Ubuntu 22.04
Gnome 和 Wayland
带有 Radeon 显卡的 AMD Ryzen 7 4700U CPU
内置 2 个 NVME SSD,均具有加密文件系统
- 主 SSD:(mvme0)
- 型号:KBG40ZNT512G 东芝内存
- 固件版本:0109AELA(似乎是最新版本)
- 辅助 SSD:(nvme1)
- 型号:CT500P3SSD8
- 主 SSD:(mvme0)
两个设备上的调度程序相同:
$ cat /sys/block/nvme0n1/queue/scheduler
[none] mq-deadline
$ cat /sys/block/nvme1n1/queue/scheduler
[none] mq-deadline
理论/问题:
- 主 SSD 是不是坏了?
- 这与系统中 PCI-E 通道数量有限有关吗?
尝试过
- 我尝试了内核 6.8.0-40-generic(我的系统上的标准)以及 6.5.0-45 - 但在这方面表现相同。
EDIT1(根据@ubfan1的评论)
- Trim 似乎在第二台设备上运行得更多(没有问题)。在主设备上它运行得不那么频繁 - 这是好事还是坏事?
$ journalctl |grep "fstrim\["|grep nvme0
Jul 01 08:32:31 example fstrim[27758]: /boot/efi: 504.9 MiB (529436672 bytes) trimmed on /dev/nvme0n1p1
Jul 01 08:32:31 example fstrim[27758]: /boot: 1.2 GiB (1269686272 bytes) trimmed on /dev/nvme0n1p2
Jul 08 09:53:27 example fstrim[50828]: /boot/efi: 504.9 MiB (529436672 bytes) trimmed on /dev/nvme0n1p1
Jul 08 09:53:27 example fstrim[50828]: /boot: 1.2 GiB (1269567488 bytes) trimmed on /dev/nvme0n1p2
Jul 15 09:43:17 example fstrim[46889]: /boot/efi: 504.9 MiB (529436672 bytes) trimmed on /dev/nvme0n1p1
Jul 15 09:43:17 example fstrim[46889]: /boot: 1.2 GiB (1269567488 bytes) trimmed on /dev/nvme0n1p2
Jul 22 12:33:55 example fstrim[14093]: /boot/efi: 504.9 MiB (529436672 bytes) trimmed on /dev/nvme0n1p1
Jul 22 12:33:55 example fstrim[14093]: /boot: 1.2 GiB (1267404800 bytes) trimmed on /dev/nvme0n1p2
Jul 29 10:15:54 example fstrim[63916]: /boot/efi: 504.9 MiB (529436672 bytes) trimmed on /dev/nvme0n1p1
Jul 29 10:15:54 example fstrim[63916]: /boot: 1.2 GiB (1267404800 bytes) trimmed on /dev/nvme0n1p2
Aug 05 09:37:39 example fstrim[46612]: /boot/efi: 504.9 MiB (529436672 bytes) trimmed on /dev/nvme0n1p1
Aug 05 09:37:39 example fstrim[46612]: /boot: 1.2 GiB (1269469184 bytes) trimmed on /dev/nvme0n1p2
两个设备都是内置的,不通过 USB 连接
智能控制
smartctl 7.2 2020-12-30 r5155 [x86_64-linux-6.8.0-40-generic] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF INFORMATION SECTION ===
Model Number: KBG40ZNT512G TOSHIBA MEMORY
Serial Number: 30TPxxxxxxx
Firmware Version: 0109AELA
PCI Vendor/Subsystem ID: 0x1e0f
IEEE OUI Identifier: 0x8ce38e
Total NVM Capacity: 512.110.190.592 [512 GB]
Unallocated NVM Capacity: 0
Controller ID: 0
NVMe Version: 1.3
Number of Namespaces: 1
Namespace 1 Size/Capacity: 512.110.190.592 [512 GB]
Namespace 1 Formatted LBA Size: 512
Namespace 1 IEEE EUI-64: 8ce38e 0400911b0e
Local Time is: Tue Aug 20 08:58:37 2024 CEST
Firmware Updates (0x14): 2 Slots, no Reset required
Optional Admin Commands (0x001f): Security Format Frmw_DL NS_Mngmt Self_Test
Optional NVM Commands (0x005f): Comp Wr_Unc DS_Mngmt Wr_Zero Sav/Sel_Feat Timestmp
Log Page Attributes (0x0e): Cmd_Eff_Lg Ext_Get_Lg Telmtry_Lg
Maximum Data Transfer Size: 512 Pages
Warning Comp. Temp. Threshold: 82 Celsius
Critical Comp. Temp. Threshold: 86 Celsius
Supported Power States
St Op Max Active Idle RL RT WL WT Ent_Lat Ex_Lat
0 + 3.50W - - 0 0 0 0 1 1
1 + 2.60W - - 1 1 1 1 1 1
2 + 1.80W - - 2 2 2 2 1 1
3 - 0.0500W - - 4 4 4 4 800 1200
4 - 0.0050W - - 4 4 4 4 3000 32000
Supported LBA Sizes (NSID 0x1)
Id Fmt Data Metadt Rel_Perf
0 + 512 0 3
1 - 4096 0 1
=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
SMART/Health Information (NVMe Log 0x02)
Critical Warning: 0x00
Temperature: 29 Celsius
Available Spare: 100%
Available Spare Threshold: 10%
Percentage Used: 12%
Data Units Read: 41.196.433 [21,0 TB]
Data Units Written: 54.357.811 [27,8 TB]
Host Read Commands: 792.033.014
Host Write Commands: 1.093.950.301
Controller Busy Time: 10.506
Power Cycles: 3.308
Power On Hours: 6.391
Unsafe Shutdowns: 4
Media and Data Integrity Errors: 0
Error Information Log Entries: 44
Warning Comp. Temperature Time: 0
Critical Comp. Temperature Time: 0
Temperature Sensor 1: 29 Celsius
Error Information (NVMe Log 0x01, 16 of 256 entries)
No Errors Logged