不确定这是因为这里情况更好还是服务器故障,目前暂时尝试这里。
我有一台用作个人服务器的机器,其规格如下:
- 2 个 Xeon E5-2690 v3
- 华硕 Z10PA-D8 系列主板
- Nvidia Quadro P4000
- 1x 三星 SSD 870 EVO 500GB(系统驱动器)
- 3x WDC WD40EFAX-68JH4N1(在 mdadm RAID5 配置中)
- 1000W 黄金酷冷至尊电源
- 运行 Ubuntu 24.04 LTS
它似乎有些间歇性,但我经常发现我的 dmesg 日志在多个驱动器上充斥着相同类型的 FDPMA READ QUEUED 错误。
[ +0.003702] ata10.00: status: { DRDY }
[ +0.001747] ata10.00: failed command: READ FPDMA QUEUED
[ +0.001794] ata10.00: cmd 60/40:38:80:89:33/05:00:a9:01:00/40 tag 7 ncq dma 688128 in
res 40/00:01:00:4f:c2/00:00:00:00:00/00 Emask 0x50 (ATA bus error)
[ +0.003511] ata10.00: status: { DRDY }
[ +0.001673] ata10.00: failed command: READ FPDMA QUEUED
[ +0.001828] ata10.00: cmd 60/40:40:c0:8e:33/05:00:a9:01:00/40 tag 8 ncq dma 688128 in
res 40/00:01:00:00:00/00:00:00:00:00/00 Emask 0x50 (ATA bus error)
[ +0.003079] ata10.00: status: { DRDY }
[ +0.000513] ata10.00: failed command: READ FPDMA QUEUED
[ +0.001324] ata10.00: cmd 60/40:48:40:99:33/05:00:a9:01:00/40 tag 9 ncq dma 688128 in
res 40/00:00:00:4f:c2/00:00:00:00:00/00 Emask 0x50 (ATA bus error)
[ +0.003468] ata10.00: status: { DRDY }
[ +0.001671] ata10.00: failed command: READ FPDMA QUEUED
[ +0.001707] ata10.00: cmd 60/40:50:80:9e:33/05:00:a9:01:00/40 tag 10 ncq dma 688128 in
res 40/00:01:00:00:00/00:00:00:00:00/00 Emask 0x50 (ATA bus error)
我尝试过的:
- 为每个驱动器购买了新的 SATA 电缆,替换了旧的
- 重新安装连接到每个驱动器的电源线
- 重新排列主板上使用的端口
- 检查 BIOS 版本(是否为最新版本)
- 检查 SMART 健康 (参见 pastebins)
系统似乎仍可用,但我不认为这些错误是好事。不过我不太确定下一步该如何诊断。驱动器 SMART 报告似乎表明它们都处于良好状态,并且 SATA 电缆都是全新的。
dmesg
日志,截断(在我重新启动之前从我的终端回滚复制,下次出现错误时我将使用完整版本更新):https://pastebin.com/j0Jnhkgt
skdump
对于每个磁盘
Device: sat16:/dev/sda
Type: 16 Byte SCSI ATA SAT Passthru
Size: 476940 MiB
Model: [Samsung SSD 870 EVO 500GB]
Serial: [S6PXNU0X400255M]
Firmware: [SVT02B6Q]
SMART Available: yes
Quirks:
Awake: yes
SMART Disk Health Good: yes
Off-line Data Collection Status: [Off-line data collection activity was never started.]
Total Time To Complete Off-Line Data Collection: 0 s
Self-Test Execution Status: [The previous self-test routine completed without error or no self-test has ever been run.]
Percent Self-Test Remaining: 0%
Conveyance Self-Test Available: no
Short/Extended Self-Test Available: yes
Start Self-Test Available: yes
Abort Self-Test Available: yes
Short Self-Test Polling Time: 2 min
Extended Self-Test Polling Time: 85 min
Conveyance Self-Test Polling Time: 0 min
Bad Sectors: 0 sectors
Powered On: 6.7 months
Power Cycles: 30
Average Powered On Per Power Cycle: 6.7 days
Temperature: 25.0 C
Attribute Parsing Verification: Good
Overall Status: GOOD
ID# Name Value Worst Thres Pretty Raw Type Updates Good Good/Past
5 reallocated-sector-count 100 100 10 0 sectors 0x000000000000 prefail online yes yes
9 power-on-hours 99 99 0 6.7 months 0xdb1200000000 old-age online n/a n/a
12 power-cycle-count 99 99 0 30 0x1e0000000000 old-age online n/a n/a
177 wear-leveling-count 99 99 0 3 0x030000000000 prefail online n/a n/a
179 used-reserved-blocks-total 100 100 10 0 0x000000000000 prefail online yes yes
181 program-fail-count-total 100 100 10 0 0x000000000000 old-age online yes yes
182 erase-fail-count-total 100 100 10 0 0x000000000000 old-age online yes yes
183 runtime-bad-block-total 100 100 10 0 0x000000000000 prefail online yes yes
187 reported-uncorrect 100 100 0 0 sectors 0x000000000000 old-age online n/a n/a
190 airflow-temperature-celsius 75 63 0 25.0 C 0x190000000000 old-age online n/a n/a
195 hardware-ecc-recovered 200 200 0 0 0x000000000000 old-age online n/a n/a
199 udma-crc-error-count 99 99 0 475 0xdb0100000000 old-age online n/a n/a
235 good-block-rate 99 99 0 n/a 0x130000000000 old-age online n/a n/a
241 total-lbas-written 99 99 0 34642.150 TB 0x106d893d0000 old-age online n/a n/a
252 attribute-252 100 100 0 n/a 0x020000000000 old-age online n/a n/a
Device: sat16:/dev/sdb
Type: 16 Byte SCSI ATA SAT Passthru
Size: 3815447 MiB
Model: [WDC WD40EFAX-68JH4N1]
Serial: [WD-WX22D11RCA6L]
Firmware: [83.00A83]
SMART Available: yes
Quirks:
Awake: yes
SMART Disk Health Good: yes
Off-line Data Collection Status: [Off-line data collection activity was never started.]
Total Time To Complete Off-Line Data Collection: 404 s
Self-Test Execution Status: [The previous self-test routine completed without error or no self-test has ever been run.]
Percent Self-Test Remaining: 0%
Conveyance Self-Test Available: yes
Short/Extended Self-Test Available: yes
Start Self-Test Available: yes
Abort Self-Test Available: yes
Short Self-Test Polling Time: 2 min
Extended Self-Test Polling Time: 120 min
Conveyance Self-Test Polling Time: 2 min
Bad Sectors: 0 sectors
Powered On: 3.6 years
Power Cycles: 43
Average Powered On Per Power Cycle: 1.0 months
Temperature: 27.0 C
Attribute Parsing Verification: Good
Overall Status: GOOD
ID# Name Value Worst Thres Pretty Raw Type Updates Good Good/Past
1 raw-read-error-rate 200 200 51 0 0x000000000000 prefail online yes yes
3 spin-up-time 201 200 21 2.9 s 0x7d0b00000000 prefail online yes yes
4 start-stop-count 100 100 0 45 0x2d0000000000 old-age online n/a n/a
5 reallocated-sector-count 200 200 140 0 sectors 0x000000000000 prefail online yes yes
7 seek-error-rate 200 200 0 0 0x000000000000 old-age online n/a n/a
9 power-on-hours 57 57 0 3.6 years 0x027b00000000 old-age online n/a n/a
10 spin-retry-count 100 253 0 0 0x000000000000 old-age online n/a n/a
11 calibration-retry-count 100 253 0 0 0x000000000000 old-age online n/a n/a
12 power-cycle-count 100 100 0 43 0x2b0000000000 old-age online n/a n/a
192 power-off-retract-count 200 200 0 28 0x1c0000000000 old-age online n/a n/a
193 load-cycle-count 194 194 0 20445 0xdd4f00000000 old-age online n/a n/a
194 temperature-celsius-2 120 109 0 27.0 C 0x1b0000000000 old-age online n/a n/a
196 reallocated-event-count 200 200 0 0 0x000000000000 old-age online n/a n/a
197 current-pending-sector 200 200 0 0 sectors 0x000000000000 old-age online n/a n/a
198 offline-uncorrectable 100 253 0 0 sectors 0x000000000000 old-age offline n/a n/a
199 udma-crc-error-count 200 200 0 1825 0x210700000000 old-age online n/a n/a
200 multi-zone-error-rate 100 253 0 0 0x000000000000 old-age offline n/a n/a
Device: sat16:/dev/sdc
Type: 16 Byte SCSI ATA SAT Passthru
Size: 3815447 MiB
Model: [WDC WD40EFAX-68JH4N1]
Serial: [WD-WX22D11RCFDP]
Firmware: [83.00A83]
SMART Available: yes
Quirks:
Awake: yes
SMART Disk Health Good: yes
Off-line Data Collection Status: [Off-line data collection activity was never started.]
Total Time To Complete Off-Line Data Collection: 54720 s
Self-Test Execution Status: [The previous self-test routine completed without error or no self-test has ever been run.]
Percent Self-Test Remaining: 0%
Conveyance Self-Test Available: yes
Short/Extended Self-Test Available: yes
Start Self-Test Available: yes
Abort Self-Test Available: yes
Short Self-Test Polling Time: 2 min
Extended Self-Test Polling Time: 121 min
Conveyance Self-Test Polling Time: 2 min
Bad Sectors: 0 sectors
Powered On: 3.6 years
Power Cycles: 43
Average Powered On Per Power Cycle: 1.0 months
Temperature: 28.0 C
Attribute Parsing Verification: Good
Overall Status: GOOD
ID# Name Value Worst Thres Pretty Raw Type Updates Good Good/Past
1 raw-read-error-rate 200 200 51 0 0x000000000000 prefail online yes yes
3 spin-up-time 201 200 21 2.9 s 0x640b00000000 prefail online yes yes
4 start-stop-count 100 100 0 45 0x2d0000000000 old-age online n/a n/a
5 reallocated-sector-count 200 200 140 0 sectors 0x000000000000 prefail online yes yes
7 seek-error-rate 200 200 0 0 0x000000000000 old-age online n/a n/a
9 power-on-hours 57 57 0 3.6 years 0x2d7b00000000 old-age online n/a n/a
10 spin-retry-count 100 253 0 0 0x000000000000 old-age online n/a n/a
11 calibration-retry-count 100 253 0 0 0x000000000000 old-age online n/a n/a
12 power-cycle-count 100 100 0 43 0x2b0000000000 old-age online n/a n/a
192 power-off-retract-count 200 200 0 28 0x1c0000000000 old-age online n/a n/a
193 load-cycle-count 194 194 0 20720 0xf05000000000 old-age online n/a n/a
194 temperature-celsius-2 119 107 0 28.0 C 0x1c0000000000 old-age online n/a n/a
196 reallocated-event-count 200 200 0 0 0x000000000000 old-age online n/a n/a
197 current-pending-sector 200 200 0 0 sectors 0x000000000000 old-age online n/a n/a
198 offline-uncorrectable 100 253 0 0 sectors 0x000000000000 old-age offline n/a n/a
199 udma-crc-error-count 200 199 0 1806 0x0e0700000000 old-age online n/a n/a
200 multi-zone-error-rate 100 253 0 0 0x000000000000 old-age offline n/a n/a
Device: sat16:/dev/sdd
Type: 16 Byte SCSI ATA SAT Passthru
Size: 3815447 MiB
Model: [WDC WD40EFAX-68JH4N1]
Serial: [WD-WX32D11ED8N5]
Firmware: [83.00A83]
SMART Available: yes
Quirks:
Awake: yes
SMART Disk Health Good: yes
Off-line Data Collection Status: [Off-line data collection activity was never started.]
Total Time To Complete Off-Line Data Collection: 46184 s
Self-Test Execution Status: [The previous self-test routine completed without error or no self-test has ever been run.]
Percent Self-Test Remaining: 0%
Conveyance Self-Test Available: yes
Short/Extended Self-Test Available: yes
Start Self-Test Available: yes
Abort Self-Test Available: yes
Short Self-Test Polling Time: 2 min
Extended Self-Test Polling Time: 502 min
Conveyance Self-Test Polling Time: 3 min
Bad Sectors: 0 sectors
Powered On: 3.6 years
Power Cycles: 43
Average Powered On Per Power Cycle: 1.0 months
Temperature: 28.0 C
Attribute Parsing Verification: Good
Overall Status: GOOD
ID# Name Value Worst Thres Pretty Raw Type Updates Good Good/Past
1 raw-read-error-rate 200 200 51 0 0x000000000000 prefail online yes yes
3 spin-up-time 202 199 21 2.9 s 0x430b00000000 prefail online yes yes
4 start-stop-count 100 100 0 45 0x2d0000000000 old-age online n/a n/a
5 reallocated-sector-count 200 200 140 0 sectors 0x000000000000 prefail online yes yes
7 seek-error-rate 200 200 0 0 0x000000000000 old-age online n/a n/a
9 power-on-hours 57 57 0 3.6 years 0x447b00000000 old-age online n/a n/a
10 spin-retry-count 100 253 0 0 0x000000000000 old-age online n/a n/a
11 calibration-retry-count 100 253 0 0 0x000000000000 old-age online n/a n/a
12 power-cycle-count 100 100 0 43 0x2b0000000000 old-age online n/a n/a
192 power-off-retract-count 200 200 0 28 0x1c0000000000 old-age online n/a n/a
193 load-cycle-count 200 200 0 2402 0x620900000000 old-age online n/a n/a
194 temperature-celsius-2 119 106 0 28.0 C 0x1c0000000000 old-age online n/a n/a
196 reallocated-event-count 200 200 0 0 0x000000000000 old-age online n/a n/a
197 current-pending-sector 200 200 0 0 sectors 0x000000000000 old-age online n/a n/a
198 offline-uncorrectable 100 253 0 0 sectors 0x000000000000 old-age offline n/a n/a
199 udma-crc-error-count 200 200 0 1783 0xf70600000000 old-age online n/a n/a
200 multi-zone-error-rate 100 253 0 0 0x000000000000 old-age offline n/a n/a
mdadm
地位:
/dev/md0:
Version : 1.2
Creation Time : Sun Jul 11 18:12:41 2021
Raid Level : raid5
Array Size : 7813772288 (7.28 TiB 8.00 TB)
Used Dev Size : 3906886144 (3.64 TiB 4.00 TB)
Raid Devices : 3
Total Devices : 3
Persistence : Superblock is persistent
Intent Bitmap : Internal
Update Time : Wed Mar 5 12:58:11 2025
State : clean
Active Devices : 3
Working Devices : 3
Failed Devices : 0
Spare Devices : 0
Layout : left-symmetric
Chunk Size : 512K
Consistency Policy : bitmap
Name : crab:0 (local to host crab)
UUID : b9f769fb:49026686:78737cc2:90e8e63c
Events : 23954
Number Major Minor RaidDevice State
0 8 32 0 active sync /dev/sdc
1 8 16 1 active sync /dev/sdb
3 8 48 2 active sync /dev/sdd
更新mdadm
:如果我进行检查,我就能始终如一地重现该问题。
dmesg
没有 noncq 标志的日志:https://pastebin.com/sz1sXNQ1dmesg
启用 noncq 标志的日志/etc/default/grub
:https://pastebin.com/Aib0B8wz
我还注意到,尽管我的硬盘是 WD Reds,但实际上是 SMR 硬盘。我开始怀疑这可能是问题所在,但我不确定。