Não tinha certeza se isso é melhor aqui ou se é uma falha do servidor. Vou tentar aqui por enquanto.
Tenho uma máquina que estou usando como servidor pessoal com as seguintes especificações:
- 2x Xeon E5-2690 v3
- Placa-mãe ASUS série Z10PA-D8
- Nvidia Quadro P4000
- 1x Samsung SSD 870 EVO 500 GB (unidade do sistema)
- 3x WDC WD40EFAX-68JH4N1 (na configuração mdadm RAID5)
- Fonte de alimentação Cooler Master Gold 1000W
- Executando Ubuntu 24.04 LTS
Parece ser um pouco intermitente, mas frequentemente meu log do dmesg é inundado com o mesmo tipo de erro FDPMA READ QUEUED em várias unidades.
[ +0.003702] ata10.00: status: { DRDY }
[ +0.001747] ata10.00: failed command: READ FPDMA QUEUED
[ +0.001794] ata10.00: cmd 60/40:38:80:89:33/05:00:a9:01:00/40 tag 7 ncq dma 688128 in
res 40/00:01:00:4f:c2/00:00:00:00:00/00 Emask 0x50 (ATA bus error)
[ +0.003511] ata10.00: status: { DRDY }
[ +0.001673] ata10.00: failed command: READ FPDMA QUEUED
[ +0.001828] ata10.00: cmd 60/40:40:c0:8e:33/05:00:a9:01:00/40 tag 8 ncq dma 688128 in
res 40/00:01:00:00:00/00:00:00:00:00/00 Emask 0x50 (ATA bus error)
[ +0.003079] ata10.00: status: { DRDY }
[ +0.000513] ata10.00: failed command: READ FPDMA QUEUED
[ +0.001324] ata10.00: cmd 60/40:48:40:99:33/05:00:a9:01:00/40 tag 9 ncq dma 688128 in
res 40/00:00:00:4f:c2/00:00:00:00:00/00 Emask 0x50 (ATA bus error)
[ +0.003468] ata10.00: status: { DRDY }
[ +0.001671] ata10.00: failed command: READ FPDMA QUEUED
[ +0.001707] ata10.00: cmd 60/40:50:80:9e:33/05:00:a9:01:00/40 tag 10 ncq dma 688128 in
res 40/00:01:00:00:00/00:00:00:00:00/00 Emask 0x50 (ATA bus error)
O que eu tentei:
- Comprei novos cabos SATA para cada uma das unidades, substituindo os antigos
- Reconectou os cabos de alimentação a cada unidade
- Embaralhado em torno das portas que estão sendo usadas na placa-mãe
- Verifique a versão do BIOS (está atualizada com a versão mais recente)
- Verifique a saúde SMART (veja pastebins)
O sistema ainda parece utilizável, mas não consigo imaginar que esses erros sejam algo bom. Não tenho muita certeza do que tentar para diagnosticá-lo em seguida. Os relatórios SMART da unidade parecem indicar que estão todos em boas condições, e os cabos SATA são todos novos.
dmesg
logs, truncados (copiados do meu terminal scrollback antes de reiniciar, atualizarei com uma versão completa na próxima vez que o erro se manifestar): https://pastebin.com/j0Jnhkgt
skdump
para cada disco
Device: sat16:/dev/sda
Type: 16 Byte SCSI ATA SAT Passthru
Size: 476940 MiB
Model: [Samsung SSD 870 EVO 500GB]
Serial: [S6PXNU0X400255M]
Firmware: [SVT02B6Q]
SMART Available: yes
Quirks:
Awake: yes
SMART Disk Health Good: yes
Off-line Data Collection Status: [Off-line data collection activity was never started.]
Total Time To Complete Off-Line Data Collection: 0 s
Self-Test Execution Status: [The previous self-test routine completed without error or no self-test has ever been run.]
Percent Self-Test Remaining: 0%
Conveyance Self-Test Available: no
Short/Extended Self-Test Available: yes
Start Self-Test Available: yes
Abort Self-Test Available: yes
Short Self-Test Polling Time: 2 min
Extended Self-Test Polling Time: 85 min
Conveyance Self-Test Polling Time: 0 min
Bad Sectors: 0 sectors
Powered On: 6.7 months
Power Cycles: 30
Average Powered On Per Power Cycle: 6.7 days
Temperature: 25.0 C
Attribute Parsing Verification: Good
Overall Status: GOOD
ID# Name Value Worst Thres Pretty Raw Type Updates Good Good/Past
5 reallocated-sector-count 100 100 10 0 sectors 0x000000000000 prefail online yes yes
9 power-on-hours 99 99 0 6.7 months 0xdb1200000000 old-age online n/a n/a
12 power-cycle-count 99 99 0 30 0x1e0000000000 old-age online n/a n/a
177 wear-leveling-count 99 99 0 3 0x030000000000 prefail online n/a n/a
179 used-reserved-blocks-total 100 100 10 0 0x000000000000 prefail online yes yes
181 program-fail-count-total 100 100 10 0 0x000000000000 old-age online yes yes
182 erase-fail-count-total 100 100 10 0 0x000000000000 old-age online yes yes
183 runtime-bad-block-total 100 100 10 0 0x000000000000 prefail online yes yes
187 reported-uncorrect 100 100 0 0 sectors 0x000000000000 old-age online n/a n/a
190 airflow-temperature-celsius 75 63 0 25.0 C 0x190000000000 old-age online n/a n/a
195 hardware-ecc-recovered 200 200 0 0 0x000000000000 old-age online n/a n/a
199 udma-crc-error-count 99 99 0 475 0xdb0100000000 old-age online n/a n/a
235 good-block-rate 99 99 0 n/a 0x130000000000 old-age online n/a n/a
241 total-lbas-written 99 99 0 34642.150 TB 0x106d893d0000 old-age online n/a n/a
252 attribute-252 100 100 0 n/a 0x020000000000 old-age online n/a n/a
Device: sat16:/dev/sdb
Type: 16 Byte SCSI ATA SAT Passthru
Size: 3815447 MiB
Model: [WDC WD40EFAX-68JH4N1]
Serial: [WD-WX22D11RCA6L]
Firmware: [83.00A83]
SMART Available: yes
Quirks:
Awake: yes
SMART Disk Health Good: yes
Off-line Data Collection Status: [Off-line data collection activity was never started.]
Total Time To Complete Off-Line Data Collection: 404 s
Self-Test Execution Status: [The previous self-test routine completed without error or no self-test has ever been run.]
Percent Self-Test Remaining: 0%
Conveyance Self-Test Available: yes
Short/Extended Self-Test Available: yes
Start Self-Test Available: yes
Abort Self-Test Available: yes
Short Self-Test Polling Time: 2 min
Extended Self-Test Polling Time: 120 min
Conveyance Self-Test Polling Time: 2 min
Bad Sectors: 0 sectors
Powered On: 3.6 years
Power Cycles: 43
Average Powered On Per Power Cycle: 1.0 months
Temperature: 27.0 C
Attribute Parsing Verification: Good
Overall Status: GOOD
ID# Name Value Worst Thres Pretty Raw Type Updates Good Good/Past
1 raw-read-error-rate 200 200 51 0 0x000000000000 prefail online yes yes
3 spin-up-time 201 200 21 2.9 s 0x7d0b00000000 prefail online yes yes
4 start-stop-count 100 100 0 45 0x2d0000000000 old-age online n/a n/a
5 reallocated-sector-count 200 200 140 0 sectors 0x000000000000 prefail online yes yes
7 seek-error-rate 200 200 0 0 0x000000000000 old-age online n/a n/a
9 power-on-hours 57 57 0 3.6 years 0x027b00000000 old-age online n/a n/a
10 spin-retry-count 100 253 0 0 0x000000000000 old-age online n/a n/a
11 calibration-retry-count 100 253 0 0 0x000000000000 old-age online n/a n/a
12 power-cycle-count 100 100 0 43 0x2b0000000000 old-age online n/a n/a
192 power-off-retract-count 200 200 0 28 0x1c0000000000 old-age online n/a n/a
193 load-cycle-count 194 194 0 20445 0xdd4f00000000 old-age online n/a n/a
194 temperature-celsius-2 120 109 0 27.0 C 0x1b0000000000 old-age online n/a n/a
196 reallocated-event-count 200 200 0 0 0x000000000000 old-age online n/a n/a
197 current-pending-sector 200 200 0 0 sectors 0x000000000000 old-age online n/a n/a
198 offline-uncorrectable 100 253 0 0 sectors 0x000000000000 old-age offline n/a n/a
199 udma-crc-error-count 200 200 0 1825 0x210700000000 old-age online n/a n/a
200 multi-zone-error-rate 100 253 0 0 0x000000000000 old-age offline n/a n/a
Device: sat16:/dev/sdc
Type: 16 Byte SCSI ATA SAT Passthru
Size: 3815447 MiB
Model: [WDC WD40EFAX-68JH4N1]
Serial: [WD-WX22D11RCFDP]
Firmware: [83.00A83]
SMART Available: yes
Quirks:
Awake: yes
SMART Disk Health Good: yes
Off-line Data Collection Status: [Off-line data collection activity was never started.]
Total Time To Complete Off-Line Data Collection: 54720 s
Self-Test Execution Status: [The previous self-test routine completed without error or no self-test has ever been run.]
Percent Self-Test Remaining: 0%
Conveyance Self-Test Available: yes
Short/Extended Self-Test Available: yes
Start Self-Test Available: yes
Abort Self-Test Available: yes
Short Self-Test Polling Time: 2 min
Extended Self-Test Polling Time: 121 min
Conveyance Self-Test Polling Time: 2 min
Bad Sectors: 0 sectors
Powered On: 3.6 years
Power Cycles: 43
Average Powered On Per Power Cycle: 1.0 months
Temperature: 28.0 C
Attribute Parsing Verification: Good
Overall Status: GOOD
ID# Name Value Worst Thres Pretty Raw Type Updates Good Good/Past
1 raw-read-error-rate 200 200 51 0 0x000000000000 prefail online yes yes
3 spin-up-time 201 200 21 2.9 s 0x640b00000000 prefail online yes yes
4 start-stop-count 100 100 0 45 0x2d0000000000 old-age online n/a n/a
5 reallocated-sector-count 200 200 140 0 sectors 0x000000000000 prefail online yes yes
7 seek-error-rate 200 200 0 0 0x000000000000 old-age online n/a n/a
9 power-on-hours 57 57 0 3.6 years 0x2d7b00000000 old-age online n/a n/a
10 spin-retry-count 100 253 0 0 0x000000000000 old-age online n/a n/a
11 calibration-retry-count 100 253 0 0 0x000000000000 old-age online n/a n/a
12 power-cycle-count 100 100 0 43 0x2b0000000000 old-age online n/a n/a
192 power-off-retract-count 200 200 0 28 0x1c0000000000 old-age online n/a n/a
193 load-cycle-count 194 194 0 20720 0xf05000000000 old-age online n/a n/a
194 temperature-celsius-2 119 107 0 28.0 C 0x1c0000000000 old-age online n/a n/a
196 reallocated-event-count 200 200 0 0 0x000000000000 old-age online n/a n/a
197 current-pending-sector 200 200 0 0 sectors 0x000000000000 old-age online n/a n/a
198 offline-uncorrectable 100 253 0 0 sectors 0x000000000000 old-age offline n/a n/a
199 udma-crc-error-count 200 199 0 1806 0x0e0700000000 old-age online n/a n/a
200 multi-zone-error-rate 100 253 0 0 0x000000000000 old-age offline n/a n/a
Device: sat16:/dev/sdd
Type: 16 Byte SCSI ATA SAT Passthru
Size: 3815447 MiB
Model: [WDC WD40EFAX-68JH4N1]
Serial: [WD-WX32D11ED8N5]
Firmware: [83.00A83]
SMART Available: yes
Quirks:
Awake: yes
SMART Disk Health Good: yes
Off-line Data Collection Status: [Off-line data collection activity was never started.]
Total Time To Complete Off-Line Data Collection: 46184 s
Self-Test Execution Status: [The previous self-test routine completed without error or no self-test has ever been run.]
Percent Self-Test Remaining: 0%
Conveyance Self-Test Available: yes
Short/Extended Self-Test Available: yes
Start Self-Test Available: yes
Abort Self-Test Available: yes
Short Self-Test Polling Time: 2 min
Extended Self-Test Polling Time: 502 min
Conveyance Self-Test Polling Time: 3 min
Bad Sectors: 0 sectors
Powered On: 3.6 years
Power Cycles: 43
Average Powered On Per Power Cycle: 1.0 months
Temperature: 28.0 C
Attribute Parsing Verification: Good
Overall Status: GOOD
ID# Name Value Worst Thres Pretty Raw Type Updates Good Good/Past
1 raw-read-error-rate 200 200 51 0 0x000000000000 prefail online yes yes
3 spin-up-time 202 199 21 2.9 s 0x430b00000000 prefail online yes yes
4 start-stop-count 100 100 0 45 0x2d0000000000 old-age online n/a n/a
5 reallocated-sector-count 200 200 140 0 sectors 0x000000000000 prefail online yes yes
7 seek-error-rate 200 200 0 0 0x000000000000 old-age online n/a n/a
9 power-on-hours 57 57 0 3.6 years 0x447b00000000 old-age online n/a n/a
10 spin-retry-count 100 253 0 0 0x000000000000 old-age online n/a n/a
11 calibration-retry-count 100 253 0 0 0x000000000000 old-age online n/a n/a
12 power-cycle-count 100 100 0 43 0x2b0000000000 old-age online n/a n/a
192 power-off-retract-count 200 200 0 28 0x1c0000000000 old-age online n/a n/a
193 load-cycle-count 200 200 0 2402 0x620900000000 old-age online n/a n/a
194 temperature-celsius-2 119 106 0 28.0 C 0x1c0000000000 old-age online n/a n/a
196 reallocated-event-count 200 200 0 0 0x000000000000 old-age online n/a n/a
197 current-pending-sector 200 200 0 0 sectors 0x000000000000 old-age online n/a n/a
198 offline-uncorrectable 100 253 0 0 sectors 0x000000000000 old-age offline n/a n/a
199 udma-crc-error-count 200 200 0 1783 0xf70600000000 old-age online n/a n/a
200 multi-zone-error-rate 100 253 0 0 0x000000000000 old-age offline n/a n/a
mdadm
status:
/dev/md0:
Version : 1.2
Creation Time : Sun Jul 11 18:12:41 2021
Raid Level : raid5
Array Size : 7813772288 (7.28 TiB 8.00 TB)
Used Dev Size : 3906886144 (3.64 TiB 4.00 TB)
Raid Devices : 3
Total Devices : 3
Persistence : Superblock is persistent
Intent Bitmap : Internal
Update Time : Wed Mar 5 12:58:11 2025
State : clean
Active Devices : 3
Working Devices : 3
Failed Devices : 0
Spare Devices : 0
Layout : left-symmetric
Chunk Size : 512K
Consistency Policy : bitmap
Name : crab:0 (local to host crab)
UUID : b9f769fb:49026686:78737cc2:90e8e63c
Events : 23954
Number Major Minor RaidDevice State
0 8 32 0 active sync /dev/sdc
1 8 16 1 active sync /dev/sdb
3 8 48 2 active sync /dev/sdd
Atualização : consigo reproduzir o problema consistentemente se eu mdadm
executar uma verificação.
dmesg
logs sem o sinalizador noncq: https://pastebin.com/sz1sXNQ1dmesg
logs com o sinalizador noncq habilitado em/etc/default/grub
: https://pastebin.com/Aib0B8wz
Também notei que meus drives, apesar de serem WD Reds, são, na verdade, drives SMR. Estou começando a suspeitar que esse pode ser o problema, embora eu não tenha certeza.