我正在运行 Debian 12,并使用 MD RAID1 阵列(2 个驱动器)来存储我的个人数据(阵列上没有系统文件)。
今天,我收到一封来自 mdadm 的关于 DegradedArray 事件的邮件,当时我的驱动器通常不被使用:
This is an automatically generated mail message from mdadm
running on hostname
A DegradedArray event had been detected on md device /dev/md0.
Faithfully yours, etc.
P.S. The /proc/mdstat file currently contains the following:
Personalities : [raid1] [linear] [multipath] [raid0] [raid6] [raid5] [raid4] [raid10]
md0 : active raid1 sdc[0]
976630464 blocks super 1.2 [2/1] [U_]
bitmap: 4/8 pages [16KB], 65536KB chunk
unused devices: <none>
/var/log/syslog
不包含任何相关内容,但dmesg
显示:
[652897.364496] ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x40000 action 0x6 frozen
[652897.364512] ata2: SError: { CommWake }
[652897.364520] ata2.00: failed command: FLUSH CACHE EXT
[652897.364525] ata2.00: cmd ea/00:00:00:00:00/00:00:00:00:00/a0 tag 15
res 40/00:00:00:00:00/00:00:00:00:00/40 Emask 0x4 (timeout)
[652897.364541] ata2.00: status: { DRDY }
[652897.364549] ata2: hard resetting link
[652902.720479] ata2: found unknown device (class 0)
[652907.364975] ata2: softreset failed (1st FIS failed)
[652907.364988] ata2: hard resetting link
[652912.716814] ata2: found unknown device (class 0)
[652917.365220] ata2: softreset failed (1st FIS failed)
[652917.365233] ata2: hard resetting link
[652922.724814] ata2: found unknown device (class 0)
[652952.365391] ata2: softreset failed (1st FIS failed)
[652952.365406] ata2: limiting SATA link speed to 3.0 Gbps
[652952.365409] ata2: hard resetting link
[652957.420814] ata2: found unknown device (class 0)
[652957.580941] ata2: found unknown device (class 0)
[652957.580966] ata2: SATA link up 3.0 Gbps (SStatus 123 SControl 320)
[652962.596788] ata2.00: qc timeout after 5000 msecs (cmd 0xec)
[652962.596807] ata2.00: failed to IDENTIFY (I/O error, err_mask=0x4)
[652962.596811] ata2.00: revalidation failed (errno=-5)
[652962.596824] ata2: hard resetting link
[652967.956818] ata2: found unknown device (class 0)
[652972.597225] ata2: softreset failed (1st FIS failed)
[652972.597239] ata2: hard resetting link
[652977.188682] INFO: task md0_raid1:242 blocked for more than 120 seconds.
[652977.188696] Not tainted 6.1.0-12-amd64 #1 Debian 6.1.52-1
[652977.188703] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[652977.188708] task:md0_raid1 state:D stack:0 pid:242 ppid:2 flags:0x00004000
[652977.188716] Call Trace:
[652977.188719] <TASK>
[652977.188724] __schedule+0x351/0xa20
[652977.188736] schedule+0x5d/0xe0
[652977.188745] md_super_wait+0x9e/0xd0 [md_mod]
[652977.188770] ? cpuusage_read+0x10/0x10
[652977.188777] write_page+0x2b7/0x3c0 [md_mod]
[652977.188801] ? md_super_wait+0x23/0xd0 [md_mod]
[652977.188824] md_update_sb.part.0+0x300/0x7e0 [md_mod]
[652977.188847] ? unregister_md_personality+0x70/0x70 [md_mod]
[652977.188868] md_check_recovery+0x15a/0x5b0 [md_mod]
[652977.188892] raid1d+0x8a/0x1990 [raid1]
[652977.188903] ? update_load_avg+0x7e/0x780
[652977.188910] ? psi_group_change+0x145/0x360
[652977.188915] ? sched_clock_local+0xe/0x80
[652977.188920] ? _raw_spin_unlock+0x15/0x30
[652977.188925] ? finish_task_switch.isra.0+0x9b/0x300
[652977.188929] ? __switch_to+0x106/0x410
[652977.188936] ? __schedule+0x359/0xa20
[652977.188943] ? unregister_md_personality+0x70/0x70 [md_mod]
[652977.188963] ? preempt_count_add+0x6a/0xa0
[652977.188966] ? _raw_spin_lock_irqsave+0x23/0x50
[652977.188970] ? preempt_count_add+0x6a/0xa0
[652977.188975] ? unregister_md_personality+0x70/0x70 [md_mod]
[652977.188993] md_thread+0xaa/0x180 [md_mod]
[652977.189012] ? cpuusage_read+0x10/0x10
[652977.189017] kthread+0xe9/0x110
[652977.189023] ? kthread_complete_and_exit+0x20/0x20
[652977.189028] ret_from_fork+0x22/0x30
[652977.189036] </TASK>
[652977.189041] INFO: task jbd2/md0-8:1065 blocked for more than 120 seconds.
[652977.189046] Not tainted 6.1.0-12-amd64 #1 Debian 6.1.52-1
[652977.189051] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[652977.189056] task:jbd2/md0-8 state:D stack:0 pid:1065 ppid:2 flags:0x00004000
[652977.189061] Call Trace:
[652977.189062] <TASK>
[652977.189064] __schedule+0x351/0xa20
[652977.189071] schedule+0x5d/0xe0
[652977.189076] md_write_start+0x198/0x2a0 [md_mod]
[652977.189095] ? cpuusage_read+0x10/0x10
[652977.189100] raid1_make_request+0xac/0xbaf [raid1]
[652977.189109] ? iomap_iter+0x78/0x310
[652977.189116] ? mempool_alloc+0x85/0x1b0
[652977.189122] ? kmem_cache_alloc+0x148/0x2e0
[652977.189129] md_handle_request+0x131/0x1e0 [md_mod]
[652977.189149] __submit_bio+0x89/0x130
[652977.189154] submit_bio_noacct_nocheck+0x163/0x370
[652977.189159] ? submit_bio_noacct+0x79/0x4a0
[652977.189163] jbd2_journal_commit_transaction+0xdb3/0x1a70 [jbd2]
[652977.189187] ? _raw_spin_unlock+0x15/0x30
[652977.189191] ? finish_task_switch.isra.0+0x9b/0x300
[652977.189194] ? __switch_to+0x106/0x410
[652977.189202] kjournald2+0xa9/0x280 [jbd2]
[652977.189222] ? cpuusage_read+0x10/0x10
[652977.189227] ? jbd2_fc_wait_bufs+0xa0/0xa0 [jbd2]
[652977.189246] kthread+0xe9/0x110
[652977.189250] ? kthread_complete_and_exit+0x20/0x20
[652977.189256] ret_from_fork+0x22/0x30
[652977.189263] </TASK>
[652977.948973] ata2: found unknown device (class 0)
[652982.597296] ata2: softreset failed (1st FIS failed)
[652982.597309] ata2: hard resetting link
[652987.948814] ata2: found unknown device (class 0)
[653017.597291] ata2: softreset failed (1st FIS failed)
[653017.597306] ata2: limiting SATA link speed to 1.5 Gbps
[653017.597309] ata2: hard resetting link
[653022.648471] ata2: found unknown device (class 0)
[653022.808780] ata2: found unknown device (class 0)
[653022.808797] ata2: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
[653032.996794] ata2.00: qc timeout after 10000 msecs (cmd 0xec)
[653032.996813] ata2.00: failed to IDENTIFY (I/O error, err_mask=0x4)
[653032.996817] ata2.00: revalidation failed (errno=-5)
[653032.996832] ata2: hard resetting link
[653038.348823] ata2: found unknown device (class 0)
[653042.996967] ata2: softreset failed (1st FIS failed)
[653042.996982] ata2: hard resetting link
[653048.348821] ata2: found unknown device (class 0)
[653052.996756] ata2: softreset failed (1st FIS failed)
[653052.996770] ata2: hard resetting link
[653058.348811] ata2: found unknown device (class 0)
[653087.997129] ata2: softreset failed (1st FIS failed)
[653087.997145] ata2: hard resetting link
[653093.057056] ata2: found unknown device (class 0)
[653093.216467] ata2: found unknown device (class 0)
[653093.216484] ata2: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
[653098.020894] INFO: task kcompactd0:71 blocked for more than 120 seconds.
[653098.020910] Not tainted 6.1.0-12-amd64 #1 Debian 6.1.52-1
[653098.020918] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[653098.020923] task:kcompactd0 state:D stack:0 pid:71 ppid:2 flags:0x00004000
[653098.020932] Call Trace:
[653098.020935] <TASK>
[653098.020940] __schedule+0x351/0xa20
[653098.020952] ? bit_wait+0x60/0x60
[653098.020959] schedule+0x5d/0xe0
[653098.020965] io_schedule+0x42/0x70
[653098.020971] bit_wait_io+0xd/0x60
[653098.020977] __wait_on_bit_lock+0x5f/0xa0
[653098.020984] out_of_line_wait_on_bit_lock+0x91/0xb0
[653098.020991] ? sugov_init+0x350/0x350
[653098.020997] __buffer_migrate_folio+0xb8/0x270
[653098.021006] move_to_new_folio+0x56/0x150
[653098.021013] migrate_pages+0xc51/0x1480
[653098.021019] ? isolate_freepages_block+0x410/0x410
[653098.021027] ? release_freepages+0xc0/0xc0
[653098.021034] ? do_pages_stat+0x360/0x360
[653098.021041] compact_zone+0x97e/0xdb0
[653098.021048] ? sched_clock_local+0xe/0x80
[653098.021053] ? finish_task_switch.isra.0+0x9b/0x300
[653098.021059] proactive_compact_node+0x87/0xc0
[653098.021069] kcompactd+0x34c/0x420
[653098.021075] ? cpuusage_read+0x10/0x10
[653098.021080] ? kcompactd_do_work+0x2a0/0x2a0
[653098.021086] kthread+0xe9/0x110
[653098.021092] ? kthread_complete_and_exit+0x20/0x20
[653098.021098] ret_from_fork+0x22/0x30
[653098.021107] </TASK>
[653098.021117] INFO: task md0_raid1:242 blocked for more than 241 seconds.
[653098.021124] Not tainted 6.1.0-12-amd64 #1 Debian 6.1.52-1
[653098.021129] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[653098.021134] task:md0_raid1 state:D stack:0 pid:242 ppid:2 flags:0x00004000
[653098.021139] Call Trace:
[653098.021141] <TASK>
[653098.021143] __schedule+0x351/0xa20
[653098.021150] schedule+0x5d/0xe0
[653098.021157] md_super_wait+0x9e/0xd0 [md_mod]
[653098.021179] ? cpuusage_read+0x10/0x10
[653098.021184] write_page+0x2b7/0x3c0 [md_mod]
[653098.021205] ? md_super_wait+0x23/0xd0 [md_mod]
[653098.021225] md_update_sb.part.0+0x300/0x7e0 [md_mod]
[653098.021246] ? unregister_md_personality+0x70/0x70 [md_mod]
[653098.021265] md_check_recovery+0x15a/0x5b0 [md_mod]
[653098.021286] raid1d+0x8a/0x1990 [raid1]
[653098.021296] ? update_load_avg+0x7e/0x780
[653098.021301] ? psi_group_change+0x145/0x360
[653098.021305] ? sched_clock_local+0xe/0x80
[653098.021310] ? _raw_spin_unlock+0x15/0x30
[653098.021314] ? finish_task_switch.isra.0+0x9b/0x300
[653098.021318] ? __switch_to+0x106/0x410
[653098.021324] ? __schedule+0x359/0xa20
[653098.021330] ? unregister_md_personality+0x70/0x70 [md_mod]
[653098.021349] ? preempt_count_add+0x6a/0xa0
[653098.021352] ? _raw_spin_lock_irqsave+0x23/0x50
[653098.021356] ? preempt_count_add+0x6a/0xa0
[653098.021360] ? unregister_md_personality+0x70/0x70 [md_mod]
[653098.021379] md_thread+0xaa/0x180 [md_mod]
[653098.021398] ? cpuusage_read+0x10/0x10
[653098.021403] kthread+0xe9/0x110
[653098.021407] ? kthread_complete_and_exit+0x20/0x20
[653098.021412] ret_from_fork+0x22/0x30
[653098.021420] </TASK>
[653098.021424] INFO: task jbd2/md0-8:1065 blocked for more than 241 seconds.
[653098.021430] Not tainted 6.1.0-12-amd64 #1 Debian 6.1.52-1
[653098.021435] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[653098.021440] task:jbd2/md0-8 state:D stack:0 pid:1065 ppid:2 flags:0x00004000
[653098.021445] Call Trace:
[653098.021446] <TASK>
[653098.021449] __schedule+0x351/0xa20
[653098.021455] schedule+0x5d/0xe0
[653098.021460] md_write_start+0x198/0x2a0 [md_mod]
[653098.021479] ? cpuusage_read+0x10/0x10
[653098.021484] raid1_make_request+0xac/0xbaf [raid1]
[653098.021493] ? iomap_iter+0x78/0x310
[653098.021500] ? mempool_alloc+0x85/0x1b0
[653098.021505] ? kmem_cache_alloc+0x148/0x2e0
[653098.021511] md_handle_request+0x131/0x1e0 [md_mod]
[653098.021531] __submit_bio+0x89/0x130
[653098.021536] submit_bio_noacct_nocheck+0x163/0x370
[653098.021541] ? submit_bio_noacct+0x79/0x4a0
[653098.021545] jbd2_journal_commit_transaction+0xdb3/0x1a70 [jbd2]
[653098.021569] ? _raw_spin_unlock+0x15/0x30
[653098.021573] ? finish_task_switch.isra.0+0x9b/0x300
[653098.021577] ? __switch_to+0x106/0x410
[653098.021584] kjournald2+0xa9/0x280 [jbd2]
[653098.021604] ? cpuusage_read+0x10/0x10
[653098.021609] ? jbd2_fc_wait_bufs+0xa0/0xa0 [jbd2]
[653098.021628] kthread+0xe9/0x110
[653098.021633] ? kthread_complete_and_exit+0x20/0x20
[653098.021638] ret_from_fork+0x22/0x30
[653098.021645] </TASK>
[653124.644468] ata2.00: qc timeout after 30000 msecs (cmd 0xec)
[653124.644485] ata2.00: failed to IDENTIFY (I/O error, err_mask=0x4)
[653124.644493] ata2.00: revalidation failed (errno=-5)
[653124.644498] ata2.00: disable device
[653130.000802] ata2: found unknown device (class 0)
[653134.645444] ata2: softreset failed (1st FIS failed)
[653140.000802] ata2: found unknown device (class 0)
[653144.644791] ata2: softreset failed (1st FIS failed)
[653149.996821] ata2: found unknown device (class 0)
[653179.644808] ata2: softreset failed (1st FIS failed)
[653184.696807] ata2: found unknown device (class 0)
[653184.856889] ata2: found unknown device (class 0)
[653184.856909] ata2: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
[653184.856943] ata2: EH complete
[653184.856995] sd 1:0:0:0: [sdb] tag#11 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK cmd_age=348s
[653184.857003] sd 1:0:0:0: [sdb] tag#11 CDB: Synchronize Cache(10) 35 00 00 00 00 00 00 00 00 00
[653184.857013] I/O error, dev sdb, sector 16 op 0x1:(WRITE) flags 0x800 phys_seg 1 prio class 2
[653184.857028] md: super_written gets error=-5
[653184.857036] md/raid1:md0: Disk failure on sdb, disabling device.
md/raid1:md0: Operation continuing on 1 devices.
[653185.608789] ata2: found unknown device (class 0)
[653194.896744] ata2: softreset failed (1st FIS failed)
[653200.648480] ata2: found unknown device (class 0)
[653204.896652] ata2: softreset failed (1st FIS failed)
[653210.648810] ata2: found unknown device (class 0)
[653239.896810] ata2: softreset failed (1st FIS failed)
[653239.896826] ata2: limiting SATA link speed to 3.0 Gbps
[653244.928818] ata2: found unknown device (class 0)
[653245.088464] ata2: found unknown device (class 0)
[653245.088484] ata2: SATA link up 3.0 Gbps (SStatus 123 SControl 320)
[653245.088491] ata2: link online but 1 devices misclassified, device detection might fail
[653245.088528] ata2.00: detaching (SCSI 1:0:0:0)
[653245.152505] sd 1:0:0:0: [sdb] Synchronizing SCSI cache
[653245.832890] ata2: found unknown device (class 0)
[653255.116619] ata2: softreset failed (1st FIS failed)
[653260.868780] ata2: found unknown device (class 0)
[653265.116809] ata2: softreset failed (1st FIS failed)
[653270.868804] ata2: found unknown device (class 0)
[653300.116654] ata2: softreset failed (1st FIS failed)
[653300.116670] ata2: limiting SATA link speed to 3.0 Gbps
[653305.148810] ata2: found unknown device (class 0)
[653305.308456] ata2: found unknown device (class 0)
[653305.308473] ata2: SATA link up 3.0 Gbps (SStatus 123 SControl 320)
[653305.308480] ata2: link online but 1 devices misclassified, device detection might fail
[653305.312951] sd 1:0:0:0: [sdb] Synchronize Cache(10) failed: Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
[653305.312958] sd 1:0:0:0: [sdb] Stopping disk
[653305.312979] sd 1:0:0:0: [sdb] Start/Stop Unit failed: Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
[867105.099901] md: data-check of RAID array md0
[867105.101757] md: md0: data-check done.
我发现该驱动器突然消失了,即lsblk
不再blkid
列出该驱动器,就好像它没有物理安装一样。
我做了备份,重新启动后,驱动器再次列出,并且可以使用添加sudo mdadm --re-add /dev/md0 /dev/sdd
。之后,一切看起来又恢复正常了。例如dmesg
显示:
[ 3519.982027] md: recovery of RAID array md0
[ 3575.888503] md: md0: recovery done.
sudo mdadm --detail /dev/md0
:
/dev/md0:
Version : 1.2
Creation Time : Sat Aug 6 17:58:09 2022
Raid Level : raid1
Array Size : 976630464 (931.39 GiB 1000.07 GB)
Used Dev Size : 976630464 (931.39 GiB 1000.07 GB)
Raid Devices : 2
Total Devices : 2
Persistence : Superblock is persistent
Intent Bitmap : Internal
Update Time : Sun Oct 1 12:02:14 2023
State : clean
Active Devices : 2
Working Devices : 2
Failed Devices : 0
Spare Devices : 0
Consistency Policy : bitmap
Name : nas:0 (local to host nas)
UUID : 6d6bb2a5:d42475de:ce618a52:28bd98bb
Events : 3719
Number Major Minor RaidDevice State
0 8 0 0 active sync /dev/sda
1 8 48 1 active sync /dev/sdd
sudo smartctl -a /dev/sdd
:
smartctl 7.3 2022-02-28 r5338 [x86_64-linux-6.1.0-12-amd64] (local build)
Copyright (C) 2002-22, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF INFORMATION SECTION ===
Model Family: Seagate Skyhawk
Device Model: ST1000VX005-2EZ102
Serial Number: Z9CB4EPK
LU WWN Device Id: 5 000c50 0c45a18ab
Firmware Version: CV11
User Capacity: 1,000,204,886,016 bytes [1.00 TB]
Sector Sizes: 512 bytes logical, 4096 bytes physical
Rotation Rate: 5900 rpm
Form Factor: 3.5 inches
Device is: In smartctl database 7.3/5319
ATA Version is: ATA8-ACS T13/1699-D revision 4
SATA Version is: SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is: Sun Oct 1 11:54:51 2023 CEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
General SMART Values:
Offline data collection status: (0x82) Offline data collection activity
was completed without error.
Auto Offline Data Collection: Enabled.
Self-test execution status: ( 0) The previous self-test routine completed
without error or no self-test has ever
been run.
Total time to complete Offline
data collection: ( 0) seconds.
Offline data collection
capabilities: (0x7b) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 1) minutes.
Extended self-test routine
recommended polling time: ( 130) minutes.
Conveyance self-test routine
recommended polling time: ( 2) minutes.
SCT capabilities: (0x10bb) SCT Status supported.
SCT Error Recovery Control supported.
SCT Feature Control supported.
SCT Data Table supported.
SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x000f 081 063 006 Pre-fail Always - 153886026
3 Spin_Up_Time 0x0003 096 096 000 Pre-fail Always - 0
4 Start_Stop_Count 0x0032 100 100 020 Old_age Always - 494
5 Reallocated_Sector_Ct 0x0033 100 100 010 Pre-fail Always - 0
7 Seek_Error_Rate 0x000f 069 060 045 Pre-fail Always - 9994348
9 Power_On_Hours 0x0032 090 090 000 Old_age Always - 9604h+00m+00.000s
10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always - 0
12 Power_Cycle_Count 0x0032 100 100 020 Old_age Always - 37
184 End-to-End_Error 0x0032 100 100 099 Old_age Always - 0
187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0
188 Command_Timeout 0x0032 100 100 000 Old_age Always - 0
189 High_Fly_Writes 0x003a 100 100 000 Old_age Always - 0
190 Airflow_Temperature_Cel 0x0022 065 062 040 Old_age Always - 35 (Min/Max 28/35)
193 Load_Cycle_Count 0x0032 100 100 000 Old_age Always - 880
194 Temperature_Celsius 0x0022 035 020 000 Old_age Always - 35 (0 20 0 0 0)
195 Hardware_ECC_Recovered 0x001a 003 001 000 Old_age Always - 153886026
197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 0
SMART Error Log Version: 1
No Errors Logged
SMART Self-test log structure revision number 1
No self-tests have been logged. [To run self-tests, use: smartctl -t]
SMART Selective self-test log data structure revision number 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
我觉得这个现象很奇怪。我现在的问题是:我应该更换驱动器还是等到有人真正报告 SMART 错误?
I think the problem is
powertop --autotune
, which is run at server boot. This can cause the drive to disappear, for which I unfortunately only have anectodal evidence and this german forum post as a hint.Btw, it was the first time the drive was in standby for 2 days straight.