我想知道发生了什么,ZFS 是如何完全恢复的,或者我的数据是否仍然完好无损。
当我昨晚进来时,我感到沮丧,然后感到困惑。
zpool status
pool: san
state: DEGRADED
status: One or more devices has experienced an unrecoverable error. An
attempt was made to correct the error. Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
using 'zpool clear' or replace the device with 'zpool replace'.
see: http://zfsonlinux.org/msg/ZFS-8000-9P
scan: resilvered 392K in 0h0m with 0 errors on Tue Jan 21 16:36:41 2020
config:
NAME STATE READ WRITE CKSUM
san DEGRADED 0 0 0
raidz2-0 DEGRADED 0 0 0
ata-WDC_WD20EZRX-00DC0B0_WD-WMC1T3458346 ONLINE 0 0 0
ata-ST2000DM001-9YN164_W1E07E0G DEGRADED 0 0 38 too many errors
ata-WDC_WD20EZRX-19D8PB0_WD-WCC4M0428332 DEGRADED 0 0 63 too many errors
ata-ST2000NM0011_Z1P07NVZ ONLINE 0 0 0
ata-WDC_WD20EARX-00PASB0_WD-WCAZAJ490344 ONLINE 0 0 0
wwn-0x50014ee20949b6f9 DEGRADED 0 0 75 too many errors
errors: No known data errors
怎么可能没有数据错误,并且整个池都没有故障?
一个驱动器sdf
对 SMART 的 smartctl 测试失败read fail
,其他驱动器的问题稍小;不可纠正/未决扇区或 UDMA CRC 错误。
我尝试将每个发生故障的驱动器切换到离线状态,然后一次切换到一个在线状态,但这没有帮助。
$ zpool status
pool: san
state: DEGRADED
status: One or more devices has experienced an unrecoverable error. An
attempt was made to correct the error. Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
using 'zpool clear' or replace the device with 'zpool replace'.
see: http://zfsonlinux.org/msg/ZFS-8000-9P
scan: resilvered 392K in 0h0m with 0 errors on Tue Jan 21 16:36:41 2020
config:
NAME STATE READ WRITE CKSUM
san DEGRADED 0 0 0
raidz2-0 DEGRADED 0 0 0
ata-WDC_WD20EZRX-00DC0B0_WD-WMC1T3458346 ONLINE 0 0 0
ata-ST2000DM001-9YN164_W1E07E0G DEGRADED 0 0 38 too many errors
ata-WDC_WD20EZRX-19D8PB0_WD-WCC4M0428332 OFFLINE 0 0 63
ata-ST2000NM0011_Z1P07NVZ ONLINE 0 0 0
ata-WDC_WD20EARX-00PASB0_WD-WCAZAJ490344 ONLINE 0 0 0
wwn-0x50014ee20949b6f9 DEGRADED 0 0 75 too many errors
因此,如果我的数据实际上仍然全部存在,我感到非常幸运,或者有点困惑,在检查了最差的驱动器之后,我用我唯一的备用驱动器进行了更换。
$ zpool status
pool: san
state: DEGRADED
status: One or more devices is currently being resilvered. The pool will
continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
scan: resilver in progress since Tue Jan 21 17:33:15 2020
467G scanned out of 8.91T at 174M/s, 14h10m to go
77.6G resilvered, 5.12% done
config:
NAME STATE READ WRITE CKSUM
san DEGRADED 0 0 0
raidz2-0 DEGRADED 0 0 0
ata-WDC_WD20EZRX-00DC0B0_WD-WMC1T3458346 ONLINE 0 0 0
replacing-1 DEGRADED 0 0 0
ata-ST2000DM001-9YN164_W1E07E0G OFFLINE 0 0 38
ata-WDC_WD2000FYYZ-01UL1B1_WD-WCC1P1171516 ONLINE 0 0 0 (resilvering)
ata-WDC_WD20EZRX-19D8PB0_WD-WCC4M0428332 DEGRADED 0 0 63 too many errors
ata-ST2000NM0011_Z1P07NVZ ONLINE 0 0 0
ata-WDC_WD20EARX-00PASB0_WD-WCAZAJ490344 ONLINE 0 0 0
wwn-0x50014ee20949b6f9 DEGRADED 0 0 75 too many errors
resilver 确实成功完成。
$ zpool status
pool: san
state: DEGRADED
status: One or more devices has experienced an unrecoverable error. An
attempt was made to correct the error. Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
using 'zpool clear' or replace the device with 'zpool replace'.
see: http://zfsonlinux.org/msg/ZFS-8000-9P
scan: resilvered 1.48T in 12h5m with 0 errors on Wed Jan 22 05:38:48 2020
config:
NAME STATE READ WRITE CKSUM
san DEGRADED 0 0 0
raidz2-0 DEGRADED 0 0 0
ata-WDC_WD20EZRX-00DC0B0_WD-WMC1T3458346 ONLINE 0 0 0
ata-WDC_WD2000FYYZ-01UL1B1_WD-WCC1P1171516 ONLINE 0 0 0
ata-WDC_WD20EZRX-19D8PB0_WD-WCC4M0428332 DEGRADED 0 0 63 too many errors
ata-ST2000NM0011_Z1P07NVZ ONLINE 0 0 0
ata-WDC_WD20EARX-00PASB0_WD-WCAZAJ490344 ONLINE 0 0 0
wwn-0x50014ee20949b6f9 DEGRADED 0 0 75 too many errors
我现在正处于十字路口。我通常dd
将故障驱动器的前 2MB 归零,然后用它自己替换,我可以这样做,但是如果确实有数据丢失,我可能需要最后两个卷来恢复。
我现在桌子上有这个sdf
,已删除。我觉得我可以,在最坏的情况下,用这个来帮助恢复。
同时,我想我现在要对降级驱动器的前几 MB 进行开发/归零,并自行更换,我认为事情应该会解决,冲洗并重复第二个故障驱动器,直到我能得到一些替换手上。
问题 发生了什么,池如何能够挂起,或者我可能丢失了一些数据(考虑到 zfs 及其报告的完整性,值得怀疑)
可能是由于幸运的失败顺序,例如失败的堆栈的顶部驱动器?
问题 这只是仅供参考,与主题无关。是什么导致所有 3 个同时失败?我认为这是一种磨砂膏,它是催化剂。我前一天晚上检查了所有驱动器都在线。
请注意,最近布线一直是个问题,办公室晚上很冷,但这些问题只是drive unavailable
,而不是校验和错误。我认为那不是布线,而是老化的驱动器,它们已经 5 年了。但是一天3次失败?来吧,这足以吓到我们很多人!