我们决定用一个类似的 12TB 驱动器系统替换我们老化的主 NAS,该系统由三个 48 驱动器 SAS 扩展器和 4TB 驱动器组成,同时重复使用大约一年前添加的一些较新的硬件、一个扩展器和 SAS 卡。做出决定是为了使事情尽可能简单和便宜,同时最终不占用任何额外的机架空间。
新硬件到货了,服务器和两个扩展器,并设置了 Debian Buster 和 buster-backports 存储库中可用的 ZFS。ZFS 池是使用两个 U.2 SSD 驱动器的镜像创建的,其中两个 U.2 SSD 驱动器用于缓存,另外两个 U.2 SSD 驱动器用于缓存,4 个 HDD 备用(每个扩展器 2 个)和 12 个 RAID-Z2 RAID,每个 7 个驱动器(每个扩展器 6 次袭击)。一切看起来都不错,我开始使用脚本将数据从旧 NAS 复制到这个,该脚本利用了增量快照、zfs 发送和 zfs 接收。
脚本的第一次运行花了很多天,但最终完成了。两端都没有问题。第二次运行也很有效。然后在第三次之后,ZFS 池出现了许多问题。在 4 次袭击中,大量磁盘的状态变为不可用或失败,所有 4 个备用磁盘都自动投入使用。zpool status 的输出如下。
# zpool status
pool: bigvol
state: DEGRADED
status: One or more devices is currently being resilvered. The pool will
continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
scan: resilver in progress since Thu Jan 28 09:55:20 2021
160T scanned at 11.5G/s, 151T issued at 10.8G/s, 160T total
4.99T resilvered, 94.53% done, 0 days 00:13:46 to go
config:
NAME STATE READ WRITE CKSUM
bigvol DEGRADED 0 0 0
raidz2-0 ONLINE 0 0 0
scsi-35000c500cacd481b ONLINE 0 0 0
scsi-35000c500cacceddb ONLINE 0 0 0
scsi-35000c500cacd5c4b ONLINE 0 0 0
scsi-35000c500cacd19cb ONLINE 0 0 0
scsi-35000c500cacd0f4f ONLINE 0 0 0
scsi-35000c500cacd5efb ONLINE 0 0 0
scsi-35000c500cacd133f ONLINE 0 0 0
raidz2-1 ONLINE 0 0 0
scsi-35000c500cab6617f ONLINE 0 0 0
scsi-35000c500cacd131b ONLINE 0 0 0
scsi-35000c500cacd1637 ONLINE 0 0 0
scsi-35000c500cacd0dd3 ONLINE 0 0 0
scsi-35000c500cab64247 ONLINE 0 0 0
scsi-35000c500cacd5f4b ONLINE 0 0 0
scsi-35000c500cacd206b ONLINE 0 0 0
raidz2-2 ONLINE 0 0 0
scsi-35000c500cacd251f ONLINE 0 0 0
scsi-35000c500cacf60a7 ONLINE 0 0 0
scsi-35000c500cacd55cb ONLINE 0 0 0
scsi-35000c500cacd3a5f ONLINE 0 0 0
scsi-35000c500cacd0fa7 ONLINE 0 0 0
scsi-35000c500cacd4cb3 ONLINE 0 0 0
scsi-35000c500cacd2edf ONLINE 0 0 0
raidz2-3 DEGRADED 0 0 0
scsi-35000c500cacd1627 ONLINE 0 0 0
scsi-35000c500cacd049f ONLINE 0 0 0
scsi-35000c500cacdf9d3 ONLINE 0 0 0
scsi-35000c500cab51563 DEGRADED 0 0 1 too many errors (resilvering)
scsi-35000c500cacd1c9b DEGRADED 0 0 0 too many errors
scsi-35000c500cacdf757 FAULTED 0 10 48 too many errors (resilvering)
scsi-35000c500cacd291b FAULTED 0 11 31 too many errors (resilvering)
raidz2-4 DEGRADED 0 0 0
spare-0 DEGRADED 0 0 11
scsi-35000c500cacdb54f FAULTED 0 18 0 too many errors (resilvering)
scsi-35000c500cacdc907 DEGRADED 0 0 0 too many errors (resilvering)
scsi-35000c500cacd2c77 DEGRADED 0 0 4 too many errors
scsi-35000c500cacdbdd3 DEGRADED 0 0 12 too many errors (resilvering)
scsi-35000c500cacd0a47 DEGRADED 0 0 7 too many errors (resilvering)
scsi-35000c500cacdf107 DEGRADED 0 0 4 too many errors (resilvering)
scsi-35000c500cacd59fb DEGRADED 0 195 79 too many errors (resilvering)
scsi-35000c500cacd5307 DEGRADED 0 177 30 too many errors (resilvering)
raidz2-5 DEGRADED 0 0 0
spare-0 DEGRADED 0 0 15
scsi-35000c500cacd03a3 FAULTED 0 12 0 too many errors (resilvering)
scsi-35000c500cacd340b ONLINE 0 0 0
scsi-35000c500cacd29d7 FAULTED 0 21 24 too many errors (resilvering)
scsi-35000c500cacd23d7 DEGRADED 0 0 11 too many errors (resilvering)
scsi-35000c500cacd1c27 DEGRADED 0 0 29 too many errors (resilvering)
spare-4 DEGRADED 0 0 32
scsi-35000c500cacd26bb FAULTED 0 31 0 too many errors (resilvering)
scsi-35000c500cacd299f DEGRADED 0 0 0 too many errors (resilvering)
scsi-35000c500cacd258b DEGRADED 0 207 63 too many errors (resilvering)
spare-6 DEGRADED 0 0 24
scsi-35000c500cacdf867 FAULTED 0 15 0 too many errors (resilvering)
scsi-35000c500cacd60ef ONLINE 0 0 0
raidz2-6 DEGRADED 0 0 0
scsi-35000c500cacd2e37 ONLINE 0 0 0
scsi-35000c500cacd0ecf ONLINE 0 0 0
11839096008852004814 UNAVAIL 0 0 0 was /dev/disk/by-id/scsi-35000c500cacd1f8f-part1
scsi-35000c500cacd088b ONLINE 0 0 0
scsi-35000c500cacd28df ONLINE 0 0 0
scsi-35000c500cacd068b ONLINE 0 0 0
scsi-35000c500cacdbd77 ONLINE 0 0 0
raidz2-7 ONLINE 0 0 0
scsi-35000c500cacd040b ONLINE 0 0 0
scsi-35000c500cacd16bb ONLINE 0 0 0
scsi-35000c500cacd4d37 ONLINE 0 0 0
scsi-35000c500cacd1b57 ONLINE 0 0 0
scsi-35000c500cacd0453 ONLINE 0 0 0
scsi-35000c500cacd3f6b ONLINE 0 0 0
scsi-35000c500cacd0297 ONLINE 0 0 0
raidz2-8 ONLINE 0 0 0
scsi-35000c500cacd4bcb ONLINE 0 0 0
scsi-35000c500cacd36cf ONLINE 0 0 0
scsi-35000c500cacd1983 ONLINE 0 0 0
scsi-35000c500cacd3aaf ONLINE 0 0 0
scsi-35000c500cacda90b ONLINE 0 0 0
scsi-35000c500cacd0d53 ONLINE 0 0 0
scsi-35000c500cacdaa1f ONLINE 0 0 0
raidz2-9 ONLINE 0 0 0
scsi-35000c500cacd3f13 ONLINE 0 0 0
scsi-35000c500cacd3187 ONLINE 0 0 0
scsi-35000c500cacd59a3 ONLINE 0 0 0
scsi-35000c500cacd0913 ONLINE 0 0 0
scsi-35000c500cacdf663 ONLINE 0 0 0
scsi-35000c500cacd156b ONLINE 0 0 0
scsi-35000c500cacd203f ONLINE 0 0 0
raidz2-10 ONLINE 0 0 0
scsi-35000c500cacd4c97 ONLINE 0 0 0
scsi-35000c500cacd58a3 ONLINE 0 0 0
scsi-35000c500cacd2353 ONLINE 0 0 0
scsi-35000c500cacd3f67 ONLINE 0 0 0
scsi-35000c500cacd235f ONLINE 0 0 0
scsi-35000c500cacdf14f ONLINE 0 0 0
scsi-35000c500cacd2583 ONLINE 0 0 0
raidz2-11 ONLINE 0 0 0
scsi-35000c500cacd2f87 ONLINE 0 0 0
scsi-35000c500cacdb557 ONLINE 0 0 0
scsi-35000c500cacd00f3 ONLINE 0 0 0
scsi-35000c500cacd3ea7 ONLINE 0 0 0
scsi-35000c500cacd23ff ONLINE 0 0 0
scsi-35000c500cacd09d3 ONLINE 0 0 0
scsi-35000c500cacd3adb ONLINE 0 0 0
logs
mirror-12 ONLINE 0 0 0
nvme-eui.343842304db011100025384700000001 ONLINE 0 0 0
nvme-eui.343842304db011060025384700000001 ONLINE 0 0 0
cache
nvme-eui.343842304db010920025384700000001 ONLINE 0 0 0
nvme-eui.343842304db011080025384700000001 ONLINE 0 0 0
spares
scsi-35000c500cacdc907 INUSE currently in use
scsi-35000c500cacd299f INUSE currently in use
scsi-35000c500cacd340b INUSE currently in use
scsi-35000c500cacd60ef INUSE currently in use
errors: No known data errors
出于显而易见的原因,我已停止传输,并正在等待重新同步结束,然后再更换 FAULTED 和 UNAVAIL 驱动器。但是我想知道是否应该更换 DEGRADED 驱动器?另外,如果有人知道为什么会发生这种情况?(除了一组坏驱动器的可能性之外。)或者我只需要杀死池并更换驱动器。无论哪种方式,我都认为必须再次复制数据。
此问题与两个 4U JBOD 机柜之一内的一根或两根损坏的内部 SAS 电缆有关。有问题的电缆从“主要”外部 SAS 连接器连接到背板。用两根电缆从未使用的“辅助”外部连接器交换它们解决了这个问题。