我正在运行 CentOS 7 机器(标准内核:)3.10.0-327.36.3.el7.x86_64
,软件 RAID-10 超过 16x 1 TB SSD(更准确地说,磁盘上有两个 RAID 阵列;其中一个阵列提供主机的交换分区)。上周,SSD出现故障:
13:18:07 kvm7 kernel: sd 1:0:2:0: attempting task abort! scmd(ffff887e57b916c0)
13:18:07 kvm7 kernel: sd 1:0:2:0: [sdk] CDB: Write(10) 2a 08 02 55 20 08 00 00 01 00
13:18:07 kvm7 kernel: scsi target1:0:2: handle(0x000b), sas_address(0x4433221102000000), phy(2)
13:18:07 kvm7 kernel: scsi target1:0:2: enclosure_logical_id(0x500304801c14a001), slot(2)
13:18:10 kvm7 kernel: sd 1:0:2:0: task abort: SUCCESS scmd(ffff887e57b916c0)
13:18:11 kvm7 kernel: sd 1:0:2:0: [sdk] FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
13:18:11 kvm7 kernel: sd 1:0:2:0: [sdk] Sense Key : Not Ready [current]
13:18:11 kvm7 kernel: sd 1:0:2:0: [sdk] Add. Sense: Logical unit not ready, cause not reportable
13:18:11 kvm7 kernel: sd 1:0:2:0: [sdk] CDB: Write(10) 2a 08 02 55 20 08 00 00 01 00
13:18:11 kvm7 kernel: blk_update_request: I/O error, dev sdk, sector 39133192
13:18:11 kvm7 kernel: blk_update_request: I/O error, dev sdk, sector 39133192
13:18:11 kvm7 kernel: md: super_written gets error=-5, uptodate=0
13:18:11 kvm7 kernel: md/raid10:md3: Disk failure on sdk3, disabling device.#012md/raid10:md3: Operation continuing on 15 devices.
13:19:27 kvm7 kernel: sd 1:0:2:0: device_blocked, handle(0x000b)
13:19:29 kvm7 kernel: sd 1:0:2:0: [sdk] Synchronizing SCSI cache
13:19:29 kvm7 kernel: md: md3 still in use.
13:19:29 kvm7 kernel: sd 1:0:2:0: [sdk] Synchronize Cache(10) failed: Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
13:19:29 kvm7 kernel: mpt3sas1: removing handle(0x000b), sas_addr(0x4433221102000000)
13:19:29 kvm7 kernel: md: md2 still in use.
13:19:29 kvm7 kernel: md/raid10:md2: Disk failure on sdk2, disabling device.#012md/raid10:md2: Operation continuing on 15 devices.
13:19:29 kvm7 kernel: md: unbind<sdk3>
13:19:29 kvm7 kernel: md: export_rdev(sdk3)
13:19:29 kvm7 kernel: md: unbind<sdk2>
13:19:29 kvm7 kernel: md: export_rdev(sdk2)
/proc/mdstat
看起来像预期的那样(1 个有故障的成员)并且虚拟机继续运行没有任何问题。
md3 : active raid10 sdp3[15] sdb3[2] sdg3[12] sde3[8] sdn3[11] sdl3[7] sdm3[9] sdf3[10] sdi3[1] sdk3[5](F) sdc3[4] sdd3[6] sdh3[14] sdo3[13] sda3[0] sdj3[3]
7844052992 blocks super 1.2 128K chunks 2 near-copies [16/15] [UUUUU_UUUUUUUUUU]
由于没有可用的 1 TB SSD,SSD 必须暂时更换为更大的 SSD;所以我们做了,开始重建,一切都很好。今天,“正确”的 SSD 到货了,因此数据中心技术人员只需拉动装有上述 SSD 的托盘,系统就会在几秒钟内变得无响应。虽然主机在单独的 RAID 阵列上运行良好,但虚拟机无法执行 I/O。负载增加到> 800。我能够执行mdadm --detail /dev/md3
显示降级(但活动/干净)的阵列,所以从这个角度来看,系统绝对没问题。我试图从阵列中删除有故障/丢失的驱动器,这当然失败了(“没有这样的设备”),甚至突然mdadm --detail /dev/md3
不再生成任何输出,它只是卡住了,我不得不终止 SSH 会话才能摆脱这种情况。在此之后,我决定强制重新启动,因为我什至不知道如何从阵列中移除这个有故障的驱动器 - 一切都正确出现。当然,RAID 仍然降级,需要重新同步,但除此之外:没有问题。
我很确定在将托盘从机架中拉出之前,我应该通过 mdadm 卸下驱动器,尽管我无法解释 mdraid 的这种行为。在我看来,我们“模拟”了一次常规磁盘中断,所以有人知道是什么导致了这个问题,以及我如何确保下一次常规磁盘中断不会导致同样的问题?--set-faulty
内核记录了一些消息,我发现有趣的是新设备以sdq出现,而拉出的设备被称为sdk
. 所以我认为它sdk
没有从阵列中正确踢出。上周最初的 SSD 故障发生时,我没有看到这种行为;所以更换驱动器也出现了sdk
。
日志还显示旧 SSD 的故障和插入新 SSD 之间的 7 分钟,所以我不认为像它在https://superuser.com/questions/942886/fail-device下描述的问题in-md-raid-when-ata-stops-responding发生。虚拟机也立即关闭,而不是 7 分钟后。所以 - 对此有什么想法吗?将不胜感激:)
11:45:36 kvm7 kernel: sd 1:0:8:0: device_blocked, handle(0x000b)
11:45:37 kvm7 kernel: blk_update_request: I/O error, dev sdk, sector 0
11:45:37 kvm7 kernel: md/raid10:md3: sdk3: rescheduling sector 4072069640
11:45:37 kvm7 kernel: md/raid10:md3: sdk3: rescheduling sector 4072069648
11:45:37 kvm7 kernel: md/raid10:md3: sdk3: rescheduling sector 4072069656
11:45:37 kvm7 kernel: md/raid10:md3: sdk3: rescheduling sector 4072069664
11:45:37 kvm7 kernel: md/raid10:md3: sdk3: rescheduling sector 4072069672
11:45:37 kvm7 kernel: md/raid10:md3: sdk3: rescheduling sector 4072069680
11:45:37 kvm7 kernel: md/raid10:md3: sdk3: rescheduling sector 4072069688
11:45:37 kvm7 kernel: md/raid10:md3: sdk3: rescheduling sector 4072069696
11:45:37 kvm7 kernel: md/raid10:md3: sdk3: rescheduling sector 4072069704
11:45:37 kvm7 kernel: md/raid10:md3: sdk3: rescheduling sector 4072069712
11:45:37 kvm7 kernel: sd 1:0:8:0: [sdk] FAILED Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
11:45:37 kvm7 kernel: sd 1:0:8:0: [sdk] CDB: Read(10) 28 00 20 af f7 08 00 00 08 00
11:45:37 kvm7 kernel: blk_update_request: I/O error, dev sdk, sector 548402952
11:45:37 kvm7 kernel: blk_update_request: I/O error, dev sdk, sector 0
11:45:37 kvm7 kernel: blk_update_request: I/O error, dev sdk, sector 39133192
11:45:37 kvm7 kernel: md: super_written gets error=-5, uptodate=0
11:45:37 kvm7 kernel: md/raid10:md3: Disk failure on sdk3, disabling device.#012md/raid10:md3: Operation continuing on 15 devices.
11:45:37 kvm7 kernel: md: md2 still in use.
11:45:37 kvm7 kernel: md/raid10:md2: Disk failure on sdk2, disabling device.#012md/raid10:md2: Operation continuing on 15 devices.
11:45:37 kvm7 kernel: blk_update_request: I/O error, dev sdk, sector 39133264
11:45:37 kvm7 kernel: md: super_written gets error=-5, uptodate=0
11:45:37 kvm7 kernel: sd 1:0:8:0: [sdk] Synchronizing SCSI cache
11:45:37 kvm7 kernel: sd 1:0:8:0: [sdk] Synchronize Cache(10) failed: Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
11:45:37 kvm7 kernel: mpt3sas1: removing handle(0x000b), sas_addr(0x4433221102000000)
11:45:37 kvm7 kernel: md: unbind<sdk2>
11:45:37 kvm7 kernel: md: export_rdev(sdk2)
11:48:00 kvm7 kernel: INFO: task md3_raid10:1293 blocked for more than 120 seconds.
11:48:00 kvm7 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
11:48:00 kvm7 kernel: md3_raid10 D ffff883f26e55c00 0 1293 2 0x00000000
11:48:00 kvm7 kernel: ffff887f24bd7c58 0000000000000046 ffff887f212eb980 ffff887f24bd7fd8
11:48:00 kvm7 kernel: ffff887f24bd7fd8 ffff887f24bd7fd8 ffff887f212eb980 ffff887f23514400
11:48:00 kvm7 kernel: ffff887f235144dc 0000000000000001 ffff887f23514500 ffff8807fa4c4300
11:48:00 kvm7 kernel: Call Trace:
11:48:00 kvm7 kernel: [<ffffffff8163bb39>] schedule+0x29/0x70
11:48:00 kvm7 kernel: [<ffffffffa0104ef7>] freeze_array+0xb7/0x180 [raid10]
11:48:00 kvm7 kernel: [<ffffffff810a6b80>] ? wake_up_atomic_t+0x30/0x30
11:48:00 kvm7 kernel: [<ffffffffa010880d>] handle_read_error+0x2bd/0x360 [raid10]
11:48:00 kvm7 kernel: [<ffffffff812c7412>] ? generic_make_request+0xe2/0x130
11:48:00 kvm7 kernel: [<ffffffffa0108a1d>] raid10d+0x16d/0x1440 [raid10]
11:48:00 kvm7 kernel: [<ffffffff814bb785>] md_thread+0x155/0x1a0
11:48:00 kvm7 kernel: [<ffffffff810a6b80>] ? wake_up_atomic_t+0x30/0x30
11:48:00 kvm7 kernel: [<ffffffff814bb630>] ? md_safemode_timeout+0x50/0x50
11:48:00 kvm7 kernel: [<ffffffff810a5b8f>] kthread+0xcf/0xe0
11:48:00 kvm7 kernel: [<ffffffff810a5ac0>] ? kthread_create_on_node+0x140/0x140
11:48:00 kvm7 kernel: [<ffffffff81646a98>] ret_from_fork+0x58/0x90
11:48:00 kvm7 kernel: [<ffffffff810a5ac0>] ? kthread_create_on_node+0x140/0x140
11:48:00 kvm7 kernel: INFO: task qemu-kvm:26929 blocked for more than 120 seconds.
[serveral messages for stuck qemu-kvm processes]
11:52:42 kvm7 kernel: scsi 1:0:9:0: Direct-Access ATA KINGSTON SKC400S 001A PQ: 0 ANSI: 6
11:52:42 kvm7 kernel: scsi 1:0:9:0: SATA: handle(0x000b), sas_addr(0x4433221102000000), phy(2), device_name(0x4d6b497569a68ba2)
11:52:42 kvm7 kernel: scsi 1:0:9:0: SATA: enclosure_logical_id(0x500304801c14a001), slot(2)
11:52:42 kvm7 kernel: scsi 1:0:9:0: atapi(n), ncq(y), asyn_notify(n), smart(y), fua(y), sw_preserve(y)
11:52:42 kvm7 kernel: scsi 1:0:9:0: qdepth(32), tagged(1), simple(0), ordered(0), scsi_level(7), cmd_que(1)
11:52:42 kvm7 kernel: sd 1:0:9:0: Attached scsi generic sg10 type 0
11:52:42 kvm7 kernel: sd 1:0:9:0: [sdq] 2000409264 512-byte logical blocks: (1.02 TB/953 GiB)
11:52:42 kvm7 kernel: sd 1:0:9:0: [sdq] Write Protect is off
11:52:42 kvm7 kernel: sd 1:0:9:0: [sdq] Write cache: enabled, read cache: enabled, supports DPO and FUA
11:52:42 kvm7 kernel: sdq: unknown partition table
11:52:42 kvm7 kernel: sd 1:0:9:0: [sdq] Attached SCSI disk
从内核堆栈跟踪来看,似乎
md
驱动程序试图冻结数组 (freeze_array+0xb7/0x180 [raid10]
) 以完全删除失败的成员,但此操作从未完成。md: unbind<sdk3>
缺失的行证实了这一点。对我来说,这似乎是一个死锁/活锁问题,所以根本原因可能是软件错误。你真的应该在Linux RAID 邮件列表上提交一份报告