smartmontools 报告在 RAID1 配置中使用的驱动器上的不可读扇区数量不断增加。我认为 LSI MegaRAID 控制器还会检查其磁盘驱动器的 SMART 状态,因此应该将驱动器识别为故障并将其标记为脱机?
smartctl -d sat+megaraid,7 -a /dev/sda 的输出:
...
197 Current_Pending_Sector 0x0022 100 100 000 Old_age Always - 69
...
Error 11 occurred at disk power-on lifetime: 9704 hours (404 days + 8 hours)
When the command that caused the error occurred, the device was active or idle.
After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 51 11 6f cd 04 0f Error: UNC at LBA = 0x0f04cd6f = 251972975
Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
60 69 38 17 cd 04 40 00 2d+11:27:29.750 READ FPDMA QUEUED
61 10 30 98 12 55 40 00 2d+11:27:29.750 WRITE FPDMA QUEUED
61 01 28 57 86 da 40 00 2d+11:27:29.750 WRITE FPDMA QUEUED
60 09 20 f7 d1 04 40 00 2d+11:27:29.750 READ FPDMA QUEUED
60 80 18 00 d2 04 40 00 2d+11:27:29.750 READ FPDMA QUEUED
...
SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Short offline Completed without error 00% 9700 -
# 2 Short offline Completed without error 00% 9676 -
# 3 Extended offline Completed: read failure 90% 9673 251972659
MegaCli -AdpAllInfo -aAll 的输出:
Product Name : LSI MegaRAID SAS 9260-4i
...
================
Virtual Drives : 2
Degraded : 0
Offline : 0
Physical Devices : 5
Disks : 4
Critical Disks : 0
Failed Disks : 0
请告知 RAID 控制器行为是否正常或某处是否配置错误。控制器应该处于出厂状态,我只将四个物理磁盘配置为两个 RAID1 卷。
无论如何都会更换坏盘。
更新:我了解到实际上有一种方法可以了解此类错误(见下文),但我认为此类信息会显示在更突出的状态信息中,而不是隐藏在日志文件中。
似乎 RAID 控制器没有标记这个磁盘,因为它仍然可以从这种错误情况中恢复。
要查看 RAID 控制器日志,请运行以下命令:
events.log 文件包含如下条目,表明磁盘存在问题: