我收到了 RAID5 阵列(由 3 个磁盘组成)中 1 个磁盘的 SMART 警报。如果可能的话,我想更换故障磁盘,而无需关闭服务器。邮件警报中报告的错误是(部分信息已删除):
此消息由运行于以下操作系统的 smartd 守护程序生成:
主机名:********* DNS 域:*********
Smartd 守护进程记录了以下警告/错误:
设备:/dev/bus/0 [megaraid_disk_10],SMART 故障:数据通道即将发生故障常规硬盘故障
设备信息:[LENOVO ST2000NM003A LKB9],lu id:0x5000*************,S/N:WJC0*************,2.00 TB
有关详细信息,请参阅主机的 SYSLOG。
您还可以使用 smartctl 实用程序进行进一步调查。有关此问题的原始消息于 2025 年 3 月 22 日星期六 04:36:14 CET 发送,如果问题仍然存在,将在 24 小时内发送另一条消息。
该服务器当前正在运行 Proxmox(基于 Debian 的发行版),磁盘由Lenovo RAID 730-8i 2GB Flash管理,据我所知,它是 LSI / Broadcom,并通过其实用程序在 SO 中进行管理MegaCli64
,StorCli64
我安装了两者。使用lspci | grep RAID
:
58:00.0 RAID bus controller: LSI Logic / Symbios Logic MegaRAID SAS-3 3108 [Invader] (rev 02)
控制器上有两个驱动组:
- RAID1 适用于 2 个 SSD 磁盘,每个约 500GB
- 3 个 HDD 磁盘组成 RAID5,每个磁盘约 2TB。这是其中一台设备开始发出 SMART 警告的组。我找到了一个具有相同部件号的兼容磁盘,可以更换发出警告的磁盘。
RAID5 上的所有内容都已备份,因此我不太担心丢失数据,恢复起来需要做更多工作,如果可能的话,我想避免这种情况。
使用MegaCli64
我得到的 RAID 配置:
# ./MegaCli64 -LDInfo -LAll -aAll
[... omissis other disk group ...]
Virtual Drive: 1 (Target Id: 1)
Name :hddstorage
RAID Level : Primary-5, Secondary-0, RAID Level Qualifier-3
Size : 3.635 TB
Sector Size : 512
Is VD emulated : No
Parity Size : 1.817 TB
State : Optimal
Strip Size : 64 KB
Number Of Drives : 3
Span Depth : 1
Default Cache Policy: WriteBack, ReadAheadNone, Direct, No Write Cache if Bad BBU
Current Cache Policy: WriteBack, ReadAheadNone, Direct, No Write Cache if Bad BBU
Default Access Policy: Read/Write
Current Access Policy: Read/Write
Disk Cache Policy : Disabled
Encryption Type : None
PI type: No PI
Is VD Cached: No
以及故障驱动器的当前状态:
# ./MegaCli64 -PDList –aAll
[... omissis other disks ...]
Enclosure Device ID: 252
Slot Number: 4
Drive's position: DiskGroup: 1, Span: 0, Arm: 2
Enclosure position: N/A
Device Id: 10 # <---- ID for the SMART check
WWN: 5000C500CE7FB828
Sequence Number: 2
Media Error Count: 79
Other Error Count: 1
Predictive Failure Count: 2
Last Predictive Failure Event Seq Number: 46655
PD Type: SAS
Raw Size: 1.819 TB [0xe8e088b0 Sectors]
Non Coerced Size: 1.818 TB [0xe8d088b0 Sectors]
Coerced Size: 1.817 TB [0xe8b6d000 Sectors]
Sector Size: 512
Logical Sector Size: 512
Physical Sector Size: 512
Firmware state: Online, Spun Up
Commissioned Spare : No
Emergency Spare : No
Device Firmware Level: LKB9
Shield Counter: 0
Successful diagnostics completion on : N/A
SAS Address(0): 0x5000c500ce7fb829
SAS Address(1): 0x0
Connected Port Number: 4(path0)
Inquiry Data: LENOVO ST2000NM003A LKB9WJC06CK0LKB9LKB9LKB9
FDE Capable: Not Capable
FDE Enable: Disable
Secured: Unsecured
Locked: Unlocked
Needs EKM Attention: No
Foreign State: None
Device Speed: 12.0Gb/s
Link Speed: 12.0Gb/s
Media Type: Hard Disk Device
Drive: Not Certified
Drive Temperature :31C (87.80 F)
PI Eligibility: No
Drive is formatted for PI information: No
PI: No PI
Port-0 :
Port status: Active
Port's Linkspeed: 12.0Gb/s
Port-1 :
Port status: Active
Port's Linkspeed: 12.0Gb/s
Drive has flagged a S.M.A.R.T alert : Yes # <--- Faulty!
因此,通过查看驱动器的 SMART 结果,我得到了:
smartctl -a -d megaraid,10 /dev/sda
smartctl 7.2 2020-12-30 r5155 [x86_64-linux-5.4.157-1-pve] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF INFORMATION SECTION ===
Vendor: LENOVO
Product: ST2000NM003A
Revision: LKB9
Compliance: SPC-5
User Capacity: 2.000.398.934.016 bytes [2,00 TB]
Logical block size: 512 bytes
LU is fully provisioned
Rotation Rate: 7200 rpm
Form Factor: 3.5 inches
Logical Unit id: 0x5000c500ce7fb82b
Serial number: WJC06CK00000E024CJ6U
Device type: disk
Transport protocol: SAS (SPL-3)
Local Time is: Mon Mar 24 11:01:20 2025 CET
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
Temperature Warning: Enabled
=== START OF READ SMART DATA SECTION ===
SMART Health Status: DATA CHANNEL IMPENDING FAILURE GENERAL HARD DRIVE FAILURE [asc=5d, ascq=30]
Grown defects during certification <not available>
Total blocks reassigned during format <not available>
Total new blocks reassigned = 29
Power on minutes since format <not available>
Current Drive Temperature: 32 C
Drive Trip Temperature: 65 C
Accumulated power on time, hours:minutes 39425:21
Manufactured in week 02 of year 2020
Specified cycle count over device lifetime: 50000
Accumulated start-stop cycles: 70
Specified load-unload count over device lifetime: 600000
Accumulated load-unload cycles: 2299
Elements in grown defect list: 29
Error counter log:
Errors Corrected by Total Correction Gigabytes Total
ECC rereads/ errors algorithm processed uncorrected
fast | delayed rewrites corrected invocations [10^9 bytes] errors
read: 0 1699 0 1699 2335 504611,864 386
write: 0 0 0 0 0 73712,791 0
verify: 0 1809 0 1809 2122 471546,642 237
Non-medium error count: 11
SMART Self-test log
Num Test Status segment LifeTime LBA_first_err [SK ASC ASQ]
Description number (hours)
# 1 Background long Completed - 7 - [- - -]
# 2 Background long Aborted (by user command) - 4 - [- - -]
# 3 Background short Completed - 4 - [- - -]
# 4 Background long Aborted (by user command) - 4 - [- - -]
Long (extended) Self-test duration: 13740 seconds [229,0 minutes]
或多或少可以确认驱动器上有些东西有问题。对其他磁盘(smartctl -a -d megaraid,8 /dev/sda
和smartctl -a -d megaraid,9 /dev/sda
)的检查报告读数正常:
[... omissis ...]
=== START OF READ SMART DATA SECTION ===
SMART Health Status: OK
[... omissis ...]
控制器尚未将磁盘脱机,经确认StorCli64
:
# ./storcli64 /cALL show all
[... omissis ...]
Drive Groups = 2
TOPOLOGY :
========
-----------------------------------------------------------------------------
DG Arr Row EID:Slot DID Type State BT Size PDC PI SED DS3 FSpace TR
-----------------------------------------------------------------------------
0 - - - - RAID1 Optl N 446.102 GB dflt N N dflt N N
0 0 - - - RAID1 Optl N 446.102 GB dflt N N dflt N N
0 0 0 252:0 11 DRIVE Onln N 446.102 GB dflt N N dflt - N
0 0 1 252:1 12 DRIVE Onln N 446.102 GB dflt N N dflt - N
1 - - - - RAID5 Optl N 3.636 TB dsbl N N dflt N N
1 0 - - - RAID5 Optl N 3.636 TB dsbl N N dflt N N
1 0 0 252:2 8 DRIVE Onln N 1.818 TB dsbl N N dflt - N
1 0 1 252:3 9 DRIVE Onln N 1.818 TB dsbl N N dflt - N
1 0 2 252:4 10 DRIVE Onln N 1.818 TB dsbl N N dflt - N # <-- Used later for a storcli command
-----------------------------------------------------------------------------
[... omissis ...]
Physical Drives = 5
PD LIST :
=======
-----------------------------------------------------------------------------------------------------
EID:Slt DID State DG Size Intf Med SED PI SeSz Model Sp Type
-----------------------------------------------------------------------------------------------------
252:0 11 Onln 0 446.102 GB SATA SSD N N 512B MTFDDAK480TDS-1AW1ZA 02JG538D7A44703LEN U -
252:1 12 Onln 0 446.102 GB SATA SSD N N 512B MTFDDAK480TDS-1AW1ZA 02JG538D7A44703LEN U -
252:2 8 Onln 1 1.818 TB SAS HDD N N 512B ST2000NM003A U -
252:3 9 Onln 1 1.818 TB SAS HDD N N 512B ST2000NM003A U -
252:4 10 Onln 1 1.818 TB SAS HDD N N 512B ST2000NM003A U - # <--- THIS LINE (State: Onln)
-----------------------------------------------------------------------------------------------------
[... omissis ...]
我订购了一个新ST2000NM003A
磁盘(Seagate EXOS 7E8 SAS 12Gbit/s),正在准备更换磁盘的活动。为了进行更改,我使用命令打开了磁盘本地化./storcli64 /c0/e252/s4 start locate
。现在我试图了解哪个是更换故障磁盘的正确程序。据我所知,对于实际降级的RAID5,我认为我应该:
- 将原始磁盘置于离线状态(控制器尚未将其设置为离线)
- 将故障磁盘标记为丢失
- 将故障磁盘标记为准备移除
- 插入新磁盘
- 将新磁盘联机
- 手动开始构建阵列
- 检查重建状态
我的 RAID 没有被报告为降级,但也许可以应用相同的程序。就命令而言,我认为我应该这样做StorCli64
:
./storcli64 /c0/e252/s4 set offline
./storcli64 /c0/e252/s4 set missing
./storcli64 /c0/e252/s4 set spindown
- 将磁盘更换为同一位置的新磁盘
./storcli64 /c0/e252/s4 set spinup
和./storcli64 /c0/e252/s4 set online
./storcli64 /c0/e252/s4 insert dg=1 array=0 row=2
。这也应该自动启动重建过程。参数(dg
如设备组、阵列和行)取自StorCli
拓扑的输出。./storcli64 /c0/e252/s4 show rebuild
这或多或少是我试图从RAID 控制器的PDF 指南StorCli
中整理出来的,查看了处理该问题的章节(第 6 章)。但是,我无法确认这是正确的过程。
有人可以确认这是一个正确的程序吗?