AskOverflow.Dev

AskOverflow.Dev Logo AskOverflow.Dev Logo

AskOverflow.Dev Navigation

  • 主页
  • 系统&网络
  • Ubuntu
  • Unix
  • DBA
  • Computer
  • Coding
  • LangChain

Mobile menu

Close
  • 主页
  • 系统&网络
    • 最新
    • 热门
    • 标签
  • Ubuntu
    • 最新
    • 热门
    • 标签
  • Unix
    • 最新
    • 标签
  • DBA
    • 最新
    • 标签
  • Computer
    • 最新
    • 标签
  • Coding
    • 最新
    • 标签
主页 / user-469518

Matteo Ragni's questions

Martin Hope
Matteo Ragni
Asked: 2025-03-24 19:04:21 +0800 CST

LSI MegaRaid 上的 RAID 5 更换故障磁盘(有或没有热插拔)

  • 5

我收到了 RAID5 阵列(由 3 个磁盘组成)中 1 个磁盘的 SMART 警报。如果可能的话,我想更换故障磁盘,而无需关闭服务器。邮件警报中报告的错误是(部分信息已删除):

此消息由运行于以下操作系统的 smartd 守护程序生成:

主机名:********* DNS 域:*********

Smartd 守护进程记录了以下警告/错误:

设备:/dev/bus/0 [megaraid_disk_10],SMART 故障:数据通道即将发生故障常规硬盘故障

设备信息:[LENOVO ST2000NM003A LKB9],lu id:0x5000*************,S/N:WJC0*************,2.00 TB

有关详细信息,请参阅主机的 SYSLOG。

您还可以使用 smartctl 实用程序进行进一步调查。有关此问题的原始消息于 2025 年 3 月 22 日星期六 04:36:14 CET 发送,如果问题仍然存在,将在 24 小时内发送另一条消息。

该服务器当前正在运行 Proxmox(基于 Debian 的发行版),磁盘由Lenovo RAID 730-8i 2GB Flash管理,据我所知,它是 LSI / Broadcom,并通过其实用程序在 SO 中进行管理MegaCli64,StorCli64我安装了两者。使用lspci | grep RAID:

58:00.0 RAID bus controller: LSI Logic / Symbios Logic MegaRAID SAS-3 3108 [Invader] (rev 02)
 

控制器上有两个驱动组:

  • RAID1 适用于 2 个 SSD 磁盘,每个约 500GB
  • 3 个 HDD 磁盘组成 RAID5,每个磁盘约 2TB。这是其中一台设备开始发出 SMART 警告的组。我找到了一个具有相同部件号的兼容磁盘,可以更换发出警告的磁盘。

RAID5 上的所有内容都已备份,因此我不太担心丢失数据,恢复起来需要做更多工作,如果可能的话,我想避免这种情况。

使用MegaCli64我得到的 RAID 配置:

# ./MegaCli64 -LDInfo -LAll -aAll

[... omissis other disk group ...]

Virtual Drive: 1 (Target Id: 1)
Name                :hddstorage
RAID Level          : Primary-5, Secondary-0, RAID Level Qualifier-3
Size                : 3.635 TB
Sector Size         : 512
Is VD emulated      : No
Parity Size         : 1.817 TB
State               : Optimal
Strip Size          : 64 KB
Number Of Drives    : 3
Span Depth          : 1
Default Cache Policy: WriteBack, ReadAheadNone, Direct, No Write Cache if Bad BBU
Current Cache Policy: WriteBack, ReadAheadNone, Direct, No Write Cache if Bad BBU
Default Access Policy: Read/Write
Current Access Policy: Read/Write
Disk Cache Policy   : Disabled
Encryption Type     : None
PI type: No PI

Is VD Cached: No

以及故障驱动器的当前状态:

# ./MegaCli64  -PDList –aAll

[... omissis other disks ...]

Enclosure Device ID: 252
Slot Number: 4
Drive's position: DiskGroup: 1, Span: 0, Arm: 2
Enclosure position: N/A
Device Id: 10  # <---- ID for the SMART check
WWN: 5000C500CE7FB828
Sequence Number: 2
Media Error Count: 79
Other Error Count: 1
Predictive Failure Count: 2
Last Predictive Failure Event Seq Number: 46655
PD Type: SAS

Raw Size: 1.819 TB [0xe8e088b0 Sectors]
Non Coerced Size: 1.818 TB [0xe8d088b0 Sectors]
Coerced Size: 1.817 TB [0xe8b6d000 Sectors]
Sector Size:  512
Logical Sector Size:  512
Physical Sector Size:  512
Firmware state: Online, Spun Up
Commissioned Spare : No
Emergency Spare : No
Device Firmware Level: LKB9
Shield Counter: 0
Successful diagnostics completion on :  N/A
SAS Address(0): 0x5000c500ce7fb829
SAS Address(1): 0x0
Connected Port Number: 4(path0) 
Inquiry Data: LENOVO  ST2000NM003A    LKB9WJC06CK0LKB9LKB9LKB9
FDE Capable: Not Capable
FDE Enable: Disable
Secured: Unsecured
Locked: Unlocked
Needs EKM Attention: No
Foreign State: None 
Device Speed: 12.0Gb/s 
Link Speed: 12.0Gb/s 
Media Type: Hard Disk Device
Drive:  Not Certified
Drive Temperature :31C (87.80 F)
PI Eligibility:  No 
Drive is formatted for PI information:  No
PI: No PI
Port-0 :
Port status: Active
Port's Linkspeed: 12.0Gb/s 
Port-1 :
Port status: Active
Port's Linkspeed: 12.0Gb/s 
Drive has flagged a S.M.A.R.T alert : Yes  # <--- Faulty!

因此,通过查看驱动器的 SMART 结果,我得到了:

smartctl -a -d megaraid,10  /dev/sda

smartctl 7.2 2020-12-30 r5155 [x86_64-linux-5.4.157-1-pve] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Vendor:               LENOVO
Product:              ST2000NM003A
Revision:             LKB9
Compliance:           SPC-5
User Capacity:        2.000.398.934.016 bytes [2,00 TB]
Logical block size:   512 bytes
LU is fully provisioned
Rotation Rate:        7200 rpm
Form Factor:          3.5 inches
Logical Unit id:      0x5000c500ce7fb82b
Serial number:        WJC06CK00000E024CJ6U
Device type:          disk
Transport protocol:   SAS (SPL-3)
Local Time is:        Mon Mar 24 11:01:20 2025 CET
SMART support is:     Available - device has SMART capability.
SMART support is:     Enabled
Temperature Warning:  Enabled

=== START OF READ SMART DATA SECTION ===
SMART Health Status: DATA CHANNEL IMPENDING FAILURE GENERAL HARD DRIVE FAILURE [asc=5d, ascq=30]

Grown defects during certification <not available>
Total blocks reassigned during format <not available>
Total new blocks reassigned = 29
Power on minutes since format <not available>
Current Drive Temperature:     32 C
Drive Trip Temperature:        65 C

Accumulated power on time, hours:minutes 39425:21
Manufactured in week 02 of year 2020
Specified cycle count over device lifetime:  50000
Accumulated start-stop cycles:  70
Specified load-unload count over device lifetime:  600000
Accumulated load-unload cycles:  2299
Elements in grown defect list: 29

Error counter log:
           Errors Corrected by           Total   Correction     Gigabytes    Total
               ECC          rereads/    errors   algorithm      processed    uncorrected
           fast | delayed   rewrites  corrected  invocations   [10^9 bytes]  errors
read:          0     1699         0      1699       2335     504611,864         386
write:         0        0         0         0          0      73712,791           0
verify:        0     1809         0      1809       2122     471546,642         237

Non-medium error count:       11

SMART Self-test log
Num  Test              Status                 segment  LifeTime  LBA_first_err [SK ASC ASQ]
     Description                              number   (hours)
# 1  Background long   Completed                   -       7                 - [-   -    -]
# 2  Background long   Aborted (by user command)   -       4                 - [-   -    -]
# 3  Background short  Completed                   -       4                 - [-   -    -]
# 4  Background long   Aborted (by user command)   -       4                 - [-   -    -]

Long (extended) Self-test duration: 13740 seconds [229,0 minutes]

或多或少可以确认驱动器上有些东西有问题。对其他磁盘(smartctl -a -d megaraid,8 /dev/sda和smartctl -a -d megaraid,9 /dev/sda)的检查报告读数正常:

[... omissis ...]
=== START OF READ SMART DATA SECTION ===
SMART Health Status: OK
[... omissis ...]

控制器尚未将磁盘脱机,经确认StorCli64:

# ./storcli64 /cALL show all

[... omissis ...]

Drive Groups = 2

TOPOLOGY :
========

-----------------------------------------------------------------------------
DG Arr Row EID:Slot DID Type  State BT       Size PDC  PI SED DS3  FSpace TR 
-----------------------------------------------------------------------------
 0 -   -   -        -   RAID1 Optl  N  446.102 GB dflt N  N   dflt N      N  
 0 0   -   -        -   RAID1 Optl  N  446.102 GB dflt N  N   dflt N      N  
 0 0   0   252:0    11  DRIVE Onln  N  446.102 GB dflt N  N   dflt -      N  
 0 0   1   252:1    12  DRIVE Onln  N  446.102 GB dflt N  N   dflt -      N  
 1 -   -   -        -   RAID5 Optl  N    3.636 TB dsbl N  N   dflt N      N  
 1 0   -   -        -   RAID5 Optl  N    3.636 TB dsbl N  N   dflt N      N  
 1 0   0   252:2    8   DRIVE Onln  N    1.818 TB dsbl N  N   dflt -      N  
 1 0   1   252:3    9   DRIVE Onln  N    1.818 TB dsbl N  N   dflt -      N  
 1 0   2   252:4    10  DRIVE Onln  N    1.818 TB dsbl N  N   dflt -      N   # <-- Used later for a storcli command
-----------------------------------------------------------------------------

[... omissis ...]

Physical Drives = 5

PD LIST :
=======

-----------------------------------------------------------------------------------------------------
EID:Slt DID State DG       Size Intf Med SED PI SeSz Model                                   Sp Type 
-----------------------------------------------------------------------------------------------------
252:0    11 Onln   0 446.102 GB SATA SSD N   N  512B MTFDDAK480TDS-1AW1ZA 02JG538D7A44703LEN U  -    
252:1    12 Onln   0 446.102 GB SATA SSD N   N  512B MTFDDAK480TDS-1AW1ZA 02JG538D7A44703LEN U  -    
252:2     8 Onln   1   1.818 TB SAS  HDD N   N  512B ST2000NM003A                            U  -    
252:3     9 Onln   1   1.818 TB SAS  HDD N   N  512B ST2000NM003A                            U  -    
252:4    10 Onln   1   1.818 TB SAS  HDD N   N  512B ST2000NM003A                            U  -     # <--- THIS LINE (State: Onln)
-----------------------------------------------------------------------------------------------------

[... omissis ...]

我订购了一个新ST2000NM003A磁盘(Seagate EXOS 7E8 SAS 12Gbit/s),正在准备更换磁盘的活动。为了进行更改,我使用命令打开了磁盘本地化./storcli64 /c0/e252/s4 start locate。现在我试图了解哪个是更换故障磁盘的正确程序。据我所知,对于实际降级的RAID5,我认为我应该:

  1. 将原始磁盘置于离线状态(控制器尚未将其设置为离线)
  2. 将故障磁盘标记为丢失
  3. 将故障磁盘标记为准备移除
  4. 插入新磁盘
  5. 将新磁盘联机
  6. 手动开始构建阵列
  7. 检查重建状态

我的 RAID 没有被报告为降级,但也许可以应用相同的程序。就命令而言,我认为我应该这样做StorCli64:

  1. ./storcli64 /c0/e252/s4 set offline
  2. ./storcli64 /c0/e252/s4 set missing
  3. ./storcli64 /c0/e252/s4 set spindown
  4. 将磁盘更换为同一位置的新磁盘
  5. ./storcli64 /c0/e252/s4 set spinup和./storcli64 /c0/e252/s4 set online
  6. ./storcli64 /c0/e252/s4 insert dg=1 array=0 row=2。这也应该自动启动重建过程。参数(dg如设备组、阵列和行)取自StorCli拓扑的输出。
  7. ./storcli64 /c0/e252/s4 show rebuild

这或多或少是我试图从RAID 控制器的PDF 指南StorCli中整理出来的,查看了处理该问题的章节(第 6 章)。但是,我无法确认这是正确的过程。

有人可以确认这是一个正确的程序吗?

hardware-raid
  • 1 个回答
  • 56 Views

Sidebar

Stats

  • 问题 205573
  • 回答 270741
  • 最佳答案 135370
  • 用户 68524
  • 热门
  • 回答
  • Marko Smith

    新安装后 postgres 的默认超级用户用户名/密码是什么?

    • 5 个回答
  • Marko Smith

    SFTP 使用什么端口?

    • 6 个回答
  • Marko Smith

    命令行列出 Windows Active Directory 组中的用户?

    • 9 个回答
  • Marko Smith

    什么是 Pem 文件,它与其他 OpenSSL 生成的密钥文件格式有何不同?

    • 3 个回答
  • Marko Smith

    如何确定bash变量是否为空?

    • 15 个回答
  • Martin Hope
    Tom Feiner 如何按大小对 du -h 输出进行排序 2009-02-26 05:42:42 +0800 CST
  • Martin Hope
    Noah Goodrich 什么是 Pem 文件,它与其他 OpenSSL 生成的密钥文件格式有何不同? 2009-05-19 18:24:42 +0800 CST
  • Martin Hope
    Brent 如何确定bash变量是否为空? 2009-05-13 09:54:48 +0800 CST
  • Martin Hope
    cletus 您如何找到在 Windows 中打开文件的进程? 2009-05-01 16:47:16 +0800 CST

热门标签

linux nginx windows networking ubuntu domain-name-system amazon-web-services active-directory apache-2.4 ssh

Explore

  • 主页
  • 问题
    • 最新
    • 热门
  • 标签
  • 帮助

Footer

AskOverflow.Dev

关于我们

  • 关于我们
  • 联系我们

Legal Stuff

  • Privacy Policy

Language

  • Pt
  • Server
  • Unix

© 2023 AskOverflow.DEV All Rights Reserve