AskOverflow.Dev

AskOverflow.Dev Logo AskOverflow.Dev Logo

AskOverflow.Dev Navigation

  • 主页
  • 系统&网络
  • Ubuntu
  • Unix
  • DBA
  • Computer
  • Coding
  • LangChain

Mobile menu

Close
  • 主页
  • 系统&网络
    • 最新
    • 热门
    • 标签
  • Ubuntu
    • 最新
    • 热门
    • 标签
  • Unix
    • 最新
    • 标签
  • DBA
    • 最新
    • 标签
  • Computer
    • 最新
    • 标签
  • Coding
    • 最新
    • 标签
主页 / server / 问题

问题[smartctl](server)

Martin Hope
Nuno
Asked: 2021-11-25 01:07:56 +0800 CST

NVMe 健康测试

  • 1

在我拥有的带有 HDD 或 SSD 的服务器上,我有一个定期运行的 cron:

/usr/sbin/smartctl --test=short/long /dev/sd1

(对于每个磁盘)

当它运行时,它只是查看 的输出/usr/sbin/smartctl -c /dev/sd1,循环直到它不再包含:

[0-9]+% of test remaining.

然后检查它是否在没有错误的情况下完成:

(   0)  The previous self-test routine completed

但是,从 7.0 版开始,它似乎smartctl还不支持对 NVMe 的测试,并且按照:https ://www.smartmontools.org/wiki/NVMe_Support

它确实说

smartd 守护进程跟踪运行状况 (-H)、错误计数 (-l error) 和温度 (-W DIFF,INFO,CRIT)

但实际运行测试的是什么?除非我们运行短/长测试,否则我不确定是否输出-H和更新?-l

我也读过关于nvme-cli,但我似乎没有找到用它在磁盘上运行健康测试的方法。

有任何想法吗?

在这里使用 CentOS 7。

centos smart healthcheck nvme smartctl
  • 1 个回答
  • 1058 Views
Martin Hope
riska
Asked: 2020-03-30 13:13:23 +0800 CST

测试故障磁盘

  • 0

我有一个降级的 RAID-1 设置,我正在尝试解决问题,无论它是完全死机(并且需要将驱动器送回供应商)还是可以恢复。

所以我跑了smartctl:

root@linux:~# smartctl -l selftest /dev/sda
smartctl 6.5 2016-01-24 r4214 [x86_64-linux-4.4.0-176-generic] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed without error       00%      1930         -
# 2  Short offline       Completed without error       00%      1930         -
# 3  Extended offline    Completed without error       00%      1930         -

看起来不错。让我们看看其他的东西:

root@linux:~# smartctl --all /dev/sda
smartctl 6.5 2016-01-24 r4214 [x86_64-linux-4.4.0-176-generic] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Western Digital Red
Device Model:     WDC WD40EFRX-68N32N0
Serial Number:    WD-WCC7K3NU4V6D
LU WWN Device Id: 5 0014ee 211e15108
Firmware Version: 82.00A82
User Capacity:    4,000,787,030,016 bytes [4.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    5400 rpm
Form Factor:      3.5 inches
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ACS-3 T13/2161-D revision 5
SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Sun Mar 29 23:04:32 2020 CEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x00) Offline data collection activity
                    was never started.
                    Auto Offline Data Collection: Disabled.
Self-test execution status:      (   0) The previous self-test routine completed
                    without error or no self-test has ever 
                    been run.
Total time to complete Offline 
data collection:        (44340) seconds.
Offline data collection
capabilities:            (0x7b) SMART execute Offline immediate.
                    Auto Offline data collection on/off support.
                    Suspend Offline collection upon new
                    command.
                    Offline surface scan supported.
                    Self-test supported.
                    Conveyance Self-test supported.
                    Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                    power-saving mode.
                    Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                    General Purpose Logging supported.
Short self-test routine 
recommended polling time:    (   2) minutes.
Extended self-test routine
recommended polling time:    ( 470) minutes.
Conveyance self-test routine
recommended polling time:    (   5) minutes.
SCT capabilities:          (0x303d) SCT Status supported.
                    SCT Error Recovery Control supported.
                    SCT Feature Control supported.
                    SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       0
  3 Spin_Up_Time            0x0027   100   253   021    Pre-fail  Always       -       0
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       3
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   098   098   000    Old_age   Always       -       1930
 10 Spin_Retry_Count        0x0032   100   253   000    Old_age   Always       -       0
 11 Calibration_Retry_Count 0x0032   100   253   000    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       3
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       0
193 Load_Cycle_Count        0x0032   200   200   000    Old_age   Always       -       1286
194 Temperature_Celsius     0x0022   117   114   000    Old_age   Always       -       33
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   100   253   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   100   253   000    Old_age   Offline      -       0

SMART Error Log Version: 1
ATA Error Count: 1757 (device log contains only the most recent five errors)
    CR = Command Register [HEX]
    FR = Features Register [HEX]
    SC = Sector Count Register [HEX]
    SN = Sector Number Register [HEX]
    CL = Cylinder Low Register [HEX]
    CH = Cylinder High Register [HEX]
    DH = Device/Head Register [HEX]
    DC = Device Command Register [HEX]
    ER = Error register [HEX]
    ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 1757 occurred at disk power-on lifetime: 1930 hours (80 days + 10 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  04 61 02 00 00 00 a0  Device Fault; Error: ABRT

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  ef 10 02 00 00 00 a0 08  11d+12:18:30.077  SET FEATURES [Enable SATA feature]
  ec 00 00 00 00 00 a0 08  11d+12:18:30.077  IDENTIFY DEVICE
  ef 03 46 00 00 00 a0 08  11d+12:18:30.077  SET FEATURES [Set transfer mode]
  ef 10 02 00 00 00 a0 08  11d+12:18:30.073  SET FEATURES [Enable SATA feature]
  ec 00 00 00 00 00 a0 08  11d+12:18:30.073  IDENTIFY DEVICE

Error 1756 occurred at disk power-on lifetime: 1930 hours (80 days + 10 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  04 61 46 00 00 00 a0  Device Fault; Error: ABRT

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  ef 03 46 00 00 00 a0 08  11d+12:18:30.077  SET FEATURES [Set transfer mode]
  ef 10 02 00 00 00 a0 08  11d+12:18:30.073  SET FEATURES [Enable SATA feature]
  ec 00 00 00 00 00 a0 08  11d+12:18:30.073  IDENTIFY DEVICE
  c8 00 08 08 00 00 e0 08  11d+12:18:30.051  READ DMA
  ef 10 02 00 00 00 a0 08  11d+12:18:30.041  SET FEATURES [Enable SATA feature]

Error 1755 occurred at disk power-on lifetime: 1930 hours (80 days + 10 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  04 61 02 00 00 00 a0  Device Fault; Error: ABRT

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  ef 10 02 00 00 00 a0 08  11d+12:18:30.073  SET FEATURES [Enable SATA feature]
  ec 00 00 00 00 00 a0 08  11d+12:18:30.073  IDENTIFY DEVICE
  c8 00 08 08 00 00 e0 08  11d+12:18:30.051  READ DMA
  ef 10 02 00 00 00 a0 08  11d+12:18:30.041  SET FEATURES [Enable SATA feature]
  ec 00 00 00 00 00 a0 08  11d+12:18:30.040  IDENTIFY DEVICE

Error 1754 occurred at disk power-on lifetime: 1930 hours (80 days + 10 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  04 61 08 08 00 00 e0  Device Fault; Error: ABRT 8 sectors at LBA = 0x00000008 = 8

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  c8 00 08 08 00 00 e0 08  11d+12:18:30.051  READ DMA
  ef 10 02 00 00 00 a0 08  11d+12:18:30.041  SET FEATURES [Enable SATA feature]
  ec 00 00 00 00 00 a0 08  11d+12:18:30.040  IDENTIFY DEVICE
  ef 03 46 00 00 00 a0 08  11d+12:18:30.040  SET FEATURES [Set transfer mode]
  ef 10 02 00 00 00 a0 08  11d+12:18:30.037  SET FEATURES [Enable SATA feature]

Error 1753 occurred at disk power-on lifetime: 1930 hours (80 days + 10 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  04 61 02 00 00 00 a0  Device Fault; Error: ABRT

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  ef 10 02 00 00 00 a0 08  11d+12:18:30.041  SET FEATURES [Enable SATA feature]
  ec 00 00 00 00 00 a0 08  11d+12:18:30.040  IDENTIFY DEVICE
  ef 03 46 00 00 00 a0 08  11d+12:18:30.040  SET FEATURES [Set transfer mode]
  ef 10 02 00 00 00 a0 08  11d+12:18:30.037  SET FEATURES [Enable SATA feature]
  ec 00 00 00 00 00 a0 08  11d+12:18:30.037  IDENTIFY DEVICE

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed without error       00%      1930         -
# 2  Short offline       Completed without error       00%      1930         -
# 3  Extended offline    Completed without error       00%      1930         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

我试图查看输出,但不幸的是我不是分析 smartctl 输出的专家。它说驱动器通过了测试:

SMART overall-health self-assessment test result: PASSED

在门槛部分,一切看起来都不错(这是一个相当新的 WD Red HDD,几个月前)。

在输出结束时,虽然有很多错误。我试图查找它们,但没有运气。

另一方面,我根本无法使用磁盘:

root@idealib:~# fdisk /dev/sda

Welcome to fdisk (util-linux 2.27.1).
Changes will remain in memory only, until you decide to write them.
Be careful before using the write command.

fdisk: cannot open /dev/sda: Input/output error

dmesg看起来也很糟糕:

[ 1708.769491] ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
[ 1708.771763] ata1.00: irq_stat 0x40000001
[ 1708.773710] ata1.00: failed command: READ DMA
[ 1708.775667] ata1.00: cmd c8/00:08:08:00:00/00:00:00:00:00/e0 tag 15 dma 4096 in
                        res 61/04:08:08:00:00/00:00:00:00:00/e0 Emask 0x1 (device error)
[ 1708.779619] ata1.00: status: { DRDY DF ERR }
[ 1708.781613] ata1.00: error: { ABRT }
[ 1708.784344] ata1.00: failed to enable AA (error_mask=0x1)
[ 1708.787355] ata1.00: failed to enable AA (error_mask=0x1)
[ 1708.789183] ata1.00: configured for UDMA/133 (device error ignored)
[ 1708.796930] ata1: EH complete

在我可以 100% 确定磁盘有故障并且需要更换之前,我还应该进行哪些其他测试?或者它已经死了,我应该接受它?:)

hard-drive linux smartctl
  • 1 个回答
  • 378 Views
Martin Hope
Alexandru
Asked: 2020-02-29 03:41:15 +0800 CST

HDD SMART解读

  • 2

如果下面的驱动器出现故障,我需要您的意见。

当我运行“smartctl -a /dev/sda -d megaraid,1”时,输出末尾会出现 2 个错误,说明“错误:LBA 上的 WP”。我在 SMART 参数中没有看到任何可疑之处。

这是“smartctl -a /dev/sda -d megaraid,1”的完整输出。

此 HDD 是 RAID 1(镜像)硬件配置中的两个 HDD 之一,位于 Dell PowerEdge 服务器上的 Dell H330 控制器上。

smartctl 7.0 2018-12-30 r4883 [x86_64-linux-3.10.0-957.21.3.el7.x86_64] (local build)
Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Toshiba 3.5" MG03ACAxxx(Y) Enterprise HDD
Device Model:     TOSHIBA MG03ACA300
Serial Number:    73VCK8GDF
LU WWN Device Id: 5 000039 4ebc82c58
Firmware Version: FL1A
User Capacity:    3,000,592,982,016 bytes [3.00 TB]
Sector Size:      512 bytes logical/physical
Rotation Rate:    7200 rpm
Form Factor:      3.5 inches
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ATA8-ACS (minor revision not indicated)
SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Thu Feb 27 23:05:39 2020 EET
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART Status not supported: ATA return descriptor not supported by controller firmware
SMART overall-health self-assessment test result: PASSED
Warning: This result is based on an Attribute check.

General SMART Values:
Offline data collection status:  (0x82) Offline data collection activity
                                        was completed without error.
                                        Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0) The previous self-test routine completed
                                        without error or no self-test has ever
                                        been run.
Total time to complete Offline
data collection:                (  120) seconds.
Offline data collection
capabilities:                    (0x5b) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        No Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine
recommended polling time:        (   2) minutes.
Extended self-test routine
recommended polling time:        ( 510) minutes.
SCT capabilities:              (0x003d) SCT Status supported.
                                        SCT Error Recovery Control supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000b   100   100   050    Pre-fail  Always       -       0
  2 Throughput_Performance  0x0005   100   100   050    Pre-fail  Offline      -       0
  3 Spin_Up_Time            0x0027   100   100   001    Pre-fail  Always       -       8874
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       27
  5 Reallocated_Sector_Ct   0x0033   100   100   050    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000b   100   100   050    Pre-fail  Always       -       0
  8 Seek_Time_Performance   0x0005   100   100   050    Pre-fail  Offline      -       0
  9 Power_On_Hours          0x0032   068   068   000    Old_age   Always       -       12964
 10 Spin_Retry_Count        0x0033   100   100   030    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       27
191 G-Sense_Error_Rate      0x0032   100   100   000    Old_age   Always       -       6
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       25
193 Load_Cycle_Count        0x0032   100   100   000    Old_age   Always       -       42
194 Temperature_Celsius     0x0022   100   100   000    Old_age   Always       -       31 (Min/Max 11/48)
196 Reallocated_Event_Count 0x0032   100   100   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
220 Disk_Shift              0x0002   100   100   000    Old_age   Always       -       0
222 Loaded_Hours            0x0032   068   068   000    Old_age   Always       -       12994
223 Load_Retry_Count        0x0032   100   100   000    Old_age   Always       -       0
224 Load_Friction           0x0022   100   100   000    Old_age   Always       -       0
226 Load-in_Time            0x0026   100   100   000    Old_age   Always       -       103
240 Head_Flying_Hours       0x0001   100   100   001    Pre-fail  Offline      -       0

SMART Error Log Version: 1
ATA Error Count: 2
        CR = Command Register [HEX]
        FR = Features Register [HEX]
        SC = Sector Count Register [HEX]
        SN = Sector Number Register [HEX]
        CL = Cylinder Low Register [HEX]
        CH = Cylinder High Register [HEX]
        DH = Device/Head Register [HEX]
        DC = Device Command Register [HEX]
        ER = Error register [HEX]
        ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 2 occurred at disk power-on lifetime: 12901 hours (537 days + 13 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 41 10 0e fb 74 40  Error: WP at LBA = 0x0074fb0e = 7666446

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  61 08 00 48 7a e0 40 00  42d+20:47:35.187  WRITE FPDMA QUEUED
  61 08 20 58 89 8a 40 00  42d+20:47:35.187  WRITE FPDMA QUEUED
  61 10 20 48 89 8a 40 00  42d+20:47:35.187  WRITE FPDMA QUEUED
  61 08 20 48 7a e0 40 00  42d+20:47:35.183  WRITE FPDMA QUEUED
  61 08 20 40 89 8a 40 00  42d+20:47:35.183  WRITE FPDMA QUEUED

Error 1 occurred at disk power-on lifetime: 12901 hours (537 days + 13 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 41 00 0e fb 74 40  Error: WP at LBA = 0x0074fb0e = 7666446

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  61 10 10 18 94 de 40 00  42d+20:47:32.312  WRITE FPDMA QUEUED
  60 00 08 00 fc 74 40 00  42d+20:47:32.311  READ FPDMA QUEUED
  60 00 00 00 fb 74 40 00  42d+20:47:32.311  READ FPDMA QUEUED
  60 00 00 00 fa 74 40 00  42d+20:47:32.284  READ FPDMA QUEUED
  60 00 00 00 f9 74 40 00  42d+20:47:32.264  READ FPDMA QUEUED

SMART Self-test log structure revision number 1
No self-tests have been logged.  [To run self-tests, use: smartctl -t]

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

稍后编辑1:

我还检查了 PowerEdge 服务器上的 iDRAC,并且在存储菜单 > 摘要 > 最近记录的存储事件中,我发现了与发生 2 个 SMART 错误相对应的事件。

事件状态:“在恢复期间更正了插槽 1 中 RAID 控制器背板 1 中磁盘 1 上的磁盘介质错误”。请在屏幕截图下方找到。

来自 iDRAC > 存储菜单 > 摘要 > 最近记录的存储事件的图像

稍后编辑2:

几天后,Current_Pending_Sector 在几个小时内增加到 1,然后又减少到 0。

Reallocated_Sector_Ct、Reallocated_Event_Count 和 Offline_Uncorrectable 始终保持为 0。

SMART 错误日志中还出现了另一个错误:“错误:LBA 的 UNC”。

虽然,iDRAC 中没有出现其他错误。

我们决定用新驱动器更换驱动器,因为我们不再信任该驱动器。

谢谢!

raid hardware-raid smart smartctl dell-perc
  • 2 个回答
  • 1756 Views
Martin Hope
stz184
Asked: 2016-09-08 21:59:44 +0800 CST

e2fsck 没有发现错误,但 SMART 自检失败

  • 2

我有一个外部 Freecom 硬盘(内置三星驱动器),通过 USB 连接并使用它自己的电源。

磁盘会在随机的时间间隔(从几个小时到一个月)断开自身连接。我倾向于责怪操作系统,因为同一个驱动器在连接到 TP-Link 路由器的 USB 端口时没有问题。

无论如何,只是为了确保我使用 smartctl 执行了扩展的 SMART 自检并完成了Completed: read failure 30%消息。因此,我使用 e2fsck 进行了额外的测试。我花了整整一晚上的时间来测试这个 1.5TB 的硬盘。测试完成,没有任何错误。

我很困惑——我应该相信 SMART 自检还是 e2feck 结果?此外,SMART 健康状态为“通过”,简短的自检也很好。检查通常的嫌疑人 - USB 电缆已更换为新电缆,并检查外部电源。想法?我应该购买新驱动器还是安全?SMART 或 e2fsck 是更可靠的健康状况来源吗?

smartctl e2fsck
  • 1 个回答
  • 514 Views
Martin Hope
FreeSoftwareServers
Asked: 2016-08-10 11:04:28 +0800 CST

“smartctl -H 或 -all”是否对磁盘运行任何操作或仅轮询数据?

  • 0

我目前正在设置智能监控,我对命令有疑问

smartctl -H /dev/sda

=== START OF READ SMART DATA SECTION ===
SMART Health Status: OK

这实际上是对磁盘运行任何东西,还是只是轮询 SmartMonTools 当前可用的日志/数据。

我了解并正在考虑通过短期和长期测试运行 smartd,但这将由 smartd 管理。我的脚本很简单,它只是 greps 健康状态 OK 并根据查找结果失败/通过。它还显示“smartctl -all /dev/sda”,我也想知道这一点。

我只是想确定一下,因为

我认为两者smartctl -H /dev/sda && smartctl -all /dev/sda在运行时实际上都没有进行任何测试,它们只是轮询可用数据。有人可以确认吗?

原因是我经常用我的网络监控软件(目前每 15m)轮询这个数据,但如果它不影响磁盘,我会留下它并使用 smartd 安排实际的自测,它会 100% 读取/写/测试磁盘。

hardware smart smartctl
  • 1 个回答
  • 470 Views
Martin Hope
MUY Belgium
Asked: 2016-06-02 01:24:20 +0800 CST

在 smartctl 中,“使用耐力指标的百分比太短”是什么意思?

  • 1

我使用 Bacula 进行备份。当我使用smartctl

# smartctl -H -l error /dev/st0
smartctl 5.43 2012-06-30 r3573 [x86_64-linux-2.6.32-431.1.2.el6.x86_64] (local build)
Copyright (C) 2002-12 by Bruce Allen, http://smartmontools.sourceforge.net

TapeAlert: OK
Percentage used endurance indicator too short (pl=6)

Error counter log:
           Errors Corrected by           Total   Correction     Gigabytes    Total
           ECC          rereads/    errors   algorithm      processed    uncorrected
           fast | delayed   rewrites  corrected  invocations   [10^9 bytes]  errors
read:          0        0         0         0          0          0.000           0
write:         0        0         0         0          0          0.000           0

消息的实际含义是Percentage used endurance indicator too short什么?

bacula smart smartctl
  • 1 个回答
  • 1959 Views
Martin Hope
flashnik
Asked: 2010-03-08 18:10:37 +0800 CST

sas 驱动器 RAID 的 Linux 设备名称是什么?

  • 1

我有一个使用由 2 个 SAS 驱动器组成的 Promise FastTrack TX2650 的 RAID1。他们的 Linux 设备名称是什么?就像sda是第一个 sata 驱动器。我有 Windows 服务器,所以我无法直接查看它,但需要此信息以供smartctl使用。

更新。我找到了如何访问 RAID:(smartctl -d scsi sdb因为我也有一个 SATA 驱动器)。但在这种情况下,尽管我想获取有关驱动器本身的信息,但我只获得了有关 raid 控制器的信息。可能吗?Promises 的控制面板仅提供有关其健康状态(布尔值)的信息,我想要更多。大多数情况下,我需要有关温度的信息。

windows-server-2008 raid sas smartctl device
  • 2 个回答
  • 2583 Views
Martin Hope
Nelson
Asked: 2010-01-20 17:10:15 +0800 CST

如何轻松修复 Linux 磁盘上的单个不可读块?

  • 23

我的 Linux 系统开始在 syslog 中抛出 SMART 错误。我对其进行了追踪,并认为问题出在磁盘上的单个块上。如何轻松地让磁盘重新分配那个块?我想知道在这个过程中什么文件被破坏了。(我知道,如果磁盘上的一个块发生故障,其他块可能会跟随;我有一个很好的持续备份,只是想尝试保持该磁盘正常工作。)

在网络上搜索会导致Bad block HOWTO,它描述了在卸载磁盘上的手动过程。这似乎很复杂且容易出错。是否有工具可以在 Linux 中自动执行此过程?我唯一的其他选择是制造商的诊断工具,但我认为这会破坏坏块,而不会报告损坏的内容。最坏的情况,它可能是文件系统元数据。

有问题的磁盘是主系统分区。使用 ext3fs 和 LVM。这是来自 syslog 的错误日志和来自 smartctl 的相关位。

smartd[5226]: Device: /dev/hda, 1 Currently unreadable (pending) sectors

Error 1 occurred at disk power-on lifetime: 17449 hours (727 days + 1 hours)
... Error: UNC at LBA = 0x00d39eee = 13868782

pastebin上有一个完整的 smartctl 转储。

hard-drive linux lvm smart smartctl
  • 6 个回答
  • 27401 Views

Sidebar

Stats

  • 问题 205573
  • 回答 270741
  • 最佳答案 135370
  • 用户 68524
  • 热门
  • 回答
  • Marko Smith

    新安装后 postgres 的默认超级用户用户名/密码是什么?

    • 5 个回答
  • Marko Smith

    SFTP 使用什么端口?

    • 6 个回答
  • Marko Smith

    命令行列出 Windows Active Directory 组中的用户?

    • 9 个回答
  • Marko Smith

    什么是 Pem 文件,它与其他 OpenSSL 生成的密钥文件格式有何不同?

    • 3 个回答
  • Marko Smith

    如何确定bash变量是否为空?

    • 15 个回答
  • Martin Hope
    Tom Feiner 如何按大小对 du -h 输出进行排序 2009-02-26 05:42:42 +0800 CST
  • Martin Hope
    Noah Goodrich 什么是 Pem 文件,它与其他 OpenSSL 生成的密钥文件格式有何不同? 2009-05-19 18:24:42 +0800 CST
  • Martin Hope
    Brent 如何确定bash变量是否为空? 2009-05-13 09:54:48 +0800 CST
  • Martin Hope
    cletus 您如何找到在 Windows 中打开文件的进程? 2009-05-01 16:47:16 +0800 CST

热门标签

linux nginx windows networking ubuntu domain-name-system amazon-web-services active-directory apache-2.4 ssh

Explore

  • 主页
  • 问题
    • 最新
    • 热门
  • 标签
  • 帮助

Footer

AskOverflow.Dev

关于我们

  • 关于我们
  • 联系我们

Legal Stuff

  • Privacy Policy

Language

  • Pt
  • Server
  • Unix

© 2023 AskOverflow.DEV All Rights Reserve