AskOverflow.Dev

AskOverflow.Dev Logo AskOverflow.Dev Logo

AskOverflow.Dev Navigation

  • 主页
  • 系统&网络
  • Ubuntu
  • Unix
  • DBA
  • Computer
  • Coding
  • LangChain

Mobile menu

Close
  • 主页
  • 系统&网络
    • 最新
    • 热门
    • 标签
  • Ubuntu
    • 最新
    • 热门
    • 标签
  • Unix
    • 最新
    • 标签
  • DBA
    • 最新
    • 标签
  • Computer
    • 最新
    • 标签
  • Coding
    • 最新
    • 标签
主页 / server / 问题 / 944195
Accepted
Codemonkey
Codemonkey
Asked: 2018-12-14 05:45:46 +0800 CST2018-12-14 05:45:46 +0800 CST 2018-12-14 05:45:46 +0800 CST

OfflineUncorrectableSector 和 CurrentPendingSector 电子邮件 - 我该怎么办?

  • 772

运行 Centos 7:

我刚刚在我的一个服务器上执行yum update了 a reboot,当它恢复时我收到了两封电子邮件:

第一的:

OfflineUncorrectableSector
Device: /dev/sdb [SAT], 28 Offline uncorrectable sectors

第二:

CurrentPendingSector
Device: /dev/sdb [SAT], 24 Currently unreadable (pending) sectors

这是smartctl -x /dev/sdb:

smartctl 6.5 2016-05-07 r4318 [x86_64-linux-3.10.0-957.1.3.el7.x86_64] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     HGST Ultrastar 7K6000
Device Model:     HGST HUS726060ALE610
Serial Number:    K1KLXSNN
LU WWN Device Id: 5 000cca 255f2e12a
Firmware Version: APGNT907
User Capacity:    6,001,175,126,016 bytes [6.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    7200 rpm
Form Factor:      3.5 inches
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ACS-2, ATA8-ACS T13/1699-D revision 4
SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 3.0 Gb/s)
Local Time is:    Thu Dec 13 14:58:00 2018 CET
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
AAM feature is:   Unavailable
APM level is:     254 (maximum performance)
Rd look-ahead is: Enabled
Write cache is:   Enabled
ATA Security is:  Disabled, NOT FROZEN [SEC1]
Wt Cache Reorder: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x82) Offline data collection activity
                                        was completed without error.
                                        Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0) The previous self-test routine completed
                                        without error or no self-test has ever
                                        been run.
Total time to complete Offline
data collection:                (  113) seconds.
Offline data collection
capabilities:                    (0x5b) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        No Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine
recommended polling time:        (   2) minutes.
Extended self-test routine
recommended polling time:        ( 898) minutes.
SCT capabilities:              (0x003d) SCT Status supported.
                                        SCT Error Recovery Control supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAGS    VALUE WORST THRESH FAIL RAW_VALUE
  1 Raw_Read_Error_Rate     PO-R--   100   100   016    -    0
  2 Throughput_Performance  P-S---   062   062   054    -    2796
  3 Spin_Up_Time            POS---   100   100   024    -    0
  4 Start_Stop_Count        -O--C-   100   100   000    -    3
  5 Reallocated_Sector_Ct   PO--CK   100   100   005    -    25
  7 Seek_Error_Rate         PO-R--   100   100   067    -    0
  8 Seek_Time_Performance   P-S---   128   128   020    -    18
  9 Power_On_Hours          -O--C-   100   100   000    -    5790
 10 Spin_Retry_Count        PO--C-   100   100   060    -    0
 12 Power_Cycle_Count       -O--CK   100   100   000    -    3
192 Power-Off_Retract_Count -O--CK   099   099   000    -    1257
193 Load_Cycle_Count        -O--C-   099   099   000    -    1257
194 Temperature_Celsius     -O----   120   120   000    -    50 (Min/Max 25/61)
196 Reallocated_Event_Count -O--CK   100   100   000    -    25
197 Current_Pending_Sector  -O---K   100   100   000    -    24
198 Offline_Uncorrectable   ---R--   100   100   000    -    28
199 UDMA_CRC_Error_Count    -O-R--   200   200   000    -    0
                            ||||||_ K auto-keep
                            |||||__ C event count
                            ||||___ R error rate
                            |||____ S speed/performance
                            ||_____ O updated online
                            |______ P prefailure warning

General Purpose Log Directory Version 1
SMART           Log Directory Version 1 [multi-sector log support]
Address    Access  R/W   Size  Description
0x00       GPL,SL  R/O      1  Log Directory
0x01           SL  R/O      1  Summary SMART error log
0x02           SL  R/O      1  Comprehensive SMART error log
0x03       GPL     R/O      1  Ext. Comprehensive SMART error log
0x04       GPL,SL  R/O      8  Device Statistics log
0x06           SL  R/O      1  SMART self-test log
0x07       GPL     R/O      1  Extended self-test log
0x08       GPL     R/O      2  Power Conditions log
0x09           SL  R/W      1  Selective self-test log
0x10       GPL     R/O      1  SATA NCQ Queued Error log
0x11       GPL     R/O      1  SATA Phy Event Counters log
0x12       GPL     R/O      1  SATA NCQ NON-DATA log
0x15       GPL,SL  R/W      1  SATA Rebuild Assist log
0x21       GPL     R/O      1  Write stream error log
0x22       GPL     R/O      1  Read stream error log
0x24       GPL     R/O    256  Current Device Internal Status Data log
0x25       GPL     R/O    256  Saved Device Internal Status Data log
0x30       GPL,SL  R/O      9  IDENTIFY DEVICE data log
0x80-0x9f  GPL,SL  R/W     16  Host vendor specific log
0xe0       GPL,SL  R/W      1  SCT Command/Status
0xe1       GPL,SL  R/W      1  SCT Data Transfer

SMART Extended Comprehensive Error Log Version: 1 (1 sectors)
Device Error Count: 66 (device log contains only the most recent 4 errors)
        CR     = Command Register
        FEATR  = Features Register
        COUNT  = Count (was: Sector Count) Register
        LBA_48 = Upper bytes of LBA High/Mid/Low Registers ]  ATA-8
        LH     = LBA High (was: Cylinder High) Register    ]   LBA
        LM     = LBA Mid (was: Cylinder Low) Register      ] Register
        LL     = LBA Low (was: Sector Number) Register     ]
        DV     = Device (was: Device/Head) Register
        DC     = Device Control Register
        ER     = Error register
        ST     = Status register
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 66 [1] occurred at disk power-on lifetime: 5009 hours (208 days + 17 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER -- ST COUNT  LBA_48  LH LM LL DV DC
  -- -- -- == -- == == == -- -- -- -- --
  40 -- 53 00 38 00 00 14 e0 43 c8 40 00  Error: UNC 56 sectors at LBA = 0x14e043c8 = 350241736

  Commands leading to the command that caused the error were:
  CR FEATR COUNT  LBA_48  LH LM LL DV DC  Powered_Up_Time  Command/Feature_Name
  -- == -- == -- == == == -- -- -- -- --  ---------------  --------------------
  25 00 00 00 38 00 00 14 e0 43 c8 e0 08  7d+20:36:03.771  READ DMA EXT
  25 00 00 00 70 00 00 14 e2 10 00 e0 08  7d+20:36:00.941  READ DMA EXT
  25 00 00 04 00 00 00 14 e2 0c 00 e0 08  7d+20:36:00.938  READ DMA EXT
  25 00 00 04 00 00 00 14 e2 08 00 e0 08  7d+20:36:00.936  READ DMA EXT
  25 00 00 04 00 00 00 14 e2 04 00 e0 08  7d+20:36:00.934  READ DMA EXT

Error 65 [0] occurred at disk power-on lifetime: 5009 hours (208 days + 17 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER -- ST COUNT  LBA_48  LH LM LL DV DC
  -- -- -- == -- == == == -- -- -- -- --
  40 -- 53 00 38 00 00 14 e0 43 c8 40 00  Error: UNC 56 sectors at LBA = 0x14e043c8 = 350241736

  Commands leading to the command that caused the error were:
  CR FEATR COUNT  LBA_48  LH LM LL DV DC  Powered_Up_Time  Command/Feature_Name
  -- == -- == -- == == == -- -- -- -- --  ---------------  --------------------
  25 00 00 04 00 00 00 14 e0 40 00 e0 08  7d+20:36:00.421  READ DMA EXT
  25 00 00 04 00 00 00 14 e0 3c 00 e0 08  7d+20:35:57.616  READ DMA EXT
  25 00 00 04 00 00 00 14 e0 38 00 e0 08  7d+20:35:57.609  READ DMA EXT
  25 00 00 04 00 00 00 14 e0 34 00 e0 08  7d+20:35:57.601  READ DMA EXT
  25 00 00 04 00 00 00 14 e0 30 00 e0 08  7d+20:35:57.593  READ DMA EXT

Error 64 [3] occurred at disk power-on lifetime: 5009 hours (208 days + 17 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER -- ST COUNT  LBA_48  LH LM LL DV DC
  -- -- -- == -- == == == -- -- -- -- --
  40 -- 53 03 60 00 00 13 f9 4b 10 40 00  Error: UNC 864 sectors at LBA = 0x13f94b10 = 335104784

  Commands leading to the command that caused the error were:
  CR FEATR COUNT  LBA_48  LH LM LL DV DC  Powered_Up_Time  Command/Feature_Name
  -- == -- == -- == == == -- -- -- -- --  ---------------  --------------------
  25 00 00 03 60 00 00 13 f9 4b 10 e0 08  7d+20:35:00.738  READ DMA EXT
  25 00 00 00 d0 00 00 13 fa f2 70 e0 08  7d+20:34:57.902  READ DMA EXT
  25 00 00 04 00 00 00 13 fa ee 70 e0 08  7d+20:34:57.900  READ DMA EXT
  25 00 00 04 00 00 00 13 fa ea 70 e0 08  7d+20:34:57.898  READ DMA EXT
  25 00 00 04 00 00 00 13 fa e6 70 e0 08  7d+20:34:57.896  READ DMA EXT

Error 63 [2] occurred at disk power-on lifetime: 5009 hours (208 days + 17 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER -- ST COUNT  LBA_48  LH LM LL DV DC
  -- -- -- == -- == == == -- -- -- -- --
  40 -- 53 03 60 00 00 13 f9 4b 10 40 00  Error: UNC 864 sectors at LBA = 0x13f94b10 = 335104784

  Commands leading to the command that caused the error were:
  CR FEATR COUNT  LBA_48  LH LM LL DV DC  Powered_Up_Time  Command/Feature_Name
  -- == -- == -- == == == -- -- -- -- --  ---------------  --------------------
  25 00 00 04 00 00 00 13 f9 4a 70 e0 08  7d+20:34:57.413  READ DMA EXT
  25 00 00 04 00 00 00 13 f9 46 70 e0 08  7d+20:34:54.625  READ DMA EXT
  25 00 00 04 00 00 00 13 f9 42 70 e0 08  7d+20:34:54.622  READ DMA EXT
  25 00 00 04 00 00 00 13 f9 3e 70 e0 08  7d+20:34:54.619  READ DMA EXT
  25 00 00 04 00 00 00 13 f9 3a 70 e0 08  7d+20:34:54.614  READ DMA EXT

SMART Extended Self-test Log Version: 1 (1 sectors)
No self-tests have been logged.  [To run self-tests, use: smartctl -t]

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

SCT Status Version:                  3
SCT Version (vendor specific):       256 (0x0100)
SCT Support Level:                   1
Device State:                        Active (0)
Current Temperature:                    50 Celsius
Power Cycle Min/Max Temperature:     44/61 Celsius
Lifetime    Min/Max Temperature:     25/61 Celsius
Under/Over Temperature Limit Count:   0/0

SCT Temperature History Version:     2
Temperature Sampling Period:         1 minute
Temperature Logging Interval:        1 minute
Min/Max recommended Temperature:      0/60 Celsius
Min/Max Temperature Limit:           -40/70 Celsius
Temperature History Size (Index):    128 (100)

Index    Estimated Time   Temperature Celsius
 101    2018-12-13 12:51    50  *******************************
 ...    ..(126 skipped).    ..  *******************************
 100    2018-12-13 14:58    50  *******************************

SCT Error Recovery Control:
           Read: Disabled
          Write: Disabled

Device Statistics (GP Log 0x04)
Page  Offset Size        Value Flags Description
0x01  =====  =               =  ===  == General Statistics (rev 2) ==
0x01  0x008  4               3  ---  Lifetime Power-On Resets
0x01  0x018  6      6820807272  ---  Logical Sectors Written
0x01  0x020  6        20477840  ---  Number of Write Commands
0x01  0x028  6    444004265414  ---  Logical Sectors Read
0x01  0x030  6       448331816  ---  Number of Read Commands
0x01  0x038  6     20846500900  ---  Date and Time TimeStamp
0x03  =====  =               =  ===  == Rotating Media Statistics (rev 1) ==
0x03  0x008  4            5586  ---  Spindle Motor Power-on Hours
0x03  0x010  4            5586  ---  Head Flying Hours
0x03  0x018  4            1257  ---  Head Load Events
0x03  0x020  4              25  ---  Number of Reallocated Logical Sectors
0x03  0x028  4         1610568  ---  Read Recovery Attempts
0x03  0x030  4               0  ---  Number of Mechanical Start Failures
0x04  =====  =               =  ===  == General Errors Statistics (rev 1) ==
0x04  0x008  4              66  ---  Number of Reported Uncorrectable Errors
0x04  0x010  4              18  ---  Resets Between Cmd Acceptance and Completion
0x05  =====  =               =  ===  == Temperature Statistics (rev 1) ==
0x05  0x008  1              50  ---  Current Temperature
0x05  0x010  1              49  N--  Average Short Term Temperature
0x05  0x018  1              50  N--  Average Long Term Temperature
0x05  0x020  1              61  ---  Highest Temperature
0x05  0x028  1              25  ---  Lowest Temperature
0x05  0x030  1              58  N--  Highest Average Short Term Temperature
0x05  0x038  1              25  N--  Lowest Average Short Term Temperature
0x05  0x040  1              53  N--  Highest Average Long Term Temperature
0x05  0x048  1              25  N--  Lowest Average Long Term Temperature
0x05  0x050  4              34  ---  Time in Over-Temperature
0x05  0x058  1              60  ---  Specified Maximum Operating Temperature
0x05  0x060  4               0  ---  Time in Under-Temperature
0x05  0x068  1               0  ---  Specified Minimum Operating Temperature
0x06  =====  =               =  ===  == Transport Statistics (rev 1) ==
0x06  0x008  4              71  ---  Number of Hardware Resets
0x06  0x010  4              28  ---  Number of ASR Events
0x06  0x018  4               0  ---  Number of Interface CRC Errors
                                |||_ C monitored condition met
                                ||__ D supports DSN
                                |___ N normalized value

SATA Phy Event Counters (GP Log 0x11)
ID      Size     Value  Description
0x0001  2            0  Command failed due to ICRC error
0x0002  2            0  R_ERR response for data FIS
0x0003  2            0  R_ERR response for device-to-host data FIS
0x0004  2            0  R_ERR response for host-to-device data FIS
0x0005  2            0  R_ERR response for non-data FIS
0x0006  2            0  R_ERR response for device-to-host non-data FIS
0x0007  2            0  R_ERR response for host-to-device non-data FIS
0x0008  2            0  Device-to-host non-data FIS retries
0x0009  2           70  Transition from drive PhyRdy to drive PhyNRdy
0x000a  2           71  Device-to-host register FISes sent due to a COMRESET
0x000b  2            0  CRC errors within host-to-device FIS
0x000d  2            0  Non-CRC errors within host-to-device FIS

我觉得它说的很奇怪PASSED,但我想这意味着驱动器仍然大部分都可以,但是Reallocated_Sector_Ct25 表明它可能很快就会失败?

我以前从未遇到过任何此类错误(在运行几个专用盒子的 10 年中),并且不想搞砸。我现在该怎么办?

这是一个 Hetzner 存储盒,有 10 个 6GB 驱动器,我在 RAID 6 中设置了大约 48TB 的空间。

最好只是给他们发消息,他们会为我处理这个问题,还是我应该先做些什么......

我已经有一个备份,每天晚上运行到谷歌云服务。我宁愿不必使用它,因为我知道恢复起来会很慢。

我已经登录到 Hetzner 的支持系统,对于驱动器故障,他们给出了这个建议(这是在提交票证之前,我还没有完成):

Please give us the following information so that we can better understand your HDD/SSD defect:

  * Serial number(s) of the defective and/or intact HDD(s)/SSD(s)
  * Evidence of the defect (entire SMART log, less than one week old)

The following assistance is available:

  * [Instructions on establishing the serial numbers as well as information on defective HDDs/SSDs][1]
  * [Instructions on exchanging an HDD/SSD with software RAID][1]
  * [Instructions on creating a complete SMART log][1]

Please note that we can only exchange your defective drive for an empty drive. We do not carry out any data exchange or backups.

最后一行特别引起了我的注意。当他们将空驱动器放入时会发生什么,我需要做什么来重建它,还是会自动发生?

谢谢!

hard-drive
  • 1 1 个回答
  • 1615 Views

1 个回答

  • Voted
  1. Best Answer
    bodgit
    2018-12-14T06:06:21+08:002018-12-14T06:06:21+08:00

    所以看起来你可能有一个失败的磁盘。RAID6 为您提供了发生两个磁盘故障的可能性,所以如果这个磁盘死了,您的数据应该仍然完好无损。使用 RAID 的全部意义在于,您通常可以通过使用写入阵列中剩余健康磁盘的数据来重建一个故障(随后被替换)磁盘的内容。

    如果/当更换磁盘时,您将需要对其进行初始化并将其添加到现有阵列中以便可以使用它。这将取决于阵列的构建方式,如果它是软件 RAID,那么您可能需要创建正确类型的单个分区,然后将其添加到阵列mdadm中,此时阵列应自行修复/重建。查看阵列中现有磁盘之一的分区将显示您需要复制的内容。

    此页面应向您展示您需要做的基础知识:https ://www.thegeekdiary.com/replacing-a-failed-mirror-disk-in-a-software-raid-array-mdadm/

    硬件 RAID 通常会有自己的供应商特定方式来做同样的事情,但我假设您使用的是软件 RAID,因为您仍然可以看到各个硬盘。

    • 2

相关问题

  • 总大小(磁盘)与总大小(媒体)

  • Linux:“发现重复的 PV XXXXYYYYY:用户 /dev/sdb1 而不是 /dev/sda1”

  • Windows C:驱动器大小

  • 了解磁盘队列长度

  • md5sum 重复为同一台机器上的同一文件提供不同的校验和

Sidebar

Stats

  • 问题 205573
  • 回答 270741
  • 最佳答案 135370
  • 用户 68524
  • 热门
  • 回答
  • Marko Smith

    新安装后 postgres 的默认超级用户用户名/密码是什么?

    • 5 个回答
  • Marko Smith

    SFTP 使用什么端口?

    • 6 个回答
  • Marko Smith

    命令行列出 Windows Active Directory 组中的用户?

    • 9 个回答
  • Marko Smith

    什么是 Pem 文件,它与其他 OpenSSL 生成的密钥文件格式有何不同?

    • 3 个回答
  • Marko Smith

    如何确定bash变量是否为空?

    • 15 个回答
  • Martin Hope
    Tom Feiner 如何按大小对 du -h 输出进行排序 2009-02-26 05:42:42 +0800 CST
  • Martin Hope
    Noah Goodrich 什么是 Pem 文件,它与其他 OpenSSL 生成的密钥文件格式有何不同? 2009-05-19 18:24:42 +0800 CST
  • Martin Hope
    Brent 如何确定bash变量是否为空? 2009-05-13 09:54:48 +0800 CST
  • Martin Hope
    cletus 您如何找到在 Windows 中打开文件的进程? 2009-05-01 16:47:16 +0800 CST

热门标签

linux nginx windows networking ubuntu domain-name-system amazon-web-services active-directory apache-2.4 ssh

Explore

  • 主页
  • 问题
    • 最新
    • 热门
  • 标签
  • 帮助

Footer

AskOverflow.Dev

关于我们

  • 关于我们
  • 联系我们

Legal Stuff

  • Privacy Policy

Language

  • Pt
  • Server
  • Unix

© 2023 AskOverflow.DEV All Rights Reserve