AskOverflow.Dev

AskOverflow.Dev Logo AskOverflow.Dev Logo

AskOverflow.Dev Navigation

  • 主页
  • 系统&网络
  • Ubuntu
  • Unix
  • DBA
  • Computer
  • Coding
  • LangChain

Mobile menu

Close
  • 主页
  • 系统&网络
    • 最新
    • 热门
    • 标签
  • Ubuntu
    • 最新
    • 热门
    • 标签
  • Unix
    • 最新
    • 标签
  • DBA
    • 最新
    • 标签
  • Computer
    • 最新
    • 标签
  • Coding
    • 最新
    • 标签
主页 / server / 问题

问题[drive-failure](server)

Martin Hope
Lucio Crusca
Asked: 2022-01-13 12:37:38 +0800 CST

USB HDD 出现故障时,dmesg 中消息的正常顺序是什么?

  • 0

我有一个连接到 Debian GNU/Linux 服务器的 USB 硬盘。我正在尝试使用以下命令对其进行格式化(NTFS):

# mkntfs -v /dev/sdd1

这需要几个小时,因为它也会检查磁盘。检查时,dmesg -T显示以下内容:

[Wed Jan 12 15:22:53 2022] sd 9:0:0:0: [sdd] Attached SCSI disk
[Wed Jan 12 18:03:26 2022] usb 1-4: USB disconnect, device number 5
[Wed Jan 12 18:03:26 2022] blk_update_request: I/O error, dev sdd, sector 621745808 op 0x1:(WRITE) flags 0x104000 phys_seg 240 prio class 0
[Wed Jan 12 18:03:26 2022] Buffer I/O error on dev sdd1, logical block 621743760, lost async page write
[Wed Jan 12 18:03:26 2022] Buffer I/O error on dev sdd1, logical block 621743761, lost async page write
   (...and so on for a few lines, then)
[Wed Jan 12 18:03:26 2022] blk_update_request: I/O error, dev sdd, sector 621746048 op 0x1:(WRITE) flags 0x104000 phys_seg 240 prio class 0
[Wed Jan 12 18:03:26 2022] blk_update_request: I/O error, dev sdd, sector 621746288 op 0x1:(WRITE) flags 0x100000 phys_seg 8 prio class 0
[Wed Jan 12 18:03:26 2022] blk_update_request: I/O error, dev sdd, sector 621746296 op 0x1:(WRITE) flags 0x800 phys_seg 16 prio class 0
   (...and so on for a few lines, then)
[Wed Jan 12 18:03:31 2022] buffer_io_error: 9015384 callbacks suppressed
   (...other errors...)

看着大量的错误消息,我会说 HDD 几乎死了,但将它附加到 Windows PC 似乎可以工作。此外usb 1-4: USB disconnect, device number 5,dmesg在其他错误之前出现的第一个错误( .

但是我在dmesg输出方面不是很有经验,所以很可能我读错了。

编辑:根据 NiKiZe 的要求,这里是输出smartctl -a /dev/sdd:

# smartctl -a /dev/sdd
smartctl 6.6 2017-11-05 r4594 [x86_64-linux-5.10.0-3-amd64] (local build)
Copyright (C) 2002-17, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Western Digital Blue Mobile
Device Model:     WDC WD10SPCX-00KHST0
Serial Number:    WD-WXF1A95F0J3X
LU WWN Device Id: 5 0014ee 65b7e0332
Firmware Version: 01.01A01
User Capacity:    1.000.204.886.016 bytes [1,00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    5400 rpm
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ACS-2 (minor revision not indicated)
SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 1.5 Gb/s)
Local Time is:    Thu Jan 13 11:04:19 2022 CET
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART Status not supported: Incomplete response, ATA output registers missing
SMART overall-health self-assessment test result: PASSED
Warning: This result is based on an Attribute check.

General SMART Values:
Offline data collection status:  (0x00) Offline data collection activity
                    was never started.
                    Auto Offline Data Collection: Disabled.
Self-test execution status:      (   0) The previous self-test routine completed
                    without error or no self-test has ever 
                    been run.
Total time to complete Offline 
data collection:        (16080) seconds.
Offline data collection
capabilities:            (0x7b) SMART execute Offline immediate.
                    Auto Offline data collection on/off support.
                    Suspend Offline collection upon new
                    command.
                    Offline surface scan supported.
                    Self-test supported.
                    Conveyance Self-test supported.
                    Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                    power-saving mode.
                    Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                    General Purpose Logging supported.
Short self-test routine 
recommended polling time:    (   2) minutes.
Extended self-test routine
recommended polling time:    ( 184) minutes.
Conveyance self-test routine
recommended polling time:    (   5) minutes.
SCT capabilities:          (0x7035) SCT Status supported.
                    SCT Feature Control supported.
                    SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       0
  3 Spin_Up_Time            0x0027   190   184   021    Pre-fail  Always       -       1500
  4 Start_Stop_Count        0x0032   081   081   000    Old_age   Always       -       19048
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   073   073   000    Old_age   Always       -       20415
 10 Spin_Retry_Count        0x0032   100   100   000    Old_age   Always       -       0
 11 Calibration_Retry_Count 0x0032   100   100   000    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       188
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       43
193 Load_Cycle_Count        0x0032   187   187   000    Old_age   Always       -       41054
194 Temperature_Celsius     0x0022   119   095   000    Old_age   Always       -       28
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   100   253   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   100   253   000    Old_age   Offline      -       0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
No self-tests have been logged.  [To run self-tests, use: smartctl -t]

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

假设这个 HDD 确实出现故障,并且其中的错误消息dmesg是针对真正的坏扇区,为什么会在坏扇区消息之前而不是之后dmesg显示断开连接?

drive-failure
  • 0 个回答
  • 260 Views
Martin Hope
mike
Asked: 2021-08-09 01:26:03 +0800 CST

磁盘问题:irq_stat 0x20000000,主机总线错误

  • 0

将大文件 (50+GB) 从 NVMe 磁盘复制到 SATA 7200rpm HDD 磁盘时,我在完全修补的 Ubuntu 20.04 的日志中看到以下错误:

Aug 08 00:45:59 host kernel: ata6.00: exception Emask 0x20 SAct 0x0 SErr 0x0 action 0x6 frozen
Aug 08 00:45:59 host kernel: ata6.00: irq_stat 0x20000000, host bus error
Aug 08 00:45:59 host kernel: ata6.00: failed command: WRITE DMA EXT
Aug 08 00:45:59 host kernel: ata6.00: cmd 35/00:08:30:a2:e0/00:00:e8:00:00/e0 tag 23 dma 4096 out
                                    res 50/00:00:00:00:00/00:00:00:00:00/00 Emask 0x20 (host bus error)
Aug 08 00:45:59 host kernel: ata6.00: status: { DRDY }
Aug 08 00:45:59 host kernel: ata6: hard resetting link
Aug 08 00:46:00 host kernel: ata6: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
Aug 08 00:46:00 host kernel: ata6.00: configured for UDMA/133
Aug 08 00:46:00 host kernel: ata6: EH complete

ata6.00是正在写入的磁盘。
问题是间歇性的。有时 24 小时不出现,有时每小时出现几次。通常磁盘会恢复,但有时文件系统会损坏,需要卸载、修复(如果可能)并重新安装。

我尝试了什么:

  1. 我尝试了 3 种不同品牌的硬盘。所有人都有同样的问题。
  2. 我怀疑是硬件问题。我更换了主板和 SATA 电缆。这些都没有帮助。
  3. 我有另一台具有相同配置的服务器。该问题不会在那里发生。相同的工作量。
  4. 我还有另一台配置完全不同的服务器(英特尔与 AMD)。问题发生在那里。相同的工作量。
  5. 我通过禁用 NCQ echo 1 > /sys/block/sda/device/queue_depth。没有帮助。

我没有主意了……
这些都是数据中心级组件。鉴于我采取的步骤,我想这不是硬件制造缺陷。
这可能与软件/操作系统/BIOS 相关吗?
任何想法我还应该尝试什么?

hard-drive ubuntu sata drive-failure
  • 2 个回答
  • 205 Views
Martin Hope
Mike Texter
Asked: 2020-08-08 09:06:17 +0800 CST

在 SSD 上重置 SMART(通电时间)

  • 1

SanDisk SSD(戴尔或 HPE 品牌)上的一个已知问题困扰着我们,它们在通电一定小时后出现硬故障 - 32768 或 40000,具体取决于具体型号。有没有一种可靠的方法来回滚这个 SMART 属性,以便我们可以更新这些固件并让它们再次运行?我们有许多工具可供使用,但据我们所知,没有一个工具可以做到这一点。

ssd smart drive-failure
  • 1 个回答
  • 162 Views
Martin Hope
stormdrain
Asked: 2017-01-22 05:42:10 +0800 CST

戴尔 r510 esxi raid 10 磁盘故障:丢失配置

  • 0

昨天我有一台服务器离线,抱怨磁盘故障。我重新安装了设备中的磁盘,所有的灯都变绿了。我重新启动了机器,却发现它现在无法启动到 esxi。

它抱怨外部配置,其中一个磁盘是状态重建。根据戴尔论坛,我“清除”了外部配置并重新启动。现在所有磁盘都处于“就绪”状态,但机器仍无法启动(未找到 vd 0)。

我不相信我做了任何事情来丢失数据?我的印象是,raid 设置是为了帮助解决磁盘故障。

任何关于如何让机器再次启动到 esxi 的提示或建议将不胜感激。我也很乐意找到一种方法来让我可以移动到另一台机器上的机器上的虚拟机。

谢谢。

raid vmware-esxi dell-poweredge dell-perc drive-failure
  • 1 个回答
  • 346 Views
Martin Hope
smartenbergen
Asked: 2016-12-17 09:56:12 +0800 CST

从服务器中删除磁盘后,Linux 软件 RAID 变得无响应

  • 2

我正在运行 CentOS 7 机器(标准内核:)3.10.0-327.36.3.el7.x86_64,软件 RAID-10 超过 16x 1 TB SSD(更准确地说,磁盘上有两个 RAID 阵列;其中一个阵列提供主机的交换分区)。上周,SSD出现故障:

13:18:07 kvm7 kernel: sd 1:0:2:0: attempting task abort! scmd(ffff887e57b916c0)
13:18:07 kvm7 kernel: sd 1:0:2:0: [sdk] CDB: Write(10) 2a 08 02 55 20 08 00 00 01 00
13:18:07 kvm7 kernel: scsi target1:0:2: handle(0x000b), sas_address(0x4433221102000000), phy(2)
13:18:07 kvm7 kernel: scsi target1:0:2: enclosure_logical_id(0x500304801c14a001), slot(2)
13:18:10 kvm7 kernel: sd 1:0:2:0: task abort: SUCCESS scmd(ffff887e57b916c0)
13:18:11 kvm7 kernel: sd 1:0:2:0: [sdk] FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
13:18:11 kvm7 kernel: sd 1:0:2:0: [sdk] Sense Key : Not Ready [current] 
13:18:11 kvm7 kernel: sd 1:0:2:0: [sdk] Add. Sense: Logical unit not ready, cause not reportable
13:18:11 kvm7 kernel: sd 1:0:2:0: [sdk] CDB: Write(10) 2a 08 02 55 20 08 00 00 01 00
13:18:11 kvm7 kernel: blk_update_request: I/O error, dev sdk, sector 39133192
13:18:11 kvm7 kernel: blk_update_request: I/O error, dev sdk, sector 39133192
13:18:11 kvm7 kernel: md: super_written gets error=-5, uptodate=0
13:18:11 kvm7 kernel: md/raid10:md3: Disk failure on sdk3, disabling device.#012md/raid10:md3: Operation continuing on 15 devices.
13:19:27 kvm7 kernel: sd 1:0:2:0: device_blocked, handle(0x000b)
13:19:29 kvm7 kernel: sd 1:0:2:0: [sdk] Synchronizing SCSI cache
13:19:29 kvm7 kernel: md: md3 still in use.
13:19:29 kvm7 kernel: sd 1:0:2:0: [sdk] Synchronize Cache(10) failed: Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
13:19:29 kvm7 kernel: mpt3sas1: removing handle(0x000b), sas_addr(0x4433221102000000)
13:19:29 kvm7 kernel: md: md2 still in use.
13:19:29 kvm7 kernel: md/raid10:md2: Disk failure on sdk2, disabling device.#012md/raid10:md2: Operation continuing on 15 devices.
13:19:29 kvm7 kernel: md: unbind<sdk3>
13:19:29 kvm7 kernel: md: export_rdev(sdk3)
13:19:29 kvm7 kernel: md: unbind<sdk2>
13:19:29 kvm7 kernel: md: export_rdev(sdk2)

/proc/mdstat看起来像预期的那样(1 个有故障的成员)并且虚拟机继续运行没有任何问题。

md3 : active raid10 sdp3[15] sdb3[2] sdg3[12] sde3[8] sdn3[11] sdl3[7] sdm3[9] sdf3[10] sdi3[1] sdk3[5](F) sdc3[4] sdd3[6] sdh3[14] sdo3[13] sda3[0] sdj3[3]
  7844052992 blocks super 1.2 128K chunks 2 near-copies [16/15] [UUUUU_UUUUUUUUUU]

由于没有可用的 1 TB SSD,SSD 必须暂时更换为更大的 SSD;所以我们做了,开始重建,一切都很好。今天,“正确”的 SSD 到货了,因此数据中心技术人员只需拉动装有上述 SSD 的托盘,系统就会在几秒钟内变得无响应。虽然主机在单独的 RAID 阵列上运行良好,但虚拟机无法执行 I/O。负载增加到> 800。我能够执行mdadm --detail /dev/md3显示降级(但活动/干净)的阵列,所以从这个角度来看,系统绝对没问题。我试图从阵列中删除有故障/丢失的驱动器,这当然失败了(“没有这样的设备”),甚至突然mdadm --detail /dev/md3不再生成任何输出,它只是卡住了,我不得不终止 SSH 会话才能摆脱这种情况。在此之后,我决定强制重新启动,因为我什至不知道如何从阵列中移除这个有故障的驱动器 - 一切都正确出现。当然,RAID 仍然降级,需要重新同步,但除此之外:没有问题。

我很确定在将托盘从机架中拉出之前,我应该通过 mdadm 卸下驱动器,尽管我无法解释 mdraid 的这种行为。在我看来,我们“模拟”了一次常规磁盘中断,所以有人知道是什么导致了这个问题,以及我如何确保下一次常规磁盘中断不会导致同样的问题?--set-faulty

内核记录了一些消息,我发现有趣的是新设备以sdq出现,而​​拉出的设备被称为sdk. 所以我认为它sdk没有从阵列中正确踢出。上周最初的 SSD 故障发生时,我没有看到这种行为;所以更换驱动器也出现了sdk。

日志还显示旧 SSD 的故障和插入新 SSD 之间的 7 分钟,所以我不认为像它在https://superuser.com/questions/942886/fail-device下描述的问题in-md-raid-when-ata-stops-responding发生。虚拟机也立即关闭,而不是 7 分钟后。所以 - 对此有什么想法吗?将不胜感激:)

11:45:36 kvm7 kernel: sd 1:0:8:0: device_blocked, handle(0x000b)
11:45:37 kvm7 kernel: blk_update_request: I/O error, dev sdk, sector 0
11:45:37 kvm7 kernel: md/raid10:md3: sdk3: rescheduling sector 4072069640
11:45:37 kvm7 kernel: md/raid10:md3: sdk3: rescheduling sector 4072069648
11:45:37 kvm7 kernel: md/raid10:md3: sdk3: rescheduling sector 4072069656
11:45:37 kvm7 kernel: md/raid10:md3: sdk3: rescheduling sector 4072069664
11:45:37 kvm7 kernel: md/raid10:md3: sdk3: rescheduling sector 4072069672
11:45:37 kvm7 kernel: md/raid10:md3: sdk3: rescheduling sector 4072069680
11:45:37 kvm7 kernel: md/raid10:md3: sdk3: rescheduling sector 4072069688
11:45:37 kvm7 kernel: md/raid10:md3: sdk3: rescheduling sector 4072069696
11:45:37 kvm7 kernel: md/raid10:md3: sdk3: rescheduling sector 4072069704
11:45:37 kvm7 kernel: md/raid10:md3: sdk3: rescheduling sector 4072069712
11:45:37 kvm7 kernel: sd 1:0:8:0: [sdk] FAILED Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
11:45:37 kvm7 kernel: sd 1:0:8:0: [sdk] CDB: Read(10) 28 00 20 af f7 08 00 00 08 00
11:45:37 kvm7 kernel: blk_update_request: I/O error, dev sdk, sector 548402952
11:45:37 kvm7 kernel: blk_update_request: I/O error, dev sdk, sector 0
11:45:37 kvm7 kernel: blk_update_request: I/O error, dev sdk, sector 39133192
11:45:37 kvm7 kernel: md: super_written gets error=-5, uptodate=0
11:45:37 kvm7 kernel: md/raid10:md3: Disk failure on sdk3, disabling device.#012md/raid10:md3: Operation continuing on 15 devices.
11:45:37 kvm7 kernel: md: md2 still in use.
11:45:37 kvm7 kernel: md/raid10:md2: Disk failure on sdk2, disabling device.#012md/raid10:md2: Operation continuing on 15 devices.
11:45:37 kvm7 kernel: blk_update_request: I/O error, dev sdk, sector 39133264
11:45:37 kvm7 kernel: md: super_written gets error=-5, uptodate=0
11:45:37 kvm7 kernel: sd 1:0:8:0: [sdk] Synchronizing SCSI cache
11:45:37 kvm7 kernel: sd 1:0:8:0: [sdk] Synchronize Cache(10) failed: Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
11:45:37 kvm7 kernel: mpt3sas1: removing handle(0x000b), sas_addr(0x4433221102000000)
11:45:37 kvm7 kernel: md: unbind<sdk2>
11:45:37 kvm7 kernel: md: export_rdev(sdk2)
11:48:00 kvm7 kernel: INFO: task md3_raid10:1293 blocked for more than 120 seconds.
11:48:00 kvm7 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
11:48:00 kvm7 kernel: md3_raid10      D ffff883f26e55c00     0  1293      2 0x00000000
11:48:00 kvm7 kernel: ffff887f24bd7c58 0000000000000046 ffff887f212eb980 ffff887f24bd7fd8
11:48:00 kvm7 kernel: ffff887f24bd7fd8 ffff887f24bd7fd8 ffff887f212eb980 ffff887f23514400
11:48:00 kvm7 kernel: ffff887f235144dc 0000000000000001 ffff887f23514500 ffff8807fa4c4300
11:48:00 kvm7 kernel: Call Trace:
11:48:00 kvm7 kernel: [<ffffffff8163bb39>] schedule+0x29/0x70
11:48:00 kvm7 kernel: [<ffffffffa0104ef7>] freeze_array+0xb7/0x180 [raid10]
11:48:00 kvm7 kernel: [<ffffffff810a6b80>] ? wake_up_atomic_t+0x30/0x30
11:48:00 kvm7 kernel: [<ffffffffa010880d>] handle_read_error+0x2bd/0x360 [raid10]
11:48:00 kvm7 kernel: [<ffffffff812c7412>] ? generic_make_request+0xe2/0x130
11:48:00 kvm7 kernel: [<ffffffffa0108a1d>] raid10d+0x16d/0x1440 [raid10]
11:48:00 kvm7 kernel: [<ffffffff814bb785>] md_thread+0x155/0x1a0
11:48:00 kvm7 kernel: [<ffffffff810a6b80>] ? wake_up_atomic_t+0x30/0x30
11:48:00 kvm7 kernel: [<ffffffff814bb630>] ? md_safemode_timeout+0x50/0x50
11:48:00 kvm7 kernel: [<ffffffff810a5b8f>] kthread+0xcf/0xe0
11:48:00 kvm7 kernel: [<ffffffff810a5ac0>] ? kthread_create_on_node+0x140/0x140
11:48:00 kvm7 kernel: [<ffffffff81646a98>] ret_from_fork+0x58/0x90
11:48:00 kvm7 kernel: [<ffffffff810a5ac0>] ? kthread_create_on_node+0x140/0x140
11:48:00 kvm7 kernel: INFO: task qemu-kvm:26929 blocked for more than 120 seconds.

[serveral messages for stuck qemu-kvm processes]

11:52:42 kvm7 kernel: scsi 1:0:9:0: Direct-Access     ATA      KINGSTON SKC400S 001A PQ: 0 ANSI: 6
11:52:42 kvm7 kernel: scsi 1:0:9:0: SATA: handle(0x000b), sas_addr(0x4433221102000000), phy(2), device_name(0x4d6b497569a68ba2)
11:52:42 kvm7 kernel: scsi 1:0:9:0: SATA: enclosure_logical_id(0x500304801c14a001), slot(2)
11:52:42 kvm7 kernel: scsi 1:0:9:0: atapi(n), ncq(y), asyn_notify(n), smart(y), fua(y), sw_preserve(y)
11:52:42 kvm7 kernel: scsi 1:0:9:0: qdepth(32), tagged(1), simple(0), ordered(0), scsi_level(7), cmd_que(1)
11:52:42 kvm7 kernel: sd 1:0:9:0: Attached scsi generic sg10 type 0
11:52:42 kvm7 kernel: sd 1:0:9:0: [sdq] 2000409264 512-byte logical blocks: (1.02 TB/953 GiB)
11:52:42 kvm7 kernel: sd 1:0:9:0: [sdq] Write Protect is off
11:52:42 kvm7 kernel: sd 1:0:9:0: [sdq] Write cache: enabled, read cache: enabled, supports DPO and FUA
11:52:42 kvm7 kernel: sdq: unknown partition table
11:52:42 kvm7 kernel: sd 1:0:9:0: [sdq] Attached SCSI disk
io ssd software-raid drive-failure
  • 1 个回答
  • 798 Views
Martin Hope
Chris
Asked: 2016-12-01 16:34:24 +0800 CST

以前硬盘的SMART错误或需要更换?

  • 3

尽管磁盘很新鲜 Power_On_Minutes 427h+41m,但我运行得很聪明并提出了一些奇怪的错误

我很好奇,这些是以前硬盘的错误吗?

Error 1 occurred at disk power-on lifetime: 13729 hours (572 days + 1 hours)
  When the command that caused the error occurred, the device was active or idle
Error 2 occurred at disk power-on lifetime: 23300 hours (970 days + 20 hours)
  When the command that caused the error occurred, the device was active or idle.

这是输出

# smartctl --all /dev/sda
    smartctl 6.5 2016-01-24 r4214 [x86_64-linux-4.4.0-51-generic] (local build)
    Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

    === START OF INFORMATION SECTION ===
    Model Family:     Toshiba 2.5" HDD MK..76GSX
    Device Model:     TOSHIBA MK2576GSX
    Serial Number:    Y1J9S0IGS
    LU WWN Device Id: 5 000039 3a5a06b8e
    Firmware Version: GS001A
    User Capacity:    250,059,350,016 bytes [250 GB]
    Sector Size:      512 bytes logical/physical
    Rotation Rate:    5400 rpm
    Form Factor:      2.5 inches
    Device is:        In smartctl database [for details use: -P show]
    ATA Version is:   ATA8-ACS (minor revision not indicated)
    SATA Version is:  SATA 2.6, 3.0 Gb/s (current: 3.0 Gb/s)
    Local Time is:    Thu Dec  1 00:28:22 2016 GMT
    SMART support is: Available - device has SMART capability.
    SMART support is: Enabled

    === START OF READ SMART DATA SECTION ===
    SMART overall-health self-assessment test result: PASSED

    General SMART Values:
    Offline data collection status:  (0x00) Offline data collection activity
                                            was never started.
                                            Auto Offline Data Collection: Disabled.
    Self-test execution status:      (   0) The previous self-test routine completed
                                            without error or no self-test has ever
                                            been run.
    Total time to complete Offline
    data collection:                (  120) seconds.
    Offline data collection
    capabilities:                    (0x5b) SMART execute Offline immediate.
                                            Auto Offline data collection on/off support.
                                            Suspend Offline collection upon new
                                            command.
                                            Offline surface scan supported.
                                            Self-test supported.
                                            No Conveyance Self-test supported.
                                            Selective Self-test supported.
    SMART capabilities:            (0x0003) Saves SMART data before entering
                                            power-saving mode.
                                            Supports SMART auto save timer.
    Error logging capability:        (0x01) Error logging supported.
                                            General Purpose Logging supported.
    Short self-test routine
    recommended polling time:        (   2) minutes.
    Extended self-test routine
    recommended polling time:        (  81) minutes.
    SCT capabilities:              (0x003d) SCT Status supported.
                                            SCT Error Recovery Control supported.
                                            SCT Feature Control supported.
                                            SCT Data Table supported.

    SMART Attributes Data Structure revision number: 16
    Vendor Specific SMART Attributes with Thresholds:
    ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
      1 Raw_Read_Error_Rate     0x000b   100   100   050    Pre-fail  Always       -       0
      2 Throughput_Performance  0x0005   100   100   050    Pre-fail  Offline      -       0
      3 Spin_Up_Time            0x0027   100   100   001    Pre-fail  Always       -       1229
      4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       15
      5 Reallocated_Sector_Ct   0x0033   100   100   050    Pre-fail  Always       -       0
      7 Seek_Error_Rate         0x000b   100   100   050    Pre-fail  Always       -       0
      8 Seek_Time_Performance   0x0005   100   100   050    Pre-fail  Offline      -       0
      9 Power_On_Minutes        0x0032   036   036   000    Old_age   Always       -       427h+41m
     10 Spin_Retry_Count        0x0033   100   100   030    Pre-fail  Always       -       0
     12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       7
    191 G-Sense_Error_Rate      0x0032   100   100   000    Old_age   Always       -       0
    192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       1
    193 Load_Cycle_Count        0x0032   070   070   000    Old_age   Always       -       304324
    194 Temperature_Celsius     0x0022   100   100   000    Old_age   Always       -       27 (Min/Max 20/31)
    196 Reallocated_Event_Count 0x0032   100   100   000    Old_age   Always       -       0
    197 Current_Pending_Sector  0x0032   100   100   000    Old_age   Always       -       0
    198 Offline_Uncorrectable   0x0030   100   100   000    Old_age   Offline      -       0
    199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       2
    220 Disk_Shift              0x0002   100   100   000    Old_age   Always       -       109
    222 Loaded_Hours            0x0032   067   067   000    Old_age   Always       -       13230
    223 Load_Retry_Count        0x0032   100   100   000    Old_age   Always       -       0
    224 Load_Friction           0x0022   100   100   000    Old_age   Always       -       0
    226 Load-in_Time            0x0026   100   100   000    Old_age   Always       -       375
    240 Head_Flying_Hours       0x0001   100   100   001    Pre-fail  Offline      -       0

    SMART Error Log Version: 1
    ATA Error Count: 2
            CR = Command Register [HEX]
            FR = Features Register [HEX]
            SC = Sector Count Register [HEX]
            SN = Sector Number Register [HEX]
            CL = Cylinder Low Register [HEX]
            CH = Cylinder High Register [HEX]
            DH = Device/Head Register [HEX]
            DC = Device Command Register [HEX]
            ER = Error register [HEX]
            ST = Status register [HEX]
    Powered_Up_Time is measured from power on, and printed as
    DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
    SS=sec, and sss=millisec. It "wraps" after 49.710 days.

    Error 2 occurred at disk power-on lifetime: 23300 hours (970 days + 20 hours)
      When the command that caused the error occurred, the device was active or idle.

      After command completion occurred, registers were:
      ER ST SC SN CL CH DH
      -- -- -- -- -- -- --
      84 51 01 1f 7a 05 e0  Error: ICRC, ABRT 1 sectors at LBA = 0x00057a1f = 358943

      Commands leading to the command that caused the error were:
      CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
      -- -- -- -- -- -- -- --  ----------------  --------------------
      35 00 00 20 76 05 e0 00   6d+01:49:26.915  WRITE DMA EXT
      35 00 00 00 72 05 e0 00   6d+01:49:26.741  WRITE DMA EXT
      35 00 08 80 0f 0c e0 00   6d+01:49:26.741  WRITE DMA EXT
      35 00 08 48 8a c4 e0 00   6d+01:49:26.741  WRITE DMA EXT
      ca 00 08 00 08 14 e9 00   6d+01:49:26.741  WRITE DMA

    Error 1 occurred at disk power-on lifetime: 13729 hours (572 days + 1 hours)
      When the command that caused the error occurred, the device was active or idle.

      After command completion occurred, registers were:
      ER ST SC SN CL CH DH
      -- -- -- -- -- -- --
      84 51 01 3f 8c 4e e0  Error: ICRC, ABRT 1 sectors at LBA = 0x004e8c3f = 5147711

      Commands leading to the command that caused the error were:
      CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
      -- -- -- -- -- -- -- --  ----------------  --------------------
      35 00 00 40 88 4e e0 00  12d+21:23:20.732  WRITE DMA EXT
      ca 00 08 a8 48 c8 e3 00  12d+21:23:20.731  WRITE DMA
      35 00 08 40 c1 1d e0 00  12d+21:23:20.731  WRITE DMA EXT
      35 00 08 b0 19 14 e0 00  12d+21:23:20.731  WRITE DMA EXT
      35 00 10 28 bf 13 e0 00  12d+21:23:20.731  WRITE DMA EXT

    SMART Self-test log structure revision number 1
    No self-tests have been logged.  [To run self-tests, use: smartctl -t]

    SMART Selective self-test log data structure revision number 1
     SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
        1        0        0  Not_testing
        2        0        0  Not_testing
        3        0        0  Not_testing
        4        0        0  Not_testing
        5        0        0  Not_testing
    Selective self-test flags (0x0):
      After scanning selected spans, do NOT read-scan remainder of disk.
    If Selective self-test is pending on power-up, resume after 0 minute delay.

该硬盘是否可能很快出现故障并需要更换?

linux smart drive-failure
  • 2 个回答
  • 1287 Views
Martin Hope
nordashi
Asked: 2016-11-10 13:13:08 +0800 CST

可能的磁盘故障。我应该担心吗?

  • 0

我认为我们的硬盘驱动器出现故障。在我转储所有相关日志之前:不用担心,每天都会备份所有内容,可以更换硬盘。我只需要证实我的怀疑。硬盘处于突袭状态,因此还有另一个磁盘,并且(如您所见)该服务器在过去 2 年多的时间里 24/7 全天候在线。

Smartctl 报告在 19677h 报告的错误 348 与 syslog 中的错误匹配。349-351 也一样。

这个驱动器快死了,对吧?

日志:

Nov  9 02:50:04 XXXXX kernel: [21716010.818934] ata1.00: exception Emask 0x0 SAct 0x3ffff SErr 0x0 action 0x0
Nov  9 02:50:04 XXXXX kernel: [21716010.862259] ata1.00: irq_stat 0x40000008
Nov  9 02:50:04 XXXXX kernel: [21716010.904504] ata1.00: failed command: READ FPDMA QUEUED
Nov  9 02:50:04 XXXXX kernel: [21716010.946338] ata1.00: cmd 60/80:00:00:84:49/00:00:0b:00:00/40 tag 0 ncq 65536 in
Nov  9 02:50:04 XXXXX kernel: [21716010.946338]          res 41/40:80:78:84:49/00:00:0b:00:00/00 Emask 0x409 (media error) <F>
Nov  9 02:50:04 XXXXX kernel: [21716011.110653] ata1.00: status: { DRDY ERR }
Nov  9 02:50:04 XXXXX kernel: [21716011.151079] ata1.00: error: { UNC }
Nov  9 02:50:04 XXXXX kernel: [21716011.324910] ata1.00: configured for UDMA/133
Nov  9 02:50:04 XXXXX kernel: [21716011.324938] sd 0:0:0:0: [sda] Unhandled sense code
Nov  9 02:50:04 XXXXX kernel: [21716011.324941] sd 0:0:0:0: [sda]  
Nov  9 02:50:04 XXXXX kernel: [21716011.324943] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
Nov  9 02:50:04 XXXXX kernel: [21716011.324944] sd 0:0:0:0: [sda]  
Nov  9 02:50:04 XXXXX kernel: [21716011.324946] Sense Key : Medium Error [current] [descriptor]
Nov  9 02:50:04 XXXXX kernel: [21716011.324949] Descriptor sense data with sense descriptors (in hex):
Nov  9 02:50:04 XXXXX kernel: [21716011.324951]         72 03 11 04 00 00 00 0c 00 0a 80 00 00 00 00 00 
Nov  9 02:50:04 XXXXX kernel: [21716011.324957]         0b 49 84 78 
Nov  9 02:50:04 XXXXX kernel: [21716011.324960] sd 0:0:0:0: [sda]  
Nov  9 02:50:04 XXXXX kernel: [21716011.324963] Add. Sense: Unrecovered read error - auto reallocate failed
Nov  9 02:50:04 XXXXX kernel: [21716011.324965] sd 0:0:0:0: [sda] CDB: 
Nov  9 02:50:04 XXXXX kernel: [21716011.324967] Read(10): 28 00 0b 49 84 00 00 00 80 00
Nov  9 02:50:04 XXXXX kernel: [21716011.324973] end_request: I/O error, dev sda, sector 189367416
Nov  9 02:50:04 XXXXX kernel: [21716011.364405] ata1: EH complete
Nov  9 02:50:24 XXXXX kernel: [21716031.325428] ata1.00: exception Emask 0x0 SAct 0x1e000 SErr 0x0 action 0x0
Nov  9 02:50:24 XXXXX kernel: [21716031.367450] ata1.00: irq_stat 0x40000008
Nov  9 02:50:24 XXXXX kernel: [21716031.408335] ata1.00: failed command: READ FPDMA QUEUED
Nov  9 02:50:24 XXXXX kernel: [21716031.448762] ata1.00: cmd 60/80:68:00:86:49/00:00:0b:00:00/40 tag 13 ncq 65536 in
Nov  9 02:50:24 XXXXX kernel: [21716031.448762]          res 41/40:80:18:86:49/00:00:0b:00:00/00 Emask 0x409 (media error) <F>
Nov  9 02:50:24 XXXXX kernel: [21716031.607084] ata1.00: status: { DRDY ERR }
Nov  9 02:50:24 XXXXX kernel: [21716031.645924] ata1.00: error: { UNC }
Nov  9 02:50:25 XXXXX kernel: [21716031.967600] ata1.00: configured for UDMA/133
Nov  9 02:50:25 XXXXX kernel: [21716031.967616] sd 0:0:0:0: [sda] Unhandled sense code
Nov  9 02:50:25 XXXXX kernel: [21716031.967619] sd 0:0:0:0: [sda]  
Nov  9 02:50:25 XXXXX kernel: [21716031.967621] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
Nov  9 02:50:25 XXXXX kernel: [21716031.967624] sd 0:0:0:0: [sda]  
Nov  9 02:50:25 XXXXX kernel: [21716031.967626] Sense Key : Medium Error [current] [descriptor]
Nov  9 02:50:25 XXXXX kernel: [21716031.967629] Descriptor sense data with sense descriptors (in hex):
Nov  9 02:50:25 XXXXX kernel: [21716031.967631]         72 03 11 04 00 00 00 0c 00 0a 80 00 00 00 00 00 
Nov  9 02:50:25 XXXXX kernel: [21716031.967640]         0b 49 86 18 
Nov  9 02:50:25 XXXXX kernel: [21716031.967644] sd 0:0:0:0: [sda]  
Nov  9 02:50:25 XXXXX kernel: [21716031.967647] Add. Sense: Unrecovered read error - auto reallocate failed
Nov  9 02:50:25 XXXXX kernel: [21716031.967650] sd 0:0:0:0: [sda] CDB: 
Nov  9 02:50:25 XXXXX kernel: [21716031.967652] Read(10): 28 00 0b 49 86 00 00 00 80 00
Nov  9 02:50:25 XXXXX kernel: [21716031.967660] end_request: I/O error, dev sda, sector 189367832
Nov  9 02:50:25 XXXXX kernel: [21716032.005452] ata1: EH complete
Nov  9 02:50:41 XXXXX kernel: [21716048.237709] ata1.00: exception Emask 0x0 SAct 0x380 SErr 0x0 action 0x0
Nov  9 02:50:41 XXXXX kernel: [21716048.272985] ata1.00: irq_stat 0x40000008
Nov  9 02:50:41 XXXXX kernel: [21716048.307238] ata1.00: failed command: READ FPDMA QUEUED
Nov  9 02:50:41 XXXXX kernel: [21716048.341080] ata1.00: cmd 60/00:38:00:92:49/04:00:0b:00:00/40 tag 7 ncq 524288 in
Nov  9 02:50:41 XXXXX kernel: [21716048.341080]          res 41/40:00:08:95:49/00:04:0b:00:00/00 Emask 0x409 (media error) <F>
Nov  9 02:50:41 XXXXX kernel: [21716048.473214] ata1.00: status: { DRDY ERR }
Nov  9 02:50:41 XXXXX kernel: [21716048.505636] ata1.00: error: { UNC }
Nov  9 02:50:41 XXXXX kernel: [21716048.572423] ata1.00: configured for UDMA/133
Nov  9 02:50:41 XXXXX kernel: [21716048.572453] sd 0:0:0:0: [sda] Unhandled sense code
Nov  9 02:50:41 XXXXX kernel: [21716048.572457] sd 0:0:0:0: [sda]  
Nov  9 02:50:41 XXXXX kernel: [21716048.572459] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
Nov  9 02:50:41 XXXXX kernel: [21716048.572462] sd 0:0:0:0: [sda]  
Nov  9 02:50:41 XXXXX kernel: [21716048.572464] Sense Key : Medium Error [current] [descriptor]
Nov  9 02:50:41 XXXXX kernel: [21716048.572467] Descriptor sense data with sense descriptors (in hex):
Nov  9 02:50:41 XXXXX kernel: [21716048.572469]         72 03 11 04 00 00 00 0c 00 0a 80 00 00 00 00 00 
Nov  9 02:50:41 XXXXX kernel: [21716048.572478]         0b 49 95 08 
Nov  9 02:50:41 XXXXX kernel: [21716048.572482] sd 0:0:0:0: [sda]  
Nov  9 02:50:41 XXXXX kernel: [21716048.572485] Add. Sense: Unrecovered read error - auto reallocate failed
Nov  9 02:50:41 XXXXX kernel: [21716048.572488] sd 0:0:0:0: [sda] CDB: 
Nov  9 02:50:41 XXXXX kernel: [21716048.572489] Read(10): 28 00 0b 49 92 00 00 04 00 00
Nov  9 02:50:41 XXXXX kernel: [21716048.572498] end_request: I/O error, dev sda, sector 189371656
Nov  9 02:50:41 XXXXX kernel: [21716048.603930] ata1: EH complete

看到之后 smartctl -a /dev/sda 。这是日志:

smartctl 6.2 2013-07-26 r3841 [x86_64-linux-3.13.0-36-generic] (local build)
Copyright (C) 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Seagate Barracuda 7200.14 (AF)
Device Model:     ST1500DM003-9YN16G
Serial Number:    W2F0B7ZT
LU WWN Device Id: 5 000c50 052c400c7
Firmware Version: CC4C
User Capacity:    1,500,301,910,016 bytes [1.50 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    7200 rpm
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ATA8-ACS T13/1699-D revision 4
SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 3.0 Gb/s)
Local Time is:    Wed Nov  9 21:38:42 2016 CET

==> WARNING: A firmware update for this drive is available,
see the following Seagate web pages:
http://knowledge.seagate.com/articles/en_US/FAQ/207931en
http://knowledge.seagate.com/articles/en_US/FAQ/223651en

SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
See vendor-specific Attribute list for marginal Attributes.

General SMART Values:
Offline data collection status:  (0x82) Offline data collection activity
                    was completed without error.
                    Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0) The previous self-test routine completed
                    without error or no self-test has ever 
                    been run.
Total time to complete Offline 
data collection:        (  625) seconds.
Offline data collection
capabilities:            (0x7b) SMART execute Offline immediate.
                    Auto Offline data collection on/off support.
                    Suspend Offline collection upon new
                    command.
                    Offline surface scan supported.
                    Self-test supported.
                    Conveyance Self-test supported.
                    Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                    power-saving mode.
                    Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                    General Purpose Logging supported.
Short self-test routine 
recommended polling time:    (   1) minutes.
Extended self-test routine
recommended polling time:    ( 209) minutes.
Conveyance self-test routine
recommended polling time:    (   2) minutes.
SCT capabilities:          (0x3085) SCT Status supported.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   107   091   006    Pre-fail  Always       -       13530088
  3 Spin_Up_Time            0x0003   095   095   000    Pre-fail  Always       -       0
  4 Start_Stop_Count        0x0032   100   100   020    Old_age   Always       -       5
  5 Reallocated_Sector_Ct   0x0033   100   100   036    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000f   082   060   030    Pre-fail  Always       -       169205210
  9 Power_On_Hours          0x0032   078   078   000    Old_age   Always       -       19694
 10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   020    Old_age   Always       -       5
183 Runtime_Bad_Block       0x0032   100   100   000    Old_age   Always       -       0
184 End-to-End_Error        0x0032   100   100   099    Old_age   Always       -       0
187 Reported_Uncorrect      0x0032   001   001   000    Old_age   Always       -       543
188 Command_Timeout         0x0032   100   099   000    Old_age   Always       -       5 5 5
189 High_Fly_Writes         0x003a   100   100   000    Old_age   Always       -       0
190 Airflow_Temperature_Cel 0x0022   053   041   045    Old_age   Always   In_the_past 47 (7 6 59 27 0)
191 G-Sense_Error_Rate      0x0032   100   100   000    Old_age   Always       -       0
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       4
193 Load_Cycle_Count        0x0032   100   100   000    Old_age   Always       -       5
194 Temperature_Celsius     0x0022   047   059   000    Old_age   Always       -       47 (0 27 0 0 0)
197 Current_Pending_Sector  0x0012   100   099   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0010   100   099   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0
240 Head_Flying_Hours       0x0000   100   253   000    Old_age   Offline      -       18785h+58m+38.071s
241 Total_LBAs_Written      0x0000   100   253   000    Old_age   Offline      -       114597464908825
242 Total_LBAs_Read         0x0000   100   253   000    Old_age   Offline      -       161059093725236

SMART Error Log Version: 1
ATA Error Count: 351 (device log contains only the most recent five errors)
    CR = Command Register [HEX]
    FR = Features Register [HEX]
    SC = Sector Count Register [HEX]
    SN = Sector Number Register [HEX]
    CL = Cylinder Low Register [HEX]
    CH = Cylinder High Register [HEX]
    DH = Device/Head Register [HEX]
    DC = Device Command Register [HEX]
    ER = Error register [HEX]
    ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 351 occurred at disk power-on lifetime: 19688 hours (820 days + 8 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 ff ff ff 0f  Error: WP at LBA = 0x0fffffff = 268435455

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  61 00 08 ff ff ff 4f 00  24d+23:45:23.550  WRITE FPDMA QUEUED
  61 00 08 ff ff ff 4f 00  24d+23:45:23.528  WRITE FPDMA QUEUED
  61 00 28 ff ff ff 4f 00  24d+23:45:23.528  WRITE FPDMA QUEUED
  61 00 08 ff ff ff 4f 00  24d+23:45:23.516  WRITE FPDMA QUEUED
  61 00 08 ff ff ff 4f 00  24d+23:45:23.516  WRITE FPDMA QUEUED

Error 350 occurred at disk power-on lifetime: 19688 hours (820 days + 8 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 ff ff ff 0f  Error: WP at LBA = 0x0fffffff = 268435455

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  61 00 58 ff ff ff 4f 00  24d+23:45:18.898  WRITE FPDMA QUEUED
  60 00 08 ff ff ff 4f 00  24d+23:45:18.299  READ FPDMA QUEUED
  60 00 08 ff ff ff 4f 00  24d+23:45:18.299  READ FPDMA QUEUED
  60 00 08 ff ff ff 4f 00  24d+23:45:18.299  READ FPDMA QUEUED
  60 00 08 ff ff ff 4f 00  24d+23:45:18.298  READ FPDMA QUEUED

Error 349 occurred at disk power-on lifetime: 19688 hours (820 days + 8 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 ff ff ff 0f  Error: UNC at LBA = 0x0fffffff = 268435455

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  60 00 00 ff ff ff 4f 00  24d+23:45:15.382  READ FPDMA QUEUED
  60 00 80 ff ff ff 4f 00  24d+23:45:14.302  READ FPDMA QUEUED
  60 00 00 ff ff ff 4f 00  24d+23:45:14.300  READ FPDMA QUEUED
  60 00 00 ff ff ff 4f 00  24d+23:45:14.297  READ FPDMA QUEUED
  60 00 00 ff ff ff 4f 00  24d+23:45:14.295  READ FPDMA QUEUED

Error 348 occurred at disk power-on lifetime: 19677 hours (819 days + 21 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 ff ff ff 0f  Error: UNC at LBA = 0x0fffffff = 268435455

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  60 00 08 ff ff ff 4f 00  24d+13:00:41.477  READ FPDMA QUEUED
  60 00 08 ff ff ff 4f 00  24d+13:00:41.477  READ FPDMA QUEUED
  60 00 08 ff ff ff 4f 00  24d+13:00:41.477  READ FPDMA QUEUED
  60 00 08 ff ff ff 4f 00  24d+13:00:41.476  READ FPDMA QUEUED
  60 00 08 ff ff ff 4f 00  24d+13:00:41.476  READ FPDMA QUEUED

Error 347 occurred at disk power-on lifetime: 19675 hours (819 days + 19 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 08 95 49 0b  Error: WP at LBA = 0x0b499508 = 189371656

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  61 00 08 ff ff ff 4f 00  24d+11:11:11.868  WRITE FPDMA QUEUED
  61 00 08 ff ff ff 4f 00  24d+11:11:11.867  WRITE FPDMA QUEUED
  61 00 08 ff ff ff 4f 00  24d+11:11:11.867  WRITE FPDMA QUEUED
  61 00 08 ff ff ff 4f 00  24d+11:11:11.867  WRITE FPDMA QUEUED
  61 00 18 ff ff ff 4f 00  24d+11:11:11.867  WRITE FPDMA QUEUED

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%      1291         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
    enter code here
hard-drive smartmontools drive-failure
  • 1 个回答
  • 981 Views
Martin Hope
Dave Evans
Asked: 2016-10-29 12:33:47 +0800 CST

HP Proliant DL380 G6 - 在 RAID 1 重建期间第二个磁盘故障后恢复

  • 0

** 免责声明,我最近才成为该系统的管理员,并意识到备份无法使用。同样,管理软件的状态也很糟糕**

系统(Ubuntu 14.04)在 RAID 1(A 和 B)中运行两个 146GB 10k SAS 驱动器。机箱是可热插拔的,因此服务器过去和现在仍在运行此过程。

  • 故障驱动器 A 被驱动器 C 替换,闪烁绿色状态确认阵列正在重建
  • 回到 C 并显示稳定的绿色状态(在线)但驱动器 B 稳定的琥珀色(离线/严重故障)

  • 但是,输入/输出错误表明存在明显未同步的大量文件系统补丁,并且文件系统恢复为只读

我的目标是确定驱动器 B 故障的根源,如果它是小问题(例如无法读取的块错误),则要么使用驱动器 B 重新启动系统,要么尽管出现错误,仍强制重建阵列。主要是确定如何让阵列控制器报告故障模式,并将故障驱动器视为正常驱动器。

我只想恢复一些小的配置文件,以便在重新安装时让我的生活更轻松。

服务器当前处于受限状态,但如果重新启动,肯定不会从驱动器 C 启动,因为 /bin/ 的一部分丢失了。令人惊讶的是,它仍然发挥着它的作用,因为它只定期用于 dhcp 和 ssh。

hp-proliant data-recovery raid1 drive-failure ubuntu-14.04
  • 1 个回答
  • 733 Views
Martin Hope
user1996496
Asked: 2016-07-05 20:49:10 +0800 CST

我的 RAID 驱动器出现故障了吗?

  • -3

我今天检查了我的服务器的输出,它说:

root@s01 [~]# mdadm --detail /dev/md1
/dev/md1:
        Version : 1.1
  Creation Time : Sat Jul  7 18:23:24 2012
     Raid Level : raid1
     Array Size : 2929751932 (2794.03 GiB 3000.07 GB)
  Used Dev Size : 2929751932 (2794.03 GiB 3000.07 GB)
   Raid Devices : 2
  Total Devices : 2
    Persistence : Superblock is persistent

  Intent Bitmap : Internal

    Update Time : Mon Jul  4 21:46:45 2016
          State : active, degraded 
 Active Devices : 1
Working Devices : 1
 Failed Devices : 1
  Spare Devices : 0

           Name : server.domain.com:1
           UUID : 58600fc5:5348d92c:a7d25465:20d42940
         Events : 2250926

    Number   Major   Minor   RaidDevice State
       0       0        0        0      removed
       1       8       50        1      active sync   /dev/sdd2

       0       8       34        -      faulty   /dev/sdc2
root@s01 [~]#

这是否意味着其中一个驱动器发生故障?我需要完全更换驱动器还是可以运行一些命令来尝试解决这个问题?

performance hard-drive linux raid drive-failure
  • 1 个回答
  • 47 Views

Sidebar

Stats

  • 问题 205573
  • 回答 270741
  • 最佳答案 135370
  • 用户 68524
  • 热门
  • 回答
  • Marko Smith

    新安装后 postgres 的默认超级用户用户名/密码是什么?

    • 5 个回答
  • Marko Smith

    SFTP 使用什么端口?

    • 6 个回答
  • Marko Smith

    命令行列出 Windows Active Directory 组中的用户?

    • 9 个回答
  • Marko Smith

    什么是 Pem 文件,它与其他 OpenSSL 生成的密钥文件格式有何不同?

    • 3 个回答
  • Marko Smith

    如何确定bash变量是否为空?

    • 15 个回答
  • Martin Hope
    Tom Feiner 如何按大小对 du -h 输出进行排序 2009-02-26 05:42:42 +0800 CST
  • Martin Hope
    Noah Goodrich 什么是 Pem 文件,它与其他 OpenSSL 生成的密钥文件格式有何不同? 2009-05-19 18:24:42 +0800 CST
  • Martin Hope
    Brent 如何确定bash变量是否为空? 2009-05-13 09:54:48 +0800 CST
  • Martin Hope
    cletus 您如何找到在 Windows 中打开文件的进程? 2009-05-01 16:47:16 +0800 CST

热门标签

linux nginx windows networking ubuntu domain-name-system amazon-web-services active-directory apache-2.4 ssh

Explore

  • 主页
  • 问题
    • 最新
    • 热门
  • 标签
  • 帮助

Footer

AskOverflow.Dev

关于我们

  • 关于我们
  • 联系我们

Legal Stuff

  • Privacy Policy

Language

  • Pt
  • Server
  • Unix

© 2023 AskOverflow.DEV All Rights Reserve