我有几台 Hetzner AX 和 EX 系列(AMD 和 Intel)的服务器,我用 Centos 8 安装,然后迁移到 Centos Stream,但每次尝试使用 Stream 中的任何图像引导时都会出现内核恐慌。
当然,除了声称没有已知问题(HA!)之外,它们对潜在内核配置的建议并不是很有帮助。由于它甚至没有进入日志记录,我非常无助。
我已经在各种 PC 上完成了 10 次到 Stream 的迁移,但我只遇到了 HZ 服务器的问题。
有人对此有任何想法吗?
我有几台 Hetzner AX 和 EX 系列(AMD 和 Intel)的服务器,我用 Centos 8 安装,然后迁移到 Centos Stream,但每次尝试使用 Stream 中的任何图像引导时都会出现内核恐慌。
当然,除了声称没有已知问题(HA!)之外,它们对潜在内核配置的建议并不是很有帮助。由于它甚至没有进入日志记录,我非常无助。
我已经在各种 PC 上完成了 10 次到 Stream 的迁移,但我只遇到了 HZ 服务器的问题。
有人对此有任何想法吗?
我们刚刚收到了一个全新的双 CPU 服务器,它在启动后不久就不断崩溃并出现内核恐慌,这甚至发生在操作系统空闲时的设置过程中。我能够安装操作系统并启用 mcelog 来尝试了解正在发生的事情,尽管我不确定输出是什么。在线阅读使我认为这可能是其中一个插槽 (1) 上的 DIMM 有缺陷,但我运行 memtest 几次,没有发现任何错误。这可能是软件问题吗?我已经尝试了 2 个操作系统,并且两者都发生了同样的事情,尽管在 Debian/Proxmox 中比在 CentOS 中更常见。
服务器规格:
双英特尔 8 核至强 E5-2620v4
2 x DIMM 32GB DDR4 2400MHz RECC DDR4
MB 超微 X10DRL-i
这不是 CPU 温度,因为在 memtest 或操作系统安装期间,温度从未超过 35ºC。我还能够在 CPU 崩溃并且温度正常之前在 CPU 上运行一些短裤基准测试。
我怎么知道这里发生了什么?在它发生之前我可以访问服务器几分钟,我已经下载了 vmcore 转储,但我不确定如何处理它。
这是启动然后崩溃 50 秒后的 mce 日志:
[ 56.367615] mce: [Hardware Error]: Machine check events logged
[ 70.420914] mce: [Hardware Error]: Machine check events logged
[ 71.886789] Disabling lock debugging due to kernel taint
[ 71.886894] mce: [Hardware Error]: CPU 24: Machine Check Exception: 5 Bank 20: fa00004000020e0f
[ 71.887009] mce: [Hardware Error]: RIP !INEXACT! 10:<ffffffff8138fb97> {intel_idle+0xd7/0x160}
[ 71.887122] mce: [Hardware Error]: TSC 206cc7cd362
[ 71.887184] mce: [Hardware Error]: PROCESSOR 0:406f1 TIME 1487438906 SOCKET 1 APIC 11 microcode b00001d
[ 71.887289] mce: [Hardware Error]: Run the above through 'mcelog --ascii'
[ 71.889392] mce: [Hardware Error]: CPU 30: Machine Check Exception: 5 Bank 20: fa00004000020e0f
[ 71.889489] mce: [Hardware Error]: RIP !INEXACT! 10:<ffffffff8138fb97> {intel_idle+0xd7/0x160}
[ 71.889595] mce: [Hardware Error]: TSC 206cc7cd11d
[ 71.889657] mce: [Hardware Error]: PROCESSOR 0:406f1 TIME 1487438906 SOCKET 1 APIC 1d microcode b00001d
[ 71.889760] mce: [Hardware Error]: Run the above through 'mcelog --ascii'
[ 71.891804] mce: [Hardware Error]: CPU 14: Machine Check Exception: 5 Bank 20: fa00004000020e0f
[ 71.891901] mce: [Hardware Error]: RIP !INEXACT! 10:<ffffffff8138fb97> {intel_idle+0xd7/0x160}
[ 71.892007] mce: [Hardware Error]: TSC 206cc7cd10e
[ 71.892068] mce: [Hardware Error]: PROCESSOR 0:406f1 TIME 1487438906 SOCKET 1 APIC 1c microcode b00001d
[ 71.892171] mce: [Hardware Error]: Run the above through 'mcelog --ascii'
[ 71.894217] mce: [Hardware Error]: CPU 13: Machine Check Exception: 5 Bank 20: fa00004000020e0f
[ 71.894314] mce: [Hardware Error]: RIP !INEXACT! 10:<ffffffff8138fb97> {intel_idle+0xd7/0x160}
[ 71.894420] mce: [Hardware Error]: TSC 206cc7cd23c
[ 71.894480] mce: [Hardware Error]: PROCESSOR 0:406f1 TIME 1487438906 SOCKET 1 APIC 1a microcode b00001d
[ 71.894585] mce: [Hardware Error]: Run the above through 'mcelog --ascii'
[ 71.896634] mce: [Hardware Error]: CPU 29: Machine Check Exception: 5 Bank 20: fa00004000020e0f
[ 71.896730] mce: [Hardware Error]: RIP !INEXACT! 10:<ffffffff8138fb97> {intel_idle+0xd7/0x160}
[ 71.896835] mce: [Hardware Error]: TSC 206cc7cd194
[ 71.896896] mce: [Hardware Error]: PROCESSOR 0:406f1 TIME 1487438906 SOCKET 1 APIC 1b microcode b00001d
[ 71.897000] mce: [Hardware Error]: Run the above through 'mcelog --ascii'
[ 71.899053] mce: [Hardware Error]: CPU 28: Machine Check Exception: 5 Bank 20: fa00004000020e0f
[ 71.899150] mce: [Hardware Error]: RIP !INEXACT! 10:<ffffffff8138fb97> {intel_idle+0xd7/0x160}
[ 71.899256] mce: [Hardware Error]: TSC 206cc7cd719
[ 71.899335] mce: [Hardware Error]: PROCESSOR 0:406f1 TIME 1487438906 SOCKET 1 APIC 19 microcode b00001d
[ 71.899438] mce: [Hardware Error]: Run the above through 'mcelog --ascii'
[ 71.901485] mce: [Hardware Error]: CPU 12: Machine Check Exception: 5 Bank 20: fa00004000020e0f
[ 71.901582] mce: [Hardware Error]: RIP !INEXACT! 10:<ffffffff8138fb97> {intel_idle+0xd7/0x160}
[ 71.901687] mce: [Hardware Error]: TSC 206cc7cd720
[ 71.901748] mce: [Hardware Error]: PROCESSOR 0:406f1 TIME 1487438906 SOCKET 1 APIC 18 microcode b00001d
[ 71.901851] mce: [Hardware Error]: Run the above through 'mcelog --ascii'
[ 71.903934] mce: [Hardware Error]: CPU 10: Machine Check Exception: 5 Bank 20: fa00004000020e0f
[ 71.904031] mce: [Hardware Error]: RIP !INEXACT! 10:<ffffffff8138fb97> {intel_idle+0xd7/0x160}
[ 71.904136] mce: [Hardware Error]: TSC 206cc7cd851
[ 71.904197] mce: [Hardware Error]: PROCESSOR 0:406f1 TIME 1487438906 SOCKET 1 APIC 14 microcode b00001d
[ 71.904300] mce: [Hardware Error]: Run the above through 'mcelog --ascii'
[ 71.906306] mce: [Hardware Error]: CPU 26: Machine Check Exception: 5 Bank 20: fa00004000020e0f
[ 71.906403] mce: [Hardware Error]: RIP !INEXACT! 10:<ffffffff8138fb97> {intel_idle+0xd7/0x160}
[ 71.906508] mce: [Hardware Error]: TSC 206cc7cd863
[ 71.906569] mce: [Hardware Error]: PROCESSOR 0:406f1 TIME 1487438906 SOCKET 1 APIC 15 microcode b00001d
[ 71.909482] mce: [Hardware Error]: Run the above through 'mcelog --ascii'
[ 71.914367] mce: [Hardware Error]: CPU 11: Machine Check Exception: 5 Bank 20: fa00004000020e0f
[ 71.917304] mce: [Hardware Error]: RIP !INEXACT! 10:<ffffffff8138fb97> {intel_idle+0xd7/0x160}
[ 71.920287] mce: [Hardware Error]: TSC 206cc7cd515
[ 71.923159] mce: [Hardware Error]: PROCESSOR 0:406f1 TIME 1487438906 SOCKET 1 APIC 16 microcode b00001d
[ 71.926031] mce: [Hardware Error]: Run the above through 'mcelog --ascii'
[ 71.930820] mce: [Hardware Error]: CPU 27: Machine Check Exception: 5 Bank 20: fa00004000020e0f
[ 71.933685] mce: [Hardware Error]: RIP !INEXACT! 10:<ffffffff8138fb97> {intel_idle+0xd7/0x160}
[ 71.936557] mce: [Hardware Error]: TSC 206cc7cd449
[ 71.939384] mce: [Hardware Error]: PROCESSOR 0:406f1 TIME 1487438906 SOCKET 1 APIC 17 microcode b00001d
[ 71.944180] mce: [Hardware Error]: CPU 9: Machine Check Exception: 5 Bank 20: fa00004000020e0f
[ 71.947059] mce: [Hardware Error]: RIP !INEXACT! 10:<ffffffff8138fb97> {intel_idle+0xd7/0x160}
[ 71.949956] mce: [Hardware Error]: TSC 206cc7cd766
[ 71.952786] mce: [Hardware Error]: PROCESSOR 0:406f1 TIME 1487438906 SOCKET 1 APIC 12 microcode b00001d
[ 71.957580] mce: [Hardware Error]: CPU 25: Machine Check Exception: 5 Bank 20: fa00004000020e0f
[ 71.960480] mce: [Hardware Error]: RIP !INEXACT! 10:<ffffffff8138fb97> {intel_idle+0xd7/0x160}
[ 71.963366] mce: [Hardware Error]: TSC 206cc7cd751
[ 71.966210] mce: [Hardware Error]: PROCESSOR 0:406f1 TIME 1487438906 SOCKET 1 APIC 13 microcode b00001d
[ 71.971031] mce: [Hardware Error]: CPU 31: Machine Check Exception: 5 Bank 20: fa00004000020e0f
[ 71.973919] mce: [Hardware Error]: RIP !INEXACT! 10:<ffffffff8138fb97> {intel_idle+0xd7/0x160}
[ 71.976817] mce: [Hardware Error]: TSC 206cc7cd7f7
[ 71.979690] mce: [Hardware Error]: PROCESSOR 0:406f1 TIME 1487438906 SOCKET 1 APIC 1f microcode b00001d
[ 71.984474] mce: [Hardware Error]: CPU 15: Machine Check Exception: 5 Bank 20: fa00004000020e0f
[ 71.987371] mce: [Hardware Error]: RIP !INEXACT! 10:<ffffffff8138fb97> {intel_idle+0xd7/0x160}
[ 71.990290] mce: [Hardware Error]: TSC 206cc7cd803
[ 71.993151] mce: [Hardware Error]: PROCESSOR 0:406f1 TIME 1487438906 SOCKET 1 APIC 1e microcode b00001d
[ 71.997992] mce: [Hardware Error]: CPU 8: Machine Check Exception: 5 Bank 20: fa00004000020e0f
[ 72.000918] mce: [Hardware Error]: RIP !INEXACT! 10:<ffffffff8138fb97> {intel_idle+0xd7/0x160}
[ 72.003828] mce: [Hardware Error]: TSC 206cc7cd374
[ 72.006692] mce: [Hardware Error]: PROCESSOR 0:406f1 TIME 1487438906 SOCKET 1 APIC 10 microcode b00001d
[ 72.011533] mce: [Hardware Error]: Machine check: Processor context corrupt
[ 72.014436] Kernel panic - not syncing: Fatal machine check
这个问题类似于Server won't boot, kernel panic - not syncing
背景:我已将/etc/selinux/config
行设置并更改为SELINUX=enforcing
重新SELINUX=disabled
启动后它一直处于内核恐慌状态...
我已经尝试了来自互联网的所有建议:
selinux=0
或enforcing=0
在内核参数处SELINUX=disabled
回SELINUX=enforcing
并再次启动,仍然达到内核崩溃selinux=0
在/mnt/sysimage/boot/grub/grub.conf
救援模式下添加adding kernel.panic = 1
过/etc/sysctl.conf
,但每次当我遇到内核恐慌时,它都不会自行重启。我需要硬启动(我需要打电话给数据中心的操作员......)几天前,我管理的一台服务器在正常运行 400 多天后发生了恐慌。我重新启动它,它工作了两天左右,然后它针对各种 n 值出现“oops: cpu#n stuck for 61s”。再次重启,今天又出现了原来的kernel panic。跟踪是(手动重新输入,因此跳过地址):
Kernel panic - not syncing: Fatal exception in interrupt
Pid: 0, comm: swapper Tainted: G D 2.6.32-41-server #89-Ubuntu
Call Trace:
<IRQ> panic
oops_end
die
do_general_protection
? consume_skb
general_protection
? put_page
skb_release_data
__kfree_skb
consume_skb
dev_kfree_skb_any
sky2_tx_complete
sky2_status_intr
? __queue_work
sky2_poll
net_rx_action
__do_softirq
? handle_IRQ_event
call_softirq
do_softirq
irq_exit
do_IRQ
ret_from_intr
<EOI> ? mwait_idle
? atomic_notifier_call_chain
? cpu_idle
? start_secondary
RIP put_page
操作系统是 Ubuntu 10.04.4 x64。由于它一直有效并且在恐慌之前没有任何改变,我正在考虑一些硬件故障。在最后一次重新启动之前,我做了一个完整的内存测试并且它通过了,以及一个完整的 fsck 只是为了确定。由于恐慌与 sky2(marvell 网络控制器)有关,它可能是网卡问题?有什么我忽略了吗?考虑到错误之间一切都运行良好(日志中没有错误,没有丢包,没有减速)。
感谢任何指针
我们有一个服务器偶尔会出现内核恐慌一段时间,我们认为它存在硬件问题。您将如何对无法物理访问的硬件进行故障排除?我可以在操作系统本身中使用任何工具来诊断系统的不同部分,以试图找出导致所有这些恐慌的原因吗?
我最近在我的 colo'd 盒子上进行了硬盘交换。我在新驱动器上重新安装了 Debian Lenny 并设置了所有内容。最近我从内核收到这些消息(在终端上查看,后来在 /var/log/messages 上查看:
Mar 22 09:04:29 seedbox kernel: [72710.442831] Eeek! page_mapcount(page) went negative! (-1)
Mar 22 09:04:29 seedbox kernel: [72710.442831] page pfn = 6f65b
Mar 22 09:04:29 seedbox kernel: [72710.442831] page->flags = 100000000000010
Mar 22 09:04:29 seedbox kernel: [72710.442831] page->count = 0
Mar 22 09:04:29 seedbox kernel: [72710.442831] page->mapping = 0000000000000000
Mar 22 09:04:29 seedbox kernel: [72710.442831] vma->vm_ops = 0x0
Mar 22 09:04:29 seedbox kernel: [72710.442831] ------------[ cut here ]------------
Mar 22 09:04:29 seedbox kernel: [72710.442831] kernel BUG at mm/rmap.c:673!
Mar 22 09:04:29 seedbox kernel: [72710.442831] invalid opcode: 0000 [1] SMP
Mar 22 09:04:29 seedbox kernel: [72710.442831] CPU 0
Mar 22 09:04:29 seedbox kernel: [72710.442831] Modules linked in: ipv6 ext2 loop snd_pcm snd_timer snd soundcore snd_page_alloc serio_raw i2c_i801 rng_core pcspkr psmouse i2c_core video output button intel_agp evdev ext3 jbd mbcache dm_mirror dm_log dm_snapshot dm_mod sd_mod ata_piix ata_generic piix libata scsi_mod dock ide_pci_generic e100 mii ide_core ehci_hcd uhci_hcd thermal processor fan thermal_sys [last unloaded: scsi_wait_scan]
Mar 22 09:04:29 seedbox kernel: [72710.442831] Pid: 6527, comm: sshd Not tainted 2.6.26-2-amd64 #1
Mar 22 09:04:29 seedbox kernel: [72710.442831] RIP: 0010:[<ffffffff802876b9>] [<ffffffff802876b9>] page_remove_rmap+0xff/0x11a
Mar 22 09:04:29 seedbox kernel: [72710.442831] RSP: 0018:ffff8100b75d1da8 EFLAGS: 00010246
Mar 22 09:04:29 seedbox kernel: [72710.442831] RAX: 0000000000000000 RBX: ffffe2000185e3e8 RCX: 0000000000008e53
Mar 22 09:04:29 seedbox kernel: [72710.442831] RDX: ffff810080a4c000 RSI: 0000000000000046 RDI: 0000000000000282
Mar 22 09:04:29 seedbox kernel: [72710.442831] RBP: ffff8100379838c8 R08: 00007f6d11ba4000 R09: ffff8100b75d1800
Mar 22 09:04:29 seedbox kernel: [72710.442831] R10: 0000000000000000 R11: 0000010000000010 R12: ffff8100bb446b00
Mar 22 09:04:29 seedbox kernel: [72710.442831] R13: 00007f6d11ba4000 R14: ffffe2000185e3e8 R15: ffff810001023b80
Mar 22 09:04:29 seedbox kernel: [72710.442831] FS: 0000000000000000(0000) GS:ffffffff8053d000(0000) knlGS:0000000000000000
Mar 22 09:04:29 seedbox kernel: [72710.442831] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Mar 22 09:04:29 seedbox kernel: [72710.442831] CR2: 00007f6d1195b480 CR3: 0000000037904000 CR4: 00000000000006e0
Mar 22 09:04:29 seedbox kernel: [72710.442831] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Mar 22 09:04:29 seedbox kernel: [72710.442831] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Mar 22 09:04:29 seedbox kernel: [72710.442831] Process sshd (pid: 6527, threadinfo ffff8100b75d0000, task ffff8100bd0a0990)
Mar 22 09:04:29 seedbox kernel: [72710.442831] Stack: 800000006f65b045 800000006f65b045 ffff8100bc013d20 ffffffff8027f69a
Mar 22 09:04:29 seedbox kernel: [72710.442831] ffff810100000000 0000000000000000 ffff8100b75d1eb8 ffffffffffffffff
Mar 22 09:04:29 seedbox kernel: [72710.442831] 0000000000000000 ffff8100379838c8 ffff8100b75d1ec0 0000000000296460
Mar 22 09:04:29 seedbox kernel: [72710.442831] Call Trace:
Mar 22 09:04:29 seedbox kernel: [72710.442831] [<ffffffff8027f69a>] ? unmap_vmas+0x4c9/0x885
Mar 22 09:04:29 seedbox kernel: [72710.442831] [<ffffffff80283ac8>] ? exit_mmap+0x7c/0xf0
Mar 22 09:04:29 seedbox kernel: [72710.442831] [<ffffffff80232538>] ? mmput+0x2c/0xa2
Mar 22 09:04:29 seedbox kernel: [72710.442831] [<ffffffff802378ad>] ? do_exit+0x25a/0x6a6
Mar 22 09:04:29 seedbox kernel: [72710.442831] [<ffffffff802afa45>] ? mntput_no_expire+0x20/0x117
Mar 22 09:04:29 seedbox kernel: [72710.442831] [<ffffffff80237d66>] ? do_group_exit+0x6d/0x9d
Mar 22 09:04:29 seedbox kernel: [72710.442831] [<ffffffff80237da8>] ? sys_exit_group+0x12/0x16
Mar 22 09:04:29 seedbox kernel: [72710.442831] [<ffffffff8020beda>] ? system_call_after_swapgs+0x8a/0x8f
Mar 22 09:04:29 seedbox kernel: [72710.442831]
Mar 22 09:04:29 seedbox kernel: [72710.442831]
Mar 22 09:04:29 seedbox kernel: [72710.442831] Code: 80 e8 d7 e5 fc ff 48 8b 85 90 00 00 00 48 85 c0 74 19 48 8b 40 20 48 85 c0 74 10 48 8b 70 58 48 c7 c7 8c 49 4b 80 e8 b2 e5 fc ff <0f> 0b eb fe 8b 77 18 5a 5b 5d 83 e6 01 f7 de 83 c6 04 e9 21 54
Mar 22 09:04:29 seedbox kernel: [72710.442831] RIP [<ffffffff802876b9>] page_remove_rmap+0xff/0x11a
Mar 22 09:04:29 seedbox kernel: [72710.442831] RSP <ffff8100b75d1da8>
Mar 22 09:04:29 seedbox kernel: [72710.442831] ---[ end trace e8a2f3b263482c6e ]---
Mar 22 09:04:29 seedbox kernel: [72710.442831] Fixing recursive fault but reboot is needed!
在此之后一切仍然有效,但消息不断显示。我不知道问题是什么或如何调试/跟踪它,希望你们能提供帮助..如果您需要更多信息,请告诉我。
更新:发布来自系统日志的完整消息(包含更多信息)。
ext4 是否已准备好在 debian 5(带有 linux 内核版本 2.6.26)中用于生产?它会是稳定的、没有糟糕的和没有错误的吗?
最近我 9 岁的 Apple G4 文件服务器随机崩溃。通常这是内核恐慌,但有时系统只是锁定。当我不在办公室时,这似乎总是会发生……但即使我在办公室,系统也位于单独的服务器机房中,几乎从来没有人在控制台上。怀疑内存不好,我跑了 memtest,但经过 20 次后发现没有问题。(我跑了 10 次,重新启动,又跑了 10 次。两次都是单用户模式)。Apple Hardware Test 也报告没有问题(在循环运行超过 100 个循环后)
我怀疑硬件只是变坏了……毕竟它已经有9 年历史了。但我们目前没有更换服务器的预算。在我们下一次升级之前,我最好的选择是什么?有什么方法可以解决崩溃的问题吗?或者至少,有什么方法可以让系统在内核恐慌或锁定后自动重新启动,以便它可以恢复服务?
panic.log 显示:
Mon Jun 29 12:52:23 2009
panic(cpu 1 caller 0x00040180): zalloc: "socket" (751876 elements) retry fail 3
Latest stack backtrace for cpu 1:
Backtrace:
0x000954F8 0x00095A10 0x00026898 0x00040180 0x0026B868 0x00290E10 0x00290F1C 0x00296B40
0x002ABDB8 0x000ABD30 0x00000000
Proceeding back via exception chain:
Exception state (sv=0x32288780)
PC=0x9001B08C; MSR=0x0000F030; DAR=0x12555000; DSISR=0x42000000; LR=0x8EF88A00; R1=0xBFFFF700; XCP=0x0000003
0 (0xC00 - System call)
Kernel version:
Darwin Kernel Version 8.11.0: Wed Oct 10 18:26:00 PDT 2007; root:xnu-792.24.17~1/RELEASE_PPC
*********
Fri Jul 3 10:15:24 2009
panic(cpu 1 caller 0x00040180): zalloc: "socket" (762004 elements) retry fail 3
Latest stack backtrace for cpu 1:
Backtrace:
0x000954F8 0x00095A10 0x00026898 0x00040180 0x0026B868 0x00290E10 0x00290F1C 0x00296B40
0x002ABDB8 0x000ABD30 0x00000000
Proceeding back via exception chain:
Exception state (sv=0x2C543000)
PC=0x9001B08C; MSR=0x0000F030; DAR=0x11A41000; DSISR=0x42000000; LR=0x8EF88A00; R1=0xBFFFF700; XCP=0x0000003
0 (0xC00 - System call)
Kernel version:
Darwin Kernel Version 8.11.0: Wed Oct 10 18:26:00 PDT 2007; root:xnu-792.24.17~1/RELEASE_PPC
*********
Tue Jul 21 20:44:47 2009
panic(cpu 1 caller 0x00040180): zalloc: "socket" (762004 elements) retry fail 3
Latest stack backtrace for cpu 1:
Backtrace:
0x000954F8 0x00095A10 0x00026898 0x00040180 0x0026B868 0x00290E10 0x00290F1C 0x00296B40
0x002ABDB8 0x000ABD30 0x00000000
Proceeding back via exception chain:
Exception state (sv=0x2C543000)
PC=0x9001B08C; MSR=0x0000F030; DAR=0x11A41000; DSISR=0x42000000; LR=0x8EF88A00; R1=0xBFFFF700; XCP=0x0000003
0 (0xC00 - System call)
Kernel version:
Darwin Kernel Version 8.11.0: Wed Oct 10 18:26:00 PDT 2007; root:xnu-792.24.17~1/RELEASE_PPC
*********
我们有许多 HP DL385 G2 在安装 RHEL 5.3 后出现内核崩溃。所有都是最新的固件 CD 8.50。最初的 RHEL 5.3 安装始终有效,并且在大多数情况下第一次启动没问题(内核 2.6.18-128.el5);到目前为止,四分之一的人在这里感到恐慌。当“yum update”到内核 2.6.18-128.1.10.el5 时,我尝试过的大多数其他机器都无法启动。一两个还好。
恐慌总是在同一点上。控制台上记录的最后几行是:
device-mapper: uevent: version 1.0.3
device-mapper: ioctl: 4.11.5-ioctl (2007-12-12) initialised: [email protected]
usb 4-2: new full speed USB device using uhci_hcd and address 3
device-mapper: dm-raid45: initialized v0.2429
usb 4-2: configuration #1 chosen from 1 choice
hub 4-2:1.0: USB hub found
hub 4-2:1.0: 7 ports detected
然后暂停,然后:
kernel panic - not syncing - attempted to kill init
超过这一点,内核将不会启动(包括 Anaconda 安装的 2.6.18-128.el5),并且重新安装是唯一的选择。它似乎与HP 论坛上报告的这个问题非常相似。
那么,有什么想法吗?我们在 RHEL 5.2 上有 DL385 G2,因此 5.3 中的某些内容在相同的硬件上运行不佳。我们已经尝试过将 BIOS 恢复出厂设置等。我如何确定内核在做什么?(我已经从附加行中删除了“rhgb quiet”。)幸运的是,我们没有太多这些框,我有一点时间进行调查。
我目前正在运行 Ubuntu 9.04,Jaunty,并且遇到了一些导致一些内核恐慌突然出现的问题。这些恐慌导致系统将一堆信息转储到终端并挂起。
通常,这些发生在我离开系统时。这意味着系统处于空闲状态,直到我来到终端,看到它发生了内核崩溃,然后重新启动系统。
有没有办法用 Linux 自动重启?我知道对于 Windows BSOD,有一个选项可以在核心转储发生后自动重新启动。Linux有类似的选择吗?