昨天我开始玩Kenshi(顺便说一句有趣的游戏),一段时间后它崩溃了。我没有过热问题(远非如此),我的笔记本电脑上没有超频,而且我不太年轻的 GTX 880M 仍然能够运行它,但此时它似乎停止工作。我关掉了电脑,决定第二天再看看。
第二天(今天),我注意到以下内容:
- 计算机在 BIOS 初始化阶段(即在加载操作系统之前)花费了异常长的时间
- 操作系统启动后,尽管没有使用 dGPU 灯,但它仍然亮着
- Windows 设备管理器抱怨无法加载驱动程序
- 在 Linux 上,可以通过 ACPI 调用关闭 dGPU(因此灯熄灭),但尝试使用它根本不起作用
所以我将 BIOS 重置为所谓的“优化默认值”,希望它会有所帮助,但看起来没有。
在 Linux 上,我在尝试使用 GTX 880M 时收到以下内核消息(为清楚起见,之前和之后的所有内容都已删除):
[Aug22 16:11] pci 0000:01:00.0: [10de:1198] type 00 class 0x030000
[ +0.000036] pci 0000:01:00.0: reg 0x10: [mem 0xf6000000-0xf6ffffff]
[ +0.000017] pci 0000:01:00.0: reg 0x14: [mem 0xe0000000-0xefffffff 64bit pref]
[ +0.000016] pci 0000:01:00.0: reg 0x1c: [mem 0xf0000000-0xf1ffffff 64bit pref]
[ +0.000012] pci 0000:01:00.0: reg 0x24: [io 0xe000-0xe07f]
[ +0.000012] pci 0000:01:00.0: reg 0x30: [mem 0xf7000000-0xf707ffff pref]
[ +0.000052] pci 0000:01:00.0: Enabling HDA controller
[ +0.000087] pci 0000:01:00.0: 32.000 Gb/s available PCIe bandwidth, limited by 2.5 GT/s x16 link at 0000:00:01.0 (capable of 126.016 Gb/s with 8 GT/s x16 link)
[ +0.000486] pci 0000:01:00.0: vgaarb: VGA device added: decodes=io+mem,owns=none,locks=none
[ +0.000011] i915 0000:00:02.0: vgaarb: changed VGA decodes: olddecodes=io+mem,decodes=none:owns=io+mem
[ +0.000160] pci 0000:01:00.1: [10de:0e0a] type 00 class 0x040300
[ +0.000029] pci 0000:01:00.1: reg 0x10: [mem 0x00000000-0x00003fff]
[ +0.000057] pci 0000:01:00.1: Max Payload Size set to 256 (was 128, max 256)
[ +0.000382] pcieport 0000:00:01.0: ASPM: current common clock configuration is broken, reconfiguring
[ +0.009663] pci 0000:01:00.0: BAR 1: assigned [mem 0xe0000000-0xefffffff 64bit pref]
[ +0.000037] pci 0000:01:00.0: BAR 3: assigned [mem 0xf0000000-0xf1ffffff 64bit pref]
[ +0.000018] pci 0000:01:00.0: BAR 0: assigned [mem 0xf6000000-0xf6ffffff]
[ +0.000003] pci 0000:01:00.0: BAR 6: assigned [mem 0xf7000000-0xf707ffff pref]
[ +0.000002] pci 0000:01:00.1: BAR 0: assigned [mem 0xf7080000-0xf7083fff]
[ +0.000002] pci 0000:01:00.0: BAR 5: assigned [io 0xe000-0xe07f]
[ +0.000004] pcieport 0000:00:01.0: PCI bridge to [bus 01]
[ +0.000001] pcieport 0000:00:01.0: bridge window [io 0xe000-0xefff]
[ +0.000003] pcieport 0000:00:01.0: bridge window [mem 0xf6000000-0xf70fffff]
[ +0.000002] pcieport 0000:00:01.0: bridge window [mem 0xe0000000-0xf1ffffff 64bit pref]
[ +0.000176] pci 0000:01:00.1: D0 power state depends on 0000:01:00.0
[ +0.000036] snd_hda_intel 0000:01:00.1: enabling device (0000 -> 0002)
[ +0.000057] snd_hda_intel 0000:01:00.1: Disabling MSI
[ +0.000006] snd_hda_intel 0000:01:00.1: Handle vga_switcheroo audio client
[ +0.041484] IPMI message handler: version 39.2
[ +0.016187] ipmi device interface
[ +0.704043] nvidia: module license 'NVIDIA' taints kernel.
[ +0.000001] Disabling lock debugging due to kernel taint
[ +0.012267] nvidia-nvlink: Nvlink Core is being initialized, major device number 237
[ +0.000320] nvidia 0000:01:00.0: enabling device (0006 -> 0007)
[ +0.000078] nvidia 0000:01:00.0: vgaarb: changed VGA decodes: olddecodes=io+mem,decodes=none:owns=none
[ +0.099429] NVRM: loading NVIDIA UNIX x86_64 Kernel Module 440.100 Fri May 29 08:45:51 UTC 2020
[ +0.055239] nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver for UNIX platforms 440.100 Fri May 29 08:14:04 UTC 2020
[ +0.002640] [drm] [nvidia-drm] [GPU ID 0x00000100] Loading driver
[ +0.020768] ACPI Warning: \_SB.PCI0.PEG0.PEGP._DSM: Argument #4 type mismatch - Found [Buffer], ACPI requires [Package] (20190816/nsarguments-59)
[ +30.855545] NVRM: GPU 0000:01:00.0: RmInitAdapter failed! (0x24:0x65:1185)
[ +0.000034] NVRM: GPU 0000:01:00.0: rm_init_adapter failed, device minor number 0
[ +0.000492] [drm:nv_drm_load [nvidia_drm]] *ERROR* [nvidia-drm] [GPU ID 0x00000100] Failed to allocate NvKmsKapiDevice
[ +0.000122] [drm:nv_drm_probe_devices [nvidia_drm]] *ERROR* [nvidia-drm] [GPU ID 0x00000100] Failed to register device
[ +0.071852] nvidia-uvm: Loaded the UVM driver, major device number 235.
[Aug22 16:14] Lockdown: systemd-logind: hibernation is restricted; see man kernel_lockdown.7
[ +3.384946] rfkill: input handler enabled
[ +9.980611] NVRM: GPU 0000:01:00.0: RmInitAdapter failed! (0x26:0x65:1227)
[ +0.000047] NVRM: GPU 0000:01:00.0: rm_init_adapter failed, device minor number 0
[ +8.167869] NVRM: GPU 0000:01:00.0: RmInitAdapter failed! (0x26:0x65:1227)
[ +0.000075] NVRM: GPU 0000:01:00.0: rm_init_adapter failed, device minor number 0
[ +0.099416] gnome-shell[6474]: segfault at 20 ip 00007f4d29a83356 sp 00007ffd43c62db0 error 4 in libnvidia-glsi.so.440.100[7f4d29a21000+95000]
[ +0.000004] Code: 48 8b 5c 24 08 48 8b 6c 24 10 48 83 c4 18 c3 0f 1f 44 00 00 48 8b 3d f1 31 24 00 e8 74 31 00 00 89 de 48 89 c7 e8 5a fe ff ff <48> 8b 78 20 e8 61 60 01 00 48 83 f8 01 48 89 45 00 19 c0 83 e0 0f
[ +4.409548] Lockdown: systemd-logind: hibernation is restricted; see man kernel_lockdown.7
[ +0.281982] rfkill: input handler disabled
[Aug22 16:15] rfkill: input handler enabled
[ +4.303838] Lockdown: systemd-logind: hibernation is restricted; see man kernel_lockdown.7
[ +0.355605] rfkill: input handler disabled
[ +8.056674] NVRM: GPU 0000:01:00.0: RmInitAdapter failed! (0x26:0x65:1227)
[ +0.000069] NVRM: GPU 0000:01:00.0: rm_init_adapter failed, device minor number 0
[ +8.172351] NVRM: GPU 0000:01:00.0: RmInitAdapter failed! (0x26:0x65:1227)
[ +0.000030] NVRM: GPU 0000:01:00.0: rm_init_adapter failed, device minor number 0
[ +8.171805] NVRM: GPU 0000:01:00.0: RmInitAdapter failed! (0x26:0x65:1227)
[ +0.000022] NVRM: GPU 0000:01:00.0: rm_init_adapter failed, device minor number 0
[ +8.167969] NVRM: GPU 0000:01:00.0: RmInitAdapter failed! (0x26:0x65:1227)
[ +0.000022] NVRM: GPU 0000:01:00.0: rm_init_adapter failed, device minor number 0
[ +8.171784] NVRM: GPU 0000:01:00.0: RmInitAdapter failed! (0x26:0x65:1227)
[ +0.000070] NVRM: GPU 0000:01:00.0: rm_init_adapter failed, device minor number 0
[ +8.168094] NVRM: GPU 0000:01:00.0: RmInitAdapter failed! (0x26:0x65:1227)
[ +0.000023] NVRM: GPU 0000:01:00.0: rm_init_adapter failed, device minor number 0
[Aug22 16:16] NVRM: GPU 0000:01:00.0: RmInitAdapter failed! (0x26:0x65:1227)
[ +0.000031] NVRM: GPU 0000:01:00.0: rm_init_adapter failed, device minor number 0
[ +8.168161] NVRM: GPU 0000:01:00.0: RmInitAdapter failed! (0x26:0x65:1227)
[ +0.000025] NVRM: GPU 0000:01:00.0: rm_init_adapter failed, device minor number 0
[ +8.171680] NVRM: GPU 0000:01:00.0: RmInitAdapter failed! (0x26:0x65:1227)
[ +0.000023] NVRM: GPU 0000:01:00.0: rm_init_adapter failed, device minor number 0
[ +8.171964] NVRM: GPU 0000:01:00.0: RmInitAdapter failed! (0x26:0x65:1227)
[ +0.000039] NVRM: GPU 0000:01:00.0: rm_init_adapter failed, device minor number 0
接下来我打算做的是拔掉所有东西,取出电池并等待一小时,但我已经没有想法了。除了祈祷,还有什么建议吗?
编辑: 我的笔记本电脑是 Clevo P170SM,尽管我不认为 Clevo 是罪魁祸首。也许我就是那个,因为我的游戏崩溃后 Windows 死机了,我没有强制关闭,而是等待操作系统,直到它自己进行错误检查并重新启动,它确实做到了,但只是在可能致命的一小时等待之后,我怀疑在这个过程中有什么东西炸了。
我在这里学到的教训是:不要相信操作系统会保护你的硬件。大多数硬件保护要么在 BIOS、驱动程序中实现,要么由设备制造商直接在硬件中实现,但有时可能会失败。
操作系统更侧重于保护您的数据和系统可用性。例如,文件系统日志保护数据,在 Windows 上,GPU 超时检测和恢复 (TDR) 带来可用性。这是个人意见,但尽管是在驱动程序级别完成的,但我认为文件系统日志记录是操作系统的一部分,因为大多数操作系统(如果不是全部)都是经过设计的,并且应该使用非常特定的文件系统格式进行安装。我想 Windows 可以安装在 ext4 文件系统上,但是会缺少一些功能。但我离题了...
根据一些相关日志条目,操作系统未能初始化显示适配器。这与您提供的其他信息一起通常是坏消息。Windows 的事件查看器中可能存在类似的硬件错误。
日志的左列显示消息之间经过的时间。由于反复初始化 Nvidia 硬件失败,您的 PC 需要很长时间才能加载。第一次失败需要 30 秒,随后的失败每次大约需要 10 秒。
您没有提及笔记本电脑的品牌,而是特别提到了“dGPU”,我认为它是指独立或专用的 Nvidia 图形芯片,您仍然可以使用 CPU 集成显卡生成的视频输出。
GPU 可能能够接收电源和基本系统管理请求(您提到的电源开启和关闭),但 GPU 的其余部分无法访问,很可能是由于硬件故障。
根据所提供的信息,这是我认为可能的最佳评估。
其他可能的故障排除可能是尝试从未经修改的 Live CD 启动,只是为了绝对确定这不是您安装的操作系统的问题。
如果您在日志中搜索与 GPU 相关的错误消息,您会找到 Nvidia 支持论坛的链接。这是他们的建议: