我在 19.10 和现在的 20.04 中遇到了这个问题。我在 2020 年 2 月用它构建了这台计算机的 18.04 没有这个问题。我对 20.04 进行了全新安装。简而言之,在 FireFox 中滚动一段时间(几分钟到一小时)后,鼠标变为非活动状态(我可以移动它,但点击不注册),几秒钟后系统变得完全无响应,通常为空白或错误-彩色低分辨率屏幕,需要硬启动才能重置。
通常,这发生在从挂起恢复之后,但也发生在新启动之后(更罕见)。然而,这是一个间歇性问题,我不能确定前提条件是什么。在 FireFox 中滚动似乎或多或少是一个持续的触发器。我的怀疑是恢复或初始化时存在一些竞争条件,导致 amdgpu 驱动程序出现不正确的条件。我通过 syslog 中的错误搜索了这个问题,并遵循了我可以收集到的线索 - 从 AMD 站点重新安装 amdgpu 驱动程序,更新内核(现在为 5.8.1),但没有任何帮助。系统日志错误总是以:
8 月 18 日 21:05:26 mvlLinux-pc 内核:[28611.718399] [drm:amdgpu_dm_atomic_commit_tail [amdgpu]]错误等待围栏超时!
8 月 18 日 21:05:31 mvlLinux-pc 内核:[28611.718497] [drm:amdgpu_dm_atomic_commit_tail [amdgpu]]错误等待栅栏超时!
8 月 18 日 21:05:31 mvlLinux-pc 内核:[28617.360497] [drm:amdgpu_job_timedout [amdgpu]]错误环 gfx 超时,发出 seq=624416,发出 seq=624418
8 月 18 日 21:05:31 mvlLinux-pc 内核:[ 28617.360584] [drm:amdgpu_job_timedout [amdgpu]]错误进程信息:进程 gnome-shell pid 2328 线程 gnome-shel:cs0 pid 2354
8 月 18 日 21:05:31 mvlLinux-pc 内核:[28617.360590] amdgpu 0000:09:00.0: amdgpu:GPU 重置开始!
硬件概要:
主板 Asus PRIME X470-PRO
处理器:AMD Ryzen 5 2600X 六核处理器
视频:Asus Strix Radeon RX570
Ram:CRUCIAL 16 GiB
当然,可以提供更多详细信息。任何建议都欣然接受。我发现最近使用 Linux 太容易崩溃了。
@heynnema
我不认为内存是问题,但这里是:
free -h
total used free shared buff/cache available<br />
Mem: 15Gi 2.7Gi 10Gi 235Mi 2.0Gi 12Gi<br />
Swap: 2.0Gi 0B 2.0Gi
sudo dmidecode -s bios-version
5406
sudo lshw -C memory
*-firmware
description: BIOS
vendor: American Megatrends Inc.
physical id: 0
version: 5406
date: 11/13/2019
size: 64KiB
capacity: 16MiB
capabilities: pci apm upgrade shadowing cdboot bootselect socketedrom edd int13floppy1200 int13floppy720 int13floppy2880 int5printscreen int9keyboard int14serial int17printer acpi usb biosbootspecification uefi
*-memory
description: System Memory
physical id: 2e
slot: System board or motherboard
size: 16GiB
*-bank:0
description: [empty]
product: Unknown
vendor: Unknown
physical id: 0
serial: Unknown
slot: DIMM_A1
*-bank:1
description: DIMM DDR4 Synchronous Unbuffered (Unregistered) 2400 MHz (0.4 ns)
product: BLS8G4D32AESBK.M8FE1
vendor: CRUCIAL
physical id: 1
serial: E316F686
slot: DIMM_A2
size: 8GiB
width: 64 bits
clock: 2400MHz (0.4ns)
*-bank:2
description: [empty]
product: Unknown
vendor: Unknown
physical id: 2
serial: Unknown
slot: DIMM_B1
*-bank:3
description: DIMM DDR4 Synchronous Unbuffered (Unregistered) 2400 MHz (0.4 ns)
product: BLS8G4D32AESBK.M8FE1
vendor: CRUCIAL
physical id: 3
serial: E316E264
slot: DIMM_B2
size: 8GiB
width: 64 bits
clock: 2400MHz (0.4ns)
*-cache:0
description: L1 cache
physical id: 30
slot: L1 - Cache
size: 576KiB
capacity: 576KiB
clock: 1GHz (1.0ns)
capabilities: pipeline-burst internal write-back unified
configuration: level=1
*-cache:1
description: L2 cache
physical id: 31
slot: L2 - Cache
size: 3MiB
capacity: 3MiB
clock: 1GHz (1.0ns)
capabilities: pipeline-burst internal write-back unified
configuration: level=2
*-cache:2
description: L3 cache
physical id: 32
slot: L3 - Cache
size: 16MiB
capacity: 16MiB
clock: 1GHz (1.0ns)
capabilities: pipeline-burst internal write-back unified
configuration: level=3
@heynnema
在暂停/恢复后添加更多来自冻结的错误消息:
Aug 29 08:36:17 mvlLinux-pc systemd-resolved[830]: Server returned error NXDOMAIN, mitigating potential DNS violation DVE-2018-0001, retrying transaction with reduced feature level UDP.
Aug 29 08:39:37 mvlLinux-pc kernel: [ 8030.248541] pcieport 0000:00:03.1: AER: Multiple Uncorrected (Non-Fatal) error received: 0000:00:00.0
Aug 29 08:39:37 mvlLinux-pc kernel: [ 8030.248550] pcieport 0000:00:03.1: AER: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, (Receiver ID)
Aug 29 08:39:37 mvlLinux-pc kernel: [ 8030.248553] pcieport 0000:00:03.1: AER: device [1022:1453] error status/mask=00200000/04400000
Aug 29 08:39:37 mvlLinux-pc kernel: [ 8030.248556] pcieport 0000:00:03.1: AER: [21] ACSViol (First)
Aug 29 08:39:37 mvlLinux-pc kernel: [ 8030.248559] amdgpu 0000:09:00.0: AER: can't recover (no error_detected callback)
Aug 29 08:39:37 mvlLinux-pc kernel: [ 8030.248561] snd_hda_intel 0000:09:00.1: AER: can't recover (no error_detected callback)
Aug 29 08:39:37 mvlLinux-pc kernel: [ 8030.248587] pcieport 0000:00:03.1: AER: device recovery failed
Aug 29 08:39:39 mvlLinux-pc kernel: [ 8032.331741] pcieport 0000:00:03.1: AER: Multiple Uncorrected (Non-Fatal) error received: 0000:00:00.0
Aug 29 08:39:39 mvlLinux-pc kernel: [ 8032.331751] pcieport 0000:00:03.1: AER: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, (Receiver ID)
Aug 29 08:39:39 mvlLinux-pc kernel: [ 8032.331756] pcieport 0000:00:03.1: AER: device [1022:1453] error status/mask=00200000/04400000
Aug 29 08:39:39 mvlLinux-pc kernel: [ 8032.331759] pcieport 0000:00:03.1: AER: [21] ACSViol (First)
Aug 29 08:39:39 mvlLinux-pc kernel: [ 8032.331763] amdgpu 0000:09:00.0: AER: can't recover (no error_detected callback)
Aug 29 08:39:39 mvlLinux-pc kernel: [ 8032.331765] snd_hda_intel 0000:09:00.1: AER: can't recover (no error_detected callback)
Aug 29 08:39:39 mvlLinux-pc kernel: [ 8032.331799] pcieport 0000:00:03.1: AER: device recovery failed
Aug 29 08:39:47 mvlLinux-pc kernel: [ 8040.390787] [drm:drm_atomic_helper_wait_for_flip_done [drm_kms_helper]] *ERROR* [CRTC:47:crtc-0] flip_done timed out
Aug 29 08:39:47 mvlLinux-pc kernel: [ 8040.390799] [drm:drm_atomic_helper_wait_for_flip_done [drm_kms_helper]] *ERROR* [CRTC:49:crtc-1] flip_done timed out
Aug 29 08:39:49 mvlLinux-pc kernel: [ 8042.438900] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma0 timeout, signaled seq=22040, emitted seq=22042
Aug 29 08:39:49 mvlLinux-pc kernel: [ 8042.438988] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process pid 0 thread pid 0
Aug 29 08:39:49 mvlLinux-pc kernel: [ 8042.438995] amdgpu 0000:09:00.0: amdgpu: GPU reset begin!
Aug 29 08:39:50 mvlLinux-pc kernel: [ 8043.146715] amdgpu 0000:09:00.0: [drm:amdgpu_ring_test_helper [amdgpu]] *ERROR* ring kiq_2.1.0 test failed (-110)
Aug 29 08:39:50 mvlLinux-pc kernel: [ 8043.146795] [drm:gfx_v8_0_kcq_disable.isra.0 [amdgpu]] *ERROR* KCQ disable failed
Aug 29 08:39:50 mvlLinux-pc kernel: [ 8043.423697] amdgpu: cp is busy, skip halt cp
Aug 29 08:39:51 mvlLinux-pc kernel: [ 8043.700692] amdgpu: rlc is busy, skip halt rlc
Aug 29 08:39:51 mvlLinux-pc kernel: [ 8043.701711] amdgpu 0000:09:00.0: amdgpu: GPU BACO reset
Aug 29 08:39:51 mvlLinux-pc kernel: [ 8044.346691] amdgpu 0000:09:00.0: amdgpu: GPU reset succeeded, trying to resume
Aug 29 08:39:51 mvlLinux-pc kernel: [ 8044.348500] [drm] PCIE GART of 256M enabled (table at 0x000000F400000000).
Aug 29 08:39:51 mvlLinux-pc kernel: [ 8044.348515] [drm] VRAM is lost due to GPU reset!
Aug 29 08:39:51 mvlLinux-pc kernel: [ 8044.678238] amdgpu 0000:09:00.0: [drm:amdgpu_ring_test_helper [amdgpu]] *ERROR* ring gfx test failed (-110)
Aug 29 08:39:51 mvlLinux-pc kernel: [ 8044.678302] [drm:amdgpu_device_ip_resume_phase2 [amdgpu]] *ERROR* resume of IP block <gfx_v8_0> failed -110
Aug 29 08:39:51 mvlLinux-pc kernel: [ 8044.678328] amdgpu 0000:09:00.0: amdgpu: GPU reset(1) failed
Aug 29 08:39:52 mvlLinux-pc kernel: [ 8044.680626] amdgpu 0000:09:00.0: amdgpu: GPU reset end with ret = -110
Aug 29 08:39:54 mvlLinux-pc kernel: [ 8047.302923] [drm:drm_atomic_helper_wait_for_dependencies [drm_kms_helper]] *ERROR* [CRTC:47:crtc-0] flip_done timed out
Aug 29 08:40:02 mvlLinux-pc kernel: [ 8054.727115] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma0 timeout, signaled seq=22042, emitted seq=22042
Aug 29 08:40:02 mvlLinux-pc kernel: [ 8054.727203] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process pid 0 thread pid 0
Aug 29 08:40:02 mvlLinux-pc kernel: [ 8054.727216] amdgpu 0000:09:00.0: amdgpu: GPU reset begin!
Aug 29 08:40:46 mvlLinux-pc systemd-modules-load[388]: Inserted module 'lp'
Aug 29 08:40:46 mvlLinux-pc systemd-modules-load[388]: Inserted module 'ppdev'
Aug 29 08:40:46 mvlLinux-pc kernel: [ 0.000000] Linux version 5.8.1-050801-generic (kernel@sita) (gcc (Ubuntu 10.2.0-5ubuntu2) 10.2.0, GNU ld (GNU Binutils for Ubuntu) 2.35) #202008111432 SMP Tue Aug 11 14:34:42 UTC 2020
Aug 29 08:40:46 mvlLinux-pc kernel: [ 0.000000] Command line: BOOT_IMAGE=/boot/vmlinuz-5.8.1-050801-generic root=UUID=566746e2-b4e2-42a6-b18a-fa84ebca61aa ro quiet splash vt.handoff=7`
我在错误报告中看到了类似的错误,总是涉及 AMD 显卡,但主要是集成 APU,而不是我的离散设置。这个问题出现在我从 Ubuntu 18.04 迁移到 19.10 的过程中,其他人表示更新的内核修复了它,但更新到 5.8.1 并没有帮助我。鉴于问题的间歇性,其他人可能只认为它已经消失,而我见过的几个人注意到它又回来了。到目前为止,在我读过的几十个线程中都没有看到任何解决方案。我想我可以尝试放入一个较旧的视频卡,看看是否可以缩小范围。谢谢!
@heynnema
在 grub 命令行中设置 pci=noaer 后,我在从挂起恢复时遇到了同样的错误。简历中的 Dmesg 输出:
[ 2456.697121] ACPI: Low-level resume complete
[ 2456.697163] ACPI: EC: EC started
[ 2456.697164] PM: Restoring platform NVS memory
[ 2456.697710] Enabling non-boot CPUs ...
[ 2456.697747] x86: Booting SMP configuration:
[ 2456.697748] smpboot: Booting Node 0 Processor 1 APIC 0x2
[ 2456.697845] microcode: CPU1: patch_level=0x0800820d
[ 2456.700139] ACPI: \_PR_.C002: Found 2 idle states
[ 2456.700328] CPU1 is up
[ 2456.700344] smpboot: Booting Node 0 Processor 2 APIC 0x4
[ 2456.700442] microcode: CPU2: patch_level=0x0800820d
[ 2456.702609] ACPI: \_PR_.C004: Found 2 idle states
[ 2456.702779] CPU2 is up
[ 2456.702793] smpboot: Booting Node 0 Processor 3 APIC 0x8
[ 2456.702921] microcode: CPU3: patch_level=0x0800820d
[ 2456.705121] ACPI: \_PR_.C006: Found 2 idle states
[ 2456.705330] CPU3 is up
[ 2456.705344] smpboot: Booting Node 0 Processor 4 APIC 0xa
[ 2456.705468] microcode: CPU4: patch_level=0x0800820d
[ 2456.707683] ACPI: \_PR_.C008: Found 2 idle states
[ 2456.707886] CPU4 is up
[ 2456.707901] smpboot: Booting Node 0 Processor 5 APIC 0xc
[ 2456.708026] microcode: CPU5: patch_level=0x0800820d
[ 2456.710215] ACPI: \_PR_.C00A: Found 2 idle states
[ 2456.710422] CPU5 is up
[ 2456.710435] smpboot: Booting Node 0 Processor 6 APIC 0x1
[ 2456.710561] microcode: CPU6: patch_level=0x0800820d
[ 2456.712760] ACPI: \_PR_.C001: Found 2 idle states
[ 2456.713055] CPU6 is up
[ 2456.713084] smpboot: Booting Node 0 Processor 7 APIC 0x3
[ 2456.713186] microcode: CPU7: patch_level=0x0800820d
[ 2456.715367] ACPI: \_PR_.C003: Found 2 idle states
[ 2456.715594] CPU7 is up
[ 2456.715609] smpboot: Booting Node 0 Processor 8 APIC 0x5
[ 2456.715709] microcode: CPU8: patch_level=0x0800820d
[ 2456.717892] ACPI: \_PR_.C005: Found 2 idle states
[ 2456.718131] CPU8 is up
[ 2456.718143] smpboot: Booting Node 0 Processor 9 APIC 0x9
[ 2456.718271] microcode: CPU9: patch_level=0x0800820d
[ 2456.720463] ACPI: \_PR_.C007: Found 2 idle states
[ 2456.720728] CPU9 is up
[ 2456.720742] smpboot: Booting Node 0 Processor 10 APIC 0xb
[ 2456.720868] microcode: CPU10: patch_level=0x0800820d
[ 2456.723067] ACPI: \_PR_.C009: Found 2 idle states
[ 2456.723342] CPU10 is up
[ 2456.723356] smpboot: Booting Node 0 Processor 11 APIC 0xd
[ 2456.723483] microcode: CPU11: patch_level=0x0800820d
[ 2456.725687] ACPI: \_PR_.C00B: Found 2 idle states
[ 2456.725971] CPU11 is up
[ 2456.727331] ACPI: Waking up from system sleep state S3
[ 2456.728144] ACPI: EC: interrupt unblocked
[ 2456.810892] ACPI: EC: event unblocked
[ 2456.810961] usb usb1: root hub lost power or was reset
[ 2456.810962] usb usb2: root hub lost power or was reset
[ 2456.811202] usb usb3: root hub lost power or was reset
[ 2456.811203] usb usb4: root hub lost power or was reset
[ 2456.811595] sd 1:0:0:0: [sda] Starting disk
[ 2456.811933] serial 00:03: activated
[ 2457.124313] ata5: SATA link down (SStatus 0 SControl 330)
[ 2457.124331] ata6: SATA link down (SStatus 0 SControl 330)
[ 2457.124375] ata7: SATA link down (SStatus 0 SControl 330)
[ 2457.124474] ata1: SATA link down (SStatus 0 SControl 300)
[ 2457.124622] ata9: SATA link down (SStatus 0 SControl 300)
[ 2457.128321] ata3: SATA link down (SStatus 0 SControl 330)
[ 2457.168893] nvme nvme0: Shutdown timeout set to 8 seconds
[ 2457.181058] ata4: SATA link down (SStatus 0 SControl 330)
[ 2457.204000] nvme nvme0: 32/0/0 default/read/poll queues
[ 2457.215120] usb 4-1: reset SuperSpeed Gen 1 USB device number 2 using xhci_hcd
[ 2457.283762] [drm] PCIE GART of 256M enabled (table at 0x000000F400000000).
[ 2457.366979] usb 4-2: reset SuperSpeed Gen 1 USB device number 3 using xhci_hcd
[ 2457.403433] [drm] UVD and UVD ENC initialized successfully.
[ 2457.526411] [drm] VCE initialized successfully.
[ 2457.586664] usb 3-1: reset high-speed USB device number 2 using xhci_hcd
[ 2457.850542] ata8: failed to resume link (SControl 0)
[ 2457.850553] ata8: SATA link down (SStatus 0 SControl 0)
[ 2458.122724] usb 3-1.1: reset full-speed USB device number 3 using xhci_hcd
[ 2460.178827] igb 0000:07:00.0 enp7s0: igb: enp7s0 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX/TX
[ 2462.202613] ata2: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
[ 2462.379171] usb 5-2.2: reset low-speed USB device number 5 using xhci_hcd
[ 2462.607145] ata2.00: configured for UDMA/133
[ 2467.726718] PM: dpm_run_callback(): usb_dev_resume+0x0/0x20 returns -5
[ 2467.726722] PM: Device 5-2.2 failed to resume async: error -5
[ 2467.727071] OOM killer enabled.
[ 2467.727072] Restarting tasks ... done.
[ 2467.821378] PM: suspend exit
[ 2467.887621] usb 5-2.2: USB disconnect, device number 5
[ 2467.994352] usb 5-2.2: new low-speed USB device number 7 using xhci_hcd
[ 2468.103947] usb 5-2.2: New USB device found, idVendor=0764, idProduct=0501, bcdDevice= 0.01
[ 2468.103949] usb 5-2.2: New USB device strings: Mfr=3, Product=1, SerialNumber=0
[ 2468.103950] usb 5-2.2: Product: ST Series
[ 2468.103951] usb 5-2.2: Manufacturer: CPS
[ 2468.161509] hid-generic 0003:0764:0501.0008: hiddev2,hidraw5: USB HID v1.10 Device [CPS ST Series] on usb-0000:0a:00.3-2.2/input0
[ 2471.910903] igb 0000:07:00.0 enp7s0: igb: enp7s0 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX/TX
[ 2472.022608] IPv6: ADDRCONF(NETDEV_CHANGE): enp7s0: link becomes ready
[ 2575.502700] [drm:amdgpu_dm_commit_planes.constprop.0 [amdgpu]] *ERROR* Waiting for fences timed out!
[ 2575.502806] [drm:amdgpu_dm_commit_planes.constprop.0 [amdgpu]] *ERROR* Waiting for fences timed out!
[ 2580.632921] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, signaled seq=84864, emitted seq=84866
[ 2580.633010] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process Xorg pid 1874 thread Xorg:cs0 pid 1877
[ 2580.633018] amdgpu 0000:09:00.0: amdgpu: GPU reset begin!
[ 2581.335993] amdgpu 0000:09:00.0: [drm:amdgpu_ring_test_helper [amdgpu]] *ERROR* ring kiq_2.1.0 test failed (-110)
[ 2581.336073] [drm:gfx_v8_0_kcq_disable.isra.0 [amdgpu]] *ERROR* KCQ disable failed
[ 2581.613633] amdgpu: cp is busy, skip halt cp
[ 2581.890354] amdgpu: rlc is busy, skip halt rlc
[ 2581.891376] amdgpu 0000:09:00.0: amdgpu: GPU BACO reset
[ 2582.546375] amdgpu 0000:09:00.0: amdgpu: GPU reset succeeded, trying to resume
[ 2582.548207] [drm] PCIE GART of 256M enabled (table at 0x000000F400000000).
[ 2582.548220] [drm] VRAM is lost due to GPU reset!
[ 2582.878644] amdgpu 0000:09:00.0: [drm:amdgpu_ring_test_helper [amdgpu]] *ERROR* ring gfx test failed (-110)
[ 2582.878708] [drm:amdgpu_device_ip_resume_phase2 [amdgpu]] *ERROR* resume of IP block <gfx_v8_0> failed -110
[ 2582.878764] amdgpu 0000:09:00.0: amdgpu: GPU reset(2) failed
[ 2582.881066] amdgpu 0000:09:00.0: amdgpu: GPU reset end with ret = -110
[ 2585.742804] [drm:drm_atomic_helper_wait_for_flip_done [drm_kms_helper]] *ERROR* [CRTC:47:crtc-0] flip_done timed out
[ 2585.742817] [drm:drm_atomic_helper_wait_for_flip_done [drm_kms_helper]] *ERROR* [CRTC:49:crtc-1] flip_done timed out
[ 2588.558904] [drm:drm_atomic_helper_wait_for_dependencies [drm_kms_helper]] *ERROR* [CRTC:47:crtc-0] flip_done timed out
[ 2592.910983] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, but soft recovered
[ 2603.150799] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, but soft recovered
此时屏幕为空白,系统冻结。它看起来和往常一样。GPU重置被重试并超时并且失败所以在我看来,发生的事情是GPU在挂起后无法恢复/重置。我在重新启动时看到过它,但很少见,而且我通常可以工作/玩几个小时——只要我不允许它暂停。谢谢!