SurlyJest提出的问题 -ubuntu

SurlyJest

Asked: 2020-08-26 09:09:19 +0800 CST

帮助在 20.04 和 AMD RX570 显卡中从挂起恢复后间歇性冻结

我在 19.10 和现在的 20.04 中遇到了这个问题。我在 2020 年 2 月用它构建了这台计算机的 18.04 没有这个问题。我对 20.04 进行了全新安装。简而言之，在 FireFox 中滚动一段时间（几分钟到一小时）后，鼠标变为非活动状态（我可以移动它，但点击不注册），几秒钟后系统变得完全无响应，通常为空白或错误-彩色低分辨率屏幕，需要硬启动才能重置。
通常，这发生在从挂起恢复之后，但也发生在新启动之后（更罕见）。然而，这是一个间歇性问题，我不能确定前提条件是什么。在 FireFox 中滚动似乎或多或少是一个持续的触发器。我的怀疑是恢复或初始化时存在一些竞争条件，导致 amdgpu 驱动程序出现不正确的条件。我通过 syslog 中的错误搜索了这个问题，并遵循了我可以收集到的线索 - 从 AMD 站点重新安装 amdgpu 驱动程序，更新内核（现在为 5.8.1），但没有任何帮助。系统日志错误总是以：

8 月 18 日 21:05:26 mvlLinux-pc 内核：[28611.718399] [drm:amdgpu_dm_atomic_commit_tail [amdgpu]]错误等待围栏超时！
8 月 18 日 21:05:31 mvlLinux-pc 内核：[28611.718497] [drm:amdgpu_dm_atomic_commit_tail [amdgpu]]错误等待栅栏超时！
8 月 18 日 21:05:31 mvlLinux-pc 内核：[28617.360497] [drm:amdgpu_job_timedout [amdgpu]]错误环 gfx 超时，发出 seq=624416，发出 seq=624418
8 月 18 日 21:05:31 mvlLinux-pc 内核：[ 28617.360584] [drm:amdgpu_job_timedout [amdgpu]]错误进程信息：进程 gnome-shell pid 2328 线程 gnome-shel:cs0 pid 2354
8 月 18 日 21:05:31 mvlLinux-pc 内核：[28617.360590] amdgpu 0000:09:00.0： amdgpu：GPU 重置开始！

硬件概要：
主板 Asus PRIME X470-PRO
处理器：AMD Ryzen 5 2600X 六核处理器
视频：Asus Strix Radeon RX570
Ram：CRUCIAL 16 GiB

当然，可以提供更多详细信息。任何建议都欣然接受。我发现最近使用 Linux 太容易崩溃了。

@heynnema

我不认为内存是问题，但这里是：

free -h
              total        used        free      shared  buff/cache   available<br />
Mem:           15Gi       2.7Gi        10Gi       235Mi       2.0Gi        12Gi<br />
Swap:         2.0Gi          0B       2.0Gi

sudo dmidecode -s bios-version
5406
sudo lshw -C memory
  *-firmware                
       description: BIOS
       vendor: American Megatrends Inc.
       physical id: 0
       version: 5406
       date: 11/13/2019
       size: 64KiB
       capacity: 16MiB
       capabilities: pci apm upgrade shadowing cdboot bootselect socketedrom edd int13floppy1200 int13floppy720 int13floppy2880 int5printscreen int9keyboard int14serial int17printer acpi usb biosbootspecification uefi
  *-memory
       description: System Memory
       physical id: 2e
       slot: System board or motherboard
       size: 16GiB
     *-bank:0
          description: [empty]
          product: Unknown
          vendor: Unknown
          physical id: 0
          serial: Unknown
          slot: DIMM_A1
     *-bank:1
          description: DIMM DDR4 Synchronous Unbuffered (Unregistered) 2400 MHz (0.4 ns)
          product: BLS8G4D32AESBK.M8FE1
          vendor: CRUCIAL
          physical id: 1
          serial: E316F686
          slot: DIMM_A2
          size: 8GiB
          width: 64 bits
          clock: 2400MHz (0.4ns)
     *-bank:2
          description: [empty]
          product: Unknown
          vendor: Unknown
          physical id: 2
          serial: Unknown
          slot: DIMM_B1
     *-bank:3
          description: DIMM DDR4 Synchronous Unbuffered (Unregistered) 2400 MHz (0.4 ns)
          product: BLS8G4D32AESBK.M8FE1
          vendor: CRUCIAL
          physical id: 3
          serial: E316E264
          slot: DIMM_B2
          size: 8GiB
          width: 64 bits
          clock: 2400MHz (0.4ns)
  *-cache:0
       description: L1 cache
       physical id: 30
       slot: L1 - Cache
       size: 576KiB
       capacity: 576KiB
       clock: 1GHz (1.0ns)
       capabilities: pipeline-burst internal write-back unified
       configuration: level=1
  *-cache:1
       description: L2 cache
       physical id: 31
       slot: L2 - Cache
       size: 3MiB
       capacity: 3MiB
       clock: 1GHz (1.0ns)
       capabilities: pipeline-burst internal write-back unified
       configuration: level=2
  *-cache:2
       description: L3 cache
       physical id: 32
       slot: L3 - Cache
       size: 16MiB
       capacity: 16MiB
       clock: 1GHz (1.0ns)
       capabilities: pipeline-burst internal write-back unified
       configuration: level=3

@heynnema
在暂停/恢复后添加更多来自冻结的错误消息：

Aug 29 08:36:17 mvlLinux-pc systemd-resolved[830]: Server returned error NXDOMAIN, mitigating potential DNS violation DVE-2018-0001, retrying transaction with reduced feature level UDP.  
Aug 29 08:39:37 mvlLinux-pc kernel: [ 8030.248541] pcieport 0000:00:03.1: AER: Multiple Uncorrected (Non-Fatal) error received: 0000:00:00.0  
Aug 29 08:39:37 mvlLinux-pc kernel: [ 8030.248550] pcieport 0000:00:03.1: AER: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, (Receiver ID)  
Aug 29 08:39:37 mvlLinux-pc kernel: [ 8030.248553] pcieport 0000:00:03.1: AER:   device [1022:1453] error status/mask=00200000/04400000
Aug 29 08:39:37 mvlLinux-pc kernel: [ 8030.248556] pcieport 0000:00:03.1: AER:    [21] ACSViol                (First)
Aug 29 08:39:37 mvlLinux-pc kernel: [ 8030.248559] amdgpu 0000:09:00.0: AER: can't recover (no error_detected callback)
Aug 29 08:39:37 mvlLinux-pc kernel: [ 8030.248561] snd_hda_intel 0000:09:00.1: AER: can't recover (no error_detected callback)
Aug 29 08:39:37 mvlLinux-pc kernel: [ 8030.248587] pcieport 0000:00:03.1: AER: device recovery failed
Aug 29 08:39:39 mvlLinux-pc kernel: [ 8032.331741] pcieport 0000:00:03.1: AER: Multiple Uncorrected (Non-Fatal) error received: 0000:00:00.0
Aug 29 08:39:39 mvlLinux-pc kernel: [ 8032.331751] pcieport 0000:00:03.1: AER: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, (Receiver ID)
Aug 29 08:39:39 mvlLinux-pc kernel: [ 8032.331756] pcieport 0000:00:03.1: AER:   device [1022:1453] error status/mask=00200000/04400000
Aug 29 08:39:39 mvlLinux-pc kernel: [ 8032.331759] pcieport 0000:00:03.1: AER:    [21] ACSViol                (First)
Aug 29 08:39:39 mvlLinux-pc kernel: [ 8032.331763] amdgpu 0000:09:00.0: AER: can't recover (no error_detected callback)
Aug 29 08:39:39 mvlLinux-pc kernel: [ 8032.331765] snd_hda_intel 0000:09:00.1: AER: can't recover (no error_detected callback)
Aug 29 08:39:39 mvlLinux-pc kernel: [ 8032.331799] pcieport 0000:00:03.1: AER: device recovery failed
Aug 29 08:39:47 mvlLinux-pc kernel: [ 8040.390787] [drm:drm_atomic_helper_wait_for_flip_done [drm_kms_helper]] *ERROR* [CRTC:47:crtc-0] flip_done timed out
Aug 29 08:39:47 mvlLinux-pc kernel: [ 8040.390799] [drm:drm_atomic_helper_wait_for_flip_done [drm_kms_helper]] *ERROR* [CRTC:49:crtc-1] flip_done timed out
Aug 29 08:39:49 mvlLinux-pc kernel: [ 8042.438900] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma0 timeout, signaled seq=22040, emitted seq=22042
Aug 29 08:39:49 mvlLinux-pc kernel: [ 8042.438988] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process  pid 0 thread  pid 0
Aug 29 08:39:49 mvlLinux-pc kernel: [ 8042.438995] amdgpu 0000:09:00.0: amdgpu: GPU reset begin!
Aug 29 08:39:50 mvlLinux-pc kernel: [ 8043.146715] amdgpu 0000:09:00.0: [drm:amdgpu_ring_test_helper [amdgpu]] *ERROR* ring kiq_2.1.0 test failed (-110)
Aug 29 08:39:50 mvlLinux-pc kernel: [ 8043.146795] [drm:gfx_v8_0_kcq_disable.isra.0 [amdgpu]] *ERROR* KCQ disable failed
Aug 29 08:39:50 mvlLinux-pc kernel: [ 8043.423697] amdgpu: cp is busy, skip halt cp
Aug 29 08:39:51 mvlLinux-pc kernel: [ 8043.700692] amdgpu: rlc is busy, skip halt rlc
Aug 29 08:39:51 mvlLinux-pc kernel: [ 8043.701711] amdgpu 0000:09:00.0: amdgpu: GPU BACO reset
Aug 29 08:39:51 mvlLinux-pc kernel: [ 8044.346691] amdgpu 0000:09:00.0: amdgpu: GPU reset succeeded, trying to resume
Aug 29 08:39:51 mvlLinux-pc kernel: [ 8044.348500] [drm] PCIE GART of 256M enabled (table at 0x000000F400000000).
Aug 29 08:39:51 mvlLinux-pc kernel: [ 8044.348515] [drm] VRAM is lost due to GPU reset!
Aug 29 08:39:51 mvlLinux-pc kernel: [ 8044.678238] amdgpu 0000:09:00.0: [drm:amdgpu_ring_test_helper [amdgpu]] *ERROR* ring gfx test failed (-110)
Aug 29 08:39:51 mvlLinux-pc kernel: [ 8044.678302] [drm:amdgpu_device_ip_resume_phase2 [amdgpu]] *ERROR* resume of IP block <gfx_v8_0> failed -110
Aug 29 08:39:51 mvlLinux-pc kernel: [ 8044.678328] amdgpu 0000:09:00.0: amdgpu: GPU reset(1) failed
Aug 29 08:39:52 mvlLinux-pc kernel: [ 8044.680626] amdgpu 0000:09:00.0: amdgpu: GPU reset end with ret = -110
Aug 29 08:39:54 mvlLinux-pc kernel: [ 8047.302923] [drm:drm_atomic_helper_wait_for_dependencies [drm_kms_helper]] *ERROR* [CRTC:47:crtc-0] flip_done timed out
Aug 29 08:40:02 mvlLinux-pc kernel: [ 8054.727115] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma0 timeout, signaled seq=22042, emitted seq=22042
Aug 29 08:40:02 mvlLinux-pc kernel: [ 8054.727203] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process  pid 0 thread  pid 0
Aug 29 08:40:02 mvlLinux-pc kernel: [ 8054.727216] amdgpu 0000:09:00.0: amdgpu: GPU reset begin!
Aug 29 08:40:46 mvlLinux-pc systemd-modules-load[388]: Inserted module 'lp'
Aug 29 08:40:46 mvlLinux-pc systemd-modules-load[388]: Inserted module 'ppdev'
Aug 29 08:40:46 mvlLinux-pc kernel: [    0.000000] Linux version 5.8.1-050801-generic (kernel@sita) (gcc (Ubuntu 10.2.0-5ubuntu2) 10.2.0, GNU ld (GNU Binutils for Ubuntu) 2.35) #202008111432 SMP Tue Aug 11 14:34:42 UTC 2020
Aug 29 08:40:46 mvlLinux-pc kernel: [    0.000000] Command line: BOOT_IMAGE=/boot/vmlinuz-5.8.1-050801-generic root=UUID=566746e2-b4e2-42a6-b18a-fa84ebca61aa ro quiet splash vt.handoff=7`

我在错误报告中看到了类似的错误，总是涉及 AMD 显卡，但主要是集成 APU，而不是我的离散设置。这个问题出现在我从 Ubuntu 18.04 迁移到 19.10 的过程中，其他人表示更新的内核修复了它，但更新到 5.8.1 并没有帮助我。鉴于问题的间歇性，其他人可能只认为它已经消失，而我见过的几个人注意到它又回来了。到目前为止，在我读过的几十个线程中都没有看到任何解决方案。我想我可以尝试放入一个较旧的视频卡，看看是否可以缩小范围。谢谢！

@heynnema
在 grub 命令行中设置 pci=noaer 后，我在从挂起恢复时遇到了同样的错误。简历中的 Dmesg 输出：

[ 2456.697121] ACPI: Low-level resume complete
[ 2456.697163] ACPI: EC: EC started
[ 2456.697164] PM: Restoring platform NVS memory
[ 2456.697710] Enabling non-boot CPUs ...
[ 2456.697747] x86: Booting SMP configuration:
[ 2456.697748] smpboot: Booting Node 0 Processor 1 APIC 0x2
[ 2456.697845] microcode: CPU1: patch_level=0x0800820d
[ 2456.700139] ACPI: \_PR_.C002: Found 2 idle states
[ 2456.700328] CPU1 is up
[ 2456.700344] smpboot: Booting Node 0 Processor 2 APIC 0x4
[ 2456.700442] microcode: CPU2: patch_level=0x0800820d
[ 2456.702609] ACPI: \_PR_.C004: Found 2 idle states
[ 2456.702779] CPU2 is up
[ 2456.702793] smpboot: Booting Node 0 Processor 3 APIC 0x8
[ 2456.702921] microcode: CPU3: patch_level=0x0800820d
[ 2456.705121] ACPI: \_PR_.C006: Found 2 idle states
[ 2456.705330] CPU3 is up
[ 2456.705344] smpboot: Booting Node 0 Processor 4 APIC 0xa
[ 2456.705468] microcode: CPU4: patch_level=0x0800820d
[ 2456.707683] ACPI: \_PR_.C008: Found 2 idle states
[ 2456.707886] CPU4 is up
[ 2456.707901] smpboot: Booting Node 0 Processor 5 APIC 0xc
[ 2456.708026] microcode: CPU5: patch_level=0x0800820d
[ 2456.710215] ACPI: \_PR_.C00A: Found 2 idle states
[ 2456.710422] CPU5 is up
[ 2456.710435] smpboot: Booting Node 0 Processor 6 APIC 0x1
[ 2456.710561] microcode: CPU6: patch_level=0x0800820d
[ 2456.712760] ACPI: \_PR_.C001: Found 2 idle states
[ 2456.713055] CPU6 is up
[ 2456.713084] smpboot: Booting Node 0 Processor 7 APIC 0x3
[ 2456.713186] microcode: CPU7: patch_level=0x0800820d
[ 2456.715367] ACPI: \_PR_.C003: Found 2 idle states
[ 2456.715594] CPU7 is up
[ 2456.715609] smpboot: Booting Node 0 Processor 8 APIC 0x5
[ 2456.715709] microcode: CPU8: patch_level=0x0800820d
[ 2456.717892] ACPI: \_PR_.C005: Found 2 idle states
[ 2456.718131] CPU8 is up
[ 2456.718143] smpboot: Booting Node 0 Processor 9 APIC 0x9
[ 2456.718271] microcode: CPU9: patch_level=0x0800820d
[ 2456.720463] ACPI: \_PR_.C007: Found 2 idle states
[ 2456.720728] CPU9 is up
[ 2456.720742] smpboot: Booting Node 0 Processor 10 APIC 0xb
[ 2456.720868] microcode: CPU10: patch_level=0x0800820d
[ 2456.723067] ACPI: \_PR_.C009: Found 2 idle states
[ 2456.723342] CPU10 is up
[ 2456.723356] smpboot: Booting Node 0 Processor 11 APIC 0xd
[ 2456.723483] microcode: CPU11: patch_level=0x0800820d
[ 2456.725687] ACPI: \_PR_.C00B: Found 2 idle states
[ 2456.725971] CPU11 is up
[ 2456.727331] ACPI: Waking up from system sleep state S3
[ 2456.728144] ACPI: EC: interrupt unblocked
[ 2456.810892] ACPI: EC: event unblocked
[ 2456.810961] usb usb1: root hub lost power or was reset
[ 2456.810962] usb usb2: root hub lost power or was reset
[ 2456.811202] usb usb3: root hub lost power or was reset
[ 2456.811203] usb usb4: root hub lost power or was reset
[ 2456.811595] sd 1:0:0:0: [sda] Starting disk
[ 2456.811933] serial 00:03: activated
[ 2457.124313] ata5: SATA link down (SStatus 0 SControl 330)
[ 2457.124331] ata6: SATA link down (SStatus 0 SControl 330)
[ 2457.124375] ata7: SATA link down (SStatus 0 SControl 330)
[ 2457.124474] ata1: SATA link down (SStatus 0 SControl 300)
[ 2457.124622] ata9: SATA link down (SStatus 0 SControl 300)
[ 2457.128321] ata3: SATA link down (SStatus 0 SControl 330)
[ 2457.168893] nvme nvme0: Shutdown timeout set to 8 seconds
[ 2457.181058] ata4: SATA link down (SStatus 0 SControl 330)
[ 2457.204000] nvme nvme0: 32/0/0 default/read/poll queues
[ 2457.215120] usb 4-1: reset SuperSpeed Gen 1 USB device number 2 using xhci_hcd
[ 2457.283762] [drm] PCIE GART of 256M enabled (table at 0x000000F400000000).
[ 2457.366979] usb 4-2: reset SuperSpeed Gen 1 USB device number 3 using xhci_hcd
[ 2457.403433] [drm] UVD and UVD ENC initialized successfully.
[ 2457.526411] [drm] VCE initialized successfully.
[ 2457.586664] usb 3-1: reset high-speed USB device number 2 using xhci_hcd
[ 2457.850542] ata8: failed to resume link (SControl 0)
[ 2457.850553] ata8: SATA link down (SStatus 0 SControl 0)
[ 2458.122724] usb 3-1.1: reset full-speed USB device number 3 using xhci_hcd
[ 2460.178827] igb 0000:07:00.0 enp7s0: igb: enp7s0 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX/TX
[ 2462.202613] ata2: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
[ 2462.379171] usb 5-2.2: reset low-speed USB device number 5 using xhci_hcd
[ 2462.607145] ata2.00: configured for UDMA/133
[ 2467.726718] PM: dpm_run_callback(): usb_dev_resume+0x0/0x20 returns -5
[ 2467.726722] PM: Device 5-2.2 failed to resume async: error -5
[ 2467.727071] OOM killer enabled.
[ 2467.727072] Restarting tasks ... done.
[ 2467.821378] PM: suspend exit
[ 2467.887621] usb 5-2.2: USB disconnect, device number 5
[ 2467.994352] usb 5-2.2: new low-speed USB device number 7 using xhci_hcd
[ 2468.103947] usb 5-2.2: New USB device found, idVendor=0764, idProduct=0501, bcdDevice= 0.01
[ 2468.103949] usb 5-2.2: New USB device strings: Mfr=3, Product=1, SerialNumber=0
[ 2468.103950] usb 5-2.2: Product: ST Series
[ 2468.103951] usb 5-2.2: Manufacturer: CPS
[ 2468.161509] hid-generic 0003:0764:0501.0008: hiddev2,hidraw5: USB HID v1.10 Device [CPS ST Series] on usb-0000:0a:00.3-2.2/input0
[ 2471.910903] igb 0000:07:00.0 enp7s0: igb: enp7s0 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX/TX
[ 2472.022608] IPv6: ADDRCONF(NETDEV_CHANGE): enp7s0: link becomes ready
[ 2575.502700] [drm:amdgpu_dm_commit_planes.constprop.0 [amdgpu]] *ERROR* Waiting for fences timed out!
[ 2575.502806] [drm:amdgpu_dm_commit_planes.constprop.0 [amdgpu]] *ERROR* Waiting for fences timed out!
[ 2580.632921] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, signaled seq=84864, emitted seq=84866
[ 2580.633010] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process Xorg pid 1874 thread Xorg:cs0 pid 1877
[ 2580.633018] amdgpu 0000:09:00.0: amdgpu: GPU reset begin!
[ 2581.335993] amdgpu 0000:09:00.0: [drm:amdgpu_ring_test_helper [amdgpu]] *ERROR* ring kiq_2.1.0 test failed (-110)
[ 2581.336073] [drm:gfx_v8_0_kcq_disable.isra.0 [amdgpu]] *ERROR* KCQ disable failed
[ 2581.613633] amdgpu: cp is busy, skip halt cp
[ 2581.890354] amdgpu: rlc is busy, skip halt rlc
[ 2581.891376] amdgpu 0000:09:00.0: amdgpu: GPU BACO reset
[ 2582.546375] amdgpu 0000:09:00.0: amdgpu: GPU reset succeeded, trying to resume
[ 2582.548207] [drm] PCIE GART of 256M enabled (table at 0x000000F400000000).
[ 2582.548220] [drm] VRAM is lost due to GPU reset!
[ 2582.878644] amdgpu 0000:09:00.0: [drm:amdgpu_ring_test_helper [amdgpu]] *ERROR* ring gfx test failed (-110)
[ 2582.878708] [drm:amdgpu_device_ip_resume_phase2 [amdgpu]] *ERROR* resume of IP block <gfx_v8_0> failed -110
[ 2582.878764] amdgpu 0000:09:00.0: amdgpu: GPU reset(2) failed
[ 2582.881066] amdgpu 0000:09:00.0: amdgpu: GPU reset end with ret = -110
[ 2585.742804] [drm:drm_atomic_helper_wait_for_flip_done [drm_kms_helper]] *ERROR* [CRTC:47:crtc-0] flip_done timed out
[ 2585.742817] [drm:drm_atomic_helper_wait_for_flip_done [drm_kms_helper]] *ERROR* [CRTC:49:crtc-1] flip_done timed out
[ 2588.558904] [drm:drm_atomic_helper_wait_for_dependencies [drm_kms_helper]] *ERROR* [CRTC:47:crtc-0] flip_done timed out
[ 2592.910983] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, but soft recovered
[ 2603.150799] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, but soft recovered

此时屏幕为空白，系统冻结。它看起来和往常一样。GPU重置被重试并超时并且失败所以在我看来，发生的事情是GPU在挂起后无法恢复/重置。我在重新启动时看到过它，但很少见，而且我通常可以工作/玩几个小时——只要我不允许它暂停。谢谢！

帮助在 20.04 和 AMD RX570 显卡中从挂起恢复后间歇性冻结

如何运行 .sh 脚本？

如何安装 .tar.gz（或 .tar.bz2）文件？

如何列出所有已安装的软件包

无法锁定管理目录 (/var/lib/dpkg/) 是另一个进程在使用它吗？

SurlyJest's questions

帮助在 20.04 和 AMD RX570 显卡中从挂起恢复后间歇性冻结

如何运行 .sh 脚本？

如何安装 .tar.gz（或 .tar.bz2）文件？

如何列出所有已安装的软件包

无法锁定管理目录 (/var/lib/dpkg/) 是另一个进程在使用它吗？