每当机器启动或 GPU 上有负载时,我都会遇到 NVIDIA RTX A6000 的问题。
dmesg 报告AER: buffer overflow in recovery for
三个独立的 PCI 地址:
41:00.1 Audio device: NVIDIA Corporation GA102 High Definition Audio Controller (rev a1)
41:00.0 VGA compatible controller: NVIDIA Corporation GA102GL [RTX A6000] (rev a1)
40:01.1 PCI bridge: Advanced Micro Devices, Inc. [AMD] Starship/Matisse GPP Bridge
AER 还报告这些问题已得到纠正。但他们也指出snd_hda_intel 0000:41:00.1
受到该问题的影响。
[ 5.301395] nvidia 0000:41:00.0: AER: aer_status: 0x00000001, aer_mask: 0x00000000
[ 5.301397] nvidia 0000:41:00.0: [ 0] RxErr (First)
[ 5.301399] nvidia 0000:41:00.0: AER: aer_layer=Physical Layer, aer_agent=Receiver ID
[ 5.301401] snd_hda_intel 0000:41:00.1: AER: aer_status: 0x00000001, aer_mask: 0x00000000
[ 5.301402] snd_hda_intel 0000:41:00.1: [ 0] RxErr (First)
[ 5.301403] snd_hda_intel 0000:41:00.1: AER: aer_layer=Physical Layer, aer_agent=Receiver ID
[ 5.301405] nvidia 0000:41:00.0: AER: aer_status: 0x00000001, aer_mask: 0x00000000
[ 5.301405] nvidia 0000:41:00.0: [ 0] RxErr (First)
[ 5.301406] nvidia 0000:41:00.0: AER: aer_layer=Physical Layer, aer_agent=Receiver ID
[ 5.301407] snd_hda_intel 0000:41:00.1: AER: aer_status: 0x00000001, aer_mask: 0x00000000
[ 5.301408] snd_hda_intel 0000:41:00.1: [ 0] RxErr (First)
[ 5.301409] snd_hda_intel 0000:41:00.1: AER: aer_layer=Physical Layer, aer_agent=Receiver ID
[ 5.301410] nvidia 0000:41:00.0: AER: aer_status: 0x00000001, aer_mask: 0x00000000
[ 5.301411] nvidia 0000:41:00.0: [ 0] RxErr (First)
[ 5.301411] nvidia 0000:41:00.0: AER: aer_layer=Physical Layer, aer_agent=Receiver ID
[ 5.301413] nvidia 0000:41:00.0: AER: aer_status: 0x00000001, aer_mask: 0x00000000
[ 5.301414] nvidia 0000:41:00.0: [ 0] RxErr (First)
[ 5.301414] nvidia 0000:41:00.0: AER: aer_layer=Physical Layer, aer_agent=Receiver ID
[ 5.301416] snd_hda_intel 0000:41:00.1: AER: aer_status: 0x00000001, aer_mask: 0x00000000
[ 5.301416] snd_hda_intel 0000:41:00.1: [ 0] RxErr (First)
[ 5.301417] snd_hda_intel 0000:41:00.1: AER: aer_layer=Physical Layer, aer_agent=Receiver ID
[ 5.301418] nvidia 0000:41:00.0: AER: aer_status: 0x00000001, aer_mask: 0x00000000
[ 5.301419] nvidia 0000:41:00.0: [ 0] RxErr (First)
[ 5.301420] nvidia 0000:41:00.0: AER: aer_layer=Physical Layer, aer_agent=Receiver ID
[ 5.301421] snd_hda_intel 0000:41:00.1: AER: aer_status: 0x00000001, aer_mask: 0x00000000
[ 5.301422] snd_hda_intel 0000:41:00.1: [ 0] RxErr (First)
[ 5.301422] snd_hda_intel 0000:41:00.1: AER: aer_layer=Physical Layer, aer_agent=Receiver ID
[ 5.301424] nvidia 0000:41:00.0: AER: aer_status: 0x00000001, aer_mask: 0x00000000
[ 5.301424] nvidia 0000:41:00.0: [ 0] RxErr (First)
[ 5.301425] nvidia 0000:41:00.0: AER: aer_layer=Physical Layer, aer_agent=Receiver ID
[ 5.301426] snd_hda_intel 0000:41:00.1: AER: aer_status: 0x00000001, aer_mask: 0x00000000
[ 5.301427] snd_hda_intel 0000:41:00.1: [ 0] RxErr (First)
[ 5.301428] snd_hda_intel 0000:41:00.1: AER: aer_layer=Physical Layer, aer_agent=Receiver ID
[ 5.301429] nvidia 0000:41:00.0: AER: aer_status: 0x00000001, aer_mask: 0x00000000
[ 5.301430] nvidia 0000:41:00.0: [ 0] RxErr (First)
[ 5.301430] nvidia 0000:41:00.0: AER: aer_layer=Physical Layer, aer_agent=Receiver ID
[ 5.301432] snd_hda_intel 0000:41:00.1: AER: aer_status: 0x00000001, aer_mask: 0x00000000
[ 5.301432] snd_hda_intel 0000:41:00.1: [ 0] RxErr (First)
[ 5.301433] snd_hda_intel 0000:41:00.1: AER: aer_layer=Physical Layer, aer_agent=Receiver ID
[ 5.301435] nvidia 0000:41:00.0: AER: aer_status: 0x00000001, aer_mask: 0x00000000
[ 5.301435] nvidia 0000:41:00.0: [ 0] RxErr (First)
[ 5.301436] nvidia 0000:41:00.0: AER: aer_layer=Physical Layer, aer_agent=Receiver ID
[ 5.301437] pcieport 0000:40:01.1: AER: aer_status: 0x00001000, aer_mask: 0x00000000
[ 5.301438] pcieport 0000:40:01.1: [12] Timeout
[ 5.301439] pcieport 0000:40:01.1: AER: aer_layer=Data Link Layer, aer_agent=Transmitter ID
[ 5.301440] pcieport 0000:40:01.1: AER: aer_status: 0x00001000, aer_mask: 0x00000000
[ 5.301441] pcieport 0000:40:01.1: [12] Timeout
[ 5.301442] pcieport 0000:40:01.1: AER: aer_layer=Data Link Layer, aer_agent=Transmitter ID
PCI 地址已更正
前面列出的所有 3 个 PCI 地址都存在更正的消息,更正示例:
[ 10.419954] {3}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 512
[ 10.419957] {3}[Hardware Error]: It has been corrected by h/w and requires no further action
[ 10.419958] {3}[Hardware Error]: event severity: corrected
[ 10.419959] {3}[Hardware Error]: Error 0, type: corrected
[ 10.419960] {3}[Hardware Error]: section_type: PCIe error
[ 10.419960] {3}[Hardware Error]: port_type: 4, root port
[ 10.419961] {3}[Hardware Error]: version: 0.2
[ 10.419961] {3}[Hardware Error]: command: 0x0407, status: 0x0010
[ 10.419962] {3}[Hardware Error]: device_id: 0000:40:01.1
[ 10.419963] {3}[Hardware Error]: slot: 0
[ 10.419964] {3}[Hardware Error]: secondary_bus: 0x41
[ 10.419964] {3}[Hardware Error]: vendor_id: 0x1022, device_id: 0x1483
[ 10.419965] {3}[Hardware Error]: class_code: 060400
[ 10.419966] {3}[Hardware Error]: bridge: secondary_status: 0x0000, control: 0x0012
解决消息的测试和尝试
我与供应商合作,尝试了很多方法来排除这个问题。
- 移除 GPU 可以彻底解决该问题。
- 更新了机器中两个 NVME/SSD Western Digital SN850X 的固件。
- 在猜测 WD SN850X 是问题所在后,将系统安装到了不同的 SSD 型号。
- PNY 已确认没有适用于 A6000 GPU 的 BIOS 更新* BIOS 已更新。
- Windows 运行似乎没有发现任何具体问题。
- 内核 6.2 已经过测试,以确保满足所有组件的要求。
- ASPM 已在 grub 启动菜单中关闭,以防 PCI 通道上的电源切换引起问题。BIOS 中没有针对 GPU 的 ASPM 控制,只有存储。
一名学生在这台机器上运行了一些计算作业,并且在使用 GPU 时没有报告任何具体问题。此外,Windows 中的 FurMark 和 Ubuntu 中的 GPUburn 似乎运行没有问题,这似乎表明问题正在得到纠正。
我仍然渴望更好地了解出了什么问题,只是为了最好地确保此 AER 消息不会影响机器上未来的工作,因为它将用于计算。仍然很难判断这是操作系统的软件问题还是卡的硬件问题。
提前致谢!
这并不是说这是一个修复,但我设法将 GPU 移至另一个 PCI 插槽,并且它阻止了错误的出现。GPUburn 测试似乎运行没有问题。
向主板制造商报告该问题,因为它似乎是一些模糊的 PCI 地址问题(ASUS TeK / Pro WS WRX80E-SAGE SE WIFI)