我在 AskUbuntu 上问过这个问题,但我现在在这里问更具体的硬件问题。
看来我的 RAM 很糟糕,因为我在 Memtest86+ 中发现了大约 6000 个错误,并且在 1 小时内我有 10 次以上的冻结和硬重启,但现在我只是简单地拔下两个 RAM 模块并将它们重新插入,我可以'不会得到一个新的错误。它在保修期内,因此戴尔愿意在下周免费更换整个主板和两个 RAM 棒(每个 8 GB),但我想我应该拒绝这个提议,但我担心我的硬件可能仍然不好. 现在没有出现错误,我想知道这是否比我现在需要让他们更换整个主板的风险更大,特别是因为他们将使用翻新部件,以及我对翻新硬件部件的一般经验(不是戴尔根本——但只是一般来说)告诉我要远离,除非我真的别无选择。
我应该怎么办?我的 RAM 曾经坏过吗?或者,它是否只是机械引脚对齐或碎片问题,通过我拔下并重新插入 RAM 就可以解决?
请注意,我的电脑已经使用了 1 年。这是一款高端的戴尔笔记本电脑。最近,我彻底擦除了 Windows 10 并安装了 Ubuntu 20.04。
这是我发送给戴尔支持团队的完整描述,但他们从未让工程师查看我的描述,所以我想看看这里是否有人知道可能发生的事情以及解决方案是什么。
[我发送给戴尔的消息(开始)]
我做了一些故障排除,这让我很困惑。
请注意,我的操作系统是 Linux Ubuntu 20.04。
在过去的两周里,我偶尔会遇到死机,但很少发生,通常是在启动或关机期间。有时在启动过程中它会冻结,我不得不按住电源按钮再试一次。我没有想太多,但还是被它弄糊涂了。3 天前,我经历了反复的完全冻结,没有任何形式的软重启可以工作,甚至没有用特殊的 Ctrl + Alt + PrScr + REISUB 序列来中断 Linux 内核,用于软重启 Linux 计算机。我每次都必须进行彻底的硬重启。这种情况一次又一次地发生——大约在一个小时内发生了 10 多次。该系统完全无法使用。
我启动到戴尔诊断菜单并运行了两次诊断。每次他们在内存测试屏幕上冻结约 15 分钟,屏幕上仍然冻结大约 4 分 20 秒,所以每次我硬重启退出。
然后我当时(3 天前)将 BIOS 从 1.9 升级到 1.15.1 并继续冻结。然后,我在 BIOS/UEFI 中启用了传统启动,启动到 Memtest86+ v5.01 ( https://www.memtest.org/ ),并运行了内存测试。它在 6 分钟内发现了数千个错误,在 2 小时左右的时间内总共发现了5632个错误。然后我打电话给你。
以下是这些错误的屏幕截图。此屏幕截图显示地址 003e295861c 的测试 10 中的错误,例如:
此屏幕截图显示了从地址到 DIMM 插槽的内存映射。正如你所看到的,这个地址映射到 DIMM B,这意味着内存是坏的:
此屏幕截图显示了测试 7 中地址 0017dfdf1b8 的错误,例如,在开始测试后仅 5 分 35 秒内。这映射到 DIMM A,这意味着内存坏了。因此,这两种记忆都是不好的:
However, I can no longer reproduce the errors (now that I have swapped the RAM sticks around during further testing). Whether I test the memories individually or together, in DIMM A or in DIMM B, they now pass. Additionally, the Dell Diagnostic test from the boot menu now runs to completion and passes. Does this make any sense!? I went from 10+ freezes per hour and 5632 errors to nothing? I wonder if it's a glitchy motherboard, but all Dell Diagnostics tests which I run from the boot menu also now pass. I need this computer to work and be reliable and not produce memory corruption. What do you think? Thanks!
[MESSAGE I SENT TO DELL (END)]
Also, I have even run a stress test with this command, for 8 hours at 100% CPU usage (all 4 cores/8 hardware threads at 100%), and at ~98% RAM usage the whole time, and it ran fine too:
stress-ng --cpu 8 --vm 8 --vm-bytes 100% --timeout 8h --metrics
And I have now run Memtest86+ for 30+ hours with both RAM sticks reinserted, and I get zero errors.
How do I go from 5632 errors to zero!?
Note: I also ran Memtest86+ v5.01 only in single-threaded mode, so none of my errors were due to its known bugs with running in multi-threaded mode.
Related:
- Related, but definitely inconclusive and not a duplicate: Can the dust cause DDR RAM errors?
- kinda-sorta related--also not a duplicate: ram errors solved by swapping slots used by ram
Future troubleshooting notes to self (Looking back: what I wish I would have done):
- I wish I would have run the Memtest86+ test 2 or 3 more times for < 1 hr each time before unplugging any RAM modules, just to see if I was consistently getting those thousands of failures.
- Then, assuming the errors were consistent, I wish the first thing I would have done to troubleshoot them would have been to just unplug both RAM modules and then plug them exactly back in as they were! Then, run the test again, and if the test passes immediately, after having failed several times in a row just before, I would know with certainty the RAM modules were just improperly seated somehow, and unplugging them and plugging them back in fixed the problem!
References:
- How I first started learning about the
stress-ng
Linux stress test command-line tool: https://www.cyberciti.biz/faq/stress-test-linux-unix-server-with-stress-ng/
Taking the RAM out, put it back in can certainly fix these kind of issues.
(But the problem may come back in a couple of months.)
Basically there are 3 separate issues here:
Taking the RAM out/back in scrapes that layer off and you are good to go until it forms again. Especially computers used in a relatively humid environment can be subject to this, but it usually takes several years before this becomes an issue.
The 3 effects above can appear in combinations and amplify each other. And they can start popping up after a long term using the computer without issues. Even in computers whose internals you never touched yourself since it came out of the factory it can happen.
Testing suspect RAM is tricky, especially if you don't have known good other system available.
Typical thing to do when you suspect a bad RAM is first to take out the RAM.
Visually inspect it for bend contacts: If there are any throw it away immediately. It will never be 100% reliable again.
Then clean the contacts and re-seat the RAM in the same slot. Then re-test.
If it still tests bad you can try a known good RAM in that slot. (Not always possible if the motherboard needs a specific combination of slots to be used.) If that also tests bad the slot itself is usually the culprit.
And you can test with only the suspect RAM in another slot.
In the motherboard/memory controller is the problem any RAM you test in that same slot will appear bad. But be ware when you change the memory layout/configuration (e.g. test with less or different size RAM strips) the problem can move to another slot. It is also possible it is guaranteed unstable in some memory combinations and stable in others (depending on the physical layout of the RAM present).
And always test with RAM timing in the Bios set to standard timing. Overclocked RAM can cause its own issues and make tests unreliable.
If you have another computer that is known to be good it is probably easiest to run that second computer with just 1 RAM from the problem system. Test all RAMs one by one. And then test the motherboard on the flaky computer by running it with RAM that has checked out the be good in the previous tests.
A few words on cleaning the contacts:
Don't try to clean the slots on the motherboard. Very easy to damage them.
The friction of a RAM strip being taking out/inserted is enough to scrape the contacts clean.
On the RAM strips themselves:
Gently rub them with a pencil eraser in the correct direction. (When you hold the RAM horizontally with the contacts pointing down you rub it from top to bottom. So along the contact in the direction of where the slot would be if it was inserted in a slot.)
Do both sides and try to avoid touching the contacts with your fingers.
如果您确实触摸了它们(或只是为了安全起见),请用棉签/棉签蘸异丙醇(可在任何药房购买),然后将其擦过触点。继续重复,直到您不再在 Q-tip 上看到任何黑色污迹。