我有一个奇怪的问题。在新安装的 Ubuntu 18.04 上,系统似乎运行良好。突然,显然无缘无故,系统挂断了 10 秒或几分钟,我无法执行任何操作。
我试图让一个顶级实例保持打开状态,并且 RAM/CPU 使用率似乎还不错。我在具有 6GB RAM 和 12GB 交换空间的 i5 机器上。我刚刚测试了内存和磁盘,它们没有错误。
编辑 一些附加信息。我将 CPU 频率调节器设置为性能,因此它始终以最大速度工作。
在执行 CPU 密集型操作(例如数据分析)时,该问题更常出现。完成后,GUI 变得完全没有响应,很难或不可能让它恢复工作。
编辑
输出grep . -r /sys/firmware/acpi/interrupts
/sys/firmware/acpi/interrupts/gpe2F: 0 invalid unmasked
/sys/firmware/acpi/interrupts/gpe23: 0 invalid unmasked
/sys/firmware/acpi/interrupts/gpe1F: 0 invalid unmasked
/sys/firmware/acpi/interrupts/gpe13: 0 EN enabled unmasked
/sys/firmware/acpi/interrupts/gpe0F: 0 invalid unmasked
/sys/firmware/acpi/interrupts/gpe03: 0 disabled unmasked
/sys/firmware/acpi/interrupts/gpe3D: 0 invalid unmasked
/sys/firmware/acpi/interrupts/gpe31: 0 invalid unmasked
/sys/firmware/acpi/interrupts/gpe2D: 0 invalid unmasked
/sys/firmware/acpi/interrupts/gpe21: 0 invalid unmasked
/sys/firmware/acpi/interrupts/gpe1D: 0 EN enabled unmasked
/sys/firmware/acpi/interrupts/ff_pwr_btn: 0 EN enabled unmasked
/sys/firmware/acpi/interrupts/gpe11: 0 STS invalid unmasked
/sys/firmware/acpi/interrupts/gpe0D: 0 disabled unmasked
/sys/firmware/acpi/interrupts/gpe01: 0 EN enabled unmasked
/sys/firmware/acpi/interrupts/gpe3B: 0 invalid unmasked
/sys/firmware/acpi/interrupts/gpe2B: 0 invalid unmasked
/sys/firmware/acpi/interrupts/ff_rt_clk: 0 disabled unmasked
/sys/firmware/acpi/interrupts/ff_pmtimer: 0 STS invalid unmasked
/sys/firmware/acpi/interrupts/gpe1B: 0 STS invalid unmasked
/sys/firmware/acpi/interrupts/gpe38: 0 invalid unmasked
/sys/firmware/acpi/interrupts/gpe0B: 0 disabled unmasked
/sys/firmware/acpi/interrupts/gpe28: 0 invalid unmasked
/sys/firmware/acpi/interrupts/gpe18: 0 invalid unmasked
/sys/firmware/acpi/interrupts/gpe08: 0 invalid unmasked
/sys/firmware/acpi/interrupts/gpe36: 0 invalid unmasked
/sys/firmware/acpi/interrupts/gpe26: 0 invalid unmasked
/sys/firmware/acpi/interrupts/error: 0
/sys/firmware/acpi/interrupts/gpe16: 0 EN enabled unmasked
/sys/firmware/acpi/interrupts/sci: 4
/sys/firmware/acpi/interrupts/gpe06: 4 EN enabled unmasked
/sys/firmware/acpi/interrupts/gpe34: 0 invalid unmasked
/sys/firmware/acpi/interrupts/gpe24: 0 invalid unmasked
/sys/firmware/acpi/interrupts/gpe14: 0 EN enabled unmasked
/sys/firmware/acpi/interrupts/gpe04: 0 disabled unmasked
/sys/firmware/acpi/interrupts/gpe3E: 0 invalid unmasked
/sys/firmware/acpi/interrupts/gpe32: 0 invalid unmasked
/sys/firmware/acpi/interrupts/gpe2E: 0 invalid unmasked
/sys/firmware/acpi/interrupts/gpe22: 0 invalid unmasked
/sys/firmware/acpi/interrupts/gpe1E: 0 invalid unmasked
/sys/firmware/acpi/interrupts/gpe12: 0 EN enabled unmasked
/sys/firmware/acpi/interrupts/gpe0E: 0 disabled unmasked
/sys/firmware/acpi/interrupts/gpe02: 0 EN enabled unmasked
/sys/firmware/acpi/interrupts/gpe3C: 0 invalid unmasked
/sys/firmware/acpi/interrupts/gpe30: 0 invalid unmasked
/sys/firmware/acpi/interrupts/gpe2C: 0 invalid unmasked
/sys/firmware/acpi/interrupts/gpe20: 0 disabled unmasked
/sys/firmware/acpi/interrupts/gpe1C: 0 invalid unmasked
/sys/firmware/acpi/interrupts/gpe10: 0 STS invalid unmasked
/sys/firmware/acpi/interrupts/gpe39: 0 invalid unmasked
/sys/firmware/acpi/interrupts/gpe0C: 0 disabled unmasked
/sys/firmware/acpi/interrupts/gpe00: 0 invalid unmasked
/sys/firmware/acpi/interrupts/gpe3A: 0 invalid unmasked
/sys/firmware/acpi/interrupts/gpe_all: 4
/sys/firmware/acpi/interrupts/gpe29: 0 invalid unmasked
/sys/firmware/acpi/interrupts/gpe2A: 0 invalid unmasked
/sys/firmware/acpi/interrupts/gpe19: 0 STS invalid unmasked
/sys/firmware/acpi/interrupts/gpe1A: 0 STS invalid unmasked
/sys/firmware/acpi/interrupts/gpe09: 0 disabled unmasked
/sys/firmware/acpi/interrupts/gpe37: 0 invalid unmasked
/sys/firmware/acpi/interrupts/gpe0A: 0 invalid unmasked
/sys/firmware/acpi/interrupts/gpe27: 0 invalid unmasked
/sys/firmware/acpi/interrupts/gpe17: 0 STS invalid unmasked
/sys/firmware/acpi/interrupts/ff_gbl_lock: 0 EN enabled unmasked
/sys/firmware/acpi/interrupts/gpe07: 0 enabled unmasked
/sys/firmware/acpi/interrupts/sci_not: 0
/sys/firmware/acpi/interrupts/gpe35: 0 invalid unmasked
/sys/firmware/acpi/interrupts/gpe25: 0 disabled unmasked
/sys/firmware/acpi/interrupts/gpe15: 0 EN enabled unmasked
/sys/firmware/acpi/interrupts/gpe05: 0 disabled unmasked
/sys/firmware/acpi/interrupts/gpe3F: 0 invalid unmasked
/sys/firmware/acpi/interrupts/gpe33: 0 invalid unmasked
/sys/firmware/acpi/interrupts/ff_slp_btn: 0 invalid unmasked
编辑 04/03/2019 我运行了一个完整的 SMART 测试,现在看起来不太好,至少在我看来。
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always - 0
3 Spin_Up_Time 0x0027 179 176 021 Pre-fail Always - 4025
4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 218
5 Reallocated_Sector_Ct 0x0033 154 154 140 Pre-fail Always - 364
7 Seek_Error_Rate 0x002e 200 200 000 Old_age Always - 0
9 Power_On_Hours 0x0032 034 034 000 Old_age Always - 48741
10 Spin_Retry_Count 0x0032 100 100 000 Old_age Always - 0
11 Calibration_Retry_Count 0x0032 100 100 000 Old_age Always - 0
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 217
192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 100
193 Load_Cycle_Count 0x0032 200 200 000 Old_age Always - 117
194 Temperature_Celsius 0x0022 089 080 000 Old_age Always - 58
196 Reallocated_Event_Count 0x0032 022 022 000 Old_age Always - 178
197 Current_Pending_Sector 0x0032 199 199 000 Old_age Always - 234
198 Offline_Uncorrectable 0x0030 199 199 000 Old_age Offline - 245
199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 0
200 Multi_Zone_Error_Rate 0x0008 188 188 000 Old_age Offline - 2436
240 Head_Flying_Hours 0x0032 038 038 000 Old_age Always - 45709
241 Total_LBAs_Written 0x0032 200 200 000 Old_age Always - 81196791754
242 Total_LBAs_Read 0x0032 200 200 000 Old_age Always - 75991010629
我还会检查 CPU 温度并确保您的冷却风扇正常。如果冷却风扇正常,您可能需要检查恶意软件/病毒。
此外,有时您可能需要更新 BIOS 以完全适应较新操作系统中的新功能(取决于系统)
我发现可能导致系统冻结的另一件事是,如果您的 Internet 连接断开,尤其是在更新等期间,请检查您的 Internet 连接并确保它没有断开。
有点“在黑暗中拍摄”,但也许一个建议会有所帮助。有关您的系统的更多信息,例如主板品牌、型号和版本,可能会有所帮助。
这只是来自个人经验,但如果其他建议没有帮助,因为您的 CPU 处于良好的温度。您可能需要考虑寻找另一个与您的主板兼容的类似 CPU 并查看是否有助于解决问题。我最近有一个 CPU 死机,它在完全死机之前所做的事情与您所描述的几乎相同。也可能是某种主板问题,但我会先检查CPU。我知道获取和测试其他部分可能也不完全实用,但根据我的经验,这种问题往往是某种硬件问题。
如果这两个都不是问题,我将使用 Disks Utility 在硬盘驱动器上运行 SMART 测试,详细信息如下:如何在当前版本的 Ubuntu 14.04 到 18.10 上检查 SSD 或 HDD 的 SMART 状态?
尝试围绕交换调整您的设置。例如,通过运行
sudo sysctl vm.swappiness=20
,重新启动后,这将再次恢复。即使您的内存尚未完全使用,内核也会开始将部分交换到磁盘以保留一些空间。选择一个相当低的值会导致更少的可用空间,但也会减少交换。最佳值取决于您的内存大小以及您正在运行的工作负载。当你找到一个你可以接受的值时,你可以通过添加这样的一行来永久设置它
/etc/sysctl.conf
:有关更多背景信息,请参阅:什么是 swappiness 以及如何更改它?
从系统监视器(例如传感器;可能对您没有用的 GUI 是psensor)获取信息并将其转储,以便您可以进行事后分析。 RRDTool可能会派上用场。
您可以输出带有时间和日期的信息,选择转储数据的间隔,获取硬盘温度等。
看
如何监控和记录服务器硬件温度和负载
温度监测帮助
https://ubuntuforums.org/showthread.php?t=1998005
https://ubuntuforums.org/showthread.php?t=2364408
http://manpages.ubuntu.com/manpages/bionic/man8/turbostat.8.html
http://manpages.ubuntu.com/manpages/trusty/man8/hddtemp.8.html