我正在尝试隔离 ARM64 Ampere (Supermicro) 上的内核以实现实时工作负载。当使用 RT prio 运行工作负载时,我看到以下错误。我不确定此刻出了什么问题。我也不确定该错误是否与我配置的 CPU 隔离有关,或者是否与其他原因有关。该配置在 x86 上运行良好,我已经在 Intel 和 AMD CPU 上进行了测试。有人可以给我提示吗?任何帮助表示赞赏!
May 18 20:38:16 k8s-hp-ampere kernel: Unable to handle kernel NULL pointer dereference at virtual address 0000000000000000
May 18 20:38:16 k8s-hp-ampere kernel: Mem abort info:
May 18 20:38:16 k8s-hp-ampere kernel: ESR = 0x96000004
May 18 20:38:16 k8s-hp-ampere kernel: EC = 0x25: DABT (current EL), IL = 32 bits
May 18 20:38:16 k8s-hp-ampere kernel: SET = 0, FnV = 0
May 18 20:38:16 k8s-hp-ampere kernel: EA = 0, S1PTW = 0
May 18 20:38:16 k8s-hp-ampere kernel: FSC = 0x04: level 0 translation fault
May 18 20:38:16 k8s-hp-ampere kernel: Data abort info:
May 18 20:38:16 k8s-hp-ampere kernel: ISV = 0, ISS = 0x00000004
May 18 20:38:16 k8s-hp-ampere kernel: CM = 0, WnR = 0
May 18 20:38:16 k8s-hp-ampere kernel: user pgtable: 4k pages, 48-bit VAs, pgdp=000008041899c000
May 18 20:38:16 k8s-hp-ampere kernel: [0000000000000000] pgd=0000000000000000, p4d=0000000000000000
May 18 20:38:16 k8s-hp-ampere kernel: Internal error: Oops: 96000004 [#1] PREEMPT_RT SMP
May 18 20:38:16 k8s-hp-ampere kernel: Modules linked in: vxlan xt_multiport ipt_rpfilter ip_set_hash_net xfrm_user veth wireguard libchacha20poly1305 chacha_neon poly1305_neon libcurve25519_generic libchacha ip6_udp_tunnel udp_tunnel ip6t_REJECT nf_reject_ipv6 nf_conntrack_netlink ipt_REJECT nf_reject_ipv4 xt_addrtype xt_set ip_set_hash_ipportip ip_set_hash_ip ip_set_hash_ipportnet ip_set_bitmap_port ip_set_hash_ipport dummy ip_set xt_MASQUERADE nft_chain_nat nf_nat xt_mark xt_conntrack xt_comment nft_compat nft_counter nf_tables nfnetlink overlay arm_spe_pmu nls_iso8859_1 irdma i40e acpi_ipmi ipmi_ssif arm_dmc620_pmu ipmi_devintf arm_cmn ipmi_msghandler xgene_hwmon arm_dsu_pmu acpi_tad binfmt_misc sch_fq_codel vfio_pci vfio_pci_core irqbypass vfio_virqfd vfio_iommu_type1 vfio br_netfilter bridge stp llc ip_vs_sh ip_vs_wrr ip_vs_rr ip_vs nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 efi_pstore ip_tables x_tables autofs4 btrfs blake2b_generic zstd_compress raid10 raid456 async_raid6_recov async_memcpy async_pq
May 18 20:38:16 k8s-hp-ampere kernel: async_xor async_tx xor xor_neon raid6_pq libcrc32c raid1 raid0 multipath linear mlx5_ib ib_uverbs ib_core r8153_ecm cdc_ether usbnet r8152 ast drm_vram_helper drm_ttm_helper ttm i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops cec crct10dif_ce rc_core ghash_ce sha2_ce sha256_arm64 mlx5_core sha1_ce nvme mlxfw mpt3sas psample ixgbe drm raid_class xhci_pci ice xfrm_algo tls nvme_core xhci_pci_renesas scsi_transport_sas mdio aes_neon_bs aes_neon_blk aes_ce_blk crypto_simd cryptd aes_ce_cipher
May 18 20:38:16 k8s-hp-ampere kernel: CPU: 32 PID: 12294 Comm: clang Not tainted 5.15.0-1032-realtime #35-Ubuntu
May 18 20:38:16 k8s-hp-ampere kernel: Hardware name: Supermicro Corporation R12SPD ........../R12SPD, BIOS 1.1a 02/10/2023
May 18 20:38:16 k8s-hp-ampere kernel: pstate: 80400009 (Nzcv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
May 18 20:38:16 k8s-hp-ampere kernel: pc : pipe_write+0x540/0x754
May 18 20:38:16 k8s-hp-ampere kernel: lr : pipe_write+0x50/0x754
May 18 20:38:16 k8s-hp-ampere kernel: sp : ffff800178bb3be0
May 18 20:38:16 k8s-hp-ampere kernel: x29: ffff800178bb3be0 x28: ffff07ffadd0a700 x27: ffff0802fe6c5800
May 18 20:38:16 k8s-hp-ampere kernel: x26: 0000000000000004 x25: 0000000000000000 x24: ffff0802fe6c5850
May 18 20:38:16 k8s-hp-ampere kernel: x23: 0000000000000050 x22: ffff0802f63e8a00 x21: ffff800178bb3cc0
May 18 20:38:16 k8s-hp-ampere kernel: x20: ffffffffffffffee x19: 0000000000000001 x18: 0000000000000000
May 18 20:38:16 k8s-hp-ampere kernel: x17: 0000000000000000 x16: 0000000000000000 x15: 0000000000000000
May 18 20:38:16 k8s-hp-ampere kernel: x14: 0000000000000000 x13: 0000000000000000 x12: 0000000000000000
May 18 20:38:16 k8s-hp-ampere kernel: x11: 0000000000000000 x10: 0000000000000000 x9 : ffffb87036dbe188
May 18 20:38:16 k8s-hp-ampere kernel: x8 : 0000000000000000 x7 : 0000000000000000 x6 : 0000102b143c5eb0
May 18 20:38:16 k8s-hp-ampere kernel: x5 : 0000000000000000 x4 : 0000000000000000 x3 : 0000000000000032
May 18 20:38:16 k8s-hp-ampere kernel: x2 : 0000000000000010 x1 : 0000000000000005 x0 : 0000000000000000
May 18 20:38:16 k8s-hp-ampere kernel: Call trace:
May 18 20:38:16 k8s-hp-ampere kernel: pipe_write+0x540/0x754
May 18 20:38:16 k8s-hp-ampere kernel: new_sync_write+0x17c/0x18c
May 18 20:38:16 k8s-hp-ampere kernel: vfs_write+0x278/0x2e4
May 18 20:38:16 k8s-hp-ampere kernel: ksys_write+0xe4/0x100
May 18 20:38:16 k8s-hp-ampere kernel: __arm64_sys_write+0x24/0x30
May 18 20:38:16 k8s-hp-ampere kernel: invoke_syscall+0x78/0x100
May 18 20:38:16 k8s-hp-ampere kernel: el0_svc_common.constprop.0+0x54/0x184
May 18 20:38:16 k8s-hp-ampere kernel: do_el0_svc+0x30/0x9c
May 18 20:38:16 k8s-hp-ampere kernel: el0_svc+0x30/0x150
May 18 20:38:16 k8s-hp-ampere kernel: el0t_64_sync_handler+0xa4/0x130
May 18 20:38:16 k8s-hp-ampere kernel: el0t_64_sync+0x1a4/0x1a8
May 18 20:38:16 k8s-hp-ampere kernel: Code: 8b130341 f140043f 54ffdea8 f9400b00 (f9400002)
5.15.0-1032-realtime
我在 Ubuntu 2204 上使用内核。
/proc/cmd_line
:
BOOT_IMAGE=/vmlinuz-5.15.0-1032-realtime root=/dev/mapper/ubuntu--vg-ubuntu--lv ro quiet splash processor.max_cstate=0 idle=poll coredump_filter=0x3b nosoftlockup selinux=0 audit=0 skew_tick=1 enforcing=0 crashkernel=auto softlockup_panic=0 sc=nowatchdog hugepagesz=1G hugepages=8 hugepagesz=2M hugepages=0 default_hugepagesz=1G iommu=on nohz=on nohz_full=4-79 kthread_cpus=0-3 irqaffinity=0-3 modprobe.blacklist=mlx5_ib modprobe.blacklist=mlx5_core rcu_nocb_poll rcu_nocbs=4-79
启动脚本的一部分:
# Set SMP affinity
for irq in 'ls /proc/irq/'
do echo 0-3 > /proc/irq/$irq/smp_affinity_list
done
# Move RUCOs to housekeeping CPUs
tuna -t rcu* -c 0-3 -m
# Disable watchdog timer
echo 0 > /proc/sys/kernel/watchdog
echo 65536 > /proc/sys/kernel/watchdog_thresh
# Offline and online CPUs to move timers to housekeeping: https://www.kernel.org/doc/Documentation/kernel-per-CPU-kthreads.txt
for cpu in $(seq 4 79)
do echo 0 > /sys/devices/system/cpu/cpu${cpu}/online
done
for cpu in $(seq 4 79)
do echo 1 > /sys/devices/system/cpu/cpu${cpu}/online
done
# Shut down services, whether loaded or unloaded
systemctl stop cpupower
systemctl stop irqbalance
systemctl stop firewalld
systemctl stop cpuspeed
systemctl stop cpufreqd
systemctl stop powerd
# Disable MCE
for cpu in $(seq 4 79)
do echo 0 > /sys/devices/system/machinecheck/machinecheck${cpu}/check_interval
done
echo N | tee /sys/module/drm_kms_helper/parameters/poll >/dev/null
echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor >/dev/null
# Limit CPUs of systemd to housekeeping group
systemctl set-property init.scope AllowedCPUs=0-3
# systemctl set-property sytem.slice AllowedCPUs=0-3
# systemctl set-property user.slice AllowedCPUs=0-3
我发现我使用的内核版本有一个错误。为了修复它,我必须将其升级到最新版本,并且该错误似乎已修复。