我使用具有 2 个节点 + quorum 的 pc 设置了一个集群
[root@konor2 etc]# pcs status
Cluster name: wildflycluster
Status of pacemakerd: 'Pacemaker is running' (last updated 2023-06-01 09:52:35 +02:00)
Cluster Summary:
* Stack: corosync
* Current DC: konor2c (version 2.1.5-7.el9-a3f44794f94) - partition with quorum
* Last updated: Thu Jun 1 09:52:36 2023
* Last change: Thu Jun 1 06:03:53 2023 by root via cibadmin on konor2c
* 2 nodes configured
* 4 resource instances configured
Node List:
* Online: [ konor2c ]
* OFFLINE: [ konor1c ]
Full List of Resources:
* Resource Group: wildfly-resources-grp:
* wildfly-vip (ocf:heartbeat:IPaddr2): Started konor2c
* wildfly-server (systemd:wildfly): Started konor2c
* smb-mount-it (systemd:home-jboss-mnt-protector-IT.mount): Started konor2c
* smb-mount-transmek (systemd:home-jboss-mnt-protector-transmek.mount): Started konor2c
Daemon Status:
corosync: active/enabled
pacemaker: active/enabled
pcsd: active/enabled
其中/etc/fstab
有以下SMB坐骑:
//protector/data /home/jboss/mnt/protector/data cifs noauto,vers=3.0,_netdev,credentials=/etc/ucr/vu-d.crd,domain=unmz,uid=wildfly,noexec,nosuid,mapchars,file_mode=0664,dir_mode=0775,nounix,nobrl 0
//protector/IT /home/jboss/mnt/protector/IT cifs noauto,vers=3.0,_netdev,credentials=/etc/ucr/vu-d.crd,domain=unmz,uid=wildfly,noexec,nosuid,mapchars,file_mode=0664,dir_mode=0775,nounix,nobrl 0 0
//protector/transmek /home/jboss/mnt/protector/transmek cifs noauto,vers=3.0,_netdev,credentials=/etc/ucr/vu-d.crd,domain=unmz,uid=wildfly,noexec,nosuid,mapchars,file_mode=0664,dir_mode=0775,nounix,nobrl 0 0
使用 mount 命令挂载共享,我可以使用 systemctl 挂载它们。然后它工作正常,但过了一段时间(可能是 2 到 20 小时 - 我还没有找到触发器)。进程 cifsiod 开始消耗大量 CPU,一段时间后它消耗所有 CPU,必须从 VMware vCenter 手动重启 VM。在 中/var/log/messages
,有这样的消息:
May 30 22:10:00 konor1 systemd[1]: Finished system activity accounting tool.
May 30 22:11:20 konor1 pacemaker-controld[419997]: notice: State transition S_IDLE -> S_POLICY_ENGINE
May 30 22:11:20 konor1 pacemaker-schedulerd[419996]: notice: Calculated transition 196, saving inputs in /var/lib/pacemaker/pengine/pe-input-104.bz2
May 30 22:11:20 konor1 pacemaker-controld[419997]: notice: Transition 196 (Complete=0, Pending=0, Fired=0, Skipped=0, Incomplete=0, Source=/var/lib/pacemaker/pengine/pe-input-104.bz2): Complete
May 30 22:11:20 konor1 pacemaker-controld[419997]: notice: State transition S_TRANSITION_ENGINE -> S_IDLE
May 30 22:19:12 konor1 pacemaker-controld[419997]: notice: High CPU load detected: 3.990000
May 30 22:19:13 konor1 kernel: watchdog: BUG: soft lockup - CPU#0 stuck for 26s! [kworker/0:0:1456407]
May 30 22:19:13 konor1 kernel: Modules linked in: tls nls_utf8 cifs cifs_arc4 rdma_cm iw_cm ib_cm ib_core cifs_md4 dns_resolver nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_
reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 rfkill ip_set nf_tables nfnetlink vsock_loopback vmw_vsock_virtio_transport_common vm
w_vsock_vmci_transport vsock sunrpc vfat fat intel_rapl_msr intel_rapl_common vmw_balloon rapl pcspkr vmw_vmci i2c_piix4 joydev xfs libcrc32c sr_mod cdrom ata_generic vmwgfx drm_ttm_helper ttm d
rm_kms_helper ahci syscopyarea sysfillrect sysimgblt fb_sys_fops libahci ata_piix sd_mod drm t10_pi sg crct10dif_pclmul crc32_pclmul crc32c_intel libata ghash_clmulni_intel vmxnet3 vmw_pvscsi se
rio_raw dm_mirror dm_region_hash dm_log dm_mod fuse
May 30 22:19:13 konor1 kernel: CPU: 0 PID: 1456407 Comm: kworker/0:0 Kdump: loaded Not tainted 5.14.0-284.11.1.el9_2.x86_64 #1
May 30 22:19:13 konor1 kernel: Hardware name: VMware, Inc. VMware7,1/440BX Desktop Reference Platform, BIOS VMW71.00V.18227214.B64.2106252220 06/25/2021
May 30 22:19:13 konor1 kernel: Workqueue: cifsiod smb2_reconnect_server [cifs]
May 30 22:19:13 konor1 kernel: RIP: 0010:native_queued_spin_lock_slowpath+0x21/0x30
May 30 22:19:13 konor1 kernel: Code: 82 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 66 90 ba 01 00 00 00 8b 07 85 c0 75 0d f0 0f b1 17 85 c0 75 f2 c3 cc cc cc cc f3 90 <eb> e9 e9 38 fe ff ff 0f 1f 84
00 00 00 00 00 0f 1f 44 00 00 41 57
May 30 22:19:13 konor1 kernel: RSP: 0018:ffffb00087187d78 EFLAGS: 00000202
May 30 22:19:13 konor1 kernel: RAX: 0000000000000001 RBX: ffff9cdc14b62800 RCX: 000000364c970000
May 30 22:19:13 konor1 kernel: RDX: 0000000000000001 RSI: 0000000000000001 RDI: ffff9cdc14b60828
May 30 22:19:13 konor1 kernel: RBP: ffff9cdc14b60828 R08: ffffb00087187e38 R09: 0000000000000000
May 30 22:19:13 konor1 kernel: R10: ffffb00087187ce8 R11: ffff9cdc3594dc00 R12: 0000000000000000
May 30 22:19:13 konor1 kernel: R13: ffff9cdc14b60800 R14: 000000000000ffff R15: 000000000000ffff
May 30 22:19:13 konor1 kernel: FS: 0000000000000000(0000) GS:ffff9cdcb9c00000(0000) knlGS:0000000000000000
May 30 22:19:13 konor1 kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
May 30 22:19:13 konor1 kernel: CR2: 00007fa14a882000 CR3: 00000001ab010003 CR4: 00000000000606f0
May 30 22:19:13 konor1 kernel: Call Trace:
May 30 22:19:13 konor1 kernel: <TASK>
May 30 22:19:13 konor1 kernel: _raw_spin_lock+0x25/0x30
May 30 22:19:13 konor1 kernel: smb2_reconnect.part.0+0x3f/0x5f0 [cifs]
May 30 22:19:13 konor1 kernel: ? set_next_entity+0xda/0x150
May 30 22:19:13 konor1 kernel: smb2_reconnect_server+0x203/0x5f0 [cifs]
May 30 22:19:13 konor1 kernel: ? __tdx_hypercall+0x80/0x80
May 30 22:19:13 konor1 kernel: process_one_work+0x1e5/0x3c0
May 30 22:19:13 konor1 kernel: ? rescuer_thread+0x3a0/0x3a0
May 30 22:19:13 konor1 kernel: worker_thread+0x50/0x3b0
May 30 22:19:13 konor1 kernel: ? rescuer_thread+0x3a0/0x3a0
May 30 22:19:13 konor1 kernel: kthread+0xd6/0x100
May 30 22:19:13 konor1 kernel: ? kthread_complete_and_exit+0x20/0x20
May 30 22:19:13 konor1 kernel: ret_from_fork+0x1f/0x30
May 30 22:19:13 konor1 kernel: </TASK>
May 30 22:19:23 konor1 corosync-qdevice[933368]: Server didn't send echo reply message on time
May 30 22:19:34 konor1 corosync-qdevice[933368]: Connect timeout
May 30 22:19:41 konor1 kernel: watchdog: BUG: soft lockup - CPU#0 stuck for 52s! [kworker/0:0:1456407]
依此类推...仍在重复...在corosync.log
我可以找到以下消息(来自另一天)
I, [2023-05-22T09:57:32.101 #00000] INFO -- : 200 GET /remote/get_configs?cluster_name=wildflycluster (10.10.51.46) 3.75ms
I, [2023-05-22T10:06:42.066 #00000] INFO -- : 200 GET /remote/get_configs?cluster_name=wildflycluster (10.10.51.47) 4.13ms
I, [2023-05-22T10:06:42.271 #00012] INFO -- : Config files sync started
I, [2023-05-22T10:06:42.272 #00012] INFO -- : SRWT Node: konor2 Request: get_configs
I, [2023-05-22T10:06:42.272 #00012] INFO -- : Connecting to: https://konor2:2224/remote/get_configs?cluster_name=wildflycluster
I, [2023-05-22T10:06:42.272 #00012] INFO -- : SRWT Node: konor1 Request: get_configs
I, [2023-05-22T10:06:42.272 #00012] INFO -- : Connecting to: https://konor1:2224/remote/get_configs?cluster_name=wildflycluster
I, [2023-05-22T10:07:05.272 #00012] INFO -- : Config files sync finished
I, [2023-05-22T10:07:35.262 #00000] INFO -- : 200 GET /remote/get_configs?cluster_name=wildflycluster (10.10.51.46) 7.95ms
I, [2023-05-22T10:16:42.015 #00013] INFO -- : Config files sync started
I, [2023-05-22T10:16:42.016 #00013] INFO -- : SRWT Node: konor2 Request: get_configs
I, [2023-05-22T10:16:42.016 #00013] INFO -- : Connecting to: https://konor2:2224/remote/get_configs?cluster_name=wildflycluster
I, [2023-05-22T10:16:42.016 #00013] INFO -- : SRWT Node: konor1 Request: get_configs
I, [2023-05-22T10:16:42.016 #00013] INFO -- : Connecting to: https://konor1:2224/remote/get_configs?cluster_name=wildflycluster
I, [2023-05-22T10:16:42.016 #00013] INFO -- : No response from: konor1 request: get_configs, error: couldnt_connect
I, [2023-05-22T10:16:42.016 #00013] INFO -- : No response from: konor2 request: get_configs, error: couldnt_connect
I, [2023-05-22T10:16:42.016 #00013] INFO -- : Config files sync finished
看起来服务器丢失了网络连接
我有另一个具有相同 SMB 挂载的单个 VM(非集群),并且挂载了几天没有任何问题。当我在没有 SMB 挂载的情况下运行集群时,它运行了好几天都没有问题。这种情况在集群的两个 VM 上都存在。我也重新安装了它们,但它仍然是一样的。
你见过类似的东西吗?您对如何排除故障有任何建议吗?谢谢你的任何提示。