好的!对起搏器/corosync 来说真的很新,比如 1 天新。
软件:Ubuntu 18.04 LTS 以及与该发行版相关的版本。
起搏器:1.1.18
同步:2.4.3
我不小心从整个测试集群中删除了节点(3 个节点)
当我尝试使用 GUI 恢复所有内容时pcsd
,由于节点被“清除”而失败。凉爽的。
所以。corosync.conf
我从“主”节点获得了最后一个副本。我复制到其他两个节点。我修复bindnetaddr
了各自的confs。我pcs cluster start
在我的“主”节点上运行。
其中一个节点未能启动。我查看了pacemaker
该节点上的状态,并收到以下异常:
Dec 18 06:33:56 region-ctrl-2 crmd[1049]: crit: Nodes 1084777441 and 2 share the same name 'region-ctrl-2': shutting down
我尝试在无法启动crm_node -R --force 1084777441
的机器上运行,但当然,它没有运行,所以我得到一个错误。因此,我在其中一个健康节点上运行了相同的命令,它没有显示任何错误,但该节点永远不会消失,并且在受影响的机器上继续显示相同的错误。pacemaker
pacemaker
crmd: connection refused (111)
pacemaker
所以,我决定一次又一次地拆除整个集群。我从机器上清除了所有的包。我重新安装了一切新鲜的。我复制并修复corosync.conf
了机器。我重新创建了集群。我得到了完全相同的血腥错误。
所以这个命名1084777441
的节点不是我创建的机器。这是为我创建的集群之一。当天早些时候,我意识到我使用的是 IP 地址corosync.conf
而不是名称。我修复了/etc/hosts
机器,从 corosync 配置中删除了 IP 地址,这就是为什么我一开始无意中删除了我的整个集群(我删除了作为 IP 地址的节点)。
以下是我的 corosync.conf:
totem {
version: 2
cluster_name: maas-cluster
token: 3000
token_retransmits_before_loss_const: 10
clear_node_high_bit: yes
crypto_cipher: none
crypto_hash: none
interface {
ringnumber: 0
bindnetaddr: 192.168.99.225
mcastport: 5405
ttl: 1
}
}
logging {
fileline: off
to_stderr: no
to_logfile: no
to_syslog: yes
syslog_facility: daemon
debug: off
timestamp: on
logger_subsys {
subsys: QUORUM
debug: off
}
}
quorum {
provider: corosync_votequorum
expected_votes: 3
two_node: 1
}
nodelist {
node {
ring0_addr: postgres-sb
nodeid: 3
}
node {
ring0_addr: region-ctrl-2
nodeid: 2
}
node {
ring0_addr: region-ctrl-1
nodeid: 1
}
}
节点之间的这个 conf 唯一不同的是bindnetaddr
.
这里似乎存在鸡/蛋问题,除非我不知道有某种方法可以从某处的平面文件数据库或 sqlite 数据库中删除节点,或者有其他更权威的方法可以从集群中删除节点。
额外的
我已经确保/etc/hosts
每台机器的主机名都匹配。我忘了提那个。
127.0.0.1 localhost
127.0.1.1 postgres
192.168.99.224 postgres-sb
192.168.99.223 region-ctrl-1
192.168.99.225 region-ctrl-2
192.168.7.224 postgres-sb
192.168.7.223 region-ctrl-1
192.168.7.225 region-ctrl-2
# The following lines are desirable for IPv6 capable hosts
::1 ip6-localhost ip6-loopback
fe00::0 ip6-localnet
ff00::0 ip6-mcastprefix
ff02::1 ip6-allnodes
ff02::2 ip6-allrouters
我决定尝试从头开始。我apt removed --purge
编辑corosync*
,pacemaker*
crmsh
和pcs
. 我rm -rf
编/etc/corosync
。corosync.conf
我在每台机器上都保留了一份副本。
我在每台机器上重新安装了所有东西。我将保存的内容复制corosync.conf
到/etc/corosync/
并corosync
在所有机器上重新启动。
我仍然得到同样的错误。这必须是其中一个组件中的错误!
因此,似乎crm_get_peer
无法识别名为的主机region-ctrl-2
在corosync.conf
. 然后节点 2 会自动分配一个 ID 1084777441。这对我来说没有意义。机器的主机名region-ctrl-2
设置在/etc/hostname
并/etc/hosts
使用uname -n
. corosync.conf
正在显式地为命名的机器分配一个 ID,但region-ctrl-2
某些东西显然无法识别来自该主机的分配corosync
,而是为该主机分配了一个值为 1084777441 的非随机 ID。我怎么解决这个问题?
日志
info: crm_log_init: Changed active directory to /var/lib/pacemaker/cores
info: get_cluster_type: Detected an active 'corosync' cluster
info: qb_ipcs_us_publish: server name: pacemakerd
info: pcmk__ipc_is_authentic_process_active: Could not connect to lrmd IPC: Connection refused
info: pcmk__ipc_is_authentic_process_active: Could not connect to cib_ro IPC: Connection refused
info: pcmk__ipc_is_authentic_process_active: Could not connect to crmd IPC: Connection refused
info: pcmk__ipc_is_authentic_process_active: Could not connect to attrd IPC: Connection refused
info: pcmk__ipc_is_authentic_process_active: Could not connect to pengine IPC: Connection refused
info: pcmk__ipc_is_authentic_process_active: Could not connect to stonith-ng IPC: Connection refused
info: corosync_node_name: Unable to get node name for nodeid 1084777441
notice: get_node_name: Could not obtain a node name for corosync nodeid 1084777441
info: crm_get_peer: Created entry ea4ec23e-e676-4798-9b8b-00af39d3bb3d/0x5555f74984d0 for node (null)/1084777441 (1 total)
info: crm_get_peer: Node 1084777441 has uuid 1084777441
info: crm_update_peer_proc: cluster_connect_cpg: Node (null)[1084777441] - corosync-cpg is now online
notice: cluster_connect_quorum: Quorum acquired
info: crm_get_peer: Created entry 882c0feb-d546-44b7-955f-4c8a844a0db1/0x5555f7499fd0 for node postgres-sb/3 (2 total)
info: crm_get_peer: Node 3 is now known as postgres-sb
info: crm_get_peer: Node 3 has uuid 3
info: crm_get_peer: Created entry 4e6a6b1e-d687-4527-bffc-5d701ff60a66/0x5555f749a6f0 for node region-ctrl-2/2 (3 total)
info: crm_get_peer: Node 2 is now known as region-ctrl-2
info: crm_get_peer: Node 2 has uuid 2
info: crm_get_peer: Created entry 5532a3cc-2577-4764-b9ee-770d437ccec0/0x5555f749a0a0 for node region-ctrl-1/1 (4 total)
info: crm_get_peer: Node 1 is now known as region-ctrl-1
info: crm_get_peer: Node 1 has uuid 1
info: corosync_node_name: Unable to get node name for nodeid 1084777441
notice: get_node_name: Defaulting to uname -n for the local corosync node name
warning: crm_find_peer: Node 1084777441 and 2 share the same name: 'region-ctrl-2'
info: crm_get_peer: Node 1084777441 is now known as region-ctrl-2
info: pcmk_quorum_notification: Quorum retained | membership=32 members=3
notice: crm_update_peer_state_iter: Node region-ctrl-1 state is now member | nodeid=1 previous=unknown source=pcmk_quorum_notification
notice: crm_update_peer_state_iter: Node postgres-sb state is now member | nodeid=3 previous=unknown source=pcmk_quorum_notification
notice: crm_update_peer_state_iter: Node region-ctrl-2 state is now member | nodeid=1084777441 previous=unknown source=pcmk_quorum_notification
info: crm_reap_unseen_nodes: State of node region-ctrl-2[2] is still unknown
info: pcmk_cpg_membership: Node 1084777441 joined group pacemakerd (counter=0.0, pid=32765, unchecked for rivals)
info: pcmk_cpg_membership: Node 1 still member of group pacemakerd (peer=region-ctrl-1:900, counter=0.0, at least once)
info: crm_update_peer_proc: pcmk_cpg_membership: Node region-ctrl-1[1] - corosync-cpg is now online
info: pcmk_cpg_membership: Node 3 still member of group pacemakerd (peer=postgres-sb:976, counter=0.1, at least once)
info: crm_update_peer_proc: pcmk_cpg_membership: Node postgres-sb[3] - corosync-cpg is now online
info: pcmk_cpg_membership: Node 1084777441 still member of group pacemakerd (peer=region-ctrl-2:3016, counter=0.2, at least once)
pengine: info: crm_log_init: Changed active directory to /var/lib/pacemaker/cores
lrmd: info: crm_log_init: Changed active directory to /var/lib/pacemaker/cores
lrmd: info: qb_ipcs_us_publish: server name: lrmd
pengine: info: qb_ipcs_us_publish: server name: pengine
cib: info: crm_log_init: Changed active directory to /var/lib/pacemaker/cores
attrd: info: crm_log_init: Changed active directory to /var/lib/pacemaker/cores
attrd: info: get_cluster_type: Verifying cluster type: 'corosync'
attrd: info: get_cluster_type: Assuming an active 'corosync' cluster
info: crm_log_init: Changed active directory to /var/lib/pacemaker/cores
attrd: notice: crm_cluster_connect: Connecting to cluster infrastructure: corosync
cib: info: get_cluster_type: Verifying cluster type: 'corosync'
cib: info: get_cluster_type: Assuming an active 'corosync' cluster
info: get_cluster_type: Verifying cluster type: 'corosync'
info: get_cluster_type: Assuming an active 'corosync' cluster
notice: crm_cluster_connect: Connecting to cluster infrastructure: corosync
attrd: info: corosync_node_name: Unable to get node name for nodeid 1084777441
cib: info: validate_with_relaxng: Creating RNG parser context
crmd: info: crm_log_init: Changed active directory to /var/lib/pacemaker/cores
crmd: info: get_cluster_type: Verifying cluster type: 'corosync'
crmd: info: get_cluster_type: Assuming an active 'corosync' cluster
crmd: info: do_log: Input I_STARTUP received in state S_STARTING from crmd_init
attrd: notice: get_node_name: Could not obtain a node name for corosync nodeid 1084777441
attrd: info: crm_get_peer: Created entry af5c62c9-21c5-4428-9504-ea72a92de7eb/0x560870420e90 for node (null)/1084777441 (1 total)
attrd: info: crm_get_peer: Node 1084777441 has uuid 1084777441
attrd: info: crm_update_peer_proc: cluster_connect_cpg: Node (null)[1084777441] - corosync-cpg is now online
attrd: notice: crm_update_peer_state_iter: Node (null) state is now member | nodeid=1084777441 previous=unknown source=crm_update_peer_proc
attrd: info: init_cs_connection_once: Connection to 'corosync': established
info: corosync_node_name: Unable to get node name for nodeid 1084777441
notice: get_node_name: Could not obtain a node name for corosync nodeid 1084777441
info: crm_get_peer: Created entry 5bcb51ae-0015-4652-b036-b92cf4f1d990/0x55f583634700 for node (null)/1084777441 (1 total)
info: crm_get_peer: Node 1084777441 has uuid 1084777441
info: crm_update_peer_proc: cluster_connect_cpg: Node (null)[1084777441] - corosync-cpg is now online
notice: crm_update_peer_state_iter: Node (null) state is now member | nodeid=1084777441 previous=unknown source=crm_update_peer_proc
attrd: info: corosync_node_name: Unable to get node name for nodeid 1084777441
attrd: notice: get_node_name: Defaulting to uname -n for the local corosync node name
attrd: info: crm_get_peer: Node 1084777441 is now known as region-ctrl-2
info: corosync_node_name: Unable to get node name for nodeid 1084777441
notice: get_node_name: Defaulting to uname -n for the local corosync node name
info: init_cs_connection_once: Connection to 'corosync': established
info: corosync_node_name: Unable to get node name for nodeid 1084777441
notice: get_node_name: Defaulting to uname -n for the local corosync node name
info: crm_get_peer: Node 1084777441 is now known as region-ctrl-2
cib: notice: crm_cluster_connect: Connecting to cluster infrastructure: corosync
cib: info: corosync_node_name: Unable to get node name for nodeid 1084777441
cib: notice: get_node_name: Could not obtain a node name for corosync nodeid 1084777441
cib: info: crm_get_peer: Created entry a6ced2c1-9d51-445d-9411-2fb19deab861/0x55848365a150 for node (null)/1084777441 (1 total)
cib: info: crm_get_peer: Node 1084777441 has uuid 1084777441
cib: info: crm_update_peer_proc: cluster_connect_cpg: Node (null)[1084777441] - corosync-cpg is now online
cib: notice: crm_update_peer_state_iter: Node (null) state is now member | nodeid=1084777441 previous=unknown source=crm_update_peer_proc
cib: info: init_cs_connection_once: Connection to 'corosync': established
cib: info: corosync_node_name: Unable to get node name for nodeid 1084777441
cib: notice: get_node_name: Defaulting to uname -n for the local corosync node name
cib: info: crm_get_peer: Node 1084777441 is now known as region-ctrl-2
cib: info: qb_ipcs_us_publish: server name: cib_ro
cib: info: qb_ipcs_us_publish: server name: cib_rw
cib: info: qb_ipcs_us_publish: server name: cib_shm
cib: info: pcmk_cpg_membership: Node 1084777441 joined group cib (counter=0.0, pid=0, unchecked for rivals)