我有一个 DRBD 集群,其中一个节点关闭了几天。单节点运行良好,没有出现任何问题。当我打开它时,我遇到了一种情况,所有资源都停止了,一个 DRBD 卷是辅助卷,其他卷是主要卷,因为它似乎试图对刚刚打开的节点执行角色交换(ha1 处于活动状态,然后我打开 ha2)为了便于理解日志,在 08:06)
我的问题:
- 谁能帮我弄清楚这里发生了什么?(如果这个问题被认为太费力,我愿意考虑付费咨询来获得正确的配置)。
- 作为一个附带问题,如果情况自行解决,是否有办法让电脑自行清理资源?如果故障转移后故障情况清除,LinuxHA 集群不需要干预,所以我要么被宠坏了,要么不知道如何实现这一点。
下面是我能想象到的所有可能有用的信息。
bash-5.1# cat /proc/drbd
version: 8.4.11 (api:1/proto:86-101)
srcversion: 60F610B702CC05315B04B50
0: cs:Connected ro:Secondary/Primary ds:UpToDate/UpToDate C r-----
ns:109798092 nr:90528 dw:373317496 dr:353811713 al:558387 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:f oos:0
1: cs:Connected ro:Primary/Secondary ds:UpToDate/UpToDate C r-----
ns:415010252 nr:188601628 dw:1396698240 dr:1032339078 al:1387347 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:f oos:0
2: cs:Connected ro:Primary/Secondary ds:UpToDate/UpToDate C r-----
ns:27957772 nr:21354732 dw:97210572 dr:100798651 al:5283 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:f oos:0
集群状态最终为
bash-5.1# pcs status
Cluster name: HA
Status of pacemakerd: 'Pacemaker is running' (last updated 2023-08-10 08:38:40Z)
Cluster Summary:
* Stack: corosync
* Current DC: ha2.local (version 2.1.4-5.el9_1.2-dc6eb4362e) - partition with quorum
* Last updated: Thu Aug 10 08:38:40 2023
* Last change: Mon Jul 10 06:49:08 2023 by hacluster via crmd on ha1.local
* 2 nodes configured
* 14 resource instances configured
Node List:
* Online: [ ha1.local ha2.local ]
Full List of Resources:
* Clone Set: LV_BLOB-clone [LV_BLOB] (promotable):
* Promoted: [ ha2.local ]
* Unpromoted: [ ha1.local ]
* Resource Group: nsdrbd:
* LV_BLOBFS (ocf:heartbeat:Filesystem): Started ha2.local
* LV_POSTGRESFS (ocf:heartbeat:Filesystem): Stopped
* LV_HOMEFS (ocf:heartbeat:Filesystem): Stopped
* ClusterIP (ocf:heartbeat:IPaddr2): Stopped
* Clone Set: LV_POSTGRES-clone [LV_POSTGRES] (promotable):
* Promoted: [ ha1.local ]
* Unpromoted: [ ha2.local ]
* postgresql (systemd:postgresql): Stopped
* Clone Set: LV_HOME-clone [LV_HOME] (promotable):
* Promoted: [ ha1.local ]
* Unpromoted: [ ha2.local ]
* ns_mhswdog (lsb:mhswdog): Stopped
* Clone Set: pingd-clone [pingd]:
* Started: [ ha1.local ha2.local ]
Failed Resource Actions:
* LV_POSTGRES promote on ha2.local could not be executed (Timed Out: Resource agent did not complete within 1m30s) at Thu Aug 10 08:19:27 2023 after 1m30.003s
* LV_BLOB promote on ha2.local could not be executed (Timed Out: Resource agent did not complete within 1m30s) at Thu Aug 10 08:15:38 2023 after 1m30.001s
Daemon Status:
corosync: active/enabled
pacemaker: active/enabled
pcsd: active/enabled
我附上两个节点的日志
Aug 10 08:07:00 [1032387] ha1.local corosync info [KNET ] rx: host: 2 link: 0 is up
Aug 10 08:07:00 [1032387] ha1.local corosync info [KNET ] host: host: 2 (passive) best link: 0 (pri: 1)
Aug 10 08:07:00 [1032387] ha1.local corosync info [KNET ] pmtud: Global data MTU changed to: 1397
Aug 10 08:07:00 [1032387] ha1.local corosync notice [QUORUM] Sync members[2]: 1 2
Aug 10 08:07:00 [1032387] ha1.local corosync notice [QUORUM] Sync joined[1]: 2
Aug 10 08:07:00 [1032387] ha1.local corosync notice [TOTEM ] A new membership (1.12d) was formed. Members joined: 2
Aug 10 08:07:00 [1032387] ha1.local corosync notice [QUORUM] Members[2]: 1 2
Aug 10 08:07:00 [1032387] ha1.local corosync notice [MAIN ] Completed service synchronization, ready to provide service.
Aug 10 08:07:07 [1032387] ha1.local corosync info [KNET ] rx: host: 2 link: 1 is up
Aug 10 08:07:07 [1032387] ha1.local corosync info [KNET ] host: host: 2 (passive) best link: 0 (pri: 1)
Aug 10 08:11:48 [1032387] ha1.local corosync info [KNET ] link: host: 2 link: 1 is down
Aug 10 08:11:48 [1032387] ha1.local corosync info [KNET ] host: host: 2 (passive) best link: 0 (pri: 1)
Aug 10 08:11:50 [1032387] ha1.local corosync info [KNET ] rx: host: 2 link: 1 is up
Aug 10 08:11:50 [1032387] ha1.local corosync info [KNET ] host: host: 2 (passive) best link: 0 (pri: 1)
Aug 10 08:12:22 [1032387] ha1.local corosync info [KNET ] link: host: 2 link: 1 is down
Aug 10 08:12:22 [1032387] ha1.local corosync info [KNET ] host: host: 2 (passive) best link: 0 (pri: 1)
Aug 10 08:12:23 [1032387] ha1.local corosync info [KNET ] rx: host: 2 link: 1 is up
Aug 10 08:12:23 [1032387] ha1.local corosync info [KNET ] host: host: 2 (passive) best link: 0 (pri: 1)
和
Aug 10 08:06:55 [1128] ha2.local corosync notice [MAIN ] Corosync Cluster Engine 3.1.5 starting up
Aug 10 08:06:55 [1128] ha2.local corosync info [MAIN ] Corosync built-in features: dbus systemd xmlconf vqsim nozzle snmp pie relro bindnow
Aug 10 08:06:56 [1128] ha2.local corosync notice [TOTEM ] Initializing transport (Kronosnet).
Aug 10 08:06:57 [1128] ha2.local corosync info [TOTEM ] totemknet initialized
Aug 10 08:06:57 [1128] ha2.local corosync info [KNET ] common: crypto_nss.so has been loaded from /usr/lib64/kronosnet/crypto_nss.so
Aug 10 08:06:57 [1128] ha2.local corosync notice [SERV ] Service engine loaded: corosync configuration map access [0]
Aug 10 08:06:57 [1128] ha2.local corosync info [QB ] server name: cmap
Aug 10 08:06:57 [1128] ha2.local corosync notice [SERV ] Service engine loaded: corosync configuration service [1]
Aug 10 08:06:57 [1128] ha2.local corosync info [QB ] server name: cfg
Aug 10 08:06:57 [1128] ha2.local corosync notice [SERV ] Service engine loaded: corosync cluster closed process group service v1.01 [2]
Aug 10 08:06:57 [1128] ha2.local corosync info [QB ] server name: cpg
Aug 10 08:06:57 [1128] ha2.local corosync notice [SERV ] Service engine loaded: corosync profile loading service [4]
Aug 10 08:06:57 [1128] ha2.local corosync notice [QUORUM] Using quorum provider corosync_votequorum
Aug 10 08:06:57 [1128] ha2.local corosync notice [VOTEQ ] Waiting for all cluster members. Current votes: 1 expected_votes: 2
Aug 10 08:06:57 [1128] ha2.local corosync notice [SERV ] Service engine loaded: corosync vote quorum service v1.0 [5]
Aug 10 08:06:57 [1128] ha2.local corosync info [QB ] server name: votequorum
Aug 10 08:06:57 [1128] ha2.local corosync notice [SERV ] Service engine loaded: corosync cluster quorum service v0.1 [3]
Aug 10 08:06:57 [1128] ha2.local corosync info [QB ] server name: quorum
Aug 10 08:06:57 [1128] ha2.local corosync info [TOTEM ] Configuring link 0
Aug 10 08:06:57 [1128] ha2.local corosync info [TOTEM ] Configured link number 0: local addr: 192.168.51.216, port=5405
Aug 10 08:06:57 [1128] ha2.local corosync info [TOTEM ] Configuring link 1
Aug 10 08:06:57 [1128] ha2.local corosync info [TOTEM ] Configured link number 1: local addr: 10.0.0.2, port=5406
Aug 10 08:06:57 [1128] ha2.local corosync info [KNET ] host: host: 1 (passive) best link: 0 (pri: 1)
Aug 10 08:06:57 [1128] ha2.local corosync warning [KNET ] host: host: 1 has no active links
Aug 10 08:06:57 [1128] ha2.local corosync info [KNET ] host: host: 1 (passive) best link: 0 (pri: 1)
Aug 10 08:06:57 [1128] ha2.local corosync warning [KNET ] host: host: 1 has no active links
Aug 10 08:06:57 [1128] ha2.local corosync info [KNET ] host: host: 1 (passive) best link: 0 (pri: 1)
Aug 10 08:06:57 [1128] ha2.local corosync warning [KNET ] host: host: 1 has no active links
Aug 10 08:06:57 [1128] ha2.local corosync info [KNET ] host: host: 1 (passive) best link: 0 (pri: 1)
Aug 10 08:06:57 [1128] ha2.local corosync warning [KNET ] host: host: 1 has no active links
Aug 10 08:06:57 [1128] ha2.local corosync info [KNET ] host: host: 1 (passive) best link: 0 (pri: 1)
Aug 10 08:06:57 [1128] ha2.local corosync warning [KNET ] host: host: 1 has no active links
Aug 10 08:06:57 [1128] ha2.local corosync info [KNET ] host: host: 1 (passive) best link: 0 (pri: 1)
Aug 10 08:06:57 [1128] ha2.local corosync warning [KNET ] host: host: 1 has no active links
Aug 10 08:06:57 [1128] ha2.local corosync notice [QUORUM] Sync members[1]: 2
Aug 10 08:06:57 [1128] ha2.local corosync notice [QUORUM] Sync joined[1]: 2
Aug 10 08:06:57 [1128] ha2.local corosync notice [TOTEM ] A new membership (2.126) was formed. Members joined: 2
Aug 10 08:06:57 [1128] ha2.local corosync notice [VOTEQ ] Waiting for all cluster members. Current votes: 1 expected_votes: 2
Aug 10 08:06:57 [1128] ha2.local corosync notice [VOTEQ ] Waiting for all cluster members. Current votes: 1 expected_votes: 2
Aug 10 08:06:57 [1128] ha2.local corosync notice [VOTEQ ] Waiting for all cluster members. Current votes: 1 expected_votes: 2
Aug 10 08:06:57 [1128] ha2.local corosync notice [QUORUM] Members[1]: 2
Aug 10 08:06:57 [1128] ha2.local corosync notice [MAIN ] Completed service synchronization, ready to provide service.
Aug 10 08:07:00 [1128] ha2.local corosync info [KNET ] rx: host: 1 link: 0 is up
Aug 10 08:07:00 [1128] ha2.local corosync info [KNET ] host: host: 1 (passive) best link: 0 (pri: 1)
Aug 10 08:07:00 [1128] ha2.local corosync info [KNET ] pmtud: Global data MTU changed to: 469
Aug 10 08:07:00 [1128] ha2.local corosync notice [QUORUM] Sync members[2]: 1 2
Aug 10 08:07:00 [1128] ha2.local corosync notice [QUORUM] Sync joined[1]: 1
Aug 10 08:07:00 [1128] ha2.local corosync notice [TOTEM ] A new membership (1.12d) was formed. Members joined: 1
Aug 10 08:07:00 [1128] ha2.local corosync notice [VOTEQ ] Waiting for all cluster members. Current votes: 1 expected_votes: 2
Aug 10 08:07:00 [1128] ha2.local corosync notice [QUORUM] This node is within the primary component and will provide service.
Aug 10 08:07:00 [1128] ha2.local corosync notice [QUORUM] Members[2]: 1 2
Aug 10 08:07:00 [1128] ha2.local corosync notice [MAIN ] Completed service synchronization, ready to provide service.
Aug 10 08:07:05 [1128] ha2.local corosync info [KNET ] rx: host: 1 link: 1 is up
Aug 10 08:07:05 [1128] ha2.local corosync info [KNET ] host: host: 1 (passive) best link: 0 (pri: 1)
Aug 10 08:07:08 [1128] ha2.local corosync info [KNET ] pmtud: PMTUD link change for host: 1 link: 0 from 469 to 1397
Aug 10 08:07:08 [1128] ha2.local corosync info [KNET ] pmtud: PMTUD link change for host: 1 link: 1 from 469 to 8885
Aug 10 08:07:08 [1128] ha2.local corosync info [KNET ] pmtud: Global data MTU changed to: 1397
Aug 10 08:14:13 [1128] ha2.local corosync info [KNET ] link: host: 1 link: 1 is down
Aug 10 08:14:13 [1128] ha2.local corosync info [KNET ] host: host: 1 (passive) best link: 0 (pri: 1)
Aug 10 08:14:15 [1128] ha2.local corosync info [KNET ] rx: host: 1 link: 1 is up
Aug 10 08:14:15 [1128] ha2.local corosync info [KNET ] host: host: 1 (passive) best link: 0 (pri: 1)
Aug 10 08:19:53 [1128] ha2.local corosync info [KNET ] link: host: 1 link: 1 is down
Aug 10 08:19:53 [1128] ha2.local corosync info [KNET ] host: host: 1 (passive) best link: 0 (pri: 1)
Aug 10 08:19:54 [1128] ha2.local corosync info [KNET ] rx: host: 1 link: 1 is up
Aug 10 08:19:54 [1128] ha2.local corosync info [KNET ] host: host: 1 (passive) best link: 0 (pri: 1)
Aug 10 08:23:18 [1128] ha2.local corosync info [KNET ] link: host: 1 link: 1 is down
Aug 10 08:23:18 [1128] ha2.local corosync info [KNET ] host: host: 1 (passive) best link: 0 (pri: 1)
Aug 10 08:23:19 [1128] ha2.local corosync info [KNET ] rx: host: 1 link: 1 is up
Aug 10 08:23:19 [1128] ha2.local corosync info [KNET ] host: host: 1 (passive) best link: 0 (pri: 1)
我还附上了 pcs config show 的输出,还可以根据要求提供 pcs cluster cib
Cluster Name: HA
Corosync Nodes:
ha1.local ha2.local
Pacemaker Nodes:
ha1.local ha2.local
Resources:
Resource: postgresql (class=systemd type=postgresql)
Operations:
monitor: postgresql-monitor-interval-60s
interval=60s
start: postgresql-start-interval-0s
interval=0s
timeout=100
stop: postgresql-stop-interval-0s
interval=0s
timeout=100
Resource: ns_mhswdog (class=lsb type=mhswdog)
Operations:
force-reload: ns_mhswdog-force-reload-interval-0s
interval=0s
timeout=15
monitor: ns_mhswdog-monitor-interval-60s
interval=60s
timeout=10s
on-fail=standby
restart: ns_mhswdog-restart-interval-0s
interval=0s
timeout=140s
start: ns_mhswdog-start-interval-0s
interval=0s
timeout=80s
stop: ns_mhswdog-stop-interval-0s
interval=0s
timeout=80s
Group: nsdrbd
Resource: LV_BLOBFS (class=ocf provider=heartbeat type=Filesystem)
Attributes: LV_BLOBFS-instance_attributes
device=/dev/drbd0
directory=/data
fstype=ext4
Operations:
monitor: LV_BLOBFS-monitor-interval-20s
interval=20s
timeout=40s
start: LV_BLOBFS-start-interval-0s
interval=0s
timeout=60s
stop: LV_BLOBFS-stop-interval-0s
interval=0s
timeout=60s
Resource: LV_POSTGRESFS (class=ocf provider=heartbeat type=Filesystem)
Attributes: LV_POSTGRESFS-instance_attributes
device=/dev/drbd1
directory=/var/lib/pgsql
fstype=ext4
Operations:
monitor: LV_POSTGRESFS-monitor-interval-20s
interval=20s
timeout=40s
start: LV_POSTGRESFS-start-interval-0s
interval=0s
timeout=60s
stop: LV_POSTGRESFS-stop-interval-0s
interval=0s
timeout=60s
Resource: LV_HOMEFS (class=ocf provider=heartbeat type=Filesystem)
Attributes: LV_HOMEFS-instance_attributes
device=/dev/drbd2
directory=/home
fstype=ext4
Operations:
monitor: LV_HOMEFS-monitor-interval-20s
interval=20s
timeout=40s
start: LV_HOMEFS-start-interval-0s
interval=0s
timeout=60s
stop: LV_HOMEFS-stop-interval-0s
interval=0s
timeout=60s
Resource: ClusterIP (class=ocf provider=heartbeat type=IPaddr2)
Attributes: ClusterIP-instance_attributes
cidr_netmask=32
ip=192.168.51.75
Operations:
monitor: ClusterIP-monitor-interval-60s
interval=60s
start: ClusterIP-start-interval-0s
interval=0s
timeout=20s
stop: ClusterIP-stop-interval-0s
interval=0s
timeout=20s
Clone: LV_BLOB-clone
Meta Attributes: LV_BLOB-clone-meta_attributes
clone-max=2
clone-node-max=1
notify=true
promotable=true
promoted-max=1
promoted-node-max=1
Resource: LV_BLOB (class=ocf provider=linbit type=drbd)
Attributes: LV_BLOB-instance_attributes
drbd_resource=lv_blob
Operations:
demote: LV_BLOB-demote-interval-0s
interval=0s
timeout=90
monitor: LV_BLOB-monitor-interval-60s
interval=60s
role=Promoted
monitor: LV_BLOB-monitor-interval-63s
interval=63s
role=Unpromoted
notify: LV_BLOB-notify-interval-0s
interval=0s
timeout=90
promote: LV_BLOB-promote-interval-0s
interval=0s
timeout=90
reload: LV_BLOB-reload-interval-0s
interval=0s
timeout=30
start: LV_BLOB-start-interval-0s
interval=0s
timeout=240
stop: LV_BLOB-stop-interval-0s
interval=0s
timeout=100
Clone: LV_POSTGRES-clone
Meta Attributes: LV_POSTGRES-clone-meta_attributes
clone-max=2
clone-node-max=1
notify=true
promotable=true
promoted-max=1
promoted-node-max=1
Resource: LV_POSTGRES (class=ocf provider=linbit type=drbd)
Attributes: LV_POSTGRES-instance_attributes
drbd_resource=lv_postgres
Operations:
demote: LV_POSTGRES-demote-interval-0s
interval=0s
timeout=90
monitor: LV_POSTGRES-monitor-interval-60s
interval=60s
role=Promoted
monitor: LV_POSTGRES-monitor-interval-63s
interval=63s
role=Unpromoted
notify: LV_POSTGRES-notify-interval-0s
interval=0s
timeout=90
promote: LV_POSTGRES-promote-interval-0s
interval=0s
timeout=90
reload: LV_POSTGRES-reload-interval-0s
interval=0s
timeout=30
start: LV_POSTGRES-start-interval-0s
interval=0s
timeout=240
stop: LV_POSTGRES-stop-interval-0s
interval=0s
timeout=100
Clone: LV_HOME-clone
Meta Attributes: LV_HOME-clone-meta_attributes
clone-max=2
clone-node-max=1
notify=true
promotable=true
promoted-max=1
promoted-node-max=1
Resource: LV_HOME (class=ocf provider=linbit type=drbd)
Attributes: LV_HOME-instance_attributes
drbd_resource=lv_home
Operations:
demote: LV_HOME-demote-interval-0s
interval=0s
timeout=90
monitor: LV_HOME-monitor-interval-60s
interval=60s
role=Promoted
monitor: LV_HOME-monitor-interval-63s
interval=63s
role=Unpromoted
notify: LV_HOME-notify-interval-0s
interval=0s
timeout=90
promote: LV_HOME-promote-interval-0s
interval=0s
timeout=90
reload: LV_HOME-reload-interval-0s
interval=0s
timeout=30
start: LV_HOME-start-interval-0s
interval=0s
timeout=240
stop: LV_HOME-stop-interval-0s
interval=0s
timeout=100
Clone: pingd-clone
Resource: pingd (class=ocf provider=pacemaker type=ping)
Attributes: pingd-instance_attributes
dampen=6s
host_list=192.168.51.251
multiplier=1000
Operations:
monitor: pingd-monitor-interval-10s
interval=10s
timeout=60s
reload-agent: pingd-reload-agent-interval-0s
interval=0s
timeout=20s
start: pingd-start-interval-0s
interval=0s
timeout=60s
stop: pingd-stop-interval-0s
interval=0s
timeout=20s
Stonith Devices:
Fencing Levels:
Location Constraints:
Resource: ClusterIP
Constraint: location-ClusterIP
Rule: boolean-op=or score=-INFINITY (id:location-ClusterIP-rule)
Expression: pingd lt 1 (id:location-ClusterIP-rule-expr)
Expression: not_defined pingd (id:location-ClusterIP-rule-expr-1)
Ordering Constraints:
promote LV_BLOB-clone then start LV_BLOBFS (kind:Mandatory) (id:order-LV_BLOB-clone-LV_BLOBFS-mandatory)
promote LV_POSTGRES-clone then start LV_POSTGRESFS (kind:Mandatory) (id:order-LV_POSTGRES-clone-LV_POSTGRESFS-mandatory)
start LV_POSTGRESFS then start postgresql (kind:Mandatory) (id:order-LV_POSTGRESFS-postgresql-mandatory)
promote LV_HOME-clone then start LV_HOMEFS (kind:Mandatory) (id:order-LV_HOME-clone-LV_HOMEFS-mandatory)
start LV_HOMEFS then start ns_mhswdog (kind:Mandatory) (id:order-LV_HOMEFS-ns_mhswdog-mandatory)
start LV_BLOBFS then start ns_mhswdog (kind:Mandatory) (id:order-LV_BLOBFS-ns_mhswdog-mandatory)
start postgresql then start ns_mhswdog (kind:Mandatory) (id:order-postgresql-ns_mhswdog-mandatory)
start ns_mhswdog then start ClusterIP (kind:Mandatory) (id:order-ns_mhswdog-ClusterIP-mandatory)
Colocation Constraints:
LV_BLOBFS with LV_BLOB-clone (score:INFINITY) (with-rsc-role:Promoted) (id:colocation-LV_BLOBFS-LV_BLOB-clone-INFINITY)
LV_POSTGRESFS with LV_POSTGRES-clone (score:INFINITY) (with-rsc-role:Promoted) (id:colocation-LV_POSTGRESFS-LV_POSTGRES-clone-INFINITY)
postgresql with LV_POSTGRESFS (score:INFINITY) (id:colocation-postgresql-LV_POSTGRESFS-INFINITY)
LV_HOMEFS with LV_HOME-clone (score:INFINITY) (with-rsc-role:Promoted) (id:colocation-LV_HOMEFS-LV_HOME-clone-INFINITY)
ns_mhswdog with LV_HOMEFS (score:INFINITY) (id:colocation-ns_mhswdog-LV_HOMEFS-INFINITY)
ns_mhswdog with LV_BLOBFS (score:INFINITY) (id:colocation-ns_mhswdog-LV_BLOBFS-INFINITY)
ns_mhswdog with postgresql (score:INFINITY) (id:colocation-ns_mhswdog-postgresql-INFINITY)
ClusterIP with ns_mhswdog (score:INFINITY) (id:colocation-ClusterIP-ns_mhswdog-INFINITY)
Ticket Constraints:
Alerts:
No alerts defined
Resources Defaults:
Meta Attrs: build-resource-defaults
resource-stickiness=INFINITY
Operations Defaults:
Meta Attrs: op_defaults-meta_attributes
timeout=240s
Cluster Properties:
cluster-infrastructure: corosync
cluster-name: HA
dc-version: 2.1.4-5.el9_1.2-dc6eb4362e
have-watchdog: false
last-lrm-refresh: 1688971748
maintenance-mode: false
no-quorum-policy: ignore
stonith-enabled: false
Tags:
No tags defined
Quorum:
Options:
It's a misconfigured fencing or quorum pick-up issue. It's a very typical DRBD behavior which works when it works but fails miserably with a split-brain or replicated volumes stuck in a wrong state at some point and for no real reason. I would recommend not to fix DRBD, as it's not fixable by design, but use your best chance to scrap it and use something more reliable out of the box. I'd really recommend Ceph due to its stability, outstanding performance, and huge community support.
Clusters like Pacemaker are highly configurable, and try to be careful because they can yank block devices, your data, between nodes. Testing the behavior of your cluster in various situations is not optional. Ideally, this results in operational playbooks of what to do in various scenarios.
Read the manual. The downstream distros that support Pacemaker, like the RHEL HA guide, explain many scenarios and provide reference.
You can configure resources to prefer the node they are currently running. A non-zero value cluster wide is probably a good idea, as in
A location constraint with a higher value then this can still move resources.
Performing cluster maintenance topic has a reference for moving, stopping and starting. Planned maintenance on a node should probably be wrapped in
pcs node standby
thenpcs node unstandby
Try a
pcs resource move
to get your resource group on the same node. Watch what happens after the move constraint is gone. If resources move back and you didn't want that, troubleshoot stickiness, constraints, dependencies, and other rules.DRDB status of up UpToDate implies your volumes are healthy. Move things around with the cluster tools, and confirm the volumes mount successfully.
Two node clusters are actually difficult. When they partition there is not a good way to select a primary. Consider adding another node for quorum purposes, even if it has no disks and cannot host this resource group.