是否有可能从多台计算机中创建一台速度更快的计算机？

Question

Waslap

Asked: 2023-08-11 19:59:36 +0800 CST2023-08-11 19:59:36 +0800 CST 2023-08-11 19:59:36 +0800 CST

Corosync/Pacemaker/DRBD 弹性调整

772

我有一个 DRBD 集群，其中一个节点关闭了几天。单节点运行良好，没有出现任何问题。当我打开它时，我遇到了一种情况，所有资源都停止了，一个 DRBD 卷是辅助卷，其他卷是主要卷，因为它似乎试图对刚刚打开的节点执行角色交换（ha1 处于活动状态，然后我打开 ha2）为了便于理解日志，在 08:06）

我的问题：

谁能帮我弄清楚这里发生了什么？（如果这个问题被认为太费力，我愿意考虑付费咨询来获得正确的配置）。
作为一个附带问题，如果情况自行解决，是否有办法让电脑自行清理资源？如果故障转移后故障情况清除，LinuxHA 集群不需要干预，所以我要么被宠坏了，要么不知道如何实现这一点。

下面是我能想象到的所有可能有用的信息。

bash-5.1# cat /proc/drbd 
version: 8.4.11 (api:1/proto:86-101)
srcversion: 60F610B702CC05315B04B50 
 0: cs:Connected ro:Secondary/Primary ds:UpToDate/UpToDate C r-----
    ns:109798092 nr:90528 dw:373317496 dr:353811713 al:558387 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:f oos:0
 1: cs:Connected ro:Primary/Secondary ds:UpToDate/UpToDate C r-----
    ns:415010252 nr:188601628 dw:1396698240 dr:1032339078 al:1387347 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:f oos:0
 2: cs:Connected ro:Primary/Secondary ds:UpToDate/UpToDate C r-----
    ns:27957772 nr:21354732 dw:97210572 dr:100798651 al:5283 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:f oos:0

集群状态最终为

bash-5.1# pcs status
Cluster name: HA
Status of pacemakerd: 'Pacemaker is running' (last updated 2023-08-10 08:38:40Z)
Cluster Summary:
  * Stack: corosync
  * Current DC: ha2.local (version 2.1.4-5.el9_1.2-dc6eb4362e) - partition with quorum
  * Last updated: Thu Aug 10 08:38:40 2023
  * Last change:  Mon Jul 10 06:49:08 2023 by hacluster via crmd on ha1.local
  * 2 nodes configured
  * 14 resource instances configured

Node List:
  * Online: [ ha1.local ha2.local ]

Full List of Resources:
  * Clone Set: LV_BLOB-clone [LV_BLOB] (promotable):
    * Promoted: [ ha2.local ]
    * Unpromoted: [ ha1.local ]
  * Resource Group: nsdrbd:
    * LV_BLOBFS (ocf:heartbeat:Filesystem):  Started ha2.local
    * LV_POSTGRESFS (ocf:heartbeat:Filesystem):  Stopped
    * LV_HOMEFS (ocf:heartbeat:Filesystem):  Stopped
    * ClusterIP (ocf:heartbeat:IPaddr2):     Stopped
  * Clone Set: LV_POSTGRES-clone [LV_POSTGRES] (promotable):
    * Promoted: [ ha1.local ]
    * Unpromoted: [ ha2.local ]
  * postgresql  (systemd:postgresql):    Stopped
  * Clone Set: LV_HOME-clone [LV_HOME] (promotable):
    * Promoted: [ ha1.local ]
    * Unpromoted: [ ha2.local ]
  * ns_mhswdog  (lsb:mhswdog):   Stopped
  * Clone Set: pingd-clone [pingd]:
    * Started: [ ha1.local ha2.local ]

Failed Resource Actions:
  * LV_POSTGRES promote on ha2.local could not be executed (Timed Out: Resource agent did not complete within 1m30s) at Thu Aug 10 08:19:27 2023 after 1m30.003s
  * LV_BLOB promote on ha2.local could not be executed (Timed Out: Resource agent did not complete within 1m30s) at Thu Aug 10 08:15:38 2023 after 1m30.001s

Daemon Status:
  corosync: active/enabled
  pacemaker: active/enabled
  pcsd: active/enabled

我附上两个节点的日志

Aug 10 08:07:00 [1032387] ha1.local corosync info    [KNET  ] rx: host: 2 link: 0 is up
Aug 10 08:07:00 [1032387] ha1.local corosync info    [KNET  ] host: host: 2 (passive) best link: 0 (pri: 1)
Aug 10 08:07:00 [1032387] ha1.local corosync info    [KNET  ] pmtud: Global data MTU changed to: 1397
Aug 10 08:07:00 [1032387] ha1.local corosync notice  [QUORUM] Sync members[2]: 1 2
Aug 10 08:07:00 [1032387] ha1.local corosync notice  [QUORUM] Sync joined[1]: 2
Aug 10 08:07:00 [1032387] ha1.local corosync notice  [TOTEM ] A new membership (1.12d) was formed. Members joined: 2
Aug 10 08:07:00 [1032387] ha1.local corosync notice  [QUORUM] Members[2]: 1 2
Aug 10 08:07:00 [1032387] ha1.local corosync notice  [MAIN  ] Completed service synchronization, ready to provide service.
Aug 10 08:07:07 [1032387] ha1.local corosync info    [KNET  ] rx: host: 2 link: 1 is up
Aug 10 08:07:07 [1032387] ha1.local corosync info    [KNET  ] host: host: 2 (passive) best link: 0 (pri: 1)
Aug 10 08:11:48 [1032387] ha1.local corosync info    [KNET  ] link: host: 2 link: 1 is down
Aug 10 08:11:48 [1032387] ha1.local corosync info    [KNET  ] host: host: 2 (passive) best link: 0 (pri: 1)
Aug 10 08:11:50 [1032387] ha1.local corosync info    [KNET  ] rx: host: 2 link: 1 is up
Aug 10 08:11:50 [1032387] ha1.local corosync info    [KNET  ] host: host: 2 (passive) best link: 0 (pri: 1)
Aug 10 08:12:22 [1032387] ha1.local corosync info    [KNET  ] link: host: 2 link: 1 is down
Aug 10 08:12:22 [1032387] ha1.local corosync info    [KNET  ] host: host: 2 (passive) best link: 0 (pri: 1)
Aug 10 08:12:23 [1032387] ha1.local corosync info    [KNET  ] rx: host: 2 link: 1 is up
Aug 10 08:12:23 [1032387] ha1.local corosync info    [KNET  ] host: host: 2 (passive) best link: 0 (pri: 1)

和

Aug 10 08:06:55 [1128] ha2.local corosync notice  [MAIN  ] Corosync Cluster Engine 3.1.5 starting up
Aug 10 08:06:55 [1128] ha2.local corosync info    [MAIN  ] Corosync built-in features: dbus systemd xmlconf vqsim nozzle snmp pie relro bindnow
Aug 10 08:06:56 [1128] ha2.local corosync notice  [TOTEM ] Initializing transport (Kronosnet).
Aug 10 08:06:57 [1128] ha2.local corosync info    [TOTEM ] totemknet initialized
Aug 10 08:06:57 [1128] ha2.local corosync info    [KNET  ] common: crypto_nss.so has been loaded from /usr/lib64/kronosnet/crypto_nss.so
Aug 10 08:06:57 [1128] ha2.local corosync notice  [SERV  ] Service engine loaded: corosync configuration map access [0]
Aug 10 08:06:57 [1128] ha2.local corosync info    [QB    ] server name: cmap
Aug 10 08:06:57 [1128] ha2.local corosync notice  [SERV  ] Service engine loaded: corosync configuration service [1]
Aug 10 08:06:57 [1128] ha2.local corosync info    [QB    ] server name: cfg
Aug 10 08:06:57 [1128] ha2.local corosync notice  [SERV  ] Service engine loaded: corosync cluster closed process group service v1.01 [2]
Aug 10 08:06:57 [1128] ha2.local corosync info    [QB    ] server name: cpg
Aug 10 08:06:57 [1128] ha2.local corosync notice  [SERV  ] Service engine loaded: corosync profile loading service [4]
Aug 10 08:06:57 [1128] ha2.local corosync notice  [QUORUM] Using quorum provider corosync_votequorum
Aug 10 08:06:57 [1128] ha2.local corosync notice  [VOTEQ ] Waiting for all cluster members. Current votes: 1 expected_votes: 2
Aug 10 08:06:57 [1128] ha2.local corosync notice  [SERV  ] Service engine loaded: corosync vote quorum service v1.0 [5]
Aug 10 08:06:57 [1128] ha2.local corosync info    [QB    ] server name: votequorum
Aug 10 08:06:57 [1128] ha2.local corosync notice  [SERV  ] Service engine loaded: corosync cluster quorum service v0.1 [3]
Aug 10 08:06:57 [1128] ha2.local corosync info    [QB    ] server name: quorum
Aug 10 08:06:57 [1128] ha2.local corosync info    [TOTEM ] Configuring link 0
Aug 10 08:06:57 [1128] ha2.local corosync info    [TOTEM ] Configured link number 0: local addr: 192.168.51.216, port=5405
Aug 10 08:06:57 [1128] ha2.local corosync info    [TOTEM ] Configuring link 1
Aug 10 08:06:57 [1128] ha2.local corosync info    [TOTEM ] Configured link number 1: local addr: 10.0.0.2, port=5406
Aug 10 08:06:57 [1128] ha2.local corosync info    [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
Aug 10 08:06:57 [1128] ha2.local corosync warning [KNET  ] host: host: 1 has no active links
Aug 10 08:06:57 [1128] ha2.local corosync info    [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
Aug 10 08:06:57 [1128] ha2.local corosync warning [KNET  ] host: host: 1 has no active links
Aug 10 08:06:57 [1128] ha2.local corosync info    [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
Aug 10 08:06:57 [1128] ha2.local corosync warning [KNET  ] host: host: 1 has no active links
Aug 10 08:06:57 [1128] ha2.local corosync info    [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
Aug 10 08:06:57 [1128] ha2.local corosync warning [KNET  ] host: host: 1 has no active links
Aug 10 08:06:57 [1128] ha2.local corosync info    [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
Aug 10 08:06:57 [1128] ha2.local corosync warning [KNET  ] host: host: 1 has no active links
Aug 10 08:06:57 [1128] ha2.local corosync info    [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
Aug 10 08:06:57 [1128] ha2.local corosync warning [KNET  ] host: host: 1 has no active links
Aug 10 08:06:57 [1128] ha2.local corosync notice  [QUORUM] Sync members[1]: 2
Aug 10 08:06:57 [1128] ha2.local corosync notice  [QUORUM] Sync joined[1]: 2
Aug 10 08:06:57 [1128] ha2.local corosync notice  [TOTEM ] A new membership (2.126) was formed. Members joined: 2
Aug 10 08:06:57 [1128] ha2.local corosync notice  [VOTEQ ] Waiting for all cluster members. Current votes: 1 expected_votes: 2
Aug 10 08:06:57 [1128] ha2.local corosync notice  [VOTEQ ] Waiting for all cluster members. Current votes: 1 expected_votes: 2
Aug 10 08:06:57 [1128] ha2.local corosync notice  [VOTEQ ] Waiting for all cluster members. Current votes: 1 expected_votes: 2
Aug 10 08:06:57 [1128] ha2.local corosync notice  [QUORUM] Members[1]: 2
Aug 10 08:06:57 [1128] ha2.local corosync notice  [MAIN  ] Completed service synchronization, ready to provide service.
Aug 10 08:07:00 [1128] ha2.local corosync info    [KNET  ] rx: host: 1 link: 0 is up
Aug 10 08:07:00 [1128] ha2.local corosync info    [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
Aug 10 08:07:00 [1128] ha2.local corosync info    [KNET  ] pmtud: Global data MTU changed to: 469
Aug 10 08:07:00 [1128] ha2.local corosync notice  [QUORUM] Sync members[2]: 1 2
Aug 10 08:07:00 [1128] ha2.local corosync notice  [QUORUM] Sync joined[1]: 1
Aug 10 08:07:00 [1128] ha2.local corosync notice  [TOTEM ] A new membership (1.12d) was formed. Members joined: 1
Aug 10 08:07:00 [1128] ha2.local corosync notice  [VOTEQ ] Waiting for all cluster members. Current votes: 1 expected_votes: 2
Aug 10 08:07:00 [1128] ha2.local corosync notice  [QUORUM] This node is within the primary component and will provide service.
Aug 10 08:07:00 [1128] ha2.local corosync notice  [QUORUM] Members[2]: 1 2
Aug 10 08:07:00 [1128] ha2.local corosync notice  [MAIN  ] Completed service synchronization, ready to provide service.
Aug 10 08:07:05 [1128] ha2.local corosync info    [KNET  ] rx: host: 1 link: 1 is up
Aug 10 08:07:05 [1128] ha2.local corosync info    [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
Aug 10 08:07:08 [1128] ha2.local corosync info    [KNET  ] pmtud: PMTUD link change for host: 1 link: 0 from 469 to 1397
Aug 10 08:07:08 [1128] ha2.local corosync info    [KNET  ] pmtud: PMTUD link change for host: 1 link: 1 from 469 to 8885
Aug 10 08:07:08 [1128] ha2.local corosync info    [KNET  ] pmtud: Global data MTU changed to: 1397
Aug 10 08:14:13 [1128] ha2.local corosync info    [KNET  ] link: host: 1 link: 1 is down
Aug 10 08:14:13 [1128] ha2.local corosync info    [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
Aug 10 08:14:15 [1128] ha2.local corosync info    [KNET  ] rx: host: 1 link: 1 is up
Aug 10 08:14:15 [1128] ha2.local corosync info    [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
Aug 10 08:19:53 [1128] ha2.local corosync info    [KNET  ] link: host: 1 link: 1 is down
Aug 10 08:19:53 [1128] ha2.local corosync info    [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
Aug 10 08:19:54 [1128] ha2.local corosync info    [KNET  ] rx: host: 1 link: 1 is up
Aug 10 08:19:54 [1128] ha2.local corosync info    [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
Aug 10 08:23:18 [1128] ha2.local corosync info    [KNET  ] link: host: 1 link: 1 is down
Aug 10 08:23:18 [1128] ha2.local corosync info    [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
Aug 10 08:23:19 [1128] ha2.local corosync info    [KNET  ] rx: host: 1 link: 1 is up
Aug 10 08:23:19 [1128] ha2.local corosync info    [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)

我还附上了 pcs config show 的输出，还可以根据要求提供 pcs cluster cib

Cluster Name: HA
Corosync Nodes:
 ha1.local ha2.local
Pacemaker Nodes:
 ha1.local ha2.local

Resources:
  Resource: postgresql (class=systemd type=postgresql)
    Operations:
      monitor: postgresql-monitor-interval-60s
        interval=60s
      start: postgresql-start-interval-0s
        interval=0s
        timeout=100
      stop: postgresql-stop-interval-0s
        interval=0s
        timeout=100
  Resource: ns_mhswdog (class=lsb type=mhswdog)
    Operations:
      force-reload: ns_mhswdog-force-reload-interval-0s
        interval=0s
        timeout=15
      monitor: ns_mhswdog-monitor-interval-60s
        interval=60s
        timeout=10s
        on-fail=standby
      restart: ns_mhswdog-restart-interval-0s
        interval=0s
        timeout=140s
      start: ns_mhswdog-start-interval-0s
        interval=0s
        timeout=80s
      stop: ns_mhswdog-stop-interval-0s
        interval=0s
        timeout=80s
  Group: nsdrbd
    Resource: LV_BLOBFS (class=ocf provider=heartbeat type=Filesystem)
      Attributes: LV_BLOBFS-instance_attributes
        device=/dev/drbd0
        directory=/data
        fstype=ext4
      Operations:
        monitor: LV_BLOBFS-monitor-interval-20s
          interval=20s
          timeout=40s
        start: LV_BLOBFS-start-interval-0s
          interval=0s
          timeout=60s
        stop: LV_BLOBFS-stop-interval-0s
          interval=0s
          timeout=60s
    Resource: LV_POSTGRESFS (class=ocf provider=heartbeat type=Filesystem)
      Attributes: LV_POSTGRESFS-instance_attributes
        device=/dev/drbd1
        directory=/var/lib/pgsql
        fstype=ext4
      Operations:
        monitor: LV_POSTGRESFS-monitor-interval-20s
          interval=20s
          timeout=40s
        start: LV_POSTGRESFS-start-interval-0s
          interval=0s
          timeout=60s
        stop: LV_POSTGRESFS-stop-interval-0s
          interval=0s
          timeout=60s
    Resource: LV_HOMEFS (class=ocf provider=heartbeat type=Filesystem)
      Attributes: LV_HOMEFS-instance_attributes
        device=/dev/drbd2
        directory=/home
        fstype=ext4
      Operations:
        monitor: LV_HOMEFS-monitor-interval-20s
          interval=20s
          timeout=40s
        start: LV_HOMEFS-start-interval-0s
          interval=0s
          timeout=60s
        stop: LV_HOMEFS-stop-interval-0s
          interval=0s
          timeout=60s
    Resource: ClusterIP (class=ocf provider=heartbeat type=IPaddr2)
      Attributes: ClusterIP-instance_attributes
        cidr_netmask=32
        ip=192.168.51.75
      Operations:
        monitor: ClusterIP-monitor-interval-60s
          interval=60s
        start: ClusterIP-start-interval-0s
          interval=0s
          timeout=20s
        stop: ClusterIP-stop-interval-0s
          interval=0s
          timeout=20s
  Clone: LV_BLOB-clone
    Meta Attributes: LV_BLOB-clone-meta_attributes
      clone-max=2
      clone-node-max=1
      notify=true
      promotable=true
      promoted-max=1
      promoted-node-max=1
    Resource: LV_BLOB (class=ocf provider=linbit type=drbd)
      Attributes: LV_BLOB-instance_attributes
        drbd_resource=lv_blob
      Operations:
        demote: LV_BLOB-demote-interval-0s
          interval=0s
          timeout=90
        monitor: LV_BLOB-monitor-interval-60s
          interval=60s
          role=Promoted
        monitor: LV_BLOB-monitor-interval-63s
          interval=63s
          role=Unpromoted
        notify: LV_BLOB-notify-interval-0s
          interval=0s
          timeout=90
        promote: LV_BLOB-promote-interval-0s
          interval=0s
          timeout=90
        reload: LV_BLOB-reload-interval-0s
          interval=0s
          timeout=30
        start: LV_BLOB-start-interval-0s
          interval=0s
          timeout=240
        stop: LV_BLOB-stop-interval-0s
          interval=0s
          timeout=100
  Clone: LV_POSTGRES-clone
    Meta Attributes: LV_POSTGRES-clone-meta_attributes
      clone-max=2
      clone-node-max=1
      notify=true
      promotable=true
      promoted-max=1
      promoted-node-max=1
    Resource: LV_POSTGRES (class=ocf provider=linbit type=drbd)
      Attributes: LV_POSTGRES-instance_attributes
        drbd_resource=lv_postgres
      Operations:
        demote: LV_POSTGRES-demote-interval-0s
          interval=0s
          timeout=90
        monitor: LV_POSTGRES-monitor-interval-60s
          interval=60s
          role=Promoted
        monitor: LV_POSTGRES-monitor-interval-63s
          interval=63s
          role=Unpromoted
        notify: LV_POSTGRES-notify-interval-0s
          interval=0s
          timeout=90
        promote: LV_POSTGRES-promote-interval-0s
          interval=0s
          timeout=90
        reload: LV_POSTGRES-reload-interval-0s
          interval=0s
          timeout=30
        start: LV_POSTGRES-start-interval-0s
          interval=0s
          timeout=240
        stop: LV_POSTGRES-stop-interval-0s
          interval=0s
          timeout=100
  Clone: LV_HOME-clone
    Meta Attributes: LV_HOME-clone-meta_attributes
      clone-max=2
      clone-node-max=1
      notify=true
      promotable=true
      promoted-max=1
      promoted-node-max=1
    Resource: LV_HOME (class=ocf provider=linbit type=drbd)
      Attributes: LV_HOME-instance_attributes
        drbd_resource=lv_home
      Operations:
        demote: LV_HOME-demote-interval-0s
          interval=0s
          timeout=90
        monitor: LV_HOME-monitor-interval-60s
          interval=60s
          role=Promoted
        monitor: LV_HOME-monitor-interval-63s
          interval=63s
          role=Unpromoted
        notify: LV_HOME-notify-interval-0s
          interval=0s
          timeout=90
        promote: LV_HOME-promote-interval-0s
          interval=0s
          timeout=90
        reload: LV_HOME-reload-interval-0s
          interval=0s
          timeout=30
        start: LV_HOME-start-interval-0s
          interval=0s
          timeout=240
        stop: LV_HOME-stop-interval-0s
          interval=0s
          timeout=100
  Clone: pingd-clone
    Resource: pingd (class=ocf provider=pacemaker type=ping)
      Attributes: pingd-instance_attributes
        dampen=6s
        host_list=192.168.51.251
        multiplier=1000
      Operations:
        monitor: pingd-monitor-interval-10s
          interval=10s
          timeout=60s
        reload-agent: pingd-reload-agent-interval-0s
          interval=0s
          timeout=20s
        start: pingd-start-interval-0s
          interval=0s
          timeout=60s
        stop: pingd-stop-interval-0s
          interval=0s
          timeout=20s

Stonith Devices:
Fencing Levels:

Location Constraints:
  Resource: ClusterIP
    Constraint: location-ClusterIP
      Rule: boolean-op=or score=-INFINITY (id:location-ClusterIP-rule)
        Expression: pingd lt 1 (id:location-ClusterIP-rule-expr)
        Expression: not_defined pingd (id:location-ClusterIP-rule-expr-1)
Ordering Constraints:
  promote LV_BLOB-clone then start LV_BLOBFS (kind:Mandatory) (id:order-LV_BLOB-clone-LV_BLOBFS-mandatory)
  promote LV_POSTGRES-clone then start LV_POSTGRESFS (kind:Mandatory) (id:order-LV_POSTGRES-clone-LV_POSTGRESFS-mandatory)
  start LV_POSTGRESFS then start postgresql (kind:Mandatory) (id:order-LV_POSTGRESFS-postgresql-mandatory)
  promote LV_HOME-clone then start LV_HOMEFS (kind:Mandatory) (id:order-LV_HOME-clone-LV_HOMEFS-mandatory)
  start LV_HOMEFS then start ns_mhswdog (kind:Mandatory) (id:order-LV_HOMEFS-ns_mhswdog-mandatory)
  start LV_BLOBFS then start ns_mhswdog (kind:Mandatory) (id:order-LV_BLOBFS-ns_mhswdog-mandatory)
  start postgresql then start ns_mhswdog (kind:Mandatory) (id:order-postgresql-ns_mhswdog-mandatory)
  start ns_mhswdog then start ClusterIP (kind:Mandatory) (id:order-ns_mhswdog-ClusterIP-mandatory)
Colocation Constraints:
  LV_BLOBFS with LV_BLOB-clone (score:INFINITY) (with-rsc-role:Promoted) (id:colocation-LV_BLOBFS-LV_BLOB-clone-INFINITY)
  LV_POSTGRESFS with LV_POSTGRES-clone (score:INFINITY) (with-rsc-role:Promoted) (id:colocation-LV_POSTGRESFS-LV_POSTGRES-clone-INFINITY)
  postgresql with LV_POSTGRESFS (score:INFINITY) (id:colocation-postgresql-LV_POSTGRESFS-INFINITY)
  LV_HOMEFS with LV_HOME-clone (score:INFINITY) (with-rsc-role:Promoted) (id:colocation-LV_HOMEFS-LV_HOME-clone-INFINITY)
  ns_mhswdog with LV_HOMEFS (score:INFINITY) (id:colocation-ns_mhswdog-LV_HOMEFS-INFINITY)
  ns_mhswdog with LV_BLOBFS (score:INFINITY) (id:colocation-ns_mhswdog-LV_BLOBFS-INFINITY)
  ns_mhswdog with postgresql (score:INFINITY) (id:colocation-ns_mhswdog-postgresql-INFINITY)
  ClusterIP with ns_mhswdog (score:INFINITY) (id:colocation-ClusterIP-ns_mhswdog-INFINITY)
Ticket Constraints:

Alerts:
 No alerts defined

Resources Defaults:
  Meta Attrs: build-resource-defaults
    resource-stickiness=INFINITY
Operations Defaults:
  Meta Attrs: op_defaults-meta_attributes
    timeout=240s

Cluster Properties:
 cluster-infrastructure: corosync
 cluster-name: HA
 dc-version: 2.1.4-5.el9_1.2-dc6eb4362e
 have-watchdog: false
 last-lrm-refresh: 1688971748
 maintenance-mode: false
 no-quorum-policy: ignore
 stonith-enabled: false

Tags:
 No tags defined

Quorum:
  Options:

2 个回答

Voted

Stuka · Answer 1 · 2023-08-13T19:34:33+08:00

Stuka

2023-08-13T19:34:33+08:002023-08-13T19:34:33+08:00

It's a misconfigured fencing or quorum pick-up issue. It's a very typical DRBD behavior which works when it works but fails miserably with a split-brain or replicated volumes stuck in a wrong state at some point and for no real reason. I would recommend not to fix DRBD, as it's not fixable by design, but use your best chance to scrap it and use something more reliable out of the box. I'd really recommend Ceph due to its stability, outstanding performance, and huge community support.

3

John Mahowald · Answer 2 · 2023-08-13T23:22:55+08:00

Clusters like Pacemaker are highly configurable, and try to be careful because they can yank block devices, your data, between nodes. Testing the behavior of your cluster in various situations is not optional. Ideally, this results in operational playbooks of what to do in various scenarios.

Read the manual. The downstream distros that support Pacemaker, like the RHEL HA guide, explain many scenarios and provide reference.

You can configure resources to prefer the node they are currently running. A non-zero value cluster wide is probably a good idea, as in

pcs resource defaults update resource-stickiness=1

A location constraint with a higher value then this can still move resources.

Performing cluster maintenance topic has a reference for moving, stopping and starting. Planned maintenance on a node should probably be wrapped in pcs node standby then pcs node unstandby

Try a pcs resource move to get your resource group on the same node. Watch what happens after the move constraint is gone. If resources move back and you didn't want that, troubleshoot stickiness, constraints, dependencies, and other rules.

DRDB status of up UpToDate implies your volumes are healthy. Move things around with the cluster tools, and confirm the volumes mount successfully.

Two node clusters are actually difficult. When they partition there is not a good way to select a primary. Consider adding another node for quorum purposes, even if it has no disks and cannot host this resource group.

Corosync/Pacemaker/DRBD 弹性调整

新安装后 postgres 的默认超级用户用户名/密码是什么？

SFTP 使用什么端口？

命令行列出 Windows Active Directory 组中的用户？

什么是 Pem 文件，它与其他 OpenSSL 生成的密钥文件格式有何不同？

如何确定bash变量是否为空？

Corosync/Pacemaker/DRBD 弹性调整

2 个回答

相关问题