关于【pacemaker】的问题- 第1页

Erikas

Asked: 2022-12-15 02:50:53 +0800 CST

如果起搏器失败（而不是将其置于“已停止”状态），如何强制起搏器继续重新启动 SystemD 资源？

5

我的目标是使用虚拟 IP (VIP) 实现 2 个节点的 HTTP 负载平衡器。对于这个任务，我选择了pacemaker（虚拟 IP 交换）和Caddy作为 HTTP 负载均衡器。负载均衡器的选择不是这个问题的重点。:)

我的要求很简单——我希望将虚拟 IP 分配给运行健康且正常工作的 Caddy 实例的主机。

以下是我使用 Pacemaker 实现它的方法：

# Disable stonith feature
pcs property set stonith-enabled=false

# Ignore quorum policy
pcs property set no-quorum-policy=ignore

# Setup virtual IP
pcs resource create ClusterIP ocf:heartbeat:IPaddr2 ip=123.123.123.123

# Setup caddy resource, using SystemD provider. By default it runs on one instance at a time, so clone it and cloned one by default runs on all nodes at the same time.
# https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/6/html/configuring_the_red_hat_high_availability_add-on_with_pacemaker/ch-advancedresource-haar
pcs resource create caddy systemd:caddy clone

# Enable constraint, so both VirtualIP assigned and application running _on the same_ node.
pcs constraint colocation add ClusterIP with caddy-clone INFINITY

但是，如果我通过 SSH 连接到分配了虚拟 IP 的节点，则 Caddy 配置文件格式错误并执行systemctl restart caddy- 一段时间后 pacemaker 检测到 caddy 无法启动并简单地将其置于stopped状态。

我如何强制起搏器继续重启我的 SystemD 资源而不是将其置于stopped状态？

最重要的是 - 如果我修复配置文件并执行systemctl restart caddy它，它就会启动，但 pacemaker 只是进一步保持它的stopped状态。

最重要的是——如果我停止另一个节点，则虚拟 ip 不会分配到任何地方，原因如下：

# Enable constraint, so both VirtualIP assigned and application running _on the same_ node.
pcs constraint colocation add ClusterIP with caddy-clone INFINITY

有人可以指出我做错事的正确方向吗？

KdgDev

Asked: 2021-12-01 07:50:36 +0800 CST

Pacemaker 错误和警告日志显示在 syslog 中，而不是 pacemaker.log

1

我在起搏器上创建了一个 ping 检查，如下所示：

pcs resource create ping ocf:pacemaker:ping dampen=5s multiplier=1000 host_list=127.0.0.1 clone

当然使用了 127.0.0.1 以外的东西。

这是源代码：https ://github.com/ClusterLabs/pacemaker/blob/master/extra/resources/ping

ping 检查预计 rc 代码为 0、1 和任何其他代码。

想要查看警告和错误，我启用了调试：

pcs resource update ping debug=1

但是，/var/log/pacemaker.log这些消息不是记录到，而是记录到/var/log/syslog。

像这样的文章只是描述： https: //support.sciencelogic.com/s/article/3961

这已经过时了：http ://www.beekhof.net/blog/2013/pacemaker-logging

似乎没有办法设置它。我错过了什么？

编辑：我在这方面找到的大多数指南都假设 CentO。

在 Ubuntu 上，它似乎位于此处的起搏器 sysconfig：/etc/default/pacemaker

KdgDev

Asked: 2021-11-18 03:31:13 +0800 CST

Pacemaker - 记录 ping 检查的结果？

0

我阅读了此页面和下一个页面：https ://clusterlabs.org/pacemaker/doc/deprecated/en-US/Pacemaker/1.1/html/Pacemaker_Explained/_moving_resources_due_to_connectivity_changes.html

它还解释了如何设置可以链接到资源分配的 ping。

虽然这可行，但如果我有超过 1 个 URL 或超过 1 个 ping 检查，我怎么知道哪个失败了？

如果发生这种情况，似乎没有记录在任何地方。它刚刚发生，起搏器做出决定......

阅读此源代码：https ://github.com/ClusterLabs/pacemaker/blob/master/extra/resources/ping

似乎需要启用调试环境变量。我宁愿不这样做，假设我必须为它重新启动起搏器并因此搞乱分配，再加上任何数量的额外日志现在都将占用磁盘空间。

如果 ping 失败，有没有办法只记录一行，只说这一点，而不影响其他任何事情？

Chris

Asked: 2021-06-03 10:55:32 +0800 CST

起搏器资源监视器操作中超时和间隔的交互作用是什么

1

如果我有这样的起搏器资源：

     Resource: FoobarServer (class=ocf provider=foo type=bar)
  Operations: monitor interval=5m timeout=8m (FoobarServer-monitor-interval-5m)
              start interval=0 timeout=360s (FoobarServer-start-0)
              stop interval=0 timeout=360s (FoobarServer-stop-0)

鉴于超时时间长于间隔，似乎会有冲突。但是，我找不到任何专门警告这种潜在情况的文档。

是否每 5m 产生一个不同的监视器进程，然后在 8m 后死亡？还是一个进程每 5m 重新启动一次，并且（在这种情况下）它会错过在间隔和超时之间的 3m 差中发生的事件？

astrowalker

Asked: 2021-04-17 03:30:54 +0800 CST

由于未知主机，pcs 创建集群失败

0

我现在正在关注几个教程，它们都有这样的步骤——在 pcs 中验证给定的主机，然后创建集群。然而，在我的情况下，第一步（身份验证）有效，而第二步没有错误，说明主机未知/身份验证。

sudo pcs host auth 192.168.4.201 192.168.4.202

接着

sudo pcs cluster setup my_cluster 192.168.4.201 192.168.4.202

为此，我得到了这个特殊的错误：

警告：无法读取已知主机文件：没有这样的文件或目录：'/var/lib/pcsd/known-hosts'
错误：主机 '192.168.4.201'、'192.168.4.202' 不为 pcs 所知，请尝试使用 'pcs host auth 192.168.4.201 192.168.4.202' 命令对主机进行身份验证
错误：主机不为 pcs 所知。

推荐的步骤（在错误消息中）正是我所做的。这里真正缺少什么？

Ubuntu 20.04.2

Steve

Asked: 2020-06-16 11:48:55 +0800 CST

PCSD 简单主/从不会故障主切换

2

我正在尝试编写一个简单的起搏器主/从系统。我创建了一个代理，它的元数据如下：

elm_meta_data() {
  cat <<EOF
<?xml version="1.0"?>
<!DOCTYPE resource-agent SYSTEM "ra-api-1.dtd">
<resource-agent name="elm-agent">
  <version>0.1</version>
  <longdesc lang="en">
    Resource agent for ELM high availability clusters.
  </longdesc>
  <shortdesc>
    Resource agent for ELM
  </shortdesc>
  <parameters>
    <parameter name="datadir" unique="0" required="1">
      <longdesc lang="en">
      Data directory
      </longdesc>
      <shortdesc lang="en">Data directory</shortdesc>
      <content type="string"/>
    </parameter>
  </parameters>
  <actions>
    <action name="start"        timeout="35" />
    <action name="stop"         timeout="35" />
    <action name="monitor"      timeout="35"
                                interval="10" depth="0" />
    <action name="monitor"      timeout="35"
                                interval="10" depth="0" role="Master" />
    <action name="monitor"      timeout="35"
                                interval="11" depth="0" role="Slave" />
    <action name="reload"       timeout="70" />
    <action name="meta-data"    timeout="5" />
    <action name="promote"      timeout="20" />
    <action name="demote"       timeout="20" />
    <action name="validate-all" timeout="20" />
    <action name="notify"       timeout="20" />
  </actions>
</resource-agent>
EOF
}

我的监控，提升，降级是：

elm_monitor() {
  local elm_running
  local worker_running
  local is_master

  elm_running=0
  worker_running=0
  is_master=0

  if [ -e "${OCF_RESKEY_datadir}/master.conf" ]; then
    is_master=1
  fi

  if [ "$(docker ps -q -f name=elm_web)" ]; then
    elm_running=1
  fi
  if [ "$(docker ps -q -f name=elm_worker)" ]; then
    worker_running=1
  fi
  if [ $elm_running -ne $worker_running ]; then
    if [ $is_master -eq 1 ]; then
      exit $OCF_FAILED_MASTER
    fi
    exit $OCF_ERR_GENERIC
  fi
  if [ $elm_running -eq 0 ]; then
    return $OCF_NOT_RUNNING
  fi
  ...
  if [ $is_master -eq 1 ]; then
    exit $OCF_FAILED_MASTER
  fi
  exit $OCF_ERR_GENERIC
}
elm_promote() {
  touch ${OCF_RESKEY_datadir}/master.conf
  return $OCF_SUCCESS
}

elm_demote() {
  rm ${OCF_RESKEY_datadir}/master.conf
  return $OCF_SUCCESS
}

如果我使用以下 cib 命令配置集群，它会得到三个从属服务器并且没有主服务器：

sudo pcs cluster cib cluster1.xml

sudo pcs -f cluster1.xml resource create elmd ocf:a10:elm    \
    datadir="/etc/a10/elm"                                  \
    op start timeout=90s                                     \
    op stop timeout=90s                                      \
    op promote timeout=60s                                   \
    op demote timeout=60s                                    \
    op monitor interval=15s timeout=35s role="Master"        \
    op monitor interval=16s timeout=35s role="Slave"         \
    op notify timeout=60s

sudo pcs -f cluster1.xml resource master elm-ha elmd notify=true
sudo pcs -f cluster1.xml resource create ClusterIP ocf:heartbeat:IPaddr2 ip=$vip cidr_netmask=$net_mask op monitor interval=10s

sudo pcs -f cluster1.xml constraint colocation add ClusterIP with master elm-ha INFINITY
sudo pcs -f cluster1.xml constraint order promote elm-ha then start ClusterIP symmetrical=false kind=Mandatory
sudo pcs -f cluster1.xml constraint order demote elm-ha then stop ClusterIP symmetrical=false kind=Mandatory
sudo pcs cluster cib-push cluster1.xml

ubuntu@elm1:~$ sudo pcs status
...
 elm_proxmox_fence100   (stonith:fence_pve):    Started elm1
 elm_proxmox_fence101   (stonith:fence_pve):    Started elm2
 elm_proxmox_fence103   (stonith:fence_pve):    Started elm3
 Master/Slave Set: elm-ha [elmd]
     Slaves: [ elm1 elm2 elm3 ]
 ClusterIP  (ocf::heartbeat:IPaddr2):   Stopped

而如果我将以下命令添加到 cib，我会得到一个主/从设置：

sudo pcs -f cluster1.xml constraint location elm-ha rule role=master \#uname eq $(hostname)

   Master/Slave Set: elm-ha [elmd]
       Masters: [ elm1 ]
       Slaves: [ elm2 elm3 ]
   ClusterIP    (ocf::heartbeat:IPaddr2):   Started elm1

但是在最后一个版本上，大师似乎坚持使用 elm1。当我测试失败时，通过停止主服务器上的 corosync 服务，我最终得到了 2 个从服务器，主服务器处于停止状态。我猜测设置规则是强制起搏器将主控保持在 elm1 上。

     Master/Slave Set: elm-ha [elmd]
         Slaves: [ elm2 elm3 ]
         Stopped: [ elm1 ]
     ClusterIP  (ocf::heartbeat:IPaddr2):   Stopped

如何配置它，以便当我发送我的 cib 命令时，它会选择一个主服务器并在主服务器出现故障时进行故障转移？我的代理需要一些不同的东西吗？

Karias Bolster

Asked: 2020-05-06 09:18:37 +0800 CST

起搏器 Corosync OCF 资源生命周期

0

我目前被要求设置 Pacemaker Corosync，这对我来说是全新的。我目前有一个 2 节点集群。如果活动节点发生故障，我想要做的是重新分配 IP 给另一个节点。

所以看起来这样做的方法是创建一个资源代理。我已经阅读了一些关于创建 OCF 资源的教程。我已经阅读过 OCF 资源，似乎这些东西称为动作。我对动作不了解的是何时以及谁调用这些动作？

如果资源在主节点上运行，然后当主节点宕机时，资源会发生什么？它会自动在另一个节点上运行吗？

此外，由于我需要执行一些步骤以防调用某个操作，我如何检查我的脚本中调用了哪个操作，是否有变量？

Mezja

Asked: 2020-05-01 02:21:11 +0800 CST

节点切换后 Pacemaker 不启动服务

1

我正在使用两台虚拟 Ubuntu 服务器（最新的长期服务器）并且两者都具有相同的配置，并且正在运行 Squid 服务和 pcs pacemaker corosync。我有两个节点Squid01和Squid02一个虚拟 IP。

问题：当我启动两台服务器时，Squid02通常是通过启动 Squid 代理服务来完成所有工作，但Squid01代理服务会自动禁用自身并变为非活动状态，因此我关闭Squid02服务器并在服务器节点之间切换，但Squid02代理服务仍然处于非活动状态并且你必须手动启动它。

它没有看到停止的服务

我需要实现这一点，当切换发生时，非活动服务变为活动状态，或者两个服务一直在工作。

我已使用此示例创建集群

但只有一个区别是我没有 Squid from apt-get，但我是使用 config 手动创建的

[Unit]
Description=Squid Web Proxy.
[Service]
Type=simple
PIDFile=/usr/local/squid/var/run/squid.pid
ExecStart=/usr/local/squid/sbin/squid
[Install]
WantedBy=multi-user.target

我跳过了防火墙部分。

Mezja

Asked: 2020-04-29 05:07:38 +0800 CST

Squid Proxy .pid 文件在启动几秒钟后消失

0

我有两个相同的虚拟 Ubuntu 服务器，并且都有 Squid Proxy。问题是 Squid 在服务器启动时会创建 .pid 文件，但 10-20 秒后它会消失，我必须手动输入：

/usr/local/squid/sbin/squid

然后它会正常工作，并且 .pid 文件不会消失。我希望我的 Squid 在服务器启动时启动，并且 .pid 文件不会消失。我尝试创建 init.d 文件 /etc/init.d/run_squid

#! /bin/sh
/usr/local/squid/sbin/squid
exit0

然后

update-rc.d run_squid defaults
update-rc.d run_squid enable

我得到：

error: run_squid Default-Start contains no runlevels, aborting

我也试过

crontab -e
@reboot /scripts/squid.sh

即使我授予权限，当我运行它时也没有任何反应。

我需要 .pid 才能正常运行，这样我的集群才能正常工作（我的集群由 corosync 和起搏器组成），因为现在的问题是当一个节点 .pid 消失时，该节点只会继续工作（它根本不代理，但它认为它有效），但它不会切换到健康的。

总结：我希望我的 squid 正常启动，它不会丢失它的 pid，并且如果其中一个丢失 .pid 文件，节点会切换。

drkmkzs

Asked: 2020-02-05 02:50:34 +0800 CST

corosync/pacemaker/fencing - 具有 2 个节点的被动/主动集群

4

我正在使用起搏器/corosync 配置集群 2 节点，对此我有一些疑问（也许是最佳实践：我远非专家）

**OS:** redhat 7.6

I configurated the cluster with those properties

 - **stonith-enabled:** true

 - **symmetric-cluster:** true (even if is default value i think)


and added in corosync.conf

 - **wait_for_all:** 0 (i want a Node be able to start/work even if his twin is KO)

 - **two_nodes:** 1


Considering the fencing:

- Using ILO of blade HP (ILO1 for Node1, ILO2 for Node2)

I read that it was sometimes a good practice to prevent a node suicide, so added constraints 

- ILO1-fence can't locate in node1 

- ILO2-fence can't locate on node2

我遇到的问题如下，在 Node1 关闭时启动 Node2 时发生：

起搏器/corosync 无法在 Node1 上启动 ILO2-fence（当然是因为 Node 1 已关闭），因此不要启动其他资源，因此我的集群都无法正常工作 >:[

我想知道我是否错过了配置中的某些内容，或者我不太了解这样的集群应该如何工作。

因为我希望 Node2 启动，所以集群看到 Node1 是 KO 并且只是启动资源以使 Node2 自己工作。

但这是真的，因为 ILO2-fence 只能位于 Node1 上（因为避免自杀的约束），所以这个资源总是会失败......（在没有那些“反自杀”约束的情况下尝试时，如果 Node2 有一些服务失败，然后它在启动后直接关闭，我不想要）

我会欣赏一些回报和启示:)

谢谢：）

如果起搏器失败（而不是将其置于“已停止”状态），如何强制起搏器继续重新启动 SystemD 资源？

Pacemaker 错误和警告日志显示在 syslog 中，而不是 pacemaker.log

Pacemaker - 记录 ping 检查的结果？

起搏器资源监视器操作中超时和间隔的交互作用是什么

由于未知主机，pcs 创建集群失败

PCSD 简单主/从不会故障主切换

起搏器 Corosync OCF 资源生命周期

节点切换后 Pacemaker 不启动服务

Squid Proxy .pid 文件在启动几秒钟后消失

corosync/pacemaker/fencing - 具有 2 个节点的被动/主动集群

新安装后 postgres 的默认超级用户用户名/密码是什么？

SFTP 使用什么端口？

命令行列出 Windows Active Directory 组中的用户？

什么是 Pem 文件，它与其他 OpenSSL 生成的密钥文件格式有何不同？

如何确定bash变量是否为空？

问题[pacemaker](server)