我正在尝试编写一个简单的起搏器主/从系统。我创建了一个代理,它的元数据如下:
elm_meta_data() {
cat <<EOF
<?xml version="1.0"?>
<!DOCTYPE resource-agent SYSTEM "ra-api-1.dtd">
<resource-agent name="elm-agent">
<version>0.1</version>
<longdesc lang="en">
Resource agent for ELM high availability clusters.
</longdesc>
<shortdesc>
Resource agent for ELM
</shortdesc>
<parameters>
<parameter name="datadir" unique="0" required="1">
<longdesc lang="en">
Data directory
</longdesc>
<shortdesc lang="en">Data directory</shortdesc>
<content type="string"/>
</parameter>
</parameters>
<actions>
<action name="start" timeout="35" />
<action name="stop" timeout="35" />
<action name="monitor" timeout="35"
interval="10" depth="0" />
<action name="monitor" timeout="35"
interval="10" depth="0" role="Master" />
<action name="monitor" timeout="35"
interval="11" depth="0" role="Slave" />
<action name="reload" timeout="70" />
<action name="meta-data" timeout="5" />
<action name="promote" timeout="20" />
<action name="demote" timeout="20" />
<action name="validate-all" timeout="20" />
<action name="notify" timeout="20" />
</actions>
</resource-agent>
EOF
}
我的监控,提升,降级是:
elm_monitor() {
local elm_running
local worker_running
local is_master
elm_running=0
worker_running=0
is_master=0
if [ -e "${OCF_RESKEY_datadir}/master.conf" ]; then
is_master=1
fi
if [ "$(docker ps -q -f name=elm_web)" ]; then
elm_running=1
fi
if [ "$(docker ps -q -f name=elm_worker)" ]; then
worker_running=1
fi
if [ $elm_running -ne $worker_running ]; then
if [ $is_master -eq 1 ]; then
exit $OCF_FAILED_MASTER
fi
exit $OCF_ERR_GENERIC
fi
if [ $elm_running -eq 0 ]; then
return $OCF_NOT_RUNNING
fi
...
if [ $is_master -eq 1 ]; then
exit $OCF_FAILED_MASTER
fi
exit $OCF_ERR_GENERIC
}
elm_promote() {
touch ${OCF_RESKEY_datadir}/master.conf
return $OCF_SUCCESS
}
elm_demote() {
rm ${OCF_RESKEY_datadir}/master.conf
return $OCF_SUCCESS
}
如果我使用以下 cib 命令配置集群,它会得到三个从属服务器并且没有主服务器:
sudo pcs cluster cib cluster1.xml
sudo pcs -f cluster1.xml resource create elmd ocf:a10:elm \
datadir="/etc/a10/elm" \
op start timeout=90s \
op stop timeout=90s \
op promote timeout=60s \
op demote timeout=60s \
op monitor interval=15s timeout=35s role="Master" \
op monitor interval=16s timeout=35s role="Slave" \
op notify timeout=60s
sudo pcs -f cluster1.xml resource master elm-ha elmd notify=true
sudo pcs -f cluster1.xml resource create ClusterIP ocf:heartbeat:IPaddr2 ip=$vip cidr_netmask=$net_mask op monitor interval=10s
sudo pcs -f cluster1.xml constraint colocation add ClusterIP with master elm-ha INFINITY
sudo pcs -f cluster1.xml constraint order promote elm-ha then start ClusterIP symmetrical=false kind=Mandatory
sudo pcs -f cluster1.xml constraint order demote elm-ha then stop ClusterIP symmetrical=false kind=Mandatory
sudo pcs cluster cib-push cluster1.xml
ubuntu@elm1:~$ sudo pcs status
...
elm_proxmox_fence100 (stonith:fence_pve): Started elm1
elm_proxmox_fence101 (stonith:fence_pve): Started elm2
elm_proxmox_fence103 (stonith:fence_pve): Started elm3
Master/Slave Set: elm-ha [elmd]
Slaves: [ elm1 elm2 elm3 ]
ClusterIP (ocf::heartbeat:IPaddr2): Stopped
而如果我将以下命令添加到 cib,我会得到一个主/从设置:
sudo pcs -f cluster1.xml constraint location elm-ha rule role=master \#uname eq $(hostname)
Master/Slave Set: elm-ha [elmd]
Masters: [ elm1 ]
Slaves: [ elm2 elm3 ]
ClusterIP (ocf::heartbeat:IPaddr2): Started elm1
但是在最后一个版本上,大师似乎坚持使用 elm1。当我测试失败时,通过停止主服务器上的 corosync 服务,我最终得到了 2 个从服务器,主服务器处于停止状态。我猜测设置规则是强制起搏器将主控保持在 elm1 上。
Master/Slave Set: elm-ha [elmd]
Slaves: [ elm2 elm3 ]
Stopped: [ elm1 ]
ClusterIP (ocf::heartbeat:IPaddr2): Stopped
如何配置它,以便当我发送我的 cib 命令时,它会选择一个主服务器并在主服务器出现故障时进行故障转移?我的代理需要一些不同的东西吗?