我试图将 ceph 从 17 升级到 18.2.4,如下所述
ceph orch upgrade start --ceph-version 18.2.4
Initiating upgrade to quay.io/ceph/ceph:v18.2.4
但此后,协调器不再响应
ceph orch upgrade status
Error ENOENT: Module not found
将后端重新设置为 orchestrator 或 cephadm 失败,因为服务显示为“已禁用”。Ceph mgr 却表示服务已开启。
据我目前所知,我被困在一个运行 reef 的 mgr 守护进程上,而集群的其余部分在 quincy 上运行。
ceph versions
{
"mon": {
"ceph version 17.2.7 (b12291d110049b2f35e32e0de30d70e9a4c060d2) quincy (stable)": 5
},
"mgr": {
"ceph version 18.2.4 (e7ad5345525c7aa95470c26863873b581076945d) reef (stable)": 1
},
"osd": {
"ceph version 17.2.7 (b12291d110049b2f35e32e0de30d70e9a4c060d2) quincy (stable)": 31
},
"mds": {
"ceph version 17.2.7 (b12291d110049b2f35e32e0de30d70e9a4c060d2) quincy (stable)": 4
},
"overall": {
"ceph version 17.2.7 (b12291d110049b2f35e32e0de30d70e9a4c060d2) quincy (stable)": 40,
"ceph version 18.2.4 (e7ad5345525c7aa95470c26863873b581076945d) reef (stable)": 1
}
}
如何将集群恢复到健康状态?
编辑1 Ceph健康:
cluster:
id: 16249ca6-4060-11ef-a8a1-7509512e051b
health: HEALTH_WARN
insufficient standby MDS daemons available
mon gpu001 is low on available space
1/5 mons down, quorum ***
Degraded data redundancy: 92072087/608856489 objects degraded (15.122%), 97 pgs degraded, 97 pgs undersized
7 pgs not deep-scrubbed in time
services:
mon: 5 daemons, quorum ***
mgr: cpu01.fcxjpi(active, since 5m)
mds: 4/4 daemons up
osd: 34 osds: 31 up (since 45h), 31 in (since 46h); 31 remapped pgs
data:
volumes: 1/1 healthy
pools: 4 pools, 193 pgs
objects: 121.96M objects, 32 TiB
usage: 36 TiB used, 73 TiB / 108 TiB avail
pgs: 92072087/608856489 objects degraded (15.122%)
29422795/608856489 objects misplaced (4.832%)
97 active+undersized+degraded
65 active+clean
31 active+clean+remapped
io:
client: 253 KiB/s rd, 51 KiB/s wr, 3 op/s rd, 2 op/s wr
注意:该问题最初是在 SO [https://stackoverflow.com/posts/78949269] 上提出的,我被建议将其移到这里。我目前正在搜索 MGR 日志以调查状态,并最终强制降级。
感谢@eblock 为我指明了正确的方向。这确实与错误有关。https ://tracker.ceph.com/issues/67329
我通过查看 mgr 日志进行了确认,其显示内容如下:
我不确定这是否与我所做的任何事情有关。查看配置值(ceph 配置)时,我确实发现了有问题的值:
这是来自故障机器上仍未删除的 osd。运行 ceph config-key rm mgr/cephadm/osd_remove_queue,重新启动管理器,得到 cep