我在 Ganeti 上有一个实例(带有 2 个磁盘),两个磁盘都已降级(可能是由于连接问题?)。直到今天早上,这个实例多年来一直正常工作。
在我的主人
$ gnt-instance info myinstance
...
-disk/0
on primary: /dev/drbd4 (147:4) in sync, status *DEGRADED*
on secondary: /dev/drbd4 (147:4) in sync, status *DEGRADED*
child devices:
- child 0: lvm, size 20.0G
logical_id: kvmvg/299a0bdf-1acb-4bcd-ac43-eb02b0928757.disk0_data
on primary: /dev/kvmvg/299a0bdf-1acb-4bcd-ac43-eb02b0928757.disk0_data (254:10)
on secondary: /dev/kvmvg/299a0bdf-1acb-4bcd-ac43-eb02b0928757.disk0_data (254:8)
- child 1: lvm, size 128M
logical_id: kvmvg/299a0bdf-1acb-4bcd-ac43-eb02b0928757.disk0_meta
on primary: /dev/kvmvg/299a0bdf-1acb-4bcd-ac43-eb02b0928757.disk0_meta (254:11)
on secondary: /dev/kvmvg/299a0bdf-1acb-4bcd-ac43-eb02b0928757.disk0_meta (254:9)
...
在主节点上
$ cat /proc/drbd
4: cs:NetworkFailure ro:Primary/Unknown ds:UpToDate/DUnknown C r----
ns:678399926 nr:0 dw:678315292 dr:25942012 al:22230 bm:16189 lo:0 pe:196 ua:0 ap:195 ep:1 wo:b oos:0
在辅助节点上
$ cat /proc/drbd
4: cs:WFConnection ro:Secondary/Unknown ds:UpToDate/DUnknown C r----
ns:0 nr:678340009 dw:678340009 dr:0 al:0 bm:14884 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b oos:0
我无法重新启动或关闭实例(操作超时)。
我认为这不是脑裂问题,因为没有“独立”,并且在主节点上它是“主要/未知”,而在辅助节点上它是“次要/未知”。
我试图在辅助节点上运行“drbdadm connect all”,但什么也没做。
我试图更换磁盘,但失败了:
gnt-instance replace-disks -s myinstance
Thu Jun 2 11:32:00 2016 Replacing disk(s) 0, 1 for myinstancel
Thu Jun 2 11:36:00 2016 - WARNING: Could not prepare block device disk/1 on node primaryNode (is_primary=False, pass=1): Error while assembling disk: drbd5: cannot activate, unknown or unhandled reason
Thu Jun 2 11:38:01 2016 - WARNING: Could not prepare block device disk/0 on node primaryNode (is_primary=True, pass=2): Error while assembling disk: drbd4: cannot activate, unknown or unhandled reason
Thu Jun 2 11:40:02 2016 - WARNING: Could not prepare block device disk/1 on node primaryNode (is_primary=True, pass=2): Error while assembling disk: drbd5: cannot activate, unknown or unhandled reason
Failure: command execution error:
Disk consistency error
现在它看起来像这样:
$ gnt-instance info myinstance
...
-disk/0
on primary: /dev/drbd4 (147:4) in sync, status *DEGRADED*
(no more secondary)
child devices:
- child 0: lvm, size 20.0G
logical_id: kvmvg/299a0bdf-1acb-4bcd-ac43-eb02b0928757.disk0_data
on primary: /dev/kvmvg/299a0bdf-1acb-4bcd-ac43-eb02b0928757.disk0_data (254:10)
on secondary: /dev/kvmvg/299a0bdf-1acb-4bcd-ac43-eb02b0928757.disk0_data (254:8)
- child 1: lvm, size 128M
logical_id: kvmvg/299a0bdf-1acb-4bcd-ac43-eb02b0928757.disk0_meta
on primary: /dev/kvmvg/299a0bdf-1acb-4bcd-ac43-eb02b0928757.disk0_meta (254:11)
on secondary: /dev/kvmvg/299a0bdf-1acb-4bcd-ac43-eb02b0928757.disk0_meta (254:9)
在主节点上
$ cat /proc/drbd
4: cs:NetworkFailure ro:Primary/Unknown ds:UpToDate/DUnknown C r----
ns:678399926 nr:0 dw:678315292 dr:25942012 al:22230 bm:16189 lo:0 pe:196 ua:0 ap:195 ep:1 wo:b oos:0
在辅助节点上:
$ cat /proc/drbd
...
4: cs:Unconfigured
5: cs:Unconfigured
知道如何解决这个问题吗?
DRBD 版本:8.3.7
加内蒂版本:2.4.5
操作系统:Debian 6.0
稍微调查了一下,发现主节点上有kvm僵尸进程:
我不知道如何正确摆脱它。
我尝试从该节点迁移所有主实例(我只有 2 个),但这失败了(与 bdrm 相关的错误)。我重新启动了节点。关机的时候,因为drbd卡住了。消息是这样的:
所以我按下按钮关闭机器。机器重新启动(没有任何错误),几分钟后,Ganeti 实例自动启动。
在我运行的主节点上:
等待几分钟后,恢复完成,现在它是同步的。
结论:现在一切正常,但我希望不必重新启动节点。
感谢 gf_ 的帮助。