Recentemente, substituí um dos HDDs que fornece um tijolo em um cluster GlusterFS. Consegui mapear aquele HDD de volta em um tijolo e, em seguida, fazer com que o GlusterFS replicasse com sucesso para ele.
No entanto, havia um problema com todo esse processo que parecia não funcionar para mim. Tentei executar o comando "heal" no volume com o tijolo substituído, mas continuamente me deparava com este problema:
$ gluster volume heal nova
Locking failed on c551316f-7218-44cf-bb36-befe3d3df34b. Please check log file for details.
Locking failed on ae62c691-ae55-4c99-8364-697cb3562668. Please check log file for details.
Locking failed on cb78ba3c-256f-4413-ae7e-aa5c0e9872b5. Please check log file for details.
Locking failed on 79a6a414-3569-482c-929f-b7c5da16d05e. Please check log file for details.
Locking failed on 5f43c6a4-0ccd-424a-ae56-0492ec64feeb. Please check log file for details.
Locking failed on c7416c1f-494b-4a95-b48d-6c766c7bce14. Please check log file for details.
Locking failed on 6c0111fc-b5e7-4350-8be5-3179a1a5187e. Please check log file for details.
Locking failed on 88fcb687-47aa-4921-b3ab-d6c3b330b32a. Please check log file for details.
Locking failed on d73de03a-0f66-4619-89ef-b73c9bbd800e. Please check log file for details.
Locking failed on 4a780f57-37e4-4f1b-9c34-187a0c7e44bf. Please check log file for details.
Os logs basicamente ecoaram o acima, especificamente:
$ tail etc-glusterfs-glusterd.vol.log
[2015-08-03 23:08:03.289249] E [glusterd-syncop.c:562:_gd_syncop_mgmt_lock_cbk] 0-management: Could not find peer with ID d827a48e-627f-0000-0a00-000000000000
[2015-08-03 23:08:03.289258] E [glusterd-syncop.c:111:gd_collate_errors] 0-: Locking failed on c7416c1f-494b-4a95-b48d-6c766c7bce14. Please check log file for details.
[2015-08-03 23:08:03.289279] W [rpc-clnt-ping.c:199:rpc_clnt_ping_cbk] 0-management: socket or ib related error
[2015-08-03 23:08:03.289827] E [glusterd-syncop.c:562:_gd_syncop_mgmt_lock_cbk] 0-management: Could not find peer with ID d827a48e-627f-0000-0a00-000000000000
[2015-08-03 23:08:03.289858] E [glusterd-syncop.c:111:gd_collate_errors] 0-: Locking failed on d73de03a-0f66-4619-89ef-b73c9bbd800e. Please check log file for details.
[2015-08-03 23:08:03.290509] E [glusterd-syncop.c:562:_gd_syncop_mgmt_lock_cbk] 0-management: Could not find peer with ID d827a48e-627f-0000-0a00-000000000000
[2015-08-03 23:08:03.290529] E [glusterd-syncop.c:111:gd_collate_errors] 0-: Locking failed on 4a780f57-37e4-4f1b-9c34-187a0c7e44bf. Please check log file for details.
[2015-08-03 23:08:03.290597] E [glusterd-syncop.c:1804:gd_sync_task_begin] 0-management: Locking Peers Failed.
[2015-08-03 23:07:03.351603] I [MSGID: 106132] [glusterd-proc-mgmt.c:83:glusterd_proc_stop] 0-management: bitd already stopped
[2015-08-03 23:07:03.351644] I [MSGID: 106132] [glusterd-proc-mgmt.c:83:glusterd_proc_stop] 0-management: scrub already stopped
Esses outros logs tinham mensagens na época em que tentei o procedimento acima:
$ ls -ltr
-rw------- 1 root root 41704 Aug 2 12:07 glfsheal-nova.log
-rw------- 1 root root 15986 Aug 2 12:07 cmd_history.log-20150802
-rw------- 1 root root 290359 Aug 3 19:07 var-lib-nova-instances.log
-rw------- 1 root root 221829 Aug 3 19:07 glustershd.log
-rw------- 1 root root 195472 Aug 3 19:07 nfs.log
-rw------- 1 root root 61831116 Aug 3 19:07 var-lib-nova-mnt-92ef2ec54fd18595ed18d8e6027a1b3d.log
-rw------- 1 root root 3504 Aug 3 19:08 cmd_history.log
-rw------- 1 root root 89294 Aug 3 19:08 cli.log
-rw------- 1 root root 136421 Aug 3 19:08 etc-glusterfs-glusterd.vol.log
Olhando através deles, não ficou claro se algum deles era relevante para este problema específico.
Com a configuração acima, inicialmente pensei que só poderia executar o comando heal a partir do nó primário do cluster GlusterFS, mas, como descobri, meu verdadeiro problema residia no fato de que os 11 nós do cluster GlusterFS estavam executando 2 versões diferentes de GlusterFS.
Assim que percebi isso, atualizei todos os nós para a versão mais recente do GlusterFS (3.7.3) e consegui realizar curas de qualquer um dos nós, como seria de esperar.