Configurei um cluster MongoDB usando uma arquitetura de árbitro primário-secundário de 3 membros
Meio Ambiente:
- Contêineres LXC
- Linux Debian Stretch (9.8)
- Versão do servidor MongoDB: 4.0.6
Contêineres MongoDB:
- lxc-mongodb-01 (primário)
- lxc-mongodb-02 (secundário)
- lxc-mongodb-03 (árbitro)
Status de replicação
Tudo parece funcionar bem e a replicação está funcionando:
np:PRIMARY> rs.printSlaveReplicationInfo()
source: lxc-mongodb-02:27017
syncedTo: Wed Mar 06 2019 12:08:27 GMT+0100 (CET)
0 secs (0 hrs) behind the primary
Falha na alternância
Mas quando tento alternar primário/secundário usando rs.stepDown(), ele falha com uma mensagem de erro "Nenhum secundário elegível capturado":
np:PRIMARY> rs.stepDown(60, 30)
{
"operationTime" : Timestamp(1551870647, 1),
"ok" : 0,
"errmsg" : "No electable secondaries caught up as of 2019-03-06T12:11:19.140+0100Please use the replSetStepDown command with the argument {force: true} to force node to step down.",
"code" : 262,
"codeName" : "ExceededTimeLimit",
"$clusterTime" : {
"clusterTime" : Timestamp(1551870647, 1),
"signature" : {
"hash" : BinData(0,"+/jQR8cG+y/bPtoF7gnv2Pmn2BY="),
"keyId" : NumberLong("6653042051040411649")
}
}
}
Observe que este é um cluster de não produção, portanto, não há transação em andamento.
Logs do server01 (primário):
2019-03-06T12:08:07.709+0100 I ACCESS [conn17] Successfully authenticated as principal root on admin
2019-03-06T12:10:49.140+0100 I COMMAND [conn17] Attempting to step down in response to replSetStepDown command
2019-03-06T12:11:19.140+0100 I COMMAND [conn17] command admin.$cmd appName: "MongoDB Shell" command: replSetStepDown { replSetStepDown: 60.0, secondaryCatchUpPeriodSecs: 30.0, lsid: { id: UUID("8941645a-c582-4353-b216-6e5ee91c08b0") }, $clusterTime: { clusterTime: Timestamp(1551870507, 1), signature: { hash: BinData(0, 484DDC04A03F9CBEDA0E5FA5E4F438F414E43E8F), keyId: 6653042051040411649 } }, $db: "admin" } numYields:0 ok:0 errMsg:"No electable secondaries caught up as of 2019-03-06T12:11:19.140+0100Please use the replSetStepDown command with the argument {force: true} to force node to step down." errName:ExceededTimeLimit errCode:262 reslen:385 locks:{ Global: { acquireCount: { r: 2, W: 2 } } } protocol:op_msg 29999ms
Logs do server02 (secundário):
2019-03-06T12:10:52.278+0100 I REPL [replication-1] Restarting oplog query due to error: InterruptedDueToReplStateChange: error in fetcher batch callback :: caused by :: operation was interrupted. Last fetched optime (with hash): { ts: Timestamp(1551870647, 1), t: 8 }[-3124663669138993987]. Restarts remaining: 1
2019-03-06T12:10:52.278+0100 I REPL [replication-1] Scheduled new oplog query Fetcher source: lxc-mongodb-01:27017 database: local query: { find: "oplog.rs", filter: { ts: { $gte: Timestamp(1551870647, 1) } }, tailable: true, oplogReplay: true, awaitData: true, maxTimeMS: 2000, batchSize: 13981010, term: 8, readConcern: { afterClusterTime: Timestamp(1551870647, 1) } } query metadata: { $replData: 1, $oplogQueryData: 1, $readPreference: { mode: "secondaryPreferred" } } active: 1 findNetworkTimeout: 7000ms getMoreNetworkTimeout: 10000ms shutting down?: 0 first: 1 firstCommandScheduler: RemoteCommandRetryScheduler request: RemoteCommand 6603 -- target:lxc-mongodb-01:27017 db:local cmd:{ find: "oplog.rs", filter: { ts: { $gte: Timestamp(1551870647, 1) } }, tailable: true, oplogReplay: true, awaitData: true, maxTimeMS: 2000, batchSize: 13981010, term: 8, readConcern: { afterClusterTime: Timestamp(1551870647, 1) } } active: 1 callbackHandle.valid: 1 callbackHandle.cancelled: 0 attempt: 1 retryPolicy: RetryPolicyImpl maxAttempts: 1 maxTimeMillis: -1ms
2019-03-06T12:10:52.279+0100 W REPL [rsBackgroundSync] Fetcher stopped querying remote oplog with error: InvalidSyncSource: Sync source cannot be behind me, and if I am up-to-date with the sync source, it must have a higher lastOpCommitted. My last fetched oplog optime: { ts: Timestamp(1551870647, 1), t: 8 }, latest oplog optime of sync source: { ts: Timestamp(1551870647, 1), t: 8 }, my lastOpCommitted: { ts: Timestamp(1551870647, 1), t: 8 }, lastOpCommitted of sync source: { ts: Timestamp(1551870647, 1), t: 8 }
2019-03-06T12:10:52.279+0100 I REPL [rsBackgroundSync] Clearing sync source lxc-mongodb-01:27017 to choose a new one.
2019-03-06T12:10:52.279+0100 I REPL [rsBackgroundSync] could not find member to sync from
2019-03-06T12:10:57.276+0100 I REPL [SyncSourceFeedback] SyncSourceFeedback error sending update to lxc-mongodb-01:27017: InvalidSyncSource: Sync source was cleared. Was lxc-mongodb-01:27017
2019-03-06T12:11:27.284+0100 I REPL [rsBackgroundSync] sync source candidate: lxc-mongodb-01:27017
2019-03-06T12:11:27.286+0100 I REPL [rsBackgroundSync] Changed sync source from empty to lxc-mongodb-01:27017
2019-03-06T12:11:28.833+0100 I NETWORK [LogicalSessionCacheRefresh] Starting new replica set monitor for np/lxc-mongodb-01:27017,lxc-mongodb-02:27017
Logs do server03 (árbitro):
2019-03-06T12:11:29.428+0100 I NETWORK [LogicalSessionCacheRefresh] Starting new replica set monitor for np/lxc-mongodb-01:27017,lxc-mongodb-02:27017
2019-03-06T12:11:29.429+0100 I NETWORK [LogicalSessionCacheRefresh] Starting new replica set monitor for np/lxc-mongodb-01:27017,lxc-mongodb-02:27017
Olhando para a documentação e alguns tópicos, tentei ajustar algumas configurações, sem sucesso:
replication.enableMajorityReadConcern = false
writeConcernMajorityJournalDefault = false
Pergunta
Então, o que estou perdendo para fazer o stepDown funcionar como esperado?
EDITAR 07/03/2019
Aqui está rs.status()
a saída do primário:
np:PRIMARY> rs.status()
{
"set" : "np",
"date" : ISODate("2019-03-07T08:08:17.623Z"),
"myState" : 1,
"term" : NumberLong(8),
"syncingTo" : "",
"syncSourceHost" : "",
"syncSourceId" : -1,
"heartbeatIntervalMillis" : NumberLong(2000),
"optimes" : {
"lastCommittedOpTime" : {
"ts" : Timestamp(1551946089, 1),
"t" : NumberLong(8)
},
"readConcernMajorityOpTime" : {
"ts" : Timestamp(1551946089, 1),
"t" : NumberLong(8)
},
"appliedOpTime" : {
"ts" : Timestamp(1551946089, 1),
"t" : NumberLong(8)
},
"durableOpTime" : {
"ts" : Timestamp(1551946089, 1),
"t" : NumberLong(8)
}
},
"members" : [
{
"_id" : 0,
"name" : "lxc-mongodb-01:27017",
"health" : 1,
"state" : 1,
"stateStr" : "PRIMARY",
"uptime" : 75954,
"optime" : {
"ts" : Timestamp(1551946089, 1),
"t" : NumberLong(8)
},
"optimeDate" : ISODate("2019-03-07T08:08:09Z"),
"syncingTo" : "",
"syncSourceHost" : "",
"syncSourceId" : -1,
"infoMessage" : "",
"electionTime" : Timestamp(1551870155, 1),
"electionDate" : ISODate("2019-03-06T11:02:35Z"),
"configVersion" : 4,
"self" : true,
"lastHeartbeatMessage" : ""
},
{
"_id" : 1,
"name" : "lxc-mongodb-03:27017",
"health" : 1,
"state" : 7,
"stateStr" : "ARBITER",
"uptime" : 75952,
"lastHeartbeat" : ISODate("2019-03-07T08:08:16.005Z"),
"lastHeartbeatRecv" : ISODate("2019-03-07T08:08:17.410Z"),
"pingMs" : NumberLong(0),
"lastHeartbeatMessage" : "",
"syncingTo" : "",
"syncSourceHost" : "",
"syncSourceId" : -1,
"infoMessage" : "",
"configVersion" : 4
},
{
"_id" : 2,
"name" : "lxc-mongodb-02:27017",
"health" : 1,
"state" : 2,
"stateStr" : "SECONDARY",
"uptime" : 75952,
"optime" : {
"ts" : Timestamp(1551946089, 1),
"t" : NumberLong(8)
},
"optimeDurable" : {
"ts" : Timestamp(1551946089, 1),
"t" : NumberLong(8)
},
"optimeDate" : ISODate("2019-03-07T08:08:09Z"),
"optimeDurableDate" : ISODate("2019-03-07T08:08:09Z"),
"lastHeartbeat" : ISODate("2019-03-07T08:08:16.008Z"),
"lastHeartbeatRecv" : ISODate("2019-03-07T08:08:15.798Z"),
"pingMs" : NumberLong(0),
"lastHeartbeatMessage" : "",
"syncingTo" : "lxc-mongodb-01:27017",
"syncSourceHost" : "lxc-mongodb-01:27017",
"syncSourceId" : 0,
"infoMessage" : "",
"configVersion" : 4
}
],
"ok" : 1,
"operationTime" : Timestamp(1551946089, 1),
"$clusterTime" : {
"clusterTime" : Timestamp(1551946089, 1),
"signature" : {
"hash" : BinData(0,"ZPnNWVwjB1K9jdaSHlnfnmRPqqM="),
"keyId" : NumberLong("6653042051040411649")
}
}
}
Aqui está rs.conf()
a saída do primário:
np:PRIMARY> rs.conf()
{
"_id" : "np",
"version" : 4,
"protocolVersion" : NumberLong(1),
"writeConcernMajorityJournalDefault" : false,
"members" : [
{
"_id" : 0,
"host" : "lxc-mongodb-01:27017",
"arbiterOnly" : false,
"buildIndexes" : true,
"hidden" : false,
"priority" : 1,
"tags" : {
},
"slaveDelay" : NumberLong(0),
"votes" : 1
},
{
"_id" : 1,
"host" : "lxc-mongodb-03:27017",
"arbiterOnly" : true,
"buildIndexes" : true,
"hidden" : false,
"priority" : 0,
"tags" : {
},
"slaveDelay" : NumberLong(0),
"votes" : 1
},
{
"_id" : 2,
"host" : "lxc-mongodb-02:27017",
"arbiterOnly" : false,
"buildIndexes" : true,
"hidden" : false,
"priority" : 0,
"tags" : {
},
"slaveDelay" : NumberLong(0),
"votes" : 0
}
],
"settings" : {
"chainingAllowed" : true,
"heartbeatIntervalMillis" : 2000,
"heartbeatTimeoutSecs" : 10,
"electionTimeoutMillis" : 10000,
"catchUpTimeoutMillis" : -1,
"catchUpTakeoverDelayMillis" : 30000,
"getLastErrorModes" : {
},
"getLastErrorDefaults" : {
"w" : 1,
"wtimeout" : 0
},
"replicaSetId" : ObjectId("5c545a7d4e358716c8129ac6")
}
}
Não há secundários elegíveis porque a prioridade do nó secundário está definida como 0 (veja
rs.conf()
no post original. Obrigado a Mani por sua sugestão!).Atualizei a prioridade (e votos ) para lxc-mongodb-02 (_id = 2):
lxc-mongodb-02 agora é elegível como PRIMARY.
Dito isto, acabei de perceber que uma alternância permanente será realizada alterando a prioridade em vez de usar
rs.stepDown()
o comando.Então, para promover o lxc-mongodb-02 como primário, eu corro: