Eu AlwaysOn
configurei com dois nós de dados e uma testemunha. Enfrentei um problema em que meu servidor PRIMARY reiniciou abruptamente.
Aqui estão alguns problemas que enfrentei durante esse período.
Devido ao reinício abrupto da PRIMARY Replica, o banco de dados entrou no modo RECOVERY.
O banco de dados levou cerca de 1 hora para se recuperar.
Durante esta Fase de RECUPERAÇÃO da Réplica PRIMÁRIA. O Servidor Secundário (que agora é PRIMÁRIO devido ao FAILOVER) estava enfrentando tempos limite e estava observando lentidão.
Olhando para os logs, pude ver logs sobre roll forward e rollback que aconteceram no banco de dados. Mas, em primeiro lugar, estou me perguntando qual poderia ser a razão pela qual minha recuperação do DB demorou mais?
Além disso, gostaria de obter uma entrada se eu adicionar mais um nó secundário a esta configuração, isso me ajudará de forma eficiente?
Adicionando o log de erros:
2015-10-12 16:20:26.30 spid31s The recovery LSN (6821:15912:1) was identified for the database with ID 12. This is an informational message only. No user action is required.
2015-10-12 16:23:56.81 spid44s Recovery of database 'A' (11) is 0% complete (approximately 1168 seconds remain). Phase 1 of 3. This is an informational message only. No user action is required.
2015-10-12 16:23:57.28 spid44s Recovery of database 'A' (11) is 0% complete (approximately 1091 seconds remain). Phase 1 of 3. This is an informational message only. No user action is required.
2015-10-12 16:23:57.28 spid44s Recovery of database 'A' (11) is 0% complete (approximately 891 seconds remain). Phase 2 of 3. This is an informational message only. No user action is required.
2015-10-12 16:24:17.32 spid44s Recovery of database 'A' (11) is 31% complete (approximately 44 seconds remain). Phase 2 of 3. This is an informational message only. No user action is required.
2015-10-12 16:24:50.53 spid6s SQL Server has encountered 1 occurrence(s) of I/O requests taking longer than 15 seconds to complete on file [Z:\SQLData\A\A.mdf] in database [A] (11). The OS file handle is 0x0000000000000B9. The offset of the latest long I/O is: 0x0000041135800
2015-10-12 16:24:55.46 spid44s Recovery of database 'A' (11) is 53% complete (approximately 51 seconds remain). Phase 2 of 3. This is an informational message only. No user action is required.
2015-10-12 16:25:15.50 spid44s Recovery of database 'A' (11) is 85% complete (approximately 13 seconds remain). Phase 2 of 3. This is an informational message only. No user action is required.
2015-10-12 16:25:49.69 spid44s 3033 transactions rolled forward in database 'A' (11:0). This is an informational message only. No user action is required.
2015-10-12 16:25:50.15 spid44s Recovery completed for database A (database ID 11) in 290 second(s) (analysis 480 ms, redo 87801 ms, undo 0 ms.) This is an informational message only. No user action is required.
2015-10-12 16:25:52.35 spid44s CHECKDB for database 'A' finished without errors on 2012-10-27 22:19:16.470 (local time). This is an informational message only; no user action is required.
2015-10-12 16:42:57.67 spid24s AlwaysOn Availability Groups connection with primary database established for secondary database 'A' on the availability replica with Replica ID: {}. This is an informational message only. No user action is required.
2015-10-12 16:42:57.67 spid24s The recovery LSN (216726:2384:1) was identified for the database with ID 11. This is an informational message only. No user action is required.
2015-10-12 16:42:57.91 spid24s AlwaysOn Availability Groups connection with primary database established for secondary database 'A' on the availability replica with Replica ID: {}. This is an informational message only. No user action is required.
2015-10-12 16:42:57.91 spid24s The recovery LSN (216726:2384:1) was identified for the database with ID 11. This is an informational message only. No user action is required.
2015-10-12 16:42:58.53 spid24s Error: 35278, Severity: 17, State: 1.
2015-10-12 16:42:58.53 spid24s Availability database 'A', which is in the secondary role, is being restarted to resynchronize with the current primary database. This is an informational message only. No user action is required.
2015-10-12 16:42:58.53 spid29s Nonqualified transactions are being rolled back in database A for an AlwaysOn Availability Groups state change. Estimated rollback completion: 100%. This is an informational message only. No user action is required.
2015-10-12 16:42:58.53 spid38s AlwaysOn Availability Groups connection with primary database terminated for secondary database 'A' on the availability replica with Replica ID: {}. This is an informational message only. No user action is required.
2015-10-12 16:42:58.79 spid29s Starting up database 'A'.
2015-10-12 16:49:45.32 spid29s Recovery of database 'A' (11) is 0% complete (approximately 1168 seconds remain). Phase 1 of 3. This is an informational message only. No user action is required.
2015-10-12 16:49:45.78 spid29s Recovery of database 'A' (11) is 0% complete (approximately 1091 seconds remain). Phase 1 of 3. This is an informational message only. No user action is required.
2015-10-12 16:49:45.78 spid29s Recovery of database 'A' (11) is 0% complete (approximately 891 seconds remain). Phase 2 of 3. This is an informational message only. No user action is required.
2015-10-12 16:50:05.80 spid29s Recovery of database 'A' (11) is 35% complete (approximately 37 seconds remain). Phase 2 of 3. This is an informational message only. No user action is required.
2015-10-12 16:50:25.82 spid29s Recovery of database 'A' (11) is 68% complete (approximately 18 seconds remain). Phase 2 of 3. This is an informational message only. No user action is required.
2015-10-12 16:50:44.66 spid29s AlwaysOn Availability Groups connection with primary database established for secondary database 'A' on the availability replica with Replica ID: {}. This is an informational message only. No user action is required.
2015-10-12 16:50:44.66 spid29s The recovery LSN (216726:2386:1) was identified for the database with ID 11. This is an informational message only. No user action is required.
2015-10-12 16:50:44.66 spid29s Error: 35286, Severity: 16, State: 1.
2015-10-12 16:50:44.66 spid29s Using the recovery LSN (216726:2384:1) stored in the metadata for the database with ID 11. This is an informational message only. No user action is required.
2015-10-12 16:53:12.38 spid29s Error: 35278, Severity: 17, State: 1.
2015-10-12 16:53:12.38 spid29s Availability database 'A', which is in the secondary role, is being restarted to resynchronize with the current primary database. This is an informational message only. No user action is required.
2015-10-12 16:53:12.38 spid29s Nonqualified transactions are being rolled back in database A for an AlwaysOn Availability Groups state change. Estimated rollback completion: 100%. This is an informational message only. No user action is required.
2015-10-12 16:53:12.40 spid31s AlwaysOn Availability Groups connection with primary database terminated for secondary database 'A' on the availability replica with Replica ID: {}. This is an informational message only. No user action is required.
2015-10-12 16:53:14.45 spid29s Starting up database 'A'.
2015-10-12 17:05:44.30 spid29s Recovery of database 'A' (11) is 0% complete (approximately 1168 seconds remain). Phase 1 of 3. This is an informational message only. No user action is required.
2015-10-12 17:05:44.76 spid29s Recovery of database 'A' (11) is 0% complete (approximately 1091 seconds remain). Phase 1 of 3. This is an informational message only. No user action is required.
2015-10-12 17:05:44.76 spid29s Recovery of database 'A' (11) is 0% complete (approximately 891 seconds remain). Phase 2 of 3. This is an informational message only. No user action is required.
2015-10-12 17:06:04.88 spid29s Recovery of database 'A' (11) is 31% complete (approximately 45 seconds remain). Phase 2 of 3. This is an informational message only. No user action is required.
2015-10-12 17:06:24.92 spid29s Recovery of database 'A' (11) is 65% complete (approximately 21 seconds remain). Phase 2 of 3. This is an informational message only. No user action is required.
2015-10-12 17:06:45.55 spid29s AlwaysOn Availability Groups connection with primary database established for secondary database 'A' on the availability replica with Replica ID: {}. This is an informational message only. No user action is required.
2015-10-12 17:06:45.55 spid29s The recovery LSN (216726:19027:80) was identified for the database with ID 11. This is an informational message only. No user action is required.
2015-10-12 17:06:48.97 spid29s 3034 transactions rolled forward in database 'A' (11:0). This is an informational message only. No user action is required.
2015-10-12 17:06:49.04 spid29s Recovery completed for database A (database ID 11) in 710 second(s) (analysis 470 ms, redo 60412 ms, undo 0 ms.) This is an informational message only. No user action is required.
2015-10-12 17:06:49.07 spid20s AlwaysOn Availability Groups connection with primary database established for secondary database 'A' on the availability replica with Replica ID: {}. This is an informational message only. No user action is required.
2015-10-12 17:06:49.08 spid20s The recovery LSN (216726:19027:80) was identified for the database with ID 11. This is an informational message only. No user action is required.
O problema possivelmente está no erro que você está vendo: Erro: 35278. Quando isso acontecer, você provavelmente observará que o banco de dados fica em um estado de reversão por um longo período de tempo.
Isso pode ser causado por vários motivos, normalmente uma transação de execução longa.
Os tempos limite que você estava enfrentando podem ter sido causados pelo tráfego enviado entre as réplicas para o banco de dados ser recuperado e voltar, no entanto, você tem certeza de que os tempos limite não foram causados por algum outro problema?
Gostaria de saber qual é a sua estratégia de backup neste banco de dados e quando o último backup completo foi executado antes deste failover. Recentemente, experimentei isso em um ambiente de teste em que os backups não estavam sendo executados. Um backup completo e um backup de log subsequente permitiram que um failover ocorresse sem problemas, o erro não estava presente e o tempo de recuperação foi muito rápido.