我最近打破了复制,当我试图通过一个不正确的事务时。我得到了以下内容。
MariaDB [(none)]> STOP SLAVE;
Query OK, 0 rows affected (0.05 sec)
MariaDB [(none)]> SET GLOBAL SQL_SLAVE_SKIP_COUNTER = 1;
ERROR 1966 (HY000): When using parallel replication and GTID with multiple replication domains, @@sql_slave_skip_counter cannot be used. Instead, setting @@gtid_slave_pos explicitly can be used to skip to after a given GTID position.
MariaDB [(none)]> select @@gtid_slave_pos;
+---------------------------------------------+
| @@gtid_slave_pos |
+---------------------------------------------+
| 0-1051-1391406,1-1050-1182069,57-1051-98897 |
+---------------------------------------------+
1 row in set (0.00 sec)
MariaDB [(none)]> show variables like '%_pos%';
+----------------------+---------------------------------------------------------+
| Variable_name | Value |
+----------------------+---------------------------------------------------------+
| gtid_binlog_pos | 0-1051-1391406,2-1051-4474,57-1051-98897 |
| gtid_current_pos | 0-1051-1391406,1-1050-1182069,2-1051-4474,57-1051-98897 |
| gtid_slave_pos | 0-1051-1391406,1-1050-1182069,57-1051-98897 |
| wsrep_start_position | 00000000-0000-0000-0000-000000000000:-1 |
+----------------------+---------------------------------------------------------+
我需要做什么来解决这个问题。
更新 1
MariaDB [(none)]> show variables like '%gtid%';
+------------------------+------------------------------------------+
| Variable_name | Value |
+------------------------+------------------------------------------+
| gtid_binlog_pos | 1-1050-4820789,2-1051-379101,3-1010-3273 |
| gtid_binlog_state | 1-1050-4820789,2-1051-379101,3-1010-3273 |
| gtid_current_pos | 1-1050-4819948,2-1051-379101,3-1010-3273 |
| gtid_domain_id | 3 |
| gtid_ignore_duplicates | OFF |
| gtid_seq_no | 0 |
| gtid_slave_pos | 1-1050-4819948,2-1051-379101,3-1010-3273 |
| gtid_strict_mode | OFF |
| last_gtid | |
| wsrep_gtid_domain_id | 0 |
| wsrep_gtid_mode | OFF |
+------------------------+------------------------------------------+
我按照说明尝试了以下设置@@gtid_slave_pos;
MariaDB [(none)]> show slave status\G
*************************** 1. row ***************************
Slave_IO_State: Waiting for master to send event
Master_Host: [redacted]
Master_User: [redacted]
Master_Port: 3306
Connect_Retry: 5
Master_Log_File: binary.000591
Read_Master_Log_Pos: 526511543
Relay_Log_File: tmsdb-relay-bin.001239
Relay_Log_Pos: 4
Relay_Master_Log_File: binary.000591
Slave_IO_Running: Yes
Slave_SQL_Running: No
Replicate_Do_DB:
Replicate_Ignore_DB:
Replicate_Do_Table:
Replicate_Ignore_Table:
Replicate_Wild_Do_Table:
Replicate_Wild_Ignore_Table:
Last_Errno: 1062
Last_Error: Could not execute Write_rows_v1 event on table [redacted] Duplicate entry '1134890' for key 'PRIMARY', Error_code: 1062; handler error HA_ERR_FOUND_DUPP_KEY; the event's master log binary.000591, end_log_pos 60726493
Skip_Counter: 0
Exec_Master_Log_Pos: 60724897
Relay_Log_Space: 465787660
Until_Condition: None
Until_Log_File:
Until_Log_Pos: 0
Master_SSL_Allowed: No
Master_SSL_CA_File:
Master_SSL_CA_Path:
Master_SSL_Cert:
Master_SSL_Cipher:
Master_SSL_Key:
Seconds_Behind_Master: NULL
Master_SSL_Verify_Server_Cert: No
Last_IO_Errno: 0
Last_IO_Error:
Last_SQL_Errno: 1062
Last_SQL_Error: Could not execute Write_rows_v1 event on table [redacted] Duplicate entry '1134890' for key 'PRIMARY', Error_code: 1062; handler error HA_ERR_FOUND_DUPP_KEY; the event's master log binary.000591, end_log_pos 60726493
Replicate_Ignore_Server_Ids:
Master_Server_Id: 1050
Master_SSL_Crl:
Master_SSL_Crlpath:
Using_Gtid: Current_Pos
Gtid_IO_Pos: 1-1050-4827753,2-1051-379101,3-1010-3273
Replicate_Do_Domain_Ids:
Replicate_Ignore_Domain_Ids:
Parallel_Mode: optimistic
1 row in set (0.00 sec)
使用 gtid_slave_pos 变量
MariaDB [(none)]> select @@gtid_slave_pos\G;
*************************** 1. row ***************************
@@gtid_slave_pos: 1-1050-4819948,2-1051-379101,3-1010-3273
MariaDB [(none)]> stop slave;
Query OK, 0 rows affected (0.21 sec)
MariaDB [(none)]> SET GLOBAL gtid_slave_pos='1-1050-4819948,2-1051-379101,3-1010-3274';
Query OK, 0 rows affected (0.10 sec)
MariaDB [(none)]> start slave;
Query OK, 0 rows affected (0.21 sec)
当我在运行上述后检查状态时Got fatal error 1236 from master when reading data from binary log: 'Error: connecting slave requested to start from GTID 3-1010-3274, which is not in the master's binlog'
MariaDB [(none)]> show slave status\G
*************************** 1. row ***************************
Slave_IO_State:
Master_Host: 10.56.228.64
Master_User: maxscale
Master_Port: 3306
Connect_Retry: 5
Master_Log_File: binary.000591
Read_Master_Log_Pos: 60724897
Relay_Log_File: tmsdb-relay-bin.001239
Relay_Log_Pos: 4
Relay_Master_Log_File: binary.000591
Slave_IO_Running: No
Slave_SQL_Running: Yes
Replicate_Do_DB:
Replicate_Ignore_DB:
Replicate_Do_Table:
Replicate_Ignore_Table:
Replicate_Wild_Do_Table:
Replicate_Wild_Ignore_Table:
Last_Errno: 0
Last_Error:
Skip_Counter: 0
Exec_Master_Log_Pos: 60724897
Relay_Log_Space: 249
Until_Condition: None
Until_Log_File:
Until_Log_Pos: 0
Master_SSL_Allowed: No
Master_SSL_CA_File:
Master_SSL_CA_Path:
Master_SSL_Cert:
Master_SSL_Cipher:
Master_SSL_Key:
Seconds_Behind_Master: NULL
Master_SSL_Verify_Server_Cert: No
Last_IO_Errno: 1236
Last_IO_Error: Got fatal error 1236 from master when reading data from binary log: 'Error: connecting slave requested to start from GTID 3-1010-3274, which is not in the master's binlog'
Last_SQL_Errno: 0
Last_SQL_Error:
Replicate_Ignore_Server_Ids:
Master_Server_Id: 1050
Master_SSL_Crl:
Master_SSL_Crlpath:
Using_Gtid: Current_Pos
Gtid_IO_Pos: 1-1050-4819948,2-1051-379101,3-1010-3274
Replicate_Do_Domain_Ids:
Replicate_Ignore_Domain_Ids:
Parallel_Mode: optimistic
1 row in set (0.00 sec)
我可以通过
MariaDB [(none)]> stop slave;
Query OK, 0 rows affected (0.01 sec)
MariaDB [(none)]> SET GLOBAL gtid_slave_pos='1-1050-4819948,2-1051-379101,3-1010-3273';
Query OK, 0 rows affected (0.09 sec)
MariaDB [(none)]> start slave;
Query OK, 0 rows affected (0.06 sec)
我发现以下内容对我有用。这不会将从属恢复到与主控完全相同的状态。会有数据差异。我将使用 pt-table-sync 来解决这些问题。
1. 不使用 GTID 方法重新启动复制
2. 停止并行从属线程
3. 启用 GTID 复制
4. 使用 percona-toolkit pt-slave-restart 跳过所有错误。
1. 重启 Replication without GTID 方法 Using master binglog position
这是有据可查的,请谷歌并查找说明。
2.停止并行从线程
如原始问题所示,这是问题的一部分。
ERROR 1966 (HY000): When using parallel replication and GTID with multiple replication domains, @@sql_slave_skip_counter cannot be used. Instead, setting @@gtid_slave_pos explicitly can be used to skip to after a given GTID position.
我希望能够跳过事件,而不必担心试图找出或增加每个人的 GTID 位置。
现在,如果我检查并行从线程,我会看到
完成后,我可以反转此过程以重新启用并行从属线程。而且我知道 GTID 正在工作。
3.启用GTID复制
我现在可以尝试在启用 GTID 的情况下重新启动从站。
在主上
在奴隶上
现在,当我检查从属设备时,它有一些事件要跳过以恢复与主设备相同的状态。
Last_Error: An attempt was made to binlog GTID 1-1050-5004291 which would create an out-of-order sequence number with existing GTID 1-1050-5004322, and gtid strict mode is enabled.
4. 使用 percona-toolkit pt-slave-restart 跳过所有错误
pt-slave-restart 将跳过使从属设备进入工作状态所需的所有事件。
现在当我检查我的奴隶状态时
最后我需要重新启动服务器并确保它重新启动安全等。
我在生产中发现 Parallel_Mode 是我问题的最可能原因。
我建议使用与
optimistic
如果您收到以下错误。
在日志中,我看到以下内容:
要在从站失败后重新启动从站,您可以执行以下操作。
停止所有
slave_parallel_threads
并禁用slave_parallel_mode
我现在
pt-slave-restart
用来重新启动从站,因为当我只想启动从站时,我不必考虑序列号和一整套需要太长时间的其他事情。将运行没有错误,
ctrl-c
当你很高兴你的奴隶已经赶上时,你可以关闭它。This is not much different then, but it does it auto magically.
If you need to have parallel threads then you can re-enable them once the slave has caught up or gotten past the event causing problems. I would try a different
slave_parallel_mod
like conservative