一、主库故障重启(备库接管前重启)
主库故障后立即重启,此时主库的守护进程变成 Startup 状态,重新进入守护进程的 启动流程,将数据一致的备库归档设置为有效状态,其余备库归档设置成无效状态,并重新 Open主库。Open成功后继续作为主库,当检测到归档状态无效的备库正常时会启动 Recovery 处理流程,重新同步主备库数据。
1、备库故障处理
备库产生故障(硬件故障或者内部网卡故障)时,主库的处理流程对手动切换、自动切 换模式处理上有些差异。
手动切换模式
对于手动切换模式,检测到备库故障,满足 Failover 条件时,主库的守护进程立即 切换到 Failover 状态,执行对应的故障处理,如果不满足切换 Failover 条件,则保持 当前状态不变。
手动切换模式下,主库守护进程切换 Failover 条件:
1. 备库实例故障,或者主备库之间出现网络故障,或者备库重演时校验 LSN 不匹配,
这三种场景下引发主库同步日志到备库失败挂起,主库实例处于 Suspend 状态
2. 主库到此备库的归档状态是 Valid(读写分离集群没有此限制)
3. 主库的守护进程处于 Startup、Open 或 Recovery 状态
4. 当前没有监视器命令正在执行
自动切换模式
对于自动切换模式,主库的守护进程会自动判断切换到 Failover 状态或者 Confirm 确认状态,如果两种状态切换条件都不满足,则保持当前状态不变。
自动切换模式下,主库守护进程不进入 Confirm 确认状态,直接切换到 Failover 条件:
1. 前四项条件,和上面列出的手动切换条件相同
2. 备库实例故障,备库守护进程正常
如果只满足条件 1,不满足条件 2,则主库守护进程会先进入 Confirm 确认状态,等 待确认监视器的确认消息。主库的守护进程进入 Confirm 确认状态后,会有下面几种不同
的处理:
1. 主库和确认监视器之间网络连接正常
主库的守护进程收到了确认监视器返回的确认消息,如果确认监视器认定可以执行 Failover,则主库的守护进程会切换为 Failover 状态并执行对应的处理;如果确认监 视器认定不满足执行 Failover 条件,则主库的守护进程会一直保持在 Confirm 状态。确 认监视器认定主库可以执行 Failover 条件:
1) 主库守护进程处于 Confirm 状态
2) 主库实例正常,处于 Suspend 状态
3) 主库没有被接管,不存在其他主库
4) 没有 takeover/switchover 命令在执行
5) 当前所有归档有效的备库均可以加入主库
2. 主库和确认监视器之间网络连接异常,或者没有启动确认监视器。满足下面条件后 主库允许切换至 Failover 状态执行故障处理:
1) 主库实例正常,处于 Suspend 状态
2) 备库守护进程正常
3) 主库没有被接管,不存在其他主库
4) 没有 takeover/switchover 命令正在执行
5) 备库故障前可以加入主库
3. 主库和确认监视器网络恢复正常后,主库已经被接管。老主库的守护进程切换为 Startup 状态,重新判断是否可加入新主库。 主库守护进程进入 Failover 状态后的执行流程(自动或手动切换模式下执行流程相 同):
1) 对实时主备或 MPP 主备,通知主库修改发送归档失败的备库归档状态无效
2) 通知主库重新 Open。
3) 将主库的守护进程切换为 Open 状态
二、恢复日志
Clear all ep g_dw_status finished, Recovery finished!
switch sub_state to sub_stat_start!
设置GRP1守护进程为OPEN(SUB:STARTUP)状态
dm_connect_async connection 6 is in progress
非自动切换模式下20s没有收到远程守护进程消息
Local instance: 守护进程状态(OPEN) 实例状态(OK) 实例名(DM01) 模式(PRIMARY) 实例状态(OPEN) 归档状态(UNKNOWN) POCNT(8) FLSN(128083643) CLSN(1280836
Instance: 守护进程状态(ERROR) 实例状态(OK) 实例名(DM02) 模式(STANDBY) 实例状态(OPEN) 归档状态(UNKNOWN) POCNT(8) FLSN(128042324) CLSN(128042680) S
dm_connect_async connection 6 is timeout
dm_connect_async connection 6 is in progress
dm_connect_async connection 6 is timeout
dw2_send_port_set from dmmonitor vio(6) set, mid(1673602727), from name:dmmonitor, ip:::ffff:192.168.12.125, mon_confirm:FALSE
dw2_send_port_set to dmwatcher vio(8) set, mid(-1), to name:DM02, ip:192.168.12.126
ohis_inst_info_copy_low, inst(DM02) apply info changed, old info[p_db_magic:1486960128, n_apply_ep:1], new info to set[p_db_magic:0, n_apply_ep:0
远程实例的模式、状态或者归档状态发生变化,原状态是:
Instance: 守护进程状态(ERROR) 实例状态(OK) 实例名(DM02) 模式(STANDBY) 实例状态(OPEN) 归档状态(UNKNOWN) POCNT(8) FLSN(128042324) CLSN(128042680) S
远程实例的模式、状态或者归档状态发生变化,新状态是:
dw2_send_port_set from dmmonitor vio(10) set, mid(1673602730), from name:dmmonitor, ip:::ffff:192.168.12.125, mon_confirm:FALSE
远程实例的模式、状态或者归档状态发生变化,原状态是:
远程实例的模式、状态或者归档状态发生变化,新状态是:
Instance: 守护进程状态(STARTUP) 实例状态(OK) 实例名(DM02) 模式(UNKNOWN) 实例状态(SHUTDOWN) 归档状态(UNKNOWN) POCNT(0) FLSN(0) CLSN(0) SLSN(0) SSL
ohis_inst_info_copy_low, inst(DM02) apply info changed, old info[p_db_magic:0, n_apply_ep:0], new info to set[p_db_magic:1486960128, n_apply_ep:1
远程实例的模式、状态或者归档状态发生变化,原状态是:
Instance: 守护进程状态(STARTUP) 实例状态(OK) 实例名(DM02) 模式(UNKNOWN) 实例状态(SHUTDOWN) 归档状态(UNKNOWN) POCNT(0) FLSN(0) CLSN(0) SLSN(0) SSL
远程实例的模式、状态或者归档状态发生变化,新状态是:
Instance: 守护进程状态(UNIFY EP) 实例状态(OK) 实例名(DM02) 模式(STANDBY) 实例状态(MOUNT) 归档状态(UNKNOWN) POCNT(8) FLSN(128045793) CLSN(12804579
远程实例的模式、状态或者归档状态发生变化,原状态是:
Instance: 守护进程状态(UNIFY EP) 实例状态(OK) 实例名(DM02) 模式(STANDBY) 实例状态(MOUNT) 归档状态(UNKNOWN) POCNT(8) FLSN(128045793) CLSN(12804579
远程实例的模式、状态或者归档状态发生变化,新状态是:
Instance: 守护进程状态(STARTUP) 实例状态(OK) 实例名(DM02) 模式(STANDBY) 实例状态(OPEN) 归档状态(UNKNOWN) POCNT(8) FLSN(128045793) CLSN(128045793)
远程实例的模式、状态或者归档状态发生变化,原状态是:
Instance: 守护进程状态(STARTUP) 实例状态(OK) 实例名(DM02) 模式(STANDBY) 实例状态(OPEN) 归档状态(UNKNOWN) POCNT(8) FLSN(128045793) CLSN(128045793)
远程实例的模式、状态或者归档状态发生变化,新状态是:
Instance: 守护进程状态(OPEN) 实例状态(OK) 实例名(DM02) 模式(STANDBY) 实例状态(OPEN) 归档状态(UNKNOWN) POCNT(8) FLSN(128045793) CLSN(128045793) SL
[ohis_check_can_recover, p_iname:DM01, n_p_apply=0, p_apply_db_magic=1486960128, p_apply_seqno_arr=[1162045, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
[ohis_check_can_recover, s_iname:DM02, n_s_apply=1, s_apply_db_magic=1486960128, s_apply_seqno_arr=[1141458, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
switch sub_state to pre_set_dw_stat!
设置GRP1守护进程为RECOVERY(SUB:STARTUP)状态
[ohis_check_can_recover, p_iname:DM01, n_p_apply=0, p_apply_db_magic=1486960128, p_apply_seqno_arr=[1162045, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
[ohis_check_can_recover, s_iname:DM02, n_s_apply=1, s_apply_db_magic=1486960128, s_apply_seqno_arr=[1141458, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
[ohis_check_can_recover, p_iname:DM01, n_p_apply=0, p_apply_db_magic=1486960128, p_apply_seqno_arr=[1162045, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
[ohis_check_can_recover, s_iname:DM02, n_s_apply=1, s_apply_db_magic=1486960128, s_apply_seqno_arr=[1141458, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
[ohis_check_can_recover, p_iname:DM01, n_p_apply=0, p_apply_db_magic=1486960128, p_apply_seqno_arr=[1162045, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
[ohis_check_can_recover, s_iname:DM02, n_s_apply=1, s_apply_db_magic=1486960128, s_apply_seqno_arr=[1141458, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
[ohis_check_can_recover, p_iname:DM01, n_p_apply=0, p_apply_db_magic=1486960128, p_apply_seqno_arr=[1162045, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
[ohis_check_can_recover, s_iname:DM02, n_s_apply=1, s_apply_db_magic=1486960128, s_apply_seqno_arr=[1141458, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
[ohis_check_can_recover, p_iname:DM01, n_p_apply=0, p_apply_db_magic=1486960128, p_apply_seqno_arr=[1162045, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
[ohis_check_can_recover, s_iname:DM02, n_s_apply=1, s_apply_db_magic=1486960128, s_apply_seqno_arr=[1141458, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
[ohis_check_can_recover, p_iname:DM01, n_p_apply=0, p_apply_db_magic=1486960128, p_apply_seqno_arr=[1162045, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
[ohis_check_can_recover, s_iname:DM02, n_s_apply=1, s_apply_db_magic=1486960128, s_apply_seqno_arr=[1141458, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
dw2_notify_set_dw_stat, dseq = 1671462826, from_dw_stat: NONE, to_dw_stat: DW_RECOVERY
Send tcp msg to local ep DM01, hpc_seqno:0, code:0
设置GRP1守护进程子状态为WAIT_SET_DW_STAT状态
dw2_group_get_curr_ep_retcode, ep(DM01) cmd_ret:cmd=217, dseq=1671462826, code=0
dw2_clear_ep_cmd_info_low, clear ep(DM01) cmd info, and reset curr_ep to NULL.
notify ep(DM01) set dw_stat to DW_RECOVERY success!
[ohis_check_can_recover, p_iname:DM01, n_p_apply=0, p_apply_db_magic=1486960128, p_apply_seqno_arr=[1162045, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
[ohis_check_can_recover, s_iname:DM02, n_s_apply=1, s_apply_db_magic=1486960128, s_apply_seqno_arr=[1141458, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
[ohis_check_can_recover, p_iname:DM01, n_p_apply=0, p_apply_db_magic=1486960128, p_apply_seqno_arr=[1162045, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
[ohis_check_can_recover, s_iname:DM02, n_s_apply=1, s_apply_db_magic=1486960128, s_apply_seqno_arr=[1141458, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
[ohis_check_can_recover, p_iname:DM01, n_p_apply=0, p_apply_db_magic=1486960128, p_apply_seqno_arr=[1162045, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
[ohis_check_can_recover, s_iname:DM02, n_s_apply=1, s_apply_db_magic=1486960128, s_apply_seqno_arr=[1141458, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
[ohis_check_can_recover, p_iname:DM01, n_p_apply=0, p_apply_db_magic=1486960128, p_apply_seqno_arr=[1162045, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
[ohis_check_can_recover, s_iname:DM02, n_s_apply=1, s_apply_db_magic=1486960128, s_apply_seqno_arr=[1141458, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
检测到实例(DM02)可恢复,执行恢复流程
开始向实例(DM02)发送归档日志
[ohis_check_can_recover, p_iname:DM01, n_p_apply=0, p_apply_db_magic=1486960128, p_apply_seqno_arr=[1162045, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
[ohis_check_can_recover, s_iname:DM02, n_s_apply=1, s_apply_db_magic=1486960128, s_apply_seqno_arr=[1141458, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
[ohis_check_can_recover, p_iname:DM01, n_p_apply=0, p_apply_db_magic=1486960128, p_apply_seqno_arr=[1162045, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
[ohis_check_can_recover, s_iname:DM02, n_s_apply=1, s_apply_db_magic=1486960128, s_apply_seqno_arr=[1141458, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
[ohis_check_can_recover, p_iname:DM01, n_p_apply=0, p_apply_db_magic=1486960128, p_apply_seqno_arr=[1162045, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
[ohis_check_can_recover, s_iname:DM02, n_s_apply=1, s_apply_db_magic=1486960128, s_apply_seqno_arr=[1141458, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
dw2_rarch_send to DM02[seqno: 0], dseq = 1671462827
[ohis_check_can_recover, p_iname:DM01, n_p_apply=0, p_apply_db_magic=1486960128, p_apply_seqno_arr=[1162045, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
[ohis_check_can_recover, s_iname:DM02, n_s_apply=1, s_apply_db_magic=1486960128, s_apply_seqno_arr=[1141458, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
[ohis_check_can_recover, p_iname:DM01, n_p_apply=0, p_apply_db_magic=1486960128, p_apply_seqno_arr=[1162045, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
[ohis_check_can_recover, s_iname:DM02, n_s_apply=1, s_apply_db_magic=1486960128, s_apply_seqno_arr=[1141458, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
[ohis_check_can_recover, p_iname:DM01, n_p_apply=0, p_apply_db_magic=1486960128, p_apply_seqno_arr=[1162045, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
[ohis_check_can_recover, s_iname:DM02, n_s_apply=1, s_apply_db_magic=1486960128, s_apply_seqno_arr=[1141458, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
Send tcp msg to local ep DM01, hpc_seqno:0, code:0
设置GRP1守护进程子状态为WAIT_SEND_ARCH状态
[ohis_check_can_recover, p_iname:DM01, n_p_apply=0, p_apply_db_magic=1486960128, p_apply_seqno_arr=[1162045, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
[ohis_check_can_recover, s_iname:DM02, n_s_apply=1, s_apply_db_magic=1486960128, s_apply_seqno_arr=[1141458, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
dw2_rarch_send to DM02[seqno: 0], dseq = 1671462828
Send tcp msg to local ep DM01, hpc_seqno:0, code:0
dw2_rarch_send to DM02[seqno: 0], dseq = 1671462829
Send tcp msg to local ep DM01, hpc_seqno:0, code:0
dw2_rarch_send to DM02[seqno: 0], dseq = 1671462830
Send tcp msg to local ep DM01, hpc_seqno:0, code:0
dw2_rarch_send to DM02[seqno: 0], dseq = 1671462831
Send tcp msg to local ep DM01, hpc_seqno:0, code:0
dw2_rarch_send to DM02[seqno: 0], dseq = 1671462832
Send tcp msg to local ep DM01, hpc_seqno:0, code:0
检测到实例(DM02)发送归档成功,设置为当前恢复实例
dw2_notify_sql_exec, dseq = 1671462833, sql: ALTER DATABASE SUSPEND
Send tcp msg to local ep DM01, hpc_seqno:0, code:0
设置GRP1守护进程子状态为WAIT_TO_SUSPEND状态
向实例(DM02)发送归档日志成功,实例(DM01)转入suspend状态
dw2_group_get_curr_ep_retcode, ep(DM01) cmd_ret:cmd=1, dseq=1671462833, code=100
dw2_group_get_curr_ep_retcode, ep(DM01) cmd_ret:cmd=1, dseq=1671462833, code=0
dw2_clear_ep_cmd_info_with_recv_inst_low, clear ep(DM01) cmd info, and reset curr_ep to NULL.
转入suspend状态后,再次发送归档日志
dw2_rarch_send to DM02[seqno: 0], dseq = 1671462834
Send tcp msg to local ep DM01, hpc_seqno:0, code:0
设置GRP1守护进程子状态为WAIT_SEND_ALL_ARCH状态
dw2_group_get_curr_ep_retcode, ep(DM01) cmd_ret:cmd=210, dseq=1671462834, code=100
dw2_group_get_curr_ep_retcode, ep(DM01) cmd_ret:cmd=210, dseq=1671462834, code=100
dw2_group_get_curr_ep_retcode, ep(DM01) cmd_ret:cmd=210, dseq=1671462834, code=100
dw2_group_get_curr_ep_retcode, ep(DM01) cmd_ret:cmd=210, dseq=1671462834, code=100
dw2_group_get_curr_ep_retcode, ep(DM01) cmd_ret:cmd=210, dseq=1671462834, code=100
dw2_group_get_curr_ep_retcode, ep(DM01) cmd_ret:cmd=210, dseq=1671462834, code=0
发送归档完毕,设置实例(DM02)归档有效
dw2_notify_chg_arch_status, dseq = 1671462835, rstat = 0
Send tcp msg to local ep DM01, hpc_seqno:0, code:0
设置GRP1守护进程子状态为WAIT_SET_ARCH状态
dw2_group_get_curr_ep_retcode, ep(DM01) cmd_ret:cmd=100, dseq=1671462835, code=100
实例(DM02)归档状态发生变化:INVALID --> VALID
dw2_group_get_curr_ep_retcode, ep(DM01) cmd_ret:cmd=100, dseq=1671462835, code=0
dw2_clear_ep_cmd_info_with_recv_inst_low, clear ep(DM01) cmd info, and reset curr_ep to NULL.
设置实例(DM02)归档有效成功,通知实例(DM01)OPEN
dw2_notify_sql_exec, dseq = 1671462836, sql: ALTER DATABASE OPEN FORCE
Send tcp msg to local ep DM01, hpc_seqno:0, code:0
设置GRP1守护进程子状态为WAIT_TO_OPEN状态
dw2_group_get_curr_ep_retcode, ep(DM01) cmd_ret:cmd=1, dseq=1671462836, code=0
dw2_clear_ep_cmd_info_with_recv_inst_low, clear ep(DM01) cmd info, and reset curr_ep to NULL.
dw2_set_recover_info, instance:DM02, recover flag:TRUE, from monitor:FALSE, last_recv_time:1673602836, recover retry time:60
本地守护进程为RECOVERY状态,本机实例为PRIMARY & OPEN,实例(DM02)故障恢复完成
将实例(DM02)从恢复列表中删除
不存在可恢复备库
dw2_clear_ep_cmd_info_low, clear ep(DM01) cmd info.
设置GRP1守护进程子状态为SUB_STATE_CLEAR状态
Clear all ep dw_stat value!
dw2_notify_set_dw_stat, dseq = 1671462837, from_dw_stat: DW_RECOVERY, to_dw_stat: NONE
Send tcp msg to local ep DM01, hpc_seqno:0, code:0
设置GRP1守护进程子状态为WAIT_CLEAR状态
dw2_group_get_curr_ep_retcode, ep(DM01) cmd_ret:cmd=217, dseq=1671462837, code=100
dw2_group_get_curr_ep_retcode, ep(DM01) cmd_ret:cmd=217, dseq=1671462837, code=0
dw2_clear_ep_cmd_info_low, clear ep(DM01) cmd info, and reset curr_ep to NULL.
notify ep(DM01) set dw_stat to NONE success!
dw2_clear_ep_cmd_info_low, clear ep(DM01) cmd info.
Clear all ep g_dw_status finished, Recovery finished!
switch sub_state to sub_stat_start!
设置GRP1守护进程为OPEN(SUB:STARTUP)状态