数据库管理244期 2024-09-28
- 数据库管理-第244期 一次无法switchover的故障处理(20240928)
- 1 问题展现
- 2 问题排查与处理
- 2.1 问题1
- 2.2 问题2
- 3 问题分析
- 4 总结
数据库管理-第244期 一次无法switchover的故障处理(20240928)
作者:胖头鱼的鱼缸(尹海文)
Oracle ACE Pro: Database(Oracle与MySQL)
PostgreSQL ACE Partner
10年数据库行业经验,现主要从事数据库服务工作
拥有OCM 11g/12c/19c、MySQL 8.0 OCP、Exadata、CDP等认证
墨天轮MVP、年度墨力之星,ITPUB认证专家、专家百人团成员,OCM讲师,PolarDB开源社区技术顾问,HaloDB外聘技术顾问,OceanBase观察团成员,青学会MOP技术社区(青年数据库学习互助会)技术顾问
圈内拥有“总监”、“保安”、“国产数据库最大敌人”等称号,非著名社恐(社交恐怖分子)
公众号:胖头鱼的鱼缸;CSDN:胖头鱼的鱼缸(尹海文);墨天轮:胖头鱼的鱼缸;ITPUB:yhw1809。
除授权转载并标明出处外,均为“非法”抄袭
中秋前做了一次数据库的倒换演练,结果发现无法switchover,本期就来跟随总监一步一步的寻找并解决问题(同时感谢SR支持)。
1 问题展现
在DGMGRL中进行switchover的时候出现了下面的问题:
DGMGRL> switchover to dbdg;
Performing switchover NOW, please wait...
Error: ORA-16775: target standby database in broker operation has potential data loss
Failed.
Unable to switchover, primary database is still "dbaas"
2 问题排查与处理
2.1 问题1
第一个问题呢是在DGMGRL中show configuration:
DGMGRL> show configuration
Configuration - dg
Protection Mode: MaxPerformance
Members:
dbaas - Primary database
dbdg - Physical standby database
Error: ORA-16664: unable to receive the result from a member
Fast-Start Failover: Disabled
Configuration Status:
ERROR (status updated 34 seconds ago)
但是使用show configuration verbose则是显示正常,查看主备库也是正常的:
DGMGRL> show configuration verbose
Configuration - dg
Protection Mode: MaxPerformance
Members:
dbaas - Primary database
dbdg - Physical standby database
Properties:
FastStartFailoverThreshold = '30'
OperationTimeout = '30'
TraceLevel = 'SUPPORT'
FastStartFailoverLagLimit = '30'
CommunicationTimeout = '180'
ObserverReconnect = '0'
FastStartFailoverAutoReinstate = 'TRUE'
FastStartFailoverPmyShutdown = 'TRUE'
BystandersFollowRoleChange = 'ALL'
ObserverOverride = 'FALSE'
ExternalDestination1 = ''
ExternalDestination2 = ''
PrimaryLostWriteAction = 'CONTINUE'
ConfigurationWideServiceName = 'dbaas_CFG'
Fast-Start Failover: Disabled
Configuration Status:
SUCCESS
DGMGRL> show database dbaas
Database - dbaas
Role: PRIMARY
Intended State: TRANSPORT-ON
Instance(s):
dbaas1
dbaas2
Database Status:
SUCCESS
DGMGRL> show database dbdg
Database - dbdg
Role: PHYSICAL STANDBY
Intended State: APPLY-ON
Transport Lag: 0 seconds (computed 0 seconds ago)
Apply Lag: 0 seconds (computed 0 seconds ago)
Average Apply Rate: 61.30 MByte/s
Real Time Query: ON
Instance(s):
dbdg1
dbdg2 (apply instance)
dbdg3
dbdg4
Database Status:
SUCCESS
最终发现是数据库本身是配置了db_domain的,而tnsname和静态监听中未配置domain(即域名后缀,如xxx.com),随机调整监听和tnsname:
监听调整(以主库实例1和listener_scan1为例):
SID_LIST_LISTENER =
(SID_LIST =
(SID_DESC =
(GLOBAL_DBNAME = dbaas.scmcc.com) #增加domain
(ORACLE_HOME = /u01/app/oracle/product/19.0.0.0/dbhome_1)
(SID_NAME = dbaas1)
)
(SID_DESC =
(GLOBAL_DBNAME = dbaas_DGMGRL.scmcc.com) #增加domain
(ORACLE_HOME = /u01/app/oracle/product/19.0.0.0/dbhome_1)
(SID_NAME = dbaas1)
)
)
SID_LIST_LISTENER_SCAN1 =
(SID_LIST =
(SID_DESC =
(GLOBAL_DBNAME =dbaas.scmcc.com) #增加domain
(ORACLE_HOME = /u01/app/oracle/product/19.0.0.0/dbhome_1)
(SID_NAME = dbaas1)
)
(SID_DESC =
(GLOBAL_DBNAME = dbaas_DGMGRL.scmcc.com) #增加domain
(ORACLE_HOME = /u01/app/oracle/product/19.0.0.0/dbhome_1)
(SID_NAME = dbaas1)
)
)
完成后reload所有监听。
tnsname调整:
DBAAS =
(DESCRIPTION =
(ADDRESS = (PROTOCOL = TCP)(HOST = primary-scan)(PORT = 1521))
(CONNECT_DATA =
(SERVER = DEDICATED)
(SERVICE_NAME = dbaas.xxx.com) #增加domain
)
)
DBDG =
(DESCRIPTION =
(ADDRESS = (PROTOCOL = TCP)(HOST = standby-scan)(PORT = 1521))
(CONNECT_DATA =
(SERVER = DEDICATED)
(SERVICE_NAME = dbdg.xxx.com) #增加domain
)
)
完成后show configuration恢复正常:
DGMGRL> show configuration
Configuration - dg
Protection Mode: MaxPerformance
Members:
dbaas - Primary database
dbdg - Physical standby database
Fast-Start Failover: Disabled
Configuration Status:
SUCCESS (status updated 59 seconds ago)
2.2 问题2
接下来尝试switchover报错依然,validate备库发现一些问题:
DGMGRL> validate database verbose dbdg
Database Role: Physical standby database
Primary Database: dbaas
Ready for Switchover: No #<---Here
Ready for Failover: Yes (Primary Running)
Flashback Database Status:
dbaas: Off
dbdg : Off
Capacity Information:
Database Instances Threads
dbaas 2 4
dbdg 4 4
Managed by Clusterware:
dbaas: YES
dbdg : YES
Temporary Tablespace File Information:
dbaas TEMP Files: 56
dbdg TEMP Files: 67
Data file Online Move in Progress:
dbaas: No
dbdg: No
Standby Apply-Related Information:
Apply State: Running
Apply Lag: 0 seconds (computed 0 seconds ago)
Apply Delay: 0 minutes
Transport-Related Information:
Transport On: Yes
Gap Status: Gap #<---Here
Transport Lag: 0 seconds (computed 0 seconds ago)
Transport Status: Success
Log Files Cleared:
dbaas Standby Redo Log Files: Cleared
dbdg Online Redo Log Files: Cleared
dbdg Standby Redo Log Files: Available
Current Log File Groups Configuration:
Thread # Online Redo Log Groups Standby Redo Log Groups Status
(dbaas) (dbdg)
2 4 7 Sufficient SRLs
1 4 7 Sufficient SRLs
3 4 7 Sufficient SRLs
4 4 7 Sufficient SRLs
Future Log File Groups Configuration:
Thread # Online Redo Log Groups Standby Redo Log Groups Status
(dbdg) (dbaas)
2 4 7 Sufficient SRLs
1 4 7 Sufficient SRLs
3 4 7 Sufficient SRLs
4 4 7 Sufficient SRLs
Current Configuration Log File Sizes:
Thread # Smallest Online Redo Smallest Standby Redo
Log File Size Log File Size
(dbaas) (dbdg)
2 2048 MBytes 2048 MBytes
1 2048 MBytes 2048 MBytes
3 2048 MBytes 2048 MBytes
4 2048 MBytes 2048 MBytes
Future Configuration Log File Sizes:
Thread # Smallest Online Redo Smallest Standby Redo
Log File Size Log File Size
(dbdg) (dbaas)
2 2048 MBytes 2048 MBytes
1 2048 MBytes 2048 MBytes
3 2048 MBytes 2048 MBytes
4 2048 MBytes 2048 MBytes
Apply-Related Property Settings:
Property dbaas Value dbdg Value
DelayMins 0 0
ApplyParallel AUTO AUTO
ApplyInstances 0 0
Transport-Related Property Settings:
Property dbaas Value dbdg Value
LogShipping ON ON
LogXptMode sync sync
Dependency <empty> <empty>
DelayMins 0 0
Binding optional optional
MaxFailure 0 0
ReopenSecs 300 300
NetTimeout 30 30
RedoCompression DISABLE DISABLE
仍然显示无法switchover且存在GAP,但是通过数据库查询发现并未出现GAP(查询语句如下,结果略):
-- primary database
set markup HTML on
spool /tmp/primary_info.html
ALTER SESSION SET NLS_DATE_FORMAT = 'DD-MON-YYYY HH24:MI:SS';
select thread#, max(sequence#) "Last Primary Seq Generated"
from gv$archived_log val, gv$database vdb
where val.resetlogs_change# = vdb.resetlogs_change#
group by thread# order by 1;
SELECT thread#, dest_id, gvad.status, error, fail_sequence FROM gv$archive_dest gvad, gv$instance gvi WHERE gvad.inst_id = gvi.inst_id AND destination is NOT NULL ORDER BY thread#, dest_id;
select * from gv$dataguard_stats;
SELECT a.thread#, b. last_seq, a.applied_seq, a. last_app_timestamp, b.last_seq-a.applied_seq ARC_DIFF FROM (SELECT thread#, MAX(sequence#) applied_seq, MAX(next_time) last_app_timestamp FROM gv$archived_log WHERE applied = 'YES' GROUP BY thread#) a, (SELECT thread#, MAX (sequence#) last_seq FROM gv$archived_log GROUP BY thread#) b WHERE a.thread# = b.thread#;
spool off
set markup HTML off
-- standby database
set markup HTML on
spool /tmp/standby_info.html
ALTER SESSION SET NLS_DATE_FORMAT = 'DD-MON-YYYY HH24:MI:SS';
select process,thread#,sequence#,status from gv$managed_standby;
select * from v$dataguard_stats;
select a.thread#
,a.sequence#
,a.group# grp
, a.bytes/1024/1024 Size_MB
,a.status
,a.archived
,a.first_change# "First SCN Number"
,to_char(FIRST_TIME,'DD-Mon-RR HH24:MI:SS') "First SCN Time"
,to_char(LAST_TIME,'DD-Mon-RR HH24:MI:SS') "Last SCN Time" from
v$standby_log a order by 1,2,3,4;
select thread#, max(sequence#) "Last Standby Seq Received"
from v$archived_log val, v$database vdb
where val.resetlogs_change# = vdb.resetlogs_change#
group by thread# order by 1;
select thread#, max(sequence#) "Last Standby Seq Applied"
from v$archived_log val, v$database vdb
where val.resetlogs_change# = vdb.resetlogs_change#
and val.applied in ('YES','IN-MEMORY')
group by thread# order by 1;
spool off
set markup HTML off
主库检查thread:
SQL> SELECT thread#, instance, status FROM v$thread;
THREAD# INSTANCE STATUS
---------- -------------------- ------
1 dbaas1 OPEN
2 dbaas2 OPEN
3 UNNAMED_INSTANCE_3 CLOSED
4 UNNAMED_INSTANCE_4 CLOSED
这里后台建议做了一个操作:
ALTER DATABASE DISABLE THREAD 3;
ALTER DATABASE DISABLE THREAD 4;
运行一段时间后再次validate备库:
DGMGRL> validate database verbose dbdg
...
Ready for Switchover: Yes
...
Gap Status: No Gap
...
显示可以切换且没有Gap了。目前还没有尝试再次switchover,但应该是没问题了。
3 问题分析
这里也是我第一次遇到这个问题,应该是一开始因为domain配置引起的元数据问题。关于thread的问题,因为主备节点数量不一致,但是其他类似配置的库并没有出现过相关问题,所以我怀疑还是和domain配置有问题带来的连锁反应。
4 总结
本期处理了一个ADG无法switchover的问题,源自于最早的错误配置。
老规矩,知道写了些啥。