一、故障原因
9月3日中午,主机运维人员误操作,对sydb数据库的ASM磁盘做了格式化,将三块共享磁盘做了分区,然后使用vgextend命令加入到本地文件系,造成sydb的两个实例异常,没有操作CRS磁盘,所以RAC 集群无影响
二、问题排查
1.确定被格式化的磁盘
根据Linux的lsbk和history命令,可以确定mpathd和mpath两块盘做了格式化。实际上,当时history只查询了RAC的一个节点,在另一个节点对mpathe磁盘做了格式化。最终确定 三块ASM盘做了破会,其中一块属于ARCDG磁盘组(此磁盘组只包含一块磁盘),另两块属于DATADG磁盘组(此磁盘组只包含两块),而且磁盘组做的外部冗余。
2.确定磁盘破坏的AU
通过kfed 查询被破坏的三快盘
发现三块盘的AU 0 和AU 1 分别有不同数量的block被损坏。
三、恢复方案研究
-
kfed 恢复
kfed 是Oracle 未公开的一种ASM工具,它可以读取和修改ASM的元数据块,kfed直接读取磁盘,不受ASM实例状态影响。
但是三块磁盘被破坏的不仅仅是元数据(AU 0),还有AU 1 的部分数据块,而且kfed 更多的是对元数据磁盘头块的修复,因此不适合本次故障恢复
-
amdu
amdu是元数据导出工具,它能输出ASM磁盘的元数据信息,并且从ASM磁盘中抽取元数据和数据文件,且不依赖ASM实例和ASM磁盘状态。
使用AMDU可以抽取出来数据文件,然后尝试不完全恢复,但是有两块磁盘的ASM元数据1号文件(即文件目录,位于aun 0 blkn 1)被损坏,无法抽取磁盘组下的数据文件列表,而且数据库数据文件过多,工作量巨大,无法明确恢复时长。故也不适合本次故障恢复
-
DUL
DUL或类DUL工具可以在实例未启动状态下直接读取磁盘组的system数据文件,将数据库做逻辑导出。
但是目前DUL已经不公开,而且类DUL工具需要付费。另外数据量巨大,逻辑导出速度慢。
-
备份恢复
所幸此系统有NBU备份,最晚全备时间为8月31日 22点,而且也有每日的归档备份,所以使用NBU备份做全量恢复,然后使用归档追回三天数据方案可行。
四、NBU恢复过程
1.恢复流程
由于无法确定NBU备份集的完整性,和归档的连续性,考虑到NBU恢复失败的可能性,采用如下恢复流程:
- NBU备份恢复到一个新的单机环境sydb_rec
- NBU恢复成功的前提下,重做ARCDG和DATADG磁盘组
- 部署sydb_rec 到 sydb的ADG,sydb_rec 为主,将数据通过rman duplicate 和ADG实现数据全量实施同步。
- 执行sydb_rec 和sydb的主备切换,sydb升级为主库,
- 将sydb注册到CRS集群
2.恢复环境sydb_rec 部署
- 恢复环境初始化
- 静默安装数据库软件
- 配置监听
- 静默建库,正常关闭数据库,启动到nomount状态
- 注意:恢复环境和正式环境 数据库小版本号要一致;操作系统oracle uid和gid和备份客户端一致;dbname 和SID一致
3.NBU恢复
(1)恢复控制文件
rman target /
set dbid 3691232446
run{
ALLOCATE CHANNEL CH00 TYPE 'SBT_TAPE' PARMS 'SBT_LIBRARY=/usr/openv/netbackup/bin/libobk.so64' ;
send 'nb_ora_serv=nbumaster,NB_ora_client=sybzdb2';
restore controlfile from '/ctrl_dSYDB_uer3432v1_s91611_p1_t1178700769';
release channel CH00;
}
(2) 重定向redo日志
SQL>alter database rename file '+DATADG/sydb/onlinelog/group_1.272.1082888427' to '/oracle/data/oradata/sydb/redo1.log';
...
(3) 数据文件转储
rman target /
run {
allocate channel c1 type 'sbt_tape' PARMS 'SBT_LIBRARY=/usr/openv/netbackup/bin/libobk.so64';
SEND 'NB_ORA_SERV=nbumaster,NB_ORA_CLIENT=sybzdb2';
allocate channel c2 type 'sbt_tape' PARMS 'SBT_LIBRARY=/usr/openv/netbackup/bin/libobk.so64';
SEND 'NB_ORA_SERV=nbumaster,NB_ORA_CLIENT=sybzdb2';
set newname for datafile 1 to '/oracle/data/oradata/sydb/system.259.937960515';
...
set newname for datafile 86 to '/oracle/data/oradata/sydb/xxxxx.289.1172240931';
set newname for tempfile 1 to '/oracle/data/oradata/sydb/temp.262.937960519';
set newname for tempfile 2 to '/oracle/data/oradata/sydb/temp.290.1172566255';
set newname for tempfile 3 to '/oracle/data/oradata/sydb/temp.291.1172566617';
set newname for tempfile 4 to '/oracle/data/oradata/sydb/temp.292.1172567161';
restore database;
switch datafile all;
release channel c1;
release channel c2;
}
#如果数据量较大,服务器资源足够,可以多开几个通道,
#建议卸载脚本里面,后台执行。
(4)恢复数据库
rman target /
run {
allocate channel c1 type 'sbt_tape' PARMS 'SBT_LIBRARY=/usr/openv/netbackup/bin/libobk.so64';
SEND 'NB_ORA_SERV=nbumaster,NB_ORA_CLIENT=sybzdb2';
allocate channel c2 type 'sbt_tape' PARMS 'SBT_LIBRARY=/usr/openv/netbackup/bin/libobk.so64';
SEND 'NB_ORA_SERV=nbumaster,NB_ORA_CLIENT=sybzdb2';
sql 'alter session set nls_date_format="yyyy-mm-dd hh24:mi:ss"';
set until time='2024-09-03 08:00:00';
recover database;
release channel c1;
release channel c2;
}
#可以指定时间,也可以不指定时间,以控制文件为准
(5)开启数据库
alter database open resetlogs;
(6) 数据库调整
#禁用1线程redo
alter system disable thread 1;
#删除多余的redo,只保留4个
alter database drop logfile group 5;
alter database drop logfile group 6;
alter database drop logfile group 11;
alter database drop logfile group 12;
alter database drop logfile group 13;
alter database drop logfile group 14;
alter database drop logfile group 15;
alter database drop logfile group 16;
五、ADG数据同步
1.ADG环境初始化
(1)清理生产环境db 实例
srvctl status database -d sydb
srvctl stop database -d sydb
dbca -silent -deleteInstance -nodeList sybzdb1 -gdbName sydb -instanceName sydb1 -sysDBAUserName sys -sysDBAPassword xxxx
srvctl remove database -d sydb
(2) 磁盘组重建
alter diskgroup ARCDG dismount;
alter diskgroup DATADG dismount;
alter diskgroup DATADG1 dismount;
drop diskgroup ARCDG force including contents;
drop diskgroup DATADG force including contents;
drop diskgroup DATADG1 force including contents;
#11G 版本以后drop 磁盘组 加force 参数可以擦除磁盘头,不需要再执行dd
create diskgroup ARCDG external redundancy disk '/dev/asmdata1' attribute 'au_size'='1M';
create diskgroup DATADG external redundancy disk
'/dev/asmdata2',
'/dev/asmdata3',
'/dev/asmdata4',
'/dev/asmdata5',
'/dev/asmdata6'
attribute 'au_size'='1M';
# 必要的话修改asm磁盘兼容性,也可以在创建时指定
alter diskgroup arcdg set attribute 'compatible.asm'='11.2';
alter diskgroup datadg set attribute 'compatible.asm'='11.2';
#两个节点都需要要mount
alter diskgroup arcdg mount;
alter diskgroup datadg mount;
(3 )静态监听配置
为了不和 生产环境的监听冲突,新建dup监听,端口为1522
sydb_rec监听配置
DUP =
(DESCRIPTION =
(ADDRESS = (PROTOCOL = TCP)(HOST = xxx.xx.120.152)(PORT=1522))
)
SID_LIST_DUP =
(SID_LIST =
(SID_DESC =
(GLOBAL_DBNAME = sydb_rec)
(ORACLE_HOME = /oracle/app/oracle/base/product/11.2.0)
(SID_NAME = sydb2)
)
)
sydb 监听配置
DUP =
(DESCRIPTION =
(ADDRESS = (PROTOCOL = TCP)(HOST = xxx.xx.35.97)(PORT=1522))
)
SID_LIST_DUP =
(SID_LIST =
(SID_DESC =
(GLOBAL_DBNAME = sydb_dup)
(ORACLE_HOME = /oracle/app/oracle/base/product/11.2.0)
(SID_NAME = sydb2)
)
)
#启动恢复环境和生产环境监听
lsnrctl start dup
tns配置,恢复环境和生产环境
sydb_tns=
(DESCRIPTION =
(ADDRESS = (PROTOCOL = TCP)(HOST = xxx.xx.35.97)(PORT = 1522))
(CONNECT_DATA =
(SERVER = DEDICATED)
(SERVICE_NAME = sydb_dup)
)
)
rec_tns=
(DESCRIPTION =
(ADDRESS = (PROTOCOL = TCP)(HOST = xxx.xx.120.152)(PORT = 1522))
(CONNECT_DATA =
(SERVER = DEDICATED)
(SERVICE_NAME = sydb_rec)
)
)
(4) 恢复环境作为主库配置
##主库开启归档 和 force logging
alter database force logging;
#添加 standby log,数量比redo多1 ,大小一致
ALTER DATABASE ADD STANDBY LOGFILE '/oracle/data/oradata/sydb/standby01.log' size 1024M;
ALTER DATABASE ADD STANDBY LOGFILE '/oracle/data/oradata/sydb/standby02.log' size 1024M;
ALTER DATABASE ADD STANDBY LOGFILE '/oracle/data/oradata/sydb/standby03.log' size 1024M;
ALTER DATABASE ADD STANDBY LOGFILE '/oracle/data/oradata/sydb/standby04.log' size 1024M;
ALTER DATABASE ADD STANDBY LOGFILE '/oracle/data/oradata/sydb/standby05.log' size 1024M;
(5)参数配置
#sydb_rec 参数配置
DB_UNIQUE_NAME=sydb_rec
LOG_ARCHIVE_CONFIG='DG_CONFIG=(sydb_rec,sydb)'
LOG_ARCHIVE_DEST_1= 'LOCATION=/oracle/data/oradata/arch1 VALID_FOR=(ALL_LOGFILES,ALL_ROLES) DB_UNIQUE_NAME=sydb_rec'
LOG_ARCHIVE_DEST_2= 'SERVICE=sydb_tns ASYNC VALID_FOR=(ONLINE_LOGFILES,PRIMARY_ROLE) DB_UNIQUE_NAME=sydb
FAL_SERVER=sydb
DB_FILE_NAME_CONVERT='/oracle/data/oradata/sydb','+DATADG'
LOG_FILE_NAME_CONVERT='/oracle/data/oradata/sydb','+DATADG'
STANDBY_FILE_MANAGEMENT=AUTO
# sydb 参数配置
DB_UNIQUE_NAME=sydb
LOG_ARCHIVE_CONFIG='DG_CONFIG=(sydb_rec,sydb)'
LOG_ARCHIVE_DEST_1= 'LOCATION=+RECDG VALID_FOR=(ALL_LOGFILES,ALL_ROLES) DB_UNIQUE_NAME=sydb'
LOG_ARCHIVE_DEST_2= 'SERVICE=rec_tns ASYNC VALID_FOR=(ONLINE_LOGFILES,PRIMARY_ROLE) DB_UNIQUE_NAME=sydb_rec
FAL_SERVER=sydb
DB_FILE_NAME_CONVERT='+DATADG','/oracle/data/oradata/sydb'
LOG_FILE_NAME_CONVERT='+DATADG' ,'/oracle/data/oradata/sydb'
STANDBY_FILE_MANAGEMENT=AUTO
(6) 密码文件重建
#恢复环境和生产环境都需要重建
orapwd file=orapwsydb2 password=oracle force=y
(7) sydb 拉起实例
#nomount 状态拉起实例
starup nomount
# 测试连接
rman target sys/oracle auxiliary sys/'oracle'@sydb_tns
2.数据复制和日志应用
(1)开始duplicate
rman target sys/xxxx@rec_tns auxiliary sys/xxxxx@sydb_tns
run
{
allocate AUXILIARY channel c1 type disk;
allocate AUXILIARY channel c2 type disk;
allocate AUXILIARY channel c3 type disk;
allocate AUXILIARY channel c4 type disk;
allocate channel c5 type disk;
allocate channel c6 type disk;
allocate channel c7 type disk;
allocate channel c8 type disk;
duplicate target database for standby from active database nofilenamecheck;
release channel c1;
release channel c2;
release channel c3;
release channel c4;
release channel c5;
release channel c6;
release channel c7;
release channel c8;
}
(2) 复制完成,应用日志
alter database open;
alter database recover managed standby database using current logfile disconnect from session;
3.ADG 主备切换
(1)主库sydb_rec 降级为备库
alter system switch logfile;
alter system archive log current;
alter database commit to switchover to physical standby with session shutdown;
#重启
shutdown abort;
startup mount;
(2)备库sydb 升级为主库
alter database commit to switchover to primary with session shutdown wait;
#开启数据库
alter database open;
六、sydb注册资源到crs
1.添加线程
alter database add logfile thread 1 group 10 '+DATADG' size 1024M;
alter database add logfile thread 1 group 11 '+DATADG' size 1024M;
alter database add logfile thread 1 group 12 '+DATADG' size 1024M;
alter database add logfile thread 1 group 13 '+DATADG' size 1024M;
alter database enable thread 1;
2.配置实例为参数
alter system reset instance_number scope=spfile sid='*';
alter system reset thread scope=spfile sid='*';
alter system reset undo_tablespace scope=spfile sid='*';
alter system set instance_number=1 scope=spfile sid='sydb1';
alter system set instance_number=2 scope=spfile sid='sydb2';
alter system set thread=1 scope=spfile sid='sydb1';
alter system set thread=2 scope=spfile sid='sydb2';
alter system set undo_tablespace='UNDOTBS1' scope=spfile sid='sydb1';
alter system set undo_tablespace='UNDOTBS2' scope=spfile sid='sydb2';
alter system set cluster_database=TRUE scope=spfile sid='*';
alter system set cluster_database_instances=2 scope=spfile sid='*';
3.注册db 和instance
srvctl add database -d sydb -o /oracle/app/oracle/base/product/11.2.0 -c RAC -r PRIMARY
srvctl add instance -d sydb -i sydb2 -x sybzdb2
srvctl add instance -d sydb -i sydb1 -x sybzdb1
srvctl modify database -d sydb -p '+DATADG/sydb/parameterfile/spfile.363.1178881235'
srvctl modify database -db sydb -diskgroup DATADG,ARCDG
4.重启数据库
shutdown immeidate;
srvctl start database -d sydb
#查看启动状态
srvctl status database -d sydb
七 、参数调整
1.稳定参数调整
alter system set "_enable_shared_pool_durations" =false scope=spfile;
alter system set "_enable_space_preallocation"=0 scope=spfile;
alter system set "_gc_policy_time"=0 scope=spfile;
alter system set "_gc_undo_affinity" =FALSE scope=spfile;
alter system set "_gc_read_mostly_locking"=false scope=spfile sid='*';
alter system set "_gc_bypass_readers"=false scope=spfile sid='*';
alter system set "_memory_imm_mode_without_autosga" =false scope=spfile;
alter system set "_resource_manager_always_on" =false scope=spfile;
alter system set "_resource_manager_always_off" =true scope=spfile;
alter system set "_serial_direct_read" =false scope=spfile;
alter system set "_optimizer_use_feedback"=false scope=spfile;
alter system set "_optim_peek_user_binds"=FALSE scope=spfile;
alter system set "_partition_large_extents"=FALSE scope=spfile;
alter system set "_undo_autotune" =FALSE scope=spfile;
alter system set aq_tm_processes=2 scope=spfile;
alter system set db_files=500 scope=spfile;
alter system set db_file_multiblock_read_count=128 scope=spfile;
alter system set deferred_segment_creation=false scope=spfile;
alter system set fast_start_parallel_rollback=false scope=spfile;
alter system set filesystemio_options='SETALL' scope=spfile;
alter system set JOB_QUEUE_PROCESSES=200 scope=spfile;
alter system set parallel_force_local=TRUE scope=spfile;
alter system set parallel_max_servers=50 scope=spfile;
alter system set processes=3000 scope=spfile;
alter system set undo_retention = 3600 scope=spfile;
alter system set session_cached_cursors = 200 scope=spfile;
alter system set sec_case_sensitive_logon=false scope=spfile;
alter system set event='10949 trace name context forever,level 1','28401 trace name context forever,level 1' scope=spfile;
重启数据库生效
2.重置ADG参数
# 这里如果不需要ADG,必须要将dest_2 置空,否则归档会一直被占用,无法删除,导致归档空间满,影响业务
alter system set log_archive_dest_2='' sid='*'
#清空数据文件转换
alter system set db_file_name_convert='','' scope=spfile sid='*';
alter system set log_file_name_convert='','' scope=spfile sid='*';
alter system reset db_file_name_convert scope=spfile sid='*';
alter system reset log_file_name_convert scope=spfile sid='*';
八、总结
- 对生产环境保持敬畏之心,一些高危操作一定要谨慎
- 备份很重要!如果这次故障没有备份,只采取DUL抽取恢复,加大恢复时长,还会有额外的费用