备注:本次记录根据后期回顾整理,实际测试过程遇到问题较多,也花费了不少时间,后期整理也有些处理过程被遗漏。
一、问题现象及处理过程
研发中心开发部一同事反馈某Oracle 测试库 10.xxx.xxx.xxx 无法访问,业务无法测试,让我协助帮忙处理下。
尝试登陆该服务器失败,但可以ping通该服务器,于是让系统组同事帮忙通过控制台登陆该服务器看下什么问题,被告知无法进入服务器,报内核错误,于是强制重启。
服务器重启一段时间后,可以正常登陆服务器,数据库已被自动拉起。
告知研发同事进行测试,被告知应用连接超时。
观察数据库日志,发现大量如下报错:
由于ORA-609,导致ospid(xxxx)进程被aborting了,同时由于TNS-12537报错,最终导致应用连接被关闭。
ORA-609错误原因是因为无法与进入的连接进行联系,所以无法将此连接转入监听器,所以数据库的process中止此进程,因此服务器进程被终止关闭
根据报错信息,在网上及MOS上搜索了下相关信息,并根据相关信息,在Oracle用户$ORACLE_HOME/network/admin下创建sqlnet.ora,并配置
sqlnet.expire_time=10
SQLNET.INBOUND_CONNECT_TIMEOUT=180
DIAG_ADR_ENABLED = OFF
让研发同事连接测试,结果应用连接后,很快超时,数据库负载压力突增,就连编辑文件都会被夯住。
根据报错,又反复调整了几个参数值,应用测试还是超时。
另外观察到ASM日志里保留大量如下错误:
根据资料显示 WARNING: Waited 139 secs for write IO to PST disk 0 in group 1 这种报错大意为PST通信链路在访问磁盘组中的磁盘时候等待了XXX秒,而且触发了持续的等待。超时等待会在频率触发的基础上递增每次的等待时间。
出现这种状况的原因一般与操作系统网络通信链路,数据库主机磁盘或者超时参数的设置有关。
并根据建议将_asm_hbeatiowait值进行了调整:alter system set “_asm_hbeatiowait”=120 scope=spfile;
并重启了ASM实例,重启完后,依然应用报连接超时。
此时观察到系统日志/var/log/messages里有报如下信息:
根据 “echo 0 > /proc/sys/kernel/hung_task_timeout_secs” disables this message 查询了相关资料,对此的解释是:
默认情况下, Linux会最多使用40%的可用内存作为文件系统缓存。当超过这个阈值后,文件系统会把将缓存中的内存全部写入磁盘, 导致后续的IO请求都是同步的。
将缓存写入磁盘时,有一个默认120秒的超时时间。出现上面的问题的原因是IO子系统的处理速度不够快,不能在120秒将缓存中的数据全部写入磁盘。IO系统响应缓慢,导致越来越多的请求堆积,最终系统内存全部被占用,导致系统失去响应。
对此的建议修改是:
sysctl -w vm.dirty_ratio=10
sysctl -w vm.dirty_background_ratio=5
sysctl -p
如果系统永久生效,修改/etc/sysctl.conf文件。加入如下两行
vm.dirty_background_ratio = 5
vm.dirty_ratio = 10
重启了数据库和操作系统。
此时也到了周末,期待着经过周末,该问题能否自动修复。
但很遗憾,周一早上观察,依然报连接超时。
研发的同事反馈,部分SQL可以正常查询,但几个关键SQL查询很慢,超过了应用设置的超时时间,其中如下这个SQL要求在3毫米内查出。
select * from (
select q.*,rownum rnum from (
select *
from (select *
from (select inxxxx_xxxxx id,
d.*,
case
when quo_num > 0 then
'1'
when gxxxx_xxxx = '1' then
'4'
when left_day <= 0 then
'3'
else
'3'
end quxxxxx_xxxx,
row_number() over(partition by inxxxx_xxxxx order by source_flag asc,quotation_id desc) rn
from (select m.com_name coxxxxxxxx,
m.com_id,
i.info_title,
i.inxxxx_xxxxx,
(select count(ip.inqxxxx_pxxxx_xxxx) from inxxxx_xxxxx_info ip where ip.relation_id = i.inxxxx_xxxxx and ip.relation_type = '1' and ip.delete_flag = '1') inquiry_prod_num,
(select ip.purchase_unit_text from inxxxx_xxxxx_info ip where ip.relation_id = i.inxxxx_xxxxx and ip.relation_type = '1' and ip.delete_flag = '1' and rownum=1) inquiry_prod_unit,
(select sum(ip.purxxxx_xxxxx) from inxxxx_xxxxx_info ip where ip.relation_id = i.inxxxx_xxxxx and ip.relation_type = '1' and ip.delete_flag = '1') inquiry_prod_quantity,
i.mexxxxx_xxxxx,
(select com_id from core_comxxxx_xxx_xxxxx_xxxx where xxxxx_mexxxxx_xxxxx = i.mexxxxx_xxxxx and rownum<=1) xxxxxx_com_id,
(select mm.MEMBER_TYPE
from xxxxx.CORE_MEMXXXX_XXXXX mm
where mm.mexxxxx_xxxxx = i.mexxxxx_xxxxx) member_type,
(select l.logon_xxxxx_xxxxx from xxxxx.core_xxxxxx_inxxxx l where l.mexxxxx_xxxxx=i.mexxxxx_xxxxx and l.user_role='0' and rownum<=1) logon_xxxxx_xxxxx,
i.delivery_province,
i.delivery_city,
i.xxxxxx_time,
i.added_time added_xxxx_inqxxxx,
trunc(i.xxxxxx_time) - trunc(sysdate) + 1 left_day,
f_get_big_buyer_flag(i.mexxxxx_xxxxx) big_flag,
f_get_xxxx_cert_xxxxx(i.mexxxxx_xxxxx) cert_status,
q.invite_read_flag read_flag,
q.gxxxx_xxxx,
'1' source_flag,
q.quotation_id quotation_id,
(select count(1)
from quoxxxx_info q
where q.relation_id = i.inxxxx_xxxxx
and q.relation_type = '1'
and q.quotation_flag = '1'
and q.mexxxxx_xxxxx = 21484xxxxx) quo_num,
(select max(q.added_time)
from quoxxxx_info q
where q.relation_id = i.inxxxx_xxxxx
and q.relation_type = '1'
and q.quotation_flag = '1') added_time_order,
'0' selection
from inqxxxx_xxxxx i,
quoxxxx_info q,
core_comxxxx_mxxxxx m
where i.business_flag = '1'
and i.sensetive_flag = '0'
and i.confirm_status in ('0', '3')
and m.mexxxxx_xxxxx = i.mexxxxx_xxxxx
and m.com_status in ('1', '3')
and q.relation_type = '1'
and i.inxxxx_xxxxx = q.relation_id
and q.mexxxxx_xxxxx = 21484xxxxx
and q.invite_email is not null
union all
select m.com_name coxxxxxxxx,
m.com_id,
i.info_title,
i.inxxxx_xxxxx,
(select count(ip.inqxxxx_pxxxx_xxxx) from inxxxx_xxxxx_info ip where ip.relation_id = i.inxxxx_xxxxx and ip.relation_type = '1' and ip.delete_flag = '1') inquiry_prod_num,
(select ip.purchase_unit_text from inxxxx_xxxxx_info ip where ip.relation_id = i.inxxxx_xxxxx and ip.relation_type = '1' and ip.delete_flag = '1' and rownum=1) inquiry_prod_unit,
(select sum(ip.purxxxx_xxxxx) from inxxxx_xxxxx_info ip where ip.relation_id = i.inxxxx_xxxxx and ip.relation_type = '1' and ip.delete_flag = '1') inquiry_prod_quantity,
i.mexxxxx_xxxxx,
(select com_id from core_comxxxx_xxx_xxxxx_xxxx where xxxxx_mexxxxx_xxxxx = i.mexxxxx_xxxxx and rownum<=1) xxxxxx_com_id,
(select mm.MEMBER_TYPE
from xxxxx.CORE_MEMXXXX_XXXXX mm
where mm.mexxxxx_xxxxx = i.mexxxxx_xxxxx) member_type,
(select l.logon_xxxxx_xxxxx from xxxxx.core_xxxxxx_inxxxx l where l.mexxxxx_xxxxx=i.mexxxxx_xxxxx and l.user_role='0' and rownum<=1) logon_xxxxx_xxxxx,
i.delivery_province,
i.delivery_city,
i.xxxxxx_time,
i.added_time added_xxxx_inqxxxx,
trunc(i.xxxxxx_time) - trunc(sysdate) + 1 left_day,
f_get_big_buyer_flag(i.mexxxxx_xxxxx) big_flag,
f_get_xxxx_cert_xxxxx(i.mexxxxx_xxxxx) cert_status,
q.invite_read_flag read_flag,
q.gxxxx_xxxx,
'1' source_flag,
q.quotation_id quotation_id,
(select count(1)
from quoxxxx_info q
where q.relation_id = i.inxxxx_xxxxx
and q.relation_type = '1'
and q.quotation_flag = '1'
and q.mexxxxx_xxxxx = 21484xxxxx) quo_num,
(select max(q.added_time)
from quoxxxx_info q
where q.relation_id = i.inxxxx_xxxxx
and q.relation_type = '1'
and q.quotation_flag = '1') added_time_order,
'0' selection
from inqxxxx_xxxxx i,
quoxxxx_info q,
core_product_cn_search p,
core_comxxxx_mxxxxx m
where i.business_flag = '1'
and i.sensetive_flag = '0'
and i.confirm_status in ('0', '3')
and m.mexxxxx_xxxxx = i.mexxxxx_xxxxx
and m.com_status in ('1', '3')
and q.relation_type = '1'
and i.inxxxx_xxxxx = q.relation_id
and i.xxxxxx_prod_id = p.prod_id
and q.mexxxxx_xxxxx = 21484xxxxx
and q.xxxxxx_com_id is not null
union all
select i.coxxxxxxxx,
i.com_id,
i.info_title,
i.inxxxx_xxxxx,
(select count(ip.inqxxxx_pxxxx_xxxx) from inxxxx_xxxxx_info ip where ip.relation_id = i.inxxxx_xxxxx and ip.relation_type = '1' and ip.delete_flag = '1') inquiry_prod_num,
(select ip.purchase_unit_text from inxxxx_xxxxx_info ip where ip.relation_id = i.inxxxx_xxxxx and ip.relation_type = '1' and ip.delete_flag = '1' and rownum=1) inquiry_prod_unit,
(select sum(ip.purxxxx_xxxxx) from inxxxx_xxxxx_info ip where ip.relation_id = i.inxxxx_xxxxx and ip.relation_type = '1' and ip.delete_flag = '1') inquiry_prod_quantity,
i.mexxxxx_xxxxx,
(select com_id from core_comxxxx_xxx_xxxxx_xxxx where xxxxx_mexxxxx_xxxxx = i.mexxxxx_xxxxx and rownum<=1) xxxxxx_com_id,
(select mm.MEMBER_TYPE
from xxxxx.CORE_MEMXXXX_XXXXX mm
where mm.mexxxxx_xxxxx = i.mexxxxx_xxxxx) member_type,
(select l.logon_xxxxx_xxxxx from xxxxx.core_xxxxxx_inxxxx l where l.mexxxxx_xxxxx=i.mexxxxx_xxxxx and l.user_role='0' and rownum<=1) logon_xxxxx_xxxxx,
i.delivery_province,
i.delivery_city,
i.xxxxxx_time,
i.added_time added_xxxx_inqxxxx,
trunc(i.xxxxxx_time) - trunc(sysdate) + 1 left_day,
f_get_big_buyer_flag(i.mexxxxx_xxxxx) big_flag,
f_get_xxxx_cert_xxxxx(i.mexxxxx_xxxxx) cert_status,
i.readFlag read_flag,
i.gxxxx_xxxx,
'2' source_flag,
i.quotation_id quotation_id,
(select count(1)
from quoxxxx_info q
where q.relation_id = i.inxxxx_xxxxx
and q.relation_type = '1'
and q.quotation_flag = '1'
and q.mexxxxx_xxxxx = 21484xxxxx) quo_num,
nvl((select max(q.added_time)
from quoxxxx_info q
where q.relation_id = i.inxxxx_xxxxx
and q.relation_type = '1'
and q.quotation_flag = '1'),
i.s_added_time) added_time_order,
i.selection
from (select row_number() over(partition by s.match_xxxxx_xxxxxx order by s.added_time asc) rn,
s.gxxxx_xxxx,
s.added_time s_added_time,
s.read_flxxx_xxxx readFlag,
s.match_id quotation_id,
m.com_id,
m.com_name coxxxxxxxx,
s.selection,
q.*
from xxxxx.systxxx_xxxx_xxxx s,
xxxxx.inqxxxx_xxxxx q,
xxxxx.core_comxxxx_mxxxxx m,
xxxxx.core_comxxxx_xxx_xxxxx_xxxx k
where s.match_xxxxx_xxxxxx = q.inxxxx_xxxxx
and s.xxxxxx_com_id = k.com_id
and s.relation_type in ('3', '4', '5')
and s.delete_flag = '1'
and q.business_flag = '1'
and q.sensetive_flag = '0'
and q.CONFIRM_STATUS in ('0', '3')
and m.mexxxxx_xxxxx = s.mexxxxx_xxxxx
and m.com_status in ('1', '3')
and k.xxxxx_mexxxxx_xxxxx = 21484xxxxx) i
union all
select i.coxxxxxxxx,
i.com_id,
i.info_title,
i.inxxxx_xxxxx,
(select count(ip.inqxxxx_pxxxx_xxxx) from inxxxx_xxxxx_info ip where ip.relation_id = i.inxxxx_xxxxx and ip.relation_type = '1' and ip.delete_flag = '1') inquiry_prod_num,
(select ip.purchase_unit_text from inxxxx_xxxxx_info ip where ip.relation_id = i.inxxxx_xxxxx and ip.relation_type = '1' and ip.delete_flag = '1' and rownum=1) inquiry_prod_unit,
(select sum(ip.purxxxx_xxxxx) from inxxxx_xxxxx_info ip where ip.relation_id = i.inxxxx_xxxxx and ip.relation_type = '1' and ip.delete_flag = '1') inquiry_prod_quantity,
i.mexxxxx_xxxxx,
(select com_id from core_comxxxx_xxx_xxxxx_xxxx where xxxxx_mexxxxx_xxxxx = i.mexxxxx_xxxxx and rownum<=1) xxxxxx_com_id,
(select mm.MEMBER_TYPE
from xxxxx.CORE_MEMXXXX_XXXXX mm
where mm.mexxxxx_xxxxx = i.mexxxxx_xxxxx) member_type,
(select l.logon_xxxxx_xxxxx from xxxxx.core_xxxxxx_inxxxx l where l.mexxxxx_xxxxx=i.mexxxxx_xxxxx and l.user_role='0' and rownum<=1) logon_xxxxx_xxxxx,
i.delivery_province,
i.delivery_city,
i.xxxxxx_time,
i.added_time added_xxxx_inqxxxx,
trunc(i.xxxxxx_time) - trunc(sysdate) + 1 left_day,
f_get_big_buyer_flag(i.mexxxxx_xxxxx) big_flag,
f_get_xxxx_cert_xxxxx(i.mexxxxx_xxxxx) cert_status,
i.readFlag read_flag,
i.gxxxx_xxxx,
'3' source_flag,
i.quotation_id quotation_id,
(select count(1)
from quoxxxx_info q
where q.relation_id = i.inxxxx_xxxxx
and q.relation_type = '1'
and q.quotation_flag = '1'
and q.mexxxxx_xxxxx = 21484xxxxx) quo_num,
nvl((select max(q.added_time)
from quoxxxx_info q
where q.relation_id = i.inxxxx_xxxxx
and q.relation_type = '1'
and q.quotation_flag = '1'),
i.s_added_time) added_time_order,
'0' selection
from (select row_number() over(partition by s.match_xxxxx_xxxxxx order by s.added_time asc) rn,
s.gxxxx_xxxx,
s.added_time s_added_time,
s.read_flag readFlag,
s.match_id quotation_id,
m.com_id,
m.com_name coxxxxxxxx,
q.*
from system_match_xxxxxxcn s,
inqxxxx_xxxxx q,
core_comxxxx_mxxxxx m
where s.supp_mexxxxx_xxxxx = 21484xxxxx and s.match_xxxxx_xxxxxx = q.inxxxx_xxxxx
and s.match_method = '2'
and q.business_flag = '1'
and q.sensetive_flag = '0'
and q.CONFIRM_STATUS in ('0', '3')
and m.mexxxxx_xxxxx = q.mexxxxx_xxxxx
and m.com_status in ('1', '3')
AND (( ((TRUNC(s.ADDED_TIME) >= TRUNC(SYSDATE - 2)) OR q.inxxxx_xxxxx = 2000522250 )
AND NOT EXISTS
(SELECT *
FROM quoxxxx_info k
WHERE k.relation_id = q.inxxxx_xxxxx
AND k.relation_type = '1'
AND k.quotation_flag = '1'
AND k.mexxxxx_xxxxx = 21484xxxxx)) OR
s.gxxxx_xxxx = '1' OR EXISTS
(SELECT *
FROM quoxxxx_info k
WHERE k.relation_id = q.inxxxx_xxxxx
AND k.relation_type = '1'
AND k.quotation_flag = '1'
AND k.mexxxxx_xxxxx = 21484xxxxx))
) i
union all
select i.coxxxxxxxx,
i.com_id,
i.info_title,
i.inxxxx_xxxxx,
(select count(ip.inqxxxx_pxxxx_xxxx) from inxxxx_xxxxx_info ip where ip.relation_id = i.inxxxx_xxxxx and ip.relation_type = '1' and ip.delete_flag = '1') inquiry_prod_num,
(select ip.purchase_unit_text from inxxxx_xxxxx_info ip where ip.relation_id = i.inxxxx_xxxxx and ip.relation_type = '1' and ip.delete_flag = '1' and rownum=1) inquiry_prod_unit,
(select sum(ip.purxxxx_xxxxx) from inxxxx_xxxxx_info ip where ip.relation_id = i.inxxxx_xxxxx and ip.relation_type = '1' and ip.delete_flag = '1') inquiry_prod_quantity,
i.mexxxxx_xxxxx,
(select com_id from core_comxxxx_xxx_xxxxx_xxxx where xxxxx_mexxxxx_xxxxx = i.mexxxxx_xxxxx and rownum<=1) xxxxxx_com_id,
(select mm.MEMBER_TYPE
from xxxxx.CORE_MEMXXXX_XXXXX mm
where mm.mexxxxx_xxxxx = i.mexxxxx_xxxxx) member_type,
(select l.logon_xxxxx_xxxxx from xxxxx.core_xxxxxx_inxxxx l where l.mexxxxx_xxxxx=i.mexxxxx_xxxxx and l.user_role='0' and rownum<=1) logon_xxxxx_xxxxx,
i.delivery_province,
i.delivery_city,
i.xxxxxx_time,
i.added_time added_xxxx_inqxxxx,
trunc(i.xxxxxx_time) - trunc(sysdate) + 1 left_day,
f_get_big_buyer_flag(i.mexxxxx_xxxxx) big_flag,
f_get_xxxx_cert_xxxxx(i.mexxxxx_xxxxx) cert_status,
i.readFlag read_flag,
i.gxxxx_xxxx,
'2' source_flag,
i.quotation_id quotation_id,
(select count(1)
from quoxxxx_info q
where q.relation_id = i.inxxxx_xxxxx
and q.relation_type = '1'
and q.quotation_flag = '1'
and q.mexxxxx_xxxxx = 21484xxxxx) quo_num,
nvl((select max(q.added_time)
from quoxxxx_info q
where q.relation_id = i.inxxxx_xxxxx
and q.relation_type = '1'
and q.quotation_flag = '1'),
i.s_added_time) added_time_order,
i.selection
from (select row_number() over(partition by s.match_xxxxx_xxxxxx order by s.added_time asc) rn,
s.gxxxx_xxxx,
s.added_time s_added_time,
s.read_flag readFlag,
s.match_id quotation_id,
m.com_id,
m.com_name coxxxxxxxx,
'1' as selection,
q.*
from xxxxx.system_match_xxxxxxcn s,
xxxxx.inqxxxx_xxxxx q,
xxxxx.core_comxxxx_mxxxxx m,
xxxxx.core_comxxxx_xxx_xxxxx_xxxx k
where s.match_xxxxx_xxxxxx = q.inxxxx_xxxxx
and s.xxxxxx_com_id = k.com_id
and s.match_method ='2'
and q.business_flag = '1'
and q.sensetive_flag = '0'
and q.CONFIRM_STATUS in ('0', '3')
and q.mexxxxx_xxxxx = m.mexxxxx_xxxxx
and m.com_status in ('1', '3')
and k.xxxxx_mexxxxx_xxxxx = 21484xxxxx) i
where rn <= 1) d)
where rn <= 1) d where d.quo_num=0 and d.gxxxx_xxxx='0' and d.xxxxxx_time>=trunc(sysdate) order by xxxxxx_time DESC ) q ) p
where rnum > 1
and rnum <= 10
?尝试执行了下该SQL,发现当查到有数据的时候,会非常慢,超过一分钟,当查询无数据的时候会很快,也就一秒左右。
备注:
在处理过程中,也遇到了监听报WARNING: Subscription for node down event still pending告警,如下所示
在10g以后,涉及到Oracle监听器的消息默认会订阅到ONS(Oracle Notification Service),在非RAC环境,建议禁止订阅这些消息。
并根据MOS资料,在GRID用户 $ORACLE_HOME/network/admin下修改 listener.ora
添加 SUBSCRIBE_FOR_NODE_DOWN_EVENT_LISTENER=OFF
并对监听执行 lsnrctl reload。
根据一段时间观察,监听日志未再报 WARNING: Subscription for node down event still pending告警。
二、解决过程
该数据库服务器为64G,SGA和PGA分别设置了30G和10G。
对比了一个线上11G和此环境配置差不多的生产库,调整了下SGA和PGA以及sysctl.conf、limits.conf的参数,重启数据库和服务器依然如此。
并查询了SQL的缓存命中率。
于是将应用程序运行了一段时间,抓取问题时段的ASH报告,如下:
发现主要是direct path read等待事件。direct path read是 11G推出的一个新特性,
Direct path read的目的是让一些不常使用的大表数据(冷数据)在全表扫描时,每次都从磁盘读到用户的私有内存(PGA),而不要去挤占有限的、宝贵的、频繁使用的数据(热数据)所在的共享内存(SGA-buffer cache)。
热数据只在第一次访问时从磁盘读,读到SGA的buffer cache后,再次访问会直接从内存读,效率高、对存储压力小。
从共享内存读到直接路径读,这个变化在不频繁的全表扫描时是起到积极作用的,但某些SQL由于业务设计问题,导致SQL执行计划不合适,就会对数据库产生致命影响。
于是就尝试将 direct path read这一特性进行强制关闭。
ALTER SYSTEM SET _serial_direct_read=‘NEVER’ SCOPE=BOTH;
并通过如下语句查询执行结果:
SELECT x.ksppinm NAME, y.ksppstvl VALUE, x.ksppdesc describ
FROM SYS.x$ksppi x, SYS.x$ksppcv y
WHERE x.indx = y.indx
AND (x.ksppinm = '_small_table_threshold'
OR x.ksppinm = '_serial_direct_read');
复制
我将之前执行的SQL重新又执行了一次,耗时从原来的一分多钟一下子降到了一秒多,于是又重新找了几个变量带入进去,查询时间基本在3秒以下。
通知研发人员复测下,反馈时间确实块了很多,业务测试通过了。
三、后续
对于该环境,通过和研发的沟通,目前存在硬解析较高,因为研发未对SQL添加绑定变量,导致硬解析高。另外禁用了direct path read是否会对其它SQL产生影响,还在观察中。
也欢迎关注我微信公众号