oracle WAITED TOO LONG FOR A ROW CACHE ENQUEUE LOCK问题分析

服务概述

财务系统出现业务卡顿，数据库实例2遇到>>> WAITED TOO LONG FOR A ROW CACHE ENQUEUE LOCK! <<<错误导致业务HANG住。

对此HANG的原因进行分析：

故障发生时，未有主机监控数据，从可以获取的数据库AWR中，可以发现在故障发生前一小时，已经出现有较多cursor: pin S wait on X Latch: row cache objects等待；而从AWR报告中也可以看到问题时段的shared pool 从4800M收缩为了2900M，查看SGA组件变动使用情况，可以发现将近2个GB是被KGH: NO ACCESS所使用，即处于shrik/grow的中间状态。

此时ALERT日志中开始出现有>>> WAITED TOO LONG FOR A ROW CACHE ENQUEUE LOCK! <<<错误，查看此时的System State dumped TRACE文件，核心进程在等待latch: row cache objects资源，根据故障时ALERT日志中systemstate dump prod2_ora_3191.trc信息来看，以典型的LCK0进程为例，此进程在等待和kghfrunp(释放shared pool中内存)有关的latch,此进程的历史等待事件中也大量存在'latch: shared pool'、latch: row cache objects；kghfrunp(kghfrunp相关的功能是需要从shared pool中释放空间相关)有关的latch。综合考虑,从现在得到的信息来看,很可能发生问题时shared pool的使用出现了问题。

RAC数据库实例HANG分析

4.1 数据库实例日志的分析

4.1.1故障节点数据库实例Alert日志

节点1数据库实例的Alert日志

Fri May 31 12:11:03 2019

Incremental checkpoint up to RBA [0x16b38.55e5.0], current log tail at RBA [0x16b38.7949.0]

Fri May 31 12:31:15 2019

Incremental checkpoint up to RBA [0x16b38.d76f.0], current log tail at RBA [0x16b38.d7a1.0]

Fri May 31 12:38:24 2019

GES: Potential blocker (pid=4099) on resource CI-0000001E-00000002;

 enqueue info in file /oracle/product/10.2.0/db/rdbms/log/prod2_lmd0_3768.trc and DIAG trace file

Fri May 31 12:49:04 2019

Beginning log switch checkpoint up to RBA [0x16b39.2.10], SCN: 6163266145639

Thread 2 advanced to log sequence 92985

  Current log# 8 seq# 92985 mem# 0: +CWDATA/prod/onlinelog/group_8.525.896335299

  Current log# 8 seq# 92985 mem# 1: +CWDATA/prod/onlinelog/group_8.628.896335301

Fri May 31 12:49:09 2019

LNS: Standby redo logfile selected for thread 2 sequence 92985 for destination LOG_ARCHIVE_DEST_2

Fri May 31 12:49:09 2019

Completed checkpoint up to RBA [0x16b39.2.10], SCN: 6163266145639

Fri May 31 12:51:21 2019

Incremental checkpoint up to RBA [0x16b39.19.0], current log tail at RBA [0x16b39.19.0]

Fri May 31 12:54:11 2019

GES: Potential blocker (pid=4099) on resource CI-00000046-00000002;

 enqueue info in file /oracle/product/10.2.0/db/rdbms/log/prod2_lmd0_3768.trc and DIAG trace file

Fri May 31 13:02:58 2019

GES: Potential blocker (pid=3991) on resource DR-00000000-00000000;

 enqueue info in file /oracle/product/10.2.0/db/rdbms/log/prod2_lmd0_3768.trc and DIAG trace file

Fri May 31 13:11:23 2019

Incremental checkpoint up to RBA [0x16b39.e1.0], current log tail at RBA [0x16b39.e1.0]

Fri May 31 13:14:13 2019

GES: Potential blocker (pid=4099) on resource CI-0000001E-00000002;

 enqueue info in file /oracle/product/10.2.0/db/rdbms/log/prod2_lmd0_3768.trc and DIAG trace file

Fri May 31 13:25:34 2019

>>> WAITED TOO LONG FOR A ROW CACHE ENQUEUE LOCK! pid=287

System State dumped to trace file /oracle/product/10.2.0/db/rdbms/log/prod2_ora_3191.trc

Fri May 31 13:31:25 2019

Incremental checkpoint up to RBA [0x16b39.1ad.0], current log tail at RBA [0x16b39.1ad.0]

Fri May 31 13:33:01 2019

WARNING: inbound connection timed out (ORA-3136)

Fri May 31 13:33:01 2019

WARNING: inbound connection timed out (ORA-3136)

Fri May 31 13:33:01 2019

WARNING: inbound connection timed out (ORA-3136)

对故障时间段的alert日志进行分析，可以发现当时节点1的ALERT日志中出现大量如下告警信息：

>>> WAITED TOO LONG FOR A ROW CACHE ENQUEUE LOCK! pid=***

4.2 对dump信息的分析

节点2的ALERT日志中提到了大量System State dumped文件，对此类TRACE文件进行了分析。如下：

使用发生问题后产生的systemstate dump prod2_ora_3191.trc。

[oracle@bys1 ~]$ awk -f ass109.awk prod2_ora_3191.trc



Starting Systemstate 1

.............................................

Ass.Awk Version 1.0.9 - Processing prod2_ora_3191.trc



System State 1

~~~~~~~~~~~~~~~~

1:                                      

2:  waiting for 'pmon timer'            wait

3:  waiting for 'DIAG idle wait'        wait

4:  waiting for 'rdbms ipc message'     wait

5:  waiting for 'rdbms ipc message'     wait

6:  waiting for 'ges remote message'    wait

7:  waiting for 'gcs remote message'    wait

8:  waiting for 'gcs remote message'    wait

9:  waiting for 'gcs remote message'    wait

10: waiting for 'gcs remote message'    wait

11: waiting for 'gcs remote message'    wait

12: waiting for 'gcs remote message'    wait

13: waiting for 'gcs remote message'    wait

14: waiting for 'gcs remote message'    wait

15: waiting for 'gcs remote message'    wait

16: waiting for 'gcs remote message'    wait

17: waiting for 'gcs remote message'    wait

18: waiting for 'gcs remote message'    wait

19: waiting for 'gcs remote message'    wait

20: waiting for 'gcs remote message'    wait

21: waiting for 'gcs remote message'    wait

22: waiting for 'gcs remote message'    wait

23: waiting for 'gcs remote message'    wait

24: waiting for 'gcs remote message'    wait

25: waiting for 'gcs remote message'    wait

26: waiting for 'gcs remote message'    wait

27: waiting for 'rdbms ipc message'     wait

28: waiting for 'rdbms ipc message'     wait

29: waiting for 'rdbms ipc message'     wait

30: waiting for 'rdbms ipc message'     wait

31: waiting for 'rdbms ipc message'     wait

32: waiting for 'rdbms ipc message'     wait

33: waiting for 'rdbms ipc message'     wait

34: waiting for 'rdbms ipc message'     wait

35: waiting for 'rdbms ipc message'     wait

36: waiting for 'rdbms ipc message'     wait

37: waiting for 'rdbms ipc message'     wait

38: waiting for 'rdbms ipc message'     wait

39: waiting for 'rdbms ipc message'     wait

40: waiting for 'latch: row cache objects'[Latch 33a5d5bd0] wait

41: waiting for 'cursor: pin S wait on X' wait

42: waiting for 'latch: shared pool'   [Latch 33ba89b70] wait

43: waiting for 'rdbms ipc reply'       wait

44: waiting for 'rdbms ipc message'     wait

45: waiting for 'latch: row cache objects'[Latch 33a5d5bd0] wait

Blockers

~~~~~~~~



        Above is a list of all the processes. If they are waiting for a resource

        then it will be given in square brackets. Below is a summary of the

        waited upon resources, together with the holder of that resource.

        Notes:

        ~~~~~

         o A process id of '???' implies that the holder was not found in the

           systemstate.



                    Resource Holder State

             Latch 33a5d5bd0    ??? Blocker

             Latch 33ba89b70    ??? Blocker



Object Names

~~~~~~~~~~~~

Latch 33a5d5bd0 Child row cache objects       

Latch 33ba89b70 Child library cache           





167446 Lines Processed.

latch: row cache objects等待具体信息

LCK0进程在等待和kghfrunp(释放shared pool中内存)有关的latch,此进程的历史等待事件中也大量存在'latch: shared pool'、latch: row cache objects

PROCESS 45:

  ----------------------------------------

  SO: 0x357240660, type: 2, owner: (nil), flag: INIT/-/-/0x00

  (process) Oracle pid=45, calls cur/top: 0x33472fae8/0x33472fae8, flag: (6) SYSTEM

            int error: 0, call error: 0, sess error: 0, txn error 0

  (post info) last post received: 0 0 24

              last post received-location: ksasnd

              last process to post me: 35224fdd8 39 0

              last post sent: 0 0 87

              last post sent-location: kjmdrmchk: drm_hb ack

              last process posted by me: 355269da8 1 6

    (latch info) wait_event=0 bits=0

        Location from where call was made: kqrbip:

      waiting for 33a5d5bd0 Child row cache objects level=4 child#=16

        Location from where latch is held: kghfrunp: clatch: wait:  ====>>> 可以看到这个latch是因为kghfrunp(KGH: Ask client to free unpinned space)相关的功能发起的,而这个功能是需要从shared pool中释放空间有关

        Context saved from call: 0

        state=busy, wlstate=free

          waiters [orapid (seconds since: put on list, posted, alive check)]:

           50 (1, 1559280342, 1)

           45 (1, 1559280342, 1)

           waiter count=2

          gotten 540727585 times wait, failed first 2224239 sleeps 386931

          gotten 3781484 times nowait, failed: 419225

        possible holder pid = 223 ospid=30412

      on wait list for 33a5d5bd0

    Process Group: DEFAULT, pseudo proc: 0x3552a2308

    O/S info: user: oracle, term: UNKNOWN, ospid: 4099

    OSD pid info: Unix process pid: 4099, image: oracle@findbb (LCK0)

    Short stack dump: ksdxfstk()+32<-ksdxcb()+1547<-sspuser()+111<-<0x311720eb10>

此进程的历史等待事件：

----------------------------------------

    SO: 0x3553c5998, type: 4, owner: 0x357240660, flag: INIT/-/-/0x00

    (session) sid: 1537 trans: (nil), creator: 0x357240660, flag: (51) USR/- BSY/-/-/-/-/-

              DID: 0000-002D-00000003, short-term DID: 0000-0000-00000000

              txn branch: (nil)

              oct: 0, prv: 0, sql: (nil), psql: (nil), user: 0/SYS

    waiting for 'latch: row cache objects' blocking sess=0x(nil) seq=30395 wait_time=0 seconds since wait started=0

                address=33a5d5bd0, number=c7, tries=2

    Dumping Session Wait History

     for 'latch: row cache objects' count=1 wait_time=41

                address=33a5d5bd0, number=c7, tries=1

     for 'latch: row cache objects' count=1 wait_time=422982

                address=33a5d5bd0, number=c7, tries=0

     for 'latch: shared pool' count=1 wait_time=28

                address=600ebf80, number=d5, tries=0

     for 'latch: shared pool' count=1 wait_time=57

                address=600ebf80, number=d5, tries=0

     for 'latch: row cache objects' count=1 wait_time=135277

                address=33a5d5bd0, number=c7, tries=0

     for 'latch: shared pool' count=1 wait_time=29

                address=600ebf80, number=d5, tries=1

     for 'latch: shared pool' count=1 wait_time=37

                address=600ebf80, number=d5, tries=0

     for 'latch: shared pool' count=1 wait_time=78

                address=600ebf80, number=d5, tries=0

     for 'latch: row cache objects' count=1 wait_time=47585

                address=33a5d5bd0, number=c7, tries=0

     for 'latch: row cache objects' count=1 wait_time=45842

                address=33a5d5bd0, number=c7, tries=0

    temporary object counter: 0

      ----------------------------------------

      UOL used : 0 locks(used=0, free=0)

      KGX Atomic Operation Log 0x35afaaab8

       Mutex (nil)(0, 0) idn 7fff00000000 oper NONE

       Cursor Pin uid 1537 efd 3 whr 11 slp 0

      KGX Atomic Operation Log 0x35afaab00

       Mutex (nil)(0, 0) idn 7fff00000000 oper NONE

       Library Cache uid 1537 efd 0 whr 0 slp 0

      KGX Atomic Operation Log 0x35afaab48

       Mutex (nil)(0, 0) idn 7fff00000000 oper NONE

       Library Cache uid 1537 efd 0 whr 0 slp 0

      ----------------------------------------

      SO: 0x33c58c458, type: 41, owner: 0x3553c5998, flag:

4.3 数据库AWR信息分析

数据库实例在11点时已经HANG住，未生成AWR快照。因此使用9-10点的快照来查看故障发生前是否存在异常。

从AWR可以看到shared pool 从4800M收缩为了2900M，查看SGA组件变动使用情况，可以发现将近2个GB是被KGH: NO ACCESS所使用，即处于shrik/grow的中间状态。

总结与建议

针对此次RAC数据库实例问题，总结及建议如下：

5.1 问题描述与分析

此套RAC上运行的数据库实例，数据库版本为10.2.0.3.0；

其中数据库实例2遇到>>> WAITED TOO LONG FOR A ROW CACHE ENQUEUE LOCK! <<<错误导致业务HANG住。

对此HANG的原因进行分析：

5.2 后续处理方法与建议

1. 数据库SGA内存参数的优化

当前节点1主机内存大小为32GB，节点2由于更换过硬件设备，内存为128GB。

因此之前出于两节点参数统一未对节点2的内存参数进行增加，使用的是sga_target= 12884901888，即12GB。

建议将两节点内存统计设置为SGA设置为节点1内存的一半大小，即16GB。（注意此操作需要重启数据库需要安排停机维护时间窗口进行）

2.数据库buffer cache及shared pool组件优化

建议设置最小的buffer cache及shared pool的值

这是用来设置下限值，使SGA自动管理时不会调整到低于此值。具体值应结合历史AWR报告中Report Summary部分的Cache Sizes大小及Advisory Statistics 部分建议的 Buffer Pool Advisory/Shared Pool Advisory值。使用的命令如下：（注意在非业务高峰时或安排停机维护时间窗口进行）

SQL> ALTER SYSTEM SET DB_CACHE_SIZE=n SCOPE=SPFILE;

SQL> ALTER SYSTEM SET SHARED_POOL_SIZE=m SCOPE=SPFILE;

建议可以设置为BUFFER CACHE 10GB，SHARED POOL 5GB。

3.关闭DRM相关参数

10G

_gc_affinity_time=0

_gc_undo_affinity=FALSE

这2个参数是静态参数，也就是说必须要重启实例才能生效。

实际上可以设置另外2个动态的隐含参数，来达到这个目的。按下面的值设置这2个参数之后，不能完全算是禁止/关闭了DRM，而是从”事实上“关闭了DRM。

--可以设置更高的值。

_gc_affinity_limit=250

_gc_affinity_minimum=10485760