Oracle 11g RAC 节点异常重启问题分析

news2024/10/10 10:28:16

一、背景

在国庆期间巡检的时候,发现数据库alert日志中出现了异常重启的信息,当即对该报错进行分析处理。

二、处理过程

(1)数据库告警日志分析

node1 alert:
Sat Oct 05 13:05:14 2024
Thread 1 advanced to log sequence 6981 (LGWR switch)
  Current log# 11 seq# 6981 mem# 0: +DATA/ybqddb/onlinelog/group_11.302.1144593261
Sat Oct 05 13:05:15 2024
Archived Log entry 12130 added for thread 1 sequence 6980 ID 0x8d497377 dest 1:
Sat Oct 05 14:50:48 2024
Reconfiguration started (old inc 27, new inc 29)
List of instances:
 1 (myinst: 1) 
 Global Resource Directory frozen
 * dead instance detected - domain 0 invalid = TRUE 
 Communication channels reestablished
 Master broadcasted resource hash value bitmaps
 Non-local Process blocks cleaned out
Sat Oct 05 14:50:48 2024
Sat Oct 05 14:50:48 2024
 LMS 3: 1 GCS shadows cancelled, 1 closed, 0 Xw survived
 LMS 0: 0 GCS shadows cancelled, 0 closed, 0 Xw survived
Sat Oct 05 14:50:48 2024
 LMS 1: 1 GCS shadows cancelled, 0 closed, 0 Xw survived
Sat Oct 05 14:50:48 2024
 LMS 2: 0 GCS shadows cancelled, 0 closed, 0 Xw survived
 Set master node info 
 Submitted all remote-enqueue requests
 Dwn-cvts replayed, VALBLKs dubious
 All grantable enqueues granted
 Post SMON to start 1st pass IR
Sat Oct 05 14:50:48 2024
Instance recovery: looking for dead threads
Beginning instance recovery of 1 threads
 Submitted all GCS remote-cache requests
 Post SMON to start 1st pass IR
 Fix write in gcs resources
Reconfiguration complete
 parallel recovery started with 32 processes
Started redo scan
Completed redo scan
 read 76 KB redo, 16 data blocks need recovery
Started redo application at
 Thread 2: logseq 5168, block 86069
Sat Oct 05 14:50:53 2024
Setting Resource Manager plan SCHEDULER[0x32DE]:DEFAULT_MAINTENANCE_PLAN via scheduler window
Setting Resource Manager plan DEFAULT_MAINTENANCE_PLAN via parameter
Recovery of Online Redo Log: Thread 2 Group 10 Seq 5168 Reading mem 0
  Mem# 0: +DATA/ybqddb/onlinelog/group_10.300.1144593255
Completed redo application of 0.01MB
Completed instance recovery at
 Thread 2: logseq 5168, block 86222, scn 245107035
 16 data blocks read, 16 data blocks written, 76 redo k-bytes read
Sat Oct 05 14:50:53 2024
minact-scn: master found reconf/inst-rec before recscn scan old-inc#:29 new-inc#:29
Thread 2 advanced to log sequence 5169 (thread recovery)
Redo thread 2 internally disabled at seq 5169 (SMON)
Sat Oct 05 14:50:54 2024
Archived Log entry 12131 added for thread 2 sequence 5168 ID 0x8d497377 dest 1:
Sat Oct 05 14:50:54 2024
ARC2: Archiving disabled thread 2 sequence 5169
Archived Log entry 12132 added for thread 2 sequence 5169 ID 0x8d497377 dest 1:
minact-scn: master continuing after IR
minact-scn: Master considers inst:2 dead
Sat Oct 05 14:51:49 2024
Decreasing number of real time LMS from 4 to 0
Sat Oct 05 14:52:11 2024
Reconfiguration started (old inc 29, new inc 31)
List of instances:
 1 2 (myinst: 1) 
 Global Resource Directory frozen
 Communication channels reestablished
 Master broadcasted resource hash value bitmaps
 Non-local Process blocks cleaned out
Sat Oct 05 14:52:11 2024
 LMS 3: 0 GCS shadows cancelled, 0 closed, 0 Xw survived
Sat Oct 05 14:52:11 2024
 LMS 1: 0 GCS shadows cancelled, 0 closed, 0 Xw survived
Sat Oct 05 14:52:11 2024
 LMS 0: 0 GCS shadows cancelled, 0 closed, 0 Xw survived
Sat Oct 05 14:52:11 2024
 LMS 2: 0 GCS shadows cancelled, 0 closed, 0 Xw survived
 Set master node info 
 Submitted all remote-enqueue requests
 Dwn-cvts replayed, VALBLKs dubious
 All grantable enqueues granted
Sat Oct 05 14:52:11 2024
minact-scn: Master returning as live inst:2 has inc# mismatch instinc:0 cur:31 errcnt:0
 Submitted all GCS remote-cache requests
 Fix write in gcs resources
Reconfiguration complete
Sat Oct 05 14:53:26 2024
Increasing number of real time LMS from 0 to 4
Sat Oct 05 17:05:20 2024
ALTER SYSTEM ARCHIVE LOG
Sat Oct 05 17:05:21 2024
Thread 1 advanced to log sequence 6982 (LGWR switch)
  Current log# 5 seq# 6982 mem# 0: +DATA/ybqddb/onlinelog/group_5.290.1144593225
Sat Oct 05 17:05:21 2024
Archived Log entry 12134 added for thread 1 sequence 6981 ID 0x8d497377 dest 1:
Sat Oct 05 21:05:22 2024
ALTER SYSTEM ARCHIVE LOG
Sat Oct 05 21:05:22 2024
Thread 1 advanced to log sequence 6983 (LGWR switch)
  Current log# 7 seq# 6983 mem# 0: +DATA/ybqddb/onlinelog/group_7.294.1144593235
Sat Oct 05 21:05:23 2024
Archived Log entry 12135 added for thread 1 sequence 6982 ID 0x8d497377 dest 1:
Sun Oct 06 01:08:47 2024
ALTER SYSTEM ARCHIVE LOG
Sun Oct 06 01:08:49 2024
Thread 1 advanced to log sequence 6984 (LGWR switch)
  Current log# 9 seq# 6984 mem# 0: +DATA/ybqddb/onlinelog/group_9.298.1144593249
Sun Oct 06 01:08:49 2024
Archived Log entry 12138 added for thread 1 sequence 6983 ID 0x8d497377 dest 1:
Sun Oct 06 05:05:18 2024
ALTER SYSTEM ARCHIVE LOG
Sun Oct 06 05:05:18 2024
Thread 1 advanced to log sequence 6985 (LGWR switch)
  Current log# 11 seq# 6985 mem# 0: +DATA/ybqddb/onlinelog/group_11.302.1144593261
Archived Log entry 12139 added for thread 2 sequence 5173 ID 0x8d497377 dest 1:
Sun Oct 06 05:05:19 2024

node2 alert:

Sat Oct 05 13:05:14 2024
Archived Log entry 12129 added for thread 2 sequence 5167 ID 0x8d497377 dest 1:
Sat Oct 05 14:50:47 2024
NOTE: ASMB terminating
Errors in file /u01/app/oracle/diag/rdbms/ybqddb/ybqddb2/trace/ybqddb2_asmb_15097.trc:
ORA-15064: communication failure with ASM instance
ORA-03113: end-of-file on communication channel
Process ID: 
Session ID: 2109 Serial number: 3
Errors in file /u01/app/oracle/diag/rdbms/ybqddb/ybqddb2/trace/ybqddb2_asmb_15097.trc:
ORA-15064: communication failure with ASM instance
ORA-03113: end-of-file on communication channel
Process ID: 
Session ID: 2109 Serial number: 3
ASMB (ospid: 15097): terminating the instance due to error 15064
Instance terminated by ASMB, pid = 15097
Sat Oct 05 14:51:59 2024
Starting ORACLE instance (normal)
************************ Large Pages Information *******************
Per process system memlock (soft) limit = UNLIMITED
 
Total Shared Global Region in Large Pages = 0 KB (0%)
 
Large Pages used by this instance: 0 (0 KB)
Large Pages unused system wide = 0 (0 KB)
Large Pages configured system wide = 0 (0 KB)
Large Page size = 2048 KB
 
从节点2日志可以看出,在14:50:47出现与asm实例通信失败报错,并于14:51:59该节点实例开始启动。

(2)查看crs及ocssd日志

node1 crs日志:

2024-07-26 03:02:59.119: 
[client(240419)]CRS-10051:CVU found following errors with Clusterware setup : PRVF-4657 : Name resolution setup check for "nxybj-scan" (IP address: 172.31.122.38) failedPRVF-4664 : Found inconsistent name resolution entries for SCAN name "nxybj-scan"

2024-10-05 14:50:31.281: 
[cssd(10595)]CRS-1612:Network communication with node nxybj2 (2) missing for 50% of timeout interval.  Removal of this node from cluster in 14.200 seconds
2024-10-05 14:50:38.282: 
[cssd(10595)]CRS-1611:Network communication with node nxybj2 (2) missing for 75% of timeout interval.  Removal of this node from cluster in 7.200 seconds
2024-10-05 14:50:43.283: 
[cssd(10595)]CRS-1610:Network communication with node nxybj2 (2) missing for 90% of timeout interval.  Removal of this node from cluster in 2.200 seconds
2024-10-05 14:50:45.484: 
[cssd(10595)]CRS-1607:Node nxybj2 is being evicted in cluster incarnation 587229053; details at (:CSSNM00007:) in /u01/app/11.2.0/grid/log/nxybj1/cssd/ocssd.log.
2024-10-05 14:50:47.994: 
[cssd(10595)]CRS-1625:Node nxybj2, number 2, was manually shut down
2024-10-05 14:50:47.999: 
[cssd(10595)]CRS-1601:CSSD Reconfiguration complete. Active nodes are nxybj1 .
2024-10-05 14:50:48.000: 
[crsd(14356)]CRS-5504:Node down event reported for node 'nxybj2'.
2024-10-05 14:50:51.248: 
[crsd(14356)]CRS-2773:Server 'nxybj2' has been removed from pool 'Generic'.
2024-10-05 14:50:51.249: 
[crsd(14356)]CRS-2773:Server 'nxybj2' has been removed from pool 'ora.orcl'.
2024-10-05 14:50:51.249: 
[crsd(14356)]CRS-2773:Server 'nxybj2' has been removed from pool 'ora.ybqddb'.
2024-10-05 14:50:51.249: 
[crsd(14356)]CRS-2773:Server 'nxybj2' has been removed from pool 'ora.ybzxddb'.
2024-10-05 14:51:13.084: 
[cssd(10595)]CRS-1601:CSSD Reconfiguration complete. Active nodes are nxybj1 nxybj2 .
2024-10-05 14:51:42.117: 
[crsd(14356)]CRS-2772:Server 'nxybj2' has been assigned to pool 'Generic'.
2024-10-05 14:51:42.118: 
[crsd(14356)]CRS-2772:Server 'nxybj2' has been assigned to pool 'ora.ybzxddb'.
2024-10-05 14:51:42.118: 
[crsd(14356)]CRS-2772:Server 'nxybj2' has been assigned to pool 'ora.ybqddb'.
2024-10-05 14:51:42.118: 
[crsd(14356)]CRS-2772:Server 'nxybj2' has been assigned to pool 'ora.orcl'.

node1 ocssd日志:
2024-10-05 14:50:28.553: [    CSSD][2946451200]clssnmSendingThread: sent 5 status msgs to all nodes
2024-10-05 14:50:31.281: [    CSSD][2948028160]clssnmPollingThread: node nxybj2 (2) at 50% heartbeat fatal, removal in 14.200 seconds
2024-10-05 14:50:31.281: [    CSSD][2948028160]clssnmPollingThread: node nxybj2 (2) is impending reconfig, flag 2229260, misstime 15800
2024-10-05 14:50:31.281: [    CSSD][2948028160]clssnmPollingThread: local diskTimeout set to 27000 ms, remote disk timeout set to 27000, impending reconfig status(1)
2024-10-05 14:50:31.282: [    CSSD][3159578368]clssnmvDHBValidateNcopy: node 2, nxybj2, has a disk HB, but no network HB, DHB has rcfg 587229053, wrtcnt, 108693547, LATS 2409222280, lastSeqNo 108691578, uniqueness 1708519428, timestamp 1728111030/2407806020
2024-10-05 14:50:31.282: [    CSSD][3164333824]clssnmvDHBValidateNcopy: node 2, nxybj2, has a disk HB, but no network HB, DHB has rcfg 587229053, wrtcnt, 108693549, LATS 2409222280, lastSeqNo 108693378, uniqueness 1708519428, timestamp 1728111030/2407806140
2024-10-05 14:50:31.691: [    CSSD][3162732288]clssnmvDiskPing: Writing with status 0x3, timestamp 1728111031/2409222690
2024-10-05 14:50:31.781: [    CSSD][3172316928]clssnmvDiskPing: Writing with status 0x3, timestamp 1728111031/2409222780
2024-10-05 14:50:31.784: [    CSSD][3169081088]clssnmvDHBValidateNcopy: node 2, nxybj2, has a disk HB, but no network HB, DHB has rcfg 587229053, wrtcnt, 108693551, LATS 2409222790, lastSeqNo 108691579, uniqueness 1708519428, timestamp 1728111031/2407807110
2024-10-05 14:50:31.811: [    CSSD][3167487744]clssnmvDiskPing: Writing with status 0x3, timestamp 1728111031/2409222810
2024-10-05 14:50:32.193: [    CSSD][3162732288]clssnmvDiskPing: Writing with status 0x3, timestamp 1728111032/2409223200
2024-10-05 14:50:32.282: [    CSSD][3159578368]clssnmvDHBValidateNcopy: node 2, nxybj2, has a disk HB, but no network HB, DHB has rcfg 587229053, wrtcnt, 108693553, LATS 2409223280, lastSeqNo 108693547, uniqueness 1708519428, timestamp 1728111032/2407807530
2024-10-05 14:50:32.291: [    CSSD][3172316928]clssnmvDiskPing: Writing with status 0x3, timestamp 1728111032/2409223290
2024-10-05 14:50:32.321: [    CSSD][3167487744]clssnmvDiskPing: Writing with status 0x3, timestamp 1728111032/2409223320
2024-10-05 14:50:32.701: [    CSSD][3162732288]clssnmvDiskPing: Writing with status 0x3, timestamp 1728111032/2409223700
2024-10-05 14:50:32.783: [    CSSD][3159578368]clssnmvDHBValidateNcopy: node 2, nxybj2, has a disk HB, but no network HB, DHB has rcfg 587229053, wrtcnt, 108693556, LATS 2409223790, lastSeqNo 108693553, uniqueness 1708519428, timestamp 1728111032/2407808040
2024-10-05 14:50:32.793: [    CSSD][3172316928]clssnmvDiskPing: Writing with status 0x3, timestamp 1728111032/2409223790
2024-10-05 14:50:32.823: [    CSSD][3167487744]clssnmvDiskPing: Writing with status 0x3, timestamp 1728111032/2409223820
2024-10-05 14:50:33.203: [    CSSD][3162732288]clssnmvDiskPing: Writing with status 0x3, timestamp 1728111033/2409224200
2024-10-05 14:50:33.282: [    CSSD][3159578368]clssnmvDHBValidateNcopy: node 2, nxybj2, has a disk HB, but no network HB, DHB has rcfg 587229053, wrtcnt, 108693559, LATS 2409224280, lastSeqNo 108693556, uniqueness 1708519428, timestamp 1728111033/2407808540
2024-10-05 14:50:33.301: [    CSSD][3172316928]clssnmvDiskPing: Writing with status 0x3, timestamp 1728111033/2409224300
2024-10-05 14:50:33.331: [    CSSD][3167487744]clssnmvDiskPing: Writing with status 0x3, timestamp 1728111033/2409224330
2024-10-05 14:50:33.554: [    CSSD][2946451200]clssnmSendingThread: sending status msg to all nodes
2024-10-05 14:50:33.554: [    CSSD][2946451200]clssnmSendingThread: sent 5 status msgs to all nodes
2024-10-05 14:50:33.711: [    CSSD][3162732288]clssnmvDiskPing: Writing with status 0x3, timestamp 1728111033/2409224710
2024-10-05 14:50:33.784: [    CSSD][3159578368]clssnmvDHBValidateNcopy: node 2, nxybj2, has a disk HB, but no network HB, DHB has rcfg 587229053, wrtcnt, 108693562, LATS 2409224790, lastSeqNo 108693559, uniqueness 1708519428, timestamp 1728111033/2407809050
2024-10-05 14:50:33.803: [    CSSD][3172316928]clssnmvDiskPing: Writing with status 0x3, timestamp 1728111033/2409224800
2024-10-05 14:50:33.833: [    CSSD][3167487744]clssnmvDiskPing: Writing with status 0x3, timestamp 1728111033/2409224840
2024-10-05 14:50:34.213: [    CSSD][3162732288]clssnmvDiskPing: Writing with status 0x3, timestamp 1728111034/2409225210
2024-10-05 14:50:34.282: [    CSSD][3169081088]clssnmvDHBValidateNcopy: node 2, nxybj2, has a disk HB, but no network HB, DHB has rcfg 587229053, wrtcnt, 108693563, LATS 2409225280, lastSeqNo 108693551, uniqueness 1708519428, timestamp 1728111033/2407809130
2024-10-05 14:50:34.282: [    CSSD][3159578368]clssnmvDHBValidateNcopy: node 2, nxybj2, has a disk HB, but no network HB, DHB has rcfg 587229053, wrtcnt, 108693565, LATS 2409225280, lastSeqNo 108693562, uniqueness 1708519428, timestamp 1728111034/2407809550
2024-10-05 14:50:34.311: [    CSSD][3172316928]clssnmvDiskPing: Writing with status 0x3, timestamp 1728111034/2409225310
2024-10-05 14:50:34.341: [    CSSD][3167487744]clssnmvDiskPing: Writing with status 0x3, timestamp 1728111034/2409225340
2024-10-05 14:50:34.721: [    CSSD][3162732288]clssnmvDiskPing: Writing with status 0x3, timestamp 1728111034/2409225720
2024-10-05 14:50:34.784: [    CSSD][3164333824]clssnmvDHBValidateNcopy: node 2, nxybj2, has a disk HB, but no network HB, DHB has rcfg 587229053, wrtcnt, 108693567, LATS 2409225790, lastSeqNo 108693549, uniqueness 1708519428, timestamp 1728111034/2407809680
2024-10-05 14:50:34.784: [    CSSD][3159578368]clssnmvDHBValidateNcopy: node 2, nxybj2, has a disk HB, but no network HB, DHB has rcfg 587229053, wrtcnt, 108693568, LATS 2409225790, lastSeqNo 108693565, uniqueness 1708519428, timestamp 1728111034/2407810050
2024-10-05 14:50:34.813: [    CSSD][3172316928]clssnmvDiskPing: Writing with status 0x3, timestamp 1728111034/2409225810
 

                                                                                                                            
node2 crs日志:
[ctssd(11996)]CRS-2409:The clock on host nxybj2 is not synchronous with the mean cluster time. No action has been taken as the Cluster Time Synchronization Service is running in observer mode.
2024-10-03 05:57:32.243: 
[ctssd(11996)]CRS-2409:The clock on host nxybj2 is not synchronous with the mean cluster time. No action has been taken as the Cluster Time Synchronization Service is running in observer mode.
2024-10-05 14:50:31.036: 
[cssd(10538)]CRS-1612:Network communication with node nxybj1 (1) missing for 50% of timeout interval.  Removal of this node from cluster in 14.520 seconds
2024-10-05 14:50:39.038: 
[cssd(10538)]CRS-1611:Network communication with node nxybj1 (1) missing for 75% of timeout interval.  Removal of this node from cluster in 6.520 seconds
2024-10-05 14:50:43.039: 
[cssd(10538)]CRS-1610:Network communication with node nxybj1 (1) missing for 90% of timeout interval.  Removal of this node from cluster in 2.520 seconds
2024-10-05 14:50:45.560: 
[cssd(10538)]CRS-1609:This node is unable to communicate with other nodes in the cluster and is going down to preserve cluster integrity; details at (:CSSNM00008:) in /u01/app/11.2.0/grid/log/nxybj2/cssd/ocssd.log.
2024-10-05 14:50:45.560: 
[cssd(10538)]CRS-1656:The CSS daemon is terminating due to a fatal error; Details at (:CSSSC00012:) in /u01/app/11.2.0/grid/log/nxybj2/cssd/ocssd.log
2024-10-05 14:50:45.560: 
[cssd(10538)]CRS-1608:This node was evicted by node 1, nxybj1; details at (:CSSNM00005:) in /u01/app/11.2.0/grid/log/nxybj2/cssd/ocssd.log.
2024-10-05 14:50:45.560: 
[cssd(10538)]CRS-1608:This node was evicted by node 1, nxybj1; details at (:CSSNM00005:) in /u01/app/11.2.0/grid/log/nxybj2/cssd/ocssd.log.
2024-10-05 14:50:45.577: 
[cssd(10538)]CRS-1652:Starting clean up of CRSD resources.
2024-10-05 14:50:46.038: 
[cssd(10538)]CRS-1608:This node was evicted by node 1, nxybj1; details at (:CSSNM00005:) in /u01/app/11.2.0/grid/log/nxybj2/cssd/ocssd.log.
2024-10-05 14:50:46.803: 
[/u01/app/11.2.0/grid/bin/oraagent.bin(14456)]CRS-5016:Process "/u01/app/11.2.0/grid/opmn/bin/onsctli" spawned by agent "/u01/app/11.2.0/grid/bin/oraagent.bin" for action "check" failed: details at "(:CLSN00010:)" in "/u01/app/11.2.0/grid/log/nxybj2/agent/crsd/oraagent_grid/oraagent_grid.log"
2024-10-05 14:50:47.404: 
[/u01/app/11.2.0/grid/bin/oraagent.bin(14456)]CRS-5016:Process "/u01/app/11.2.0/grid/bin/lsnrctl" spawned by agent "/u01/app/11.2.0/grid/bin/oraagent.bin" for action "check" failed: details at "(:CLSN00010:)" in "/u01/app/11.2.0/grid/log/nxybj2/agent/crsd/oraagent_grid/oraagent_grid.log"
2024-10-05 14:50:47.405: 
[cssd(10538)]CRS-1654:Clean up of CRSD resources finished successfully.
2024-10-05 14:50:47.406: 
[cssd(10538)]CRS-1655:CSSD on node nxybj2 detected a problem and started to shutdown.
2024-10-05 14:50:47.434: 
[/u01/app/11.2.0/grid/bin/oraagent.bin(14698)]CRS-5822:Agent '/u01/app/11.2.0/grid/bin/oraagent_oracle' disconnected from server. Details at (:CRSAGF00117:) {0:7:28777} in /u01/app/11.2.0/grid/log/nxybj2/agent/crsd/oraagent_oracle/oraagent_oracle.log.
2024-10-05 14:50:47.436: 
[ohasd(10322)]CRS-2765:Resource 'ora.crsd' has failed on server 'nxybj2'.
2024-10-05 14:50:47.437: 
[/u01/app/11.2.0/grid/bin/oraagent.bin(14456)]CRS-5822:Agent '/u01/app/11.2.0/grid/bin/oraagent_grid' disconnected from server. Details at (:CRSAGF00117:) {0:1:4} in /u01/app/11.2.0/grid/log/nxybj2/agent/crsd/oraagent_grid/oraagent_grid.log.
2024-10-05 14:50:47.437: 
[/u01/app/11.2.0/grid/bin/orarootagent.bin(14460)]CRS-5822:Agent '/u01/app/11.2.0/grid/bin/orarootagent_root' disconnected from server. Details at (:CRSAGF00117:) {0:3:2522} in /u01/app/11.2.0/grid/log/nxybj2/agent/crsd/orarootagent_root/orarootagent_root.log.
2024-10-05 14:50:47.691: 
[cssd(10538)]CRS-1660:The CSS daemon shutdown has completed
2024-10-05 14:50:47.846: 
[ohasd(10322)]CRS-2765:Resource 'ora.evmd' has failed on server 'nxybj2'.
2024-10-05 14:50:47.848: 
[ohasd(10322)]CRS-2765:Resource 'ora.ctssd' has failed on server 'nxybj2'.
2024-10-05 14:50:48.452: 
[crsd(345044)]CRS-0805:Cluster Ready Service aborted due to failure to communicate with Cluster Synchronization Service with error [3]. Details at (:CRSD00109:) in /u01/app/11.2.0/grid/log/nxybj2/crsd/crsd.log.
2024-10-05 14:50:48.598: 
[ohasd(10322)]CRS-2765:Resource 'ora.cssdmonitor' has failed on server 'nxybj2'.
2024-10-05 14:50:48.603: 
[ohasd(10322)]CRS-2765:Resource 'ora.crsd' has failed on server 'nxybj2'.
2024-10-05 14:50:48.861: 
[ctssd(348486)]CRS-2402:The Cluster Time Synchronization Service aborted on host nxybj2. Details at (:ctss_css_init1:) in /u01/app/11.2.0/grid/log/nxybj2/ctssd/octssd.log.
2024-10-05 14:50:49.102: 
[ohasd(10322)]CRS-2765:Resource 'ora.cssd' has failed on server 'nxybj2'.
2024-10-05 14:50:49.105: 
[/u01/app/11.2.0/grid/bin/oraagent.bin(10448)]CRS-5011:Check of resource "+ASM" failed: details at "(:CLSN00006:)" in "/u01/app/11.2.0/grid/log/nxybj2/agent/ohasd/oraagent_grid/oraagent_grid.log"
2024-10-05 14:50:49.271: 
[/u01/app/11.2.0/grid/bin/oraagent.bin(10448)]CRS-5011:Check of resource "+ASM" failed: details at "(:CLSN00006:)" in "/u01/app/11.2.0/grid/log/nxybj2/agent/ohasd/oraagent_grid/oraagent_grid.log"
2024-10-05 14:50:49.272: 
[ohasd(10322)]CRS-2765:Resource 'ora.asm' has failed on server 'nxybj2'.
2024-10-05 14:50:51.030: 
[cssd(349990)]CRS-1713:CSSD daemon is started in clustered mode
2024-10-05 14:50:52.863: 
[ohasd(10322)]CRS-2878:Failed to restart resource 'ora.ctssd'
2024-10-05 14:50:52.863: 
[ohasd(10322)]CRS-2769:Unable to failover resource 'ora.ctssd'.
2024-10-05 14:51:06.688: 
[cssd(349990)]CRS-1707:Lease acquisition for node nxybj2 number 2 completed
2024-10-05 14:51:07.344: 
[ohasd(10322)]CRS-2765:Resource 'ora.cluster_interconnect.haip' has failed on server 'nxybj2'.
2024-10-05 14:51:07.954: 
[cssd(349990)]CRS-1605:CSSD voting file is online: /dev/asmdisk/ocr02; details in /u01/app/11.2.0/grid/log/nxybj2/cssd/ocssd.log.
2024-10-05 14:51:07.961: 
[cssd(349990)]CRS-1605:CSSD voting file is online: /dev/asmdisk/ocr03; details in /u01/app/11.2.0/grid/log/nxybj2/cssd/ocssd.log.
2024-10-05 14:51:07.968: 
[cssd(349990)]CRS-1605:CSSD voting file is online: /dev/asmdisk/ocr01; details in /u01/app/11.2.0/grid/log/nxybj2/cssd/ocssd.log.
2024-10-05 14:51:13.083: 
[cssd(349990)]CRS-1601:CSSD Reconfiguration complete. Active nodes are nxybj1 nxybj2 .
2024-10-05 14:51:15.026: 
[ctssd(351425)]CRS-2403:The Cluster Time Synchronization Service on host nxybj2 is in observer mode.
2024-10-05 14:51:15.261: 
[ctssd(351425)]CRS-2407:The new Cluster Time Synchronization Service reference node is host nxybj1.
2024-10-05 14:51:15.261: 
[ctssd(351425)]CRS-2401:The Cluster Time Synchronization Service started on host nxybj2.
2024-10-05 14:51:17.018: 
[ohasd(10322)]CRS-2767:Resource state recovery not attempted for 'ora.diskmon' as its target state is OFFLINE
2024-10-05 14:51:17.018: 
[ohasd(10322)]CRS-2769:Unable to failover resource 'ora.diskmon'.
2024-10-05 14:51:22.920: 
[ohasd(10322)]CRS-2878:Failed to restart resource 'ora.cluster_interconnect.haip'
2024-10-05 14:51:22.920: 
[/u01/app/11.2.0/grid/bin/oraagent.bin(10448)]CRS-5011:Check of resource "+ASM" failed: details at "(:CLSN00006:)" in "/u01/app/11.2.0/grid/log/nxybj2/agent/ohasd/oraagent_grid/oraagent_grid.log"
2024-10-05 14:51:23.116: 
[/u01/app/11.2.0/grid/bin/oraagent.bin(10448)]CRS-5011:Check of resource "+ASM" failed: details at "(:CLSN00006:)" in "/u01/app/11.2.0/grid/log/nxybj2/agent/ohasd/oraagent_grid/oraagent_grid.log"
2024-10-05 14:51:30.099: 
[ohasd(10322)]CRS-2878:Failed to restart resource 'ora.asm'
2024-10-05 14:51:31.109: 
[ohasd(10322)]CRS-2878:Failed to restart resource 'ora.crsd'
2024-10-05 14:51:36.711: 
[crsd(353461)]CRS-1012:The OCR service started on node nxybj2.
2024-10-05 14:51:36.717: 
[evmd(348478)]CRS-1401:EVMD started on node nxybj2.
2024-10-05 14:51:37.724: 
[crsd(353461)]CRS-1201:CRSD started on node nxybj2.
2024-10-05 14:51:39.350: 

                                                                                                                    
node2 ocssd日志:
2024-10-05 14:50:31.036: [    CSSD][1568802560]clssnmPollingThread: node nxybj1 (1) at 50% heartbeat fatal, removal in 14.520 seconds
2024-10-05 14:50:31.036: [    CSSD][1568802560]clssnmPollingThread: node nxybj1 (1) is impending reconfig, flag 2491406, misstime 15480
2024-10-05 14:50:31.036: [    CSSD][1568802560]clssnmPollingThread: local diskTimeout set to 27000 ms, remote disk timeout set to 27000, impending reconfig status(1)
2024-10-05 14:50:31.037: [    CSSD][1584641792]clssnmvDHBValidateNcopy: node 1, nxybj1, has a disk HB, but no network HB, DHB has rcfg 587229053, wrtcnt, 108592072, LATS 2407806370, lastSeqNo 108585015, uniqueness 1708518327, timestamp 1728111030/2409221770
2024-10-05 14:50:31.037: [    CSSD][1575139072]clssnmvDHBValidateNcopy: node 1, nxybj1, has a disk HB, but no network HB, DHB has rcfg 587229053, wrtcnt, 108592073, LATS 2407806370, lastSeqNo 108589502, uniqueness 1708518327, timestamp 1728111030/2409221810
2024-10-05 14:50:31.687: [    CSSD][1583048448]clssnmvDiskPing: Writing with status 0x3, timestamp 1728111031/2407807020
2024-10-05 14:50:31.777: [    CSSD][1587885824]clssnmvDiskPing: Writing with status 0x3, timestamp 1728111031/2407807110
2024-10-05 14:50:31.817: [    CSSD][1578292992]clssnmvDiskPing: Writing with status 0x3, timestamp 1728111031/2407807150
2024-10-05 14:50:32.037: [    CSSD][1575139072]clssnmvDHBValidateNcopy: node 1, nxybj1, has a disk HB, but no network HB, DHB has rcfg 587229053, wrtcnt, 108592076, LATS 2407807370, lastSeqNo 108592073, uniqueness 1708518327, timestamp 1728111031/2409222810
2024-10-05 14:50:32.197: [    CSSD][1583048448]clssnmvDiskPing: Writing with status 0x3, timestamp 1728111032/2407807530
2024-10-05 14:50:32.287: [    CSSD][1587885824]clssnmvDiskPing: Writing with status 0x3, timestamp 1728111032/2407807620
2024-10-05 14:50:32.327: [    CSSD][1578292992]clssnmvDiskPing: Writing with status 0x3, timestamp 1728111032/2407807660
2024-10-05 14:50:32.480: [    CSSD][1567225600]clssnmSendingThread: sending status msg to all nodes
2024-10-05 14:50:32.480: [    CSSD][1567225600]clssnmSendingThread: sent 5 status msgs to all nodes
2024-10-05 14:50:32.539: [    CSSD][1575139072]clssnmvDHBValidateNcopy: node 1, nxybj1, has a disk HB, but no network HB, DHB has rcfg 587229053, wrtcnt, 108592079, LATS 2407807880, lastSeqNo 108592076, uniqueness 1708518327, timestamp 1728111032/2409223320
2024-10-05 14:50:32.698: [    CSSD][1583048448]clssnmvDiskPing: Writing with status 0x3, timestamp 1728111032/2407808040
2024-10-05 14:50:32.789: [    CSSD][1587885824]clssnmvDiskPing: Writing with status 0x3, timestamp 1728111032/2407808130
2024-10-05 14:50:32.828: [    CSSD][1578292992]clssnmvDiskPing: Writing with status 0x3, timestamp 1728111032/2407808170
2024-10-05 14:50:33.038: [    CSSD][1579894528]clssnmvDHBValidateNcopy: node 1, nxybj1, has a disk HB, but no network HB, DHB has rcfg 587229053, wrtcnt, 108592080, LATS 2407808370, lastSeqNo 108591299, uniqueness 1708518327, timestamp 1728111032/2409223700
2024-10-05 14:50:33.038: [    CSSD][1575139072]clssnmvDHBValidateNcopy: node 1, nxybj1, has a disk HB, but no network HB, DHB has rcfg 587229053, wrtcnt, 108592082, LATS 2407808370, lastSeqNo 108592079, uniqueness 1708518327, timestamp 1728111032/2409223820
2024-10-05 14:50:33.207: [    CSSD][1583048448]clssnmvDiskPing: Writing with status 0x3, timestamp 1728111033/2407808540
2024-10-05 14:50:33.297: [    CSSD][1587885824]clssnmvDiskPing: Writing with status 0x3, timestamp 1728111033/2407808630
2024-10-05 14:50:33.337: [    CSSD][1578292992]clssnmvDiskPing: Writing with status 0x3, timestamp 1728111033/2407808670
2024-10-05 14:50:33.540: [    CSSD][1584641792]clssnmvDHBValidateNcopy: node 1, nxybj1, has a disk HB, but no network HB, DHB has rcfg 587229053, wrtcnt, 108592084, LATS 2407808880, lastSeqNo 108592072, uniqueness 1708518327, timestamp 1728111033/2409224300
2024-10-05 14:50:33.546: [    CSSD][1575139072]clssnmvDHBValidateNcopy: node 1, nxybj1, has a disk HB, but no network HB, DHB has rcfg 587229053, wrtcnt, 108592085, LATS 2407808880, lastSeqNo 108592082, uniqueness 1708518327, timestamp 1728111033/2409224330
2024-10-05 14:50:33.709: [    CSSD][1583048448]clssnmvDiskPing: Writing with status 0x3, timestamp 1728111033/2407809050
2024-10-05 14:50:33.799: [    CSSD][1587885824]clssnmvDiskPing: Writing with status 0x3, timestamp 1728111033/2407809130
2024-10-05 14:50:33.838: [    CSSD][1578292992]clssnmvDiskPing: Writing with status 0x3, timestamp 1728111033/2407809170
2024-10-05 14:50:34.038: [    CSSD][1579894528]clssnmvDHBValidateNcopy: node 1, nxybj1, has a disk HB, but no network HB, DHB has rcfg 587229053, wrtcnt, 108592086, LATS 2407809370, lastSeqNo 108592080, uniqueness 1708518327, timestamp 1728111033/2409224710
2024-10-05 14:50:34.038: [    CSSD][1575139072]clssnmvDHBValidateNcopy: node 1, nxybj1, has a disk HB, but no network HB, DHB has rcfg 587229053, wrtcnt, 108592088, LATS 2407809370, lastSeqNo 108592085, uniqueness 1708518327, timestamp 1728111033/2409224840
2024-10-05 14:50:34.217: [    CSSD][1583048448]clssnmvDiskPing: Writing with status 0x3, timestamp 1728111034/2407809550
2024-10-05 14:50:34.307: [    CSSD][1587885824]clssnmvDiskPing: Writing with status 0x3, timestamp 1728111034/2407809640
2024-10-05 14:50:34.347: [    CSSD][1578292992]clssnmvDiskPing: Writing with status 0x3, timestamp 1728111034/2407809680
2024-10-05 14:50:34.540: [    CSSD][1579894528]clssnmvDHBValidateNcopy: node 1, nxybj1, has a disk HB, but no network HB, DHB has rcfg 587229053, wrtcnt, 108592089, LATS 2407809880, lastSeqNo 108592086, uniqueness 1708518327, timestamp 1728111034/2409225210
2024-10-05 14:50:34.540: [    CSSD][1575139072]clssnmvDHBValidateNcopy: node 1, nxybj1, has a disk HB, but no network HB, DHB has rcfg 587229053, wrtcnt, 108592091, LATS 2407809880, lastSeqNo 108592088, uniqueness 1708518327, timestamp 1728111034/2409225340
2024-10-05 14:50:34.718: [    CSSD][1583048448]clssnmvDiskPing: Writing with status 0x3, timestamp 1728111034/2407810050
2024-10-05 14:50:34.809: [    CSSD][1587885824]clssnmvDiskPing: Writing with status 0x3, timestamp 1728111034/2407810150
2024-10-05 14:50:34.849: [    CSSD][1578292992]clssnmvDiskPing: Writing with status 0x3, timestamp 1728111034/2407810180                                                                                                 

通过crs日志可以看出,在14:50:31开始出现节点之间通信失败报错,随后节点2发生了重启。

(3) 查看节点2系统日志

节点2系统日志:
[root@nxybj2 log]# more messages-20241006
Sep 29 03:50:01 nxybj2 rsyslogd: [origin software="rsyslogd" swVersion="5.8.10" x-pid="7519" x-info="http://www.rsyslog.com"] rsyslogd was HUPed
Sep 29 03:50:02 nxybj2 rhsmd: In order for Subscription Manager to provide your system with updates, your system must be registered with the Customer Portal. Please enter your Red Hat login to ensure your system is up-
to-date.
Sep 30 03:28:02 nxybj2 rhsmd: In order for Subscription Manager to provide your system with updates, your system must be registered with the Customer Portal. Please enter your Red Hat login to ensure your system is up-
to-date.
Oct  1 04:16:02 nxybj2 rhsmd: In order for Subscription Manager to provide your system with updates, your system must be registered with the Customer Portal. Please enter your Red Hat login to ensure your system is up-
to-date.
Oct  2 03:24:02 nxybj2 rhsmd: In order for Subscription Manager to provide your system with updates, your system must be registered with the Customer Portal. Please enter your Red Hat login to ensure your system is up-
to-date.
Oct  3 03:23:03 nxybj2 rhsmd: In order for Subscription Manager to provide your system with updates, your system must be registered with the Customer Portal. Please enter your Red Hat login to ensure your system is up-
to-date.
Oct  4 03:27:02 nxybj2 rhsmd: In order for Subscription Manager to provide your system with updates, your system must be registered with the Customer Portal. Please enter your Red Hat login to ensure your system is up-
to-date.
Oct  5 03:34:02 nxybj2 rhsmd: In order for Subscription Manager to provide your system with updates, your system must be registered with the Customer Portal. Please enter your Red Hat login to ensure your system is up-
to-date.
Oct  5 14:50:46 nxybj2 ntpd[394054]: Deleting interface #4 bond0:1, 172.31.122.37#123, interface stats: received=0, sent=0, dropped=0, active_time=6191372 secs
Oct  5 14:51:15 nxybj2 ntpd[394054]: Deleting interface #6 bond1:1, 169.254.209.169#123, interface stats: received=0, sent=0, dropped=0, active_time=6191401 secs
Oct  5 14:51:23 nxybj2 ntpd[394054]: Listen normally on 17 bond1:1 169.254.209.169 UDP 123
Oct  5 14:51:46 nxybj2 ntpd[394054]: Listen normally on 18 bond0:1 172.31.122.37 UDP 123

 系统日志无异常报错。

三、故障原因分析

根据日志信息可以看到,是由于节点之间心跳通信异常,导致节点2发生了重启,根据日志中的关键信息:

2024-10-05 14:50:45.484: 
[cssd(10595)]CRS-1607:Node nxybj2 is being evicted in cluster incarnation 587229053; details at (:CSSNM00007:) in /u01/app/11.2.0/grid/log/nxybj1/cssd/ocssd.log.
2024-10-05 14:50:47.994: 
[cssd(10595)]CRS-1625:Node nxybj2, number 2, was manually shut down

在mos上查询该报错,通过查询到的结果分析,该报错为bug:

受影响版本为:11.2.0.3,当前生成环境版本为:11.2.0.4,由此可以确定本次节点2重启是bug导致。

进一步查询,该bug在如下版本中进行了修复:

根据查询结果,要想修复此bug,就需要打对应版本的补丁。
 

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.coloradmin.cn/o/2195477.html

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈,一经查实,立即删除!

相关文章

前端模块化进化史:从全局 function 到 ES Modules

目前,前端开发已经离不开由 CommonJS、ES Modules 和 Webpack 构建的模块化开发环境。无论是 JavaScript、CSS、图片还是其他资源,都可以作为一个模块来处理。那么,模块化究竟是如何发展到今天的呢? 全局函数模式 最初的前端模块…

C# 属性和方法

本课要点&#xff1a; 1、属性的使用 2、方法的使用 一 数据安全问题 示例 static void Main(string[] args) {Student stu new Student();stu.age -2; }二 属性 private int age; //年龄 public int Age {get { return age; }set{if (value > 0 && value <…

JavaEE: 深入解析HTTP协议的奥秘(2)

文章目录 HTTP认识 URLURL encode 介绍 认识 "方法"(method)GETPOST其他方法 HTTP JavaEE: 深入解析HTTP协议的奥秘(1) 书接上文~ 认识 URL 平时我们俗称的"网址"其实就是说的 URL .(唯一资源定位符) URL 不是 HTTP 专属的,很多协议都会用到. 其实除了 …

软考-程序设计语言基础

一、程序设计语言概述 (一)程序设计语言的基本概念 1. 程序设计语言的目的 为了书写计算机程序而人为设计的符号语言&#xff0c;用于对计算过程进行描述、组织和推导。 2.低级语言 机器语言&#xff1a; 机器语言是计算机最原始的语言&#xff0c;由0和1的代码构成 计算机…

You must konw JS!!(超详细的javascript套餐,适合计算机专业有基础的,包含常见前端开发面试题)

1.起源 JavaScript 起源于 1995 年&#xff0c;当时它主要是为了满足网页交互的需求而被创建。它最初的设计目的是为了让网页开发者能够在网页中添加一些简单的交互效果和动态内容。在那个时期&#xff0c;网页大多是静态的&#xff0c;而 JavaScript 的出现为网页带来了新的活…

如何将数据从 AWS S3 导入到 Elastic Cloud - 第 1 部分:Elastic Serverless Forwarder

作者&#xff1a;来自 Elastic Hemendra Singh Lodhi 这是多部分博客系列的第一部分&#xff0c;探讨了将数据从 AWS S3 导入 Elastic Cloud 的不同选项。 Elasticsearch 提供了多种从 AWS S3 存储桶导入数据的选项&#xff0c;允许客户根据其特定需求和架构策略选择最合适的方…

助动词的分类及其缩略形式

助动词的分类及其缩略形式 1. 助动词 (auxiliary verb)2. 基本助动词 (primary auxiliary)2.1. 基本助动词 be、do 和 have2.2. 实义动词 be、do 和 have 3. 情态助动词 (modal auxiliary)3.1. 情态助动词取代情态动词 4. 半助动词 (semi-auxiliary)4.1. 不能与 it ... that-cl…

MoveIt2-humble----Planning Around Objects(a)

4.Pick and Place with MoveIt Task Constructor 本节教程会教你创建一个功能包&#xff0c;使用MoveIt Task Constructor规划一个抓取和放置的操作。MoveIt Task Constructor 提供了一种方式&#xff0c;去规划由多种不同子任务&#xff08;也称为阶段&#xff09;所组成的任…

详解Redis分布式锁在SpringBoot的@Async方法中没锁住的坑

背景 Redis分布式锁很有用处&#xff0c;在秒杀、抢购、订单、限流特别是一些用到异步分布式并行处理任务时频繁的用到&#xff0c;可以说它是一个BS架构的应用中最高频使用的技术之一。 但是我们经常会碰到这样的一个问题&#xff0c;那就是我们都按照标准做了但有时运行着、…

分层解耦-05.IOCDI-DI详解

一.依赖注入的注解 在我们的项目中&#xff0c;EmpService的实现类有两个&#xff0c;分别是EmpServiceA和EmpServiceB。这两个实现类都加上Service注解。我们运行程序&#xff0c;就会报错。 这是因为我们依赖注入的注解Autowired默认是按照类型来寻找bean对象的进行依赖注入…

基于Qt的速度仪表盘控件实现

本文将详细讲解一个基于Qt的速度仪表盘控件的实现过程&#xff0c;并对代码进行详细的注释说明。该控件可以模拟汽车仪表盘的外观&#xff0c;并通过滑动条动态改变速度显示。本文将从代码结构、绘制组件到实现细节进行讲解&#xff0c;帮助您理解如何使用Qt框架自定义绘制控件…

CSRF | GET 型 CSRF 漏洞攻击

关注这个漏洞的其他相关笔记&#xff1a;CSRF 漏洞 - 学习手册-CSDN博客 0x01&#xff1a;GET 型 CSRF 漏洞攻击 —— 理论篇 GET 型 CSRF 漏洞是指攻击者通过构造恶意的 HTTP GET 请求&#xff0c;利用用户的登录状态&#xff0c;在用户不知情的情况下&#xff0c;诱使浏览器…

Cortex-M3/M4/M7 芯片 Fault 分析原理与实战

目录 一、简介1、异常类型2、异常优先级3、同步异步问题4、异常具体类型 二、Fault exception registers1、Control registers1.1 CCR1.2 SHP1.3 SHCSR 2、Status and address registers2.1 HardFault Status Register——HSFR2.2 Configurable Fault Status Register——CFSR2…

《Linux从小白到高手》进阶实操篇:用户及权限有关的实际工作场景应用

List item 本篇为《Linux从小白到高手》进阶实操篇的第一篇&#xff0c;主要介绍分享一些用户及权限有关的实际工作场景应用。 场景1&#xff1a; 实际工作中你一定会碰到如下图所示的情景&#xff1a;本部门有5个组&#xff0c;分别为&#xff1a;①Root组&#xff1a;用户…

Python中对象obj类型确定最pythonic的方式——isinstance()函数

python中确定对象obj的类型&#xff0c;isinstance函数最是优雅&#xff0c;type、issubclass等函数也可以&#xff0c;但终究“曲折”。 (笔记模板由python脚本于2024年10月07日 19:42:38创建&#xff0c;本篇笔记适合喜欢python的coder翻阅) 【学习的细节是欢悦的历程】 Pyth…

Vue2电商项目(七)、订单与支付

文章目录 一、交易业务Trade1. 获取用户地址2. 获取订单信息 二、提交订单三、支付1. 获取支付信息2. 支付页面--ElementUI(1) 引入Element UI(2) 弹框支付的业务逻辑(这个逻辑其实没那么全)(3) 支付逻辑知识点小总结 四、个人中心1. 搭建二级路由2. 展示动态数据(1). 接口(2).…

【Kubernetes】常见面试题汇总(六十)

目录 131. pod 一直处于 pending 状态&#xff1f; 132. helm 安装组件失败&#xff1f; 特别说明&#xff1a; 题目 1-68 属于【Kubernetes】的常规概念题&#xff0c;即 “ 汇总&#xff08;一&#xff09;~&#xff08;二十二&#xff09;” 。 题目 69-113 属于…

企业经营异常怎么解除

经营异常是怎么回事&#xff1f;是什么意思&#xff1f;了解异常原因&#xff1a;我们到所属工商营业执照异常的具体原因。原因可能包括未按时提交年报、未履行即时信息公示义务、公示信息隐瞒真实情况或弄xu作jia、失联等。纠正违规行为&#xff1a;查到了异常原因&#xff0c…

洛谷P5723、P5728、P1428、P1319 Python解析

P5723 完整代码 def is_prime(y):if y < 2:return Falsefor i in range(2, int(y**0.5) 1):if y % i 0:return Falsereturn Truen int(input()) sum_primes 0 x 0if n < 2:print("0") elif n 2:print("2\n1") else:for i in range(2, n 1):i…

计数原理与组合 - 离散数学系列(三)

目录 1. 计数原理的基本概念 加法原理&#xff08;Rule of Sum&#xff09; 乘法原理&#xff08;Rule of Product&#xff09; 2. 排列与组合 排列&#xff08;Permutation&#xff09; 组合&#xff08;Combination&#xff09; 日常生活中的例子 3. 二项式定理 4. 实…