rac异常hang死故障分析(sskgxpsnd2)

news2024/11/25 4:21:38

x86虚拟化的平台麒麟系统的一套RAC。事件梳理20:24左右,发现一个节点hang死,关闭操作没有响应。关闭hang死节点,另一个节点也发生hang死,然后重启了另一个节点。

无效分析部分

检查gi的alert日志

有一个很大跨度的时间回退

再看crsd日志

直接的崩溃,并没有有价值的信息

又看了ctssd日志,也用处不大。

经过一顿确认,这个应该是重启虚拟机然后时间同步到宿主机的缘故。不是hang死的原因。

有效分析部分

查看asm实例日志

2023-08-29T14:08:32.939390+08:00
skgxpvfynet: mtype: 61 process 12032 failed because of a resource problem in the OS. The OS has most likely run out of buffers (rval: 4)
2023-08-29T14:08:32.995797+08:00
opiodr aborting process unknown ospid (11828) as a result of ORA-603
Errors in file /oracle/app/grid/diag/asm/+asm/+ASM1/trace/+ASM1_m000_12032.trc  (incident=247819):
ORA-00603: ORACLE server session terminated by fatal error
ORA-27504: IPC error creating OSD context
ORA-27300: OS system dependent operation:sendmsg failed with status: 105
ORA-27301: OS failure message: No buffer space available
ORA-27302: failure occurred at: sskgxpsnd2
Incident details in: /oracle/app/grid/diag/asm/+asm/+ASM1/incident/incdir_247819/+ASM1_m000_12032_i247819.trc
2023-08-29T14:08:33.135129+08:00
skgxpvfynet: mtype: 61 process 12030 failed because of a resource problem in the OS. The OS has most likely run out of buffers (rval: 4)
Errors in file /oracle/app/grid/diag/asm/+asm/+ASM1/trace/+ASM1_ora_12030.trc  (incident=247820):
ORA-00603: ORACLE server session terminated by fatal error
ORA-27504: IPC error creating OSD context
ORA-27300: OS system dependent operation:sendmsg failed with status: 105
ORA-27301: OS failure message: No buffer space available
ORA-27302: failure occurred at: sskgxpsnd2
Incident details in: /oracle/app/grid/diag/asm/+asm/+ASM1/incident/incdir_247820/+ASM1_ora_12030_i247820.trc
2023-08-29T14:08:35.419844+08:00
opidrv aborting process M000 ospid (12032) as a result of ORA-603
2023-08-29T14:08:35.699231+08:00
opiodr aborting process unknown ospid (12030) as a result of ORA-603
2023-08-29T14:08:35.868746+08:00
Process m000 died, see its trace file
2023-08-29T15:46:28.224445+08:00
NOTE: cleaning up empty system-created directory '+OCRVOTE/gz11db/OCRBACKUP/backup01.ocr.303.1146123969'
NOTE: cleaning up empty system-created directory '+OCRVOTE/gz11db/OCRBACKUP/backup00.ocr.302.1146138375'
NOTE: cleaning up empty system-created directory '+OCRVOTE/gz11db/OCRBACKUP/40051182.285.1146152783'
2023-08-29T16:10:47.520135+08:00
skgxpvfynet: mtype: 61 process 8178 failed because of a resource problem in the OS. The OS has most likely run out of buffers (rval: 4)
Errors in file /oracle/app/grid/diag/asm/+asm/+ASM1/trace/+ASM1_ora_8178.trc  (incident=247821):
ORA-00603: ORACLE server session terminated by fatal error
ORA-27504: IPC error creating OSD context
ORA-27300: OS system dependent operation:sendmsg failed with status: 105
ORA-27301: OS failure message: No buffer space available
ORA-27302: failure occurred at: sskgxpsnd2
Incident details in: /oracle/app/grid/diag/asm/+asm/+ASM1/incident/incdir_247821/+ASM1_ora_8178_i247821.trc
2023-08-29T16:10:49.918887+08:00
opiodr aborting process unknown ospid (8178) as a result of ORA-603
2023-08-29T19:46:34.829519+08:00
NOTE: cleaning up empty system-created directory '+OCRVOTE/gz11db/OCRBACKUP/backup01.ocr.302.1146138375'
NOTE: cleaning up empty system-created directory '+OCRVOTE/gz11db/OCRBACKUP/backup00.ocr.285.1146152783'
NOTE: cleaning up empty system-created directory '+OCRVOTE/gz11db/OCRBACKUP/40735354.289.1146167189'
2023-08-29T20:21:04.087366+08:00
NOTE: ASM client -MGMTDB:_mgmtdb:gz11db disconnected unexpectedly.
NOTE: check client alert log.
NOTE: cleaned up ASM client -MGMTDB:_mgmtdb:gz11db connection state (reg:1944369513)
2023-08-29T20:21:04.693879+08:00
Dumping diagnostic data in directory=[cdmp_20230829202104], requested by (instance=0, osid=23886), summary=[trace bucket dump request (kfnclDelete0)].
2023-08-29T20:21:05.338195+08:00
NOTE: detected orphaned client id 0x10001.
2023-08-29T20:21:50.697339+08:00
NOTE: ASM client YXPTDB1:YXPTDB:gz11db disconnected unexpectedly.
NOTE: check client alert log.
NOTE: cleaned up ASM client YXPTDB1:YXPTDB:gz11db connection state (reg:1287912358)
2023-08-29T20:21:53.345392+08:00
NOTE: detected orphaned client id 0x10002.
2023-08-29T20:22:08.589123+08:00
NOTE: client exited [11015]
2023-08-29T20:22:09.901462+08:00
NOTE: client +ASM1:+ASM:gz11db no longer has group 2 (OCRVOTE) mounted
NOTE: client +ASM1:+ASM:gz11db no longer has group 1 (DATA) mounted
2023-08-29T20:22:09.972392+08:00
NOTE: ASMB0 process exiting due to ASM instance shutdown (inactive for 1 seconds)
NOTE: ASMB0 clearing idle groups before exit
2023-08-29T20:22:10.123064+08:00
NOTE: client +ASM1:+ASM:gz11db deregistered
2023-08-29T20:22:10.195414+08:00
Shutting down instance (immediate) (OS id: 20496)
Shutting down instance: further logons disabled
Stopping background process MMNL
2023-08-29T20:22:11.333414+08:00
Stopping background process MMON
2023-08-29T20:22:13.335540+08:00
License high water mark = 19
2023-08-29T20:22:14.528890+08:00
SQL> ALTER DISKGROUP ALL DISMOUNT /* asm agent *//* {0:0:6404} */ 
2023-08-29T20:22:14.641327+08:00
NOTE: cache dismounting (clean) group 1/0xA5E0179B (DATA)
NOTE: messaging CKPT to quiesce pins Unix process pid: 20496, image: oracle@gz11db1 (TNS V1-V3)
2023-08-29T20:22:15.254098+08:00
NOTE: LGWR doing clean dismount of group 1 (DATA) thread 1
NOTE: LGWR closing thread 1 of diskgroup 1 (DATA) at ABA 31.3948
NOTE: LGWR released recovery enqueue for thread 1 group 1 (DATA)
2023-08-29T20:22:15.525844+08:00
kjbdomdet send to inst 2
detach from dom 1, sending detach message to inst 2
2023-08-29T20:22:15.652953+08:00
NOTE: detached from domain 1
2023-08-29T20:22:15.656908+08:00
NOTE: cache dismounted group 1/0xA5E0179B (DATA)
2023-08-29T20:22:15.714886+08:00
GMON dismounting group 1 at 6571 for pid 33, osid 20496
2023-08-29T20:22:15.778396+08:00
NOTE: Disk DATA_0000 in mode 0x7f marked for de-assignment
2023-08-29T20:22:15.813238+08:00
SUCCESS: diskgroup DATA was dismounted
NOTE: cache deleting context for group DATA 1/0xa5e0179b
2023-08-29T20:22:15.820667+08:00
NOTE: cache dismounting (clean) group 2/0xA5F0179C (OCRVOTE)
NOTE: messaging CKPT to quiesce pins Unix process pid: 20496, image: oracle@gz11db1 (TNS V1-V3)
2023-08-29T20:22:15.830477+08:00
NOTE: LGWR doing clean dismount of group 2 (OCRVOTE) thread 1
NOTE: LGWR closing thread 1 of diskgroup 2 (OCRVOTE) at ABA 33.8895
NOTE: LGWR released recovery enqueue for thread 1 group 2 (OCRVOTE)
2023-08-29T20:22:16.030466+08:00
kjbdomdet send to inst 2
detach from dom 2, sending detach message to inst 2
2023-08-29T20:22:16.058295+08:00
NOTE: detached from domain 2
2023-08-29T20:22:16.066375+08:00
NOTE: cache dismounted group 2/0xA5F0179C (OCRVOTE)
2023-08-29T20:22:16.068034+08:00
GMON dismounting group 2 at 6572 for pid 33, osid 20496
2023-08-29T20:22:16.076227+08:00
NOTE: Disk OCRVOTE_0000 in mode 0x7f marked for de-assignment
NOTE: Disk OCRVOTE_0001 in mode 0x7f marked for de-assignment
NOTE: Disk OCRVOTE_0002 in mode 0x7f marked for de-assignment
2023-08-29T20:22:16.086385+08:00
SUCCESS: diskgroup OCRVOTE was dismounted
NOTE: cache deleting context for group OCRVOTE 2/0xa5f0179c
2023-08-29T20:22:16.092465+08:00
SUCCESS: ALTER DISKGROUP ALL DISMOUNT /* asm agent *//* {0:0:6404} */
Shutting down archive processes
Archiving is disabled
2023-08-29T20:22:17.511581+08:00
Shutting down archive processes
Archiving is disabled
2023-08-29T20:22:17.702392+08:00
Stopping background process VKTM
2023-08-29T20:22:24.267747+08:00
freeing rdom 2
freeing rdom 1
freeing rdom 0
2023-08-29T20:22:27.651980+08:00
Instance shutdown complete (OS id: 20496)
2023-08-29T20:05:25.578362+08:00
MEMORY_TARGET defaulting to 1128267776.
WARNING: ASM does not support ipclw. Switching to skgxp
WARNING: ASM does not support ipclw. Switching to skgxp
WARNING: ASM does not support ipclw. Switching to skgxp
ksxp_exafusion_enabled_dcf: ipclw_enabled=0 
WARNING: ASM does not support ipclw. Switching to skgxp
WARNING: ASM does not support ipclw. Switching to skgxp
WARNING: ASM does not support ipclw. Switching to skgxp
* instance_number obtained from CSS = 1, checking for the existence of node 0... 
* node 0 does not exist. instance_number = 1 
Starting ORACLE instance (normal) (OS id: 10339)
2023-08-29T20:05:25.597821+08:00
CLI notifier numLatches:7 maxDescs:2103
2023-08-29T20:05:25.640338+08:00
**********************************************************************
2023-08-29T20:05:25.640902+08:00
Dump of system resources acquired for SHARED GLOBAL AREA (SG

重启前的asm日志,看到了比较重要的一个报错

ORA-00603: ORACLE server session terminated by fatal error
ORA-27504: IPC error creating OSD context
ORA-27300: OS system dependent operation:sendmsg failed with status: 105
ORA-27301: OS failure message: No buffer space available
ORA-27302: failure occurred at: sskgxpsnd2

在asm实例上看到buffer空间不足,可能是系统层的缓存出现问题

再看db实例的alert日志

2023-08-29T20:06:53.015478+08:00
start recovery: pdb 0, passed in flags x5 (domain enable 0) 
2023-08-29T20:06:53.015515+08:00
start recovery: pdb 0, passed in flags x5 (domain enable 0) 
2023-08-29T20:06:53.015616+08:00
start recovery: pdb 0, passed in flags x5 (domain enable 0) 
2023-08-29T20:06:53.015651+08:00
start recovery: pdb 0, passed in flags x5 (domain enable 0) 
2023-08-29T20:06:53.015861+08:00
start recovery: pdb 0, passed in flags x5 (domain enable 0) 
2023-08-29T20:06:53.015911+08:00
start recovery: pdb 0, passed in flags x5 (domain enable 0) 
2023-08-29T20:06:53.016094+08:00
start recovery: pdb 0, passed in flags x5 (domain enable 0) 
2023-08-29T20:06:53.016176+08:00
start recovery: pdb 0, passed in flags x5 (domain enable 0) 
2023-08-29T20:06:53.016311+08:00
start recovery: pdb 0, passed in flags x5 (domain enable 0) 
2023-08-29T20:06:53.016330+08:00
start recovery: pdb 0, passed in flags x5 (domain enable 0) 
2023-08-29T20:06:53.016377+08:00
start recovery: pdb 0, passed in flags x5 (domain enable 0) 
2023-08-29T20:06:53.016444+08:00
start recovery: pdb 0, passed in flags x5 (domain enable 0) 
2023-08-29T20:06:53.066125+08:00
Started redo scan
2023-08-29T20:06:53.220605+08:00
Completed redo scan
 read 7782 KB redo, 707 data blocks need recovery
2023-08-29T20:06:53.314712+08:00
Started redo application at
 Thread 2: logseq 17827, block 147240, offset 0
2023-08-29T20:06:53.393986+08:00
Recovery of Online Redo Log: Thread 2 Group 3 Seq 17827 Reading mem 0
  Mem# 0: +DATA/YXPTDB/redo03.log
2023-08-29T20:06:53.491915+08:00
Completed redo application of 4.09MB

再往前翻,也看到了进程无法启动的日志

Errors in file /oracle/app/oracle/diag/rdbms/yxptdb/YXPTDB1/trace/YXPTDB1_m000_14903.trc  (incident=420364):
ORA-00700: 软内部错误, 参数: [kfnRConnect2], [0], [0x7F263EE44E48], [], [], [], [], [], [], [], [], []
ORA-00603: ORACLE server session terminated by fatal error
ORA-27504: IPC error creating OSD context
ORA-27300: OS system dependent operation:sendmsg failed with status: 105
ORA-27301: OS failure message: No buffer space available
ORA-27302: failure occurred at: sskgxpsnd2
Incident details in: /oracle/app/oracle/diag/rdbms/yxptdb/YXPTDB1/incident/incdir_420364/YXPTDB1_m000_14903_i420364.trc
2023-08-25T19:25:01.488828+08:00
Dumping diagnostic data in directory=[cdmp_20230825192501], requested by (instance=1, osid=14903 (M000)), summary=[incident=420364].
2023-08-25T21:42:03.079640+08:00
Thread 1 advanced to log sequence 10126 (LGWR switch)
  Current log# 2 seq# 10126 mem# 0: +DATA/YXPTDB/redo02.log

看到这里崩溃的m000进程,因为该进程不重要,当时没有影响业务。

官方资料确认

查看mos,发现一篇文档2484025.1

对比可知现象比较类似,原因是内部bug没有找到说明。

但是这个bug只出现在UEK4上,解决方法是调整lo网卡的mtu,调整到16436

我们的环境是

中标麒麟v7u6    3.10.0-957的内核

错误现象能对应,但是内核不能对应。只能作为一个参考。有修改的价值,期待后续反馈。

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.coloradmin.cn/o/973927.html

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈,一经查实,立即删除!

相关文章

辛普森近似求值

辛普森近似求解 公式证明任意一个对称区间的一元二次函数定积分拆分求和:strawberry: 总结 : 如果我们把六分之一乘进去我们只不过在指定的区间采集数据六个求平均,乘以采集数据区间的微元宽度(历史上不少的手稿用h,翻译为微元高度&#xff0…

Python虚拟环境venv下安装playwright介绍及记录

playwright介绍 Playwright是一个用于自动化Web浏览器测试和Web数据抓取的开源库。它由Microsoft开发,支持Chrome、Firefox、Safari、Edge和WebKit浏览器。Playwright的一个主要特点是它能够在所有主要的操作系统(包括Windows、Linux和macOS&#xff09…

OS 磁盘 从生磁盘到文件 文件使用磁盘

通过磁头和磁盘的电生磁,磁生电来写读通过往控制器写入扇区sect 磁头head 柱面port 等位置,通过DMA总线盗用技术,将信息读入内存或写入磁盘,重点在于传递数值,使用out指令,将几个信息拼接起来 寻道&#xf…

正中优配:消费电子概念走高,捷荣技术斩获5连板,凯旺科技等大涨

消费电子概念5日盘中走势活泼,截至发稿,凯旺科技涨超12%,华映科技、合力泰、瀛通通讯、捷荣技能、实益达等涨停,信维通讯涨超8%。值得注意的是,捷荣技能已连续5个交易日涨停,华映科技4日斩获3板。 消息面上…

【RabbitMQ】介绍及消息收发流程

介绍 RabbitMQ 是实现 AMQP(高级消息队列协议)的消息中间件的一种,最初起源于金融系统,用于在分布式系统中存储转发消息,在易用性、扩展性、高可用性等方面表现不俗。 RabbitMQ 主要是为了实现系统之间的双向解耦而实…

【校招VIP】前端专业课考点之CSMA/CD协议

考点介绍: CSMA/CD,载波监听多点接入/碰撞检测,是广播型信道中采用一种随机访问技术的竞争型访问方法,具有多目标地址的特点。它通过边发送数据边监听线路的方法来尽可能减少数据碰撞与冲突。采用分布式控制方法,所有结…

QT 一个简易闹钟

1 效果图 pro QT core gui texttospeechgreaterThan(QT_MAJOR_VERSION, 4): QT widgetsCONFIG c11# The following define makes your compiler emit warnings if you use # any Qt feature that has been marked deprecated (the exact warnings # depend on your c…

PE文件格式详解

摘要 本文描述了Windows系统的PE文件格式。 PE文件格式简介 PE(Portable Executable)文件格式是一种Windows操作系统下的可执行文件格式。PE文件格式是由Microsoft基于COFF(Common Object File Format)格式所定义的&#xff0c…

16|女性视角:李清照笔下独到的细腻

好诗相伴,千金不换。你好,我是天博。 前面我们说了这一章的主题是“见众生”,见众生就是读诗词里的人性。截止到现在,我们已经感受了杜甫面对人民的悲悯,刘禹锡面对贬谪的耿直,而今天这一讲,我…

智能合约安全,著名的区块链漏洞:双花攻击

智能合约安全,著名的区块链漏洞:双花攻击 介绍: 区块链技术通过提供去中心化和透明的系统彻底改变了各个行业。但是,与任何技术一样,它也不能免受漏洞的影响。一个值得注意的漏洞是双花攻击。在本文中,我们将深入研究…

告别复杂的绘画软件!选择Growly Draw for Mac,让你的创作更轻松

Growly Draw for mac是一款快速绘画应用,让你可以在Mac电脑上轻松创作美丽的绘画作品。这个应用程序并不像Photoshop那样拥有丰富的功能,但它的简约设计使得那些基本的绘画任务变得轻松便捷。 如果你对绘画充满热情,但缺乏专业的绘画技巧&am…

Python调用Jumpserver的Api接口增删改查

引言 Jumpserver是一款强大的堡垒机系统,可以有效管理和控制企业内部服务器的访问权限,提高网络安全性。本文将介绍如何使用Python编程语言,结合Jumpserver提供的API接口,实现对跳板机的管理和操作。 1、什么是Jumpserver&#…

气传导耳机怎么样?市面上热门气传导耳机推荐

​气传导耳机不仅能够提升幸福感还能听到周围环境声,大大提高安全性。如果你在寻找一款高品质的气传导耳机,又不知从何入手时,不要担心,我已经为你精心挑选了四款市面上综合表现很不错的气传导耳机,让你享受更好的音质…

达梦类型转换问题-float转换为varchar

表结构 CREATE TABLE "SYSDBA"."TABLE_2" ( "COLUMN_1" FLOAT, "COLUMN_2" NUMERIC(22,6)) STORAGE(ON "MAIN", CLUSTERBTR) ; 表数据: 查询,将numeric转换为float,再转换为varchar&…

Spring Boot 整合 Redis,使用 RedisTemplate 客户端

文章目录 一、SpringBoot 整合 Redis1.1 整合 Redis 步骤1.1.1 添加依赖1.1.2 yml 配置文件1.1.3 Config 配置文件1.1.4 使用示例 1.2 RedisTemplate 概述1.2.1 RedisTemplate 简介1.2.2 RedisTemplate 功能 二、RedisTemplate API2.1 RedisTemplate 公共 API2.2 String 类型 A…

通讯行业:看完这篇文章,我的认知被刷新了!

在现代社会中,通讯系统已经成为我们生活中不可或缺的一部分,它们支撑着信息传递、数据交流和社交互动。然而,通讯系统的可靠性和连续性依赖于电源的稳定供应。电源中断或波动可能导致通讯中断,给个人、企业和组织带来巨大的不便和…

Java版企业电子招标采购系统源码Spring Cloud + Spring Boot +二次开发+ MybatisPlus + Redis

功能描述 1、门户管理:所有用户可在门户页面查看所有的公告信息及相关的通知信息。主要板块包含:招标公告、非招标公告、系统通知、政策法规。 2、立项管理:企业用户可对需要采购的项目进行立项申请,并提交审批,查看…

港陆证券:服装家纺公司上半年投资并购力度加大

9月1日,嘉曼服饰发布公告,为完善多品牌差异化开展战略,将以自有资金收买暇步士(Hush Puppies)品牌我国内地及香港、澳门区域IP财物。 面对服饰市场的激烈竞争,本年以来一些服饰类A股公司开启了“买买买”形…

西贝餐饮集团贺赞贤:Smartbi及指标体系的应用助力销供产业务协同

“传统的供应链数字化运营,是自下而上的需求驱动,存在效率低下、口径不统一、分析不敏捷等问题。西贝亟需自上而下构建完善科学的指标体系,实现敏捷、灵活、统一的应用。因此借助Smartbi以指标为核心的一站式ABI平台,梳理指标体系…

Spring事务(ACID特性、隔离级别、传播机制、失效场景)

一、事务的ACID特性 原子性(Atomicity) 原子性是指事务是一个不可分割的工作单位,事务中的操作要么都发生,要么都不发生。一致性(Consistency) 事务前后数据的完整性必须保持一致。隔离性(Isola…