官方宕机原因排查
官方故障诊断排除
相关概念
达梦数据库宕机往往会产生core文件,解读core文件是分析宕机原因的主要手段,类似oracle的diag.trc或system dump转储文件,记录数据库线程状态、sql语句等。
首选的排查方向可以从内存溢出、磁盘空间、许可过期、异常SQL(没错业务sql导致宕机)等方面着手,例如查操作系统日志、磁盘空间、sa文件、数据库日志等做初筛。
core文件位置可以看
cat /proc/sys/kernel/core_pattern
听说数据库连不上,先看看主机是否重启过
[dmdba@db1 dm]$ w
15:56:36 up 165 days, 1:59, 3 users, load average: 0.02, 0.02, 0.04
USER TTY LOGIN@ IDLE JCPU PCPU WHAT
root tty1 105月23 165days 6:42m 6:42m /usr/libexec/Xorg -core -noreset :0 -seat seat0 -auth /run/lightdm/root/:0 -nolisten tcp vt1 -novtswitch
root pts/0 1210月24 16days 0.11s 0.06s -bash
dmdba pts/1 15:19 1.00s 3.22s 0.00s w
连续运行165天,主机近期没重启
查看数据库日志
ls -l /dm/dmdbms/log/dm_DMSERVER_202410.log
2024-10-30 15:59:43.377 [INFO] database P0000598124 T0000000000000598149 checkpoint begin, used_space[30720], free_space[536832000]...
2024-10-30 15:59:43.381 [INFO] database P0000598124 T0000000000000598149 ckpt2_log_adjust: full_status: 160, ptx_reserved: 0
2024-10-30 15:59:43.382 [INFO] database P0000598124 T0000000000000598149 ckpt2_log_adjust: ckpt_lsn(949526222), ckpt_fil(0), ckpt_off(218334208), cur_lsn(949526222), l_next_seq(4
5367593), g_next_seq(45367593), cur_free(218334208), total_space(536862720), used_space(0), free_space(536862720), n_ep(1)
2024-10-30 15:59:43.382 [INFO] database P0000598124 T0000000000000598149 checkpoint end, 0 pages flushed, used_space[0], free_space[536862720].
2024-10-30 16:02:43.496 [INFO] database P0000598124 T0000000000000598193 checkpoint requested by CKPT_INTERVAL, rlog free space[536832000], used space[30720]
2024-10-30 16:02:43.496 [INFO] database P0000598124 T0000000000000598193 checkpoint generate by ckpt_interval
2024-10-30 16:02:43.496 [INFO] database P0000598124 T0000000000000598149 checkpoint begin, used_space[30720], free_space[536832000]...
2024-10-30 16:02:43.502 [INFO] database P0000598124 T0000000000000598149 ckpt2_log_adjust: full_status: 160, ptx_reserved: 0
2024-10-30 16:02:43.502 [INFO] database P0000598124 T0000000000000598149 ckpt2_log_adjust: ckpt_lsn(949526282), ckpt_fil(0), ckpt_off(218364928), cur_lsn(949526282), l_next_seq(4
5367653), g_next_seq(45367653), cur_free(218364928), total_space(536862720), used_space(0), free_space(536862720), n_ep(1)
2024-10-30 16:02:43.502 [INFO] database P0000598124 T0000000000000598149 checkpoint end, 0 pages flushed, used_space[0], free_space[536862720].
2024-10-30 16:03:17.582 [INFO] database P0000598124 T0000000000001077553 socket_err_should_retry errno:104
2024-10-30 16:03:17.582 [INFO] database P0000598124 T0000000000001077554 socket_err_should_retry errno:104
2024-10-30 16:03:17.615 [FATAL] database P0000598124 T0000000000001077554 Server page check error! ts_id(1) file_id(0) page_no(34748) page_type(0) index_id(0)
2024-10-30 16:03:17.615 [FATAL] database P0000598124 T0000000000001077554 System Halt!
2024-10-30 16:03:17.615 [FATAL] database P0000598124 T0000000000001077554 [for dem]SYSTEM SHUTDOWN ABORT.
2024-10-30 16:03:17.615 [FATAL] database P0000598124 T0000000000001077554 Server page check error!
2024-10-30 16:03:17.615 [FATAL] database P0000598124 T0000000000001077554 code = -1, dm_sys_halt now!!!
2024-10-30 16:03:17.615 [INFO] database P0000598124 T0000000000001077554 total 2 rfil opened!
[dmdba@db1 log]$
的确发生致命错误,解读一下P0000598124 代表数据库进程号是598124,意义不大,但后面的T0000000000001077554代表线程1077554号,很重要。
再看Server page check error! ts_id(1) file_id(0) page_no(34748) page_type(0) index_id(0),又提供了丰富的信息。ts_id=1代表表空间序号
1就是回滚表空间
file_id=0代表ROLL表空间中第1个文件
page_no=34748代表文件中第34748页上的对象,具体是什么呢?可以通过工具来定位,很简单
获取页对象id:SF_PAGE_GET_SEGID()
获取页类型:SF_PAGE_GET_PAGE_TYPE(
ts_id int,
file_id int,
page_no int
)
SQL> select sf_page_get_page_type(1,0,0);
行号 sf_page_get_page_type(1,0,0)
---------- ----------------------------
1 FSM_PAGE_GROUP_HDR
SQL> select sf_page_get_page_type(1, 0, 34748);
如果定位到索引损坏,删除重建即可。如果无法进一步定位,可以先简单看看关联sql,毕竟堆栈分析难度更大。
这时要用到dmrdc了(我猜是:达梦 read core)
./dmrdc sfile=core-dm_sql_thd-598124-4
等待10几秒会在当前目录下生成一个文件core-dm_sql_thd-598124-4_tmp,这里记录着读取的相关sql,可以直接cat看内容
!#%&*^$@[1077554]:(case when (A.advanced_pay_amt = 0 or
A.advanced_pay_amt is null) then '2'
else '1'
end
) as FloorFlag,
(case when(A.bill_range_start
is null and A.bill_range_end is null)
then concat(A.bill_no,'1')
else concat(concat(A.bill_no,
lpad(concat(A.bill_range_start, ''), 12, 0),
lpad(concat(A.bill_range_end, ''), 12, 0)),'1')
end
) as loanAccount,
A.acpt_brch_no as OPNOD,
A.bill_money * A.bail_pcet as INTR1,
''
from ABC_XX_BB A,YYYY_DETAIL B
where A.BATCH_ID=B.ID and bill_no in
('123','345','456')
看来有些不完整,还是比较菜。
启用蛮力处理core文件
strings core123 > /tmp/a.log
获取到完整的sql后再从业务角度尝试改写或优化。
如果担心还存在其他问题,还可以用dmdbchk校验数据文件一致性。需要在关库状态下才能执行检查。
发现问题后首要还是尝试先启动数据库,但是启动异常
这时就需要先忽略一致性检查,设置参数PSEG_RECV=0,启动后再恢复为默认值3。