StarRocks BE宕机排查
排查是否OOM
dmesg -T|grep -i oom #排查是否oom
原因:
2.X版本OOM原因
- BE 的配置文件 (be.conf) 中 mem_limit 配置不合理,需要配置mem_limit=(机器总内存-其他服务占用内存-1~2g(系统预留))
比如机器内存40G,上面有个Mysql,理论上限会用4G,那么配置下mem_limit=34G (40-4-2)
排查系统参数
一般先检查下系统参数配置是否合理,建议参考 https://docs.starrocks.io/zh/docs/deployment/environment_configurations/ 配置。
尤其需要关注ulimit、overcommit和swap参数,检查方式如下
ulimit检查
需要关注max processes和max open files,需要确保>=65535
ulimit -a #查看系统配置
cat /proc/$be_pid/limits #查看be进程配置
overcommit检查
以下值应该为 1
cat /proc/sys/vm/overcommit_memory
swap检查
以下值应该为 0,确保关闭swap
cat /proc/sys/vm/swappiness
排查BE日志
如上参数配置正确的前提下,如果还存在crash,当前crash都会在be.out中打印异常栈
首先获取be.out
# less be.out
query_id:0862041d-07bd-11f0-9214-005056853513, fragment_instance:0862041d-07bd-11f0-9214-005056853518
..............
*** Aborted at 1742716891 (unix time) try "date -d @1742716891" if you are using GNU date ***
PC: @ 0x527d26b starrocks::SegmentIterator::_finish_late_materialization()
*** SIGSEGV (@0x0) received by PID 22176 (TID 0x7f06987b1700) from PID 0; stack trace: ***
@ 0x688b642 google::(anonymous namespace)::FailureSignalHandler()
@ 0x7f089e584630 (unknown)
@ 0x527d26b starrocks::SegmentIterator::_finish_late_materialization()
@ 0x5288648 starrocks::SegmentIterator::_do_get_next()
@ 0x528aa30 starrocks::SegmentIterator::do_get_next()
@ 0x530e573 starrocks::ProjectionIterator::do_get_next()
@ 0x5994675 starrocks::SegmentIteratorWrapper::do_get_next()
@ 0x57c62d3 starrocks::TimedChunkIterator::do_get_next()
@ 0x5341706 starrocks::TabletReader::do_get_next()
@ 0x3b0271b starrocks::pipeline::OlapChunkSource::_read_chunk_from_storage()
@ 0x3b02e42 starrocks::pipeline::OlapChunkSource::_read_chunk()
@ 0x3afba17 starrocks::pipeline::ChunkSource::buffer_next_batch_chunks_blocking()
@ 0x37c0c38 _ZZN9starrocks8pipeline12ScanOperator18_trigger_next_scanEPNS_12RuntimeStateEiENKUlvE_clEv
@ 0x38d4c91 starrocks::workgroup::ScanExecutor::worker_thread()
@ 0x2ed30ec starrocks::ThreadPool::dispatch_thread()
@ 0x2ecc7ba starrocks::Thread::supervise_thread()
@ 0x7f089e57cea5 start_thread
@ 0x7f089d97d9fd __clone
@ 0x0 (unknown)
- 可先通过关键去常见 Crash / BUG 堆栈查询 搜索(上面关键字是
_finish_late_materialization
),判断是不是已知问题; - 根据
query_id
去fe审计日志查找sql;
参考:https://forum.mirrorship.cn/t/topic/4930