-
sr版本:version 2.1.12 RELEASE (build 04f2931)
-
部署方式:
FE:10.6.xx.107~10.6.xx.109
BE:10.6.xx.107~10.6.xx.111 -
问题:某天10.6.xx.107机器上的BE提示内存溢出(日志如下),重启后,be却无法正常工作,查看重启后的日志却又无任何报错
-
重启步骤:sh stop_be.sh、sh start_be.sh --daemon
-
查看be进程:
-
fe的8030端口可正常访问,be的8040端口不可访问:
-
tail -n 200 be.WARNING(重启过程无任何告警日志)
-
配置文件:/opt/module/starRocks/be/conf
# Licensed to the Apache Software Foundation (ASF) under one
# or more contributor license agreements. See the NOTICE file
# distributed with this work for additional information
# regarding copyright ownership. The ASF licenses this file
# to you under the Apache License, Version 2.0 (the
# "License"); you may not use this file except in compliance
# with the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing,
# software distributed under the License is distributed on an
# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
# KIND, either express or implied. See the License for the
# specific language governing permissions and limitations
# under the License.
# INFO, WARNING, ERROR, FATAL
sys_log_level = INFO
# ports for admin, web, heartbeat service
be_port = 9060
webserver_port = 8040
heartbeat_service_port = 9050
brpc_port = 8060
# Choose one if there are more than one ip except loopback address.
# Note that there should at most one ip match this list.
# If no ip match this rule, will choose one randomly.
# use CIDR format, e.g. 10.10.10.0/24
# Default value is empty.
# priority_networks = 10.10.10.0/24;192.168.0.0/16
# data root path, separate by ';'
# you can specify the storage medium of each root path, HDD or SSD, seperate by ','
# eg:
# storage_root_path = /data1,medium:HDD;/data2,medium:SSD;/data3
# /data1, HDD;
# /data2, SSD;
# /data3, HDD(default);
#
# Default value is ${STARROCKS_HOME}/storage, you should create it by hand.
# storage_root_path = ${STARROCKS_HOME}/storage
storage_root_path = /data/startrocks/storage,medium:SSD;
# Advanced configurations
# sys_log_dir = ${STARROCKS_HOME}/log
# sys_log_roll_mode = SIZE-MB-1024
# sys_log_roll_num = 10
# sys_log_verbose_modules = *
# log_buffer_level = -1
default_rowset_type = beta
streaming_load_max_mb = 26000
cumulative_compaction_num_threads_per_disk = 4
base_compaction_num_threads_per_disk = 2
cumulative_compaction_check_interval_seconds = 2
mem_limit=90%
disable_storage_page_cache=false
storage_page_cache_limit=10737418240
load_process_max_memory_limit_bytes=18106127360
与重启BE前的配置相比,修改了以下配置:
mem_limit=65%(前配置)
load_process_max_memory_limit_bytes=10737418240(前配置)
- 没有日志,定位不到具体问题。查看github issue发现有类似的问题,可能是元数据损坏导致启动失败。尝试先把这个be节点踢出集群,更改数据文件路径(重新生成元数据文件),让节点重新参与工作;
-- 将该节点踢出集群
ALTER SYSTEM DECOMMISSION backend "be_host:be_heartbeat_service_port";
--更该be配置文件,重新设置storage_root_path
storage_root_path = /data/startrocks/storage_v2,medium:SSD;
-- 重启be
sh stop_be.sh、sh start_be.sh --daemon
-- 重新把该节点加入集群
ALTER SYSTEM DECOMMISSION backend "be_host:be_heartbeat_service_port";
把节点重新加入集群后发现可以正常工作了,只是内存使用只有几十m(经过一个月后再次查询内充使用,发现已经有23G),需要后续SR进行自动数据平衡,把数据分片至该节点,内存使用率才会上来;
11. 附排查过程使用的查询命令
--使用Mysql协议接口连接SR:
mysql -h 127.0.0.1 -P9030 -uroot -p
--查看BE节点情况:
SHOW PROC '/backends';
--查看FE节点情况:
SHOW PROC '/frontends';
--查看FE内存使用
http://10.6.xx.107.:8040/memz
--重启BE:
sh stop_be.sh、sh start_be.sh --daemon
-- 查看be内存使用情况
http://10.6.xx.107:8040/memz
-- 查看fe健康
http://10.6.xx.107:8030/api/health
-- 查看be健康
http://10.6.xx.107:8040/api/health
- 建议使用2.5版本以上的SR