正常运行的oceanbase容器,重新启动该容器却启动不了,重启服务器也无法恢复,报obshell failed错误,无法正常启动,本文记录了问题处理过程。
一、问题现象
1、正常运行的oceanbase容器,重启却启动不了
2、运行docker logs oceanbase检查日志,出错信息如下
核心错误为以下两句
[ERROR] 127.0.0.1 obshell failed
[ERROR] oceanbase-ce start failed
并提示运行 “obd display-trace 3d1c71c4-f80a-11ee-947f-0242ac110002”来检查obd的日志信息。
二、问题分析
1、定位问题
此时容器已无法启动,无法进入容器运行obd display-trace命令,但还好数据目录是挂载的主机目录 /app/dockerdata/oceanbase/obd,相应日志文件在主机侧可以直接查看。
[root@localhost ~]# cat /app/dockerdata/oceanbase/obd/log/obd
....
[2024-04-11 13:48:56.356] [3d1c71c4-f80a-11ee-947f-0242ac110002] [DEBUG] -- exited code 2, error output:
[2024-04-11 13:48:56.356] [3d1c71c4-f80a-11ee-947f-0242ac110002] [DEBUG] ls: cannot access '/proc/118': No such file or directory
[2024-04-11 13:48:56.356] [3d1c71c4-f80a-11ee-947f-0242ac110002] [DEBUG]
[2024-04-11 13:48:56.356] [3d1c71c4-f80a-11ee-947f-0242ac110002] [DEBUG] -- root@127.0.0.1 set env OB_ROOT_PASSWORD to ''
[2024-04-11 13:48:56.356] [3d1c71c4-f80a-11ee-947f-0242ac110002] [DEBUG] -- start obshell: cd /root/ob; /root/ob/bin/obshell admin start --ip 127.0.0.1 --port 2886
[2024-04-11 13:48:56.356] [3d1c71c4-f80a-11ee-947f-0242ac110002] [DEBUG] -- local execute: cd /root/ob; /root/ob/bin/obshell admin start --ip 127.0.0.1 --port 2886
[2024-04-11 13:48:57.414] [3d1c71c4-f80a-11ee-947f-0242ac110002] [DEBUG] -- exited code 29, error output:
[2024-04-11 13:48:57.415] [3d1c71c4-f80a-11ee-947f-0242ac110002] [DEBUG] open /root/ob/run/daemon.pid: file exists
[2024-04-11 13:48:57.415] [3d1c71c4-f80a-11ee-947f-0242ac110002] [DEBUG]
[2024-04-11 13:48:57.415] [3d1c71c4-f80a-11ee-947f-0242ac110002] [ERROR] 127.0.0.1 obshell failed
[2024-04-11 13:48:57.416] [3d1c71c4-f80a-11ee-947f-0242ac110002] [DEBUG] - sub start ref count to 0
[2024-04-11 13:48:57.416] [3d1c71c4-f80a-11ee-947f-0242ac110002] [DEBUG] - export start
[2024-04-11 13:48:57.416] [3d1c71c4-f80a-11ee-947f-0242ac110002] [ERROR] oceanbase-ce start failed
[2024-04-11 13:48:57.420] [3d1c71c4-f80a-11ee-947f-0242ac110002] [INFO] See https://www.oceanbase.com/product/ob-deployer/error-codes .
[2024-04-11 13:48:57.420] [3d1c71c4-f80a-11ee-947f-0242ac110002] [INFO] Trace ID: 3d1c71c4-f80a-11ee-947f-0242ac110002
[2024-04-11 13:48:57.420] [3d1c71c4-f80a-11ee-947f-0242ac110002] [INFO] If you want to view detailed obd logs, please run: obd display-trace 3d1c71c4-f80a-11ee-947f-0242ac110002
[2024-04-11 13:48:57.421] [3d1c71c4-f80a-11ee-947f-0242ac110002] [DEBUG] - share lock /root/.obd/lock/mirror_and_repo release, count 1
[2024-04-11 13:48:57.421] [3d1c71c4-f80a-11ee-947f-0242ac110002] [DEBUG] - share lock /root/.obd/lock/mirror_and_repo release, count 0
[2024-04-11 13:48:57.421] [3d1c71c4-f80a-11ee-947f-0242ac110002] [DEBUG] - unlock /root/.obd/lock/mirror_and_repo
[2024-04-11 13:48:57.421] [3d1c71c4-f80a-11ee-947f-0242ac110002] [DEBUG] - exclusive lock /root/.obd/lock/deploy_obcluster release, count 0
[2024-04-11 13:48:57.421] [3d1c71c4-f80a-11ee-947f-0242ac110002] [DEBUG] - unlock /root/.obd/lock/deploy_obcluster
[2024-04-11 13:48:57.421] [3d1c71c4-f80a-11ee-947f-0242ac110002] [DEBUG] - share lock /root/.obd/lock/global release, count 0
[2024-04-11 13:48:57.421] [3d1c71c4-f80a-11ee-947f-0242ac110002] [DEBUG] - unlock /root/.obd/lock/global
可以看到关键的出错信息为:
[2024-04-11 13:48:57.415] [3d1c71c4-f80a-11ee-947f-0242ac110002] [DEBUG] open /root/ob/run/daemon.pid: file exists
[2024-04-11 13:48:57.415] [3d1c71c4-f80a-11ee-947f-0242ac110002] [DEBUG]
[2024-04-11 13:48:57.415] [3d1c71c4-f80a-11ee-947f-0242ac110002] [ERROR] 127.0.0.1 obshell failed
[2024-04-11 13:48:57.416] [3d1c71c4-f80a-11ee-947f-0242ac110002] [DEBUG] - sub start ref count to 0
[2024-04-11 13:48:57.416] [3d1c71c4-f80a-11ee-947f-0242ac110002] [DEBUG] - export start
[2024-04-11 13:48:57.416] [3d1c71c4-f80a-11ee-947f-0242ac110002] [ERROR] oceanbase-ce start failed
即容器在启动ob时发现/root/ob/run/daemon.pid存在,认为程序仍在运行退出,随即obshell 启动失败,导致最后oceanbase-ce启动失败。
三、解决办法
容器内的/root/ob/run/daemon.pid对应主机/app/dockerdata/oceanbase/ob/run/daemon.pid,察看文件内容
[root@localhost ~]# cat /app/dockerdata/oceanbase/ob/run/daemon.pid
98
里面的值为上次容器运行时守护进程的pid,删除该文件,重启容器
[root@localhost ~]# rm /app/dockerdata/oceanbase/ob/run/daemon.pid
rm: remove regular file '/app/dockerdata/oceanbase/ob/run/daemon.pid'? y
[root@localhost ~]# docker restart oceanbase
oceanbase
[root@localhost ~]# docker ps -a
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
e2f1998af148 oceanbase/oceanbase-ce "/bin/sh -c _boot" 38 minutes ago Up 6 seconds 0.0.0.0:3306->2881/tcp oceanbase
容器恢复正常 ,尝试登录:
[root@localhost ~]# mysql -h127.0.0.1 -uroot -p -P3306
Enter password:
Welcome to the MySQL monitor. Commands end with ; or \g.
Your MySQL connection id is 3221487687
Server version: 5.7.25 OceanBase_CE 4.3.0.1 (r100000242024032211-0193a343bc60b4699ec47792c3fc4ce166a182f9) (Built Mar 22 2024 13:19:48)
Copyright (c) 2000, 2022, Oracle and/or its affiliates.
Oracle is a registered trademark of Oracle Corporation and/or its
affiliates. Other names may be trademarks of their respective
owners.
Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.
mysql> show databases;
+--------------------+
| Database |
+--------------------+
| information_schema |
| LBACSYS |
| mysql |
| oceanbase |
| ocs |
| ORAAUDITOR |
| SYS |
| test |
+--------------------+
8 rows in set (0.02 sec)
mysql> exit
Bye
[root@localhost ~]#
可见业务已经恢复。
经查,这是oceanbase容器的一个运行BUG,通过docker restart oceanbase(oceanbase为运行的容器名)就必然会启不来了,要删掉pid文件才能重新正常启动,:-(。