一、背景
BI集群,有60多个节点,2P+数据,机器都已经运行了3年以上
二、现象
提交hive任务会经常失败,有时候能成功,上午失败概率大,下午成功的概率大。
异常日志:
日志1、
2021-09-30 08:28:35.451 [AMRM Callback Handler Thread] INFO com.aaa.lever.master.RMCallbackHandler.onContainersCompleted(RMCallbackHandler.java:77) --> got container status for containerID=container_e155_1632330508050_62782_01_000002, state=COMPLETE, exitStatus=1, diagnostics=Exception from container-launch.
Container id: container_e155_1632330508050_62782_01_000002
Exit code: 1
Stack trace: ExitCodeException exitCode=1:
at org.apache.hadoop.util.Shell.runCommand(Shell.java:538)
at org.apache.hadoop.util.Shell.run(Shell.java:455)
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:715)
at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:211)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Container exited with a non-zero exit code 1
2021-09-30 08:28:35.602 [main] INFO com.aaa.lever.master.LeverMasterManipulator.finish(LeverMasterManipulator.java:185) --> Application completed. Stopping running containers
2021-09-30 08:28:35.614 [main] INFO com.aaa.lever.master.LeverMasterManipulator.finish(LeverMasterManipulator.java:189) --> Application completed. Signalling finish to RM
2021-09-30 08:28:35.722 [main] INFO com.aaa.lever.master.LeverMaster.main(LeverMaster.java:58) --> Application Master failed:Exception from container-launch.
Container id: container_e155_1632330508050_62782_01_000002
Exit code: 1
Stack trace: ExitCodeException exitCode=1:
at org.apache.hadoop.util.Shell.runCommand(Shell.java:538)
at org.apache.hadoop.util.Shell.run(Shell.java:455)
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:715)
at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:211)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Container exited with a non-zero exit code 1
2021-09-30 08:28:35.723 [main] INFO com.aaa.lever.master.LeverMaster.main(LeverMaster.java:59) --> exiting now
日志2、
Exception in thread "main" java.lang.RuntimeException: java.sql.SQLException: Error while processing statement: FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.mr.MapredLocalTask
at com.aaa.lever.task.SyncTask.call(SyncTask.java:58)
at com.aaa.lever.action.SqlActionMain.executeSql(SqlActionMain.java:119)
at com.aaa.lever.action.SqlActionMain.main(SqlActionMain.java:86)
Caused by: java.sql.SQLException: Error while processing statement: FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.mr.MapredLocalTask
at org.apache.hive.jdbc.HiveStatement.execute(HiveStatement.java:296)
at org.apache.hive.jdbc.HiveStatement.executeUpdate(HiveStatement.java:406)
at org.apache.hive.jdbc.HivePreparedStatement.executeUpdate(HivePreparedStatement.java:119)
at com.aaa.lever.task.SqlExecutorTask.doTask(SqlExecutorTask.java:110)
at com.aaa.lever.task.SyncTask.call(SyncTask.java:45)
... 2 more
三、调查思路
1、怀疑跟之前hadoop集群的异常一样,是因为单个节点问题导致的,结果节点问题修复以后,hive的问题依然存在。
qingchen:hadoop集群异常问题排查记录3 赞同 · 0 评论文章
2、根据日志1进行分析,调查各种exit dode = 1的问题
因为这个日志没有具体表现,所以还需要找更具体的日志才是真正的原因。
3、根据日志2进行分析,搜索到的大多数是
“Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.mr.MapRedTask“
这个异常,说是yarn集群资源不足或者权限的问题,
而我们这个问题的异常是
“Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.mr.MapredLocalTask“
一个MapRedTask,一个MapredLocalTask,
MapredLocalTask是跟本地任务有关,hive为了提高效率会自动把common join改为map join,这样任务会在本地运行join操作,如果本地内存不够,就会报错。
每天凌晨到上午这个时间段,集群运行任务较多,内存占用率高,个别机器内存处于满负荷状态,如果在这些负载高的机器上进行本地操作的话,内存是不够用的,所以报错的概率大。
下午的时候,集群任务少,比较空闲,负载低,所以大多数任务都能成功。
参考:
org.apache.hadoop.hive.ql.exec.mr.MapredLocalTask_choulanlan_51CTO博客
hive任务return code 1 from org.apache.hadoop.hive.ql.exec.mr.MapredLocalTask错误
四、解决方案
把hive.auto.convert.join和hive.auto.convert.join.noconditionaltask参数值改为false
这样不会自动转化join为map join,不会在某个节点本地执行join任务,但是会牺牲一部分性能。
五、结论
1、日志很重要,一定要找对异常日志。
2、量变引起质变,在量少的情况下是优化的操作,等到量大时没准就会出问题,所以性能优化不是一成不变的,需要具体情况具体分析。