目录
- Yarn ResourceManager 莫名奇妙宕机
- 重启Yarn ResourceManager 报错1
- 重启Yarn ResourceManager 报错2
- 成功解决
Yarn ResourceManager 莫名奇妙宕机
接到同事反馈,说yarn RM 端口总是访问超时。但是查看日志,又没有发现任务蛛丝马迹,且RM服务是存活的。很有可能是RM服务已经假死了。
重启Yarn ResourceManager 报错1
The specific max attemts: 0 for application:24998 is invalid.because it is out of [1,2] ……
ApplicationNotFoundException: Application with id ‘application_1687657423545_009’ doesn’t exist in RM.Please check that the job submission was successful.
RM服务重启后,没多久,又自动宕机。反复重启好几次都是一样的情况。于是乎开始,认真排查问题的真实原因。
最后,再out的日志文件里,发现了一些有用的信息。
看到Out of Memory Error,这下总算是有些眉目了。内存溢出,那就调大RM 得JVM 内存。之前是2G,调大到3G,继续重启。
重启Yarn ResourceManager 报错2
这一次重启,和之前重启有了不一样的报错,这次是重启了一会之后,日志大量刷之前刷过的日志,然后过一小会就停了。
最终退出前打印以上日志,特意去搜索了下:GC overhead limit exceeded 相关资料。
java.lang.OutOfMemoryError: GC overhead limit exceeded错误。
oracle官方给出了这个错误产生的原因和解决方法:
Exception in thread thread_name: java.lang.OutOfMemoryError: GC Overhead limit
exceeded Cause: The detail message "GC overhead limit exceeded" indicates that
the garbage collector is running all the time and Java program is making very slow
progress. After a garbage collection, if the Java process is spending more than
approximately 98% of its time doing garbage collection and if it is recovering less
than 2% of the heap and has been doing so far the last 5 (compile time constant)
consecutive garbage collections, then a java.lang.OutOfMemoryError is thrown.
This exception is typically thrown because the amount of live data barely fits into
the Java heap having little free space for new allocations.
Action: Increase the heap size. The java.lang.OutOfMemoryError exception for GC
Overhead limit exceeded can be turned off with the command line flag -XX:-
UseGCOverheadLimit.
原因:
JVM花费了大量时间做垃圾回收,使得程序变得很慢,垃圾回收器一直在运行,Java程序运行很慢。如果Java进程花费超过大约98%的时间用于垃圾收集,回收到的只有2%可用的内存,且过去的5次垃圾回收情况都是如此,连续的垃圾收集效率都很低,那就说明Java堆几乎没有可供新分配的可用空间。
调整:看来RM 的JVM 内存调大到3G还是不行,那就继续调大到4G。
成功解决
当RM的JVM 内存调大到4G后,再次重启RM,服务刷了一堆同样的错误和警告日志后,终于停下来了。也没有再次宕机,登录到application scheduler页面,将一些过期的application 手动查杀掉之后,再次提交任务,就正常处理了。