现象
生产环境大量的报OutOfMemoryError: unable to create new native thread
Caused by: java.lang.OutOfMemoryError: unable to create new native thread
at java.lang.Thread.start0(Native Method) [na:1.8.0_291]
at java.lang.Thread.start(Thread.java:717) [na:1.8.0_291]
at java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:957) [na:1.8.0_291]
at java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1378) [na:1.8.0_291]
at com.alibaba.nacos.shaded.io.grpc.internal.DnsNameResolver.resolve(DnsNameResolver.java:349) ~[nacos-client-2.0.4.jar!/:na]
at com.alibaba.nacos.shaded.io.grpc.internal.DnsNameResolver.refresh(DnsNameResolver.java:197) ~[nacos-client-2.0.4.jar!/:na]
at com.alibaba.nacos.shaded.io.grpc.internal.ManagedChannelImpl.refreshNameResolution(ManagedChannelImpl.java:456) ~[nacos-client-2.0.4.jar!/:na]
at com.alibaba.nacos.shaded.io.grpc.internal.ManagedChannelImpl.refreshAndResetNameResolution(ManagedChannelImpl.java:450) ~[nacos-client-2.0.4.jar!/:na]
at com.alibaba.nacos.shaded.io.grpc.internal.ManagedChannelImpl.handleInternalSubchannelState(ManagedChannelImpl.java:896) ~[nacos-client-2.0.4.jar!/:na]
at com.alibaba.nacos.shaded.io.grpc.internal.ManagedChannelImpl.access$5000(ManagedChannelImpl.java:106) ~[nacos-client-2.0.4.jar!/:na]
at com.alibaba.nacos.shaded.io.grpc.internal.ManagedChannelImpl$SubchannelImpl$1ManagedInternalSubchannelCallback.onStateChange(ManagedChannelImpl.java:1465) ~[nacos-client-2.0.4.jar!/:na]
at com.alibaba.nacos.shaded.io.grpc.internal.InternalSubchannel.gotoState(InternalSubchannel.java:326) ~[nacos-client-2.0.4.jar!/:na]
at com.alibaba.nacos.shaded.io.grpc.internal.InternalSubchannel.gotoNonErrorState(InternalSubchannel.java:316) ~[nacos-client-2.0.4.jar!/:na]
at com.alibaba.nacos.shaded.io.grpc.internal.InternalSubchannel.access$300(InternalSubchannel.java:65) ~[nacos-client-2.0.4.jar!/:na]
at com.alibaba.nacos.shaded.io.grpc.internal.InternalSubchannel$TransportListener$2.run(InternalSubchannel.java:544) ~[nacos-client-2.0.4.jar!/:na]
at com.alibaba.nacos.shaded.io.grpc.SynchronizationContext.drain(SynchronizationContext.java:95) ~[nacos-client-2.0.4.jar!/:na]
at com.alibaba.nacos.shaded.io.grpc.SynchronizationContext.execute(SynchronizationContext.java:127) ~[nacos-client-2.0.4.jar!/:na]
at com.alibaba.nacos.shaded.io.grpc.internal.InternalSubchannel$TransportListener.transportShutdown(InternalSubchannel.java:535) ~[nacos-client-2.0.4.jar!/:na]
at com.alibaba.nacos.shaded.io.grpc.netty.shaded.io.grpc.netty.ClientTransportLifecycleManager.notifyShutdown(ClientTransportLifecycleManager.java:53) ~[nacos-client-2.0.4.jar!/:na]
at com.alibaba.nacos.shaded.io.grpc.netty.shaded.io.grpc.netty.NettyClientHandler.onConnectionError(NettyClientHandler.java:485) ~[nacos-client-2.0.4.jar!/:na]
at com.alibaba.nacos.shaded.io.grpc.netty.shaded.io.netty.handler.codec.http2.Http2ConnectionHandler.onError(Http2ConnectionHandler.java:641) ~[nacos-client-2.0.4.jar!/:na]
at com.alibaba.nacos.shaded.io.grpc.netty.shaded.io.netty.handler.codec.http2.Http2ConnectionHandler$FrameDecoder.decode(Http2ConnectionHandler.java:380) ~[nacos-client-2.0.4.jar!/:na]
at com.alibaba.nacos.shaded.io.grpc.netty.shaded.io.netty.handler.codec.http2.Http2ConnectionHandler.decode(Http2ConnectionHandler.java:438) ~[nacos-client-2.0.4.jar!/:na]
at com.alibaba.nacos.shaded.io.grpc.netty.shaded.io.netty.handler.codec.ByteToMessageDecoder.decodeRemovalReentryProtection(ByteToMessageDecoder.java:505) ~[nacos-client-2.0.4.jar!/:na]
at com.alibaba.nacos.shaded.io.grpc.netty.shaded.io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:444) ~[nacos-client-2.0.4.jar!/:na]
at com.alibaba.nacos.shaded.io.grpc.netty.shaded.io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:283) ~[nacos-client-2.0.4.jar!/:na]
at com.alibaba.nacos.shaded.io.grpc.netty.shaded.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:374) ~[nacos-client-2.0.4.jar!/:na]
at com.alibaba.nacos.shaded.io.grpc.netty.shaded.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:360) ~[nacos-client-2.0.4.jar!/:na]
at com.alibaba.nacos.shaded.io.grpc.netty.shaded.io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:352) ~[nacos-client-2.0.4.jar!/:na]
at com.alibaba.nacos.shaded.io.grpc.netty.shaded.io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1421) ~[nacos-client-2.0.4.jar!/:na]
at com.alibaba.nacos.shaded.io.grpc.netty.shaded.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:374) ~[nacos-client-2.0.4.jar!/:na]
at com.alibaba.nacos.shaded.io.grpc.netty.shaded.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:360) ~[nacos-client-2.0.4.jar!/:na]
at com.alibaba.nacos.shaded.io.grpc.netty.shaded.io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:930) ~[nacos-client-2.0.4.jar!/:na]
at com.alibaba.nacos.shaded.io.grpc.netty.shaded.io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:163) ~[nacos-client-2.0.4.jar!/:na]
at com.alibaba.nacos.shaded.io.grpc.netty.shaded.io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:697) ~[nacos-client-2.0.4.jar!/:na]
at com.alibaba.nacos.shaded.io.grpc.netty.shaded.io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:632) ~[nacos-client-2.0.4.jar!/:na]
at com.alibaba.nacos.shaded.io.grpc.netty.shaded.io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:549) ~[nacos-client-2.0.4.jar!/:na]
at com.alibaba.nacos.shaded.io.grpc.netty.shaded.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:511) ~[nacos-client-2.0.4.jar!/:na]
at com.alibaba.nacos.shaded.io.grpc.netty.shaded.io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:918) ~[nacos-client-2.0.4.jar!/:na]
at com.alibaba.nacos.shaded.io.grpc.netty.shaded.io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) ~[nacos-client-2.0.4.jar!/:na]
at com.alibaba.nacos.shaded.io.grpc.netty.shaded.io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) ~[nacos-client-2.0.4.jar!/:na]
... 1 common frames omitted
排查过程
容器的request和limits都是4GB,用arthas查看进程内存,内存充足的
[arthas@1]$ memory
Memory used total max usage
heap 705M 3399M 3641M 19.37%
ps_eden_space 458M 660M 1333M 34.35%
ps_survivor_space 8M 8M 8M 95.77%
ps_old_gen 238M 2731M 2731M 8.75%
nonheap 270M 285M -1 94.65%
code_cache 101M 102M 240M 42.15%
metaspace 151M 164M -1 92.47%
compressed_class_space 17M 19M 1024M 1.70%
direct 277K 277K - 100.00%
mapped
查看线程数量
[arthas@1]$ thread
Threads Total: 288, NEW: 0, RUNNABLE: 57, BLOCKED: 0, WAITING: 82, TIMED_WAITING: 137, TERMINATED: 0, Internal threads: 12
查看pid_max和threads-max 等影响线程创建的参数
[root@eis-module-system-pro-5b669cc4cf-dg4kh kernel]# pwd
/proc/sys/kernel
[root@eis-module-system-pro-5b669cc4cf-dg4kh kernel]# more pid_max
98304
[root@eis-module-system-pro-5b669cc4cf-dg4kh kernel]# more threads-max
12377471
在容器里面查,各种尝试下来都没有问题。直到有个同事提醒我,是不是线程数量超过了物理机的限制。我觉得很有道理,物理机上的容器用的是宿主机的资源,我们一个物理机上200多个pod。所有的容器的线程的总和超过物理机的限制是有可能的。于是等下一次告警的时候,找运维登上物理机排查。
总线程数:
物理机的线程数限制
运维在执行命令的时候,会概率性的出现 fork: cannot allocate memeory的现象,可以从侧面佐证线程数量确实不够了(free 看内存是够用的)。运维把这个参数调大之后,就不再报OOM了。
过了一会儿,运维排查到,有个应用一个实例会创建6万+线程。这个应用一共6个实例,6个实例所在的物理机的其他业务都受到了影响。
总结
容器不同于虚拟机,他是没有操作系统内核空间的。所有的容器是公用物理机操作系统的资源的。当所有容器使用的线程数量的和大于物理机的线程的限制,创建新的线程就会失败。