### 现象:
1. 单节点CPU持续高
2.写入骤降
3.线程池队列积压,但没有reject
4.使用方没有记录日志
### 排查
1.ES监控
只能看到相应的结果指标,无法反应出原因。
2.ES日志:大量日志打印相关异常(routate等调用栈)
core.appender.OutputStreamManager.writeToDestination(OutputStreamManager.java:263)
at org.apache.logging.log4j.core.appender.FileManager.writeToDestination
3.查询CPU的使用,GET _nodes/hot_threads
35.3% (176.7ms out of 500ms) cpu usage by thread 'elasticsearch[xxxxx-es-hot2-13][write][T#10]'
10/10 snapshots sharing following 179 elements
app//org.elasticsearch.action.support.replication.TransportReplicationAction$AsyncPrimaryAction.runWithPrimaryShardReference(TransportReplicationAction.java:433)
app//org.elasticsearch.action.support.replication.TransportReplicationAction$AsyncPrimaryAction.lambda$doRun$0(TransportReplicationAction.java:374)
app//org.elasticsearch.action.support.replication.TransportReplicationAction$AsyncPrimaryAction$$Lambda$3657/0x0000000800d2f440.accept(Unknown Source)
app//org.elasticsearch.action.ActionListener$1.onResponse(ActionListener.java:61)
app//org.elasticsearch.index.shard.IndexShard.lambda$wrapPrimaryOperationPermitListener$14(IndexShard.java:2588)
app//org.elasticsearch.index.shard.IndexShard$$Lambda$3659/0x0000000800d2fc40.accept(Unknown Source)
app//org.elasticsearch.action.ActionListener$1.onResponse(ActionListener.java:61)
app//org.elasticsearch.index.shard.IndexShardOperationPermits.acquire(IndexShardOperationPermits.java:273)
app//org.elasticsearch.index.shard.IndexShardOperationPermits.acquire(IndexShardOperationPermits.java:240)
app//org.elasticsearch.index.shard.IndexShard.acquirePrimaryOperationPermit(IndexShard.java:2563)
app//org.elasticsearch.action.support.replication.TransportReplicationAction.acquirePrimaryOperationPermit(TransportReplicationAction.java:996)
app//org.elasticsearch.action.support.replication.TransportReplicationAction$AsyncPrimaryAction.doRun(TransportReplicationAction.java:370)
....
35.0% (174.7ms out of 500ms) cpu usage by thread 'elasticsearch[xxxxxx-es-hot2-13][write][T#5]'
5/10 snapshots sharing following 216 elements
app//org.apache.logging.log4j.core.layout.TextEncoderHelper.encodeChunkedText(TextEncoderHelper.java:146)
app//org.apache.logging.log4j.core.layout.TextEncoderHelper.encodeText(TextEncoderHelper.java:58)
app//org.apache.logging.log4j.core.layout.StringBuilderEncoder.encode(StringBuilderEncoder.java:68)
app//org.apache.logging.log4j.core.layout.StringBuilderEncoder.encode(StringBuilderEncoder.java:32)
app//org.apache.logging.log4j.core.layout.PatternLayout.encode(PatternLayout.java:220)
app//org.apache.logging.log4j.core.layout.PatternLayout.encode(PatternLayout.java:58)
app//org.apache.logging.log4j.core.appender.AbstractOutputStreamAppender.directEncodeEvent(AbstractOutputStreamAppender.java:177)
app//org.apache.logging.log4j.core.appender.AbstractOutputStreamAppender.tryAppend(AbstractOutputStreamAppender.java:170)
app//org.apache.logging.log4j.core.appender.AbstractOutputStreamAppender.append(AbstractOutputStreamAppender.java:161)
app//org.apache.logging.log4j.core.config.AppenderControl.tryCallAppender(AppenderControl.java:156)
app//org.apache.logging.log4j.core.config.AppenderControl.callAppender0(AppenderControl.java:129)
app//org.apache.logging.log4j.core.config.AppenderControl.callAppenderPreventRecursion(AppenderControl.java:120)
app//org.apache.logging.log4j.core.config.AppenderControl.callAppender(AppenderControl.java:84)
“CPU高” 和写入、日志打印相关,无法获取更详细的信息,且由于瞬时抓取,也并不非常精准。
4.火焰图
大致确认和日志相关。
5. 根据以往经验,可能和单分片doc数量限制相关
6.继续搜索日志,确认是单分片超过限制
2023-08-21 02:31:10,215 elasticsearch[xxxx-es-hot2-13][write][T#1] ERROR Recovering from StringBuilderEncoder.encode('[2023-08-21T02:31:10,201][DEBUG][o.e.a.b.TransportShardBulkAction] [xxxxx-es-hot2-13][cp0001001_2023_08][0] failed to execute bulk item (index) index {[xxxxx001_2023_08][event_xxx][xxxxxxxxx], source[{"id":"9f61ef55-0334-4363-9bcf-xxxx","rowkey":"xxxxxxd83ce110","column01":"1007922682","datachangelasttime":1692584511322,"column19":"xxx","column20":"80,295",xxx.......}]}
2023-08-21T02:31:10.237858677Z java.lang.IllegalArgumentException: number of documents in the index cannot exceed 2147483519
### 处理
删除索引重建,并设计好分片