现网/生产/一线问题记录

news2025/3/1 4:55:38

为信息安全考虑,涉及到公司保密信息的,用某来代替

文章目录

  • 问题现象
  • 定位过程
    • 查看节点日志
    • 分析重启原因
    • 发现kafka消息积压
    • 分析dump
    • 追踪代码
  • 定位结论
  • 经验总结,编码教训

问题现象

凌晨升级微服务,维护通知升级后某微服务频繁重启,大概十几分钟就会重启一次

定位过程

查看节点日志

首先肯定是看日志

[2024-07-15 20:09:49,340]-[5c603b4abb964f14]-[1g7m]-[plioservice-mateinfo-cse-tenant-thread-118]-[com.huawei.ies.plioservice.service.impl.AlarmWriteBackServiceImpl.invokeUpdateAlarmsInBatch(AlarmWriteBackServiceImpl.java:359)]-[WARN] updateAlarmsInBatch: The batch update request body = {"body":{"alarms":[{"condition":[{"field":"identifier","values":["1721074052000_3ed485f0a0a9e05053c19d2343bde3428f7b2a6ddacda60a2377a63d43ff8316"],"id":"1","operator":"="}],"expression":"1","updateFields":{"circuit_count":1,"customer_count":1}}]},"header":{"commonValues":{"currentTenant":"1g7m"},"tracker_id":"5c603b4abb964f14"}}
[2024-07-15 20:09:49,372]-[5c603b4abb964f14]-[1g7m]-[plioservice-mateinfo-cse-tenant-thread-118]-[com.huawei.ies.plioservice.service.impl.AlarmWriteBackServiceImpl.invokeUpdateAlarmsInBatch(AlarmWriteBackServiceImpl.java:369)]-[WARN] updateAlarmsInBatch: End call serviceName:[app.service.ict_alarm_interface.updateAlarmsInBatches] it costs [31]ms
[2024-07-15 20:09:53,508]-[5c603b4abb964f14]-[1g7m]-[plioservice-mateinfo-cse-tenant-thread-118]-[com.huawei.ies.plioservice.service.impl.AlarmWriteBackServiceImpl.invokeUpdateAlarmsInBatch(AlarmWriteBackServiceImpl.java:375)]-[WARN] updateAlarmsInBatch: results = ServiceMessage{id='null', comeFrom='null', header={"commonValues":{"currentUser":"","currentTenant":"1g7m"},"tracker_id":"5c603b4abb964f14"}, body={"result":{"failedNum":"0","successNum":"1"},"total":1,"flag":"true","errorMessage":"","errorCode":""}, error=null}
[2024-07-15 20:10:41,500]-[]-[1g7m]-[Druid-ConnectionPool-Destroy-1960641160]-[com.huawei.mateinfo.sdk.datasource.monitor.filter.MateInfoStatFilter.internalAfterStatementExecute4MateInfo(MateInfoStatFilter.java:93)]-[WARN] Slow sql 4121 millis SELECT 1 FROM dual
[2024-07-15 20:10:52,646]-[]-[]-[vertx-blocked-thread-checker]-[io.vertx.core.logging.SLF4JLogDelegate.log(SLF4JLogDelegate.java:202)]-[WARN] Thread Thread[transport-vert.x-eventloop-thread-5,1,main] has been blocked for 3276 ms, time limit is 2000 ms
[2024-07-15 20:10:57,834]-[]-[]-[vertx-blocked-thread-checker]-[io.vertx.core.logging.SLF4JLogDelegate.log(SLF4JLogDelegate.java:202)]-[WARN] Thread Thread[transport-vert.x-eventloop-thread-4,1,main] has been blocked for 4621 ms, time limit is 2000 ms
[2024-07-15 20:11:01,955]-[]-[1g7m]-[LoadBalancerStatsTimer]-[org.apache.servicecomb.registry.consumer.SimpleMicroserviceInstancePing.ping(SimpleMicroserviceInstancePing.java:50)]-[WARN] pin*****#*#*****e8b11efb6a00255ac100019 endpoint rest://172.18.0.79:28443?sslEnabled=true&urlPrefix=%2Fadc-service%2Fcse failed
[2024-07-15 20:11:01,955]-[]-[]-[vertx-blocked-thread-checker]-[io.vertx.core.logging.SLF4JLogDelegate.log(SLF4JLogDelegate.java:202)]-[WARN] Thread Thread[registry-vert.x-eventloop-thread-1,1,main] has been blocked for 4120 ms, time limit is 2000 ms
[2024-07-15 20:11:06,534]-[]-[]-[vertx-blocked-thread-checker]-[io.vertx.core.logging.SLF4JLogDelegate.log(SLF4JLogDelegate.java:202)]-[WARN] Thread Thread[registry-vert.x-eventloop-thread-1,1,main] has been blocked for 4576 ms, time limit is 2000 m
[2024-07-15 20:11:39,726]-[]-[]-[vertx-blocked-thread-checker]-[io.vertx.core.logging.SLF4JLogDelegate.log(SLF4JLogDelegate.java:202)]-[WARN] Thread Thread[transport-vert.x-eventloop-thread-3,1,main] has been blocked for 4594 ms, time limit is 2000 ms
[2024-07-15 20:11:43,927]-[]-[1g7m]-[Druid-ConnectionPool-Destroy-1815678973]-[com.huawei.mateinfo.sdk.datasource.monitor.filter.MateInfoStatFilter.internalAfterStatementExecute4MateInfo(MateInfoStatFilter.java:93)]-[WARN] Slow sql 4196 millis SELECT 1 FROM dual
[2024-07-15 20:11:43,930]-[]-[]-[registry-vert.x-eventloop-thread-1]-[org.apache.servicecomb.serviceregistry.client.http.RestClientUtil.lambda$null$3(RestClientUtil.java:145)]-[ERROR] PUT /v4/default/registry/microservices/89dedccb32b411eeaef40255ac10001b/instances/3b5fa4f542e211ef85210255ac10002b/heartbeat fail, endpoint is 10.2.16.130:30100, message: The timeout period of 3000ms has been exceeded while executing PUT /v4/default/registry/microservices/89dedccb32b411eeaef40255ac10001b/instances/3b5fa4f542e211ef85210255ac10002b/heartbeat for server 10.2.16.130:30100
[2024-07-15 20:11:48,634]-[]-[1g7m]-[Druid-ConnectionPool-Destroy-1815678973]-[com.huawei.mateinfo.sdk.datasource.monitor.filter.MateInfoStatFilter.internalAfterStatementExecute4MateInfo(MateInfoStatFilter.java:93)]-[WARN] Slow sql 4696 millis SELECT 1 FROM dual
[2024-07-15 20:11:52,486]-[]-[]-[registry-vert.x-eventloop-thread-1]-[org.apache.servicecomb.serviceregistry.client.http.ServiceRegistryClientImpl.retry(ServiceRegistryClientImpl.java:128)]-[WARN] invoke service [10.2.16.130:30100] failed, retry address [10.2.16.130:30100].
[2024-07-15 20:11:52,493]-[]-[1g7m]-[Druid-ConnectionPool-Destroy-1960641160]-[com.huawei.mateinfo.sdk.datasource.monitor.filter.MateInfoStatFilter.internalAfterStatementExecute4MateInfo(MateInfoStatFilter.java:93)]-[WARN] Slow sql 9138 millis SELECT 1 FROM dual
[2024-07-15 20:11:52,494]-[]-[1g7m]-[Druid-ConnectionPool-Destroy-1815678973]-[com.huawei.mateinfo.sdk.datasource.monitor.filter.MateInfoStatFilter.internalAfterStatementExecute4MateInfo(MateInfoStatFilter.java:93)]-[WARN] Slow sql 3858 millis SELECT 1 FROM dual
[2024-07-15 20:11:52,495]-[]-[]-[registry-vert.x-eventloop-thread-1]-[io.vertx.core.logging.SLF4JLogDelegate.log(SLF4JLogDelegate.java:205)]-[ERROR] The timeout period of 3000ms has been exceeded while executing PUT /v4/default/registry/microservices/89dedccb32b411eeaef40255ac10001b/instances/3b5fa4f542e211ef85210255ac10002b/heartbeat for server 10.2.16.130:30100[N]io.vertx.core.http.impl.NoStackTraceTimeoutException: The timeout period of 3000ms has been exceeded while executing PUT /v4/default/registry/microservices/89dedccb32b411eeaef40255ac10001b/instances/3b5fa4f542e211ef85210255ac10002b/heartbeat for server 10.2.16.130:30100
io.vertx.core.http.impl.NoStackTraceTimeoutException: The timeout period of 3000ms has been exceeded while executing PUT /v4/default/registry/microservices/89dedccb32b411eeaef40255ac10001b/instances/3b5fa4f542e211ef85210255ac10002b/heartbeat for server 10.2.16.130:30100
[2024-07-15 20:12:40,597]-[]-[]-[Catalina-utility-1]-[com.huawei.dsp.boot.core.config.impl.BootConfigManager.startRefresherThread(BootConfigManager.java:272)]-[WARN] boot.core.config.refresher.period is unknown, value=null
[2024-07-15 20:12:42,869]-[]-[]-[Catalina-utility-1]-[com.netflix.config.sources.URLConfigurationSource.<init>(URLConfigurationSource.java:126)]-[WARN] No URLs will be polled as dynamic configuration sources.
[2024-07-15 20:12:43,361]-[]-[]-[Catalina-utility-1]-[org.apache.commons.logging.impl.SLF4JLog.warn(SLF4JLog.java:176)]-[WARN] Multiple PropertySourcesPlaceholderConfigurer beans register*****#*#*****laceholderConfigurer, org.apache.servicecomb.core.ConfigurationSpringInitializer#0], falling back to Environment
[2024-07-15 20:12:44,078]-[]-[]-[Catalina-utility-1]-[com.huawei.mateinfo.sdk.starter.common.tracing.spi.impl.ConfigCenterBasedTracingConfig.getTracingCollectorAddress(ConfigCenterBasedTracingConfig.java:38)]-[WARN] get tracing collector address http://10.2.17.56:9411 through http
[2024-07-15 20:12:45,834]-[]-[]-[Catalina-utility-1]-[com.huawei.ies.plioservice.listener.AlarmSimulationListener.initAlarmConsumer(AlarmSimulationListener.java:59)]-[WARN] init kafka simulator alarm and affected service topic consumer
[2024-07-15 20:12:45,835]-[]-[]-[Catalina-utility-1]-[com.huawei.ies.plioservice.listener.AlarmSimulationListener.initAlarmConsumer(AlarmSimulationListener.java:65)]-[WARN] kafka simulator alarm consumer is working

可以看到init kafka alarm and affected service topic consumerboot.core.config.refresher.period等关键日志判断服务正在重启,原因是心跳检查未通过
此外还可以通过tomcat-catalina.log日志判断服务的重启时间

[2024-07-16 04:21:14] WARNING [main] org.apache.tomcat.util.net.SSLHostConfig.setProtocols The protocol [TLSv1.3] was added to the list of protocols on the SSLHostConfig named [_default_]. Check if a +/- prefix is missing.
[2024-07-16 04:21:14] WARNING [main] org.apache.tomcat.util.net.SSLHostConfig.setProtocols The protocol [TLSv1.3] was added to the list of protocols on the SSLHostConfig named [_default_]. Check if a +/- prefix is missing.
[2024-07-16 04:21:14] WARNING [main] org.apache.tomcat.util.digester.SetPropertiesRule.begin Match [Server/Service/Engine/Host] failed to set property [hostConfigClass] to [org.apache.catalina.core.startup.SortedHostConfig]
[2024-07-16 04:21:14] INFO [main] org.apache.catalina.core.AprLifecycleListener.lifecycleEvent The Apache Tomcat Native library which allows using OpenSSL was not found on the java.library.path: [/home/gtsgsdba/GaussDB_T_1.1.0-RUN-EULER20SP5-64bit/lib:/home/gtsgsdba/GaussDB_T_1.1.0-RUN-EULER20SP5-64bit/add-ons:/home/gtsgsdba/GaussDB_T_1.1.0-RUN-EULER20SP5-64bit/lib:/home/gtsgsdba/GaussDB_T_1.1.0-RUN-EULER20SP5-64bit/add-ons:/home/gtsgsdba/GaussDB_T_1.1.0-RUN-EULER20SP5-64bit/lib:/home/gtsgsdba/GaussDB_T_1.1.0-RUN-EULER20SP5-64bit/add-ons:/home/gtsgsdba/GaussDB_T_1.1.0-RUN-EULER20SP5-64bit/lib:/home/gtsgsdba/GaussDB_T_1.1.0-RUN-EULER20SP5-64bit/add-ons::/usr/java/packages/lib/amd64:/usr/lib64:/lib64:/lib:/usr/lib]
[2024-07-16 04:21:16] INFO [main] org.apache.coyote.AbstractProtocol.init Initializing ProtocolHandler ["https-jsse-nio-172.18.15.87-18443"]
[2024-07-16 04:21:16] INFO [main] org.apache.tomcat.util.net.AbstractEndpoint.logCertificate Connector [https-jsse-nio-172.18.15.87-18443], TLS virtual host [_default_], certificate type [UNDEFINED] configured from [/opt/mateinfo/conf/security/mateinfo.keystore] using alias [tomcat] and with trust store [/opt/mateinfo/conf/security/mateinfo.keystore]
[2024-07-16 04:21:17] INFO [main] org.apache.coyote.AbstractProtocol.init Initializing ProtocolHandler ["https-jsse-nio-172.18.15.87-28443"]
[2024-07-16 04:21:17] INFO [main] org.apache.tomcat.util.net.AbstractEndpoint.logCertificate Connector [https-jsse-nio-172.18.15.87-28443], TLS virtual host [_default_], certificate type [UNDEFINED] configured from [/opt/mateinfo/conf/security/mateinfo.keystore] using alias [tomcat] and with trust store [/opt/mateinfo/conf/security/mateinfo.keystore]
[2024-07-16 04:21:17] INFO [main] org.apache.catalina.startup.Catalina.load Server initialization in [3635] milliseconds
[2024-07-16 04:21:17] INFO [main] org.apache.catalina.core.StandardService.startInternal Starting service [Catalina]
[2024-07-16 04:21:17] INFO [main] org.apache.catalina.core.StandardEngine.startInternal Starting Servlet engine: [Platform app/2.8]
[2024-07-16 04:21:17] WARNING [Catalina-utility-1] org.apache.tomcat.util.digester.SetPropertiesRule.begin Match [Context/Manager] failed to set property [sessionIdLength] to [24]
[2024-07-16 04:21:34] INFO [Catalina-utility-1] org.apache.jasper.servlet.TldScanner.scanJars At least one JAR was scanned for TLDs yet contained no TLDs. Enable debug logging for this logger for a complete list of JARs that were scanned but no TLDs were found in them. Skipping unneeded JARs during scanning can improve startup time and JSP compilation time.
[2024-07-16 04:21:34] INFO [Catalina-utility-1] org.apache.catalina.core.ApplicationContext.log 2 Spring WebApplicationInitializers detected on classpath
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/opt/mateinfo/app/webapps/plioservice/WEB-INF/lib/mateinfo-sdk-base-common-23.5.13.B209.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/opt/mateinfo/app/lib/log4j-slf4j-impl-2.18.0.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [com.huawei.mateinfo.sdk.common.log.logger.Log4jLoggerFactory]
### Excluding compile: org.apache.logging.log4j.core.config.AppenderControl::getAppender
             __  __       _       _        __
            |  \/  | __ _| |_ ___(_)_ __  / _| ___
    __      | |\/| |/ _` | __/ _ \ | '_ \| |_ / _ \
 _/    \    | |  | | (_| | ||  __/ | | | |  _| (_) |
(______/__) |_|  |_|\__,_|\__\___|_|_| |_|_|  \___/

 :: Spring Boot ::                  (v2.6.7)
### Excluding compile: static org.springframework.core.ResolvableType::forMethodParameter
### Excluding compile: static org.springframework.core.ResolvableType::forMethodParameter
### Excluding compile: static org.springframework.core.ResolvableType::forMethodParameter
[2024-07-16 04:21:43] INFO [Catalina-utility-1] org.apache.catalina.core.ApplicationContext.log Initializing Spring embedded WebApplicationContext
### Excluding compile: static org.springframework.core.ResolvableType::forMethodParameter
[2024-07-16 04:22:02] SEVERE [Catalina-utility-1] org.apache.tomcat.util.descriptor.web.SecurityConstraint.findUncoveredHttpMethods For security constraints with URL pattern [/*] only the HTTP methods [TRACE HEAD OPTIONS] are covered. All other methods are uncovered.
[2024-07-16 04:22:02] INFO [main] org.apache.coyote.AbstractProtocol.start Starting ProtocolHandler ["https-jsse-nio-172.18.15.87-18443"]
[2024-07-16 04:22:02] INFO [main] org.apache.coyote.AbstractProtocol.start Starting ProtocolHandler ["https-jsse-nio-172.18.15.87-28443"]
[2024-07-16 04:22:02] INFO [main] org.apache.catalina.startup.Catalina.start Server startup in [45079] milliseconds
[2024-07-16 04:22:10] INFO [https-jsse-nio-172.18.15.87-18443-exec-4] org.apache.catalina.core.ApplicationContext.log Initializing Spring DispatcherServlet 'dispatcherServlet'

分析重启原因

找到两次服务重启之间的日志,判断出当前服务正在从kafka拉取某服务的业务分析结果,并将结果写到分库中,大致数据流如下图所示:
在这里插入图片描述

发现kafka消息积压

于是根据经验,首先查看kafka的消费数据量,判断是否有消息积压
(当前每次消费的数据量配置上限为500)
全局搜索后,发现升级后每次拉取的kafka数据量都是500,推测当前kafka已有大量消息积压,后经查询kafka管理面证实了这一推测

[2024-07-15 20:25:47,024]-[]-[1g7m]-[AlarmListener-alarm-1]-[com.huawei.ies.plioservice.listener.AlarmListener.handleMessage(AlarmListener.java:106)]-[WARN] consume records size is 500  
[2024-07-15 20:25:47,048]-[]-[1g7m]-[AlarmListener-alarm-1]-[com.huawei.ies.plioservice.listener.AlarmListener.siaAssetNameFilter(AlarmListener.java:128)]-[WARN] key=1g7m_sia_asset_ip_az,topic=topic-1g7m-sia-service-outer,value={"service_catalog":"1","old_service_status":2,"start_time":1721070029496,"asset_name":"sia_asset_ip_az","service_name":"default","service_id":"MCQLL0268","service_status":"0","specialLineMessages":[{"route_id":"MCQLL0268","ids":[],"status":0}],"alarm_nodes*****#*#*****  
[2024-07-15 20:25:47,054]-[]-[1g7m]-[AlarmListener-alarm-1]-[com.huawei.ies.plioservice.listener.AlarmListener.siaAssetNameFilter(AlarmListener.java:128)]-[WARN] key=1g7m_sia_asset_ip_az,topic=topic-1g7m-sia-service-outer,value={"service_catalog":"1","old_service_status":2,"start_time":1721070029496,"asset_name":"sia_asset_ip_az","service_name":"default","service_id":"MCQLL0305","service_status":"0","specialLineMessages":[{"route_id":"MCQLL0305","ids":[],"status":0}],"alarm_nodes*****#*#*****  
[2024-07-15 20:25:47,054]-[]-[1g7m]-[AlarmListener-alarm-1]-[com.huawei.ies.plioservice.listener.AlarmListener.siaAssetNameFilter(AlarmListener.java:128)]-[WARN] key=1g7m_sia_asset_ip_az,topic=topic-1g7m-sia-service-outer,value={"service_catalog":"1","old_service_status":2,"start_time":1721070029496,"asset_name":"sia_asset_ip_az","service_name":"default","service_id":"MCSLL0230","service_status":"0","specialLineMessages":[{"route_id":"MCSLL0230","ids":[],"status":0}],"alarm_nodes*****#*#*****

分析dump

使用arthas的heapdump命令dump一次当前服务的内存快照,发现堆内存占用并不高,大约700M,最大堆内存为2G,推测是dump内存时服务刚重启不久

请添加图片描述
继续分析日志,发现日志中有大量如下内容:

sset is not supported.||AssetName = MBB_SIA_Asset_Wireless

追踪这一日志,发现是因为从kafka中拉取的数据,包含我们不支持分析的数据,MBB_SIA_Asset_Wireless是另一产品的服务,故初步怀疑是由于大量不支持的数据导致。测试场景一般较难覆盖这种情况。

同时发现日志中存在大量超长日志,如下图所示,并且log4j配置中未配置截断

在这里插入图片描述

追踪代码

追踪到代码中打印的地方,发现我们会将从kafka中拉取的value全部打印出来,结合之前的日志,如果拉取的所有数据都是不支持的类型,那我们的代码会循环打印日志而不做任何其他处理,所以怀疑是异步打印日志过多导致内存被占用

/**  
 * 资产包名称过滤  
 *  
 * @param records 消费者记录  
 * @param siaAssetNameList sia资产名称列表  
 * @return List<SiaResultModel> 模型列表  
 */  
public static List<SiaResultModel> siaAssetNameFilter(ConsumerRecords<String, String> records,  
    List<String> siaAssetNameList) {  
    List<SiaResultModel> siaResultModelList = new ArrayList<>();  
    for (ConsumerRecord<String, String> record : records) {  
        String key = record.key();  
        String topic = record.topic();  
        String value = record.value();  
        LOGGER.warn("key={},topic={},value={}", key, topic, value);  
        if (StringUtils.isEmpty(value)) {  
            LOGGER.error("kafka message is empty");  
            continue;  
        }  
        SiaResultModel siaResultModel = JSON.parseObject(value, SiaResultModel.class);  
        if (StringUtils.hasText(siaResultModel.getAssetName()) && !siaAssetNameList.contains(  
            siaResultModel.getAssetName())) {  
            LOGGER.error("Asset is not supported.||AssetName = {}", siaResultModel.getAssetName());  
            continue;  
        }  
        if (!SiaResultModel.isValid(siaResultModel)) {  
            LOGGER.error("siaResultModel is invalid.||siaResultModel = {}", siaResultModel);  
            continue;  
        }  
        siaResultModelList.add(siaResultModel);  
    }  
    return siaResultModelList;  
}

顺势排查GC日志,发现日志中存在大量如下日志,上次GC在19秒时清空了新生代中的eden区,在21秒时eden区又100%了,推测是有代码在不停的生成对象占用新生代内存

{Heap before GC invocations=0 (full 0):
 par new generation   total 824000K, used 775552K [0x0000000089400000, 0x00000000be800000, 0x00000000be800000)
  eden space 775552K, 100% used [0x0000000089400000, 0x00000000b8960000, 0x00000000b8960000)
  from space 48448K,   0% used [0x00000000b8960000, 0x00000000b8960000, 0x00000000bb8b0000)
  to   space 48448K,   0% used [0x00000000bb8b0000, 0x00000000bb8b0000, 0x00000000be800000)
 concurrent mark-sweep generation total 561152K, used 0K [0x00000000be800000, 0x00000000e0c00000, 0x0000000100000000)
 Metaspace       used 22029K, capacity 22618K, committed 22784K, reserved 1069056K
  class space    used 2512K, capacity 2709K, committed 2816K, reserved 1048576K
2024-07-16T03:51:19.489+0000: 6.418: [GC (Allocation Failure) 2024-07-16T03:51:19.489+0000: 6.418: [ParNew: 775552K->43478K(824000K), 0.0411209 secs] 775552K->43478K(1385152K), 0.0412038 secs] [Times: user=0.06 sys=0.02, real=0.04 secs] 
Heap after GC invocations=1 (full 0):
 par new generation   total 824000K, used 43478K [0x0000000089400000, 0x00000000be800000, 0x00000000be800000)
  eden space 775552K,   0% used [0x0000000089400000, 0x0000000089400000, 0x00000000b8960000)
  from space 48448K,  89% used [0x00000000bb8b0000, 0x00000000be325bf0, 0x00000000be800000)
  to   space 48448K,   0% used [0x00000000b8960000, 0x00000000b8960000, 0x00000000bb8b0000)
 concurrent mark-sweep generation total 561152K, used 0K [0x00000000be800000, 0x00000000e0c00000, 0x0000000100000000)
 Metaspace       used 22029K, capacity 22618K, committed 22784K, reserved 1069056K
  class space    used 2512K, capacity 2709K, committed 2816K, reserved 1048576K
}
{Heap before GC invocations=1 (full 0):
 par new generation   total 824000K, used 819030K [0x0000000089400000, 0x00000000be800000, 0x00000000be800000)
  eden space 775552K, 100% used [0x0000000089400000, 0x00000000b8960000, 0x00000000b8960000)
  from space 48448K,  89% used [0x00000000bb8b0000, 0x00000000be325bf0, 0x00000000be800000)
  to   space 48448K,   0% used [0x00000000b8960000, 0x00000000b8960000, 0x00000000bb8b0000)
 concurrent mark-sweep generation total 561152K, used 0K [0x00000000be800000, 0x00000000e0c00000, 0x0000000100000000)
 Metaspace       used 22405K, capacity 23046K, committed 23296K, reserved 1069056K
  class space    used 2543K, capacity 2743K, committed 2816K, reserved 1048576K
2024-07-16T03:51:21.011+0000: 7.941: [GC (Allocation Failure) 2024-07-16T03:51:21.012+0000: 7.941: [ParNew: 819030K->48447K(824000K), 0.1400724 secs] 819030K->85515K(1385152K), 0.1401505 secs] [Times: user=0.25 sys=0.05, real=0.14 secs] 
Heap after GC invocations=2 (full 0):
 par new generation   total 824000K, used 48447K [0x0000000089400000, 0x00000000be800000, 0x00000000be800000)
  eden space 775552K,   0% used [0x0000000089400000, 0x0000000089400000, 0x00000000b8960000)
  from space 48448K,  99% used [0x00000000b8960000, 0x00000000bb8afff8, 0x00000000bb8b0000)
  to   space 48448K,   0% used [0x00000000bb8b0000, 0x00000000bb8b0000, 0x00000000be800000)
 concurrent mark-sweep generation total 561152K, used 37067K [0x00000000be800000, 0x00000000e0c00000, 0x0000000100000000)
 Metaspace       used 22405K, capacity 23046K, committed 23296K, reserved 1069056K
  class space    used 2543K, capacity 2743K, committed 2816K, reserved 1048576K
}

由于此时服务仍在不停重启,所以我们继续使用arthas监控java进程的堆栈,找到内存较高但还未重启时,重新dump了一次内存
请添加图片描述
点击右侧calculate之后,计算出了最大的retained对象,发现是一个JSONArray,占用了几乎全部的堆内存
请添加图片描述
点开之后,找到了它的GC Root
在这里插入图片描述
追踪代码找到getFmListAll方法,看到这里循环查询了数据库

使用Arthas监控getFmListAll方法的入参,发现toal居然有5271948!
在这里插入图片描述
再使用stack方法监控getFmListAll方法的调用栈,发现是从getHistoryAlramByIdentifier中调过来的,也就是查询的过程中。
请添加图片描述

定位结论

至此,我们的定位完成了,服务在查询的过程中,查出来了大量的数据,导致堆内存爆满,心跳检测不通过,服务重启。因为我们的服务是不停的从Kafka拉取消息,查询对应的数据,所以只要Kafka中有未消费的消息,就会触发这一问题,进而导致服务频繁重启。

那么还有最后一个问题,为什么会查询来这么多数据呢?

经过与周边服务开发人员的对齐,发现我们查询的数据,他们在存储的过程中,如果该数据类型是A,那么ID就是其唯一主键。但如果该数据类型为B,会存储到ES中,它的唯一主键就变成了ID+发生时间。该环境几天前进行了一大波压测,数据中有一大批ID,但是发生时间不同的数据。我们的服务是使用ID查询的B类数据,也就把这一大批告警全部查出来了

经验总结,编码教训

  1. 读取外部数据要做好防护,对关键数据量要做校验,不能一股脑全部读出来,上述代码中,while要设置最大循环次数,或对total的值做一个校验
  2. 打印日志要配置长度截断,否则会对内存造成很大压力
  3. 不要把一个大变量直接打印到日志中!

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.coloradmin.cn/o/2040373.html

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈,一经查实,立即删除!

相关文章

【Python开发】Python环境安装(Python3.8.0)与VS Code配置相应环境

一、安装Python环境 Python3.8.0下载连接 下载好后同意用户协议并点击安装 等待安装 安装完成 二、检查Python环境 按住键盘上的【Win】键【R】键&#xff0c;并在弹出窗口输入cmd 在弹出界面输入“Python”后&#xff0c;按下键盘回车键 若提示如下则Python环境安装成功 三、…

深入理解 iOS 中的 AutoLayout(一)

目录 1.前言 一、AutoLayout 基本概念 1.AutoLayout的概念 1.外部的变化 2.内部的变化 3.AutoLayout和基于frame的布局 2.不使用约束进行自动布局 1.xib使用UIStackView 2.纯代码方式设置UIStackView 3.AutoLayout中的约束 1.Auto Layout中的属性 1.边距约束 2.宽…

软件架构设计师-UML知识导图

软件架构设计师-UML知识导图&#xff0c;包含如下内容&#xff1a; 结构化设计&#xff0c;包含结构化设计的概念、结构化设计的主要内容、概要设计、详细设计及模块设计原则&#xff1b;UML是什么&#xff1a;介绍UML是什么&#xff1b;UML的结构&#xff1a;构造块、公共机制…

【SpringCloud】RabbitMQ——五种方式实现发送和接收消息

SpringAMQP SpringAMQP是基于RabbitMQ封装的一套模板&#xff0c;并且还利用SpringBoot对其实现了自动装配。 SpringAmqp的官方地址&#xff1a;https://spring.io/projects/spring-amqp SpringAMQP提供了三个功能&#xff1a; 自动声明队列、交换机及其绑定关系基于注解的…

Docker基本语法

提示&#xff1a;文章写完后&#xff0c;目录可以自动生成&#xff0c;如何生成可参考右边的帮助文档 文章目录 前言一、更新yum镜像仓库&#xff08;一&#xff09;查看本地yum镜像源地址&#xff08;二&#xff09;设置docker的镜像仓库&#xff08;1&#xff09;安装必要工具…

安卓相关环境配置

安卓相关环境配置 偶尔更新。。。 JEB&#xff08;动态调试好用&#xff09; JEB动态调试Smali-真机/模拟器&#xff08;详细&#xff0c;新手必看&#xff09; 夜步城 JADX官网&#xff08;静态分析&#xff09; https://github.com/skylot/jadx/releases/tag/v1.5.0 雷…

Upload-Lab第3关:如何巧妙应对黑名单文件后缀检测?

关卡介绍 在Pass03中,我们面临的挑战是绕过文件上传功能的黑名单检测机制。黑名单检测是一种常见的安全措施,它通过检查上传文件的后缀来阻止特定类型的文件(如 .php, .exe)被上传。在这一关,我们需要找到一种方法,上传一个可以执行的恶意文件,同时绕过黑名单检测。 …

Vue3学习 Day01

创建第一个vue项目 1.安装node.js cmd输入node查看是否安装成功 2.vscode开启一个终端&#xff0c;配置淘宝镜像 # 修改为淘宝镜像源 npm config set registry https://registry.npmmirror.com 3.下载依赖&#xff0c;启动项目 访问5173端口 第一个Vue项目的目录结构 我们先打…

C++ | Leetcode C++题解之第336题回文对

题目&#xff1a; 题解&#xff1a; //字典树节点 class TrieNode { private:bool isEnd;//单词结束标记int index;//单词序号vector<TrieNode*> children;//子节点 public://构造TrieNode():index(-1),isEnd(false),children(26,nullptr){}//析构~TrieNode(){for(int i…

php连接sphinx的长连接事宜以及sphinx的排除查询以及关于sphinx里使用SetSelect进行复杂的条件过滤或复杂查询

一、php连接sphinx的长连接事宜以及sphinx的排除查询 在使用php连接sphinx时&#xff0c;默认的sphinx连接非长连接&#xff0c;于是在想php连接sphinx能否进行一些优化 publish:January 9, 2018 -Tuesday: 方法&#xff1a;public bool SphinxClient::open ( void ) — 建立到…

24/8/15算法笔记 复习_决策回归树

from sklearn.tree import DecisionTreeRegressor from sklearn import tree import numpy as np import matplotlib.pyplot as plt#创建数据 X_train np.linspace(0,2*np.pi,40).reshape(-1,1)#训练数据就是符合要求的二维数据 #二维&#xff1a;[[样本一].[样本二]&#xff…

Elasticsearch、Easy-es 快速入门 SearchAfterPage分页 若依前后端分离 Ruoyi-Vue SpringBoot

一、环境安装 Elasticsearch ik分词器 1.1 下载解压Elasticsearch-7.x版本&#xff0c;越高越好&#xff0c;低版本有Log4j漏洞&#xff0c;Easy-es目前支持7.x 1.2 IK中文分词器 将对应Elasticsearch版本IK放进文件夹&#xff0c;Elasticsearch-7.6.1&#xff0c;ik对应版…

GPT-SoVITS

文章目录 model archS1 ModelS2 model model arch S1 model: AR model–ssl tokensS2 model: VITS&#xff0c;ssl 已经是mel 长度线性相关&#xff0c;MRTE(ssl_codes_embs, text, global_mel_emb)模块&#xff0c;将文本加强相关&#xff0c;学到一个参考结果 S1 Model cla…

Lora 全文翻译

作者&#xff1a; 地点&#xff1a;hby 来源&#xff1a;https://arxiv.org/pdf/2106.09685 工具&#xff1a;文心 LORA: LOW-RANK ADAPTATION OF LARGE LANGUAGE MODELS 摘要 自然语言处理的一个重要范式包括在通用领域数据上进行大规模预训练&#xff0c;并适应特定任务或…

Qt自定义控件:关于大佬“飞扬青云“的自定义UI控件的使用教程(MinGw,MSVC)

前言 最近在搞自定义控件&#xff0c;无意间发现大佬飞扬青云的开源项目&#xff0c;Qt/C编写超精美自定义控件 这里先贴出大佬项目地址和博客 码云&#xff1a;wwlzq5/qucsdk (gitee.com)&#xff08;旧版下载地址Qt4.7到Qt5.14&#xff09; github&#xff1a;https://git…

攻克面试:高频面试题与常见算法深度剖析

干货分享&#xff0c;感谢您的阅读&#xff01; &#xff08;暂存篇---后续会删除&#xff0c;完整版和持续更新见高频面试题基本总结回顾&#xff08;含笔试高频算法整理&#xff09;&#xff09; 备注&#xff1a;引用请标注出处&#xff0c;同时存在的问题请在相关博客留言…

第1章 大模型的概念、发展历程和应用领域

大模型&#xff1a;塑造未来的智能力量 目录 引言&#xff1a;大模型的定义与影响大模型的发展历程 早期探索&#xff1a;深度学习的起步中期发展&#xff1a;算法的革新与计算能力的提升当代突破&#xff1a;大模型的崛起 大模型的影响与未来展望 引言&#xff1a;大模型的定…

【设计模式】六大原则之依赖倒置原则(Dependency Inversion Principle,‌DIP)

设计模式是对相关问题提出的解决方案。 一般而言&#xff0c;一个模式有四个基本要素&#xff1a; 模式名称 &#xff08;pattern name&#xff09; 一个助记名&#xff0c;它用一两个词语来描述模式问题、解决方案和效果。问题&#xff08;problem&#xff09;描述了应该在何…

Unity中对Spine动画播放、暂停、事件处理管理类

Unity中对Spine动画播放、暂停、事件处理管理类 介绍Spine的事件处理动画师制作沟通Unity前端使用事件 Unity中动画播放Unity中动画暂定和继续Unity中停止动画Unity中动画转向Unity中获取骨骼和设置插槽附件完整管理类分享总结 介绍 最近在做设计spine动画的抖音小程序&#x…

RecyclerView的缓存机制(面试常客)

在构建滚动列表时&#xff0c;我们常首选RecyclerView&#xff0c;出于它优秀的缓存复用机制。 核心机制 RecyclerView的缓存机制又称回收复用机制&#xff0c;RecyclerView构建列表视图分为以下三步&#xff1a; 第一步的创建ViewHolder是RecyclerView构建视图时最耗时的操作…