JMX Exporter源码解读+生产环境最佳实践+解决其抓取指标超时问题

文章目录

背景
第一版配置-查询所有MBean
第二版配置-配置白名单
第三版配置-增加Cache
第四版配置-修改jmx_exorter源码禁用默认jvm导出
第五版配置-基于第四版+excludeObjectNameAttributes
第六版配置-修改jmx_exorter源码+includeObjectNameAttributes配置
- 基于release-1.0.1分支修改源码
- 更新配置
jmx-exporter源码解读
将改动贡献给jmx_exporter社区
生产环境配置建议

背景

在JMX的快速入门与使用+使用JMX Exporter监控+集成OpenTelemetry 博文中当时提出来一个未解决的问题

目前我们遇到的问题是，includeObjectNames本身配置的MBean很少了，但是偶尔还会出现jmx_exporter抓取metrics的时候会长达几十秒甚至一百秒+的情况，目前没找到原因

我们的系统的动态扩容（增加机器）或缩容（减少机器）机制是

prometheus定时拉取jmx-exporter的数据
grafana根据prometheus中的数据配置报警规则
当触发报警规则时，通过脚本去调用AWS 相关服务的API来实现机器的上线、下线、开机、关机等操作

可以看到上述即使非常依赖于jmx-exporter上报的数据，如果jmx-exporter总是偶尔有超时导致prometheus收集不到数据，那么会严重影响后续脚本的逻辑。（PS: 请不要问我为什么不直接用K8S而是自己实现？问就是emm…现实情况复杂）

当时这个问题一直困扰DevOps组不能解决，且他们缺少Java相关知识。所以找我帮忙看看能不能尝试解决下问题。以下为解决这个问题的过程的思路的全纪录

首先并不是所有服务器都有这种情况，而是具有某种代码类型的服务器，例如：有几台服务器的代码是用来生成报表+定时执行生成报表任务，这种类型的服务器非常容易出现此问题

第一版配置-查询所有MBean

在花了一些时间了解了我们整体的监控架构之后（整体架构在JMX的快速入门与使用+使用JMX Exporter监控+集成OpenTelemetry一文中可以找到），先去查看了Tomcat的日志，发现里面和jmx-exporter agent相关的异常为Broken pipe和Stream is closed，我判断jmx-exporter agent和otel collector的prometheus-receiver之间的连接由于超时断开引发了此异常。

为什么会超时呢？可能有两个原因

收集指标太多
prometheus-receiver配置的超时太短

我查看了产品环境服务器上jmx_exporter的配置文件，发现DevOps的配置很简单

rules:
- pattern: ".*"

查询了所有MBean。还有一个问题，由于我们代码中使用Ehcache，每一个Cache都会注册自己的MBean到JMX中，所以Cache非常多的时候，JMX里的MBean是非常多的。但是我们不关心这些MBean。

所以第一次判断是认为Ehcache 注册的MBean太多，导致查询所有指标很慢

第二版配置-配置白名单

基于上面的分析，我决定使用jmx-exporter中的includeObjectNames，指定我们关心的MBean

includeObjectNames:
  - "org.apache.commons.pool2:type=GenericObjectPool,*"
  - "tomcat.jdbc:*"
  - "Catalina:type=Manager,*"

除了这3个MBean， jvm相关的MBean在 jmx-exporter默认导出且不可配置。见源码

JvmMetrics.builder().register(PrometheusRegistry.defaultRegistry);

第二版配置上去之后，问题依旧

第三版配置-增加Cache

继续看文档和官方Issue+Google，看到了增加rules的cache减少了jmx-exporter的抓取时间这一文章，修改配置如下

includeObjectNames:
  - "org.apache.commons.pool2:type=GenericObjectPool,*"
  - "tomcat.jdbc:*"
  - "Catalina:type=Manager,*"
rules:
  - pattern: 'org.apache.commons.pool2<type=GenericObjectPool, name=(\w+)><>(NumActive)'
    cache: true
  - pattern: 'tomcat.jdbc<name=\"\w+/\w+\", .*><>(NumActive)'
    cache: true
  - pattern: 'Catalina<type=Manager,.*><>(activeSessions)'
    cache: true

并增加了prometheus-receiver的抓取超时时间

receivers:
  prometheus:
    config:
      scrape_configs:
      - job_name: 'otel-collector'
        scrape_interval: 30s
        scrape_timeout: 29s
        static_configs:
        - targets: ['127.0.0.1:12345']

第三版配置上去之后，问题依旧

第四版配置-修改jmx_exorter源码禁用默认jvm导出

既然还是超时，我认为问题可能出在jvm本身收集的指标太多了，一旦jvm负载较高，有可能增加查询MBean的时间，我决定直接将源码中JavaAgent的 JvmMetrics.builder().register(PrometheusRegistry.defaultRegistry);直接注释掉，然后修改配置如下

includeObjectNames:
  - "org.apache.commons.pool2:type=GenericObjectPool,*"
  - "tomcat.jdbc:*"
  - "Catalina:type=Manager,*"
  - "java.lang:type=Memory"
  - "java.lang:type=Threading"
rules:
  - pattern: 'org.apache.commons.pool2<type=GenericObjectPool, name=(\w+)><>(NumActive)'
    cache: true
  - pattern: 'tomcat.jdbc<name=\"\w+/\w+\", .*><>(NumActive)'
    cache: true
  - pattern: 'Catalina<type=Manager,.*><>(activeSessions)'
    cache: true
  - pattern: 'java.lang<type=Memory><HeapMemoryUsage>(used|max)'
    cache: true
    name: jvm_memory_$1_bytes
    type: GAUGE
    help: $1 (bytes) of a given JVM memory area
    labels: {"area":"heap"}
  - pattern: 'java.lang<type=Threading><>ThreadCount'
    cache: true
    type: GAUGE
    help: Current thread count of a JVM
    name: jvm_threads_current

通过name兼容原来源码中导出的jvm参数名字

第四版配置上去之后，问题依旧

第五版配置-基于第四版+excludeObjectNameAttributes

到了这一步，我还是没有深入看过jmx-exporter源码。在文档中看到了excludeObjectNameAttributes参数，于是将MBean中不关心的Attribute都配置上。在这一版的配置中，我没有将所有的MBean的不关心的Attribute都配置上，仅配置了jvm相关的MBean和部分MBean，主要是拿这一版的配置来验证下我的猜想。看看到底是不是jvm的某个MBean的某个属性导致的查询慢，修改配置如下

# for the metrics of the jvm itself, we don't have to declare them, they are automatically exported, see https://groups.google.com/g/prometheus-users/c/2WTZn5Vi4FE
includeObjectNames:
  - "java.lang:type=Memory"
  - "java.lang:type=Threading"
  - "org.apache.commons.pool2:type=GenericObjectPool,*"
  - "tomcat.jdbc:*"
  - "Catalina:type=Manager,*"
excludeObjectNameAttributes:
  "java.lang:type=Memory":
    - "ObjectPendingFinalizationCount"
    - "NonHeapMemoryUsage"
    - "Verbose"
    - "ObjectName"
  "java.lang:type=Threading":
    - "ThreadAllocatedMemorySupported"
    - "ThreadAllocatedMemoryEnabled"
    - "CurrentThreadAllocatedBytes"
    - "ThreadContentionMonitoringEnabled"
    - "ThreadContentionMonitoringSupported"
    - "CurrentThreadCpuTimeSupported"
    - "ObjectMonitorUsageSupported"
    - "SynchronizerUsageSupported"
    - "ThreadCpuTimeEnabled"
    - "TotalStartedThreadCount"
    - "AllThreadIds"
    - "CurrentThreadCpuTime"
    - "CurrentThreadUserTime"
    - "ThreadCpuTimeSupported"
    - "PeakThreadCount"
    - "DaemonThreadCount"
    - "ObjectName"
# using cache parameters to increase performance, note that this parameter only caches bean name expressions to rule computation and not cache metrics. see https://github.com/prometheus/jmx_exporter/tree/release-1.0.1/docs
rules:
  - pattern: 'java.lang<type=Memory><HeapMemoryUsage>(used|max)'
    cache: true
    name: jvm_memory_$1_bytes
    type: GAUGE
    help: $1 (bytes) of a given JVM memory area
    labels: {"area":"heap"}
  - pattern: 'java.lang<type=Threading><>ThreadCount'
    cache: true
    type: GAUGE
    help: Current thread count of a JVM
    name: jvm_threads_current
  - pattern: 'org.apache.commons.pool2<type=GenericObjectPool, name=(\w+)><>(NumActive)'
    cache: true
  - pattern: 'tomcat.jdbc<name=\"\w+/\w+\", .*><>(NumActive)'
    cache: true
  - pattern: 'Catalina<type=Manager,.*><>(activeSessions)'
    cache: true

第五版配置上去之后，问题依旧，不过这次能看到超时时间jmx_scrape_duration_seconds降低了，但是超时频率增加了。

这一版验证了两个猜想

不是jvm本身的MBean查询引起的超时
排除掉不需要的MBean属性能够降低jmx_scrape_duration_seconds时间

关键的问题：到底是哪个MBean的哪个属性导致的呢？范围已经缩小到3个MBean了

第六版配置-修改jmx_exorter源码+includeObjectNameAttributes配置

既然范围已经缩小到3个MBean了，我为什么不继续增加excludeObjectNameAttributes配置调查下去呢？因为我觉得，这种配置太麻烦了，如果以后我新监控一个MBean的Attribute，且这个MBean有几百个Attribute，那加这个规则岂不是要累死。所以我决定，修改源码，让jmx-exporter只查我声明的Attribute。在看了源码之后，改动如下

基于release-1.0.1分支修改源码

在io.prometheus.jmx.ObjectNameAttributeFilter类中做如下改动

public static final String INCLUDE_OBJECT_NAME_ATTRIBUTES = "includeObjectNameAttributes";
private final Map<ObjectName, Set<String>> includeObjectNameAttributesMap;

private ObjectNameAttributeFilter() {
        excludeObjectNameAttributesMap = new ConcurrentHashMap<>();
        includeObjectNameAttributesMap = new ConcurrentHashMap<>();
    }

在io.prometheus.jmx.ObjectNameAttributeFilter#initialize方法中新增

if (yamlConfig.containsKey(INCLUDE_OBJECT_NAME_ATTRIBUTES)) {
      Map<Object, Object> objectNameAttributeMap =
              (Map<Object, Object>) yamlConfig.get(INCLUDE_OBJECT_NAME_ATTRIBUTES);

      for (Map.Entry<Object, Object> entry : objectNameAttributeMap.entrySet()) {
          ObjectName objectName = new ObjectName((String) entry.getKey());

          List<String> attributeNames = (List<String>) entry.getValue();

          Set<String> attributeNameSet =
                  includeObjectNameAttributesMap.computeIfAbsent(
                          objectName, o -> Collections.synchronizedSet(new HashSet<>()));

          attributeNameSet.addAll(attributeNames);
          for (String attribueName : attributeNames) {
              attributeNameSet.add(attribueName);
          }

          includeObjectNameAttributesMap.put(objectName, attributeNameSet);
 }

以及增加

public boolean include(ObjectName objectName, String attributeName) {
     boolean result = false;

     if (includeObjectNameAttributesMap.size() > 0) {
         Set<String> attributeNameSet = includeObjectNameAttributesMap.get(objectName);
         if (attributeNameSet != null) {
             result = attributeNameSet.contains(attributeName);
         }
     }

     return result;
}

在io.prometheus.jmx.JmxScraper#scrapeBean方法中增加一行

if (objectNameAttributeFilter.include(mBeanName, mBeanAttributeInfo.getName())) {
      name2MBeanAttributeInfo.put(mBeanAttributeInfo.getName(), mBeanAttributeInfo);
 }

更新配置

includeObjectNames:
  - "java.lang:type=Memory"
  - "java.lang:type=Threading"
  - "org.apache.commons.pool2:type=GenericObjectPool,*"
  - "tomcat.jdbc:*"
  - "Catalina:type=Manager,*"
includeObjectNameAttributes:
  "Catalina:type=Manager,host=localhost,context=/替换为真正的xxx":
    - "activeSessions"
  "java.lang:type=Memory":
    - "HeapMemoryUsage"
  "java.lang:type=Threading":
    - "ThreadCount"
  "org.apache.commons.pool2:type=GenericObjectPool,name=替换为真正的xxx":
    - "NumActive"
  "tomcat.jdbc:name=\"jdbc/替换为真正的xxx\",type=ConnectionPool,class=org.apache.tomcat.jdbc.pool.DataSource":
# using cache parameters to increase performance, note that this parameter only caches bean name expressions to rule computation and not cache metrics. see https://github.com/prometheus/jmx_exporter/tree/release-1.0.1/docs
rules:
  - pattern: 'java.lang<type=Memory><HeapMemoryUsage>(used|max)'
    cache: true
    name: jvm_memory_$1_bytes
    type: GAUGE
    help: $1 (bytes) of a given JVM memory area
    labels: {"area":"heap"}
  - pattern: 'java.lang<type=Threading><>ThreadCount'
    cache: true
    type: GAUGE
    help: Current thread count of a JVM
    name: jvm_threads_current
  - pattern: 'org.apache.commons.pool2<type=GenericObjectPool, name=(\w+)><>(NumActive)'
    cache: true
  - pattern: 'tomcat.jdbc<name=\"\w+/\w+\", .*><>(NumActive)'
    cache: true
  - pattern: 'Catalina<type=Manager,.*><>(activeSessions)'
    cache: true

将新的jar包以及配置应用到产品环境之后，只抓取几个Attribute的时间只有0.000xxx秒，这才是我们想要的速度和效果。终于把这个问题给解决了。

jmx-exporter源码解读

在改源码以及排查问题过程中，把jmx-exporter的源码看了一遍，其关键流程如下

使用maven-shade-plugin设置agent的入口点
一个Agent的入口点的类中需要包括premain方法，关于如何写一个Java Agent，感兴趣的可以自行Google，这里不做过多解释
入口点JavaAgent类，主要是将要抓取的指标注册到PrometheusRegistry.defaultRegistry然后启动
createHTTPServer最底层方法io.prometheus.metrics.exporter.httpserver.HTTPServer#HTTPServer，核心在于MetricsHandler
在io.prometheus.metrics.exporter.common.PrometheusScrapeHandler#handleRequest方法负责实际的请求
scrape的底层方法
负责真正的MBean查询。Collector接口的实现类是查询JVM各种MBean的类，例如JvmMetrics > JvmThreadsMetrics，这些实现类都会调用io.prometheus.metrics.model.registry.PrometheusRegistry#register(io.prometheus.metrics.model.registry.Collector)添加到collectors中； MultiCollector的实现类是JmxCollector，负责其他MBean的查询。关于Registry可参考prometheus官方java客户端文档
这里重点关注下MultiCollector的collect方法
doScrape的最终底层方法
这就是关键位置，这些步骤都是在JMX快速入门博文里提到的查询MBean方法，后面折叠起来的代码部分，是将查询出来的Attribute格式化为prometheus格式，感兴趣的可以看下源码。