Nightingale 夜莺监控系统

Nightingale

Author：rab

官方文档：https://flashcat.cloud/docs/content/flashcat-monitor/categraf/3-configuration/

- 前言
- 一、Categraf 配置文件
- 二、Input 插件配置文件
- - 2.1 插件说明
  - 2.2 通用配置
  - - 2.2.1 配置采集频率 interval
    - 2.2.2 配置采集实例 instances
    - 2.2.3 配置采集频率 interval_times
    - 2.2.4 配置采集标签 labels
  - 2.3 高级配置
  - - 2.3.1 过滤指定采集指标
    - 2.3.2 采集指标添加前缀
    - 2.3.3 采集指标数据转换
    - 2.3.4 Categraf 自身指标采集
- 三、常用插件
- - 3.1 基础指标采集
  - 3.2 mysql
  - 3.3 redis
  - 3.4 nginx
  - 3.5 net_response
  - 3.6 exec
- 总结

前言

在部署篇中已经讲到了如何部署 n9e 的采集客户端（Categraf），这里不再重复，当然也不细说 Categraf 中的每个采集插件。这里只是告诉你，n9e 是如何监控（采集）数据的。

首先我们需要搞清楚 Categraf 的配置文件，以及一些常用采集的功能作用。Categraf 提供了一个采集框架，支持近百个指标采集插件，框架层面有一些通用的配置，比如 labels、interval 等，

我当前的 Categraf 版本为 0.3.45。

一、Categraf 配置文件

Categraf 主配置文件放在conf目录下，conf/config.toml。

[global]
# 启动的时候是否在stdout中打印配置内容
print_configs = false

# 机器名，作为本机的唯一标识，会为时序数据自动附加一个 agent_hostname=$hostname 的标签
# hostname 配置如果为空，自动取本机的机器名
# hostname 配置如果不为空，就使用用户配置的内容作为hostname
# 用户配置的hostname字符串中，可以包含变量，目前支持两个变量，
# $hostname 和 $ip，如果字符串中出现这两个变量，就会自动替换
# $hostname 自动替换为本机机器名，$ip 自动替换为本机IP
# 建议大家使用 --test 做一下测试，看看输出的内容是否符合预期
# 这里配置的内容，在--test模式下，会显示为 agent_hostname=xxx 的标签
hostname = ""


# 是否忽略主机名的标签，如果设置为true，时序数据中就不会自动附加agent_hostname=$hostname 的标签
omit_hostname = false

# 时序数据的时间戳使用ms还是s，默认是ms，是因为remote write协议使用ms作为时间戳的单位
precision = "ms"

# 全局采集频率，15秒采集一次
interval = 15

# 全局附加标签，一行一个，这些写的标签会自动附到时序数据上
# [global.labels]
# region = "shanghai"
# env = "localhost"

[log]
# 默认的log输出，到标准输出(stdout) 
# 如果指定为文件, 则写入到指定的文件中
file_name = "stdout"

# options below will not be work when file_name is stdout or stderr
# 如果是写入文件，最大写入大小，单位是MB
max_size = 100
# max_age is the maximum number of days to retain old log files based on the timestamp encoded in their filename.
# 保留多少天的日志文件
max_age = 1
# max_backups is the maximum number of old log files to retain.
# 保留多少个日志文件
max_backups = 1
# local_time determines if the time used for formatting the timestamps in backup files is the computer's local time.
# 是否使用本地时间
local_time = true
# Compress determines if the rotated log files should be compressed using gzip.
# 是否将老文件压缩（gzip格式)
compress = false

# 发给后端的时序数据，会先被扔到 categraf 内存队列里，每个采集插件一个队列
# chan_size 定义了队列最大长度
# batch 是每次从队列中取多少条，发送给后端backend
[writer_opt]
# default: 2000
batch = 2000
# channel(as queue) size
chan_size = 10000

# 后端backend配置，在toml中 [[]] 表示数组，所以可以配置多个writer
# 每个writer可以有不同的url，不同的basic auth信息
[[writers]]
# 注意端口号
# v5版本端口是19000
# v6版本端口是17000
url = "http://127.0.0.1:19000/prometheus/v1/write"

# Basic auth username
basic_auth_user = ""

# Basic auth password
basic_auth_pass = ""

# timeout settings, unit: ms
timeout = 5000
dial_timeout = 2500
max_idle_conns_per_host = 100

# 是否开启push gateway
[http]
enable = false
address = ":9100"
print_access = false
run_mode = "release"

# 是否启用告警自愈agent
[ibex]
enable = false
## ibex flush interval
interval = "1000ms"
## n9e ibex server rpc address
servers = ["127.0.0.1:20090"]
## temp script dir
meta_dir = "./meta"

# 心跳上报（附带资源信息,对象列表中使用）给夜莺v6
# 如果是v5版本，这里不需要保留
[heartbeat]
enable = true

# report os version cpu.util mem.util metadata
url = "http://127.0.0.1:17000/v1/n9e/heartbeat"

# interval, unit: s
interval = 10

# Basic auth username
basic_auth_user = ""

# Basic auth password
basic_auth_pass = ""

## Optional headers
# headers = ["X-From", "categraf", "X-Xyz", "abc"]

# timeout settings, unit: ms
timeout = 5000
dial_timeout = 2500
max_idle_conns_per_host = 100

以上是官方对 Categraf 主配置文件的解释，接下来就是对每个采集插件的配置了。

二、Input 插件配置文件

2.1 插件说明

插件的配置文件，放在conf目录，以input.打头，conf/input.xxxx/xxx.toml。

插件采集的指标默认都会添加一个前缀，比如 input.redis 采集的指标会以 redis_ 开头, input.mysql采集的指标会以mysql_开头。部分插件也支持自定义前缀，像 input.prometheus 和 input.exec 插件。

可以通过以下命令观察指标：

./categraf --test --inputs xxx 

# xxx：即你的插件名称
# 注意：这只是简单测试你这个插件是否能获取到数据，并不会上报n9e或n9e-edge

2.2 通用配置

2.2.1 配置采集频率 interval

每个插件的配置中，一开始通常都是 interval 配置，表示采集频率，如果这个配置注释掉了，就会复用主配置文件 config.toml 中的采集频率，这个配置如果配置成数字，单位就是秒，如果配置成字符串，就要给出单位，比如：

interval = 60
interval = "60s"
interval = "1m"

上面三种写法，都表示采集频率是1分钟，如果是使用字符串，可以使用的单位有：

秒：s
分钟：m
小时：h

2.2.2 配置采集实例 instances

很多采集插件的配置中，都有 instances 配置段，用 [[]] 包住，说明是数组，即，可以出现多个 [[instances]] 配置段，比如 ping 监控的采集插件，想对4个IP做PING探测，可以按照下面的方式来配置：

[[instances]]
targets = [
    "www.baidu.com",
    "127.0.0.1",
    "10.4.5.6",
    "10.4.5.7"
]

也可以下面这样子配置：

[[instances]]
targets = [
    "www.baidu.com",
    "127.0.0.1"
]

[[instances]]
targets = [
    "10.4.5.6",
    "10.4.5.7"
]

2.2.3 配置采集频率 interval_times

instances 下面如果有 interval_times 配置，表示 interval 的倍数，比如ping监控，有些地址采集频率是 15 秒，有些可能想采集的别太频繁，比如 30 秒，那就可以把interval配置成 15，把不需要频繁采集的那些 instances 的 interval_times 配置成 2

2.2.4 配置采集标签 labels

instances 下面的 labels 和 config.toml 中的 global.labels 的作用类似，只是生效范围不同，都是为时序数据附加标签，instances 下面的 labels 是附到对应的实例上，global.labels 是附到所有时序数据上

2.3 高级配置

2.3.1 过滤指定采集指标

拿 http_response 这个插件来做个演示，http_response.toml 的配置样例如下：

[[instances]]
targets = [
     "https://flashcat.cloud"
]

采集的数据如下：

ulric@192 categraf % ./categraf --test --inputs http_response
...
16:27:08 http_response_cert_expire_timestamp agent_hostname=192.168.10.121 method=GET target=https://flashcat.cloud 1688947199
16:27:08 http_response_response_code agent_hostname=192.168.10.121 method=GET target=https://flashcat.cloud 200
16:27:08 http_response_response_time agent_hostname=192.168.10.121 method=GET target=https://flashcat.cloud 0.227738958
16:27:08 http_response_result_code agent_hostname=192.168.10.121 method=GET target=https://flashcat.cloud 0

假设，我想 drop 掉以 timestamp 结尾的指标，可以这么配置：

[[instances]]
targets = [
     "https://flashcat.cloud"
]

metrics_drop = ["*timestamp"]

重新测试，可以看到结果中少了 http_response_cert_expire_timestamp 这个指标了。metrics_drop 是个数组，每一项可以支持通配符。

那如果我只想保留以 timestamp 结尾的指标，忽略其他指标，应该怎么配置呢？样例如下：

[[instances]]
targets = [
     "https://flashcat.cloud"
]

metrics_pass = ["*timestamp"]

重新测试，你会发现，结果中只有 http_response_cert_expire_timestamp 这一条指标了，另外 3 条指标则都被忽略了。

2.3.2 采集指标添加前缀

所有针对 flashcat.cloud 的探测指标，统一增加 ulricqin_ 的前缀：

[[instances]]
targets = [
     "https://flashcat.cloud"
]

metrics_name_prefix = "ulricqin_"

测试结果如下：

16:37:39 ulricqin_http_response_response_time agent_hostname=192.168.10.121 method=GET target=https://flashcat.cloud 0.203779417
16:37:39 ulricqin_http_response_result_code agent_hostname=192.168.10.121 method=GET target=https://flashcat.cloud 0
16:37:39 ulricqin_http_response_cert_expire_timestamp agent_hostname=192.168.10.121 method=GET target=https://flashcat.cloud 1688947199
16:37:39 ulricqin_http_response_response_code agent_hostname=192.168.10.121 method=GET target=https://flashcat.cloud 200

2.3.3 采集指标数据转换

比如某些插件采集的数据可能是可枚举的字符串，字符串无法写入 Prometheus 生态的时序库，此时要把这些可枚举的字符串转换为数字。

[[processor_enum]]
  metrics = ["*status"]
  [processor_enum.value_mappings]
    up = 1
    down = 0

这个意思是：针对 status 结尾的指标，采集到了两个值，一个是 up 一个是 down，都是字符串，然后通过 processor_enum 把字符串转换成了不同的数字。

2.3.4 Categraf 自身指标采集

即 Categraf 这个 Agent 服务自身的监控数据。

touch input.self_metrics/self_metrics.toml

# 需手动创建self_metrics.toml文件（默认没有）

测试：

./categraf --test --inputs self_metrics

开启后，categraf 会将自身的资源指标，包括cpu 内存使用量、打开了多少文件描述符、使用了多少 groutine、threads、gc 信息等推送到 remote write backend。

三、常用插件

3.1 基础指标采集

所谓基础指标采集，就是下载并安装好 Categraf 后，就自动为我们采集 CPU、内存、磁盘、磁盘IO、网络等。

涉及到的采集插件为：

cpu：采集cpu指标，包括cpu用户态/系统态使用量、中断、软中断等指标
mem：采集内存指标，包括内存使用量、总量、cache、buffer、 available等指标
disk：采集各个磁盘分区的空间/inode的使用量情况
diskio：采集磁盘io指标，包括读写速率，读写次数，读写延迟等
net：采集网络指标，包括网卡流量，包数，错误数等

这些基础监控在实际测试/生产都是有用的，保持默认足以。

3.2 mysql

1、功能

MySQL 监控采集插件，核心原理就是连到 MySQL 实例，执行一些 sql，解析输出内容，整理为监控数据上报。

场景：多实例 MySQL、主从等情况时，采集 Agent 端配置多个 [[instances]] 即可，如：

[[instances]]
address = "10.2.3.6:3306"
username = "root"
password = "1234"
labels = { instance="n9e-10.2.3.6:3306" }

[[instances]]
address = "10.2.6.9:3306"
username = "root"
password = "1234"
labels = { instance="zbx-10.2.6.9:3306" }

[[instances]]
...

2、案例

vim input.mysql/mysql.toml

[[instances]]
address = "10.xxx.xx.17:3306"
username = "root"
password = "xxxxxxxx"

# # set tls=custom to enable tls
parameters = "tls=false"

extra_status_metrics = true
extra_innodb_metrics = false
gather_processlist_processes_by_state = false
gather_processlist_processes_by_user = false
gather_schema_size = true
gather_table_size = false
gather_system_table_size = false
gather_slave_status = true

# # timeout
timeout_seconds = 3

# # interval = global.interval * interval_times
interval_times = 1

# important! use global unique string to specify instance
labels = { instance="myn9e-10.206.0.17:3306" }

./categraf --test --inputs mysql

# 会获取当前MySQL实例的所有库数据相关数据（innodb存储、bufferpool等）

3.3 redis

1、功能

慢日志、持久化大小等。

2、案例

vim conf/input.redis/redis.toml

[[instances]]
address = "127.0.0.1:6379"
username = ""
password = ""
pool_size = 2

# 是否开启slowlog 收集
gather_slowlog = true
# 最多收集少条slowlog
slowlog_max_len = 100
# 收集距离现在多少秒以内的slowlog
# 注意插件的采集周期,该参数不要小于采集周期，否则会有slowlog查不到
slowlog_time_window=30

3.4 nginx

1、功能

Nginx 作为 web 服务器、反向代理服务器时的数据采集。

2、案例 - 以简单的 web 为例

**条件一：**nginx服务需要启用http_stub_status_module模块

第三方模块下载地址：

nginx_upstream_check_module：https://github.com/yaoweibin/nginx_upstream_check_module

ngx_devel_kit-0.3.0：https://github.com/vision5/ngx_devel_kit/releases/tag/v0.3.0

lua-nginx-module-0.10.13：https://github.com/openresty/lua-nginx-module/releases/tag/v0.10.13

nginx-module-vts：https://github.com/vozlt/nginx-module-vts/releases

ngx-fancyindex-0.5.2：https://github.com/aperezdc/ngx-fancyindex/releases/tag/v0.5.2

# 推荐源码编译方式安装模块，如不清楚要安装哪些模块，可参考：
cd /opt/nginx-1.20.1 && ./configure \
--prefix=/usr/share/nginx \
--sbin-path=/usr/sbin/nginx \
--modules-path=/usr/lib64/nginx/modules \
--conf-path=/etc/nginx/nginx.conf \
--error-log-path=/var/log/nginx/error.log \
--http-log-path=/var/log/nginx/access.log \
--http-client-body-temp-path=/var/lib/nginx/tmp/client_body \
--http-proxy-temp-path=/var/lib/nginx/tmp/proxy \
--http-fastcgi-temp-path=/var/lib/nginx/tmp/fastcgi \
--http-uwsgi-temp-path=/var/lib/nginx/tmp/uwsgi \
--http-scgi-temp-path=/var/lib/nginx/tmp/scgi \
--pid-path=/var/run/nginx.pid \
--lock-path=/run/lock/subsys/nginx \
--user=nginx \
--group=nginx \
--with-compat \
--with-threads \
--with-http_addition_module \
--with-http_auth_request_module \
--with-http_dav_module \
--with-http_flv_module \
--with-http_gunzip_module \
--with-http_gzip_static_module \
--with-http_mp4_module \
--with-http_random_index_module \
--with-http_realip_module \
--with-http_secure_link_module \
--with-http_slice_module \
--with-http_ssl_module \
--with-http_stub_status_module \
--with-http_sub_module \
--with-http_v2_module \
--with-mail \
--with-mail_ssl_module \
--with-stream \
--with-stream_realip_module \
--with-stream_ssl_module \
--with-stream_ssl_preread_module \
--with-select_module \
--with-poll_module \
--with-file-aio \
--with-http_xslt_module=dynamic \
--with-http_image_filter_module=dynamic \
--with-http_perl_module=dynamic \
--with-stream=dynamic \
--with-mail=dynamic \
--with-http_xslt_module=dynamic \
--add-module=/etc/nginx/third-modules/nginx_upstream_check_module \
--add-module=/etc/nginx/third-modules/ngx_devel_kit-0.3.0 \
--add-module=/etc/nginx/third-modules/lua-nginx-module-0.10.13 \
--add-module=/etc/nginx/third-modules/nginx-module-vts \
--add-module=/etc/nginx/third-modules/ngx-fancyindex-0.5.2

# 根据cpu核数
make -j2
make install

# 注意：第三方模块nginx_upstream_check_module lua-nginx-module nginx-module-vts 都是相关插件所必备的依赖。

**条件二：**nginx 启用 stub_status 配置

vim /etc/nginx/conf.d/default.conf

server {
    listen       80;
    server_name  10.206.0.17;

    location / {
        root   /usr/share/nginx/html;
        index  index.html index.htm;
    }

    location /status {
        stub_status on;
    }
}

开始采集：urls 中写入你的 server_name + url

vim conf/input.nginx/nginx.toml

## collect interval
# interval = 15

## Set the mapping of extra tags in batches
[mappings]
# "http://192.168.0.216:8000/nginx_status" = { "job" = "local" }
# "https://www.baidu.com/ngx_status" = { "job" = "baidu" }

[[instances]]
## An array of Nginx stub_status URI to gather stats.
urls = [
    "http://10.206.0.17/status/"   # 在此处配置
]

## append some labels for series
# labels = { region="cloud", product="n9e" }

## interval = global.interval * interval_times
# interval_times = 1

## Set response_timeout (default 5 seconds)
response_timeout = "5s"

## Whether to follow redirects from the server (defaults to false)
# follow_redirects = false

## Optional HTTP Basic Auth Credentials
#username = "admin"
#password = "admin"

## Optional headers
# headers = ["X-From", "categraf", "X-Xyz", "abc"]

## Optional TLS Config
# use_tls = false
# tls_ca = "/etc/categraf/ca.pem"
# tls_cert = "/etc/categraf/cert.pem"
# tls_key = "/etc/categraf/key.pem"
## Use TLS but skip chain & host verification
# insecure_skip_verify = false

相关参数解释：

curl http://127.0.0.1/status

Active connections: 1 
server accepts handled requests
 78 78 78 
Reading: 0 Writing: 1 Waiting: 0

# Nginx状态解释：
Active connections Nginx正处理的活动连接数5个
server Nginx启动到现在共处理了90837个连接。
accepts Nginx启动到现在共成功创建90837次握手。
handled requests Nginx总共处理了79582次请求。
Reading Nginx读取到客户端的 Header 信息数。
Writing Nginx返回给客户端的 Header 信息数。
Waiting Nginx已经处理完正在等候下一次请求指令的驻留链接，Keep-alive启用情况下，这个值等于active-（reading + writing）。
请求丢失数=(握手数-连接数)可以看出,本次状态显示没有丢失请求。

3.5 net_response

1、功能

TCP 端口存活探测、TCP 网络延迟探测。

2、案例

vim input.net_response/net_response.toml

# # collect interval
# interval = 15

[mappings]
# "127.0.0.1:22"= {region="local",ssh="test"}
# "127.0.0.1:22"= {region="local",ssh="redis"}

[[instances]]
targets = [
#     "127.0.0.1:22",
#     "localhost:6379",
#     ":9090"
    "127.0.0.1:9090"  # 此处添加
]

# # append some labels for series
labels = { env="pre" }  # 标签

# # interval = global.interval * interval_times
interval_times = 1

## Protocol, must be "tcp" or "udp"
## NOTE: because the "udp" protocol does not respond to requests, it requires
## a send/expect string pair (see below).
protocol = "tcp"   # 协议

## Set timeout
timeout = "15s"

## Set read timeout (only used if expecting a response)
read_timeout = "15s"

## The following options are required for UDP checks. For TCP, they are
## optional. The plugin will send the given string to the server and then
## expect to receive the given 'expect' string back.
## string sent to the server
# send = "ssh"
## expected string in answer
# expect = "ssh"

./categraf --test --inputs net_response

3.6 exec

1、功能

如果input目录下的插件无法满足采集需求或者有特殊场景需要自定义实现指定业务的监控。

监控脚本采集到监控数据之后通过相应的格式输出到stdout，categraf截获stdout内容，解析之后传给服务端。

前面我们说了，exec 插件是支持自定义指标名称的，而其他插件都是以其插件名开头的。

2、数据格式

influx、prometheus、falcon

influx 格式

mesurement,labelkey1=labelval1,labelkey2=labelval2 field1=1.2,field2=2.3

influx格式说明

mesurement，定义指标名称(或者前缀)，比如 connections；
mesurement后面是逗号，逗号后面是标签，如果没有标签，则mesurement后面不需要逗号
标签是k=v的格式，多个标签用逗号分隔，比如region=beijing,env=test
标签之后是空格
空格之后是属性字段，多个属性字段用逗号分隔
属性字段是字段名=值的格式，在categraf里值只能是数字

最终，mesurement和各个属性字段名称拼接成metric名字。

falcon格式

[
    {
        "endpoint": "test-endpoint",
        "metric": "test-metric",
        "timestamp": 1658490609,
        "step": 60,
        "value": 1,
        "counterType": "GAUGE",
        "tags": "idc=lg,loc=beijing"
    },
    {
        "endpoint": "test-endpoint",
        "metric": "test-metric2",
        "timestamp": 1658490609,
        "step": 60,
        "value": 2,
        "counterType": "GAUGE",
        "tags": "idc=lg,loc=beijing"
    }
]

timestamp、step、counterType，这三个字段在categraf处理的时候会直接忽略掉，endpoint会放到labels里上报。

prometheus 格式

echo '# HELP demo_http_requests_total Total number of http api requests'
echo '# TYPE demo_http_requests_total counter'
echo 'demo_http_requests_total{api="add_product"} 4633433'

其中 # 注释的部分，其实会被 categraf 忽略，不要也罢。

这里要注意的是，标签值必须用双引号引起，否则报错，如果你的标签值是一个变量，你必须转义引号
echo "demo_http_requests_total{api=\"add_product\"} 4633433"

3、案例

编写脚本（这里以 prometheus 数据格式为例）：

vim /opt/tools/test.sh

echo '# buyu service heartbeat'
echo 'proc_status{env="pro",servername="buyu"} 1'

开始采集：

vim conf/input.exec/exec.toml

# # collect interval
# interval = 15

[[instances]]
# # commands, support glob
commands = [
    "/opt/tools/test.sh"
]

# # timeout for each command to complete
# timeout = 5

# # interval = global.interval * interval_times
# interval_times = 1

# # choices: influx prometheus falcon
# # influx stdout example: mesurement,labelkey1=labelval1,labelkey2=labelval2 field1=1.2,field2=2.3
data_format = "prometheus"