轻量化日志中心 Grafana Loki

需要集中的日志系统的原因

目前现状，每个服务生产上有三台，定位生产问题，需要连上一台机器，然后使用 cd / tail / less / grep / sed / awk 等 linux命令去日志里查找故障原因。如果在这台机器没搜索到线索，就去另外两台机器上查日志。
但在分布式系统中，众多服务分散部署在数十台甚至上百台不同的服务器上，想要快速方便的实现查找、分析和归档等功能，使用Linux命令等传统的方式查询到想要的日志就费事费力，更不要说对日志进行分析与归纳了。
为解决大量日志归档，文件搜索慢，如何多维度查询就得需要集中化日志管理，将所有服务器上的日志收集汇总，常见的解决思路就是建立集中式日志收集系统。

LPG

Loki + Promtail + GRafana

Loki：聚合并存储日志数据，可以作为Grafana的数据源，为Grafana提供可视化数据
Promtail：日志收集器，有点像Filebeat，可以手机日志文件中的日志，并把收集到的数据推送到Loki。
Grafana：从Loki中获取日志信息，进行可视化展示

Loki是 Grafana Labs 团队最新的开源项目，是一个水平可扩展，高可用，多租户的日志聚合系统。它的设计非常经济高效且易于操作，因为它不会为日志内容编制索引，而是为每个日志流编制一组标签，专门为 Prometheus和Kubernetes用户做了相关优化。
该项目受Prometheus启发，官方的介绍就是：Like Prometheus，But For Logs。类似Prometheus的日志系统。

项目地址：

https://github.com/grafana/loki/

与其他日志聚合系统相比，Loki具有以下特性：

不对日志进行全文索引。通过存储压缩非结构化日志和仅索引元数据，Loki操作起来会更简单，更省成本
通过使用与Prometheus相同的标签记录流对日志进行索引和分组，这使得日志的扩展和操作效率更高，能对接 alertmanager
特别适合存储Kubernetes Pod日志；诸如Pod标签之类的元数据会被自动删除和编入索引
受Grafana原生支持，避免Kibana和grafana来回切换

日志收集流程图

架构说明

在这里插入图片描述

部署

安装 docker

# 安装依赖
sudo yum install -y yum-utils device-mapper-persistent-data lvm2

# 添加源
sudo yum-config-manager --add-repo http://mirrors.aliyun.com/docker-ce/linux/centos/docker-ce.repo

# 更新
sudo yum makecache fast

# 安装docker-ce
sudo yum -y install docker-ce

# 创建配置文件目录
sudo mkdir -p /etc/docker

# 创建配置文件
touch /etc/docker/daemon.json

# 新增配置文件内容 /mitr/middleware/docker目录已创建
cat <<EOF | sudo tee /etc/docker/daemon.json
{
"registry-mirrors":["https://docker.mirrors.ustc.edu.cn"],
"data-root": "/mitr/middleware/docker"
}
EOF

# 加载启动并设置开机自启
systemctl daemon-reload && systemctl enable docker && systemctl start docker

安装grafana

# 拉取镜像
docker pull grafana/grafana

# 创建镜像并启动grafana 初始密码 admin/admin
docker run --name=grafana -id -p 3000:3000 grafana/grafana

安装Promtail

# 拉取镜像
docker pull grafana/promtail

# 创建配置文件目录
mkdir -p /etc/promtail

cat <<EOF |sudo tee /etc/promtail/promtail-config.yaml
server:
  http_listen_port: 9080
  grpc_listen_port: 0

positions:
  filename: /tmp/positions.yaml

clients:
  - url: http://10.0.0.58:3100/loki/api/v1/push

scrape_configs:
- job_name: system
  static_configs:
  - targets:
      - localhost
    labels:
      job: varlogs
      host: localhost
      __path__: /var/log/*
EOF
# 创建并启动容器
docker run -di --name=promtail -v /etc/promtail/promtail-config.yaml:/etc/promtail/config.yml -v /var/log:/var/log -p 9080:9080 grafana/promtail

安装Loki

Loki配置文件参考

# 拉取镜像
docker pull grafana/loki

# 创建配置文件目录
mkdir -p /etc/loki

cat <<EOF |sudo tee /etc/loki/local-config.yaml
# 认证配置
# 通过X-Scope-OrgID头启用身份验证，该头必须存在
# 如果情况属实。如果为假，则OrgID将始终设置为"fake"
auth_enabled: false

# 端口配置
server:
  http_listen_port: 3100

ingester:
  # WAL(Write Ahead Log)记录传入的日志，并将他们存储在本地文件系统，在进程雪崩的情况下，保证已确认数据的持久性。
  wal:
    enabled: true
    dir: /loki/wal
  
  # 生命周期配置
  lifecycler:
    address: 127.0.0.1
    ring:
      kvstore:
        # Backend storage to use for the ring. Supported values are: consul, etcd,
        # inmemory, memberlist
        # CLI flag: -ring.store
        store: inmemory

      # The number of ingesters to write to and read from 
      # default = 3
      replication_factor: 1
    # 在退出以确保指标被抓取之前休眠的持续时间 default = 30s  
    final_sleep: 0s
  chunk_idle_period: 1h       # Any chunk not receiving new logs in this time will be flushed
  max_chunk_age: 1h           # All chunks will be flushed when they hit this age, default is 1h
  chunk_target_size: 1048576  # Loki will attempt to build chunks up to 1.5MB, flushing first if chunk_idle_period or max_chunk_age is reached first
  chunk_retain_period: 30s    # Must be greater than index read cache TTL if using an index cache (Default index read cache TTL is 5m)
  max_transfer_retries: 0     # Chunk transfers disabled

schema_config:
  configs:
    - from: 2020-10-24
      store: boltdb-shipper
      object_store: filesystem
      schema: v11
      index:
        prefix: index_
        period: 24h

# 配置Loki存储数据的位置
storage_config:
  boltdb_shipper:
    active_index_directory: /loki/boltdb-shipper-active
    cache_location: /loki/boltdb-shipper-cache
    cache_ttl: 24h         # Can be increased for faster performance over longer query periods, uses more disk space
    shared_store: filesystem
  filesystem:
    directory: /loki/chunks

# 压缩块配置压缩器组件，该组件压缩索引碎片以提高性能
compactor:
  working_directory: /loki/boltdb-shipper-compactor
  shared_store: filesystem

limits_config:
  reject_old_samples: true
  # Maximum accepted sample age before rejecting
  reject_old_samples_max_age: 168h

chunk_store_config:
  # 向后查找日志行所需的最大时间。仅适用于即时日志查询 default = 30s
  max_look_back_period: 0s

table_manager:
  retention_deletes_enabled: true
  retention_period: 360h

# ruler:
#   storage:
#     type: local
#     local:
#       directory: /loki/rules
#   rule_path: /loki/rules-temp
#   alertmanager_url: http://localhost:9093
#   ring:
#     kvstore:
#       store: inmemory
#   enable_api: true  
EOF

# 创建并启动容器
docker run -di --name=loki -v loki-data:/loki -v /etc/loki/local-config.yaml:/etc/loki/local-config.yaml -p 3100:3100 grafana/loki

使用

Grafana配置Loki数据源
如下图：

源地址配置 http://10.0.0.74:3100 即可保存

保存之后，切换到grafana左侧区域的Explore，即可进入到Loki页面

然后点击Log browser 就可以把当前系统采集的日志标签显示出来，再根据这些标签进行日志的过滤查询即可，比如这里选择 /var/log/message，10.0.0.74，就会把该文件下面的日志过滤展示出来

这里展示的是Promtail容器里面的 /var/log 目录中的日志
Promtail容器 /etc/promtail/config.yml

shellserver:
  http_listen_port: 9080
  grpc_listen_port: 0

positions:
  filename: /tmp/positions.yaml

clients:
  - url: http://loki:3100/loki/api/v1/push

scrape_configs:
- job_name: system
  static_configs:
  - targets:
      - localhost
    labels:
      job: varlogs
      __path__: /var/log/*log

这里的 job 就是varlog，文件路径就是 /var/log/*log
只索引标签：之前多次看到Loki和ES最大的不同是Loki只对标签进行索引而不对内容索引。
静态标签匹配模式
以简单的Promtail配置举例

scrape_configs:
 - job_name: system
   pipeline_stages:
   static_configs:
   - targets:
      - localhost
     labels:
      job: message
      __path__: /var/log/messages

解读：

上面这段配置代表启动一个日志采集任务
这个任务有一个固定标签 job: message
采集日志路径为 /var/log/messages，会以一个名为filename的固定标签
在Promtail的web页面上可以看到类似Promtail的target信息页面

可以和使用Prometheus一样的标签匹配语句进行查询 {job=“message”}

scrape_configs:
 - job_name: system
   pipeline_stages:
   static_configs:
   - targets:
      - localhost
     labels:
      job: syslog
      __path__: /var/log/syslog
 - job_name: system
   pipeline_stages:
   static_configs:
   - targets:
      - localhost
     labels:
      job: apache
      __path__: /var/log/apache.log

如果我们配置了两个job，则可以使用 {job=~“message|varlogs”} 进行多job匹配；同时也支持正则匹配
标签匹配模式特点

和Prometheus一致，相同标签对应的是一个流Prometheus处理series的模式
Prometheus中标签一致对应的同一个hash值和refid（正整数递增的id），也就是同一个series
时序数据不断的append追加到这个memseries中
当有人以标签发生变化时会产生新的hash值和refid，对应新的series

Loki处理日志的模式和Prometheus一致，loki一组标签值会生成一个stream。日志随着时间的递增会追加到这个stream中，最后压缩为chunk。当有人以标签发生变化时会产生新的hash值，对应新的stream
查询过程：

loki现根据标签算出hash值在倒排索引中找到对应的chunk
然后再根据查询语句中的关键词等进行过滤，这样可以提速
因为这种根据标签算哈希在倒排中查找id，对应找到存储的块在Prometheus中已经被验证过了
属于第开销速度快

动态标签和高基数

动态标签：标签的value不固定
高基数标签：标签的value可能性太多了，达到10万，100万甚至更多

比如Apache的access日志
11.11.11.11 - frank [25/Jan/2000:14:00:01 -0500] “GET /1986.js HTTP/1.1” 200 932 “-” “Mozilla/5.0 (Windows; U; Windows NT 5.1; de; rv:1.9.1.7) Gecko/20091221 Firefox/3.5.7 GTB6”
在Promtail中使用regex想要匹配action和status_code两个标签：

scrape_configs:
 - job_name: system
   pipeline_stages:
   static_configs:
   - targets:
      - localhost
     labels:
      job: syslog
      __path__: /var/log/syslog
 - job_name: system
   pipeline_stages:
   static_configs:
   - targets:
      - localhost
     labels:
      job: apache
      __path__: /var/log/apache.log

  - job_name: system
    pipeline_stages:
       - regex:
         expression: "^(?P<ip>\\S+) (?P<identd>\\S+) (?P<user>\\S+) \\[(?P<timestamp>[\\w:/]+\\s[+\\-]\\d{4})\\] \"(?P<action>\\S+)\\s?(?P<path>\\S+)?\\s?(?P<protocol>\\S+)?\" (?P<status_code>\\d{3}|-) (?P<size>\\d+|-)\\s?\"?(?P<referer>[^\"]*)\"?\\s?\"?(?P<useragent>[^\"]*)?\"?$"
     - labels:
         action:
         status_code:
    static_configs:
    - targets:
       - localhost
      labels:
       job: apache
       env: dev
       __path__: /var/log/apache.log

那么对应action=get/post 和 status_code=200/400 则对应4个流：

11.11.11.11 - frank [25/Jan/2000:14:00:01 -0500] "GET /1986.js HTTP/1.1" 200 932 "-" "Mozilla/5.0 (Windows; U; Windows NT 5.1; de; rv:1.9.1.7) Gecko/20091221 Firefox/3.5.7 GTB6"
11.11.11.12 - frank [25/Jan/2000:14:00:02 -0500] "POST /1986.js HTTP/1.1" 200 932 "-" "Mozilla/5.0 (Windows; U; Windows NT 5.1; de; rv:1.9.1.7) Gecko/20091221 Firefox/3.5.7 GTB6"
11.11.11.13 - frank [25/Jan/2000:14:00:03 -0500] "GET /1986.js HTTP/1.1" 400 932 "-" "Mozilla/5.0 (Windows; U; Windows NT 5.1; de; rv:1.9.1.7) Gecko/20091221 Firefox/3.5.7 GTB6"
11.11.11.14 - frank [25/Jan/2000:14:00:04 -0500] "POST /1986.js HTTP/1.1" 400 932 "-" "Mozilla/5.0 (Windows; U; Windows NT 5.1; de; rv:1.9.1.7) Gecko/20091221 Firefox/3.5.7 GTB6"

那四个日志行将变成4个单独的流，并开始填充4个单独的块，如果出现另一个独特的标签组合（如 status_code=500）,则会创建另一个新流。
高基数问题：就像上面，如果给ip设置一个标签，来自用户的每个不同的ip请求不仅成为唯一的流。可以快速生成成千上万的流，这是高基数，这样可以杀死Loki
如果字段没有被当做标签被索引，会不会查询很慢，Loki的能力是将查询分解为小块并行分发，以便可以在短时间内查询大量日志数据
全文索引问题
大索引既复杂又昂贵。通常日志数据的全文索引的大小等于或大于日志数据本身的大小。
要查询日志数据，需要加载此索引，并且为了提高性能，它可能在内存中。这很难扩展，并随着日志增多，索引会迅速变大。
Loki的索引通常比摄取的日志量小一个数量级，索引的增长非常缓慢。
加速查询没标签字段：以👆提到的ip字段为例，使用过滤器表达式查询
{job=“apache”}|=“11.11.11.11”
loki查询时的分片（按时间范围分段 grep）

Loki把查询分解成较小的分片，并且与标签匹配的流打开每个区块，并开始寻找IP地址
这些分片的大小和并行化的数量是可以配置的，并取决于提供的资源
如果需要，可以将分片间隔配置文件5min，部署20个查询器，并在几秒钟内处理千兆字节的日志
或者，可以设置200个查询器并处理TB的日志

两种索引模式对比

es的大索引，不管你查不查询，它都必须时刻存在。比如长时间占用过多的内存
loki的逻辑是查询时在启动多个分段并行查询

日志量少时少加标签

因为每多加载一个chunk就有额外的开销
举例，如果查询是 {app=“loki”,level!=“bug”}
在没加level标签的情况下只需要加载一个chunk 即app=“loki”的标签
如果加了level的情况，则需要把level=info,warn,error,critical 5个chunk都加载再查询

需要标签是再去添加

当chunk_target_size=1M时以1M的压缩大小来切割块
对应原始日志大小在5M—10M，如果日志在max_chunk_age时间内能达到10M，考虑添加标签

日志应当按时间递增

这个问题和tsdb中处理旧数据是一样的道理
目前loki为了性能考虑直接拒绝掉旧数据

对接Prometheus监控

修改 prometheus.yml

scrape_configs:
    - job_name: 'Loki'
      static_configs:
        - targets: ["10.0.0.74:3100"]

然后加载Prometheus即可

Grafana dashboard

安装插件
grafana-cli plugins install grafana-piechart-panel
dashboard效果图

import Dashboard Json

{
   
  "annotations": {
   
    "list": [
      {
   
        "$$hashKey": "object:7",
        "builtIn": 1,
        "datasource": "-- Grafana --",
        "enable": true,
        "hide": true,
        "iconColor": "rgba(0, 211, 255, 1)",
        "name": "Annotations & Alerts",
        "target": {
   
          "limit": 100,
          "matchAny": false,
          "tags": [],
          "type": "dashboard"
        },
        "type": "dashboard"
      }
    ]
  },
  "description": "Loki metrics via 2.0",
  "editable": true,
  "gnetId": 13407,
  "graphTooltip": 0,
  "id": 28,
  "links": [],
  "panels": [
    {
   
      "datasource": "Prometheus",
      "description": "",
      "gridPos": {
   
        "h": 2,
        "w": 24,
        "x": 0,
        "y": 0
      },
      "id": 21,
      "interval": "",
      "options": {
   
        "content": "<div class=\"left dashboard-header\">\n   <img src=\"https://marketplace-assets.digitalocean.com/logos/loki-logo.svg\" style=\"height:29px;\"/>\n  <span>Loki Global Metrics</span>\n</div>",
        "mode": "html"
      },
      "pluginVersion": "8.1.3",
      "targets": [
        {
   
          "expr": "loki_build_info",
          "format": "table",
          "instant": false,
          "interval": "",
          "legendFormat": "",
          "refId": "A"
        }
      ],
      "timeFrom": null,
      "timeShift": null,
      "transparent": true,
      "type": "text"
    },
    {
   
      "datasource": "Prometheus",
      "fieldConfig": {
   
        "defaults": {
   
          "decimals": 4,
          "links": [],
          "mappings": [],
          "thresholds": {
   
            "mode": "absolute",
            "steps": [
              {
   
                "color": "green",
                "value": null
              }
            ]
          },
          "unit": "short"
        },
        "overrides": []
      },
      "gridPos": {
   
        "h": 3,
        "w": 6,
        "x": 0,
        "y": 2
      },
      "id": 59,
      "interval": "",
      "options": {
   
        "colorMode": "background",
        "graphMode": "none",
        "justifyMode": "center",
        "orientation": "auto",
        "reduceOptions": {
   
          "calcs": [
            "last"
          ],
          "fields": "/^version$/",
          "values": false
        },
        "text": {
   },
        "textMode": "auto"
      },
      "pluginVersion": "8.1.3",
      "targets": [
        {
   
          "expr": "loki_build_info",
          "format": "table",
          "instant": false,
          "interval": "",
          "legendFormat": "",
          "refId": "A"
        }
      ],
      "timeFrom": null,
      "timeShift": null,
      "title": "Loki Version",
      "type": "stat"
    },
    {
   
      "datasource": "Prometheus",
      "fieldConfig": {
   
        "defaults": {
   
          "decimals": 0,
          "links": [],
          "mappings": [],
          "thresholds": {
   
            "mode": "absolute",
            "steps": [
              {
   
                "color": "green",
                "value": null
              }
            ]
          },
          "unit": "short"
        },
        "overrides": []
      },
      "gridPos": {
   
        "h": 3,
        "w": 6,
        "x": 6,
        "y": 2
      },
      "id": 10,
      "interval": "",
      "options": {
   
        "colorMode": "value",
        "graphMode": "none",
        "justifyMode": "center",
        "orientation": "auto",
        "reduceOptions": {
   
          "calcs": [
            "last"
          ],
          "fields": "",
          "values": false
        },
        "text": {
   },
        "textMode": "auto"
      },
      "pluginVersion": "8.1.3",
      "targets": [
        {
   
          "expr": "sum(log_messages_total)",
          "instant": false,
          "interval": "",
          "legendFormat": "",
          "refId": "A"
        }
      ],
      "timeFrom": null,
      "timeShift": null,
      "title": "Message Total",
      "type": "stat"
    },
    {
   
      "datasource": "Prometheus",
      "fieldConfig": {
   
        "defaults": {
   
          "decimals": 4,
          "links": [],
          "mappings": [],
          "thresholds": {
   
            "mode": "absolute",
            "steps": [
              {
   
                "color": "green",
                "value": null
              }
            ]
          },
          "unit": "short"
        },
        "overrides": []
      },
      "gridPos": {
   
        "h": 3,
        "w": 4,
        "x": 12,
        "y": 2
      },
      "id": 23,
      "interval": "",
      "options": {
   
        "colorMode": "value",
        "graphMode": "area",
        "justifyMode": "auto",
        "orientation": "auto",
        "reduceOptions": {
   
          "calcs": [
            "mean"
          ],
          "fields": "",
          "values": false
        },
        "text": {
   },
        "textMode": "auto"
      },
      "pluginVersion": "8.1.3",
      "targets": [
        {
   
          "expr": "sum(loki_store_series_total)",
          "instant": false,
          "interval": "",
          "legendFormat": "",
          "refId": "A"
        }
      ],
      "timeFrom": null,
      "timeShift": null,
      "title": "Store Series Total",
      "type": "stat"
    },
    {
   
      "datasource": "Prometheus",
      "fieldConfig": {
   
        "defaults": {
   
          "decimals": 4,
          "links": [],
          "mappings": [],
          "thresholds": {
   
            "mode": "absolute",
            "steps": [
              {
   
                "color": "green",
                "value": null
              }
            ]
          },
          "unit": "bytes"
        },
        "overrides": []
      },
      "gridPos": {
   
        "h": 3,
        "w": 5,
        "x": 16,
        "y": 2
      },
      "id": 41,
      "interval": "",
      "options": {
   
        "colorMode": "value",
        "graphMode": "none",
        "justifyMode": "auto",
        "orientation": "auto",
        "reduceOptions": {
   
          "calcs": [
            "mean"
          ],
          "fields": "",
          "values": false
        },
        "text": {
   },
        "textMode": "auto"
      },
      "pluginVersion": "8.1.3",
      "targets": [
        {
   
          "expr": "sum(loki_ingester_chunk_stored_bytes_total)",
          "instant": true,
          "interval": "",
          "legendFormat": "",
          "refId": "A"
        }
      ],
      "timeFrom": null,
      "timeShift": null,
      "title": "Store Chunks Total",
      "type": "stat"
    },
    {
   
      "datasource": "Prometheus",
      "fieldConfig": {
   
        "defaults": {
   
          "decimals": 4,
          "links": [],
          "mappings": [],
          "thresholds": {
   
            "mode": "absolute",
            "steps": [
              {
   
                "color": "green",
                "value": null
              }
            ]
          },
          "unit": "short"
        },
        "overrides": []
      },
      "gridPos": {
   
        "h": 3,
        "w": 3,
        "x": 21,
        "y": 2
      },
      "id": 24,
      "interval": "",
      "options": {
   
        "colorMode": "value",
        "graphMode": "none",
        "justifyMode": "center",
        "orientation": "auto",
        "reduceOptions": {
   
          "calcs": [
            "last"
          ],
          "fields": "",
          "values": false
        },
        "text": {
   },
        "textMode": "auto"
      },
      "pluginVersion": "8.1.3",
      "targets": [
        {
   
          "expr": "loki_panic_total",
          "instant": false,
          "interval": "",
          "legendFormat":</