文章目录
- @[toc]
- 什么是 Thanos
- Thanos 的主要功能
- Thanos 的架构组件
- Thanos 部署架构
- Sidecar
- Receive
- 架构选择
- 开始部署
- 部署架构
- 创建 namespace
- node-exporter 部署
- kube-state-metrics 部署
- Prometheus + Thanos-Sidecar 部署
- 固定节点创建 label
- 生成 secret
- MinIO 配置
- etcd 证书
- 启动 Prometheus + Thanos-Sidecar
- Thanos-store-gateway 部署
- Thanos-compact 部署
- Thanos-query 部署
- Thanos-query-globle 部署
- Thanos-query-frontend 部署
- Grafana 部署
- 增加 Thanos 和 MinIO 监控
- Grafana dashboard
- coredns
- etcd
- Thanos
- node-exporter
- 最后
文章目录
- @[toc]
- 什么是 Thanos
- Thanos 的主要功能
- Thanos 的架构组件
- Thanos 部署架构
- Sidecar
- Receive
- 架构选择
- 开始部署
- 部署架构
- 创建 namespace
- node-exporter 部署
- kube-state-metrics 部署
- Prometheus + Thanos-Sidecar 部署
- 固定节点创建 label
- 生成 secret
- MinIO 配置
- etcd 证书
- 启动 Prometheus + Thanos-Sidecar
- Thanos-store-gateway 部署
- Thanos-compact 部署
- Thanos-query 部署
- Thanos-query-globle 部署
- Thanos-query-frontend 部署
- Grafana 部署
- 增加 Thanos 和 MinIO 监控
- Grafana dashboard
- coredns
- etcd
- Thanos
- node-exporter
- 最后
什么是 Thanos
- Thanos 官网
- Thanos quay.io 镜像仓库
- Thanos Github
Thanos 是一个强大的 Prometheus 扩展解决方案,能够解决 Prometheus 在大规模环境下的存储、扩展性和高可用性问题。
它非常适合大规模集群监控需求,尤其是需要长期存储监控数据和全局查询。
Thanos 的主要功能
全局查询(Global Query View)
- 通过其
Querier
组件提供从多个 Prometheus 实例查询的能力,并能对跨多个数据源进行全局去重查询 - 即使在大规模集群中运行多个 Prometheus 实例,用户也可以从一个接口统一查询所有的监控数据
- 通过其
长期存储(Unlimited Retention)
- Prometheus 默认只适用于短期数据存储,而 Thanos 提供了将监控数据推送到长期存储(如 Amazon S3、Google Cloud Storage、MinIO 等对象存储)的能力
Prometheus 集成(Prometheus Compatible)
- Grafana 和其他支持 Prometheus 查询 API 的工具都可以通过 Thanos 查询 Prometheus 数据
数据压缩与去重(Downsampling & Compaction)
- Thanos 的
Compactor
组件会定期对存储在对象存储中的数据进行压缩、去重和优化,以减少存储开销并提高查询性能
- Thanos 的
Thanos 的架构组件
遵循 KISS 和 Unix 理念,Thanos 由一组组件组成,每个组件都扮演一个特定的角色
Sidecar
- 与每个 Prometheus 实例一起部署,负责将数据推送到对象存储,并暴露出 Prometheus 的数据给
Querier
- 与每个 Prometheus 实例一起部署,负责将数据推送到对象存储,并暴露出 Prometheus 的数据给
Store Gateway
- 简称为
Store
,专门用于从对象存储(如 AWS S3、Google Cloud Storage、MinIO 等)中检索历史监控数据的组件
- 简称为
Compactor
- 负责对存储在对象存储中的数据进行压缩、去重和优化,提升查询性能并减少存储开销
Receiver
- 专门用于接收和存储 Prometheus 实例通过
Remote Write
发送数据的组件(强烈建议使用 Prometheus v2.13.0+,因为它的远程读取功能得到了改进。)
- 专门用于接收和存储 Prometheus 实例通过
Ruler/Rule
- 类似 Prometheus 的 Alertmanager,它允许用户基于存储的数据执行告警和规则评估
Querier/Query
- 一个用于全局查询的组件,能够从多个 Prometheus 实例和对象存储中提取数据,并提供统一的查询接口
Query Frontend
- Query 的前端页面,通过
查询分片
、缓存
和请求队列
等机制,加速复杂查询,并提升查询在高负载环境下的响应速度
- Query 的前端页面,通过
Thanos 部署架构
Sidecar
Sidecar 使用 Prometheus 的
reload
接口。确保 Prometheus 启用--web.enable-lifecycle
参数
- 优点
轻量级
:Sidecar 是一个轻量的代理,只需要运行在 Prometheus 实例旁边即可,无需对 Prometheus 进行大的改动。实时数据访问
:Sidecar 允许 Thanos 直接访问 Prometheus 的实时监控数据,保证了最新监控信息的可查询性。长期存储集成
:可以将 Prometheus 的数据定期上传到对象存储,解决了 Prometheus 原生不具备长期存储的缺陷。
- 缺点
依赖 Prometheus
:Sidecar 必须依赖于运行的 Prometheus 实例,如果 Prometheus 实例宕机,Sidecar 也无法提供数据查询功能。水平扩展有限
:Sidecar 并不设计用于大规模数据接收,它主要是作为 Prometheus 的配套组件,无法像 Receiver 那样水平扩展来处理大量的数据。
Receive
-
优点
大规模数据接收
:Receiver 能够高效接收大量来自 Prometheus 实例的数据,适用于大规模部署。多租户支持
:可以处理和隔离多个租户的数据,在需要监控多个独立环境时非常有用。水平扩展
:通过数据分片和扩展 Receiver 实例,能够处理越来越多的数据接收任务。去重和高可用性
:Receiver 能够通过去重机制,确保多实例高可用性,并避免重复数据存储。
-
缺点
-
无直接查询功能
:Receiver 本身不具备查询功能,接收到的数据需要依赖其他 Thanos 组件(如 Querier 和 Store)进行查询和分析。实时性较低
:相比直接从 Prometheus 实例查询数据,Receiver 可能在数据处理和查询时存在一定的延迟。
-
Sidecar 与 Receiver 的区别对比(抄自 ChatGPT)
特性 | Thanos Sidecar | Thanos Receiver |
---|---|---|
主要功能 | 集成 Prometheus 实例,提供实时数据访问和长期存储 | 接收 Prometheus 实例的远程写入数据,并存储 |
数据源 | 直接从 Prometheus 获取数据 | Prometheus 的 Remote Write 数据 |
数据存储方式 | 定期上传 Prometheus 数据块到对象存储 | 将接收到的数据存储在本地或对象存储中 |
水平扩展性 | 无法扩展,只与单个 Prometheus 实例集成 | 可以通过增加实例水平扩展 |
实时数据查询 | 支持 Prometheus 实时数据查询 | 无法直接查询数据 |
多租户支持 | 不支持 | 支持,适用于多租户环境 |
高可用性 | 依赖 Prometheus 实例 | 支持高可用部署和去重机制 |
适用场景 | 与现有 Prometheus 实例集成,长期存储数据 | 大规模、多租户环境的数据接收和存储 |
架构选择
- 多集群 thanos 监控告警实践
- 打造云原生大型分布式监控系统 (三): Thanos 部署与实践
以下的建议取自这两个博客,具体的架构选择,也只能大家根据自己的实际情况验证和判断
- Sidecar 与 Receiver 的最主要的区分就是最新数据的查询方式不同
- Sidecar 最新数据直接读取 Promethues 数据目录
- Receiver 的所有数据都在存储服务里面(S3 等存储服务)
- Prometheus 集群不大,采集的服务不多的情况下,即使 Sidecar 和 全局查询的 Query 不在一个机房,只要都是国内的,查询延迟一般不会太高
- Prometheus 集群很大,要采集的数据也非常多的情况下,尽可能还是选择 Sidecar 架构,因为数据一旦激增,Receiver 的压力会非常非常大,需要很大的资源,也需要很强大的存储性能
- 除非主要目的是针对指标历史做分析使用,或者 Prometheus 有某些特殊场景无法持久化数据,这些以外的场景,建议使用 Sidecar
开始部署
采用 sidecar 模式部署
部署架构
考虑用 Prometheus 自带的 rule 做告警,这边没打算部署 Thanos-rule
k8s 集群 A | k8s 集群 B |
---|---|
Prometheus:v2.54.1 | Prometheus:v2.54.1 |
node-exporter:v1.8.2 | node-exporter:v1.8.2 |
kube-state-metrics:v2.11.0 | kube-state-metrics:v2.11.0 |
Thanos-sidecar:v0.36.1 | Thanos-sidecar:v0.36.1 |
Thanos-query:v0.36.1 | Thanos-query:v0.36.1 |
Thanos-store-gateway:v0.36.1 | Thanos-store-gateway:v0.36.1 |
Thanos-compact:v0.36.1 | |
Thanos-query-globle:v0.36.1 | |
Thanos-query-frontend:v0.36.1 | |
Grafana |
MinIO 部署可以看我之前的博客:k8s 1.28.2 集群部署 MinIO 分布式集群,先提前准备好 MinIO 集群
创建 namespace
以下所有的 k 命令都代表 kubectl 命令,部署这块只展示一个环境的,我这边是两套 k8s 集群,需要部署两套 Prometheus
k create ns monitor
node-exporter 部署
---
apiVersion: v1
kind: Service
metadata:
labels:
app.kubernetes.io/name: node-exporter
name: node-exporter-svc
namespace: monitoring
spec:
clusterIP: None
ports:
- name: http
port: 9100
protocol: TCP
selector:
app.kubernetes.io/name: node-exporter
type: ClusterIP
---
apiVersion: apps/v1
kind: DaemonSet
metadata:
labels:
app.kubernetes.io/name: node-exporter
name: node-exporter
namespace: monitoring
spec:
selector:
matchLabels:
app.kubernetes.io/name: node-exporter
template:
metadata:
labels:
app.kubernetes.io/name: node-exporter
spec:
containers:
- args:
- --path.rootfs=/rootfs
- --collector.filesystem.ignored-fs-types=^(autofs|binfmt_misc|cgroup|configfs|debugfs|devpts|devtmpfs|fusectl|hugetlbfs|mqueue|overlay|proc|procfs|pstore|rpc_pipefs|securityfs|sysfs|tracefs)$
image: docker.m.daocloud.io/prom/node-exporter:v1.8.2
name: node-exporter
ports:
- containerPort: 9100
hostPort: 9100
name: http
volumeMounts:
- mountPath: /rootfs
name: root
readOnly: true
hostIPC: true
hostNetwork: true
hostPID: true
volumes:
- hostPath:
path: /
name: root
kube-state-metrics 部署
---
apiVersion: v1
automountServiceAccountToken: false
kind: ServiceAccount
metadata:
labels:
app.kubernetes.io/name: kube-state-metrics
name: kube-state-metrics-sa
namespace: monitoring
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
labels:
app.kubernetes.io/name: kube-state-metrics
name: kube-state-metrics
rules:
- apiGroups:
- ""
resources:
- configmaps
- secrets
- nodes
- pods
- services
- serviceaccounts
- resourcequotas
- replicationcontrollers
- limitranges
- persistentvolumeclaims
- persistentvolumes
- namespaces
- endpoints
verbs:
- list
- watch
- apiGroups:
- apps
resources:
- statefulsets
- daemonsets
- deployments
- replicasets
verbs:
- list
- watch
- apiGroups:
- batch
resources:
- cronjobs
- jobs
verbs:
- list
- watch
- apiGroups:
- autoscaling
resources:
- horizontalpodautoscalers
verbs:
- list
- watch
- apiGroups:
- authentication.k8s.io
resources:
- tokenreviews
verbs:
- create
- apiGroups:
- authorization.k8s.io
resources:
- subjectaccessreviews
verbs:
- create
- apiGroups:
- policy
resources:
- poddisruptionbudgets
verbs:
- list
- watch
- apiGroups:
- certificates.k8s.io
resources:
- certificatesigningrequests
verbs:
- list
- watch
- apiGroups:
- discovery.k8s.io
resources:
- endpointslices
verbs:
- list
- watch
- apiGroups:
- storage.k8s.io
resources:
- storageclasses
- volumeattachments
verbs:
- list
- watch
- apiGroups:
- admissionregistration.k8s.io
resources:
- mutatingwebhookconfigurations
- validatingwebhookconfigurations
verbs:
- list
- watch
- apiGroups:
- networking.k8s.io
resources:
- networkpolicies
- ingressclasses
- ingresses
verbs:
- list
- watch
- apiGroups:
- coordination.k8s.io
resources:
- leases
verbs:
- list
- watch
- apiGroups:
- rbac.authorization.k8s.io
resources:
- clusterrolebindings
- clusterroles
- rolebindings
- roles
verbs:
- list
- watch
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
labels:
app.kubernetes.io/name: kube-state-metrics
name: kube-state-metrics
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: kube-state-metrics
subjects:
- kind: ServiceAccount
name: kube-state-metrics-sa
namespace: monitoring
---
apiVersion: v1
kind: Service
metadata:
labels:
app.kubernetes.io/name: kube-state-metrics
name: kube-state-metrics
namespace: monitoring
spec:
clusterIP: None
ports:
- name: http-metrics
port: 8080
targetPort: http-metrics
- name: telemetry
port: 8081
targetPort: telemetry
selector:
app.kubernetes.io/name: kube-state-metrics
---
apiVersion: apps/v1
kind: Deployment
metadata:
labels:
app.kubernetes.io/name: kube-state-metrics
name: kube-state-metrics
namespace: monitoring
spec:
replicas: 1
selector:
matchLabels:
app.kubernetes.io/name: kube-state-metrics
template:
metadata:
labels:
app.kubernetes.io/name: kube-state-metrics
spec:
automountServiceAccountToken: true
containers:
- image: docker.m.daocloud.io/registry.k8s.io/kube-state-metrics/kube-state-metrics:v2.11.0
imagePullPolicy: IfNotPresent
livenessProbe:
httpGet:
path: /livez
port: http-metrics
initialDelaySeconds: 5
timeoutSeconds: 5
name: kube-state-metrics
ports:
- containerPort: 8080
name: http-metrics
- containerPort: 8081
name: telemetry
readinessProbe:
httpGet:
path: /readyz
port: telemetry
initialDelaySeconds: 5
timeoutSeconds: 5
securityContext:
allowPrivilegeEscalation: false
capabilities:
drop:
- ALL
readOnlyRootFilesystem: true
runAsNonRoot: true
runAsUser: 65534
seccompProfile:
type: RuntimeDefault
nodeSelector:
kubernetes.io/os: linux
serviceAccountName: kube-state-metrics-sa
Prometheus + Thanos-Sidecar 部署
固定节点创建 label
k label node 192.168.22.125 prometheus=true
生成 secret
MinIO 配置
因为包含 MinIO 的 access_key 和 secret_key,尽量别用 configmap 去明文读取,用 secret 读取,一会输出的内容,合并成一行后,需要放到下面的 secret 里面去替换掉
cat <<EOF | base64 -
type: S3
config:
bucket: "prom-thanos-sidecar"
endpoint: "minio.api.devops.icu"
access_key: "gsl2dzAHviNzabSn0ikw"
secret_key: "82zQ0UMDlOo3LxCQM9TqSygEYrMuxSSRYQdO1KXF"
insecure: true
EOF
etcd 证书
我是 kubeadm 部署的 k8s 集群,我的证书路径是 /etc/kubernetes/pki/etcd,我直接把本地文件生成 secret
certs_dir=/etc/kubernetes/pki/etcd; \
k create secret generic etcd-pki -n monitoring \
--from-file=ca=${certs_dir}/ca.crt \
--from-file=cert=${certs_dir}/server.crt \
--from-file=key=${certs_dir}/server.key
启动 Prometheus + Thanos-Sidecar
Prometheus 的数据存储用的是本地 hostpath 的方式,由于 Thanos 需要读取 Prometheus 的数据,所以要保持用户一致,不然会因为权限问题,Thanos 没法读取数据,也没法将数据上传到 MinIO,具体的报错参考:
ts=2024-10-21T06:09:16.284378709Z caller=sidecar.go:410 level=warn err="upload 01JAP2JAZ0AQT8BEYFY30A4VVD: hard link block: hard link file chunks/000001: link /etc/prometheus/data/01JAP2JAZ0AQT8BEYFY30A4VVD/chunks/000001 /etc/prometheus/data/thanos/upload/01JAP2JAZ0AQT8BEYFY30A4VVD/chunks/000001: operation not permitted" uploaded=0
- Prometheus 参数简介
--storage.tsdb.min-block-duration=2h
:最小2小时生成一次新的数据块--storage.tsdb.max-block-duration=2h
:最大2小时生成一次新的数据块--storage.tsdb.retention.time=6h
:Prometheus 本地数据保留时长,默认是15天,这个可以自己根据实际磁盘情况调整--storage.tsdb.wal-compression
:启用 WAL 日志压缩,减少 WAL 文件的大小,降低存储空间的需求--storage.tsdb.no-lockfile
:禁用锁文件,避免影响 Thanos 上传数据块到 MinIO--web.enable-lifecycle
:支持热更新 localhost:9090/-/reload 热加载配置文件
---
apiVersion: v1
kind: ServiceAccount
metadata:
name: prometheus-sa
namespace: monitoring
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: prometheus
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: cluster-admin
subjects:
- kind: ServiceAccount
name: prometheus-sa
namespace: monitoring
---
apiVersion: v1
kind: Service
metadata:
labels:
app: prometheus
name: prometheus-svc
namespace: monitoring
spec:
ports:
- name: http
port: 9090
targetPort: 9090
- name: grpc
port: 10901
targetPort: 10901
selector:
app: prometheus
type: ClusterIP
---
apiVersion: v1
data:
prometheus.yml: |
global:
scrape_interval: 30s
evaluation_interval: 30s
scrape_timeout: 10s
external_labels:
cluster: devops
replica: $(POD_NAME)
rule_files:
- /etc/prometheus/rules/*.yml
scrape_configs:
- job_name: prometheus
kubernetes_sd_configs:
- role: endpoints
relabel_configs:
- source_labels: [__meta_kubernetes_service_label_app]
regex: prometheus
action: keep
- source_labels: [__meta_kubernetes_pod_ip]
regex: (.+)
target_label: __address__
replacement: ${1}:9090
- source_labels: [__meta_kubernetes_endpoints_name]
action: replace
target_label: endpoint
- source_labels: [__meta_kubernetes_pod_name]
action: replace
target_label: pod
- source_labels: [__meta_kubernetes_service_name]
action: replace
target_label: service
- source_labels: [__meta_kubernetes_namespace]
action: replace
target_label: namespace
- job_name: kube-apiserver
kubernetes_sd_configs:
- role: endpoints
scheme: https
tls_config:
insecure_skip_verify: true
bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
relabel_configs:
- source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_service_name, __meta_kubernetes_endpoint_port_name]
action: keep
regex: default;kubernetes;https
- source_labels: [__meta_kubernetes_endpoints_name]
action: replace
target_label: endpoint
- source_labels: [__meta_kubernetes_pod_name]
action: replace
target_label: pod
- source_labels: [__meta_kubernetes_service_name]
action: replace
target_label: service
- source_labels: [__meta_kubernetes_namespace]
action: replace
target_label: namespace
- job_name: kubelet
metrics_path: /metrics/cadvisor
scheme: https
tls_config:
insecure_skip_verify: true
bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
kubernetes_sd_configs:
- role: node
relabel_configs:
- action: labelmap
regex: __meta_kubernetes_node_label_(.+)
- source_labels: [instance]
action: replace
target_label: node
- source_labels: [__meta_kubernetes_endpoints_name]
action: replace
target_label: endpoint
- source_labels: [__meta_kubernetes_pod_name]
action: replace
target_label: pod
- source_labels: [__meta_kubernetes_namespace]
action: replace
target_label: namespace
- job_name: etcd
kubernetes_sd_configs:
- role: pod
scheme: https
tls_config:
ca_file: /etc/prometheus/etcd-ssl/ca
cert_file: /etc/prometheus/etcd-ssl/cert
key_file: /etc/prometheus/etcd-ssl/key
insecure_skip_verify: false
relabel_configs:
- source_labels: [__meta_kubernetes_pod_label_component]
regex: etcd
action: keep
- source_labels: [__meta_kubernetes_pod_ip]
regex: (.+)
target_label: __address__
replacement: ${1}:2379
- source_labels: [__meta_kubernetes_endpoints_name]
action: replace
target_label: endpoint
- source_labels: [__meta_kubernetes_pod_name]
action: replace
target_label: pod
- source_labels: [__meta_kubernetes_service_name]
action: replace
target_label: service
- source_labels: [__meta_kubernetes_namespace]
action: replace
target_label: namespace
- job_name: coredns
kubernetes_sd_configs:
- role: endpoints
relabel_configs:
- source_labels: [__meta_kubernetes_service_label_k8s_app]
regex: kube-dns
action: keep
- source_labels: [__meta_kubernetes_pod_ip]
regex: (.+)
target_label: __address__
replacement: ${1}:9153
- source_labels: [__meta_kubernetes_endpoints_name]
action: replace
target_label: endpoint
- source_labels: [__meta_kubernetes_pod_name]
action: replace
target_label: pod
- source_labels: [__meta_kubernetes_service_name]
action: replace
target_label: service
- source_labels: [__meta_kubernetes_namespace]
action: replace
target_label: namespace
- job_name: node-exporter
kubernetes_sd_configs:
- role: node
relabel_configs:
- action: labelmap
regex: __meta_kubernetes_node_label_(.+)
- source_labels: [__address__]
regex: '(.*):10250'
replacement: '${1}:9100'
target_label: __address__
action: replace
- source_labels: [__meta_kubernetes_node_address_InternalIP]
action: replace
target_label: ip
- source_labels: [__meta_kubernetes_pod_name]
action: replace
target_label: pod
- source_labels: [__meta_kubernetes_namespace]
action: replace
target_label: namespace
- job_name: kube-state-metrics
kubernetes_sd_configs:
- role: endpoints
relabel_configs:
- source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_service_name]
regex: monitoring;kube-state-metrics
action: keep
- source_labels: [__meta_kubernetes_pod_ip]
regex: (.+)
target_label: __address__
replacement: ${1}:8080
- source_labels: [__meta_kubernetes_endpoints_name]
action: replace
target_label: endpoint
- source_labels: [__meta_kubernetes_pod_name]
action: replace
target_label: pod
- source_labels: [__meta_kubernetes_service_name]
action: replace
target_label: service
- source_labels: [__meta_kubernetes_namespace]
action: replace
target_label: namespace
kind: ConfigMap
metadata:
name: prometheus-cm
namespace: monitoring
---
apiVersion: v1
data:
config: dHlwZTogUzMKY29uZmlnOgogIGJ1Y2tldDogInByb20tdGhhbm9zLXNpZGVjYXIiCiAgZW5kcG9pbnQ6ICJtaW5pby5hcGkuZGV2b3BzLmljdSIKICBhY2Nlc3Nfa2V5OiAiZ3NsMmR6QUh2aU56YWJTbjBpa3ciCiAgc2VjcmV0X2tleTogIjgyelEwVU1EbE9vM0x4Q1FNOVRxU3lnRVlyTXV4U1NSWVFkTzFLWEYiCiAgaW5zZWN1cmU6IHRydWUK
kind: Secret
metadata:
labels:
app.kubernetes.io/name: prometheus
name: thanos-config
namespace: monitoring
---
apiVersion: apps/v1
kind: StatefulSet
metadata:
labels:
app: prometheus
name: prometheus
namespace: monitoring
spec:
replicas: 1
selector:
matchLabels:
app: prometheus
template:
metadata:
labels:
app: prometheus
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: prometheus
operator: In
values:
- "true"
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: app
operator: In
values:
- prometheus
topologyKey: kubernetes.io/hostname
containers:
- args:
- --config.file=/etc/prometheus/config/prometheus.yml
- --storage.tsdb.path=/etc/prometheus/data
- --storage.tsdb.min-block-duration=2h
- --storage.tsdb.max-block-duration=2h
- --storage.tsdb.retention.time=6h
- --storage.tsdb.wal-compression
- --storage.tsdb.no-lockfile
- --web.enable-lifecycle
command:
- /bin/prometheus
env:
- name: TZ
value: Asia/Shanghai
image: quay.io/prometheus/prometheus:v2.54.1
imagePullPolicy: IfNotPresent
livenessProbe:
failureThreshold: 60
initialDelaySeconds: 5
periodSeconds: 10
successThreshold: 1
tcpSocket:
port: http
timeoutSeconds: 1
name: prometheus
ports:
- containerPort: 9090
name: http
readinessProbe:
failureThreshold: 60
initialDelaySeconds: 5
periodSeconds: 10
successThreshold: 1
tcpSocket:
port: http
timeoutSeconds: 1
resources:
limits:
cpu: 500m
memory: 1024Mi
requests:
cpu: 100m
memory: 100Mi
volumeMounts:
- mountPath: /etc/prometheus/data
name: prometheus-home
- mountPath: /etc/prometheus/config
name: prometheus-config
- mountPath: /etc/prometheus/etcd-ssl
name: etcd-ssl
- args:
- sidecar
- --log.level=info
- --log.format=logfmt
- --grpc-address=0.0.0.0:10901
- --http-address=0.0.0.0:10902
- --tsdb.path=/etc/prometheus/data
- --prometheus.url=http://localhost:9090
- --objstore.config-file=/etc/thanos/config/thanos-sidecar.yml
image: quay.io/thanos/thanos:v0.36.1
imagePullPolicy: IfNotPresent
name: thanos-sidecar
ports:
- containerPort: 10901
name: grpc
volumeMounts:
- mountPath: /etc/prometheus/data
name: prometheus-home
- mountPath: /etc/thanos/config/thanos-sidecar.yml
name: thanos-config
readOnly: true
subPath: config
imagePullSecrets:
- name: harbor-secret
initContainers:
- command:
- sh
- -c
- '[ -d /etc/prometheus/data/thanos ] || chown -R 65534:65534 /etc/prometheus/data'
image: quay.io/prometheus/prometheus:v2.54.1
imagePullPolicy: IfNotPresent
name: init-dir
securityContext:
runAsUser: 0
volumeMounts:
- mountPath: /etc/prometheus/data
name: prometheus-home
securityContext:
runAsUser: 65534
serviceAccount: prometheus-sa
terminationGracePeriodSeconds: 0
volumes:
- hostPath:
path: /approot/k8s_data/prometheus
type: DirectoryOrCreate
name: prometheus-home
- configMap:
name: prometheus-cm
name: prometheus-config
- name: thanos-config
secret:
secretName: thanos-config
- name: etcd-ssl
secret:
secretName: etcd-pki
Thanos-store-gateway 部署
secret 里面涉及的内容,和 sidecar 里面的是一样的,记得替换成自己的
---
apiVersion: v1
automountServiceAccountToken: false
kind: ServiceAccount
metadata:
labels:
app.kubernetes.io/name: thanos-store-gateway
name: thanos-store-gateway-sa
namespace: monitoring
---
apiVersion: v1
data:
config: dHlwZTogUzMKY29uZmlnOgogIGJ1Y2tldDogInByb20tdGhhbm9zLXNpZGVjYXIiCiAgZW5kcG9pbnQ6ICJtaW5pby5hcGkuZGV2b3BzLmljdSIKICBhY2Nlc3Nfa2V5OiAiZ3NsMmR6QUh2aU56YWJTbjBpa3ciCiAgc2VjcmV0X2tleTogIjgyelEwVU1EbE9vM0x4Q1FNOVRxU3lnRVlyTXV4U1NSWVFkTzFLWEYiCiAgaW5zZWN1cmU6IHRydWUK
kind: Secret
metadata:
labels:
app.kubernetes.io/name: thanos-store-gateway
name: thanos-objstore-config
namespace: monitoring
---
apiVersion: v1
kind: Service
metadata:
labels:
app.kubernetes.io/name: thanos-store-gateway
name: thanos-store-gateway-headless
namespace: monitoring
spec:
clusterIP: None
ports:
- name: grpc
port: 10901
targetPort: grpc
- name: http
port: 10902
protocol: TCP
targetPort: http
selector:
app.kubernetes.io/name: thanos-store-gateway
type: ClusterIP
---
apiVersion: apps/v1
kind: StatefulSet
metadata:
labels:
app.kubernetes.io/name: thanos-store-gateway
name: thanos-store-gateway
namespace: monitoring
spec:
replicas: 1
selector:
matchLabels:
app.kubernetes.io/name: thanos-store-gateway
serviceName: thanos-store-gateway-headless
template:
metadata:
labels:
app.kubernetes.io/name: thanos-store-gateway
spec:
containers:
- args:
- store
- --log.level=info
- --log.format=logfmt
- --data-dir=/var/thanos/store
- --grpc-address=0.0.0.0:10901
- --http-address=0.0.0.0:10902
- --no-cache-index-header
- --objstore.config-file=/etc/thanos/objstore.yaml
env:
- name: NAME
valueFrom:
fieldRef:
fieldPath: metadata.name
- name: HOST_IP_ADDRESS
valueFrom:
fieldRef:
fieldPath: status.hostIP
image: quay.io/thanos/thanos:v0.36.1
imagePullPolicy: IfNotPresent
livenessProbe:
failureThreshold: 4
httpGet:
path: /-/healthy
port: http
scheme: HTTP
initialDelaySeconds: 0
periodSeconds: 30
successThreshold: 1
timeoutSeconds: 1
name: thanos-store-gateway
ports:
- containerPort: 10901
name: grpc
protocol: TCP
- containerPort: 10902
name: http
protocol: TCP
readinessProbe:
failureThreshold: 20
httpGet:
path: /-/ready
port: http
scheme: HTTP
initialDelaySeconds: 0
periodSeconds: 5
successThreshold: 1
timeoutSeconds: 1
securityContext:
allowPrivilegeEscalation: false
capabilities:
drop:
- ALL
readOnlyRootFilesystem: true
runAsGroup: 65532
runAsNonRoot: true
runAsUser: 65534
seccompProfile:
type: RuntimeDefault
volumeMounts:
- mountPath: /etc/thanos/objstore.yaml
name: objstore-config
readOnly: true
subPath: config
- mountPath: /var/thanos/store
name: data
readOnly: false
securityContext:
fsGroup: 65534
runAsGroup: 65532
runAsNonRoot: true
runAsUser: 65534
seccompProfile:
type: RuntimeDefault
serviceAccountName: thanos-store-gateway-sa
volumes:
- name: objstore-config
secret:
secretName: thanos-objstore-config
- emptyDir:
sizeLimit: 100Mi
name: data
Thanos-compact 部署
---
apiVersion: v1
automountServiceAccountToken: false
kind: ServiceAccount
metadata:
labels:
app.kubernetes.io/name: thanos-store-gateway
name: thanos-store-gateway-sa
namespace: monitoring
---
apiVersion: v1
data:
config: dHlwZTogUzMKY29uZmlnOgogIGJ1Y2tldDogInByb20tdGhhbm9zLXNpZGVjYXIiCiAgZW5kcG9pbnQ6ICJtaW5pby5hcGkuZGV2b3BzLmljdSIKICBhY2Nlc3Nfa2V5OiAiZ3NsMmR6QUh2aU56YWJTbjBpa3ciCiAgc2VjcmV0X2tleTogIjgyelEwVU1EbE9vM0x4Q1FNOVRxU3lnRVlyTXV4U1NSWVFkTzFLWEYiCiAgaW5zZWN1cmU6IHRydWUK
kind: Secret
metadata:
labels:
app.kubernetes.io/name: thanos-store-gateway
name: thanos-objstore-config
namespace: monitoring
---
apiVersion: v1
kind: Service
metadata:
labels:
app.kubernetes.io/name: thanos-store-gateway
name: thanos-store-gateway-headless
namespace: monitoring
spec:
clusterIP: None
ports:
- name: grpc
port: 10901
targetPort: grpc
- name: http
port: 10902
protocol: TCP
targetPort: http
selector:
app.kubernetes.io/name: thanos-store-gateway
type: ClusterIP
---
apiVersion: apps/v1
kind: StatefulSet
metadata:
labels:
app.kubernetes.io/name: thanos-store-gateway
name: thanos-store-gateway
namespace: monitoring
spec:
replicas: 1
selector:
matchLabels:
app.kubernetes.io/name: thanos-store-gateway
serviceName: thanos-store-gateway-headless
template:
metadata:
labels:
app.kubernetes.io/name: thanos-store-gateway
spec:
containers:
- args:
- store
- --log.level=info
- --log.format=logfmt
- --data-dir=/var/thanos/store
- --grpc-address=0.0.0.0:10901
- --http-address=0.0.0.0:10902
- --no-cache-index-header
- --objstore.config-file=/etc/thanos/objstore.yaml
env:
- name: NAME
valueFrom:
fieldRef:
fieldPath: metadata.name
- name: HOST_IP_ADDRESS
valueFrom:
fieldRef:
fieldPath: status.hostIP
image: quay.io/thanos/thanos:v0.36.1
imagePullPolicy: IfNotPresent
livenessProbe:
failureThreshold: 4
httpGet:
path: /-/healthy
port: http
scheme: HTTP
initialDelaySeconds: 0
periodSeconds: 30
successThreshold: 1
timeoutSeconds: 1
name: thanos-store-gateway
ports:
- containerPort: 10901
name: grpc
protocol: TCP
- containerPort: 10902
name: http
protocol: TCP
readinessProbe:
failureThreshold: 20
httpGet:
path: /-/ready
port: http
scheme: HTTP
initialDelaySeconds: 0
periodSeconds: 5
successThreshold: 1
timeoutSeconds: 1
securityContext:
allowPrivilegeEscalation: false
capabilities:
drop:
- ALL
readOnlyRootFilesystem: true
runAsGroup: 65532
runAsNonRoot: true
runAsUser: 65534
seccompProfile:
type: RuntimeDefault
volumeMounts:
- mountPath: /etc/thanos/objstore.yaml
name: objstore-config
readOnly: true
subPath: config
- mountPath: /var/thanos/store
name: data
readOnly: false
securityContext:
fsGroup: 65534
runAsGroup: 65532
runAsNonRoot: true
runAsUser: 65534
seccompProfile:
type: RuntimeDefault
serviceAccountName: thanos-store-gateway-sa
volumes:
- name: objstore-config
secret:
secretName: thanos-objstore-config
- emptyDir:
sizeLimit: 100Mi
name: data
root@dream:/approot/chen2ha/kubetpl 13:58:08 # cat output/thanos-compact.yaml
---
apiVersion: v1
automountServiceAccountToken: false
kind: ServiceAccount
metadata:
labels:
app.kubernetes.io/name: thanos-compact
name: thanos-compact-sa
namespace: monitoring
---
apiVersion: v1
kind: Service
metadata:
labels:
app.kubernetes.io/name: thanos-compact
name: thanos-compact-headless
namespace: monitoring
spec:
clusterIP: None
ports:
- name: http
port: 10902
protocol: TCP
targetPort: http
selector:
app.kubernetes.io/name: thanos-compact
type: ClusterIP
---
apiVersion: apps/v1
kind: StatefulSet
metadata:
labels:
app.kubernetes.io/name: thanos-compact
name: thanos-compact
namespace: monitoring
spec:
replicas: 1
selector:
matchLabels:
app.kubernetes.io/name: thanos-compact
serviceName: thanos-compact-headless
template:
metadata:
labels:
app.kubernetes.io/name: thanos-compact
spec:
containers:
- args:
- compact
- --wait
- --log.level=info
- --log.format=logfmt
- --data-dir=/var/thanos/compact
- --http-address=0.0.0.0:10902
- --objstore.config-file=/etc/thanos/objstore.yaml
- --compact.enable-vertical-compaction
- --deduplication.replica-label=replica
- --deduplication.func=penalty
- --delete-delay=1d
- --retention.resolution-raw=7d
- --retention.resolution-5m=15d
- --retention.resolution-1h=30d
env:
- name: NAME
valueFrom:
fieldRef:
fieldPath: metadata.name
- name: HOST_IP_ADDRESS
valueFrom:
fieldRef:
fieldPath: status.hostIP
image: quay.io/thanos/thanos:v0.36.1
imagePullPolicy: IfNotPresent
livenessProbe:
failureThreshold: 4
httpGet:
path: /-/healthy
port: http
scheme: HTTP
initialDelaySeconds: 0
periodSeconds: 30
successThreshold: 1
timeoutSeconds: 1
name: thanos-compact
ports:
- containerPort: 10902
name: http
protocol: TCP
readinessProbe:
failureThreshold: 20
httpGet:
path: /-/ready
port: http
scheme: HTTP
initialDelaySeconds: 0
periodSeconds: 5
successThreshold: 1
timeoutSeconds: 1
securityContext:
allowPrivilegeEscalation: false
capabilities:
drop:
- ALL
readOnlyRootFilesystem: true
runAsGroup: 65532
runAsNonRoot: true
runAsUser: 65534
seccompProfile:
type: RuntimeDefault
volumeMounts:
- mountPath: /etc/thanos/objstore.yaml
name: objstore-config
readOnly: true
subPath: config
- mountPath: /var/thanos/compact
name: data
readOnly: false
securityContext:
fsGroup: 65534
runAsGroup: 65532
runAsNonRoot: true
runAsUser: 65534
seccompProfile:
type: RuntimeDefault
serviceAccountName: thanos-compact-sa
volumes:
- name: objstore-config
secret:
secretName: thanos-objstore-config
- emptyDir:
sizeLimit: 100Mi
name: data
Thanos-query 部署
--query.replica-label
参数指定依据哪个标签做数据的去重,在 Prometheus 的 external_labels 里面配置的- 给 Thanos-query 的 gRPC 端口配一个独立的 svc ,通过 nodeport 的方式暴露端口,再由一个全局的 Thanos-query 来注册各个集群的 Thanos-query,最终通过 Thanos-query-frontend 来查询
- 当然,如果资源足够,也完全可以每个集群再多部署一个 Thanos-query 来当作全局查询,内外查询做一个分流
---
apiVersion: v1
automountServiceAccountToken: false
kind: ServiceAccount
metadata:
labels:
app.kubernetes.io/name: thanos-query
name: thanos-query-sa
namespace: monitoring
---
apiVersion: v1
kind: Service
metadata:
labels:
app.kubernetes.io/name: thanos-query
name: thanos-query-svc
namespace: monitoring
spec:
ports:
- name: grpc
port: 10901
targetPort: grpc
- name: http
port: 10902
protocol: TCP
targetPort: http
selector:
app.kubernetes.io/name: thanos-query
type: ClusterIP
---
apiVersion: v1
kind: Service
metadata:
labels:
app.kubernetes.io/name: thanos-query
name: thanos-query-np-svc
namespace: monitoring
spec:
ports:
- name: grpc
nodePort: 31901
port: 10901
targetPort: grpc
selector:
app.kubernetes.io/name: thanos-query
type: NodePort
---
apiVersion: apps/v1
kind: Deployment
metadata:
labels:
app.kubernetes.io/name: thanos-query
name: thanos-query
namespace: monitoring
spec:
replicas: 1
selector:
matchLabels:
app.kubernetes.io/name: thanos-query
template:
metadata:
labels:
app.kubernetes.io/name: thanos-query
spec:
containers:
- args:
- query
- --log.level=info
- --log.format=logfmt
- --grpc-address=0.0.0.0:10901
- --http-address=0.0.0.0:10902
- --query.replica-label=replica
- --endpoint=dnssrv+_grpc._tcp.thanos-store-gateway-headless.monitoring.svc.cluster.local
- --endpoint=dnssrv+_grpc._tcp.prometheus-svc.monitoring.svc.cluster.local
env:
- name: HOST_IP_ADDRESS
valueFrom:
fieldRef:
fieldPath: status.hostIP
image: quay.io/thanos/thanos:v0.36.1
imagePullPolicy: IfNotPresent
livenessProbe:
failureThreshold: 4
httpGet:
path: /-/healthy
port: http
scheme: HTTP
initialDelaySeconds: 0
periodSeconds: 30
successThreshold: 1
timeoutSeconds: 1
name: thanos-query
ports:
- containerPort: 10901
name: grpc
protocol: TCP
- containerPort: 10902
name: http
protocol: TCP
readinessProbe:
failureThreshold: 20
httpGet:
path: /-/ready
port: http
scheme: HTTP
initialDelaySeconds: 0
periodSeconds: 5
successThreshold: 1
timeoutSeconds: 1
securityContext:
allowPrivilegeEscalation: false
capabilities:
drop:
- ALL
readOnlyRootFilesystem: true
runAsGroup: 65532
runAsNonRoot: true
runAsUser: 65534
seccompProfile:
type: RuntimeDefault
securityContext:
fsGroup: 65534
runAsGroup: 65532
runAsNonRoot: true
runAsUser: 65534
seccompProfile:
type: RuntimeDefault
serviceAccountName: thanos-query-sa
Thanos-query-globle 部署
--endpoint
我是两个集群各挑了两个节点
---
apiVersion: v1
automountServiceAccountToken: false
kind: ServiceAccount
metadata:
labels:
app.kubernetes.io/name: thanos-query-globle
name: thanos-query-globle-sa
namespace: monitoring
---
apiVersion: v1
kind: Service
metadata:
labels:
app.kubernetes.io/name: thanos-query-globle
name: thanos-query-globle-svc
namespace: monitoring
spec:
ports:
- name: grpc
port: 10901
targetPort: grpc
- name: http
port: 10902
protocol: TCP
targetPort: http
selector:
app.kubernetes.io/name: thanos-query-globle
type: ClusterIP
---
apiVersion: apps/v1
kind: Deployment
metadata:
labels:
app.kubernetes.io/name: thanos-query-globle
name: thanos-query-globle
namespace: monitoring
spec:
replicas: 1
selector:
matchLabels:
app.kubernetes.io/name: thanos-query-globle
template:
metadata:
labels:
app.kubernetes.io/name: thanos-query-globle
spec:
containers:
- args:
- query
- --log.level=info
- --log.format=logfmt
- --grpc-address=0.0.0.0:10901
- --http-address=0.0.0.0:10902
- --query.replica-label=replica
- --endpoint=192.168.22.112:31901
- --endpoint=192.168.22.113:31901
- --endpoint=192.168.22.122:31901
- --endpoint=192.168.22.123:31901
env:
- name: HOST_IP_ADDRESS
valueFrom:
fieldRef:
fieldPath: status.hostIP
image: quay.io/thanos/thanos:v0.36.1
imagePullPolicy: IfNotPresent
livenessProbe:
failureThreshold: 4
httpGet:
path: /-/healthy
port: http
scheme: HTTP
initialDelaySeconds: 0
periodSeconds: 30
successThreshold: 1
timeoutSeconds: 1
name: thanos-query-globle
ports:
- containerPort: 10901
name: grpc
protocol: TCP
- containerPort: 10902
name: http
protocol: TCP
readinessProbe:
failureThreshold: 20
httpGet:
path: /-/ready
port: http
scheme: HTTP
initialDelaySeconds: 0
periodSeconds: 5
successThreshold: 1
timeoutSeconds: 1
securityContext:
allowPrivilegeEscalation: false
capabilities:
drop:
- ALL
readOnlyRootFilesystem: true
runAsGroup: 65532
runAsNonRoot: true
runAsUser: 65534
seccompProfile:
type: RuntimeDefault
securityContext:
fsGroup: 65534
runAsGroup: 65532
runAsNonRoot: true
runAsUser: 65534
seccompProfile:
type: RuntimeDefault
serviceAccountName: thanos-query-globle-sa
Thanos-query-frontend 部署
---
apiVersion: v1
automountServiceAccountToken: false
kind: ServiceAccount
metadata:
labels:
app.kubernetes.io/name: thanos-query-frontend
name: thanos-query-frontend-sa
namespace: monitoring
---
apiVersion: v1
kind: Service
metadata:
labels:
app.kubernetes.io/name: thanos-query-frontend
name: thanos-query-frontend-svc
namespace: monitoring
spec:
ports:
- name: http
port: 10902
protocol: TCP
targetPort: http
selector:
app.kubernetes.io/name: thanos-query-frontend
type: ClusterIP
---
apiVersion: apps/v1
kind: Deployment
metadata:
labels:
app.kubernetes.io/name: thanos-query-frontend
name: thanos-query-frontend
namespace: monitoring
spec:
replicas: 1
selector:
matchLabels:
app.kubernetes.io/name: thanos-query-frontend
template:
metadata:
labels:
app.kubernetes.io/name: thanos-query-frontend
spec:
containers:
- args:
- query-frontend
- --log.level=info
- --log.format=logfmt
- --http-address=0.0.0.0:10902
- --query-frontend.downstream-url=http://thanos-query-globle-svc.monitoring.svc.cluster.local:10902
env:
- name: HOST_IP_ADDRESS
valueFrom:
fieldRef:
fieldPath: status.hostIP
image: quay.io/thanos/thanos:v0.36.1
imagePullPolicy: IfNotPresent
livenessProbe:
failureThreshold: 4
httpGet:
path: /-/healthy
port: http
scheme: HTTP
initialDelaySeconds: 0
periodSeconds: 30
successThreshold: 1
timeoutSeconds: 1
name: thanos-query-frontend
ports:
- containerPort: 10902
name: http
protocol: TCP
readinessProbe:
failureThreshold: 20
httpGet:
path: /-/ready
port: http
scheme: HTTP
initialDelaySeconds: 0
periodSeconds: 5
successThreshold: 1
timeoutSeconds: 1
securityContext:
allowPrivilegeEscalation: false
capabilities:
drop:
- ALL
readOnlyRootFilesystem: true
runAsGroup: 65532
runAsNonRoot: true
runAsUser: 65534
seccompProfile:
type: RuntimeDefault
securityContext:
fsGroup: 65534
runAsGroup: 65532
runAsNonRoot: true
runAsUser: 65534
seccompProfile:
type: RuntimeDefault
serviceAccountName: thanos-query-frontend-sa
---
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: thanos-query-frontend
namespace: monitoring
spec:
ingressClassName: nginx
rules:
- host: thanos.devops.icu
http:
paths:
- backend:
service:
name: thanos-query-frontend-svc
port:
number: 10902
path: /
pathType: Prefix
Grafana 部署
这边采用了 nfs 针对 dashboard 的 json 文件做了持久化,有修改或者增加就比较方便,直接上传到 nfs 就可以了
---
apiVersion: v1
data:
grafana.ini: |
provisioning = /etc/grafana/provisioning
kind: ConfigMap
metadata:
name: grafana-cm
namespace: monitoring
---
apiVersion: v1
data:
prometheus.yaml: |
apiVersion: 1
datasources:
- name: Prometheus
type: prometheus
access: proxy
url: http://thanos-query-globle-svc.monitoring.svc.cluster.local:10902
kind: ConfigMap
metadata:
name: grafana-datasource
namespace: monitoring
---
apiVersion: v1
data:
dashboards.yaml: |
apiVersion: 1
providers:
- name: 'a unique provider name'
orgId: 1
folder: ''
folderUid: ''
type: file
disableDeletion: false
editable: true
updateIntervalSeconds: 10
allowUiUpdates: true
options:
# <string, required> path to dashboard files on disk. Required
path: /etc/grafana/provisioning/dashboards/views
kind: ConfigMap
metadata:
name: grafana-dashboard
namespace: monitoring
---
apiVersion: v1
kind: Service
metadata:
labels:
app.kubernetes.io/name: grafana
name: grafana-svc
namespace: monitoring
spec:
ports:
- port: 3000
protocol: TCP
targetPort: http-grafana
selector:
app.kubernetes.io/name: grafana
type: ClusterIP
---
apiVersion: apps/v1
kind: StatefulSet
metadata:
labels:
app.kubernetes.io/name: grafana
name: grafana
namespace: monitoring
spec:
replicas: 1
selector:
matchLabels:
app.kubernetes.io/name: grafana
template:
metadata:
labels:
app.kubernetes.io/name: grafana
spec:
containers:
- env:
- name: POD_NAME
valueFrom:
fieldRef:
apiVersion: v1
fieldPath: metadata.name
image: docker.m.daocloud.io/grafana/grafana:11.3.0
imagePullPolicy: IfNotPresent
livenessProbe:
failureThreshold: 3
initialDelaySeconds: 30
periodSeconds: 10
successThreshold: 1
tcpSocket:
port: 3000
timeoutSeconds: 1
name: grafana
ports:
- containerPort: 3000
name: http-grafana
protocol: TCP
readinessProbe:
failureThreshold: 3
httpGet:
path: /robots.txt
port: 3000
scheme: HTTP
initialDelaySeconds: 10
periodSeconds: 30
successThreshold: 1
timeoutSeconds: 2
resources:
limits:
cpu: 1000m
memory: 1024Mi
requests:
cpu: 250m
memory: 750Mi
volumeMounts:
- mountPath: /etc/grafana/grafana.ini
name: grafana-config
subPath: grafana.ini
- mountPath: /etc/grafana/provisioning/datasources/prometheus.yaml
name: grafana-datasource
subPath: prometheus.yaml
- mountPath: /etc/grafana/provisioning/dashboards/grafana-dashboard.yaml
name: grafana-dashboard
subPath: dashboards.yaml
- mountPath: /etc/grafana/provisioning/dashboards/views
name: grafana
subPathExpr: $(POD_NAME)
securityContext:
fsGroup: 472
supplementalGroups:
- 0
volumes:
- configMap:
name: grafana-cm
name: grafana-config
- configMap:
name: grafana-datasource
name: grafana-datasource
- configMap:
name: grafana-dashboard
name: grafana-dashboard
volumeClaimTemplates:
- metadata:
name: grafana
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 5Gi
storageClassName: nfs-client
---
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: grafana
namespace: monitoring
spec:
ingressClassName: nginx
rules:
- host: grafana.devops.icu
http:
paths:
- backend:
service:
name: grafana-svc
port:
number: 3000
path: /
pathType: Prefix
增加 Thanos 和 MinIO 监控
Prometheus 采集 MinIO 指标需要鉴权,需要通过 mc 命令配置 JWT 认证,可以查看官方文档:
mc admin prometheus generate
或者 MinIO 配置
MINIO_PROMETHEUS_AUTH_TYPE=public
参数,需要重启 MinIO 生效,使 Prometheus 可以直接访问 metrics api
- job_name: minio
metrics_path: /minio/v2/metrics/cluster
kubernetes_sd_configs:
- role: endpoints
relabel_configs:
- source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_service_name]
regex: storage;minio-svc
action: keep
- source_labels: [__meta_kubernetes_pod_ip]
regex: (.+)
target_label: __address__
replacement: ${1}:9000
- source_labels: [__meta_kubernetes_endpoints_name]
action: replace
target_label: endpoint
- source_labels: [__meta_kubernetes_pod_name]
action: replace
target_label: pod
- source_labels: [__meta_kubernetes_service_name]
action: replace
target_label: service
- source_labels: [__meta_kubernetes_namespace]
action: replace
target_label: namespace
- job_name: thanos-query
kubernetes_sd_configs:
- role: endpoints
relabel_configs:
- source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_service_name]
regex: monitoring;thanos-query-svc
action: keep
- source_labels: [__meta_kubernetes_pod_ip]
regex: (.+)
target_label: __address__
replacement: ${1}:10902
- source_labels: [__meta_kubernetes_endpoints_name]
action: replace
target_label: endpoint
- source_labels: [__meta_kubernetes_pod_name]
action: replace
target_label: pod
- source_labels: [__meta_kubernetes_service_name]
action: replace
target_label: service
- source_labels: [__meta_kubernetes_namespace]
action: replace
target_label: namespace
- job_name: thanos-store-gateway
kubernetes_sd_configs:
- role: endpoints
relabel_configs:
- source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_service_name]
regex: monitoring;thanos-store-gateway-headless
action: keep
- source_labels: [__meta_kubernetes_pod_ip]
regex: (.+)
target_label: __address__
replacement: ${1}:10902
- source_labels: [__meta_kubernetes_endpoints_name]
action: replace
target_label: endpoint
- source_labels: [__meta_kubernetes_pod_name]
action: replace
target_label: pod
- source_labels: [__meta_kubernetes_service_name]
action: replace
target_label: service
- source_labels: [__meta_kubernetes_namespace]
action: replace
target_label: namespace
- job_name: thanos-compact
kubernetes_sd_configs:
- role: endpoints
relabel_configs:
- source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_service_name]
regex: monitoring;thanos-compact-headless
action: keep
- source_labels: [__meta_kubernetes_pod_ip]
regex: (.+)
target_label: __address__
replacement: ${1}:10902
- source_labels: [__meta_kubernetes_endpoints_name]
action: replace
target_label: endpoint
- source_labels: [__meta_kubernetes_pod_name]
action: replace
target_label: pod
- source_labels: [__meta_kubernetes_service_name]
action: replace
target_label: service
- source_labels: [__meta_kubernetes_namespace]
action: replace
target_label: namespace
Grafana dashboard
记录几个我这边配置的 dashboard id,因为我这边是双 k8s 集群,所以要加上 cluster 这个变量,大部分都需要自己再细调一下
coredns
14981
etcd
用的官方给的模板:grafana.json
Thanos
12937
node-exporter
12633 或者 21902
16098
最后
yaml 和 dashboard 的 json 文件可以从 gitee 自取:https://gitee.com/chen2ha/yaml_for_kubernetes/tree/master/thanos