记一次TIDB开启TLS失败导致PD扩容失败案例

news2025/1/10 20:59:02

作者: Dora 原文来源: https://tidb.net/blog/8ee8f295

问题背景

  1. 集群之前由于TIUP目录被删除导致TLLS证书丢失,后续需要重新开启TLS

  2. 在测试环境测试TLS开启步骤,导致后续两台PD扩容失败,步骤如下:

    1. 缩容两台PD
    2. 开启TLS
    3. 扩容原有的两台PD,最后PD启动的时候报错
    4. 集群Restart

集群重启后导致三台PD全部Down机,需要pd-recover恢复或者销毁集群重建恢复

注:TLS开启需要只保留一台PD,若有多台PD需要先缩容成1台

排查过程

TIUP日志排查

  1. 查看TIUP的日志,看到开启TLS的时候有报错
2024-01-04T16:11:00.718+0800    DEBUG   TaskFinish  {"task": "Restart Cluster", "error": "failed to stop: failed to stop: xxxx node_exporter-9100.service, please check the instance's log() for more detail.: timed out waiting for port 9100 to be stopped after 2m0s", "errorVerbose": "timed out waiting for port 9100 to be stopped after 2m0s\ngithub.com/pingcap/tiup/pkg/cluster/module.(*WaitFor).Execute\n\tgithub.com/pingcap/tiup/pkg/cluster/module/wait_for.go:91\ngithub.com/pingcap/tiup/pkg/cluster/spec.PortStopped\n\tgithub.com/pingcap/tiup/pkg/cluster/spec/instance.go:130\ngithub.com/pingcap/tiup/pkg/cluster/operation.systemctlMonitor.func1\n\tgithub.com/pingcap/tiup/pkg/cluster/operation/action.go:338\ngolang.org/x/sync/errgroup.(*Group).Go.func1\n\tgolang.org/x/sync@v0.0.0-20220819030929-7fc1605a5dde/errgroup/errgroup.go:75\nruntime.goexit\n\truntime/asm_amd64.s:1594\nfailed to stop: 10.196.32.49 node_exporter-9100.service, please check the instance's log() for more detail.\nfailed to stop"}
2024-01-04T16:11:00.718+0800    INFO    Execute command finished    {"code": 1, "error": "failed to stop: failed to stop: xxxx node_exporter-9100.service, please check the instance's log() for more detail.: timed out waiting for port 9100 to be stopped after 2m0s", "errorVerbose": "timed out waiting for port 9100 to be stopped after 2m0s\ngithub.com/pingcap/tiup/pkg/cluster/module.(*WaitFor).Execute\n\tgithub.com/pingcap/tiup/pkg/cluster/module/wait_for.go:91\ngithub.com/pingcap/tiup/pkg/cluster/spec.PortStopped\n\tgithub.com/pingcap/tiup/pkg/cluster/spec/instance.go:130\ngithub.com/pingcap/tiup/pkg/cluster/operation.systemctlMonitor.func1\n\tgithub.com/pingcap/tiup/pkg/cluster/operation/action.go:338\ngolang.org/x/sync/errgroup.(*Group).Go.func1\n\tgolang.org/x/sync@v0.0.0-20220819030929-7fc1605a5dde/errgroup/errgroup.go:75\nruntime.goexit\n\truntime/asm_amd64.s:1594\nfailed to stop: 10.196.32.49 node_exporter-9100.service, please check the instance's log() for more detail.\nfailed to stop"}

PD日志排查

某一台PD启动的时候是http而不是https

image.png

问题大致明朗,在开启TLS的时候因为报错导致后续PD的配置有问题

复现过程

缩容PD

tiup cluster scale-in tidb-cc -N 172.16.201.159:52379

开启TLS

开启TLS的时候发现最后一步是修改PD的配置信息

+ [ Serial ] - Reload PD Members Update pd-172.16.201.73-52379 peerURLs [ https://172.16.201.73:52380 ]

[tidb@vm172-16-201-73 /tidb-deploy/cc/pd-52379/scripts]$ tiup cluster tls tidb-cc enable
tiup is checking updates for component cluster ...
A new version of cluster is available:
   The latest version:         v1.14.1
   Local installed version:    v1.12.3
   Update current component:   tiup update cluster
   Update all components:      tiup update --all

Starting component `cluster`: /home/tidb/.tiup/components/cluster/v1.12.3/tiup-cluster tls tidb-cc enable
Enable/Disable TLS will stop and restart the cluster `tidb-cc`
Do you want to continue? [y/N]:(default=N) y
Generate certificate: /home/tidb/.tiup/storage/cluster/clusters/tidb-cc/tls
+ [ Serial ] - SSHKeySet: privateKey=/home/tidb/.tiup/storage/cluster/clusters/tidb-cc/ssh/id_rsa, publicKey=/home/tidb/.tiup/storage/cluster/clusters/tidb-cc/ssh/id_rsa.pub
+ [Parallel] - UserSSH: user=tidb, host=172.16.201.73
+ [Parallel] - UserSSH: user=tidb, host=172.16.201.159
+ [Parallel] - UserSSH: user=tidb, host=172.16.201.73
+ [Parallel] - UserSSH: user=tidb, host=172.16.201.73
+ [Parallel] - UserSSH: user=tidb, host=172.16.201.159
+ [Parallel] - UserSSH: user=tidb, host=172.16.201.99
+ [Parallel] - UserSSH: user=tidb, host=172.16.201.73
+ [Parallel] - UserSSH: user=tidb, host=172.16.201.73
+ [Parallel] - UserSSH: user=tidb, host=172.16.201.73
+ Copy certificate to remote host
  - Generate certificate pd -> 172.16.201.73:52379 ... Done
  - Generate certificate tikv -> 172.16.201.73:25160 ... Done
  - Generate certificate tikv -> 172.16.201.159:25160 ... Done
  - Generate certificate tikv -> 172.16.201.99:25160 ... Done
  - Generate certificate tidb -> 172.16.201.73:54000 ... Done
  - Generate certificate tidb -> 172.16.201.159:54000 ... Done
  - Generate certificate prometheus -> 172.16.201.73:59090 ... Done
  - Generate certificate grafana -> 172.16.201.73:43000 ... Done
  - Generate certificate alertmanager -> 172.16.201.73:59093 ... Done
+ Copy monitor certificate to remote host
  - Generate certificate node_exporter -> 172.16.201.159 ... Done
  - Generate certificate node_exporter -> 172.16.201.99 ... Done
  - Generate certificate node_exporter -> 172.16.201.73 ... Done
  - Generate certificate blackbox_exporter -> 172.16.201.73 ... Done
  - Generate certificate blackbox_exporter -> 172.16.201.159 ... Done
  - Generate certificate blackbox_exporter -> 172.16.201.99 ... Done
+ Refresh instance configs
  - Generate config pd -> 172.16.201.73:52379 ... Done
  - Generate config tikv -> 172.16.201.73:25160 ... Done
  - Generate config tikv -> 172.16.201.159:25160 ... Done
  - Generate config tikv -> 172.16.201.99:25160 ... Done
  - Generate config tidb -> 172.16.201.73:54000 ... Done
  - Generate config tidb -> 172.16.201.159:54000 ... Done
  - Generate config prometheus -> 172.16.201.73:59090 ... Done
  - Generate config grafana -> 172.16.201.73:43000 ... Done
  - Generate config alertmanager -> 172.16.201.73:59093 ... Done
+ Refresh monitor configs
  - Generate config node_exporter -> 172.16.201.73 ... Done
  - Generate config node_exporter -> 172.16.201.159 ... Done
  - Generate config node_exporter -> 172.16.201.99 ... Done
  - Generate config blackbox_exporter -> 172.16.201.73 ... Done
  - Generate config blackbox_exporter -> 172.16.201.159 ... Done
  - Generate config blackbox_exporter -> 172.16.201.99 ... Done
+ [ Serial ] - Save meta
+ [ Serial ] - Restart Cluster
Stopping component alertmanager
        Stopping instance 172.16.201.73
        Stop alertmanager 172.16.201.73:59093 success
Stopping component grafana
        Stopping instance 172.16.201.73
        Stop grafana 172.16.201.73:43000 success
Stopping component prometheus
        Stopping instance 172.16.201.73
        Stop prometheus 172.16.201.73:59090 success
Stopping component tidb
        Stopping instance 172.16.201.159
        Stopping instance 172.16.201.73
        Stop tidb 172.16.201.159:54000 success
        Stop tidb 172.16.201.73:54000 success
Stopping component tikv
        Stopping instance 172.16.201.99
        Stopping instance 172.16.201.73
        Stopping instance 172.16.201.159
        Stop tikv 172.16.201.73:25160 success
        Stop tikv 172.16.201.99:25160 success
        Stop tikv 172.16.201.159:25160 success
Stopping component pd
        Stopping instance 172.16.201.73
        Stop pd 172.16.201.73:52379 success
Stopping component node_exporter
        Stopping instance 172.16.201.99
        Stopping instance 172.16.201.73
        Stopping instance 172.16.201.159
        Stop 172.16.201.73 success
        Stop 172.16.201.99 success
        Stop 172.16.201.159 success
Stopping component blackbox_exporter
        Stopping instance 172.16.201.99
        Stopping instance 172.16.201.73
        Stopping instance 172.16.201.159
        Stop 172.16.201.73 success
        Stop 172.16.201.99 success
        Stop 172.16.201.159 success
Starting component pd
        Starting instance 172.16.201.73:52379
        Start instance 172.16.201.73:52379 success
Starting component tikv
        Starting instance 172.16.201.99:25160
        Starting instance 172.16.201.73:25160
        Starting instance 172.16.201.159:25160
        Start instance 172.16.201.99:25160 success
        Start instance 172.16.201.73:25160 success
        Start instance 172.16.201.159:25160 success
Starting component tidb
        Starting instance 172.16.201.159:54000
        Starting instance 172.16.201.73:54000
        Start instance 172.16.201.159:54000 success
        Start instance 172.16.201.73:54000 success
Starting component prometheus
        Starting instance 172.16.201.73:59090
        Start instance 172.16.201.73:59090 success
Starting component grafana
        Starting instance 172.16.201.73:43000
        Start instance 172.16.201.73:43000 success
Starting component alertmanager
        Starting instance 172.16.201.73:59093
        Start instance 172.16.201.73:59093 success
Starting component node_exporter
        Starting instance 172.16.201.159
        Starting instance 172.16.201.99
        Starting instance 172.16.201.73
        Start 172.16.201.73 success
        Start 172.16.201.99 success
        Start 172.16.201.159 success
Starting component blackbox_exporter
        Starting instance 172.16.201.159
        Starting instance 172.16.201.99
        Starting instance 172.16.201.73
        Start 172.16.201.73 success
        Start 172.16.201.99 success
        Start 172.16.201.159 success
+ [ Serial ] - Reload PD Members
        Update pd-172.16.201.73-52379 peerURLs: [https://172.16.201.73:52380]
Enabled TLS between TiDB components for cluster `tidb-cc` successfully

模拟开启TLS的时候停止node_exports失败

查看member信息

[tidb@vm172-16-201-73 ~]$ tiup ctl:v7.1.1 pd -u https://172.16.201.73:52379 --cacert=/home/tidb/.tiup/storage/cluster/clusters/tidb-cc/tls/ca.crt --key=/home/tidb/.tiup/storage/cluster/clusters/tidb-cc/tls/client.pem --cert=/home/tidb/.tiup/storage/cluster/clusters/tidb-cc/tls/client.crt member

{
  "header": {
    "cluster_id": 7299695983639846267
  },
  "members": [
    {
      "name": "pd-172.16.201.73-52379",
      "member_id": 781118713925753452,
      "peer_urls": [
        "http://172.16.201.73:52380"
      ],
      "client_urls": [
        "https://172.16.201.73:52379"
      ],

发现 members中的peer_urls确实是http的而非https,复现成功

总结

  1. 开启TLS的时候失败,因为停止node_export失败,认为无关紧要,所以继续接下来的步骤
  2. 开启TLS最后一步是pd member的配置更新,停止node_export失败导致这一步没有正常执行
  3. 后续需要确保TLS开启成功后才能做下一步,也可以使用pd-ctl 查看member的情况
  4. 如果后续遇到此类情况,可以先关闭TLS,再次开启,开启的时候还遇到停止node_export时间过长,可以tiup执行时增加--wait-timeout参数以及手动kill 机器的node_export进程,确保TLS继续进行

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.coloradmin.cn/o/1925326.html

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈,一经查实,立即删除!

相关文章

Linux Win 10 Windows上安装Ollama部署大模型qwen2 7b/15配置启动 LangChain-ChatChat 0.2.10进行对话

Win 10 Window安装Ollama部署qwen2 7b LangChain-ChatChat 环境说明 Win 10 Python 3.11.9 LangChain-ChatChat 0.20 Ollama 0.2.10 Qwen2 1.5b/7b Windows 安装Ollama 下载并安装Windows版Ollama https://ollama.com/download#/ 下载大模型qwen2:1.5b或者qwen2:7b 在命令…

电子技术员基础

这是介绍电子技术员必须掌握的基础知识,在维修和测试中可能有所帮助。放在电脑里有几十年了,是我早期在做电子技术员期间做的文档,所有插图都是自己做的,当初学习态度是绝对认真。如果把这股子劲用在考研上呢,... ...&…

Dataset for Stable Diffusion

1.Dataset for Stable Diffusion 笔记来源: 1.Flickr8k数据集处理 2.处理Flickr8k数据集 3.Github:pytorch-stable-diffusion 4.Flickr 8k Dataset 5.dataset_flickr8k.json 6.About Train, Validation and Test Sets in Machine Learning Tarang Shah …

python-28-零基础自学python-json存数据、读数据,及程序合并

学习内容:《python编程:从入门到实践》第二版 知识点: import json引入、 try-except-else return def函数、打开文件、 练习内容: 练习10-11:喜欢的数 编写一个程序,提示用户输入喜欢的数&#xff…

机器学习基本概念,Numpy,matplotlib和张量Tensor知识进一步学习

机器学习一些基本概念: 监督学习 监督学习是机器学习中最常见的形式之一,它涉及到使用带标签的数据集来训练模型。这意味着每条训练数据都包含输入特征和对应的输出标签。目标是让模型学会从输入到输出的映射,这样当给出新的未见过的输入时…

06:串口通信一

串口通信初识 1、串口的基本认识2、串口的电平3、接线方式4、通过单片机向PC发送每隔1s发送一个字符A 1、串口的基本认识 串口也就是接口,称串行通信接口或串行通讯接口(通常指COM接口),是采用串行通信方式的扩展接口。用来进行数据一位一位地顺序传送。…

【2024_CUMCM】时间序列4-实战

目录 思考建模思路 例1 定义date 创建时间序列模拟器 结果分析 例2 序列图 创建时间序列模型 未除去异常值 剔除有异常值 勾选 结果 注 思考建模思路 ( 1 )处理数据的缺失值问题、生成时间变量并画出时间序列图; &…

OpenGL笔记九之彩色三角形与重心插值算法

OpenGL笔记九之彩色三角形与重心插值算法 —— 2024-07-07 晚上 bilibili赵新政老师的教程看后笔记 code review! 文章目录 OpenGL笔记九之彩色三角形与重心插值算法1.运行3.main.cpp 1.运行 3.main.cpp 代码 #include <iostream>#define DEBUG//注意&#xff1a;glad…

算法 —— 高精度

目录 加法高精度 两个正整数相加 两个正小数相加 两正数相加 减法高精度 两个正整数相减 两个正小数相减 两正数相减 加减法总结 乘法高精度 两个正整数相乘 两个正小数相乘 乘法总结 加法高精度 题目来源洛谷&#xff1a;P1601 AB Problem&#xff08;高精&#x…

链路追踪系列-01.mac m1 安装zipkin

下载地址&#xff1a;https://hub.docker.com/r/openzipkin/zipkin jelexjelexxudeMacBook-Pro zipkin-server % pwd /Users/jelex/Documents/work/zipkin-server 先启动Es: 可能需要先删除 /Users/jelex/dockerV/es/plugins 目录下的.DS_Store 当端口占用时再次启动&#x…

java多线程操作之CAS

1&#xff0c;什么是CAS&#xff1f; CAS&#xff08;Compare-And-Swap&#xff09; 比较并交换&#xff0c;用于实现同步和锁机制。经常配合juc中Atomic相关类进行。Atomic相关类无法解决aba问题。 2&#xff0c;CAS核心思想是什么&#xff1f; 比较和交换。本质上就是乐观锁…

数字电路-建立时间和保持时间详解

对于数字系统而言&#xff0c;建立时间&#xff08;setup time&#xff09;和保持时间&#xff08;hold time&#xff09;是数字电路时序的基础。数字电路系统的稳定性&#xff0c;基本取决于时序是否满足建立时间和保持时间。我自己在初学时一度很难理解清楚他们的概念&#x…

android studio开发

Kotlin 编程简介 | Android Basics Compose - First Android app | Android Developers (google.cn) 这是官网的教程&#xff0c;实现试一下。 之后进入课程 您的第一个 Kotlin 程序 (google.cn) 程序可以被视为一系列指示计算机或设备执行某项操作的指令&#xff0c;

Highlight.js示例

图例 代码在图片后面 点赞❤️关注&#x1f64f;收藏⭐️ 源代码 <!DOCTYPE html> <html lang"en"> <head> <meta charset"UTF-8"> <meta name"viewport" content"widthdevice-width, initial-scale1.0"…

2024007月份 制作一个 Windows 10 U disk 安装工具

1&#xff0c;下载微软官方 Win10 U盘安装工具 工具名称&#xff1a; MediaCreationTool 下载地址&#xff1a; https://www.microsoft.com/zh-cn/software-download/windows10 2&#xff0c;制作 U盘安装盘 双击打开&#xff0c;并单击“接受” 选中 为另一台电脑创建安…

微信小程序如何实现登陆和注册功能?

&#x1f468;‍&#x1f4bb;个人主页&#xff1a;开发者-曼亿点 &#x1f468;‍&#x1f4bb; hallo 欢迎 点赞&#x1f44d; 收藏⭐ 留言&#x1f4dd; 加关注✅! &#x1f468;‍&#x1f4bb; 本文由 曼亿点 原创 &#x1f468;‍&#x1f4bb; 收录于专栏&#xff1a…

python中的os模块和shutil模块

目录 os 1. 获取当前脚本绝对路径 2.获得工作路径&#xff1b; 3.该路径文件和目录 4.walk&#xff0c;查看目录下所有的文件&#xff08;含子孙文件&#xff09; 5.创建文件夹 6.os.makedirs(path) 7.路径拼接 8. 获取当前文件的上级目录 9.判断路径是否存在 10.是…

linux系统查看父子进程

① 查找特定进程的父进程 ps -o pid,ppid,cmd -p 1234 查找进程 PID 为 1234 的父进程 ② 显示所有进程的树状结构 pstree ③ 显示特定进程及其父进程的树状结构 pstree -s 1234 ④ 启动 top 后&#xff0c;按下 c 键可以查看完整命令&#xff0c;按下 f 键进入字段管理界面…

Java | Leetcode Java题解之第233题数字1的个数

题目&#xff1a; 题解&#xff1a; class Solution {public int countDigitOne(int n) {// mulk 表示 10^k// 在下面的代码中&#xff0c;可以发现 k 并没有被直接使用到&#xff08;都是使用 10^k&#xff09;// 但为了让代码看起来更加直观&#xff0c;这里保留了 klong mu…

载波相位定位原理

在现代定位系统中&#xff0c;载波相位测距技术因其高精度而备受青睐。本文将探讨其工作原理&#xff0c;以及如何通过数学模型和算法来校正测量中的误差。 载波相位测距模型 载波相位测距是基于接收卫星发射的载波信号相位变化来进行距离测量的技术。它利用了信号传输过程中…