遇到的问题如下：

2023-08-17 20:24:21.566 CST [1556001] LOG: database system was interrupted; last known up at 2023-08-17 20:21:41 CST
2023-08-17 20:24:21.770 CST [1556001] LOG: restored log file "00000009.history" from archive
cp: 无法获取'/home/postgres/pgarch/0000000A.history' 的文件状态(stat): 没有那个文件或目录
2023-08-17 20:24:21.771 CST [1556001] LOG: entering standby mode
2023-08-17 20:24:21.772 CST [1556001] LOG: restored log file "00000009.history" from archive
cp: 无法获取'/home/postgres/pgarch/000000090000010200000066' 的文件状态(stat): 没有那个文件或目录
2023-08-17 20:24:21.784 CST [1556001] LOG: restored log file "000000080000010200000066" from archive
2023-08-17 20:24:21.851 CST [1556001] FATAL: requested timeline 9 is not a child of this server's history
2023-08-17 20:24:21.851 CST [1556001] DETAIL: Latest checkpoint is at 102/66000060 on timeline 8, but in the history of the requested timeline, the server forked off from that timeline at 102/580000A0.
2023-08-17 20:24:21.851 CST [1555991] LOG: startup process (PID 1556001) exited with exit code 1
2023-08-17 20:24:21.851 CST [1555991] LOG: aborting startup due to startup process failure
2023-08-17 20:24:21.851 CST [1555991] LOG: database system is shut down

出现上面的原因是repmgr出现了双主。

在db206的主机上修改了shared_preload_libraries = 'pg_stat_statements'，试图重启，发现无法启动（没有提前创建pg_stat_statements扩展）导致。

[postgres@db206 data]$ vi postgresql.conf
[postgres@db206 data]$ pg_ctl restart
waiting for server to shut down...... done
server stopped
waiting for server to start....2023-08-17 18:11:53.086 CST [6497] FATAL: could not access file "pg_stat_statements": 没有那个文件或目录
2023-08-17 18:11:53.086 CST [6497] LOG: database system is shut down
stopped waiting
pg_ctl: could not start server

这个时候 vi postgresql.conf 把shared_preload_libraries = 'pg_stat_statements'去掉，再次启动数据库，可以启动，试图创建，这个时候备机已经接管主机了

这个时候想起来先去修改db223的shared_preload_libraries = 'pg_stat_statements'（先在备机上给加上）

[postgres@db223 ~]$ vi pg14/data/postgresql.conf

这个时候发现出现了双主（暂时还不知道为什么会出现双主），这个时候时间线也不一样，新主是9，旧主是8

[postgres@db206 data]$ repmgr -f ~/repmgr/repmgr.conf cluster show
ID | Name | Role | Status | Upstream | Location | Priority | Timeline | Connection string
----+-------+---------+----------------------+----------+----------+----------+----------+------------------------------------------------------------------------
1 | db223 | standby | ! running as primary | | default | 100 | 9 | host=db223 dbname=repmgr user=repmgr password=repmgr connect_timeout=2
2 | db206 | primary | * running | | default | 100 | 8 | host=db206 dbname=repmgr user=repmgr password=repmgr connect_timeout=2

WARNING: following issues were detected
- node "db223" (ID: 1) is registered as standby but running as primary

试图对从节点进行重新注册操作，提示需要先启动数据库。

[postgres@db206 data]$ repmgr -f /home/postgres/repmgr/repmgr.conf standby unregister
INFO: connecting to local standby
ERROR: connection to database failed
DETAIL:
connection to server at "db206" (172.20.101.206), port 5432 failed: Connection refused
Is the server running on that host and accepting TCP/IP connections?

DETAIL: attempted to connect using:
user=repmgr password=repmgr connect_timeout=2 dbname=repmgr host=db206 fallback_application_name=repmgr options=-csearch_path=

启动之后重新执行命令，又提示现在是主节点。

[postgres@db206 data]$ repmgr -f /home/postgres/repmgr/repmgr.conf standby unregister
INFO: connecting to local standby
INFO: connecting to primary database
ERROR: node 2 is not a standby server

然后试图对主节点执行注销操作，又说db233节点仍然将此节点作为其上游节点。提示:使用“repmgr standby follow”确保这些节点遵循当前的主节点。

[postgres@db206 data]$ repmgr -f /home/postgres/repmgr/repmgr.conf primary unregister
ERROR: 1 other node still has this node as its upstream node
HINT: ensure these nodes are following the current primary with "repmgr standby follow"
DETAIL: the affected node(s) are:
db223 (ID: 1)

这个时候对db223重新加入集群，发现不能在正在运行的节点上执行

[postgres@db223 ~]$ repmgr -f ~/repmgr/repmgr.conf node rejoin -d 'host=db206 port=5432 user=repmgr dbname=repmgr password=repmgr'
ERROR: database is still running in state "in production"
HINT: "repmgr node rejoin" cannot be executed on a running node

停止数据库后，再次执行，这个时候没有报错

[postgres@db223 ~]$ repmgr -f ~/repmgr/repmgr.conf node rejoin -d 'host=db206 port=5432 user=repmgr dbname=repmgr password=repmgr' -F
NOTICE: rejoin target is node "db206" (ID: 2)
NOTICE: pg_rewind execution required for this node to attach to rejoin target node 2
HINT: provide --force-rewind

重新启动db223，发现还是作为主节点加入，这就很崩溃了。

pg_ctl start

[postgres@db223 ~]$ repmgr -f ~/repmgr/repmgr.conf cluster show
ID | Name | Role | Status | Upstream | Location | Priority | Timeline | Connection string
----+-------+---------+-----------+----------+----------+----------+----------+------------------------------------------------------------------------
1 | db223 | primary | * running | | default | 100 | 9 | host=db223 dbname=repmgr user=repmgr password=repmgr connect_timeout=2
2 | db206 | primary | ! running | | default | 100 | 8 | host=db206 dbname=repmgr user=repmgr password=repmgr connect_timeout=2

WARNING: following issues were detected
- node "db206" (ID: 2) is running but the repmgr node record is inactive

这个时候加上pg_rewind操作是不是就好了呢，发现还是不行，无法读到时间线9的，不知道为什么要读9的时间线，估计还是作为主节点加入吧。

[postgres@db223 ~]$ repmgr -f ~/repmgr/repmgr.conf node rejoin -d 'host=db206 port=5432 user=repmgr dbname=repmgr password=repmgr' --force-rewind
NOTICE: rejoin target is node "db206" (ID: 2)
NOTICE: executing pg_rewind
DETAIL: pg_rewind command is "/home/postgres/pg14/bin/pg_rewind -D '/home/postgres/pg14/data' --source-server='host=db206 dbname=repmgr user=repmgr password=repmgr connect_timeout=2'"
ERROR: pg_rewind execution failed
DETAIL: pg_rewind: servers diverged at WAL location 102/580000A0 on timeline 8
pg_rewind: error: could not open file "/home/postgres/pg14/data/pg_wal/000000090000010200000058": 没有那个文件或目录
pg_rewind: fatal: could not find previous WAL record at 102/580000A0

最终极的方法是删掉重建，这个时候删掉的是时间线9的，虽然重建好了，但是pg_ctl start无法启动。

[postgres@db223 data]$ rm -rf *
[postgres@db223 data]$ ll
总用量 0
[postgres@db223 data]$ repmgr -h db206 -U repmgr -d repmgr -f /home/postgres/repmgr/repmgr.conf standby clone
NOTICE: destination directory "/home/postgres/pg14/data" provided
INFO: connecting to source node
DETAIL: connection string is: host=db206 user=repmgr dbname=repmgr
DETAIL: current installation size is 12 GB
INFO: replication slot usage not requested; no replication slot will be set up for this standby
NOTICE: checking for available walsenders on the source node (2 required)
NOTICE: checking replication connections can be made to the source server (2 required)
INFO: checking and correcting permissions on existing directory "/home/postgres/pg14/data"
NOTICE: starting backup (using pg_basebackup)...
HINT: this may take some time; consider using the -c/--fast-checkpoint option
INFO: executing:
/home/postgres/pg14/bin/pg_basebackup -l "repmgr base backup" -D /home/postgres/pg14/data -h db206 -p 5432 -U repmgr -X stream
NOTICE: standby clone (using pg_basebackup) complete
NOTICE: you can now start your PostgreSQL server
HINT: for example: pg_ctl -D /home/postgres/pg14/data start
HINT: after starting the server, you need to re-register this standby with "repmgr standby register --force" to update the existing node record
[postgres@db223 data]$ ^C
[postgres@db223 data]$ pg_ctl start
waiting for server to start....2023-08-17 19:48:33.265 CST [1532642] LOG: redirecting log output to logging collector process
2023-08-17 19:48:33.265 CST [1532642] HINT: Future log output will appear in directory "log".
stopped waiting
pg_ctl: could not start server

查看log日志就是开头的，还是要读取时间线9，但是主库db203是没有时间线8的。又崩溃了。。。

2023-08-17 20:24:21.566 CST [1556001] LOG: database system was interrupted; last known up at 2023-08-17 20:21:41 CST
2023-08-17 20:24:21.770 CST [1556001] LOG: restored log file "00000009.history" from archive
cp: 无法获取'/home/postgres/pgarch/0000000A.history' 的文件状态(stat): 没有那个文件或目录
2023-08-17 20:24:21.771 CST [1556001] LOG: entering standby mode
2023-08-17 20:24:21.772 CST [1556001] LOG: restored log file "00000009.history" from archive
cp: 无法获取'/home/postgres/pgarch/000000090000010200000066' 的文件状态(stat): 没有那个文件或目录
2023-08-17 20:24:21.784 CST [1556001] LOG: restored log file "000000080000010200000066" from archive
2023-08-17 20:24:21.851 CST [1556001] FATAL: requested timeline 9 is not a child of this server's history
2023-08-17 20:24:21.851 CST [1556001] DETAIL: Latest checkpoint is at 102/66000060 on timeline 8, but in the history of the requested timeline, the server forked off from that timeline at 102/580000A0.
2023-08-17 20:24:21.851 CST [1555991] LOG: startup process (PID 1556001) exited with exit code 1
2023-08-17 20:24:21.851 CST [1555991] LOG: aborting startup due to startup process failure
2023-08-17 20:24:21.851 CST [1555991] LOG: database system is shut down

这个时候看了看db223的参数，是不是读取的归档路径不对，然后就看到基于时间线恢复recovery_target_timeline参数

archive_mode = on

archive_command = 'scp %p postgres@172.20.101.208:/home/postgres/pgarch/%f'

archive_cleanup_command = 'pg_archivecleanup /home/postgres/pgarch %r'

restore_command = 'cp /home/postgres/pgarch/%f %p'

recovery_target_timeline = 'latest'

修改了recovery_target_timeline = 'current'之后，再次启动db223就好了。

[postgres@db206 ~]$ repmgr -f ~/repmgr/repmgr.conf cluster show
ID | Name | Role | Status | Upstream | Location | Priority | Timeline | Connection string
----+-------+---------+-----------+----------+----------+----------+----------+------------------------------------------------------------------------
1 | db223 | standby | running | db206 | default | 100 | 8 | host=db223 dbname=repmgr user=repmgr password=repmgr connect_timeout=2
2 | db206 | primary | * running | | default | 100 | 8 | host=db206 dbname=repmgr user=repmgr password=repmgr connect_timeout=2