文章目录

1. 要求
2. 主库设置
3. 主库创建复制槽、复制用户
4. 导出主库
5. 备库设置
6. 备库创建standby.signal
7. 启动备库
8. 备库创建复制槽、查询数据
9. 查看日志
10. 测试切换
11. 同步测试
12. 监控主备状态

1. 要求

1.使用流复制，配置复制槽
2.只有一个备库时不配置同步，多个备库时ANY
3.测试切换

2. 主库设置

postgresql.conf

# WAL参数
# wal_level=replica              # 默认replica
# synchronous_commit=on     # 默认为on。只要不配置synchronous_standby_names，这里不配置off结果都一样。
# full_page_writes=on            # 默认为on，必须为on
wal_compression=on    #默认为off，开启wal压缩
# wal_writer_delay   wal_writer_flush_after 这两参数就用默认值，如果写磁盘过于频繁增加该值

# checkpoint参数
checkpoint_timeout # 默认5min，增加能提高性能但要崩溃后恢复的时间更长。checkpoint对数据库的影响还是挺大的。
checkpoint_completion_target=0.8 # 用上面0.8倍的时间进行checkpoint，平滑I/O

# 归档参数
archive_mode=always  # 默认为off。主备环境要开启，万一WAL没了还能用归档拿。设置为always切换后也不用再设置归档。
archive_command='test ! -f /home/postgres/pg_arch/%f && cp %p /home/postgres/pg_arch/%f'   # 改成实际路径
# archive_timeout    默认不切换，流复制不需要发送wal文件不需要切换
# archive_cleanup_command 主库不会自动清除，备库会自动清除
# restore_command # 在备库上配置时是通过log方式复制，流复制不需要。主库上配置时仅在recovery时有效果。
archive_cleanup_command ='pg_archivecleanup /home/postgres/pg_arch %r' 

# send 配置
# max_wal_senders        # 默认10，够用
# max_replication_slots  # 默认10，够用
wal_keep_segments=64     # 保留1G应该够了吧
track_commit_timestamp=on # 开启保留时间戳功能

# 主库设置
# synchronous_standby_names 不配置，不设置同步，除非有明确需求
# vacuum_defer_cleanup_age 默认为0，立即删除。

pg_hba.conf

host    replication     standby     192.168.0.38  md5

3. 主库创建复制槽、复制用户

create database test;
\c test;
create user replicat_user replication password 'bfd@dev24';
SELECT * FROM pg_create_physical_replication_slot('slot_01');
SELECT slot_name, slot_type, active FROM pg_replication_slots;
create table t1(id int);
insert into t1 values(1),(2),(3);

4. 导出主库

在备库执行，需要有superuser或者replication的权限才行.
用pg_basebackup导过来之后，整个数据目录都导过来了，包括数据文件，因此只需要修改配置文件就行了

pg_basebackup -h abc07 -U replicat_user -D pg_data -X s

5. 备库设置

postgres.conf

Primary_conninfo = 'host=abc07 port=5432 user=replicat_user password=bfd@dev24 '
primary_slot_name = 'slot_01'

pg_hba.conf

host    replication     standby      192.168.0.37  md5

6. 备库创建standby.signal

数据目录下：

touch standby.signal

一定要用启动数据库的用户创建

7. 启动备库

略

8. 备库创建复制槽、查询数据

同步后replication_user用户也被同步到备库，不需要再创建。

SELECT * FROM pg_create_physical_replication_slot('slot_01');
SELECT slot_name, slot_type, active FROM pg_replication_slots;
select * from t1;

9. 查看日志

2023-04-28 09:24:02.312 CST [26908] LOG:  started streaming WAL from primary at 0/7000000 on timeline 1

关闭主库后

2023-04-28 09:55:11.791 CST [26908] LOG:  replication terminated by primary server
2023-04-28 09:55:11.791 CST [26908] DETAIL:  End of WAL reached on timeline 1 at 0/80000A0.
2023-04-28 09:55:11.791 CST [26908] FATAL:  could not send end-of-streaming message to primary: no COPY in progress
2023-04-28 09:55:11.791 CST [26903] LOG:  invalid record length at 0/80000A0: wanted 24, got 0
2023-04-28 09:55:11.807 CST [32659] FATAL:  could not connect to the primary server: server closed the connection un
expectedly
                This probably means the server terminated abnormally
                before or while processing the request.
2023-04-28 09:55:16.798 CST [32669] FATAL:  could not connect to the primary server: could not connect to server: Co
nnection refused
                Is the server running on host "abc07" (192.168.0.37) and accepting
                TCP/IP connections on port 5432?

10. 测试切换

https://blog.csdn.net/pg_hgdb/article/details/125264019
不设置promote_trigger_file，使用pg_ctl promote或者pg_promote来提升备库为主库。任务有执行该函数权限的用户都可以执行。

pg_promote(wait boolean DEFAULT true, wait_seconds integer DEFAULT 60)。 当wait为true时直到提升成功或者达到wait_seconds的时间才返回true，当wait为false时，直接返回true。

可以使用以下两种方式：

$ pg_ctl promote    [-D DATADIR] [-W] [-t SECS] [-s]

或者：

postgres=# select pg_promote(true,60);
 pg_promote 
------------
 t
(1 row)

提升后日志显示：

2023-04-28 10:05:35.761 CST [26903] LOG:  received promote request
2023-04-28 10:05:35.761 CST [26903] LOG:  redo done at 0/8000028
2023-04-28 10:05:35.762 CST [26903] LOG:  selected new timeline ID: 2
2023-04-28 10:05:35.882 CST [26903] LOG:  archive recovery complete
2023-04-28 10:05:35.891 CST [26899] LOG:  database system is ready to accept connections

此时数据目录中standby.signal没有了。

11. 同步测试

修改备库配置文件，或者在命令行修改。

Primary_conninfo = 'host=abc07 port=5432 user=replicat_user password=bfd@dev24 application_name=s38'

修改主库配置文件，或者在命令行修改。

synchronous_commit=on
synchronous_standby_names='1 (s38)'

测试关闭备库，然后看主库提交等待情况：
在这里插入图片描述

一直卡住不提交。
从这点看，如果没有监控备库的状态就用同步，会对主库造成致命问题。
所以，要用ANY的策略，ANY能保证至少X个备库同步完成就行，并且尝试同步所有备库。
当仅有一个备库时，不做同步复制，太危险了

12. 监控主备状态

主库上的pg_stat_replication记录了复制信息：
https://www.postgresql.org/docs/12/monitoring-stats.html


postgres=# \x
Expanded display is on.
postgres=# select * from pg_stat_replication;
-[ RECORD 1 ]----+------------------------------
pid              | 19189
usesysid         | 16386
usename          | replicat_user
application_name | s38
client_addr      | 192.168.0.38
client_hostname  | 
client_port      | 12214
backend_start    | 2023-04-28 14:42:16.180019+08
backend_xmin     | 
state            | streaming
sent_lsn         | 0/B0004B8
write_lsn        | 0/B0004B8
flush_lsn        | 0/B0004B8
replay_lsn       | 0/B0004B8
write_lag        | 
flush_lag        | 
replay_lag       | 
sync_priority    | 1
sync_state       | sync
reply_time       | 2023-04-28 15:43:23.142803+08

sent_lsn：发送的WAL日志位置
write_lsn：standby写入到磁盘的WAL的位置。(写到缓冲)
flush_lsn：standby flush到磁盘的WAL的位置。(写到磁盘上)
replay_lsn：已经Apply的WAL的位置
write_lag：主库WAL本地化与standby romote_write(写到缓冲)之间的lag。(同步复制时才有)
flush_lag：主库WAL本地化与standby 写到磁盘之间的lag。(同步复制时才有)
replay_lag: 主库WAL本地化与standby apply之间的lag。(同步复制时才有)

sync_state：
备库的状态，如下：
async：该standy异步复制。不在synchronous_standby_names中的都是异步。
potential：该standy现在是异步复制，但是在同步的standby down的时候有可能成为同步的
sync：该standy同步复制
quorum：备选的standbys。意义不明

reply_time：上次发送的apply的时间

主备的pg_current_wal_lsn()记录了当前的WAL位置。

postgres=# select pg_current_wal_lsn();
 pg_current_wal_lsn 
--------------------
 0/B0004B8
(1 row)

备库的pg_last_wal_receive_lsn()记录了接收位置：

postgres=# select pg_last_wal_receive_lsn();
 pg_last_wal_receive_lsn 
-------------------------
 0/B0004B8
(1 row)

备库的pg_stat_wal_receiver显示复制信息：

postgres=# \x
Expanded display is on.
postgres=# select * from pg_stat_wal_receiver;
-[ RECORD 1 ]---------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
pid                   | 21791
status                | streaming
receive_start_lsn     | 0/B000000
receive_start_tli     | 1
received_lsn          | 0/B0004B8
received_tli          | 1
last_msg_send_time    | 2023-04-28 17:23:04.078542+08
last_msg_receipt_time | 2023-04-28 17:23:04.111514+08
latest_end_lsn        | 0/B0004B8
latest_end_time       | 2023-04-28 14:42:16.182656+08
slot_name             | slot_01
sender_host           | abc07
sender_port           | 5432
conninfo              | user=replicat_user password=******** dbname=replication host=abc07 port=5432 application_name=s38 fallback_application_name=walreceiver sslmode=disable sslcompression=0 gssencmode=disable krbsrvname=postgres target_session_attrs=any

通过主备的视图能比较准确的判断复制的延迟：
https://www.postgresql.org/docs/12/runtime-config-wal.html#GUC-FULL-PAGE-WRITES

select pid,usename,client_addr,state,
pg_wal_lsn_diff(pg_current_wal_lsn(),write_lsn) write_delay,
pg_wal_lsn_diff(pg_current_wal_lsn(),flush_lsn) flush_delay,
pg_wal_lsn_diff(pg_current_wal_lsn(),replay_lsn) replay_dely
from pg_stat_replication;