我有两个具有以下规格的服务器:
- 8 个 vCPU、32768 MB RAM、640 GB SSD
主 Postgres 13.3 数据库 (db1) 安装在第一台服务器 (Ubuntu 16.04.7) 上,配置如下:
shared_buffers = 16GB
work_mem = 128MB
maintenance_work_mem = 8GB
effective_cache_size = 16GB
effective_io_concurrency = 400
max_worker_processes = 8
max_parallel_workers_per_gather = 4
max_parallel_workers = 8
wal_level = logical
synchronous_commit = on
max_wal_size = 4GB
min_wal_size = 32MB
wal_keep_size = 16384
wal_sender_timeout = 60s
checkpoint_completion_target = 0.7
synchronous_standby_names = 'FIRST 1 (db2_slave)'
max_standby_archive_delay = 1800s
max_standby_streaming_delay = 1800s
备用数据库是安装在第二台服务器(Ubuntu 20.04.3)上的 Postgres 13.4 数据库(db2),配置如下:
shared_buffers = 24GB
work_mem = 128MB
maintenance_work_mem = 16GB
effective_cache_size = 24GB
effective_io_concurrency = 400
max_worker_processes = 8
max_parallel_workers_per_gather = 4
max_parallel_workers = 8
wal_level = logical
synchronous_commit = on
max_wal_size = 4GB
min_wal_size = 32MB
checkpoint_completion_target = 0.7
primary_conninfo = 'host=... port=5432 user=repluser passfile=''...'' application_name=db2_slave'
primary_slot_name = 'db2'
hot_standby = on
max_standby_archive_delay = 1800s
max_standby_streaming_delay = 1800s
如果我在备用服务器上运行 iotop -u postgresql ,我会看到两个进程:
2229172 postgres: 13/main: walreceiver streaming DDFD/8E9FE9E0
2229138 postgres: 13/main: startup recovering 000000010000DDFD0000008E
在我在待机(SELECT COUNT(*) FROM big_table;
2229138 postgres: 13/main: startup recovering 000000010000DE0400000017 waiting
我在master上运行了这个查询:
SELECT client_addr as client,
usename as user,
application_name as name,
state,
sync_state as mode,
pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), sent_lsn)) as pending,
pg_size_pretty(pg_wal_lsn_diff(sent_lsn, write_lsn)) as write,
pg_size_pretty(pg_wal_lsn_diff(write_lsn, flush_lsn)) as flush,
pg_size_pretty(pg_wal_lsn_diff(flush_lsn, replay_lsn)) as replay,
pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn)) as total_lag
FROM pg_stat_replication;
输出是:
client | user | name | state | mode | pending | write | flush | replay | total_lag
-------------+----------+-----------+-----------+------+---------+---------+---------+--------+-----------
... | repluser | db2_slave | streaming | sync | 0 bytes | 0 bytes | 0 bytes | 21 MB | 21 MB
(1 row)
如果我多次执行此请求,则在执行此查询 ( SELECT COUNT(*) FROM big_table
) 期间,重播和总延迟会一直增加。因此,我想知道问题的答案:
- 为什么在执行副本的分析查询期间重播延迟不断增加?
- 为什么我一开始对备用服务器的请求,恢复进程就处于“等待”状态?
原因是恢复冲突。数据库可以延迟复制或取消冲突查询。这由
max_standby_streaming_delay
备用数据库上的参数控制,该参数确定 PostgreSQL 在取消有问题的查询之前准备好延迟重播复制信息的时间。hot_standby_feedback
您可以通过在备用服务器上设置为来减少冲突的数量on
(冒着使主服务器上的表膨胀的风险),但这并不能完全解决问题。本质上,您将无法在备用数据库上运行不间断的查询,并且同时没有复制延迟。
有关详细信息,请参阅我关于该主题的文章。
重播需要删除运行计数的查询仍然有权查看的元组。所以重播会等待。