我们正在运行:
user@primary/client-n:~$ psql -d database -c "SELECT version();"
version
---------------------------------------------------------------------------------------------------------------------------------------------
PostgreSQL 10.7 (Ubuntu 10.7-1.pgdg16.04+1) on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 5.4.0-6ubuntu1~16.04.11) 5.4.0 20160609, 64-bit
(1 row)
上:
user@primary/client-n:~$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 16.04.6 LTS
Release: 16.04
Codename: xenial
并设置一个主客户端和两个流复制客户端,配置如下:
user@client-n:~$ psql -d postgres -c "SELECT name, setting FROM pg_settings WHERE name IN ( 'hot_standby', 'hot_standby_feedback', 'max_standby_streaming_delay' );"
name | setting
----------------------------+---------
hot_standby | on
hot_standby_feedback | on
max_standby_streaming_delay | 150000
(3 rows)
我们只有一个数据库(标准数据库除外)和数据库中的一个表。每天大约 3 到 4 次,我们会遇到一种特殊情况下的 autovacuum 情况,文档中将其描述为:
[...]它(autovacuum)不会将空间返回给操作系统,除非在特殊情况下,表末尾的一个或多个页面完全空闲并且可以轻松获得独占表锁
我们正在监视pg_locks
并且已经能够观察到 autovacuum 守护进程获取表级锁定ACCESS EXCLUSIVE
,这反过来会导致客户端上的大量进程被阻塞,如下面的日志条目所示:
基本的:
...
2019-06-04 05:59:29.154 BST [8998-1] LOG: automatic vacuum of table "database.schema.table": index scans: 1
...
客户 1:
...
2019-06-04 05:59:03.660 BST [21167-858] [PostgreSQL JDBC Driver@ip_address(port):role@database] | LOG: process 21167 still waiting for AccessShareLock on relation 16390 of database 16388 after 1000.222 ms
2019-06-04 05:59:03.660 BST [21167-859] [PostgreSQL JDBC Driver@ip_address(port):role@database] | DETAIL: Process holding the lock: 2741. Wait queue: 21167, 1215, 26415.
2019-06-04 05:59:03.660 BST [21167-860] [PostgreSQL JDBC Driver@ip_address(port):role@database] | STATEMENT: SELECT ...
2019-06-04 05:59:03.730 BST [1215-51] [PostgreSQL JDBC Driver@ip_address(port):role@database] | LOG: process 1215 still waiting for AccessShareLock on relation 16390 of database 16388 after 1000.188 ms at character 15
2019-06-04 05:59:03.730 BST [1215-52] [PostgreSQL JDBC Driver@ip_address(port):role@database] | DETAIL: Process holding the lock: 2741. Wait queue: 21167, 1215, 26415.
2019-06-04 05:59:03.730 BST [1215-53] [PostgreSQL JDBC Driver@ip_address(port):role@database] | STATEMENT: SELECT ...
...
2019-06-04 05:59:19.975 BST [22242-4569] [PostgreSQL JDBC Driver@ip_address(port):role@database] | LOG: process 22242 still waiting for AccessShareLock on relation 16390 of database 16388 after 1000.281 ms at character 15
2019-06-04 05:59:19.975 BST [22242-4570] [PostgreSQL JDBC Driver@ip_address(port):role@database] | DETAIL: Process holding the lock: 2741. Wait queue: 21167, 1215, 26415, 2423, 1289, 24009, 22441, 2640, 1843, 1056, 23336, 28060, 1860, 1134, 19419, 14649, 2721, 29540, 20138, 22242.
2019-06-04 05:59:19.975 BST [22242-4571] [PostgreSQL JDBC Driver@ip_address(port):role@database] | STATEMENT: SELECT...
...
以及持有锁的过程:
postgres=# SELECT pid, backend_type, wait_event_type, wait_event FROM pg_stat_activity WHERE pid = 2741;
pid | backend_type | wait_event_type | wait_event
------+--------------+-----------------+----------------
2741 | startup | Activity | RecoveryWalAll
(1 row)
客户 2:
...
2019-06-04 06:00:08.964 BST [16153-1] [PostgreSQL JDBC Driver@ip_address(port):role@database] | FATAL: terminating connection due to conflict with recovery
2019-06-04 06:00:08.964 BST [16153-2] [PostgreSQL JDBC Driver@ip_address(port):role@database] | DETAIL: User was holding a relation lock for too long.
2019-06-04 06:00:08.964 BST [16153-3] [PostgreSQL JDBC Driver@ip_address(port):role@database] | HINT: In a moment you should be able to reconnect to the database and repeat your command.
2019-06-04 06:00:09.964 BST [5747-537] [PostgreSQL JDBC Driver@ip_address(port):role@database] | LOG: process 5747 still waiting for AccessShareLock on relation 16390 of database 16388 after 1000.248 ms
2019-06-04 06:00:09.964 BST [5747-538] [PostgreSQL JDBC Driver@ip_address(port):role@database] | DETAIL: Process holding the lock: 12709. Wait queue: 5747, 19765, 16036, 14617, 12280, 14513, 14728, 15398, 27611, 14542, 15948, 23398, 5853, 5098, 4324, 10760, 23480, 30192, 15300, 16228.
2019-06-04 06:00:09.964 BST [5747-539] [PostgreSQL JDBC Driver@ip_address(port):role@database] | STATEMENT: SELECT ...
2019-06-04 06:00:09.975 BST [19765-6847] [PostgreSQL JDBC Driver@ip_address(port):role@database] | LOG: process 19765 still waiting for AccessShareLock on relation 16390 of database 16388 after 1000.180 ms
2019-06-04 06:00:09.975 BST [19765-6848] [PostgreSQL JDBC Driver@ip_address(port):role@database] | DETAIL: Process holding the lock: 12709. Wait queue: 5747, 19765, 16036, 14617, 12280, 14513, 14728, 15398, 27611, 14542, 15948, 23398, 5853, 5098, 4324, 10760, 23480, 30192, 15300, 16228.
2019-06-04 06:00:09.975 BST [19765-6849] [PostgreSQL JDBC Driver@ip_address(port):role@database] | STATEMENT: SELECT ...
...
2019-06-04 06:01:25.487 BST [15873-1] [PostgreSQL JDBC Driver@ip_address(port):role@database] | LOG: process 15873 still waiting for AccessShareLock on relation 16390 of database 16388 after 1000.218 ms at character 15
2019-06-04 06:01:25.487 BST [15873-2] [PostgreSQL JDBC Driver@ip_address(port):role@database] | DETAIL: Process holding the lock: 12709. Wait queue: 5747, 19765, 16036, 14617, 12280, 14513, 14728, 15398, 27611, 14542, 15948, 23398, 5853, 5098, 4324, 10760, 23480, 30192, 15300, 16228, 16127, 16285, 15873.
2019-06-04 06:01:25.487 BST [15873-3] [PostgreSQL JDBC Driver@ip_address(port):role@database] | STATEMENT: SELECT ...
...
2019-06-04 06:01:29.160 BST [16127-6] [PostgreSQL JDBC Driver@ip_address(port):role@database] | LOG: process 16127 acquired AccessShareLock on relation 16390 of database 16388 after 8560.748 ms at character 15
2019-06-04 06:01:29.160 BST [16127-7] [PostgreSQL JDBC Driver@ip_address(port):role@database] | STATEMENT: SELECT ...
...
再次持有锁的进程:
postgres=# SELECT pid, backend_type, wait_event_type, wait_event FROM pg_stat_activity WHERE pid = 2741;
pid | backend_type | wait_event_type | wait_event
------+--------------+-----------------+----------------
12709 | startup | Activity | RecoveryWalAll
(1 row)
客户端上被阻止的查询导致 API 延迟在 10 到 20 秒之间,有时还会出现大量 5xx 响应。Out SRE 团队的任务是降低这些事件期间的 API 延迟,我们正在寻找解决这个问题的方法,我们认为这是一个非常小众的情况。我们目前正在recovery_min_apply_delay = 120s
客户端 1 上进行试验(因此后面的日志条目),以便两个客户端不会同时锁定。这在一定程度上减少了错误响应的数量并稍微降低了延迟峰值。我们不确定如何完全解决这个问题,事实上,是否有可能这样做。我们将不胜感激您的建议。我们找到了这个相关的帖子,但遗憾的是,它也没有得到解决。