Nós estamos correndo:
user@primary/client-n:~$ psql -d database -c "SELECT version();"
version
---------------------------------------------------------------------------------------------------------------------------------------------
PostgreSQL 10.7 (Ubuntu 10.7-1.pgdg16.04+1) on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 5.4.0-6ubuntu1~16.04.11) 5.4.0 20160609, 64-bit
(1 row)
sobre:
user@primary/client-n:~$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 16.04.6 LTS
Release: 16.04
Codename: xenial
e ter uma configuração com um primário e dois clientes de replicação de streaming configurados com:
user@client-n:~$ psql -d postgres -c "SELECT name, setting FROM pg_settings WHERE name IN ( 'hot_standby', 'hot_standby_feedback', 'max_standby_streaming_delay' );"
name | setting
----------------------------+---------
hot_standby | on
hot_standby_feedback | on
max_standby_streaming_delay | 150000
(3 rows)
Temos apenas um banco de dados (além dos padrões) e uma tabela no banco de dados. Aproximadamente 3 a 4 vezes por dia nos deparamos com uma situação especial de autovácuo que é descrita na documentação como:
[...] (autovacuum) não retornará o espaço para o sistema operacional, exceto no caso especial em que uma ou mais páginas no final de uma tabela ficam totalmente livres e um bloqueio de tabela exclusivo pode ser facilmente obtido
Estamos monitorando pg_locks
e pudemos observar o daemon autovacuum fazendo o bloqueio ACCESS EXCLUSIVE
no nível da tabela, o que, por sua vez, leva a uma série de processos bloqueados nos clientes, conforme ilustrado com as entradas de log abaixo:
Primário:
...
2019-06-04 05:59:29.154 BST [8998-1] LOG: automatic vacuum of table "database.schema.table": index scans: 1
...
Cliente 1:
...
2019-06-04 05:59:03.660 BST [21167-858] [PostgreSQL JDBC Driver@ip_address(port):role@database] | LOG: process 21167 still waiting for AccessShareLock on relation 16390 of database 16388 after 1000.222 ms
2019-06-04 05:59:03.660 BST [21167-859] [PostgreSQL JDBC Driver@ip_address(port):role@database] | DETAIL: Process holding the lock: 2741. Wait queue: 21167, 1215, 26415.
2019-06-04 05:59:03.660 BST [21167-860] [PostgreSQL JDBC Driver@ip_address(port):role@database] | STATEMENT: SELECT ...
2019-06-04 05:59:03.730 BST [1215-51] [PostgreSQL JDBC Driver@ip_address(port):role@database] | LOG: process 1215 still waiting for AccessShareLock on relation 16390 of database 16388 after 1000.188 ms at character 15
2019-06-04 05:59:03.730 BST [1215-52] [PostgreSQL JDBC Driver@ip_address(port):role@database] | DETAIL: Process holding the lock: 2741. Wait queue: 21167, 1215, 26415.
2019-06-04 05:59:03.730 BST [1215-53] [PostgreSQL JDBC Driver@ip_address(port):role@database] | STATEMENT: SELECT ...
...
2019-06-04 05:59:19.975 BST [22242-4569] [PostgreSQL JDBC Driver@ip_address(port):role@database] | LOG: process 22242 still waiting for AccessShareLock on relation 16390 of database 16388 after 1000.281 ms at character 15
2019-06-04 05:59:19.975 BST [22242-4570] [PostgreSQL JDBC Driver@ip_address(port):role@database] | DETAIL: Process holding the lock: 2741. Wait queue: 21167, 1215, 26415, 2423, 1289, 24009, 22441, 2640, 1843, 1056, 23336, 28060, 1860, 1134, 19419, 14649, 2721, 29540, 20138, 22242.
2019-06-04 05:59:19.975 BST [22242-4571] [PostgreSQL JDBC Driver@ip_address(port):role@database] | STATEMENT: SELECT...
...
E o processo segurando o bloqueio:
postgres=# SELECT pid, backend_type, wait_event_type, wait_event FROM pg_stat_activity WHERE pid = 2741;
pid | backend_type | wait_event_type | wait_event
------+--------------+-----------------+----------------
2741 | startup | Activity | RecoveryWalAll
(1 row)
Cliente 2:
...
2019-06-04 06:00:08.964 BST [16153-1] [PostgreSQL JDBC Driver@ip_address(port):role@database] | FATAL: terminating connection due to conflict with recovery
2019-06-04 06:00:08.964 BST [16153-2] [PostgreSQL JDBC Driver@ip_address(port):role@database] | DETAIL: User was holding a relation lock for too long.
2019-06-04 06:00:08.964 BST [16153-3] [PostgreSQL JDBC Driver@ip_address(port):role@database] | HINT: In a moment you should be able to reconnect to the database and repeat your command.
2019-06-04 06:00:09.964 BST [5747-537] [PostgreSQL JDBC Driver@ip_address(port):role@database] | LOG: process 5747 still waiting for AccessShareLock on relation 16390 of database 16388 after 1000.248 ms
2019-06-04 06:00:09.964 BST [5747-538] [PostgreSQL JDBC Driver@ip_address(port):role@database] | DETAIL: Process holding the lock: 12709. Wait queue: 5747, 19765, 16036, 14617, 12280, 14513, 14728, 15398, 27611, 14542, 15948, 23398, 5853, 5098, 4324, 10760, 23480, 30192, 15300, 16228.
2019-06-04 06:00:09.964 BST [5747-539] [PostgreSQL JDBC Driver@ip_address(port):role@database] | STATEMENT: SELECT ...
2019-06-04 06:00:09.975 BST [19765-6847] [PostgreSQL JDBC Driver@ip_address(port):role@database] | LOG: process 19765 still waiting for AccessShareLock on relation 16390 of database 16388 after 1000.180 ms
2019-06-04 06:00:09.975 BST [19765-6848] [PostgreSQL JDBC Driver@ip_address(port):role@database] | DETAIL: Process holding the lock: 12709. Wait queue: 5747, 19765, 16036, 14617, 12280, 14513, 14728, 15398, 27611, 14542, 15948, 23398, 5853, 5098, 4324, 10760, 23480, 30192, 15300, 16228.
2019-06-04 06:00:09.975 BST [19765-6849] [PostgreSQL JDBC Driver@ip_address(port):role@database] | STATEMENT: SELECT ...
...
2019-06-04 06:01:25.487 BST [15873-1] [PostgreSQL JDBC Driver@ip_address(port):role@database] | LOG: process 15873 still waiting for AccessShareLock on relation 16390 of database 16388 after 1000.218 ms at character 15
2019-06-04 06:01:25.487 BST [15873-2] [PostgreSQL JDBC Driver@ip_address(port):role@database] | DETAIL: Process holding the lock: 12709. Wait queue: 5747, 19765, 16036, 14617, 12280, 14513, 14728, 15398, 27611, 14542, 15948, 23398, 5853, 5098, 4324, 10760, 23480, 30192, 15300, 16228, 16127, 16285, 15873.
2019-06-04 06:01:25.487 BST [15873-3] [PostgreSQL JDBC Driver@ip_address(port):role@database] | STATEMENT: SELECT ...
...
2019-06-04 06:01:29.160 BST [16127-6] [PostgreSQL JDBC Driver@ip_address(port):role@database] | LOG: process 16127 acquired AccessShareLock on relation 16390 of database 16388 after 8560.748 ms at character 15
2019-06-04 06:01:29.160 BST [16127-7] [PostgreSQL JDBC Driver@ip_address(port):role@database] | STATEMENT: SELECT ...
...
E o processo segurando o cadeado, novamente:
postgres=# SELECT pid, backend_type, wait_event_type, wait_event FROM pg_stat_activity WHERE pid = 2741;
pid | backend_type | wait_event_type | wait_event
------+--------------+-----------------+----------------
12709 | startup | Activity | RecoveryWalAll
(1 row)
As consultas bloqueadas nos clientes levam à latência da API de 10 a 20 segundos e, ocasionalmente, a um alto número de respostas 5xx. Nossa equipe de SRE foi encarregada de reduzir a latência da API durante esses incidentes e estamos procurando maneiras de resolver isso, o que entendemos ser uma situação de nicho. No momento, estamos experimentando recovery_min_apply_delay = 120s
no cliente 1 (daí as entradas de log posteriores) para que os dois clientes não bloqueiem ao mesmo tempo. Isso reduziu um pouco o número de respostas erradas e diminuiu um pouco os picos de latência. Não temos certeza sobre como resolver esse problema completamente e, de fato, se é possível fazê-lo. Agradeceríamos seu conselho. Encontramos este post relacionado , mas, infelizmente, também não foi resolvido.