Eu tenho dois servidores rodando PostgreSQL 10.11 que são configurados da mesma forma, rodam em hardware comparável, contêm os mesmos bancos de dados e (principalmente) as mesmas tabelas.
Várias verificações do Nagios usando check_postgres
a verificação desses servidores rotineiramente, incluindo uma que verifica se há "inchaço".
Em um servidor, essa verificação de inchaço leva de 3 a 4 segundos.
No outro servidor, a verificação de inchaço leva de 24 a 30 minutos . EDIT: Leva de 55 a 60 minutos a partir de 14/01/2020.
Além disso, essa condição começou há 3 dias. Antes disso, a verificação de inchaço era executada em um período de tempo razoável (menos de ~ 10 segundos) no servidor agora "lento".
Após investigação adicional sobre a consulta "check_bloat" usada pelo check_postgres
, descobri o que acho que pode ser o motivo pelo qual é muito mais lento em um servidor do que no outro. A pg_stats
tabela contém aproximadamente 815.000 entradas no servidor lento. No servidor "rápido", ele contém apenas ~45.000 entradas. EDIT: Quarenta e poucos dias depois, a pg_stats
tabela contém ~ 838.000 entradas a partir de 14/01/2020.
O que é que eu posso fazer sobre isto? Existe uma maneira de reduzir a tabela pg_stats
( pg_statistic
)?
Eu tentei reiniciar o servidor "lento" e fazer um vacuum analyze;
nele. Nenhum efeito.
Tentei executar select pg_stat_reset()
em todos os bancos de dados no servidor "lento", mas não parecia fazer nada. (O número de linhas em pg_stats
não mudou.)
EDIT: Aqui está o SQL:
SELECT ns.nspname, tbl.relname, hdr, ma, bs,
SUM((1-coalesce(null_frac,0))*coalesce(avg_width, 2048)) AS datawidth,
MAX(coalesce(null_frac,0)) AS maxfracsum,
hdr+(
SELECT 1+count(*)/8
FROM pg_stats s2
WHERE null_frac<>0 AND s2.schemaname = ns.nspname AND s2.tablename = tbl.relname
) AS nullhdr
FROM pg_attribute att
JOIN pg_class tbl ON att.attrelid = tbl.oid
JOIN pg_namespace ns ON ns.oid = tbl.relnamespace
LEFT JOIN pg_stats s ON s.schemaname=ns.nspname
AND s.tablename = tbl.relname
AND s.inherited=false
AND s.attname=att.attname,
(
SELECT
(SELECT current_setting('block_size')::numeric) AS bs,
CASE WHEN SUBSTRING(SPLIT_PART(v, ' ', 2) FROM '#\"[0-9]+.[0-9]+#\"%' for '#')
IN ('8.0','8.1','8.2') THEN 27 ELSE 23 END AS hdr,
CASE WHEN v ~ 'mingw32' OR v ~ '64-bit' THEN 8 ELSE 4 END AS ma
FROM (SELECT version() AS v) AS foo
) AS constants
WHERE att.attnum > 0 AND tbl.relkind='r'
GROUP BY 1,2,3,4,5;
EDIT: Aqui está a saída do EXPLAIN (ANALYZE,BUFFERS)
servidor "lento":
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
GroupAggregate (cost=1935395.99..4529587255.06 rows=4097995 width=188) (actual time=106609.972..396360.540 rows=96069 loops=1)
Group Key: ns.nspname, tbl.relname, CASE WHEN ("substring"(split_part(version(), ' '::text, 2), '#\"[0-9]+.[0-9]+#\"%'::text, '#'::text) = ANY ('{8.0,8.1,8.2}'::text[])) THEN 27 ELSE 23 END, CASE WHEN ((version() ~ 'mingw32'::text) OR (version() ~ '64-bit'::text)) THEN 8 ELSE 4 END, $0
Buffers: shared hit=81434823 read=621336, temp read=340059 written=340054
InitPlan 1 (returns $0)
-> Result (cost=0.00..0.02 rows=1 width=32) (actual time=0.022..0.023 rows=1 loops=1)
-> Merge Left Join (cost=1935395.97..2063714.73 rows=4097995 width=176) (actual time=106594.140..263322.505 rows=13304172 loops=1)
Merge Cond: ((ns.nspname = s.schemaname) AND (tbl.relname = s.tablename) AND (att.attname = s.attname))
Buffers: shared hit=6832941 read=331003, temp read=340059 written=340054
-> Sort (cost=1427752.62..1437997.60 rows=4097995 width=192) (actual time=100214.124..113229.530 rows=13304172 loops=1)
Sort Key: ns.nspname, tbl.relname, att.attname
Sort Method: external merge Disk: 2629600kB
Buffers: shared hit=52022 read=282240, temp read=328700 written=328701
-> Hash Join (cost=15335.36..599474.35 rows=4097995 width=192) (actual time=242.121..56135.695 rows=13304172 loops=1)
Hash Cond: (tbl.relnamespace = ns.oid)
Buffers: shared hit=52022 read=282240
-> Hash Join (cost=15246.41..588609.50 rows=4097995 width=132) (actual time=237.991..35301.264 rows=13304172 loops=1)
Hash Cond: (att.attrelid = tbl.oid)
Buffers: shared hit=52006 read=282219
-> Seq Scan on pg_attribute att (cost=0.00..532962.28 rows=15390604 width=68) (actual time=0.004..13225.400 rows=13870158 loops=1)
Filter: (attnum > 0)
Rows Removed by Filter: 1147617
Buffers: shared hit=42449 read=282219
-> Hash (cost=14050.06..14050.06 rows=95708 width=72) (actual time=237.495..237.496 rows=96069 loops=1)
Buckets: 131072 Batches: 1 Memory Usage: 10782kB
Buffers: shared hit=9557
-> Seq Scan on pg_class tbl (cost=0.00..14050.06 rows=95708 width=72) (actual time=0.005..149.710 rows=96069 loops=1)
Filter: (relkind = 'r'::"char")
Rows Removed by Filter: 270863
Buffers: shared hit=9557
-> Hash (cost=58.42..58.42 rows=2442 width=68) (actual time=4.095..4.096 rows=2393 loops=1)
Buckets: 4096 Batches: 1 Memory Usage: 266kB
Buffers: shared hit=13 read=21
-> Seq Scan on pg_namespace ns (cost=0.00..58.42 rows=2442 width=68) (actual time=0.007..2.020 rows=2393 loops=1)
Buffers: shared hit=13 read=21
-> Sort (cost=507643.36..507707.46 rows=25640 width=200) (actual time=6379.663..6409.276 rows=37000 loops=1)
Sort Key: s.schemaname, s.tablename, s.attname
Sort Method: quicksort Memory: 11365kB
Buffers: shared hit=6780919 read=48763, temp read=11359 written=11353
-> Subquery Scan on s (cost=23845.50..505765.73 rows=25640 width=200) (actual time=1299.167..6301.807 rows=37000 loops=1)
Buffers: shared hit=6780919 read=48763, temp read=11359 written=11353
-> Gather (cost=23845.50..505509.33 rows=25640 width=401) (actual time=1299.164..6213.093 rows=37000 loops=1)
Workers Planned: 2
Workers Launched: 2
Buffers: shared hit=6780919 read=48763, temp read=11359 written=11353
-> Hash Left Join (cost=22845.50..501945.33 rows=10683 width=401) (actual time=1127.147..5861.412 rows=12333 loops=3)
Hash Cond: (c.relnamespace = n.oid)
Buffers: shared hit=6780919 read=48763, temp read=11359 written=11353
-> Hash Join (cost=22756.56..501828.29 rows=10683 width=140) (actual time=1121.277..5826.907 rows=12333 loops=3)
Hash Cond: (s_1.starelid = c.oid)
Join Filter: has_column_privilege(c.oid, a.attnum, 'select'::text)
Rows Removed by Join Filter: 266947
Buffers: shared hit=6780817 read=48763, temp read=11359 written=11353
-> Nested Loop (cost=0.43..462367.09 rows=416791 width=82) (actual time=0.061..2480.113 rows=279280 loops=3)
Buffers: shared hit=3354232 read=48763
-> Parallel Seq Scan on pg_statistic s_1 (cost=0.00..53955.73 rows=236437 width=14) (actual time=0.030..405.942 rows=279280 loops=3)
Filter: (NOT stainherit)
Buffers: shared hit=464 read=48763
-> Index Scan using pg_attribute_relid_attnum_index on pg_attribute a (cost=0.43..3.02 rows=2 width=70) (actual time=0.004..0.004 rows=1 loops=837840)
Index Cond: ((attrelid = s_1.starelid) AND (attnum = s_1.staattnum))
Filter: (NOT attisdropped)
Buffers: shared hit=3353768
-> Hash (cost=14050.06..14050.06 rows=359445 width=72) (actual time=1118.810..1118.811 rows=366932 loops=3)
Buckets: 262144 Batches: 2 Memory Usage: 20667kB
Buffers: shared hit=28671, temp written=6186
-> Seq Scan on pg_class c (cost=0.00..14050.06 rows=359445 width=72) (actual time=0.007..551.732 rows=366932 loops=3)
Filter: ((NOT relrowsecurity) OR (NOT row_security_active(oid)))
Buffers: shared hit=28671
-> Hash (cost=58.42..58.42 rows=2442 width=68) (actual time=5.839..5.840 rows=2393 loops=3)
Buckets: 4096 Batches: 1 Memory Usage: 266kB
Buffers: shared hit=102
-> Seq Scan on pg_namespace n (cost=0.00..58.42 rows=2442 width=68) (actual time=0.017..2.882 rows=2393 loops=3)
Buffers: shared hit=102
SubPlan 2
-> Aggregate (cost=1104.73..1104.74 rows=1 width=8) (actual time=1.149..1.150 rows=1 loops=96069)
Buffers: shared hit=74601882 read=290333
-> Nested Loop (cost=1.56..1104.72 rows=1 width=401) (actual time=1.144..1.147 rows=0 loops=96069)
Join Filter: (c_1.oid = s_2.starelid)
Buffers: shared hit=74601882 read=290333
-> Nested Loop (cost=1.14..1103.30 rows=1 width=10) (actual time=1.141..1.143 rows=0 loops=96069)
Buffers: shared hit=74440049 read=284290
-> Nested Loop (cost=0.70..16.75 rows=1 width=4) (actual time=0.181..0.356 rows=1 loops=96069)
Join Filter: (c_1.relnamespace = n_1.oid)
Rows Removed by Join Filter: 57
Buffers: shared hit=19268934 read=1187
-> Index Scan using pg_class_relname_nsp_index on pg_class c_1 (cost=0.42..8.44 rows=1 width=8) (actual time=0.006..0.069 rows=58 loops=96069)
Index Cond: (relname = tbl.relname)
Filter: ((NOT relrowsecurity) OR (NOT row_security_active(oid)))
Buffers: shared hit=2617035 read=1168
-> Index Scan using pg_namespace_nspname_index on pg_namespace n_1 (cost=0.28..8.30 rows=1 width=4) (actual time=0.002..0.003 rows=1 loops=5531986)
Index Cond: (nspname = ns.nspname)
Buffers: shared hit=16651899 read=19
-> Index Scan using pg_attribute_relid_attnum_index on pg_attribute a_1 (cost=0.43..1084.53 rows=201 width=6) (actual time=0.782..0.782 rows=0 loops=96069)
Index Cond: (attrelid = c_1.oid)
Filter: ((NOT attisdropped) AND has_column_privilege(c_1.oid, attnum, 'select'::text))
Rows Removed by Filter: 144
Buffers: shared hit=55171115 read=283103
-> Index Scan using pg_statistic_relid_att_inh_index on pg_statistic s_2 (cost=0.43..1.07 rows=28 width=6) (actual time=0.003..0.004 rows=0 loops=43518)
Index Cond: ((starelid = a_1.attrelid) AND (staattnum = a_1.attnum))
Filter: (stanullfrac <> '0'::double precision)
Rows Removed by Filter: 0
Buffers: shared hit=161833 read=6043
Planning time: 3.341 ms
Execution time: 397056.086 ms
(103 rows)
E aqui está a saída do EXPLAIN (ANALYZE,BUFFERS)
servidor "rápido":
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
GroupAggregate (cost=18339.88..841921.77 rows=13555 width=188) (actual time=1379.248..8600.924 rows=2694 loops=1)
Group Key: ns.nspname, tbl.relname, CASE WHEN ("substring"(split_part(version(), ' '::text, 2), '#\"[0-9]+.[0-9]+#\"%'::text, '#'::text) = ANY ('{8.0,8.1,8.2}'::text[])) THEN 27 ELSE 23 END, CASE WHEN ((version() ~ 'mingw32'::text) OR (version() ~ '64-bit'::text)) THEN 8 ELSE 4 END, $0
Buffers: shared hit=3648911
InitPlan 1 (returns $0)
-> Result (cost=0.00..0.02 rows=1 width=32) (actual time=0.026..0.027 rows=1 loops=1)
-> Merge Left Join (cost=18339.86..18764.78 rows=13555 width=176) (actual time=1372.967..2097.150 rows=55107 loops=1)
Merge Cond: ((ns.nspname = s.schemaname) AND (tbl.relname = s.tablename) AND (att.attname = s.attname))
Buffers: shared hit=262164
-> Sort (cost=5887.15..5921.04 rows=13555 width=192) (actual time=371.212..415.656 rows=55107 loops=1)
Sort Key: ns.nspname, tbl.relname, att.attname
Sort Method: quicksort Memory: 16174kB
Buffers: shared hit=3056
-> Hash Join (cost=708.49..4956.84 rows=13555 width=192) (actual time=9.314..282.912 rows=55107 loops=1)
Hash Cond: (tbl.relnamespace = ns.oid)
Buffers: shared hit=3056
-> Hash Join (cost=670.68..4883.30 rows=13555 width=132) (actual time=7.529..194.704 rows=55107 loops=1)
Hash Cond: (att.attrelid = tbl.oid)
Buffers: shared hit=3038
-> Seq Scan on pg_attribute att (cost=0.00..4007.49 rows=78125 width=68) (actual time=0.004..77.499 rows=82849 loops=1)
Filter: (attnum > 0)
Rows Removed by Filter: 37659
Buffers: shared hit=2595
-> Hash (cost=637.01..637.01 rows=2693 width=72) (actual time=7.497..7.497 rows=2694 loops=1)
Buckets: 4096 Batches: 1 Memory Usage: 306kB
Buffers: shared hit=443
-> Seq Scan on pg_class tbl (cost=0.00..637.01 rows=2693 width=72) (actual time=0.007..5.110 rows=2694 loops=1)
Filter: (relkind = 'r'::"char")
Rows Removed by Filter: 12831
Buffers: shared hit=443
-> Hash (cost=25.14..25.14 rows=1014 width=68) (actual time=1.757..1.758 rows=1036 loops=1)
Buckets: 2048 (originally 1024) Batches: 1 (originally 1) Memory Usage: 118kB
Buffers: shared hit=15
-> Seq Scan on pg_namespace ns (cost=0.00..25.14 rows=1014 width=68) (actual time=0.008..0.841 rows=1036 loops=1)
Buffers: shared hit=15
-> Sort (cost=12452.71..12453.04 rows=131 width=200) (actual time=1001.404..1031.723 rows=36997 loops=1)
Sort Key: s.schemaname, s.tablename, s.attname
Sort Method: quicksort Memory: 11364kB
Buffers: shared hit=259108
-> Subquery Scan on s (cost=5455.47..12448.10 rows=131 width=200) (actual time=709.076..941.593 rows=36997 loops=1)
Buffers: shared hit=259108
-> Hash Left Join (cost=5455.47..12446.79 rows=131 width=401) (actual time=709.073..886.682 rows=36997 loops=1)
Hash Cond: (c.relnamespace = n.oid)
Buffers: shared hit=259108
-> Hash Join (cost=5417.65..12408.63 rows=131 width=140) (actual time=707.296..823.908 rows=36997 loops=1)
Hash Cond: ((s_1.starelid = c.oid) AND (s_1.staattnum = a.attnum))
Buffers: shared hit=259093
-> Seq Scan on pg_statistic s_1 (cost=0.00..6793.45 rows=19622 width=14) (actual time=0.014..45.484 rows=45420 loops=1)
Filter: (NOT stainherit)
Buffers: shared hit=6401
-> Hash (cost=4852.66..4852.66 rows=37666 width=142) (actual time=706.898..706.899 rows=69779 loops=1)
Buckets: 131072 (originally 65536) Batches: 1 (originally 1) Memory Usage: 12881kB
Buffers: shared hit=252689
-> Hash Join (cost=831.03..4852.66 rows=37666 width=142) (actual time=29.082..629.863 rows=69779 loops=1)
Hash Cond: (a.attrelid = c.oid)
Join Filter: has_column_privilege(c.oid, a.attnum, 'select'::text)
Rows Removed by Join Filter: 50729
Buffers: shared hit=252689
-> Seq Scan on pg_attribute a (cost=0.00..3724.99 rows=112999 width=70) (actual time=0.005..117.968 rows=120508 loops=1)
Filter: (NOT attisdropped)
Buffers: shared hit=2595
-> Hash (cost=637.01..637.01 rows=15521 width=72) (actual time=28.956..28.957 rows=15525 loops=1)
Buckets: 16384 Batches: 1 Memory Usage: 1705kB
Buffers: shared hit=443
-> Seq Scan on pg_class c (cost=0.00..637.01 rows=15521 width=72) (actual time=0.007..15.192 rows=15525 loops=1)
Filter: ((NOT relrowsecurity) OR (NOT row_security_active(oid)))
Buffers: shared hit=443
-> Hash (cost=25.14..25.14 rows=1014 width=68) (actual time=1.760..1.761 rows=1036 loops=1)
Buckets: 2048 (originally 1024) Batches: 1 (originally 1) Memory Usage: 118kB
Buffers: shared hit=15
-> Seq Scan on pg_namespace n (cost=0.00..25.14 rows=1014 width=68) (actual time=0.012..0.858 rows=1036 loops=1)
Buffers: shared hit=15
SubPlan 2
-> Aggregate (cost=60.64..60.66 rows=1 width=8) (actual time=2.376..2.377 rows=1 loops=2694)
Buffers: shared hit=3386747
-> Nested Loop (cost=1.39..60.63 rows=1 width=401) (actual time=2.280..2.369 rows=6 loops=2694)
Join Filter: (c_1.oid = s_2.starelid)
Buffers: shared hit=3386747
-> Nested Loop (cost=1.10..60.06 rows=1 width=10) (actual time=2.228..2.283 rows=16 loops=2694)
Buffers: shared hit=3262332
-> Nested Loop (cost=0.69..16.73 rows=1 width=4) (actual time=1.108..2.211 rows=1 loops=2694)
Join Filter: (c_1.relnamespace = n_1.oid)
Rows Removed by Join Filter: 364
Buffers: shared hit=3247620
-> Index Scan using pg_class_relname_nsp_index on pg_class c_1 (cost=0.41..8.43 rows=1 width=8) (actual time=0.005..0.398 rows=365 loops=2694)
Index Cond: (relname = tbl.relname)
Filter: ((NOT relrowsecurity) OR (NOT row_security_active(oid)))
Buffers: shared hit=290274
-> Index Scan using pg_namespace_nspname_index on pg_namespace n_1 (cost=0.28..8.29 rows=1 width=4) (actual time=0.002..0.003 rows=1 loops=983801)
Index Cond: (nspname = ns.nspname)
Buffers: shared hit=2957346
-> Index Scan using pg_attribute_relid_attnum_index on pg_attribute a_1 (cost=0.42..43.29 rows=4 width=6) (actual time=0.019..0.045 rows=16 loops=2694)
Index Cond: (attrelid = c_1.oid)
Filter: ((NOT attisdropped) AND has_column_privilege(c_1.oid, attnum, 'select'::text))
Rows Removed by Filter: 10
Buffers: shared hit=14712
-> Index Scan using pg_statistic_relid_att_inh_index on pg_statistic s_2 (cost=0.29..0.55 rows=1 width=6) (actual time=0.003..0.003 rows=0 loops=43518)
Index Cond: ((starelid = a_1.attrelid) AND (staattnum = a_1.attnum))
Filter: (stanullfrac <> '0'::double precision)
Rows Removed by Filter: 0
Buffers: shared hit=124415
Planning time: 4.337 ms
Execution time: 8605.391 ms
(102 rows)
Bem, mistério resolvido. Graças aos comentários acima, vi que a saída EXPLAIN indicava que pg_attribute também era muito maior do que eu esperava. Depois de olhar para isso, vi qual era o problema. Um de nossos desenvolvedores teve um processo descontrolado que criou mais de cem mil tabelas que deveriam ser temporárias, mas não eram e então "esqueceu" de limpá-las. Eles estavam em um esquema não público, então não os notei até agora. Uma vez que eu larguei essas centenas de milhares de tabelas e executei
vacuum full pg_attribute
evacuum full pg_class
evacuum full pg_statistic
, o problema que eu estava tendo com a consulta de inchaço do Nagios foi finalmente resolvido.