Pawel Zieminski提出的问题 -dba

Pawel Zieminski

Asked: 2023-10-07 04:38:52 +0800 CST

Comportamento estranho do Postgresql com caracteres não-ascii com índice triagrama presente

Vejo algum comportamento estranho ao usar índices gin_trgm_opsou gist_trgm_ops. Parece haver uma grande diferença nos planos ao usar say ILIKEou ~e pesquisar frases ascii versus frases char multibyte. Como se houvesse um custo maior quando o operando é um operando não ASCII.

O que estou vendo é esperado? Qual é a razão para isso?

Eu tentei no Postgreql 12 e 13 mais recente de cada.

Aqui está um cenário:

CREATE DATABASE postgres
    WITH
    OWNER = postgres
    ENCODING = 'UTF8'
    LC_COLLATE = 'en_US.utf8'
    LC_CTYPE = 'en_US.utf8'
    TABLESPACE = pg_default
    CONNECTION LIMIT = -1
    IS_TEMPLATE = False;

-- snip

CREATE TABLE test_table (
    id uuid PRIMARY KEY,
    label varchar
);

-- insert 1m rows

VACUUM ANALYZE test_table;

No conjunto de dados, tenho 10 rótulos contendo 'acl'e 10 contendo '定す'.

Ao usar o índice GIN

CREATE INDEX test_table_label_gin_idx
    ON test_table USING gin
    (label gin_trgm_ops);

Eu vejo o seguinte.

EXPLAIN ANALYZE SELECT * FROM test_table WHERE label ILIKE '%定す%' LIMIT 100;

Limit  (cost=1000.00..16573.18 rows=100 width=52) (actual time=392.153..395.095 rows=10 loops=1)
  ->  Gather  (cost=1000.00..16728.91 rows=101 width=52) (actual time=392.135..394.830 rows=10 loops=1)
        Workers Planned: 2
        Workers Launched: 2
        ->  Parallel Seq Scan on test_table  (cost=0.00..15718.81 rows=42 width=52) (actual time=382.922..388.082 rows=3 loops=3)
              Filter: ((label)::text ~~* '%定す%'::text)
              Rows Removed by Filter: 338417
Planning Time: 0.656 ms
Execution Time: 395.233 ms


EXPLAIN ANALYZE SELECT * FROM test_table WHERE label ILIKE '%acl%' LIMIT 100;

Limit  (cost=28.78..400.51 rows=100 width=52) (actual time=0.072..0.406 rows=10 loops=1)
  ->  Bitmap Heap Scan on test_table  (cost=28.78..404.23 rows=101 width=52) (actual time=0.053..0.197 rows=10 loops=1)
        Recheck Cond: ((label)::text ~~* '%acl%'::text)
        Heap Blocks: exact=10
        ->  Bitmap Index Scan on test_table_label_gin_idx  (cost=0.00..28.76 rows=101 width=0) (actual time=0.025..0.034 rows=10 loops=1)
              Index Cond: ((label)::text ~~* '%acl%'::text)
Planning Time: 0.231 ms
Execution Time: 0.542 ms

Com GIST

DROP INDEX test_table_label_gin_idx;

CREATE INDEX test_table_label_gist_idx
    ON test_table USING gist
    (label gist_trgm_ops);

Eu vejo

EXPLAIN ANALYZE SELECT * FROM test_table WHERE label ILIKE '%定す%' LIMIT 100;

Limit  (cost=13.19..384.92 rows=100 width=52) (actual time=303.772..1557.498 rows=10 loops=1)
  ->  Bitmap Heap Scan on test_table  (cost=13.19..388.64 rows=101 width=52) (actual time=303.752..1557.286 rows=10 loops=1)
        Recheck Cond: ((label)::text ~~* '%定す%'::text)
        Rows Removed by Index Recheck: 1015250
        Heap Blocks: exact=10431
        ->  Bitmap Index Scan on test_table_label_gist_idx  (cost=0.00..13.17 rows=101 width=0) (actual time=301.046..301.053 rows=1015260 loops=1)
              Index Cond: ((label)::text ~~* '%定す%'::text)
Planning Time: 0.215 ms
Execution Time: 1557.643 ms


EXPLAIN ANALYZE SELECT * FROM test_table WHERE label ILIKE '%acl%' LIMIT 100;

Limit  (cost=13.19..384.92 rows=100 width=52) (actual time=257.385..257.751 rows=10 loops=1)
  ->  Bitmap Heap Scan on test_table  (cost=13.19..388.64 rows=101 width=52) (actual time=257.366..257.551 rows=10 loops=1)
        Recheck Cond: ((label)::text ~~* '%acl%'::text)
        Heap Blocks: exact=10
        ->  Bitmap Index Scan on test_table_label_gist_idx  (cost=0.00..13.17 rows=101 width=0) (actual time=257.319..257.328 rows=10 loops=1)
              Index Cond: ((label)::text ~~* '%acl%'::text)
Planning Time: 0.377 ms
Execution Time: 257.948 ms

Apenas mudar os caracteres do operando muda bastante o plano.

Editar

SELECT show_trgm('定す');

"{0x145ed8,0x6628fa,0x6cb12d}"

SELECT encode('定す', 'escape')

\345\256\232\343\201\231

Este problema parece semelhante ao Postgresql não usar o índice trigrama GIN ao executar uma consulta LIKE não ASCII?

Pawel Zieminski

Asked: 2020-03-04 16:06:50 +0800 CST

Índice parcial do Postgres btree no array jsonb -> expressão do array parece estar corrompida para tabelas maiores na versão 9.5.x

Estou analisando a indexação de atributos jsonb e estou vendo algo suspeito com o Postgres 9.5.x, mas não em versões superiores. Abaixo está o que eu fiz que acionou os erros de consulta estranhos. Pode ser que eu esteja fazendo algo errado, mas ver este trabalho nas versões mais recentes do Postgres me faz pensar que é um bug no 9.5.x (eu tentei até a versão 9.5.21).

Estou vendo isso consistentemente com o tamanho da tabela de cerca de 1 milhão de linhas e superior.

O json na coluna jsonb contém atributos que representam diferentes tipos de json de valor único e matrizes. Eu tenho string, boolean, number integer, number float e data formatada string. O erro que vejo é com <o operador array para array inteiro (não tentei todos eles). A partir do erro, parece que a column -> 'attribute'parte da expressão falha ao recuperar a parte correta do valor jsonb e, digamos, para uma matriz int obtém a matriz de string próxima etc. Isso realmente muda entre as execuções, pois os dados são aleatórios.

A estrutura do json na propertiescoluna é fixa (determinística) para cada valor de typecolumn . Portanto, cada linha onde type = 8 sempre tem uma matriz de inteiros em properties -> 'r'. type = 7tem a matriz em properties -> 'q', type = 9tem a matriz em properties -> 's', etc. Em outras palavras, typeé um tipo lógico em termos de estrutura (ou "esquema") de json em propertiese todas as linhas com o mesmo typevalor têm estrutura json homogênea em termos de nomes de nós e tipos de valor (os próprios valores são aleatórios). Também agora as matrizes são sempre de comprimento 3.

Isso é um inseto? Ou estou fazendo algo errado?

CREATE TABLE test1 (
  id SERIAL PRIMARY KEY,
  type INTEGER NOT NULL,
  properties jsonb
);

-- generates test data wherein the json structure of "properties" column varies by "type" column
INSERT INTO test1 (type, properties)
SELECT
  s.type AS type,
  json_build_object(CHR(s.type + 100), md5(random() :: TEXT),
                    CHR(s.type + 101), (random() * 100)::INTEGER,
                    CHR(s.type + 102), (random() * 10)::DOUBLE PRECISION,
                    CHR(s.type + 103), random()::INTEGER::BOOLEAN ,
                    CHR(s.type + 104), to_char(to_timestamp((random() * 1500000000)::DOUBLE PRECISION), 'YYYY-MM-DD"T"HH24:MI:SS.MS"Z"'),
                    CHR(s.type + 105), ARRAY[md5(random() :: TEXT), md5(random() :: TEXT), md5(random() :: TEXT)],
                    CHR(s.type + 106), ARRAY[(random() * 100)::INTEGER, (random() * 100)::INTEGER, (random() * 100)::INTEGER],
                    CHR(s.type + 107), ARRAY[(random() * 10)::DOUBLE PRECISION, (random() * 10)::DOUBLE PRECISION, (random() * 10)::DOUBLE PRECISION],
                    CHR(s.type + 108), ARRAY[random()::INTEGER::BOOLEAN, random()::INTEGER::BOOLEAN, random()::INTEGER::BOOLEAN],
                    CHR(s.type + 109), ARRAY[
                      to_char(to_timestamp((random() * 1500000000)::DOUBLE PRECISION), 'YYYY-MM-DD"T"HH24:MI:SS.MS"Z"'),
                      to_char(to_timestamp((random() * 1500000000)::DOUBLE PRECISION), 'YYYY-MM-DD"T"HH24:MI:SS.MS"Z"'),
                      to_char(to_timestamp((random() * 1500000000)::DOUBLE PRECISION), 'YYYY-MM-DD"T"HH24:MI:SS.MS"Z"')
                    ]
  ) AS properties
FROM (SELECT (random() * 10) :: INT AS type
      FROM generate_series(1, 1000000)) s;

CREATE OR REPLACE FUNCTION jsonb_array_int_array(JSONB)
  RETURNS INTEGER[] AS
$$
DECLARE
  result INTEGER[];
BEGIN
  IF $1 ISNULL
  THEN
    result := NULL;
  ELSEIF jsonb_array_length($1) = 0
  THEN
    result := ARRAY [] :: INTEGER[];
  ELSE
    SELECT array_agg(x::INTEGER) FROM jsonb_array_elements_text($1) t(x) INTO result;
  END IF;
  RETURN result;
END;
$$
  LANGUAGE plpgsql
  IMMUTABLE;

--  properties -> 'r' field of type 8 is always an array of integers
CREATE INDEX test1_properties_r_int_array_index ON test1 USING btree (jsonb_array_int_array(properties -> 'r')) WHERE type = 8;

-- this works
SELECT count(*) FROM test1 WHERE type = 8 AND jsonb_array_int_array(properties -> 'r') < ARRAY[50];

-- this fails
SELECT count(*) FROM test1 WHERE type = 8 AND jsonb_array_int_array(properties -> 'r') < ARRAY[100];

-- but
DROP INDEX test1_properties_r_int_array_index;
-- now it works
SELECT count(*) FROM test1 WHERE type = 8 AND jsonb_array_int_array(properties -> 'r') < ARRAY[100];

-- also
CREATE INDEX test1_properties_r_int_array_index ON test1 USING gin (jsonb_array_int_array(properties -> 'r')) WHERE type = 8;

-- works here too
SELECT count(*) FROM test1 WHERE type = 8 AND jsonb_array_int_array(properties -> 'r') < ARRAY[100];

Obrigado pela ajuda.

Editar:

Aqui estão alguns esclarecimentos sobre como ele falha. Acabei de reexecutar o acima e a consulta falha da seguinte forma

sql> SELECT count(*) FROM test1 WHERE type = 8 AND jsonb_array_int_array(properties -> 'r') < ARRAY[100]
[2020-03-04 00:46:20] [22P02] ERROR: invalid input syntax for integer: "1.73782130237668753"
[2020-03-04 00:46:20] Where: SQL statement "SELECT array_agg(x::INTEGER) FROM jsonb_array_elements_text($1) t(x)"
[2020-03-04 00:46:20] PL/pgSQL function jsonb_array_int_array(jsonb) line 12 at SQL statement

Eu verifiquei o valor aleatório da mensagem de erro

SELECT id AS txt FROM test1 WHERE position('1.73782130237668753' IN properties::text) > 0;

e descobri que a linha que causou o erro na verdade é typeigual a 7 e não 8 como na cláusula where da consulta. Portanto, parece que a condição do índice não é satisfeita na linha que está sendo retornada.

Aqui está o plano para a consulta com falha

Aggregate  (cost=69293.65..69293.66 rows=1 width=0)
  ->  Bitmap Heap Scan on test1  (cost=1228.78..69208.38 rows=34111 width=0)
        Recheck Cond: ((jsonb_array_int_array((properties -> 'r'::text)) < '{100}'::integer[]) AND (type = 8))
        ->  Bitmap Index Scan on test1_properties_r_int_array_index  (cost=0.00..1220.25 rows=34111 width=0)
              Index Cond: (jsonb_array_int_array((properties -> 'r'::text)) < '{100}'::integer[])

Edição 2:

Após a resposta de Laurenz Albe, realizei o seguinte teste. Eu defini uma nova função

CREATE OR REPLACE FUNCTION jsonb_array_int_array2(json_value JSONB, actual_type INTEGER, expected_type INTEGER)
  RETURNS INTEGER[] AS
$$
DECLARE
  result INTEGER[];
BEGIN
  IF actual_type <> expected_type THEN
    RAISE EXCEPTION 'unexpected type % instead of %', actual_type, expected_type;
  END IF;

  IF $1 ISNULL OR actual_type <> expected_type
  THEN
    result := NULL;
  ELSEIF jsonb_array_length(json_value) = 0
  THEN
    result := ARRAY [] :: INTEGER[];
  ELSE
    SELECT array_agg(x::INTEGER) FROM jsonb_array_elements_text(json_value) t(x) INTO result;
  END IF;
  RETURN result;
END;
$$
  LANGUAGE plpgsql
  IMMUTABLE;

Eu redefini o índice e reestruturei a consulta da seguinte forma

CREATE INDEX test1_properties_r_int_array_index ON test1 USING btree (jsonb_array_int_array2(properties -> 'r', type, 8)) WHERE type = 8;

SELECT count(*) FROM test1 WHERE type = 8 AND jsonb_array_int_array2(properties -> 'r', type, 8) < ARRAY[100];

E agora eu recebo

[2020-03-04 09:47:34] [P0001] ERROR: unexpected type 7 instead of 8

O que indica que uma etapa é executada em todas as linhas, não apenas naquelas em que type = 8. É talvez isso do plano

Recheck Cond: ((jsonb_array_int_array((properties -> 'r'::text)) < '{50}'::integer[]) AND (type = 8))

Se esta for a ordem da avaliação é possível reverter e verificar type = 8antes jsonb_array_int_array((properties -> 'r'::text)?

Também pelo desempenho (uma vez que eu removo a verificação de exceção e executo novamente), parece que toda a tabela é verificada.

Isso é esperado?

Edição 3:

Percebi que isso agora se tornou uma pergunta diferente e a excelente e detalhada resposta de Laurenz Albe aborda a questão original de "por que não funciona". A questão agora é como trabalhar melhor o esquema original que eu estava procurando. Acho que vou ter que destilá-lo em uma pergunta separada.

Obrigada!

Aliás, como Laurenz previu, consegui reproduzir o problema no Postgres 10.x com mais dados.

Edição 4:

Para o registro, isso não é específico para matrizes. Qualquer conversão de valores neste cenário irá eventualmente falhar com tabelas grandes. Então, dado que properties ->> 'm'é sempre um número inteiro quando type = 8isso também não é seguro

CREATE INDEX test1_properties_m_int_index ON test1 (((properties ->> 'm')::INTEGER)) WHERE type = 8;

e a consulta

SELECT count(*) FROM test1 WHERE type = 8 AND (properties ->> 'm')::INTEGER < 50;

falha com

[2020-03-05 09:35:24] [22P02] ERROR: invalid input syntax for integer: "["a1c815126aa058706476b21f37f60038", "450513bd0f25abf8bd39b1b4645a1427", "e51acc579414985eaa59d9bdc3dc8187"]"

A lição aqui é que, se o esquema json não estiver fixo na coluna da tabela, seja qual for a conversão feita, ela deverá antecipar qualquer entrada jsonb durante varreduras indiscriminadas de partes da tabela.

Comportamento estranho do Postgresql com caracteres não-ascii com índice triagrama presente

Índice parcial do Postgres btree no array jsonb -> expressão do array parece estar corrompida para tabelas maiores na versão 9.5.x

conectar ao servidor PostgreSQL: FATAL: nenhuma entrada pg_hba.conf para o host

Como fazer a saída do sqlplus aparecer em uma linha?

Selecione qual tem data máxima ou data mais recente

Como faço para listar todos os esquemas no PostgreSQL?

Listar todas as colunas de uma tabela especificada

Como usar o sqlplus para se conectar a um banco de dados Oracle localizado em outro host sem modificar meu próprio tnsnames.ora

Como você mysqldump tabela (s) específica (s)?

Listar os privilégios do banco de dados usando o psql

Como inserir valores em uma tabela de uma consulta de seleção no PostgreSQL?

Como faço para listar todos os bancos de dados e tabelas usando o psql?

Pawel Zieminski's questions