Mad Scientist提出的问题 -dba

Mad Scientist

Asked: 2017-03-19 10:15:42 +0800 CST

具有 btree 索引的 jsonb 列的统计信息不一致

7

我注意到涉及 jsonb 列的查询的性能在 VACUUM ANALYZE 运行期间差异很大，同时测试它。在分析表格后，我似乎随机地得到了完全不同的执行计划。

我在这里使用 Postgres 9.6。我的测试设置如下，我在 jsonb 列“params”中插入一个键“x”，其值介于 1 和 6 之间，其中 1 是最稀有的值，而 6 是最常见的值。我还有一个常规的 int 列“single_param”，其中包含相同的值分布以进行比较。：

CREATE TABLE test_data (
    id      serial,
    single_param    int,
    params      jsonb
);

INSERT INTO test_data
SELECT 
    generate_series(1, 1000000) AS id, 
    floor(log(random() * 9999999 + 1)) AS single_param,
    json_build_object(
        'x', floor(log(random() * 9999999 + 1))
    ) AS params;

CREATE INDEX idx_test_btree ON test_data (cast(test_data.params->>'x' AS int));
CREATE INDEX idx_test_gin ON test_data USING GIN (params);
CREATE INDEX ON test_data(id)
CREATE INDEX ON test_data(single_param)

我正在测试的查询是用于分页结果的典型查询，我按 id 排序并将输出限制为前 50 行。

SELECT * FROM test_data where (params->>'x')::int = 1 ORDER BY id DESC LIMIT 50;

运行后，我随机得到两个解释分析输出之一VACUUM ANALYZE：

Limit  (cost=0.42..836.59 rows=50 width=33) (actual time=39.679..410.292 rows=10 loops=1)
  ->  Index Scan Backward using test_data_id_idx on test_data  (cost=0.42..44317.43 rows=2650 width=33) (actual time=39.678..410.283 rows=10 loops=1)
        Filter: (((params ->> 'x'::text))::integer = 1)
        Rows Removed by Filter: 999990"
Planning time: 0.106 ms
Execution time: 410.314 ms

或者

Limit  (cost=8.45..8.46 rows=1 width=33) (actual time=0.032..0.034 rows=10 loops=1)
  ->  Sort  (cost=8.45..8.46 rows=1 width=33) (actual time=0.032..0.032 rows=10 loops=1)
        Sort Key: id DESC
        Sort Method: quicksort  Memory: 25kB
        ->  Index Scan using idx_test_btree on test_data  (cost=0.42..8.44 rows=1 width=33) (actual time=0.007..0.016 rows=10 loops=1)
              Index Cond: (((params ->> 'x'::text))::integer = 1)
Planning time: 0.320 ms
Execution time: 0.052 ms

不同之处在于，与 where 子句匹配的列数的估计值在两个计划之间是不同的。第一个估计是 2650 行，第二个是 1 行，而实际数字是 10 行。

以下可能使用 GIN 索引的查询版本似乎对 json 列使用了 1% 的默认估计值，这也导致了上述错误的查询计划：

SELECT * FROM test_data where params @> '{"x": 1}' ORDER BY id DESC LIMIT 50;

我最初的假设是 Postgres 不会对 jsonb 列有任何统计信息，并且总是使用估计值，就像使用@>运算符进行查询一样。但是对于为能够使用我创建的 btree 索引而编写的查询，它使用不同的估计值。有时这些足够好，有时它们很糟糕。

这些估计来自哪里？我猜它们是 Postgres 使用索引创建的某种统计信息。对于列统计信息，可以选择收集更准确的统计信息，这些统计信息有类似的吗？或者任何其他方式让 Postgres 在我的情况下选择更好的计划？

Mad Scientist

Asked: 2017-02-07 04:39:02 +0800 CST

数组列的时态表扩展错误

4

我在 Windows 上的 Postgres 9.5.4 上使用时态表 Postgres 扩展（ http://pgxn.org/dist/temporal_tables/ ）。这适用于大多数情况，但我遇到了数组列的奇怪问题。

我已经创建了一些最小的步骤来重现这一点，表的设置和临时表扩展如下：

CREATE EXTENSION IF NOT EXISTS temporal_tables;

DROP TABLE IF EXISTS test;
DROP TABLE IF EXISTS test_history;

CREATE TABLE test
(
  id SERIAL PRIMARY KEY,
  a integer,
  directories text[],
  sys_period tstzrange NOT NULL
);

CREATE TABLE test_history (LIKE test);
CREATE TRIGGER versioning_trigger BEFORE INSERT OR UPDATE OR DELETE ON test FOR EACH ROW EXECUTE PROCEDURE versioning('sys_period', 'test_history', true);

并在单独的事务中执行以下两个命令：

INSERT INTO test(a) VALUES (1);

UPDATE test SET a = 5 WHERE id = 1;

我收到以下错误：

错误：关系“test”的列“目录”是文本[]类型，但历史关系“test_history”的列“目录”是文本[]类型

这个错误对我来说是荒谬的，两列都具有相同的类型，错误甚至说明了这一点。这仅在表中存在 text[] 列时发生。没有必要写入该列，它只需要存在即可。如果我不创建数组列，则没有错误。

该扩展没有提及任何关于与数组不兼容的内容，我希望这种限制足够大，可以被提及。我想知道我设置表格的方式是否有任何问题，或者我是否遗漏了其他任何内容。两个 text[] 数组列的类型不同有什么根本原因吗？

知道究竟是什么导致了这个错误，我该如何摆脱它？

Mad Scientist

Asked: 2017-01-21 07:37:32 +0800 CST

即使存在正确的索引，聚合列也会导致全表扫描

4

我有一个查询，我想从按 date_added 列排序的表数据集中获取前几行。排序依据的列被索引，所以这个表的基本版本非常快：

SELECT datasets.id FROM datasets ORDER BY date_added LIMIT 25

"Limit  (cost=0.28..6.48 rows=25 width=12) (actual time=0.040..0.092 rows=25 loops=1)"
"  ->  Index Scan using datasets_date_added_idx2 on datasets  (cost=0.28..1244.19 rows=5016 width=12) (actual time=0.037..0.086 rows=25 loops=1)"
"Planning time: 0.484 ms"
"Execution time: 0.139 ms"

但是一旦我使查询变得更加复杂，我就会遇到问题。我想加入另一个表示多对多关系的表并将结果聚合到一个数组列中。为此，我需要添加一个 GROUP BY id 子句：

SELECT datasets.id FROM datasets GROUP BY datasets.id ORDER BY date_added LIMIT 25

"Limit  (cost=551.41..551.47 rows=25 width=12) (actual time=9.926..9.931 rows=25 loops=1)"
"  ->  Sort  (cost=551.41..563.95 rows=5016 width=12) (actual time=9.924..9.926 rows=25 loops=1)"
"        Sort Key: date_added"
"        Sort Method: top-N heapsort  Memory: 26kB"
"        ->  HashAggregate  (cost=359.70..409.86 rows=5016 width=12) (actual time=7.016..8.604 rows=5016 loops=1)"
"              Group Key: datasets_id"
"              ->  Seq Scan on datasets  (cost=0.00..347.16 rows=5016 width=12) (actual time=0.009..1.574 rows=5016 loops=1)"
"Planning time: 0.502 ms"
"Execution time: 10.235 ms"

只需添加 GROUP BY 子句，查询现在就会对数据集表进行全面扫描，而不是像以前那样使用 date_added 列上的索引。

我想要做的实际查询的简化版本如下：

SELECT 
    datasets.id,
    array_remove(array_agg(other_table.some_column), NULL) AS other_table
FROM datasets 
LEFT JOIN other_table 
    ON other_table.id = datasets.id
GROUP BY datasets.id 
ORDER BY date_added 
LIMIT 25

为什么 GROUP BY 子句会导致索引被忽略并强制进行全表扫描？有没有办法重写此查询以使其使用其排序依据的列上的索引？

我在 Windows 上使用 Postgres 9.5.4，有问题的表目前有 5000 行，但它可能有几十万行。在 EXPLAIN ANALYZE 之前，我在两个表上手动运行了 ANALYZE。

表定义：

CREATE TABLE public.datasets
(
  id integer NOT NULL DEFAULT nextval('datasets_id_seq'::regclass),
  date_added timestamp with time zone,
  ...
  CONSTRAINT datasets_pkey PRIMARY KEY (id)
)

CREATE TABLE public.other_table
(
  id integer NOT NULL,
  some_column integer NOT NULL,
  CONSTRAINT other_table_pkey PRIMARY KEY (id, some_column)
)

\d datasets匿名化不相关列的输出：

                                                   Table "public.datasets"
             Column              |           Type           |                           Modifiers
---------------------------------+--------------------------+------------------------------------------------------
 id                              | integer                  | not null default nextval('datasets_id_seq'::regclass)
 key                             | text                     |
 date_added                      | timestamp with time zone |
 date_last_modified              | timestamp with time zone |
 *****                           | integer                  |
 ********                        | boolean                  | default false
 *****                           | boolean                  | default false
 ***************                 | integer                  |
 *********************           | integer                  |
 *********                       | boolean                  | default false
 ********                        | integer                  |
 ************                    | integer                  |
 ************                    | integer                  |
 ****************                | timestamp with time zone |
 ************                    | text                     | default ''::text
 *****                           | text                     |
 *******                         | integer                  |
 *********                       | integer                  |
 **********************          | text                     | default ''::text
 *******************             | text                     |
 ****************                | integer                  |
 **********************          | text                     | default ''::text
 *******************             | text                     | default ''::text
 **********                      | integer                  |
 ***********                     | text                     |
 ***********                     | text                     |
 **********************          | integer                  |
 ******************************* | text                     | default ''::text
 ************************        | text                     | default ''::text
 ***********                     | integer                  | default 0
 *************                   | text                     |
 *******************             | integer                  |
 ****************                | integer                  | default 0
 ***************                 | text                     |
 **************                  | text                     |
Indexes:
    "datasets_pkey" PRIMARY KEY, btree (id)
    "datasets_date_added_idx" btree (date_added)
    "datasets_*_idx" btree (*)
    "datasets_*_idx" btree (*)
    "datasets_*_idx" btree (*)
    "datasets_*_idx" btree (*)
    "datasets_*_idx" btree (*)
    "datasets_*_idx1" btree (*)
    "datasets_*_idx" btree (*)

具有 btree 索引的 jsonb 列的统计信息不一致

数组列的时态表扩展错误

即使存在正确的索引，聚合列也会导致全表扫描

连接到 PostgreSQL 服务器：致命：主机没有 pg_hba.conf 条目

如何让sqlplus的输出出现在一行中？

选择具有最大日期或最晚日期的日期

如何列出 PostgreSQL 中的所有模式？

列出指定表的所有列

如何在不修改我自己的 tnsnames.ora 的情况下使用 sqlplus 连接到位于另一台主机上的 Oracle 数据库

你如何mysqldump特定的表？

使用 psql 列出数据库权限

如何从 PostgreSQL 中的选择查询中将值插入表中？

如何使用 psql 列出所有数据库和表？

Mad Scientist's questions