过滤索引是否有助于改进基于输入时间的查询，还是应该避免这种情况？

Question

Darii Petru

Asked: 2021-02-25 08:02:10 +0800 CST2021-02-25 08:02:10 +0800 CST 2021-02-25 08:02:10 +0800 CST

Postgres 12 按 coll 类型划分的分区性能

772

我使用分区和大量数据对 posgtres 12 进行性能测试。每个分区仅包含一个具有 400k 行的站点，并且我生成了大约 1k 个分区表；

对于第一个测试套件，我使用 UUID 作为 id，但我认为如果我将 id 类型更改为 bigint，将使用更少的空间，因此性能更高。填充表格后，我使用不同的数据运行以下选择一百次

SELECT SUM(amount),
FROM   test_table
WHERE  date >= '2021-02-06'
AND date <= '2021-02-07'
AND site IN ('c3b3771c-4b48-41a9-88eb-4c47d1630644', 'cbb11cdc-cd31-4da2-b14e-9ef878ce03c5', '2609ac86-995b-4320-a3b7-46ba175aa5e2') // randomly picked from site pool
GROUP  BY site
ORDER  BY site;

UUID 测试套件无日期索引：

CREATE TABLE public.test_table
(
    id UUID   NOT NULL,
    site UUID,
    archive UUID,
    location UUID,
    col_1 UUID,
    col_2 UUID,
    col_3 UUID,
    amount numeric(8,2)
    date timestamp with time zone,
    ....
) PARTITION BY LIST (site);


CREATE TABLE test_table_${site} PARTITION OF test_table  FOR VALUES IN ('${site}');

One table size: "265 MB"

BIGINT 测试套件无日期索引：

CREATE TABLE public.test_table
(
    id bigint   NOT NULL,
    site bigint,
    archive bigint,
    location bigint,
    col_1 bigint,
    col_2 bigint,
    col_3 bigint,
    amount numeric(8,2)
    date timestamp with time zone,
    ...
) PARTITION BY LIST (site);


CREATE TABLE test_table_${site} PARTITION OF test_table  FOR VALUES IN ('${site}');

One table size: "118 MB"

试验结果

UUID test results (ms) for 100 serial selects
median          1,425.00
percentile 95%  1,930.05

BIGINT test results (ms) for 100 serial selects
median          4,456.00
percentile 95%  9,037.50

同解释：

UUID

"GroupAggregate  (cost=61944.56..61947.03 rows=90 width=88)"
"  Group Key: test_table_c3b3771c_4b48_41a9_88eb_4c47d1630644.site"
"  ->  Sort  (cost=61944.56..61944.78 rows=90 width=48)"
"        Sort Key: test_table_c3b3771c_4b48_41a9_88eb_4c47d1630644.site"
"        ->  Gather  (cost=1000.00..61941.63 rows=90 width=48)"
"              Workers Planned: 3"
"              ->  Parallel Append  (cost=0.00..60932.63 rows=30 width=48)"
"                    ->  Parallel Seq Scan on test_table_c3b3771c_4b48_41a9_88eb_4c47d1630644  (cost=0.00..20311.16 rows=10 width=48)"
"                          Filter: ((date_fiscal >= '2021-02-06 00:00:00+00'::timestamp with time zone) AND (date_fiscal <= '2021-02-07 00:00:00+00'::timestamp with time zone) AND (site = ANY ('{c3b3771c-4b48-41a9-88eb-4c47d1630644,cbb11cdc-cd31-4da2-b14e-9ef878ce03c5,2609ac86-995b-4320-a3b7-46ba175aa5e2}'::uuid[])))"
"                    ->  Parallel Seq Scan on test_table_cbb11cdc_cd31_4da2_b14e_9ef878ce03c5  (cost=0.00..20311.16 rows=10 width=48)"
"                          Filter: ((date_fiscal >= '2021-02-06 00:00:00+00'::timestamp with time zone) AND (date_fiscal <= '2021-02-07 00:00:00+00'::timestamp with time zone) AND (site = ANY ('{c3b3771c-4b48-41a9-88eb-4c47d1630644,cbb11cdc-cd31-4da2-b14e-9ef878ce03c5,2609ac86-995b-4320-a3b7-46ba175aa5e2}'::uuid[])))"
"                    ->  Parallel Seq Scan on test_table_2609ac86_995b_4320_a3b7_46ba175aa5e2  (cost=0.00..20310.16 rows=10 width=48)"
"                          Filter: ((date_fiscal >= '2021-02-06 00:00:00+00'::timestamp with time zone) AND (date_fiscal <= '2021-02-07 00:00:00+00'::timestamp with time zone) AND (site = ANY ('{c3b3771c-4b48-41a9-88eb-4c47d1630644,cbb11cdc-cd31-4da2-b14e-9ef878ce03c5,2609ac86-995b-4320-a3b7-46ba175aa5e2}'::uuid[])))"

大整数

"Finalize GroupAggregate  (cost=47951.35..47954.22 rows=21 width=80)"
"  Group Key: test_table_121.site"
"  ->  Gather Merge  (cost=47951.35..47953.63 rows=18 width=80)"
"        Workers Planned: 3"
"        ->  Partial GroupAggregate  (cost=46951.31..46951.48 rows=6 width=80)"
"              Group Key: test_table_121.site"
"              ->  Sort  (cost=46951.31..46951.33 rows=6 width=40)"
"                    Sort Key: test_table_121.site"
"                    ->  Parallel Append  (cost=0.00..46951.24 rows=6 width=40)"
"                          ->  Parallel Seq Scan on test_table_121  (cost=0.00..15651.09 rows=2 width=40)"
"                                Filter: ((date_fiscal >= '2021-02-06 00:00:00+00'::timestamp with time zone) AND (date_fiscal <= '2021-02-07 00:00:00+00'::timestamp with time zone) AND (site = ANY ('{121,122,242}'::bigint[])))"
"                          ->  Parallel Seq Scan on test_table_242  (cost=0.00..15651.09 rows=2 width=40)"
"                                Filter: ((date_fiscal >= '2021-02-06 00:00:00+00'::timestamp with time zone) AND (date_fiscal <= '2021-02-07 00:00:00+00'::timestamp with time zone) AND (site = ANY ('{121,122,242}'::bigint[])))"
"                          ->  Parallel Seq Scan on test_table_122  (cost=0.00..15649.02 rows=2 width=40)"
"                                Filter: ((date_fiscal >= '2021-02-06 00:00:00+00'::timestamp with time zone) AND (date_fiscal <= '2021-02-07 00:00:00+00'::timestamp with time zone) AND (site = ANY ('{121,122,242}'::bigint[])))"

怎么可能在数据量较小的情况下，在选择时间上有如此大的差异？或者我在测试过程中犯了一个错误。

提前致谢！

2 个回答

Voted

J.D. · Answer 1 · 2021-03-07T08:04:14+08:00

我的猜测是后果来自您如何运行测试。我相信在检查您的示例测试查询后，您的一组数据与另一组数据可能会遇到非常受欢迎的测试参数。特别是您的WHERE条款：

WHERE  date >= '2021-02-06'
AND date <= '2021-02-07'
AND site IN ('c3b3771c-4b48-41a9-88eb-4c47d1630644', 'cbb11cdc-cd31-4da2-b14e-9ef878ce03c5', '2609ac86-995b-4320-a3b7-46ba175aa5e2') // randomly picked from site pool

如果没有看到您BIGINT为数据集运行的等效测试查询，很难进行比较，但我认为由于以下可能的原因，可能会发生不平衡测试：

您使用的日期范围可能非常有利于通过UUIDsite 字段而不是 site 字段对数据进行分区BIGINT，特别是因为BIGINT假设它的值比UUID.
site您在子句中为谓词选择值的方式WHERE也可能有利于您的UUID测试分区而不是BIGINT测试。看起来您说您是从站点池中随机选择它们，但这实际上取决于它的真正随机性，再加上一个事实，您的分区顺序将与 aUUID的顺序大不相同那些分区为BIGINT. BIGINT再次没有看到您对测试的等效示例查询，以及您如何在两种情况下随机选择该谓词，很难说这有多大的影响。

总而言之，我没有看到任何其他可以保证结果有很大差异的东西，这让我对上述内容进行理论化。不幸的是，如果这是您如何测试数据的问题，例如我怀疑的那样，那么将没有任何有信誉的来源可以为您提供答案。相反，您应该首先简化您的测试，以消除导致加权结果的潜在变量，然后从那里开始。

例如，可以通过手动选择第一个分区值、最后一个分区值、接近中间分区值的边界测试开始，然后对所有分区运行测试，date在任何这些情况下都没有谓词，以消除潜在的来源我上面提到的错误。然后针对您正在测试的特定谓词引入一个date您知道的范围谓词，该谓词包含相等数量的分区，每个分区具有相等的行数。site本质上受控的测试将在这里为您提供比随机测试更有意义的信息。

Darii Petru · Answer 2 · 2021-03-13T01:00:07+08:00

我对相同的数据运行相同的测试

SELECT SUM(amount),
FROM   test_table
WHERE  date betwen (day | week | month)
AND site IN ('site id 1', 'site id 2', 'site id 3') // randomly picked from site pool
GROUP  BY site
ORDER  BY site;

测试1 天间隔，3 个站点 ID：

UUID test results (ms) for 100 serial selects
median          1,425.00
percentile 95%  1,930.05

BIGINT test results (ms) for 100 serial selects
median          1,116.50
percentile 95%  1,641.55

测试1 周间隔，3 个站点 ID：

UUID test results (ms) for 100 serial selects
median          1,406.50
percentile 95%  1,849.10

BIGINT test results (ms) for 100 serial selects
median          1,147.00
percentile 95%  1,563.75

测试1 个月的间隔，3 个站点 id：

UUID test results (ms) for 100 serial selects
median          1,446.00
percentile 95%  1,876.05

BIGINT test results (ms) for 100 serial selects
median          1,146.50
percentile 95%  1,430.15

当我在选择中添加 10 个站点 ID 时，我收到了更明确的区别：

测试 1 天间隔，10 个站点 ID：

UUID test results (ms) for 100 serial selects
median          4,431.00
percentile 95%  5,237.55

BIGINT test results (ms) for 100 serial selects
median          3,607.50
percentile 95%  4,220.05

测试1 周间隔，10 个站点 ID：

UUID test results (ms) for 100 serial selects
median          4,458.50
percentile 95%  5,308.10

BIGINT test results (ms) for 100 serial selects
median          3,405.50
percentile 95%  4,193.55

测试1 个月间隔，10 个站点 id：

UUID test results (ms) for 100 serial selects
median          4,533.50
percentile 95%  5,540.70

BIGINT test results (ms) for 100 serial selects
median          3,549.00
percentile 95%  4,162.90

我认为问题在于测试是在生成数据之前运行的，并且 postgres 服务器可能会运行一些数据重组任务

Postgres 12 按 coll 类型划分的分区性能

连接到 PostgreSQL 服务器：致命：主机没有 pg_hba.conf 条目

如何让sqlplus的输出出现在一行中？

选择具有最大日期或最晚日期的日期

如何列出 PostgreSQL 中的所有模式？

列出指定表的所有列

如何在不修改我自己的 tnsnames.ora 的情况下使用 sqlplus 连接到位于另一台主机上的 Oracle 数据库

你如何mysqldump特定的表？

使用 psql 列出数据库权限

如何从 PostgreSQL 中的选择查询中将值插入表中？

如何使用 psql 列出所有数据库和表？

Postgres 12 按 coll 类型划分的分区性能

2 个回答

相关问题