我可以在使用数据库后激活 PITR 吗？

Question

Asked: 2024-02-15 11:34:47 +0800 CST2024-02-15 11:34:47 +0800 CST 2024-02-15 11:34:47 +0800 CST

需要帮助确定表配置和查询设计的改进机会

772

我正在开发一个具有相当大数据集的项目。我们需要对该数据集进行任意聚合，这些聚合是在用户请求时生成的。以下是我们当前在 PostgresQL v11 中设置的基本描述（是的，我们知道它已经停产，升级计划将于下个季度进行）

基本表结构如下：

create table if not exists sales
(
    category_a  smallint,    -- sequential integer values from 0 - 10000
    category_b  varchar(3),  -- 3-digit ids (all numeric, padded with zeros)
    product     varchar(14), -- essentially random 14 character identifiers
    location_id varchar(5),  -- location id, 5-digit number (left padded with zeros)
    units       int,         -- value of interest
    sales       float,       -- second value of interest
    primary key (category_a, category_b, product, location_id)
) partition by range (category_a);

我们目前的分区依据是A因为这些值在大约 200 个值之后轮换出来，并从数据集中删除。A分区进一步按进行子分区B。每个A_B分区包含大约 50-7000 万行。

的值B是非连续的并且有间隙。

产品的价值有多种，大约有一百万种。

location_id，每个类别大约有 50-100 个位置B，每个位置都有其中的大部分产品。

示例查询如下所示：

select category_a, category_b, product, sum(units), sum(sales)
from sales
where category_a between 1 and 100
  and sales.category_b in ('001', '010', '018', '019', '024')
  and product in ('00000000000147', '00000000000900', '00000000000140', '00000000009999')
group by category_a, category_b, product;

此查询的解释表明我们对数据集中的每个分区进行了完整的顺序扫描。这看起来很奇怪，因为我们有唯一索引，左边的三个值是 where 和 group 子句中的三个值。我不明白为什么这不使用索引。

这是一个将示例数据加载到表中的查询：

insert into sales
    (category_a, category_b, product, location_id, units, sales)
select cat_a,
       lpad(cat_b::varchar, 3, '0'),
       lpad(product::varchar, 14, '0'),
       lpad(location_id::varchar, 5, '0'),
       (random() * 10000)::int,
       (random() * 100000)::int
from generate_series(1, 50) cat_a
         cross join generate_series(1, 25) cat_b
         cross join generate_series(1, 10) location_id
         cross join generate_series(1, 5000) product;

该查询的解释很长，但如果我们认为有帮助，我可以提供它。

这些查询可能非常慢（几分钟，有时超过 10 分钟）。非常乐意提供更多详细信息，但这是基本信息（无论如何我都这么认为）。

我们可以/应该对表或查询进行哪些更改来提高该查询的性能？

2 个回答

Voted

Laurenz Albe · Answer 1 · 2024-02-15T15:41:53+08:00

对于要使用的索引，您应该将product其作为主键中的第一列。为了使第三个索引列能够有效地与IN条件一起使用，所有先前的索引列都必须与进行比较=，而不是与进行比较IN。

bobflux · Answer 2 · 2024-02-15T19:39:05+08:00

使用此Pastebin 代码构建测试数据。

使用 postgres 16.1。

btree 索引的工作方式类似于电话簿，我将使用它作为示例。条目按（姓氏、名字）排序。因此，像这样的查询：

last_name=constant AND first_name BETWEEN first AND last

...导致一次查找（last_name=constant，first_name=first），然后按顺序扫描索引行，直到命中（last_name=constant，first_name=last）。这很快。

不幸的是，对于如下查询：

WHERE last_name BETWEEN ... AND first_name BETWEEN ....

它更复杂并且需要不同的方法：扫描索引以查找满足条件的所有姓氏，然后对每个姓氏执行上一段中描述的操作。不幸的是，这个方法目前在 postgres 中还没有实现。所以我们必须给它一些手动帮助。

select category_a, category_b, product, sum(units), sum(sales)
from sales
where category_a between 1 and 100
  and sales.category_b in ('001', '010', '018', '019', '024')
  and product in ('00000000000147', '00000000000900', '00000000000140', '00000000009999')
group by category_a, category_b, product;

原始查询需要 4300 毫秒。

您的示例数据仅使用 1 到 50 之间的category_a 填充表，因此查询中的category_a 条件是多余的，但我认为它在实际查询中并不多余。不管怎样，因为category_a是一个分区键，postgres只会读取与查询相关的分区，所以这部分已经优化了。

我们目前按 A 进行分区，因为这些值在大约 200 个值后轮换出来，并从数据集中删除。A 分区由 B 进一步子分区。每个 A_B 分区包含大约 50-7000 万行。

不清楚是要按A分区还是按A,B分区。我假设你只按 A 进行分区，因为这就是问题所在。

现在，正如 Laurenz Albe 所说，您的主键索引不足以满足此查询。你需要这个：

CREATE INDEX ON sales( category_b, product );

如果您按 A、B 而不是 A 进行分区，那么您只需要 (product) 上的索引，因为 Category_b 将在分区级别处理。

使用此索引，上述查询在 4 毫秒内完成一次索引扫描，速度快了 1000 倍。只是引用 EXPLAIN 的一部分，每个分区都有一个：

 ->  Bitmap Heap Scan on sales_a001 sales_1  (cost=90.60..799.11 rows=197 width=33) (actual time=0.468..0.596 rows=150 loops=1)
       Recheck Cond: (((category_b)::text = ANY ('{001,010,018,019,024}'::text[])) AND ((product)::text = ANY ('{00000000000147,00000000000900,00000000000140,00000000009999}'::text[])))
       Filter: ((category_a >= 1) AND (category_a <= 10))
       Heap Blocks: exact=9
       ->  Bitmap Index Scan on sales_a001_category_b_product_idx  (cost=0.00..90.55 rows=197 width=0) (actual time=0.447..0.447 rows=150 loops=1)
             Index Cond: (((category_b)::text = ANY ('{001,010,018,019,024}'::text[])) AND ((product)::text = ANY ('{00000000000147,00000000000900,00000000000140,00000000009999}'::text[])))

这很棒，但是您需要一个额外的索引，该索引将非常大。没有索引我们可以做到吗？是的，通过一点点作弊。因此，首先，我删除上面创建的索引，然后：

select category_a, category_b, product, sum(units), sum(sales)
from sales
where category_a = 1
  and sales.category_b in ('001', '010', '018', '019', '024')
  and product in ('00000000000147', '00000000000900', '00000000000140', '00000000009999')
group by category_a, category_b, product;

如果我们使用常量category_a = 1，那么postgres就可以使用索引扫描。

它真正做的是获取category_b 和product 的两个列表，迭代它们以枚举所有组合，并对每个组合进行索引搜索。仍然比 seq 扫描快得多。

这也适用于：

select category_a, category_b, product, sum(units), sum(sales)
from sales
where category_a IN (1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95,96,97,98,99,100)
  and sales.category_b in ('001', '010', '018', '019', '024')
  and product in ('00000000000147', '00000000000900', '00000000000140', '00000000009999')
group by category_a, category_b, product;

通过用一组已知值替换范围，就可以使用索引扫描。然而，要枚举的组合数量相当大。Postgres 不知道它可以跳过分区中不存在的所有category_a 值，因此它在每个分区上运行此搜索：

->  Index Scan using sales_a002_pkey on sales_a002 sales_2  (cost=0.43..8308.58 rows=202 width=33) (actual time=0.035..1.421 rows=150 loops=1)
Index Cond: ((category_a = ANY ('{1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95,96,97,98,99,100}'::integer[])) AND ((category_b)::text = ANY ('{001,010,018,019,024}'::text[])) AND ((product)::text = ANY ('{00000000000147,00000000000900,00000000000140,00000000009999}'::text[])))

因此，该查询比全序列扫描查询（4300 毫秒）快得多（175 毫秒），但也比 (category_b,product) 上的索引（4 毫秒）慢得多。尽管如此，如果这是一个偶尔的查询，并且您不想仅仅为了这个查询而拖着巨大的索引，那么这是一个有趣的中间立场。

现在，当然还有更多作弊的选择：

select category_a, category_b, product, sum(units), sum(sales) from sales_a001
        where category_a = 1 AND category_b in ('001', '010', '018', '019', '024')
        and product in ('00000000000147', '00000000000900', '00000000000140', '00000000009999')
        group by category_a, category_b, product
UNION ALL select category_a, category_b, product, sum(units), sum(sales) from sales_a002
        where category_a = 2 AND 
        ..... repeat for all values of category_a

这结合了许多级别的糟糕设计，但在没有额外索引的情况下，它的运行时间不到 20 毫秒。

我也尝试过clickhouse。

CREATE TABLE IF NOT EXISTS sales2( 
    category_a  Int16           NOT NULL CODEC(Delta,ZSTD(9)),    -- sequential integer values from 0 - 10000
    category_b  FixedString(3)  NOT NULL CODEC(ZSTD(9)),  -- 3-digit ids (all numeric, padded with zeros)
    product     FixedString(14)     NOT NULL CODEC(ZSTD(9)), -- essentially random 14 character identifiers
    location_id FixedString(5)     NOT NULL CODEC(ZSTD(9)),  -- location id, 5-digit number (left padded with zeros)
    units       Int32           NOT NULL CODEC(ZSTD(9)),         -- value of interest
    sales       Float32         NOT NULL CODEC(ZSTD(9))       -- second value of interest
)   ENGINE=MergeTree -- eliminates duplicates with same (topic,ts) which should not occur
    PARTITION BY (category_a)
    ORDER BY (category_a,category_b,product)
    PRIMARY KEY (category_a,category_b,product)
    SETTINGS index_granularity=256;

我从 postgres 复制数据并加载它。压缩比约为 6:1，由于文本列的原因，这并不理想（将数字存储为整数会得到更好的结果）。整个表和索引大约需要 340MB，而 postgres 则需要 4.3GB：由于压缩，磁盘缓存的 RAM 增加了 12 倍。

问题中的查询需要 20-70 毫秒，具体取决于索引粒度设置。

将行存储顺序和主键切换为 (category_a,product,category_b) 似乎会使速度更快，因为 Product 更具选择性。

它可能还可以进一步优化，但这只是一个 10 分钟的测试。

需要帮助确定表配置和查询设计的改进机会

连接到 PostgreSQL 服务器：致命：主机没有 pg_hba.conf 条目

如何让sqlplus的输出出现在一行中？

选择具有最大日期或最晚日期的日期

如何列出 PostgreSQL 中的所有模式？

列出指定表的所有列

如何在不修改我自己的 tnsnames.ora 的情况下使用 sqlplus 连接到位于另一台主机上的 Oracle 数据库

你如何mysqldump特定的表？

使用 psql 列出数据库权限

如何从 PostgreSQL 中的选择查询中将值插入表中？

如何使用 psql 列出所有数据库和表？

需要帮助确定表配置和查询设计的改进机会

2 个回答

相关问题