我可以在使用数据库后激活 PITR 吗？

Question

Asked: 2024-05-22 19:51:32 +0800 CST2024-05-22 19:51:32 +0800 CST 2024-05-22 19:51:32 +0800 CST

在postgresql中将大量数据插入分区表的最快方法是什么？

772

我有一个表，本机按日期分区。分区涵盖 1 个月。我有另一个非常大的表（19GB），我想将数据复制到分区表中。我已经使用过pg_partman这个过程，但是该partman.partition_data_proc过程花了 12 个小时才将 9GB 数据移动到 60 个新分区。作为参考，我在 Amazon RDS (M5 Large) 上使用 Postgres 15。

我曾尝试使用partman.partition_data_proc来移动数据。对于更具体的情况，请进行以下查询：

-- NOTE: Both tables have more columns, this is a minimal example
CREATE TABLE IF NOT EXISTS table1(
    id bigint not null,
    date timestamp not null,
    col_a integer,
    col_b double precision,
    col_c varchar(255)
);

-- insert some data into "table" at this step
-- for example using something like this: 
-- insert into table (
--      "id",
--  "date",
--  "col_a" ,
--  "col_b",
--  "col_c"
-- )
-- select
--  i,
--  get_random_date_between(start:='10 years', end:='1 day'),
--  random()::int,
--  (random()* 100)::numeric(10, 2),
--  'Some Text'
-- from
--  generate_series(1,300000000) s(i);

CREATE TABLE IF NOT EXISTS partitioned_table(
    id bigint not null,
    date timestamp not null,
    col_a integer,
    col_b double precision,
    col_c varchar(255)
) PARTITION BY RANGE (date);

-- NOTE: you will need to have pg_partman extension installed
-- https://github.com/pgpartman/pg_partman
SELECT partman.create_parent(
        p_parent_table => 'public.partitioned_table',
        p_control => 'date',
        p_interval => '1 month'
    );

          
-- This operation takes a very long time
call partman.partition_data_proc(
    p_parent_table := 'public.partitioned_table',
    p_interval := '1 month',
    p_source_table := 'public.table1'
);

我还尝试使用 DBeaver 导出/导入数据功能来移动数据（它速度较慢，并且将数据插入到默认分区中）。有没有更快的方法来做到这一点？我希望能够在 8 小时内传输数据，并且不必将 RDS 实例升级为更昂贵的实例。

1 个回答

Voted

bobflux · Answer 1 · 2024-05-24T00:54:45+08:00

固定测试数据生成：

CREATE UNLOGGED TABLE IF NOT EXISTS table1(
    id bigint not null,
    date timestamp not null,
    col_a integer,
    col_b double precision,
    col_c varchar(255)
);

-- insert some data into "table" at this step
-- for example using something like this: 
insert into table1 ("id","date","col_a","col_b","col_c")
select i,
 '2000-01-01'::DATE + ('1 DAY'::INTERVAL*(random()*7200)),
 (random()*65536)::int,
 (random()* 100)::numeric(10, 2),
 'Some Text'
from generate_series(1,300000000) s(i);

INSERT 0 300000000
Time: 517982,474 ms (08:37,982)

select pg_relation_size('table1')/1e9;
 22.9682298880000000

创建分区表：

CREATE TABLE IF NOT EXISTS partitioned_table(
    id bigint not null,
    date timestamp not null,
    col_a integer,
    col_b double precision,
    col_c varchar(255)
) PARTITION BY RANGE (date);

创建较小的测试集来玩弄：

CREATE UNLOGGED TABLE IF NOT EXISTS table1small AS SELECT * FROM table1 LIMIT 10000000;

尝试partman，使用p_wait = 0，否则它会在移动一堆行后休眠，这需要一段时间：

call partman.partition_data_proc(
    p_parent_table := 'public.partitioned_table',
    p_interval := '1 month',
    p_source_table := 'public.table1small',
    p_wait := 0
);

我注意到它非常慢（大约 40k 行/秒），并且它将行从 table1small移动到分区表中。移动行很慢，因为需要删除源表行。

我以前从未使用过 pgpartman，所以也许有一个设置可以让它复制行而不是移动它们。这样会快得多。

例如：

使用 pgpartman 将行从 table1small 移动到分区表花费了 140 秒，大约 71k 行/秒。
INSERT INTO partitioned_table SELECT * FROM table1small，只用了12秒，大约833k rows/s。
INSERT INTO dummy_non_partitioned_table SELECT * FROM table1small，耗时 3 秒，约 3.3M rows/s。

因此，如果您想快速将旧表的全部内容插入到新的分区表中，那么执行以下操作会更快：

创建新的分区表为UNLOGGED：如果服务器在操作过程中崩溃，您可以随时重做
“手动”创建所有分区（使用脚本）
INSERT INTO partitioned_table SELECT * FROM table1
对结果满意后，将新表设置为 LOGGED，使其防崩溃，然后删除或截断旧表。

这仍然不会榨干你盒子里的所有汁液：我看到它只使用 1 个核心，并且写入速度低于 100 MB/s。看起来插入分区表比插入非分区表慢，很可能是因为花费大量时间检查所有范围约束以确定该行应该进入哪个分区。

如果这个快速测试成立，即 300M 行、833M 行/秒，它应该在大约 6 分钟内完成。

可能有一种方法可以通过同步序列扫描来加速它，通过使用脚本对每个分区执行一个查询，所有这些都是并行的。INSERT INTO partition SELECT * FROM table1 WHERE date >= 'start of partition' AND date < 'end of partition'或类似的东西。

这需要每个分区一个进程，而且进程有很多。因此，也许将其分为两个步骤，每年拆分一次以获得每年一个临时表，然后按月将每个步骤拆分为最终分区。它应该能够消耗盒子中所有可用的 CPU，根据具体情况，这可能是一个功能，也可能是一个问题......

在postgresql中将大量数据插入分区表的最快方法是什么？

连接到 PostgreSQL 服务器：致命：主机没有 pg_hba.conf 条目

如何让sqlplus的输出出现在一行中？

选择具有最大日期或最晚日期的日期

如何列出 PostgreSQL 中的所有模式？

列出指定表的所有列

如何在不修改我自己的 tnsnames.ora 的情况下使用 sqlplus 连接到位于另一台主机上的 Oracle 数据库

你如何mysqldump特定的表？

使用 psql 列出数据库权限

如何从 PostgreSQL 中的选择查询中将值插入表中？

如何使用 psql 列出所有数据库和表？

在postgresql中将大量数据插入分区表的最快方法是什么？

1 个回答

相关问题