josteinb提出的问题 -dba

josteinb

Asked: 2021-08-24 05:34:40 +0800 CST

timescaledb：批量插入耗尽所有内存

概括

我正在尝试将数据批量插入 timescaledb 超表中。无论我尝试什么，内存使用量都会逐渐增长，直到服务器进程由于内存不足而被终止。我在服务器上的数据集小至 160 万行时观察到这一点，其中 128 GB 的 RAM 可用于 postgres/timescaledb，所以我一定做错了什么。对不是超表的表执行完全相同的插入操作就可以了，因此问题必须与 timescaledb 有关。

我想要做什么

我希望插入的表定义如下：

CREATE TABLE test (
    gid BIGINT NOT NULL,
    location_ GEOGRAPHY(POINT),
    -- about 15 other VARCHAR and BIGINT columns omitted for brevity
    far_end_gid BIGINT NOT NULL,
    day_partition_key DATE NOT NULL,
    PRIMARY KEY (gid, far_end_gid, day_partition_key)
);
SELECT create_hypertable(
    'test', 'day_partition_key', chunk_time_interval => INTERVAL '1 day');

要插入的数据在数据库中的另一个表中（通过\COPY操作填充的临时表）；除了某些字段需要一些解析（将字符串转换为日期等）之外，此另一个表具有与目标超表相同的字段。失败的INSERT查询是

INSERT INTO topology_test
    SELECT
        tobigintornull(gid) AS gid_as_int,
        -- about 15 other VARCHAR and BIGINT columns omitted
        CAST(CAST(REPLACE(far_end_gid, ',', '.') AS DOUBLE PRECISION) AS BIGINT),
        CAST(tobigintornull(far_end_cell_id) AS BIGINT),
        TO_DATE(day_partition_key, 'YYYYMMDD')
    FROM test_temp_table
    WHERE
        tobigintornull(gid) is not null
        and tobigintornull(REPLACE(far_end_gid, ',', '.')) is not null
        and day_partition_key is not null
    ON CONFLICT DO NOTHING;

该ON CONFLICT部分旨在静默删除重复的主键，并且该函数tobigintornull执行其名称所暗示的操作：如果可能，它将输入转换为 bigint，如果不是，则返回 null（这有助于删除无法解析的行）。它的定义是

CREATE OR REPLACE FUNCTION tobigintornull(text) RETURNS BIGINT AS $$
DECLARE x BIGINT;
BEGIN
    x = $1::BIGINT;
    RETURN x;
EXCEPTION WHEN others THEN
    RETURN NULL;
END;
$$
STRICT
LANGUAGE plpgsql IMMUTABLE;

输入的定义DOUBLE PRECISION类似。WHERE请注意，该子句仅删除了一小部分行（绝对小于 1%）。

完整的数据集是 5.93 亿行，并且没有以任何方式排序。怀疑输入缺乏排序是问题的一部分，我创建了一个包含 160 万行的数据子集，其中所有行的值都相同day_partition_key（从 timescaledb 的角度来看，该子集应该是完美排序的）。

问题

问题表现为 postgres 的内存使用量逐渐增加，直到数据库可用的全部 128 GB 被使用（大约需要 5 分钟）。在 100% 内存使用一分钟左右后，插入查询崩溃。日志显示以下内容：

timescaledb_1  | 2021-08-23 11:56:50.727 UTC [1] LOG:  server process (PID 1231) was terminated by signal 9: Killed
timescaledb_1  | 2021-08-23 11:56:50.727 UTC [1] DETAIL:  Failed process was running: INSERT INTO test
timescaledb_1  |            SELECT

(...)

timescaledb_1  | 2021-08-23 11:56:50.727 UTC [1] LOG:  terminating any other active server processes
timescaledb_1  | 2021-08-23 11:56:50.741 UTC [1215] WARNING:  terminating connection because of crash of another server process
timescaledb_1  | 2021-08-23 11:56:50.741 UTC [1215] DETAIL:  The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
timescaledb_1  | 2021-08-23 11:56:50.741 UTC [1215] HINT:  In a moment you should be able to reconnect to the database and repeat your command.
timescaledb_1  | 2021-08-23 11:56:50.744 UTC [1221] WARNING:  terminating connection because of crash of another server process
timescaledb_1  | 2021-08-23 11:56:50.744 UTC [1221] DETAIL:  The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
timescaledb_1  | 2021-08-23 11:56:50.744 UTC [1221] HINT:  In a moment you should be able to reconnect to the database and repeat your command.
timescaledb_1  | 2021-08-23 11:56:50.771 UTC [1] LOG:  all server processes terminated; reinitializing
timescaledb_1  | 2021-08-23 11:56:51.371 UTC [1259] LOG:  database system was interrupted; last known up at 2021-08-23 11:50:40 UTC
timescaledb_1  | 2021-08-23 11:56:51.711 UTC [1259] LOG:  database system was not properly shut down; automatic recovery in progress
timescaledb_1  | 2021-08-23 11:56:51.720 UTC [1259] LOG:  redo starts at 25A/16567EF8
timescaledb_1  | 2021-08-23 11:56:51.788 UTC [1259] LOG:  invalid record length at 25A/165C1A30: wanted 24, got 0
timescaledb_1  | 2021-08-23 11:56:51.788 UTC [1259] LOG:  redo done at 25A/165C19D0
timescaledb_1  | 2021-08-23 11:56:51.893 UTC [1] LOG:  database system is ready to accept connections
timescaledb_1  | 2021-08-23 11:56:52.039 UTC [1265] LOG:  TimescaleDB background worker launcher connected to shared catalogs

显然，数据库恢复后目标表中不存在任何行。

我试图解决的问题

我已将数据集的大小从 5.93 亿行减少到 160 万行，确保子集对于用于分块的日期列只有一个值 ( day_partition_key)。结果是完全一样的。
在https://github.com/timescale/timescaledb/issues/643的讨论之后，我将 timescaledb 配置更改为SET timescaledb.max_open_chunks_per_insert=1;. 问题仍然完全相同。
我尝试创建另一个目标表而不将其设为超表。然后插入 1.6 M 行子集就可以了。我希望整套都能奏效，但我还没有花时间去做。

版本、硬件和配置

docker 镜像 timescale/timescaledb-postgis:latest-pg13 (bf76e5594c98) 用于运行 timescaledb。它包含：

x86_64-pc-linux-musl 上的 PostgreSQL 13.3，由 gcc (Alpine 10.2.1_pre1) 10.2.1 20201203 编译，64 位
时标数据库 2.3.0
POSTGIS="2.5.5" [EXTENSION] PGSQL="130" GEOS="3.8.1-CAPI-1.13.3" PROJ="Rel. 7.1.1，2020 年 9 月 1 日" GDAL="GDAL 3.1.4， 2020 年 10 月 20 日发布" LIBXML="2.9.10" LIBJSON="0.15" LIBPROTOBUF="1.3.3" RASTER

docker 容器限制为 16 个内核和 128 GB 内存。对于上面提到的测试，我使用了https://pgtune.leopard.in.ua/#/建议的配置参数用于带有我的参数的数据仓库，除了我将提供给 pgtune 的可用内存降低到 64 GB，希望这会导致数据库使用更少内存的配置（我也尝试了推荐用于 128 GB 内存的设置，结果是相同的）。设置如下：

max_connections = 30
shared_buffers = 16GB
effective_cache_size = 48GB
maintenance_work_mem = 2GB
checkpoint_completion_target = 0.9
wal_buffers = 16MB
default_statistics_target = 500
random_page_cost = 1.1
effective_io_concurrency = 200
work_mem = 34952kB
min_wal_size = 4GB
max_wal_size = 16GB
max_worker_processes = 16
max_parallel_workers_per_gather = 8
max_parallel_workers = 16
max_parallel_maintenance_workers = 4

timescaledb：批量插入耗尽所有内存

概括

我想要做什么

问题

我试图解决的问题

版本、硬件和配置

连接到 PostgreSQL 服务器：致命：主机没有 pg_hba.conf 条目

如何让sqlplus的输出出现在一行中？

选择具有最大日期或最晚日期的日期

如何列出 PostgreSQL 中的所有模式？

列出指定表的所有列

如何在不修改我自己的 tnsnames.ora 的情况下使用 sqlplus 连接到位于另一台主机上的 Oracle 数据库

你如何mysqldump特定的表？

使用 psql 列出数据库权限

如何从 PostgreSQL 中的选择查询中将值插入表中？

如何使用 psql 列出所有数据库和表？

josteinb's questions

概括

我想要做什么

问题

我试图解决的问题

版本、硬件和配置