我可以在使用数据库后激活 PITR 吗？

Question

Joe

Asked: 2022-06-16 07:26:42 +0800 CST2022-06-16 07:26:42 +0800 CST 2022-06-16 07:26:42 +0800 CST

带有大表的 UPDATE FROM 很慢并且使用 Seq Scans

772

我有一个大表（最终可能有 10 亿行，但目前约为 2600 万行），我想一次性为给定分组在最高 PK 上设置一个标志。

我选择创建一个临时表来存储应该设置的 PK，current=true其余的都应该设置current=false。我制作了一个临时表而不是物化视图，但我认为它不会产生真正的区别。

为每个发现最大 ID 的过程并不太痛苦：

CREATE TABLE assertion (
    pk integer NOT NULL,
    a bigint NOT NULL,
    b bigint NOT NULL,
    c bigint NOT NULL,
    d integer NOT NULL,
    current boolean DEFAULT false NOT NULL
);

CREATE INDEX assertion_current_idx ON assertion USING btree (current) WHERE (current = true);
CREATE INDEX assertion_current_idx1 ON assertion USING btree (current);
CREATE UNIQUE INDEX assertion_a_b_c_d_idx ON assertion USING btree (a, b, c, d) WHERE (current = true);

SELECT COUNT(pk) FROM assertion;

-- 26916858
-- Time: 2912.403 ms (00:02.912)

CREATE TEMPORARY TABLE assertion_current AS
    (SELECT MAX(pk) as pk, a, b, c, d
      FROM assertion
      GROUP BY a, b, c, d);

-- Time: 72218.755 ms (01:12.219)

ANALYZE assertion_current;

CREATE INDEX ON assertion_current(pk);

-- Time: 22107.698 ms (00:22.108)

SELECT COUNT(pk) FROM assertion_current;

-- 26455092
-- Time: 15650.078 ms (00:15.650)

根据的计数assertion_current，我们需要为 98% 的行设置“当前”标志为真。

棘手的是如何assertion根据当前值在合理的时间内更新表格。有一个a, b, c, d, current必须维护的唯一约束，因此对current列的更新需要是原子的，以避免破坏约束。

我有几个选择：

选项1

仅更新那些current更改的值。这具有根据索引字段更新所需的最少行数的好处：


BEGIN;
UPDATE assertion
   SET current = false
   WHERE assertion.current = true AND PK NOT IN (SELECT pk FROM assertion_current);
UPDATE assertion
   SET current = true
   WHERE assertion.current = false AND PK IN (SELECT pk FROM assertion_current);
COMMIT;

但是这两个查询都涉及序列扫描assertion_current（我认为）必须乘以大量行。

Update on assertion  (cost=0.12..431141.55 rows=0 width=0)
   ->  Index Scan using assertion_current_idx on assertion  (cost=0.12..431141.55 rows=1 width=7)
         Index Cond: (current = true)
         Filter: (NOT (SubPlan 1))
         SubPlan 1
           ->  Materialize  (cost=0.00..787318.40 rows=29982560 width=4)
                 ->  Seq Scan on assertion_current  (cost=0.00..520285.60 rows=29982560 width=4)

和

 Update on assertion  (cost=595242.56..596693.92 rows=0 width=0)
   ->  Nested Loop  (cost=595242.56..596693.92 rows=17974196 width=13)
         ->  HashAggregate  (cost=595242.00..595244.00 rows=200 width=10)
               Group Key: assertion_current.pk
               ->  Seq Scan on assertion_current  (cost=0.00..520285.60 rows=29982560 width=10)
         ->  Index Scan using assertion_pkey on assertion  (cost=0.56..8.58 rows=1 width=10)
               Index Cond: (pk = assertion_current.pk)
               Filter: (NOT current)

这意味着这些查询之一（许多当前为真或许多当前为假）总是需要很长时间。

选项 2

单次通过，但必须不必要地触摸每一行。

UPDATE assertion
   SET current =
     (CASE WHEN assertion.pk IN (select PK from assertion_current)
     THEN TRUE ELSE FALSE END)

但这会导致再次对 assertion_current 进行序列扫描

 Update on assertion  (cost=0.00..15498697380303.70 rows=0 width=0)
   ->  Seq Scan on assertion  (cost=0.00..15498697380303.70 rows=35948392 width=7)
         SubPlan 1
           ->  Materialize  (cost=0.00..787318.40 rows=29982560 width=4)
                 ->  Seq Scan on assertion_current  (cost=0.00..520285.60 rows=29982560 width=4)

选项 3

与选项 1 类似，但WHERE在更新中使用：

BEGIN;
UPDATE assertion SET current = false WHERE current = true;
UPDATE assertion SET current = true FROM assertion_current
  WHERE assertion.pk = assertion_current.pk;
COMMIT;

但第二个查询涉及两次 seq 扫描：

 Update on assertion  (cost=1654256.82..2721576.65 rows=0 width=0)
   ->  Hash Join  (cost=1654256.82..2721576.65 rows=29982560 width=13)
         Hash Cond: (assertion_current.pk = assertion.pk)
         ->  Seq Scan on assertion_current  (cost=0.00..520285.60 rows=29982560 width=10)
         ->  Hash  (cost=1029371.92..1029371.92 rows=35948392 width=10)
               ->  Seq Scan on assertion  (cost=0.00..1029371.92 rows=35948392 width=10)

选项 4

谢谢@jjanes，这花了> 6个小时，所以我取消了它。

UPDATE assertion
   SET current = not current
   WHERE current <>
     (CASE WHEN assertion.pk IN (select PK from assertion_current)
     THEN TRUE ELSE FALSE END)

生产

 Update on assertion  (cost=0.00..11832617068493.14 rows=0 width=0)
   ->  Seq Scan on assertion  (cost=0.00..11832617068493.14 rows=27307890 width=7)
         Filter: (current <> CASE WHEN (SubPlan 1) THEN true ELSE false END)
         SubPlan 1
           ->  Materialize  (cost=0.00..787318.40 rows=29982560 width=4)
                 ->  Seq Scan on assertion_current  (cost=0.00..520285.60 rows=29982560 width=4)

选项 5

谢谢@a_horse_with_no_name。这在我的机器上需要 24 分钟。

UPDATE assertion tg SET current = EXISTS (SELECT pk FROM assertion_current cr WHERE cr.pk = tg.pk);

给

 Update on assertion tg  (cost=0.00..233024784.94 rows=0 width=0)
   ->  Seq Scan on assertion tg  (cost=0.00..233024784.94 rows=27445116 width=7)
         SubPlan 1
           ->  Index Only Scan using assertion_current_pk_idx on assertion_current cr  (cost=0.44..8.46 rows=1 width=0)
                 Index Cond: (pk = tg.pk)

有没有更好的方法来及时实现这一目标？

2 个回答

Voted

Erwin Brandstetter · Answer 1 · 2022-06-16T16:22:20+08:00

...我们需要为 98% 的行设置“当前”标志为真

NOT current罕见的情况也会如此。到目前为止，您似乎一直在尝试从头到尾做事。

您当前的索引对给定的数据分布没有帮助：

CREATE INDEX assertion_current_idx ON assertion USING btree (current) WHERE (current = true);  
CREATE INDEX assertion_current_idx1 ON assertion USING btree (current);  
CREATE UNIQUE INDEX assertion_a_b_c_d_idx ON assertion USING btree (a, b, c, d) WHERE (current = true);

我们需要保留UNIQUE索引来强制执行您的要求，它也很有用：

CREATE UNIQUE INDEX assertion_a_b_c_d_idx ON assertion (a, b, c, d) WHERE current;

但简化(current = true)为current. 执行冗余表达式没有意义，只需使用该boolean值。

~~assertion_current_idx~~绝对没有意义，永远不会被使用。但它仍然必须保持最新。算了吧。

~~assertion_current_idx1~~几乎一样毫无意义。至少对于查找current = false. 但是使用这个部分索引要便宜得多——它也支持我在下面建议的第二个查询：

CREATE INDEX assertion_not_current_idx ON assertion (a, b, c, d, pk) WHERE NOT current;

在下面的“初始查询”之后创建此索引。

请注意，我完全跳过临时表以支持 CTE。更少的开销。

初始查询

您在后来的评论中透露，最初“所有行都已设置current = false” 。我们可以使用更简单、更快的查询来初始化。只需更新每组中没有其他具有更大 PK 的行。没有独特的违规行为，没有“非当前”的更新：

UPDATE assertion a
SET    current = true
WHERE  NOT EXISTS (
   SELECT FROM assertion x
   WHERE (x.a, x.b, x.c, x.d)
       = (a.a, a.b, a.c, a.d)
   AND   x.pk > a.pk
   );

一般查询

假设每组不能有当前行。

WITH new_current AS (  -- only these groups require updates
   SELECT *
   FROM  (
      -- get row with greatest non-current pk per group
      SELECT a, b, c, d, MAX(pk) AS new_pk
          , (SELECT a2.pk  -- get current row of the same group
             FROM   assertion a2
             WHERE (a2.a, a2.b, a2.c, a2.d)
                 = (a1.a, a1.b, a1.c, a1.d)
             AND    a2.current
            ) AS old_pk
      FROM   assertion a1
      WHERE  NOT current
      GROUP  BY a, b, c, d
      ) a1
   WHERE (old_pk < new_pk   -- only if old pk is lower (!)
       OR old_pk IS NULL)   -- or does not exist
   )
, up1(dummy) AS (  -- update current to false FIRST
   UPDATE assertion a
   SET    current = false
   FROM   new_current n
   WHERE  (a.a, a.b, a.c, a.d, a.pk)
        = (n.a, n.b, n.c, n.d, n.old_pk)  -- only matches existing old pk
   RETURNING true
   )
UPDATE assertion a  -- then update the new current row
SET    current = true
FROM   new_current n
LEFT   JOIN (SELECT FROM up1 LIMIT 1) AS force_update_order ON true  -- !!!
WHERE  (a.a, a.b, a.c, a.d, a.pk)
     = (n.a, n.b, n.c, n.d, n.new_pk);

到目前为止，您尝试过的其他任何东西都应该相形见绌。

子查询通过简单的聚合a1获取每个组中非当前最大的行。这是最佳选择，因为每组的候选行 ( )很少。否则，使用模拟索引跳过扫描进一步优化此步骤：pkmax()NOT current

棘手的部分是保持UNIQUE约束愉快。它不允许在任何给定时间每组有两个当前行。同一查询的 CTE 之间没有执行顺序 - 只要一个不引用另一个。我放入了这样一个虚拟引用来强制更新顺序。

当还没有当前行时，CTEup1不返回任何行。所以我们使用LEFT JOIN. 并且LIMIT 1永远不要重复行。

jjanes · Answer 2 · 2022-06-16T10:49:32+08:00

jjanes

2022-06-16T10:49:32+08:002022-06-16T10:49:32+08:00

我认为您不应该尝试避免每次 seq 扫描。但是您不希望执行多次的 seq 扫描（例如在未散列的子计划中，或在嵌套循环的第二个子计划中）。

在您的第一个计划中，它确实在未散列的子计划中进行了 seq 扫描，但它声称这只会执行一次，所以如果它是准确的就不会太糟糕。但这似乎与您对“98% 的行的'当前'标志是正确的”的描述相矛盾，所以也许统计数据是严重错误的。

您可以增加我们的 work_mem 直到子计划切换到散列子计划，或者您可以将查询从 NOT IN 重写为 NOT EXISTS。

对于选项 2，您只需添加 WHERE 即可消除无效更新：

UPDATE assertion
   SET current = not current
   WHERE current <>  
     (CASE WHEN assertion.pk IN (select PK from assertion_current)
     THEN TRUE ELSE FALSE END)

但是再次使用 EXISTS 而不是 IN 可能会更好。

1

带有大表的 UPDATE FROM 很慢并且使用 Seq Scans

选项1

选项 2

选项 3

选项 4

选项 5

初始查询

一般查询

连接到 PostgreSQL 服务器：致命：主机没有 pg_hba.conf 条目

如何让sqlplus的输出出现在一行中？

选择具有最大日期或最晚日期的日期

如何列出 PostgreSQL 中的所有模式？

列出指定表的所有列

如何在不修改我自己的 tnsnames.ora 的情况下使用 sqlplus 连接到位于另一台主机上的 Oracle 数据库

你如何mysqldump特定的表？

使用 psql 列出数据库权限

如何从 PostgreSQL 中的选择查询中将值插入表中？

如何使用 psql 列出所有数据库和表？

带有大表的 UPDATE FROM 很慢并且使用 Seq Scans

选项1

选项 2

选项 3

选项 4

选项 5

2 个回答

初始查询

一般查询

相关问题