Morris de Oryx

Asked: 2020-01-29 02:58:09 +0800 CST

Postgres 存储函数的自定义错误代码类和编号

5

我正在 Postgres 11.5 中编写存储函数的集合，并且想要RAISE EXCEPTION在不满足各种前提条件时使用。例如，null需要一个字符串或空字符串，或者一个超出范围的整数参数等。

我可以RAISE EXCEPTION并提供详细信息、提示和消息...但是我应该为错误代码使用什么范围？我检查了文档，但在这里没有找到任何指导：

https://www.postgresql.org/docs/11/errcodes-appendix.html

我在 StackOverflow 上进行了搜索，发现了多年前的类似问题……但没有明确的答案。

是否有一些安全或常规的块或前缀可用于从存储的函数/过程返回的自定义错误代码？

Morris de Oryx

Asked: 2019-11-23 18:54:31 +0800 CST

在 Postgres 11.5 上的分区表上触发

1

我问了一个关于PG 11.5 中删除的历史表设计的问题，并收到了对表进行分区的建议。这是一个绝妙的主意，因为表格可能会变得很大，而且信息量很低。意思是，我最终会想要清除数据。

当我用分区重新实现表时，我发现 PG（11 和 12）不支持BEFORE ROW主分区表上的触发器，只支持单个分区上的触发器。这导致了大量的绑定代码。有没有更好的办法？在这种情况下，我得到的触发器就是减去两个时间戳并存储秒数。11.5，所以没有生成的列。

即使代码很长，我也将其包含在内，因为这就是重点。

Column order tweaked a bit with Column Tetris search from 
https://www.2ndquadrant.com/en/blog/on-rocks-and-sand/
Totally geeky, but this table could get big, so its worth saving some room.
Note that we can also roll up data and discard a lot of the details in this
table, if we want to save room.
 */
BEGIN;

DROP TABLE IF EXISTS data.need_history CASCADE;

CREATE TABLE IF NOT EXISTS data.need_history (
    id uuid NOT NULL DEFAULT NULL,
    item_id uuid NOT NULL DEFAULT NULL,
    facility_id uuid NOT NULL DEFAULT NULL,
    hsys_id uuid NOT NULL DEFAULT NULL,
    perc_down double precision NOT NULL DEFAULT 0,
    created_dts timestamptz NOT NULL DEFAULT NULL,
    deleted_dts timestamptz NOT NULL DEFAULT NOW(),
    total_qty integer NOT NULL DEFAULT 0,
    sterile_qty integer NOT NULL DEFAULT 0,
    available_qty integer NOT NULL DEFAULT 0,
    still_need_qty integer NOT NULL DEFAULT 0,
    usage_ integer NOT NULL DEFAULT 0,
    duration_seconds int4 NOT NULL DEFAULT 0,
    need_for_case citext NOT NULL DEFAULT NULL,
    status citext NOT NULL DEFAULT NULL,

CONSTRAINT need_history_id_pkey
    PRIMARY KEY (id,deleted_dts)
) PARTITION BY RANGE (deleted_dts);

ALTER TABLE data.need_history OWNER TO user_change_structure;

/* It's a big confusingly documented, but ranges are *inclusive* FROM and *exclusive* TO.
  So, to get January, you want 01-01 to 02-01, not 01-01 to 01-31. In practice,
  this makes the range descriptions a bit nicer, I'd say. */

CREATE TABLE ascendco.need_history_2019_11 PARTITION OF need_history 
    FOR VALUES FROM ('2019-11-01') TO ('2019-12-01');

CREATE TABLE ascendco.need_history_2019_12 PARTITION OF need_history 
    FOR VALUES FROM ('2019-12-01') TO ('2020-01-01');

CREATE TABLE ascendco.need_history_2020_01 PARTITION OF need_history 
    FOR VALUES FROM ('2020-01-01') TO ('2020-02-01');

CREATE TABLE ascendco.need_history_2020_02 PARTITION OF need_history 
    FOR VALUES FROM ('2020-02-01') TO ('2020-03-01');

CREATE TABLE ascendco.need_history_2020_03 PARTITION OF need_history 
    FOR VALUES FROM ('2020-03-01') TO ('2020-04-01');

CREATE TABLE ascendco.need_history_2020_04 PARTITION OF need_history 
    FOR VALUES FROM ('2020-04-01') TO ('2020-05-01');

CREATE TABLE ascendco.need_history_2020_05 PARTITION OF need_history 
    FOR VALUES FROM ('2020-05-01') TO ('2020-06-01');

CREATE TABLE ascendco.need_history_2020_06 PARTITION OF need_history 
    FOR VALUES FROM ('2020-06-01') TO ('2020-07-01');

CREATE TABLE ascendco.need_history_2020_07 PARTITION OF need_history 
    FOR VALUES FROM ('2020-07-01') TO ('2020-08-01');

CREATE TABLE ascendco.need_history_2020_08 PARTITION OF need_history 
    FOR VALUES FROM ('2020-08-01') TO ('2020-09-01');

CREATE TABLE ascendco.need_history_2020_09 PARTITION OF need_history 
    FOR VALUES FROM ('2020-09-01') TO ('2020-10-01');

CREATE TABLE ascendco.need_history_2020_10 PARTITION OF need_history 
    FOR VALUES FROM ('2020-10-01') TO ('2020-11-01');

CREATE TABLE ascendco.need_history_2020_11 PARTITION OF need_history 
    FOR VALUES FROM ('2020-11-01') TO ('2020-12-01');

CREATE TABLE ascendco.need_history_2020_12 PARTITION OF need_history 
    FOR VALUES FROM ('2020-12-01') TO ('2021-01-01');

CREATE TABLE ascendco.need_history_default PARTITION OF need_history DEFAULT;       


COMMIT;

/* Define the trigger function to update the duration count.
  In PG 12 well be able to do this with a generated column...easier. */

CREATE OR REPLACE FUNCTION data.need_history_insert_trigger() 
  RETURNS trigger AS
$BODY$
BEGIN
/* Use DATE_TRUNC seconds to get just the whole seconds part of the timestamps. */
NEW.duration_seconds =
      EXTRACT(EPOCH FROM (
        DATE_TRUNC('second', NEW.deleted_dts) - 
        DATE_TRUNC('second', NEW.created_dts)
        ));
  RETURN NEW;
END;
$BODY$
LANGUAGE plpgsql;


/* 
Bind a trigger event to the function. 
Note: In PG 11 & 12, BEFORE ROW triggers must be applied to the individual partitions, not the partition table.
*/

DROP TRIGGER IF EXISTS trigger_need_history_before_insert_2019_11 ON data.need_history_2019_11;
CREATE TRIGGER trigger_need_history_before_insert_2019_11 
    BEFORE INSERT ON data.need_history_2019_11
    FOR EACH ROW EXECUTE FUNCTION data.need_history_insert_trigger();

DROP TRIGGER IF EXISTS trigger_need_history_before_insert_2019_12 ON data.need_history_2019_12;
CREATE TRIGGER trigger_need_history_before_insert_2019_12 
    BEFORE INSERT ON data.need_history_2019_12
    FOR EACH ROW EXECUTE FUNCTION data.need_history_insert_trigger();   

DROP TRIGGER IF EXISTS trigger_need_history_before_insert_2020_01 ON data.need_history_2020_01;
CREATE TRIGGER trigger_need_history_before_insert_2020_01 
    BEFORE INSERT ON data.need_history_2020_01
    FOR EACH ROW EXECUTE FUNCTION data.need_history_insert_trigger();

DROP TRIGGER IF EXISTS trigger_need_history_before_insert_2020_02 ON data.need_history_2020_02;
CREATE TRIGGER trigger_need_history_before_insert_2020_02 
    BEFORE INSERT ON data.need_history_2020_02
    FOR EACH ROW EXECUTE FUNCTION data.need_history_insert_trigger();

DROP TRIGGER IF EXISTS trigger_need_history_before_insert_2020_03 ON data.need_history_2020_03;
CREATE TRIGGER trigger_need_history_before_insert_2020_03 
    BEFORE INSERT ON data.need_history_2020_03
    FOR EACH ROW EXECUTE FUNCTION data.need_history_insert_trigger();

DROP TRIGGER IF EXISTS trigger_need_history_before_insert_2020_04 ON data.need_history_2020_04;
CREATE TRIGGER trigger_need_history_before_insert_2020_04 
    BEFORE INSERT ON data.need_history_2020_04
    FOR EACH ROW EXECUTE FUNCTION data.need_history_insert_trigger();

DROP TRIGGER IF EXISTS trigger_need_history_before_insert_2020_05 ON data.need_history_2020_05;
CREATE TRIGGER trigger_need_history_before_insert_2020_05 
    BEFORE INSERT ON data.need_history_2020_05
    FOR EACH ROW EXECUTE FUNCTION data.need_history_insert_trigger();

DROP TRIGGER IF EXISTS trigger_need_history_before_insert_2020_06 ON data.need_history_2020_06;
CREATE TRIGGER trigger_need_history_before_insert_2020_06 
    BEFORE INSERT ON data.need_history_2020_06
    FOR EACH ROW EXECUTE FUNCTION data.need_history_insert_trigger();

DROP TRIGGER IF EXISTS trigger_need_history_before_insert_2020_07 ON data.need_history_2020_07;
CREATE TRIGGER trigger_need_history_before_insert_2020_07 
    BEFORE INSERT ON data.need_history_2020_07
    FOR EACH ROW EXECUTE FUNCTION data.need_history_insert_trigger();

DROP TRIGGER IF EXISTS trigger_need_history_before_insert_2020_08 ON data.need_history_2020_08;
CREATE TRIGGER trigger_need_history_before_insert_2020_08 
    BEFORE INSERT ON data.need_history_2020_08
    FOR EACH ROW EXECUTE FUNCTION data.need_history_insert_trigger();

DROP TRIGGER IF EXISTS trigger_need_history_before_insert_2020_09 ON data.need_history_2020_09;
CREATE TRIGGER trigger_need_history_before_insert_2020_09 
    BEFORE INSERT ON data.need_history_2020_09
    FOR EACH ROW EXECUTE FUNCTION data.need_history_insert_trigger();

DROP TRIGGER IF EXISTS trigger_need_history_before_insert_2020_10 ON data.need_history_2020_10;
CREATE TRIGGER trigger_need_history_before_insert_2020_10 
    BEFORE INSERT ON data.need_history_2020_10
    FOR EACH ROW EXECUTE FUNCTION data.need_history_insert_trigger();

DROP TRIGGER IF EXISTS trigger_need_history_before_insert_2020_11 ON data.need_history_2020_11;
CREATE TRIGGER trigger_need_history_before_insert_2020_11 
    BEFORE INSERT ON data.need_history_2020_11
    FOR EACH ROW EXECUTE FUNCTION data.need_history_insert_trigger();

DROP TRIGGER IF EXISTS trigger_need_history_before_insert_2020_12 ON data.need_history_2020_12;
CREATE TRIGGER trigger_need_history_before_insert_2020_12 
    BEFORE INSERT ON data.need_history_2020_12
    FOR EACH ROW EXECUTE FUNCTION data.need_history_insert_trigger();

DROP TRIGGER IF EXISTS trigger_need_history_before_insert_default ON data.need_history_default;
CREATE TRIGGER trigger_need_history_before_insert_default 
    BEFORE INSERT ON data.need_history_default
    FOR EACH ROW EXECUTE FUNCTION data.need_history_insert_trigger();```


  [1]: https://dba.stackexchange.com/questions/253891/history-table-design-for-deletions-in-pg-11-5

Morris de Oryx

Asked: 2019-11-22 21:03:30 +0800 CST

PG 11.5 中删除的历史表设计

2

我有一个关于 Postgres 中历史表设计的问题。

设置是我有一个包含需求列表的表格。一个位置每五分钟重新计算一次需求项目，并将该列表推送到 Postgres。然后，各种客户端应用程序都可以访问当前的“热门”列表以进行拉取。因此，每五分钟，与特定位置相关的行被删除，然后重新填充现在热门的内容。想象一下仓库墙上的屏幕，人们抬头看紧急任务之类的东西。这或多或少是一个队列/通知表，而不是一个真正的存储表。

我们在需求商品列表中跟踪的是带有 ID 的特定部件。随着时间的推移收集数据（或至少是统计数据）对我们来说很有价值。我们可能会发现每天都有特定项目出现在列表中，而其他项目则很少出现。这可以帮助指导购买选择等。

这就是背景，我在 Postgres 11.5 中，所以没有生成列。下面描述的策略看起来是正确的，还是可以改进？基表被调用need，历史表被调用need_history

need
-- 存储感兴趣的数据--作为表设置的一部分，
有一个NOW()分配给created_dtson 。 -- 有一个后触发器来获取已删除行的“转换表”。--保存数据的语句触发器。INSERT
PER STATEMENT
INSERTS INTO need_history

need_history ——这几乎是需要的克隆，但增加了一些额外的字段。具体来说，在插入数据时默认deleted_dts分配，并存储记录在需要表中存在的 ~ 秒数。 - 由于这是 PG 11.5，没有生成列，所以我需要一个触发器来计算. NOW()duration_seconds
EACH ROWduration_seconds

更短：
need 使用语句级删除触发器推送到need_history.

need_history 使用行级触发器进行计算duration_seconds，因为我没有生成 PG 11.x 中可用的列。

而且，为了解决显而易见的问题，不，我不必存储派生duration_seconds值，因为它可以即时生成，但是在这种情况下，我想进行非规范化以简化各种查询、排序和摘要.

我的大脑也在说“询问填充因子”，我不知道为什么。

以下是初始设置代码，以防上面的摘要不清楚。我还没有通过这个推送任何数据，所以它可能有缺陷。

对于如何在 Postgres 中最好地做到这一点的任何建议或建议，我将不胜感激。

BEGIN;

DROP TABLE IF EXISTS data.need CASCADE;

CREATE TABLE IF NOT EXISTS data.need (
    id uuid NOT NULL DEFAULT NULL,
    item_id uuid NOT NULL DEFAULT NULL,
    facility_id uuid NOT NULL DEFAULT NULL,
    hsys_id uuid NOT NULL DEFAULT NULL,
    total_qty integer NOT NULL DEFAULT 0,
    available_qty integer NOT NULL DEFAULT 0,
    sterile_qty integer NOT NULL DEFAULT 0,
    still_need_qty integer NOT NULL DEFAULT 0,
    perc_down double precision NOT NULL DEFAULT '0',
    usage_ integer NOT NULL DEFAULT 0,
    need_for_case citext NOT NULL DEFAULT NULL,
    status citext NOT NULL DEFAULT NULL,
    created_dts timestamptz NOT NULL DEFAULT NOW(),

CONSTRAINT need_id_pkey
    PRIMARY KEY (id)
);


ALTER TABLE data.need OWNER TO user_change_structure;

COMMIT;

/* Define the trigger function to copy the deleted rows to the history table. */
CREATE FUNCTION data.need_delete_copy_to_history()  
  RETURNS trigger AS
$BODY$
BEGIN
        /* need.deleted_dts      is auto-assigned on INSERT over in need, and 
           need.duration_seconds is calculated in an INSERT trigger (PG 11.5, not PG 12, no generated columns). */

   INSERT INTO data.need_history 
            (id,
            item_id,
            facility_id,
            hsys_id,
            total_qty,
            available_qty,
            sterile_qty,
            still_need_qty,
            perc_down,
            usage_,
            need_for_case,
            status,
            created_dts)

     SELECT id,
            item_id,
            facility_id,
            hsys_id,
            total_qty,
            available_qty,
            sterile_qty,
            still_need_qty,
            perc_down,
            usage_,
            need_for_case,
            status,
            created_dts

       FROM deleted_rows;

    RETURN NULL; -- result is ignored since this is an AFTER trigger       
END;
$BODY$
LANGUAGE plpgsql;

 /* Bind a trigger event to the function. */
DROP TRIGGER IF EXISTS trigger_need_after_delete ON data.need;
CREATE TRIGGER trigger_need_after_delete 
    AFTER DELETE ON data.need
    REFERENCING OLD TABLE AS deleted_rows
    FOR EACH STATEMENT EXECUTE FUNCTION data.need_delete_copy_to_history();

/* Define the table. */
BEGIN;

DROP TABLE IF EXISTS data.need_history CASCADE;

CREATE TABLE IF NOT EXISTS data.need_history (
    id uuid NOT NULL DEFAULT NULL,
    item_id uuid NOT NULL DEFAULT NULL,
    facility_id uuid NOT NULL DEFAULT NULL,
    hsys_id uuid NOT NULL DEFAULT NULL,
    total_qty integer NOT NULL DEFAULT 0,
    available_qty integer NOT NULL DEFAULT 0,
    sterile_qty integer NOT NULL DEFAULT 0,
    still_need_qty integer NOT NULL DEFAULT 0,
    perc_down double precision NOT NULL DEFAULT '0',
    usage_ integer NOT NULL DEFAULT 0,
    need_for_case citext NOT NULL DEFAULT NULL,
    status citext NOT NULL DEFAULT NULL,
    created_dts timestamptz NOT NULL DEFAULT NULL,
    deleted_dts timestamptz NOT NULL DEFAULT NOW(),
    duration_seconds int4 NOT NULL DEFAULT 0,

CONSTRAINT need_history_id_pkey
    PRIMARY KEY (id)
);


ALTER TABLE data.need_history OWNER TO user_change_structure;

COMMIT;

/* Define the trigger function to update the duration count.
  In PG 12 we'll be able to do this with a generated column...easier. */

CREATE OR REPLACE FUNCTION data.need_history_insert_trigger() 
  RETURNS trigger AS
$BODY$
BEGIN
/* Use DATE_TRUNC seconds to get just the whole seconds part of the timestamps. */
NEW.duration_seconds =
      EXTRACT(EPOCH FROM (
        DATE_TRUNC('second', NEW.deleted_dts) - 
        DATE_TRUNC('second', NEW.created_dts)
        ));
  RETURN NEW;
END;
$BODY$
LANGUAGE plpgsql;


/* Bind a trigger event to the function. */
DROP TRIGGER IF EXISTS trigger_need_history_before_insert ON data.need_history;
CREATE TRIGGER trigger_need_history_before_insert 
    BEFORE INSERT ON data.need_history
    FOR EACH ROW EXECUTE FUNCTION data.need_history_insert_trigger();```

Morris de Oryx

Asked: 2019-10-17 17:44:38 +0800 CST

tsvector 字段何时可以收回成本？

4

我一直在尝试使用 tsvector 索引进行全文搜索，并发现在 tsvector 类型的列中生成存储向量是一种常见的做法。我们使用的是 Postgres 11.4，但我已经看到这种做法被用作 PG 12 生成列的示例。（比为相同目的使用触发器更简单。）

我的问题是，有什么好处？我在文本字段的 tsvector 上尝试了表达式 GIN 索引，并在存储的 tsvector 上尝试了 GIN 索引。本地大约有 800 万行，我无法测量任何有意义的速度差异。鉴于将向量存储为列和索引需要更多空间，我很好奇是否存在明显的额外成本合理的情况。例如，当您拥有更多角色时。

注意：我们将文本存储在数据库中，因此这不是您在不将源文本吸收到数据库中的情况下索引外部页面/文档/等的设置之一。

Morris de Oryx

Asked: 2019-10-16 14:53:41 +0800 CST

Postgres 全文搜索单词，而不是词位

4

我有一个表格，其中包含我想按word搜索的文本列，而不是lexeme。更重要的是，我希望按单词而不是词位进行索引。我们有很多代码引用的错误转储，它们不适用于任何自然语言词典。

Postgres 有没有办法让 FTS 按单词边界解析而不将单词解析为词位？如果我必须定义一个边界字符列表和一个跳过单词的目录，那可能很好。这是否需要制作某种自定义字典，或者是否已经有类似的东西可用？

我一直在想我错过了一些明显的东西，然后找不到它。

目前，trigram 索引还可以，但我真的更喜欢文本的唯一关键字解析器。

RDS 上的 Postgres 11.4。

Morris de Oryx

Asked: 2019-10-01 16:04:06 +0800 CST

改进 Postgres 中的不同值估计

3

Postgres 中的完整计数可能很慢，原因已被充分理解和讨论。因此，我一直在尽可能使用估算技术。对于行，pg_stats 似乎很好，对于视图，提取工作返回的估计值EXPLAIN。

https://www.cybertec-postgresql.com/en/count-made-fast/

但是不同的价值观呢？在这里，我的运气要差得多。有时估计值是 100% 正确的，有时它们会相差 2 或 20 倍。截断的表似乎特别是过时的估计值（？）。

我刚刚运行了这个测试并提供了一些结果：

analyze assembly_prods; -- Doing an ANLYZE to give pg_stats every help.

select 'count(*) distinct' as method,
        count(*) as count
from (select distinct assembly_id 
      from assembly_prods) d 
union all
select 'n_distinct from pg_stats' as method,
        n_distinct as count
from pg_stats 
where tablename  = 'assembly_prods' and
      attname    = 'assembly_id';

结果：

method                      count
count(*) distinct           28088
n_distinct from pg_stats    13805

这只差了 2 倍，但我的数据似乎更糟。到我不会使用估计的地步。还有什么我可以尝试的吗？这是PG 12改进的东西吗？

跟进

我以前从未尝试SET STATISTICS过，因为一天只有那么多小时。受 Laurenz 回答的启发，我快速浏览了一下。这是文档中的有用评论：

https://www.postgresql.org/docs/current/planner-stats.html

pg_statistic存储在by中的信息量ANALYZE，特别是每列的和 histogram_bounds 数组中的最大条目数most_common_vals，可以使用命令逐列设置，或通过设置配置变量ALTER TABLE SET STATISTICS全局设置。default_statistics_target默认限制目前为 100 个条目。提高限制可能会允许更准确的规划器估计，特别是对于具有不规则数据分布的列，代价是消耗更多空间pg_statistic和稍微更多的时间来计算估计。相反，对于具有简单数据分布的列，下限可能就足够了。

我经常得到包含一些常见值和许多罕见值的表。或者反过来，因此正确的阈值将取决于。对于那些没有使用过的人SET STATISTICS，它可以让您将采样率设置为目标条目数。默认值为 100，因此 1000 的保真度应该更高。这是它的样子：

ALTER TABLE assembly_prods 
    ALTER COLUMN assembly_id
    SET STATISTICS 1000;

您可以SET STATISTICS在表或索引上使用。这是一篇关于索引的有趣文章：

https://akorotkov.github.io/blog/2017/05/31/alter-index-weird/

请注意，当前文档确实列出SET STATISTICS了索引。

因此，我尝试了 1、10、100、1000 和 10,000 的阈值，并从具有 467,767 行和 28,088 个不同值的表中得到这些结果：

Target   Estimate  Difference  Missing
     1   13,657    14,431      51%
    10   13,867    14,221      51%
   100   13,759    14,329      51%
 1,000   24,746     3,342      12%
10,000   28,088         0       0%

显然，您无法从一个案例中得出任何一般性结论，但SET STATISTICS看起来非常有用，我很乐意将其牢记在心。我很想总体上提高一点目标，因为我怀疑这对我们系统中的许多情况都有帮助。

Morris de Oryx

Asked: 2019-09-28 15:26:40 +0800 CST

使 Postgres 支持多版本客户端代码的策略

0

我即将着手对我们的代码进行全面检查以推送到 Postgres，并希望在继续之前征求一些反馈。

这里的设置是我们已经部署了很多与非 Postgres 数据库一起部署的软件，我们定期将行推送到中央 Postgres 数据库中。(RDS 11.4) 我们同时更新所有部署的代码副本的可能性为零。因此，我们总是会在该领域拥有多个活动版本……有时会有很多不同的版本。INSERT这本身没问题，但它确实使将天真的语句烘焙到客户端代码中变得难以管理。我已经做到了。它只是轻微地咬了我们一下，但最终可能会变得非常痛苦。我们中央 Postgres 上的以下任何和所有 DDL 更改都可能会中断来自旧版本已部署软件的推送：

删除字段
重命名字段
重新输入字段
添加NOT NULL DEFAULT NULL字段

显而易见的是，我必须将硬编码/固定引用从客户端代码中移出并移到其他代码中。大多数人可能在他们的收集器/现场应用程序和 Postgres 之间有某种带有 ORM 等的堆栈。我们不。因此，在 SO 上的人们的大力帮助下，我使用“hsys”作为示例表构建了以下解决方案。

为每个表版本创建一个自定义类型，如 hsys_v1。

CREATE TYPE api.hsys_v1 AS (
    id uuid,
    name_ citext,
    marked_for_deletion boolean);

对于每个表和版本，编写一个接受自定义类型数组的 INSERT 处理函数。

CREATE OR REPLACE FUNCTION ascendco.insert_hsys_v1 (data_in api.hsys_v1[])
  RETURNS int
AS $BODY$

-- The CTE below is a roundabout way of returning an insertion count from a pure SQL function in Postgres.
with inserted_rows as (
        INSERT INTO hsys (
            id,
            name_,
            marked_for_deletion)

        SELECT
            rows_in.id,
            rows_in.name_,
            rows_in.marked_for_deletion

        FROM unnest(data_in) as rows_in

        ON CONFLICT(id) DO UPDATE SET
            name_ = EXCLUDED.name_,
            marked_for_deletion = EXCLUDED.marked_for_deletion

        returning 1 as row_counter)

    select sum(row_counter)::integer from inserted_rows;

$BODY$
LANGUAGE sql;

这里的想法是，当我更改表时，我将能够创建一个hsys_v2类型和一个insert_hsys_v2 (hsys_v2[])函数来匹配。然后老客户可以继续以旧格式推送，只要我重写insert_hsys_v1以将内容映射/转换/强制转换为新表格格式。

几周前我写了一个 GUI 来抓取我的表定义并在上面放了一个代码生成器。然后我停了下来。我意识到我想知道我是否遗漏了一些我应该考虑的东西，并且希望有人指出这个策略中的一个漏洞。我并没有懒于联系当地的程序员……没有。（我在澳大利亚农村。）

如果这个策略没有问题，我会进行大修。作为奖励，到目前为止的工作使得为自定义转换和视图添加代码构建器变得容易。不确定它们是否有用，但我同时生成它们。

Morris de Oryx

Asked: 2019-09-18 22:42:47 +0800 CST

为什么 Postgres 11 哈希索引如此之大？

5

RDS 上的 Postgres 11.4 和家里的 11.5。

我今天更仔细地查看哈希索引，因为我遇到了 citext 索引被忽略的问题。而且我发现我不明白为什么哈希索引如此之大。当我预计它需要 10 个字节 + 一些开销时，它需要大约 50 个字节/行。

我有一个示例数据库，其中包含一个名为 record_changes_log_detail 的表，该表有 7,733,552 条记录，因此约为 8M。该表中有一个名为 old_value 的 citext 字段，它是哈希索引的来源：

CREATE INDEX record_changes_log_detail_old_value_ix_hash
    ON record_changes_log_detail
    USING hash (old_value);

这是对索引大小的检查：

select
'record_changes_log_detail_old_value_ix_hash' as index_name,
pg_relation_size ('record_changes_log_detail_old_value_ix_hash') as bytes,
pg_size_pretty(pg_relation_size ('record_changes_log_detail_old_value_ix_hash')) as pretty

这将返回 379,322,368 字节，即大约 362MB。我已经深入研究了源代码，而这件精美的作品则更多。

听起来一行的哈希索引条目是与哈希键本身配对的 TID。以及页面内的某种索引计数器。那是两个 4 字节的整数，我猜是 1 或 2 字节的整数。作为一个简单的计算，10 字节 * 7,733,552 = 77,335,520。实际索引大约是该索引的 5 倍。诚然，您需要为索引结构本身提供空间，但不应该将每行的粗略成本从 ~10 字节降低到 ~50 字节，不是吗？

以下是索引的详细信息，使用pageinspect扩展读取，然后手动旋转以确保易读性。

select * 
from hash_metapage_info(get_raw_page('record_changes_log_detail_old_value_ix_hash',0));


magic   105121344
version 4
ntuples 7733552
ffactor 307
bsize   8152
bmsize  4096
bmshift 15
maxbucket   28671
highmask    32767
lowmask 16383
ovflpoint   32
firstfree   17631
nmaps   1
procid  17269
spares  {0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,17631,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0}
mapp    {28673,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0}

select *
from hash_page_stats(get_raw_page('record_changes_log_detail_old_value_ix_hash',1));

live_items  2
dead_items  0
page_size   8192
free_size   8108
hasho_prevblkno 28671
hasho_nextblkno 4294967295
hasho_bucket    0
hasho_flag  2
hasho_page_id   65408

Morris de Oryx

Asked: 2019-09-18 18:15:50 +0800 CST

忽略 citext 列上的表达式索引，为什么？

4

在大约 32M 行的 RDS 上运行。

PostgreSQL 11.4 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-11), 64-bit

还在 macOS 上进行本地测试，行数约为 8M。

PostgreSQL 11.5 on x86_64-apple-darwin16.7.0, compiled by Apple LLVM version 8.1.0 (clang-802.0.42), 64-bit

我有一个名为old_valuecitext 类型的列。我已经问过这个问题，但在此过程中发布了我的许多发现步骤。这是一个简化的版本，我希望能说到点子上。

背景

我有一个名为 record_changes_log_detail 的字段更改日志表，其中包含一个名为 old_value 的 citext 字段，其中包含 32M 行并且在不断增长。

数据非常倾斜。大多数值少于十几个字符，有些值超过 5,000。

Postgres 因 B 树条目被限制为 2172 个字符的错误而阻塞大值。所以我相信对于B树，我需要对源值进行子串化。

我的用户主要感兴趣的是 = 搜索、以开头搜索，有时还有包含此子字符串的搜索。所以 = string% 和 %string%

目标

创建一个支持计划器使用的搜索的索引。

尝试并失败

在某些情况下，由于值太长，无法构建直 B 树。

像这样的表达式 B-tree 构建，但未使用

CREATE INDEX record_changes_log_detail_old_value_ix_btree
    ON  record_changes_log_detail 
    USING btree (substring(old_value,1,1024));

添加 text_pattern_opts 没有帮助。

CREATE INDEX record_changes_log_detail_old_value_ix_btree
    ON  record_changes_log_detail 
    USING btree (substring(old_value,1,1024) text_pattern_opts);

尝试并部分工作

哈希索引有效，但仅用于相等。（就像它在罐头上说的那样。）

这是我最接近成功的地方：

CREATE INDEX record_changes_log_detail_old_value_ix_btree
    ON record_changes_log_detail 
    USING btree (old_value citext_pattern_ops);

这适用于质量，但不适用于 LIKE。PG 11 的发行说明说它应该适用于 LIKE：

https://www.postgresql.org/docs/11/release-11.html

“工作”是指“使用索引”。

我无法使用这种方法成功地进行子串化。

人们在这种情况下对 citext 字段做了什么？

Postgres 存储函数的自定义错误代码类和编号

在 Postgres 11.5 上的分区表上触发

PG 11.5 中删除的历史表设计

tsvector 字段何时可以收回成本？

Postgres 全文搜索单词，而不是词位

改进 Postgres 中的不同值估计

跟进

使 Postgres 支持多版本客户端代码的策略

为什么 Postgres 11 哈希索引如此之大？

忽略 citext 列上的表达式索引，为什么？

背景

目标

尝试并失败

尝试并部分工作

连接到 PostgreSQL 服务器：致命：主机没有 pg_hba.conf 条目

如何让sqlplus的输出出现在一行中？

选择具有最大日期或最晚日期的日期

如何列出 PostgreSQL 中的所有模式？

列出指定表的所有列

如何在不修改我自己的 tnsnames.ora 的情况下使用 sqlplus 连接到位于另一台主机上的 Oracle 数据库

你如何mysqldump特定的表？

使用 psql 列出数据库权限

如何从 PostgreSQL 中的选择查询中将值插入表中？

如何使用 psql 列出所有数据库和表？

Morris de Oryx's questions

跟进

背景

目标

尝试并失败

尝试并部分工作