James Hay提出的问题 -dba

James Hay

Asked: 2024-01-16 19:39:32 +0800 CST

Postgres 中违反唯一键约束是否会导致性能损失？

在我的 API 中，当存在具有该唯一键的行时，用户可能会发送一个尝试创建新行的请求。

目前，我正在捕获唯一键错误并返回一条消息，指出 X 已存在。但是，首先查找该行（在同一连接上）并且仅在该行不存在时才运行 INSERT 语句是否会更高效？

我的直觉告诉我，从 Postgres 读取错误应该会更有效，但我想确保我正在按照惯用的方式做事。

Postgres 版本为 12

James Hay

Asked: 2016-01-04 17:40:08 +0800 CST

提高 GROUP BY 子句中的排序性能

我在 Postgres 9.4.1 中有两个表，events并event_refs具有以下模式：

events桌子

CREATE TABLE events (
  id serial NOT NULL PRIMARY KEY,
  event_type text NOT NULL,
  event_path jsonb,
  event_data jsonb,
  created_at timestamp with time zone NOT NULL
);

-- Index on type and created time

CREATE INDEX events_event_type_created_at_idx
  ON events (event_type, created_at);

event_refs桌子

CREATE TABLE event_refs (
  event_id integer NOT NULL,
  reference_key text NOT NULL,
  reference_value text NOT NULL,
  CONSTRAINT event_refs_pkey PRIMARY KEY (event_id, reference_key, reference_value),
  CONSTRAINT event_refs_event_id_fkey FOREIGN KEY (event_id)
      REFERENCES events (id) MATCH SIMPLE
      ON UPDATE NO ACTION ON DELETE NO ACTION
);

两个表都包含 2M 行。这是我正在运行的查询

SELECT
  EXTRACT(EPOCH FROM (MAX(events.created_at) - MIN(events.created_at))) as funnel_time
FROM
  events
INNER JOIN
  event_refs
ON
  event_refs.event_id = events.id AND
  event_refs.reference_key = 'project'
WHERE
    events.event_type = 'event1' OR
    events.event_type = 'event2' AND
    events.created_at >= '2015-07-01 00:00:00+08:00' AND
    events.created_at < '2015-12-01 00:00:00+08:00'
GROUP BY event_refs.reference_value
HAVING COUNT(*) > 1

我知道 where 子句中的运算符优先级。它只应该按日期过滤类型为“event2”的事件。

这是EXPLAIN ANALYZE输出

GroupAggregate  (cost=116503.86..120940.20 rows=147878 width=14) (actual time=3970.530..4163.041 rows=53532 loops=1)
   Group Key: event_refs.reference_value
   Filter: (count(*) > 1)
   Rows Removed by Filter: 41315
   ->  Sort  (cost=116503.86..116873.56 rows=147878 width=14) (actual time=3970.509..4105.316 rows=153766 loops=1)
         Sort Key: event_refs.reference_value
         Sort Method: external merge  Disk: 3904kB
         ->  Hash Join  (cost=24302.26..101275.04 rows=147878 width=14) (actual time=101.667..1394.281 rows=153766 loops=1)
               Hash Cond: (event_refs.event_id = events.id)
               ->  Seq Scan on event_refs  (cost=0.00..37739.00 rows=2000000 width=10) (actual time=0.007..368.661 rows=2000000 loops=1)
                     Filter: (reference_key = 'project'::text)
               ->  Hash  (cost=21730.79..21730.79 rows=147878 width=12) (actual time=101.524..101.524 rows=153766 loops=1)
                     Buckets: 16384  Batches: 2  Memory Usage: 3315kB
                     ->  Bitmap Heap Scan on events  (cost=3761.23..21730.79 rows=147878 width=12) (actual time=23.139..75.814 rows=153766 loops=1)
                           Recheck Cond: ((event_type = 'event1'::text) OR ((event_type = 'event2'::text) AND (created_at >= '2015-07-01 04:00:00+12'::timestamp with time zone) AND (created_at < '2015-12-01 05:00:00+13'::timestamp with time zone)))
                           Heap Blocks: exact=14911
                           ->  BitmapOr  (cost=3761.23..3761.23 rows=150328 width=0) (actual time=21.210..21.210 rows=0 loops=1)
                                 ->  Bitmap Index Scan on events_event_type_created_at_idx  (cost=0.00..2349.42 rows=102533 width=0) (actual time=12.234..12.234 rows=99864 loops=1)
                                       Index Cond: (event_type = 'event1'::text)
                                 ->  Bitmap Index Scan on events_event_type_created_at_idx  (cost=0.00..1337.87 rows=47795 width=0) (actual time=8.975..8.975 rows=53902 loops=1)
                                       Index Cond: ((event_type = 'event2'::text) AND (created_at >= '2015-07-01 04:00:00+12'::timestamp with time zone) AND (created_at < '2015-12-01 05:00:00+13'::timestamp with time zone))
 Planning time: 0.493 ms
 Execution time: 4178.517 ms

我知道event_refs表扫描上的过滤器没有过滤任何东西，这是我的测试数据的结果，以后会添加不同的类型。

包括HashJoin似乎合理的所有内容都提供了我的测试数据，但我想知道是否可以从该子句中提高Sort速度？GROUP BY

我尝试在reference_value列中添加一个 b 树索引，但它似乎没有使用它。如果我没记错的话（我很可能是这样，请告诉我），它正在对 153766 行进行排序。索引不会有利于这个排序过程吗？

James Hay

Asked: 2016-01-02 00:48:17 +0800 CST

SELECT 查询中未使用索引

我在 Postgres 9.4.1 中有一个大约 3.25M 行的表，格式如下

CREATE TABLE stats
(
    id serial NOT NULL,
    type character varying(255) NOT NULL,
    "references" jsonb NOT NULL,
    path jsonb,
    data jsonb,
    "createdAt" timestamp with time zone NOT NULL,
    CONSTRAINT stats_pkey PRIMARY KEY (id)
)
WITH (
    OIDS=FALSE
);

type是一个不超过 50 个字符的简单字符串。

该references列是一个包含键值列表的对象。基本上任何简单键值列表，并且只有 1 层深，值始终是字符串。它可能是

{
    "fruit": "plum"
    "car": "toyota"
}

或者它可能是

{
    "project": "2532"
}

createdAt时间戳并不总是从数据库生成（但如果未提供值，则默认情况下会生成）

我目前正在使用仅包含测试数据的表格。在此数据中，每一行都有一个project键作为参考。所以有 325 万行带有项目键。恰好有 400,000 个不同的project参考值。该字段只有 5 个不同的值type，这在生产中可能不会超过几百个。

所以我试图索引表以快速执行以下查询：

SELECT
  EXTRACT(EPOCH FROM (MAX("createdAt") - MIN("createdAt"))) 
FROM
  stats
WHERE
  stats."references"::jsonb ? 'project' AND
  (
    stats."type" = 'event1' OR
    (
      stats."type" = 'event2' AND
      stats."createdAt" > '2015-11-02T00:00:00+08:00' AND
      stats."createdAt" < '2015-12-03T23:59:59+08:00'
    )
  )
GROUP BY stats."references"::jsonb->> 'project'

该查询基于具有相同引用的两个统计行返回两个事件之间的时间距离。在这种情况下project。每个type和选定的reference值只有 1 行，但也可能没有行，在这种情况下返回的结果为 0（稍后在较大查询的不同部分进行平均）。

我已经在createdAt type和references列上创建了一个索引，但查询执行计划似乎是在进行全面扫描。

指标

CREATE INDEX "stats_createdAt_references_type_idx"
    ON stats
    USING btree
    ("createdAt", "references", type COLLATE pg_catalog."default");

执行计划：

 HashAggregate  (cost=111188.31..111188.33 rows=1 width=38) 
                (actual time=714.499..714.499 rows=0 loops=1)
   Group Key: ("references" ->> 'project'::text)
      ->  Seq Scan on stats  (cost=0.00..111188.30 rows=1 width=38) 
                             (actual time=714.498..714.498 rows=0 loops=1)
          Filter: (
              (("references" ? 'project'::text) 
               AND ((type)::text = 'event1'::text)) OR 
              (((type)::text = 'event2'::text) 
               AND ("createdAt" > '2015-11-02 05:00:00+13'::timestamp with time zone) 
               AND ("createdAt" < '2015-12-04 04:59:59+13'::timestamp with time zone)))

Rows Removed by Filter: 3258680
Planning time: 0.163 ms
Execution time: 714.534 ms

我真的不太了解索引和查询执行计划，所以如果有人能指出我正确的方向，那就太好了。

编辑

正如 Erwin 所指出的，看起来即使我确实有正确的索引，表扫描仍然会发生，因为从查询返回的表部分非常大。这是否意味着对于这组数据，这是我可以获得的最快查询时间？我假设如果我在没有项目引用的情况下再添加 60M 不相关的行，它可能会使用索引（如果我有正确的索引），但我看不出如何通过添加更多数据来加快查询速度。也许我错过了什么。

Postgres 中违反唯一键约束是否会导致性能损失？

提高 GROUP BY 子句中的排序性能

SELECT 查询中未使用索引

连接到 PostgreSQL 服务器：致命：主机没有 pg_hba.conf 条目

如何让sqlplus的输出出现在一行中？

选择具有最大日期或最晚日期的日期

如何列出 PostgreSQL 中的所有模式？

列出指定表的所有列

如何在不修改我自己的 tnsnames.ora 的情况下使用 sqlplus 连接到位于另一台主机上的 Oracle 数据库

你如何mysqldump特定的表？

使用 psql 列出数据库权限

如何从 PostgreSQL 中的选择查询中将值插入表中？

如何使用 psql 列出所有数据库和表？

James Hay's questions