Parker提出的问题 -dba

Parker

Asked: 2024-08-08 21:50:13 +0800 CST

在 PostgreSQL 中对单个索引文本列进行计数时出现这种间歇性慢查询的原因是什么？

5

我有一张包含 2,395,015 行的表，其中一TEXT列有三个值之一，并且从不为NULL。在计算值与大多数 (>99%) 行匹配的行数时，我遇到了间歇性查询性能问题。我想修复这个性能问题。这些查询必须返回精确计数，因此我不能使用近似计数。

corpus=# \d metadata
                             Table "public.metadata"
    Column     |            Type             | Collation | Nullable |    Default
---------------+-----------------------------+-----------+----------+----------------
 id            | text                        |           | not null |
 priority      | integer                     |           | not null | 10
 media_type    | text                        |           | not null |
 modified      | timestamp without time zone |           | not null | now()
 processed     | timestamp without time zone |           |          |
 status        | text                        |           | not null | 'QUEUED'::text
 note          | text                        |           |          |
 content       | text                        |           |          |
 resolved      | text                        |           |          |
 response_time | integer                     |           |          |
 luid          | integer                     |           | not null |
 jamo_date     | timestamp without time zone |           |          |
 audit_path    | text                        |           |          |
Indexes:
    "metadata_pkey" PRIMARY KEY, btree (id)
    "metadata_id_idx" btree (id)
    "metadata_luid_idx" btree (luid)
    "metadata_modified_idx" btree (modified DESC)
    "metadata_processed_idx" btree (processed DESC)
    "metadata_status_idx" btree (status)
Check constraints:
    "media_type_ck" CHECK (media_type = ANY (ARRAY['text/json'::text, 'text/yaml'::text]))
    "status_ck" CHECK (status = ANY (ARRAY['QUEUED'::text, 'PROCESSED'::text, 'ERROR'::text]))
Foreign-key constraints:
    "metadata_luid_fkey" FOREIGN KEY (luid) REFERENCES concept(luid) ON DELETE CASCADE

corpus=#

QUEUED我有一些简单的查询，用于计算与三个状态代码（、PROCESSED、）之一匹配的行数ERROR。匹配的行数为 0 行QUEUED，匹配的行数为 9,794 行ERROR，匹配的行数为 2,385,221 行PROCESSED。当我针对每个状态代码运行相同的查询时，通常会立即得到一组结果：

corpus=# EXPLAIN ANALYZE VERBOSE SELECT COUNT(*) FROM metadata WHERE status='QUEUED';
                                                                          QUERY PLAN

---------------------------------------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=1947.17..1947.18 rows=1 width=8) (actual time=2.935..2.936 rows=1 loops=1)
   Output: count(*)
   ->  Index Only Scan using metadata_status_idx on public.metadata  (cost=0.43..1915.97 rows=12480 width=0) (actual time=2.932..2.933 rows=0 loops=1)
         Output: status
         Index Cond: (metadata.status = 'QUEUED'::text)
         Heap Fetches: 0
 Planning Time: 0.734 ms
 Execution Time: 2.988 ms
(8 rows)

corpus=# EXPLAIN ANALYZE VERBOSE SELECT COUNT(*) FROM metadata WHERE status='ERROR';
                                                                             QUERY PLAN

--------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=1184.19..1184.20 rows=1 width=8) (actual time=1484.763..1484.764 rows=1 loops=1)
   Output: count(*)
   ->  Index Only Scan using metadata_status_idx on public.metadata  (cost=0.43..1165.26 rows=7569 width=0) (actual time=4.235..1484.029 rows=9794 loops=1)
         Output: status
         Index Cond: (metadata.status = 'ERROR'::text)
         Heap Fetches: 9584
 Planning Time: 0.072 ms
 Execution Time: 1484.786 ms
(8 rows)

corpus=#

corpus=# EXPLAIN ANALYZE VERBOSE SELECT COUNT(*) FROM metadata WHERE status='PROCESSED';
                                                                                          QUERY PLAN

----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Finalize Aggregate  (cost=261398.83..261398.84 rows=1 width=8) (actual time=741.319..749.026 rows=1 loops=1)
   Output: count(*)
   ->  Gather  (cost=261398.62..261398.83 rows=2 width=8) (actual time=741.309..749.020 rows=3 loops=1)
         Output: (PARTIAL count(*))
         Workers Planned: 2
         Workers Launched: 2
         ->  Partial Aggregate  (cost=260398.62..260398.63 rows=1 width=8) (actual time=735.099..735.100 rows=1 loops=3)
               Output: PARTIAL count(*)
               Worker 0: actual time=730.871..730.872 rows=1 loops=1
               Worker 1: actual time=733.435..733.436 rows=1 loops=1
               ->  Parallel Index Only Scan using metadata_status_idx on public.metadata  (cost=0.43..257903.37 rows=998100 width=0) (actual time=0.065..700.529 rows=795074 loops=3)
                     Output: status
                     Index Cond: (metadata.status = 'PROCESSED'::text)
                     Heap Fetches: 747048
                     Worker 0: actual time=0.060..702.980 rows=670975 loops=1
                     Worker 1: actual time=0.076..686.946 rows=1010099 loops=1
 Planning Time: 0.085 ms
 Execution Time: 749.068 ms
(18 rows)

corpus=#

但有时，计算PROCESSED行数会花费过多的时间（有时需要几分钟）：

corpus=# EXPLAIN ANALYZE VERBOSE SELECT COUNT(*) FROM metadata WHERE status='PROCESSED';
                                                                                           QUERY PLAN

-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Finalize Aggregate  (cost=261398.83..261398.84 rows=1 width=8) (actual time=30019.273..30019.336 rows=1 loops=1)
   Output: count(*)
   ->  Gather  (cost=261398.62..261398.83 rows=2 width=8) (actual time=30019.261..30019.326 rows=3 loops=1)
         Output: (PARTIAL count(*))
         Workers Planned: 2
         Workers Launched: 2
         ->  Partial Aggregate  (cost=260398.62..260398.63 rows=1 width=8) (actual time=29967.734..29967.735 rows=1 loops=3)
               Output: PARTIAL count(*)
               Worker 0: actual time=29939.915..29939.916 rows=1 loops=1
               Worker 1: actual time=29944.395..29944.395 rows=1 loops=1
               ->  Parallel Index Only Scan using metadata_status_idx on public.metadata  (cost=0.43..257903.37 rows=998100 width=0) (actual time=75.385..29931.795 rows=795074 loops=3)
                     Output: status
                     Index Cond: (metadata.status = 'PROCESSED'::text)
                     Heap Fetches: 747151
                     Worker 0: actual time=128.857..29899.156 rows=916461 loops=1
                     Worker 1: actual time=28.609..29905.708 rows=854439 loops=1
 Planning Time: 421.203 ms
 Execution Time: 30019.440 ms
(18 rows)

corpus=#

虽然上述查询运行缓慢，但我能够针对其他两个代码中的任一个查询同一张表，并且这些查询在 1 秒内返回。我查找了表锁（没有）。即使没有其他查询或表插入正在运行，也会发生这种情况。

这些间歇性慢速查询的可能原因有哪些？
我可以尝试哪些额外的调试来获取有关这些慢速查询的更多信息？
有没有相关的服务器设置？
是否有更有效的方法来索引/编码这些列（例如，我应该使用CHAR(1)），甚至是SMALLINT？如果是这样，应该为该列使用什么索引？

如果我使用CHAR(1)，以下约束之间是否有区别：

ALTER TABLE jgi_metadata ADD CONSTRAINT status_code_ck CHECK (status_code = ANY (ARRAY['Q'::char(1), 'P'::char(1), 'E'::char(1)]));
ALTER TABLE jgi_metadata ADD CONSTRAINT status_code_ck CHECK (status_code IN ('Q', 'P', 'E'));
是否可以对该列使用部分索引，即使它从来没有被使用过NULL？
我是否应该将其PROCESSED拆分为布尔列，然后status仅将该列用于其他代码并使用部分索引使其可空？

这是在 Linux 上运行的具有默认设置的 PostgreSQL 11。

我还尝试过其他方法：

将 work_mem 增加到 100MB（通过postgresql.conf）。性能没有变化。
我尝试在状态列上创建部分索引。

更新：我发现这个性能问题与状态列无关，而是与表本身的大小有关，如以下 2 分钟查询所示：

corpus=# EXPLAIN ANALYZE VERBOSE SELECT COUNT(*) FROM metadata;
                                                                                            QUERY PLAN

---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Finalize Aggregate  (cost=196398.52..196398.53 rows=1 width=8) (actual time=118527.897..118554.762 rows=1 loops=1)
   Output: count(*)
   ->  Gather  (cost=196398.30..196398.51 rows=2 width=8) (actual time=118522.165..118554.756 rows=3 loops=1)
         Output: (PARTIAL count(*))
         Workers Planned: 2
         Workers Launched: 2
         ->  Partial Aggregate  (cost=195398.30..195398.31 rows=1 width=8) (actual time=118491.043..118491.044 rows=1 loops=3)
               Output: PARTIAL count(*)
               Worker 0: actual time=118475.143..118475.144 rows=1 loops=1
               Worker 1: actual time=118476.110..118476.111 rows=1 loops=1
               ->  Parallel Index Only Scan using metadata_status_idx on public.metadata  (cost=0.43..192876.13rows=1008870 width=0) (actual time=71.797..118449.265 rows=809820 loops=3)
                     Output: status
                     Heap Fetches: 552630
                     Worker 0: actual time=75.877..118434.476 rows=761049 loops=1
                     Worker 1: actual time=104.872..118436.647 rows=745770 loops=1
 Planning Time: 592.040 ms
 Execution Time: 118554.839 ms
(17 rows)

corpus=#

这似乎与现在的其他问题非常相似，所以我正在尝试从这个答案中采取缓解策略：

VACUUM ANALYZE metadata;第一次COUNT(*)计数耗时 5 秒，后续计数耗时 190 毫秒。

其他想法：

如果将状态列拆分成其自己的表，并在metadata表中设置外键，这会有帮助吗？

注意：我越来越相信这个问题与这里的其他几个问题重复：

这个答案可能是该问题的最佳解决方案：

https://stackoverflow.com/a/7945274/2074605

根据要求，这里是带有缓冲区的查询计划分析：

EXPLAIN (ANALYZE, BUFFERS, VERBOSE) SELECT COUNT(*) FROM metadata;

                                                                                           QUERY PLAN

-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Finalize Aggregate  (cost=80771.95..80771.96 rows=1 width=8) (actual time=26711.481..26716.494 rows=1 loops=1)
   Output: count(*)
   Buffers: shared hit=293915 read=19595 dirtied=282 written=12
   ->  Gather  (cost=80771.73..80771.94 rows=2 width=8) (actual time=26711.203..26716.488 rows=3 loops=1)
         Output: (PARTIAL count(*))
         Workers Planned: 2
         Workers Launched: 2
         Buffers: shared hit=293915 read=19595 dirtied=282 written=12
         ->  Partial Aggregate  (cost=79771.73..79771.74 rows=1 width=8) (actual time=26565.622..26565.623 rows=1 loops=3)
               Output: PARTIAL count(*)
               Buffers: shared hit=293915 read=19595 dirtied=282 written=12
               Worker 0: actual time=26530.890..26530.891 rows=1 loops=1
                 Buffers: shared hit=105264 read=6760 dirtied=145 written=5
               Worker 1: actual time=26530.942..26530.942 rows=1 loops=1
                 Buffers: shared hit=84675 read=7529 dirtied=46 written=2
               ->  Parallel Index Only Scan using metadata_status_idx on public.metadata  (cost=0.43..77241.05 rows=1012275 width=0) (actual time=42.254..26529.232 rows=809820 loops=3)
                     Output: status
                     Heap Fetches: 17185
                     Buffers: shared hit=293915 read=19595 dirtied=282 written=12
                     Worker 0: actual time=59.291..26494.376 rows=815113 loops=1
                       Buffers: shared hit=105264 read=6760 dirtied=145 written=5
                     Worker 1: actual time=31.165..26484.729 rows=1036972 loops=1
                       Buffers: shared hit=84675 read=7529 dirtied=46 written=2
 Planning Time: 98.400 ms
 Execution Time: 26716.529 ms
(25 rows)

Parker

Asked: 2018-12-05 10:03:19 +0800 CST

对多个 OUTER JOIN 进行计数和分组

1

我正在创建广告日志视图（展示次数、点击次数和每次展示的点击次数）。我有一个简单的表结构和一些工作查询，但我在将它们组合成一个我可以用作视图（不是物化视图，因为这将是实时数据）的查询时遇到了一些问题。

这些表是：

CREATE TABLE advert
(
  id integer NOT NULL PRIMARY KEY
);

CREATE TABLE advert_event
(
  code CHAR(1) NOT NULL PRIMARY KEY
);

CREATE TABLE advert_log
(
  advertisement integer NOT NULL REFERENCES advert(id),
  event_code CHAR(1) NOT NULL REFERENCES advert_event(code)
);

以及涵盖所有可能情况的一些示例数据：

INSERT INTO advert VALUES (1);
INSERT INTO advert VALUES (2);
INSERT INTO advert VALUES (3);
INSERT INTO advert VALUES (4);

INSERT INTO advert_event VALUES ('I'); -- Impression
INSERT INTO advert_event VALUES ('C'); -- Click

INSERT INTO advert_log VALUES (1, 'I');
INSERT INTO advert_log VALUES (1, 'C');
INSERT INTO advert_log VALUES (2, 'I');
INSERT INTO advert_log VALUES (2, 'I');
INSERT INTO advert_log VALUES (2, 'C');
INSERT INTO advert_log VALUES (3, 'I');
INSERT INTO advert_log VALUES (3, 'I');

作为参考，这是我要计算的一组内容advert_log：

查询 A。

SELECT * FROM advert,advert_event;

结果A。

 id | code
----+------
  1 | I
  1 | C
  2 | I
  2 | C
  3 | I
  3 | C
  4 | I
  4 | C
(8 rows)

按广告分类的事件数：

查询 B。

SELECT DISTINCT advertisement,event_code,COUNT(*) OVER (PARTITION BY advertisement,event_code) FROM advert_log;

结果 B。

 advertisement | event_code | count
---------------+------------+-------
             1 | I          |     1
             1 | C          |     1
             2 | I          |     2
             2 | C          |     1
             3 | I          |     1
(5 rows)

对于任何单个广告，可以通过以下查询获得正确的计数：

查询 C1。

SELECT COUNT(*) FROM advert_log WHERE advertisement=4 AND event_code='I';
 count
-------
     0
(1 row)

查询 C2。

SELECT COUNT(*) FROM advert_log WHERE advertisement=4 AND event_code='C';
 count
-------
     0
(1 row)

当然，我之前的查询排除了零计数，因此它没有捕获到上述两种情况中的任何一种。

最终，我想要做的是将上述数字转换为以下数字，使用clicks（“C”条目）除以impressions（“I”条目）来导出cpi列：

 advertisement | impressions | clicks | cpi
---------------+-------------+--------+-----
             1 |           1 |     1  | 1.0
             2 |           2 |     1  | 0.5
             3 |           1 |     0  | 0.0
             4 |           0 |     0  | 0.0 <- or NULL, NaN, 1.0, ...

我最初的方法是为查询 C1 和 C2 创建一个视图，并从基于查询 A 的视图中调用该函数。

我怀疑有一种更简单的方法可以通过单个查询实现我的目标。

Parker

Asked: 2018-10-13 11:31:09 +0800 CST

通过 PostgreSQL 视图或函数从两个表的“叉积”创建二维真值表

1

我有一个基于 Excel 的工作方法，用于从从 PostgreSQL 数据库导出的两个向量创建真值表。由于大量的VLOOKUPandCOUNTIFS操作，该过程大约需要 4 个小时才能完成，因此我正在寻找一种直接在数据库中将其实现为视图的方法。

输入向量是从我的数据库中的两个现有视图生成的，它们没有外键。

为了使这个问题和解决方案尽可能通用，我使用两个包含示例数据的简单表格创建了一个并行问题，以涵盖所有可能的情况：

CREATE TABLE group_membership
(
  member character varying(6) NOT NULL,
  group_name character varying(64) NOT NULL
);

INSERT INTO group_membership VALUES ('000001','A');
INSERT INTO group_membership VALUES ('000001','B');
INSERT INTO group_membership VALUES ('000001','B'); -- A value may occur more than once.
INSERT INTO group_membership VALUES ('000001','D'); -- A value may not necessarily have a corresponding row in the group table.
INSERT INTO group_membership VALUES ('000001','D');

INSERT INTO group_membership VALUES ('000002','B');
INSERT INTO group_membership VALUES ('000002','C');
INSERT INTO group_membership VALUES ('000002','E');

INSERT INTO group_membership VALUES ('000003','A');
INSERT INTO group_membership VALUES ('000003','C');

INSERT INTO group_membership VALUES ('000004','D');
INSERT INTO group_membership VALUES ('000004','E');

CREATE TABLE groups
(
  name character varying(64) NOT NULL
);

INSERT INTO groups VALUES ('A');
INSERT INTO groups VALUES ('B');
INSERT INTO groups VALUES ('C');
INSERT INTO groups VALUES ('C'); -- A value may occur more than once.
INSERT INTO groups VALUES ('Z');
-- 'D' and 'E' not present in this table

这两个表之间没有关系。

我正在尝试构建一个视图，该视图将创建一个二进制真值表（矩阵），如下所示：

member A B C Z
000001 t t f f
000002 f t t f
000003 t f t f
000004 f f f f

其中第一列是表中的不同成员group_membership，后续列member仅显示表中定义的组中是否存在group。结果表应该仅为布尔值（TRUE如果成员在与组的元组中至少出现一次，FALSE否则）。

例如，上表中的某些特定“单元格”将符合以下内容：

SELECT COUNT(*) > 0 AS value FROM group_membership WHERE group_name='A' AND member='000001';
 value
-------
 t
(1 row)

SELECT COUNT(*) > 0 AS value FROM group_membership WHERE group_name='Z' AND member='000001';
 value
-------
 f
(1 row)

并创建第二列（“A”列）：

SELECT COUNT(*) > 0 AS A FROM group_membership WHERE group_name='A' AND member='000001'
 UNION ALL
SELECT COUNT(*) > 0 AS A FROM group_membership WHERE group_name='A' AND member='000002'
 UNION ALL
SELECT COUNT(*) > 0 AS A FROM group_membership WHERE group_name='A' AND member='000003'
 UNION ALL
SELECT COUNT(*) > 0 AS A FROM group_membership WHERE group_name='A' AND member='000004'
;

更好的是这样的（1而0不是TRUEand FALSE）：

member A B C Z
000001 1 1 0 0
000002 0 1 1 0
000003 1 0 1 0
000004 0 0 0 0

每个单独的“单元格”的查询可以采用以下形式：

SELECT CASE WHEN COUNT(*) > 0 THEN 1 ELSE 0 END FROM group_membership WHERE group_name='A' AND member='000001';

我的group_membership表有大约 50,000 行，我的group表有大约 200 行。

注意：如果你做类似下面的事情来忽略两个表中不常见的组，你最终会像000004上面的示例结果集中那样消除行，这不是我要找的（成员000004和组Z应该是出现在结果集中）：

SELECT * FROM group_membership WHERE group_name IN (SELECT DISTINCT(name) FROM groups);

作为解决这个问题的第一步，我正在研究创建一个FUNCTION依赖于表递归JOIN来group构建结果表的表。

更新： AFUNCTION需要一个RETURNS TABLE定义，鉴于结果集中的列数可变，这看起来不是一个可行的解决方案。我有一些额外的想法是创建一个函数，该函数在一个维度上执行一系列UNIONs，然后用一个视图包装，该视图执行一个UNION以上crosstab()的结果SELECT DISTINCT(name) FROM groups ORDER BY name ASC;

Parker

Asked: 2018-04-11 06:30:00 +0800 CST

同一列上的多个 COUNT

1

我有一个表（PostgreSQL 9.6），其中包含 260 万多个与帐户标识符关联的带时间戳的行，对于任何给定的标识符，我想在单个查询中计算出现的总数以及今天的出现次数。

作为参考，这是这个问题中描述的同一张表，但我在这里对其进行了简化以关注这个特定问题：

CREATE TABLE account_test
(
  id integer NOT NULL PRIMARY KEY
);

CREATE TABLE log_test
(
  account integer NOT NULL REFERENCES account_test(id),
  event_time timestamp with time zone NOT NULL DEFAULT now()
);

CREATE INDEX account_test_idx ON log_test USING btree (account,event_time);

INSERT INTO account_test VALUES (1);
INSERT INTO account_test VALUES (2);

INSERT INTO log_test VALUES (1,'2018-01-01');
INSERT INTO log_test VALUES (1,'2018-01-02');
INSERT INTO log_test VALUES (1,'2018-01-03');
INSERT INTO log_test VALUES (1,now());
INSERT INTO log_test VALUES (1,now());

INSERT INTO log_test VALUES (2,'2018-01-01');
INSERT INTO log_test VALUES (2,'2018-01-02');
INSERT INTO log_test VALUES (2,now());

这是我最初的尝试，由于以下原因，每日计数和总计数都会产生相同的数字GROUP BY：

    SELECT a.id,COUNT(d) AS daily,COUNT(t) AS total FROM account_test a 
      JOIN log_test d ON a.id=d.account AND d.event_time > now() - interval '1 day'
      JOIN log_test t ON a.id=t.account
     WHERE a.id=1 GROUP BY a.id;

 id | daily | total
----+-------+-------
  1 |    10 |    10
(1 row)

我正在寻找的结果是：

 id | daily | total
----+-------+-------
  1 |     2 |     5
(1 row)

具体来说，这个丑陋的查询的结果：

SELECT qd.id,qd.daily,qt.total FROM
(
    SELECT a.id,COUNT(d) AS daily FROM account_test a 
      JOIN log_test d ON a.id=d.account AND d.event_time > now() - interval '1 day'
     WHERE a.id=1 GROUP BY a.id
) qd,
(
    SELECT a.id,COUNT(t) AS total FROM account_test a 
      JOIN log_test t ON a.id=t.account
     WHERE a.id=1 GROUP BY a.id
) qt;

我意识到这可能是一个垒球问题，但在这种情况下，我的 SQL 直觉让我失望了，我怀疑可能有一些巧妙的技巧可以消除额外的JOIN.

Parker

Asked: 2017-12-09 06:30:05 +0800 CST

如何在不使用 PostgreSQL 中的子查询的情况下计算 GROUP BY 之外的 DISTINCT 值？

2

我正在重构一些我们通常在 Java 中作为夜间进程执行的业务逻辑，现在正尝试将其作为一组物化视图迁移到 PostgreSQL。我已将问题简化为基本问题，这导致了下面的三个表（category_source~10 行、category~100k 行、category_tags~10M 行）。

逻辑相当简单：对于 each ，将in that集合中Source的 each 的值相加，然后将这些和中的每一个除以in the的总数。TagCategoriesSourceCategoriesSource

--DROP TABLE category_tags;
--DROP TABLE category;
--DROP TABLE category_source;

CREATE TABLE category_source
(
  id integer NOT NULL,
  PRIMARY KEY (id)
);

CREATE TABLE category
(
  label text NOT NULL,
  source integer NOT NULL REFERENCES category_source(id),
  PRIMARY KEY (label)
);

CREATE TABLE category_tags
(
  category text NOT NULL REFERENCES category(label),
  tag text NOT NULL,
  rank real NOT NULL,
  PRIMARY KEY (category,tag)
);

-- Load sample data.
INSERT INTO category_source VALUES (1);
INSERT INTO category_source VALUES (2);

INSERT INTO category VALUES ('C3035240',1);
INSERT INTO category VALUES ('C3035245',1);
INSERT INTO category VALUES ('C3035250',2);

INSERT INTO category_tags VALUES ('C3035240','test',24.00);
INSERT INTO category_tags VALUES ('C3035240','sample',24.00);
INSERT INTO category_tags VALUES ('C3035240','method',20.00);
INSERT INTO category_tags VALUES ('C3035240','variety',18.00);
INSERT INTO category_tags VALUES ('C3035240','explanation',15.00);
INSERT INTO category_tags VALUES ('C3035245','test',20.00);
INSERT INTO category_tags VALUES ('C3035245','extra',21.00);
INSERT INTO category_tags VALUES ('C3035245','method',20.00);
INSERT INTO category_tags VALUES ('C3035245','sample',18.00);
INSERT INTO category_tags VALUES ('C3035245','question',15.00);
INSERT INTO category_tags VALUES ('C3035250','method',10.00);
INSERT INTO category_tags VALUES ('C3035250','explanation',8.00);
INSERT INTO category_tags VALUES ('C3035250','test',6.00);
INSERT INTO category_tags VALUES ('C3035250','question',5.00);
INSERT INTO category_tags VALUES ('C3035250','sample',2.00);
INSERT INTO category_tags VALUES ('C3035250','variety',4.00);

此处的查询 1获得了大部分的方式，但该source_category_count列包含（可变）中的数量Categories，而我真正想要的是.TagSourceCategoriesSource

SELECT category_source.id,category_tags.tag,
       SUM(category_tags.rank) AS tag_total,
       COUNT(category.label) AS source_category_count,
       SUM(category_tags.rank)/COUNT(category.label) AS source_tag_rank
 FROM
  category_source,category,category_tags
 WHERE
  category_source.id=category.source
  AND category.label=category_tags.category
 GROUP BY category_source.id,category_tags.tag ORDER BY category_source.id,tag_total DESC;

 id |     tag     | tag_total | source_category_count | source_tag_rank
----+-------------+-----------+-----------------------+-----------------
  1 | test        |        44 |                     2 |              22
  1 | sample      |        42 |                     2 |              21
  1 | method      |        40 |                     2 |              20
  1 | extra       |        21 |                     1 |              21
  1 | variety     |        18 |                     1 |              18
  1 | explanation |        15 |                     1 |              15
  1 | question    |        15 |                     1 |              15
  2 | method      |        10 |                     1 |              10
  2 | explanation |         8 |                     1 |               8
  2 | test        |         6 |                     1 |               6
  2 | question    |         5 |                     1 |               5
  2 | variety     |         4 |                     1 |               4
  2 | sample      |         2 |                     1 |               2
(13 rows)

下面的查询 2产生了我真正想要的结果：

SELECT q1.*,q2.source_category_count,q1.tag_total/q2.source_category_count AS tag_source_rank FROM
(
  SELECT category_source.id AS source,category_tags.tag,SUM(category_tags.rank) AS tag_total
   FROM category_source
   INNER JOIN category ON (category_source.id=category.source)
   INNER JOIN category_tags ON (category.label=category_tags.category)
   GROUP BY category_source.id,category_tags.tag
) q1,
(
  SELECT source,COUNT(category) AS source_category_count FROM category GROUP BY source
) q2
 WHERE q1.source=q2.source
 ORDER BY source,tag_source_rank DESC
;

     source |     tag     | tag_total | source_category_count | tag_source_rank
    --------+-------------+-----------+-----------------------+-----------------
          1 | test        |        44 |                     2 |              22
          1 | sample      |        42 |                     2 |              21
          1 | method      |        40 |                     2 |              20
          1 | extra       |        21 |                     2 |            10.5
          1 | variety     |        18 |                     2 |               9
          1 | explanation |        15 |                     2 |             7.5
          1 | question    |        15 |                     2 |             7.5
          2 | method      |        10 |                     1 |              10
          2 | explanation |         8 |                     1 |               8
          2 | test        |         6 |                     1 |               6
          2 | question    |         5 |                     1 |               5
          2 | variety     |         4 |                     1 |               4
          2 | sample      |         2 |                     1 |               2
    (13 rows)

查询 3使用产生等效结果WITH x () SELECT ...：

WITH category_counts AS
(
  SELECT source,COUNT(category) AS source_category_count FROM category GROUP BY source
) 
SELECT category_counts.source,category_tags.tag,
       SUM(category_tags.rank) AS tag_total,
       COUNT(category.label) AS source_category_freq,
       category_counts.source_category_count,
       SUM(category_tags.rank)/category_counts.source_category_count AS source_tag_rank
 FROM category_counts
 INNER JOIN category ON (category_counts.source=category.source)
 INNER JOIN category_tags ON (category.label=category_tags.category)
 GROUP BY category_counts.source,category_counts.source_category_count,category_tags.tag ORDER BY category_counts.source,tag_total DESC;

 source |     tag     | tag_total | source_category_freq | source_category_count | source_tag_rank
--------+-------------+-----------+----------------------+-----------------------+-----------------
      1 | test        |        44 |                    2 |                     2 |              22
      1 | sample      |        42 |                    2 |                     2 |              21
      1 | method      |        40 |                    2 |                     2 |              20
      1 | extra       |        21 |                    1 |                     2 |            10.5
      1 | variety     |        18 |                    1 |                     2 |               9
      1 | explanation |        15 |                    1 |                     2 |             7.5
      1 | question    |        15 |                    1 |                     2 |             7.5
      2 | method      |        10 |                    1 |                     1 |              10
      2 | explanation |         8 |                    1 |                     1 |               8
      2 | test        |         6 |                    1 |                     1 |               6
      2 | question    |         5 |                    1 |                     1 |               5
      2 | variety     |         4 |                    1 |                     1 |               4
      2 | sample      |         2 |                    1 |                     1 |               2
(13 rows)

虽然我有两个工作查询都产生了我正在寻找的结果，但我对在这种大小的表上使用两个子查询不满意（我还没有加载我的所有数据或进行任何性能测试，我只是在工作目前这个测试用例）。

我觉得source_category_count我正在寻找的价值隐藏在查询 1的某个地方，我只是不知道如何访问它。

我正在研究的另一种选择是COUNT() OVER (PARTITION BY category_source)，但目前我没有对该方法的有效查询。

是否有更简单的查询会产生与Query 2或Query 3相同的结果（即Query 1的修改）？

更新：添加了第二个工作查询。

Parker

Asked: 2017-11-10 06:03:50 +0800 CST

PostgreSQL 中的慢查询从两列中定义的范围之间选择单行

6

我导入了ip2location_db11 精简版数据库的副本，其中包含 3,319,097 行，我希望优化数字范围查询，其中低值和高值位于表的不同列中 ( ip_from, ip_to)。

导入数据库：

CREATE TABLE ip2location_db11
(
  ip_from bigint NOT NULL, -- First IP address in netblock.
  ip_to bigint NOT NULL, -- Last IP address in netblock.
  country_code character(2) NOT NULL, -- Two-character country code based on ISO 3166.
  country_name character varying(64) NOT NULL, -- Country name based on ISO 3166.
  region_name character varying(128) NOT NULL, -- Region or state name.
  city_name character varying(128) NOT NULL, -- City name.
  latitude real NOT NULL, -- City latitude. Default to capital city latitude if city is unknown.
  longitude real NOT NULL, -- City longitude. Default to capital city longitude if city is unknown.
  zip_code character varying(30) NOT NULL, -- ZIP/Postal code.
  time_zone character varying(8) NOT NULL, -- UTC time zone (with DST supported).
  CONSTRAINT ip2location_db11_pkey PRIMARY KEY (ip_from, ip_to)
);
\copy ip2location_db11 FROM 'IP2LOCATION-LITE-DB11.CSV' WITH CSV QUOTE AS '"';

我第一次天真的索引尝试是在这些列中的每一列上创建单独的索引，这导致了查询时间为 400 毫秒的顺序扫描：

account=> CREATE INDEX ip_from_db11_idx ON ip2location_db11 (ip_from);
account=> CREATE INDEX ip_to_db11_idx ON ip2location_db11 (ip_to);

account=> EXPLAIN ANALYZE VERBOSE SELECT * FROM ip2location_db11 WHERE 2538629520 BETWEEN ip_from AND ip_to;

                                                          QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------------
 Seq Scan on public.ip2location_db11  (cost=0.00..48930.99 rows=43111 width=842) (actual time=286.714..401.805 rows=1 loops=1)
   Output: ip_from, ip_to, country_code, country_name, region_name, city_name, latitude, longitude, zip_code, time_zone
   Filter: (('2538629520'::bigint >= ip2location_db11.ip_from) AND ('2538629520'::bigint <= ip2location_db11.ip_to))
   Rows Removed by Filter: 3319096
 Planning time: 0.155 ms
 Execution time: 401.834 ms
(6 rows)

account=> \d ip2location_db11
          Table "public.ip2location_db11"
    Column    |          Type          | Modifiers
--------------+------------------------+-----------
 ip_from      | bigint                 | not null
 ip_to        | bigint                 | not null
 country_code | character(2)           | not null
 country_name | character varying(64)  | not null
 region_name  | character varying(128) | not null
 city_name    | character varying(128) | not null
 latitude     | real                   | not null
 longitude    | real                   | not null
 zip_code     | character varying(30)  | not null
 time_zone    | character varying(8)   | not null
Indexes:
    "ip2location_db11_pkey" PRIMARY KEY, btree (ip_from, ip_to)
    "ip_from_db11_idx" btree (ip_from)
    "ip_to_db11_idx" btree (ip_to)

我的第二次尝试是创建一个多列 btree 索引，这导致索引扫描的查询时间为 290 毫秒：

account=> CREATE INDEX ip_range_db11_idx ON ip2location_db11 (ip_from,ip_to);

account=> EXPLAIN ANALYZE VERBOSE SELECT * FROM ip2location_db11 WHERE 2538629520 BETWEEN ip_from AND ip_to;
                                                                     QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------------
 Index Scan using ip_to_db11_idx on public.ip2location_db11 (cost=0.43..51334.91 rows=756866 width=69) (actual time=1.109..289.143 rows=1 loops=1)
   Output: ip_from, ip_to, country_code, country_name, region_name, city_name, latitude, longitude, zip_code, time_zone
   Index Cond: ('2538629520'::bigint <= ip2location_db11.ip_to)
   Filter: ('2538629520'::bigint >= ip2location_db11.ip_from)
   Rows Removed by Filter: 1160706
 Planning time: 0.324 ms
 Execution time: 289.172 ms
(7 rows)

n4l_account=> \d ip2location_db11
          Table "public.ip2location_db11"
    Column    |          Type          | Modifiers
--------------+------------------------+-----------
 ip_from      | bigint                 | not null
 ip_to        | bigint                 | not null
 country_code | character(2)           | not null
 country_name | character varying(64)  | not null
 region_name  | character varying(128) | not null
 city_name    | character varying(128) | not null
 latitude     | real                   | not null
 longitude    | real                   | not null
 zip_code     | character varying(30)  | not null
 time_zone    | character varying(8)   | not null
Indexes:
    "ip2location_db11_pkey" PRIMARY KEY, btree (ip_from, ip_to)
    "ip_from_db11_idx" btree (ip_from)
    "ip_range_db11_idx" btree (ip_from, ip_to)
    "ip_to_db11_idx" btree (ip_to)

更新：根据评论中的要求，我重新做了上面的查询。重建表后前15次查询的时间（165ms、65ms、86ms、83ms、86ms、64ms、85ms、811ms、868ms、845ms、810ms、781ms、797ms、890ms、806ms）：

account=> EXPLAIN (ANALYZE, VERBOSE, BUFFERS, TIMING) SELECT * FROM ip2location_db11 WHERE 2538629520 BETWEEN ip_from AND ip_to;
                                                                QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on public.ip2location_db11  (cost=28200.29..76843.12 rows=368789 width=842) (actual time=64.866..64.866 rows=1 loops=1)
   Output: ip_from, ip_to, country_code, country_name, region_name, city_name, latitude, longitude, zip_code, time_zone
   Recheck Cond: (('2538629520'::bigint >= ip2location_db11.ip_from) AND ('2538629520'::bigint <= ip2location_db11.ip_to))
   Heap Blocks: exact=1
   Buffers: shared hit=8273
   ->  Bitmap Index Scan on ip_range_db11_idx  (cost=0.00..28108.09 rows=368789 width=0) (actual time=64.859..64.859 rows=1 loops=1)
         Index Cond: (('2538629520'::bigint >= ip2location_db11.ip_from) AND ('2538629520'::bigint <= ip2location_db11.ip_to))
         Buffers: shared hit=8272
 Planning time: 0.099 ms
 Execution time: 64.907 ms
(10 rows)

account=> EXPLAIN (ANALYZE, VERBOSE, BUFFERS, TIMING) SELECT * FROM ip2location_db11 WHERE 2538629520 BETWEEN ip_from AND ip_to;
                                                          QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------
 Seq Scan on public.ip2location_db11  (cost=0.00..92906.18 rows=754776 width=69) (actual time=577.234..811.757 rows=1 loops=1)
   Output: ip_from, ip_to, country_code, country_name, region_name, city_name, latitude, longitude, zip_code, time_zone
   Filter: (('2538629520'::bigint >= ip2location_db11.ip_from) AND ('2538629520'::bigint <= ip2location_db11.ip_to))
   Rows Removed by Filter: 3319096
   Buffers: shared hit=33 read=43078
 Planning time: 0.667 ms
 Execution time: 811.783 ms
(7 rows)

导入的 CSV 文件中的示例行：

"0","16777215","-","-","-","-","0.000000","0.000000","-","-"
"16777216","16777471","AU","Australia","Queensland","Brisbane","-27.467940","153.028090","4000","+10:00"
"16777472","16778239","CN","China","Fujian","Fuzhou","26.061390","119.306110","350004","+08:00"

是否有更好的方法来索引此表以改进查询，或者是否有更有效的查询可以获得相同的结果？

Parker

Asked: 2017-10-25 04:37:25 +0800 CST

PostgreSQL 9.3 中按周时间戳范围内的查询性能不佳

3

我有一个缓慢的查询，会生成过去一年每周的帐户活动报告。该表当前有近 500 万行，此查询当前需要 8 秒才能执行。（当前）瓶颈是对时间戳范围的顺序扫描。

account=> EXPLAIN ANALYZE SELECT to_timestamp(to_char(date_trunc('week', event_time), 'IYYY-IW'), 'IYYY-IW')::date AS date, count(DISTINCT account) FROM account_history WHERE event_time BETWEEN now() - interval '51 weeks' AND now() GROUP BY date ORDER BY date;

 GroupAggregate  (cost=450475.76..513465.44 rows=2290534 width=12) (actual time=7524.474..8003.291 rows=52 loops=1)
   Group Key: ((to_timestamp(to_char(date_trunc('week'::text, event_time), 'IYYY-IW'::text), 'IYYY-IW'::text))::date)
   ->  Sort  (cost=450475.76..456202.09 rows=2290534 width=12) (actual time=7519.053..7691.924 rows=2314164 loops=1)
         Sort Key: ((to_timestamp(to_char(date_trunc('week'::text, event_time), 'IYYY-IW'::text), 'IYYY-IW'::text))::date)
         Sort Method: external sort  Disk: 40704kB
         ->  Seq Scan on account_history  (cost=0.00..169364.81 rows=2290534 width=12) (actual time=1470.438..6222.076 rows=2314164 loops=1)
               Filter: ((event_time <= now()) AND (event_time >= (now() - '357 days'::interval)))
               Rows Removed by Filter: 2591679
 Planning time: 0.126 ms
 Execution time: 8011.160 ms

桌子：

account=> \d account_history
                    Table "public.account_history"
   Column    |            Type             |         Modifiers
-------------+-----------------------------+---------------------------
 account     | integer                     | not null
 event_code  | text                        | not null
 event_time  | timestamp without time zone | not null default now()
 description | text                        | not null default ''::text
Indexes:
    "account_history_idx" btree (account, event_time DESC)
    "account_id_idx" btree (account, event_code, event_time)
Foreign-key constraints:
    "account_fk" FOREIGN KEY (account) REFERENCES account(id) ON UPDATE CASCADE ON DELETE RESTRICT
    "event_code_fk" FOREIGN KEY (event_code) REFERENCES domain_account_event(code) ON UPDATE CASCADE ON DELETE RESTRICT

当我最初创建此表时，我将时间戳列添加为 btree 索引的一部分，但我认为顺序扫描是由于表中的（当时）行数较少（参见相关问题）。

但是，现在表已经增长到数百万，我注意到查询的性能问题，并发现查询中没有使用索引。

我尝试按照这里的建议添加一个有序索引，但这显然也没有在执行计划中使用。

有没有更好的方法来索引这个表，或者我的查询中是否有一些内在的东西绕过了这两个索引？

更新：当我仅在时间戳上添加索引时，将使用该索引。然而，它只减少了 25% 的执行时间：

account=> CREATE INDEX account_history_time_idx ON account_history (event_time DESC);

account=> EXPLAIN ANALYZE VERBOSE SELECT to_timestamp(to_char(date_trunc('week', event_time), 'IYYY-IW'), 'IYYY-IW')::date AS date, count(DISTINCT account) FROM account_history WHERE event_time BETWEEN now() - interval '51 weeks' AND now() GROUP BY date ORDER BY date;

 GroupAggregate  (cost=391870.30..454870.16 rows=2290904 width=12) (actual time=5481.930..6104.838 rows=52 loops=1)
   Output: ((to_timestamp(to_char(date_trunc('week'::text, event_time), 'IYYY-IW'::text), 'IYYY-IW'::text))::date), count(DISTINCT account)
   Group Key: ((to_timestamp(to_char(date_trunc('week'::text, account_history.event_time), 'IYYY-IW'::text), 'IYYY-IW'::text))::date)
   ->  Sort  (cost=391870.30..397597.56 rows=2290904 width=12) (actual time=5474.181..5771.903 rows=2314038 loops=1)
         Output: ((to_timestamp(to_char(date_trunc('week'::text, event_time), 'IYYY-IW'::text), 'IYYY-IW'::text))::date), account
         Sort Key: ((to_timestamp(to_char(date_trunc('week'::text, account_history.event_time), 'IYYY-IW'::text), 'IYYY-IW'::text))::date)
         Sort Method: external merge  Disk: 40688kB
         ->  Index Scan using account_history_time_idx on public.account_history  (cost=0.44..110710.59 rows=2290904 width=12) (actual time=0.108..4352.143 rows=2314038 loops=1)
               Output: (to_timestamp(to_char(date_trunc('week'::text, event_time), 'IYYY-IW'::text), 'IYYY-IW'::text))::date, account
               Index Cond: ((account_history.event_time >= (now() - '357 days'::interval)) AND (account_history.event_time <= now()))
 Planning time: 0.204 ms
 Execution time: 6112.832 ms

https://explain.depesz.com/s/PSfU

我也按照这里VACUUM FULL的建议进行了尝试，但执行时间没有区别。

以下是针对同一张表的一些更简单查询的执行计划：

简单地计算行数需要 0.5 秒：

account=> EXPLAIN ANALYZE VERBOSE SELECT COUNT(*) FROM account_history;

 Aggregate  (cost=97401.04..97401.05 rows=1 width=0) (actual time=551.179..551.179 rows=1 loops=1)
   Output: count(*)
   ->  Seq Scan on public.account_history  (cost=0.00..85136.43 rows=4905843 width=0) (actual time=0.039..344.675 rows=4905843 loops=1)
         Output: account, event_code, event_time, description
 Planning time: 0.075 ms
 Execution time: 551.209 ms

并且使用相同的时间范围子句只需不到一秒钟：

account=> EXPLAIN ANALYZE VERBOSE SELECT COUNT(*) FROM account_history WHERE event_time BETWEEN now() - interval '51 weeks' AND now();

 Aggregate  (cost=93527.57..93527.58 rows=1 width=0) (actual time=997.436..997.436 rows=1 loops=1)
   Output: count(*)
   ->  Index Only Scan using account_history_time_idx on public.account_history  (cost=0.44..87800.45 rows=2290849 width=0) (actual time=0.100..897.776 rows=2313987 loops=1)
         Output: event_time
         Index Cond: ((account_history.event_time >= (now() - '357 days'::interval)) AND (account_history.event_time <= now()))
         Heap Fetches: 2313987
 Planning time: 0.239 ms
 Execution time: 997.473 ms

根据评论，我尝试了一种简化的查询形式：

account=> EXPLAIN ANALYZE VERBOSE SELECT date_trunc('week', event_time) AS date, count(DISTINCT account) FROM account_history
WHERE event_time BETWEEN now() - interval '51 weeks' AND now() GROUP BY date ORDER BY date;

 GroupAggregate  (cost=374676.22..420493.00 rows=2290839 width=12) (actual time=2475.556..3078.191 rows=52 loops=1)
   Output: (date_trunc('week'::text, event_time)), count(DISTINCT account)
   Group Key: (date_trunc('week'::text, account_history.event_time))
   ->  Sort  (cost=374676.22..380403.32 rows=2290839 width=12) (actual time=2468.654..2763.739 rows=2313977 loops=1)
         Output: (date_trunc('week'::text, event_time)), account
         Sort Key: (date_trunc('week'::text, account_history.event_time))
         Sort Method: external merge  Disk: 49720kB
         ->  Index Scan using account_history_time_idx on public.account_history  (cost=0.44..93527.35 rows=2290839 width=12) (actual time=0.094..1537.488 rows=2313977 loops=1)
               Output: date_trunc('week'::text, event_time), account
               Index Cond: ((account_history.event_time >= (now() - '357 days'::interval)) AND (account_history.event_time <= now()))
 Planning time: 0.220 ms
 Execution time: 3086.828 ms
(12 rows)

account=> SELECT date_trunc('week', current_date) AS date, count(DISTINCT account) FROM account_history WHERE event_time BETWE
EN now() - interval '51 weeks' AND now() GROUP BY date ORDER BY date;
          date          | count
------------------------+-------
 2017-10-23 00:00:00-04 |   132
(1 row)

事实上，这将执行时间减少了一半，但不幸的是并没有给出预期的结果，如下所示：

account=> SELECT to_timestamp(to_char(date_trunc('week', event_time), 'IYYY-IW'), 'IYYY-IW')::date AS date, count(DISTINCT account) FROM account_history WHERE event_time BETWEEN now() - interval '51 weeks' AND now() GROUP BY date ORDER BY date;
    date    | count
------------+-------
 2016-10-31 |    14
...
 2017-10-23 |   584
(52 rows)

如果我能找到一种更便宜的方法来按周汇总这些记录，那将大大有助于解决这个问题。

我愿意接受有关使用该GROUP BY子句提高每周查询性能的任何建议，包括更改表。

我创建了一个物化视图作为测试，但当然刷新它所花费的时间与原始查询完全相同，所以除非我每天只刷新几次，否则它并没有真正的帮助，代价是添加复杂：

account=> CREATE MATERIALIZED VIEW account_activity_weekly AS SELECT to_timestamp(to_char(date_trunc('week', event_time), 'IYYY-IW'), 'IYYY-IW')::date AS date, count(DISTINCT account) FROM account_history WHERE event_time BETWEEN now() - interval '51 weeks' AND now() GROUP BY date ORDER BY date;
SELECT 52

根据附加评论，我将查询修改如下，将执行时间缩短了一半，并提供了预期的结果集：

account=> EXPLAIN ANALYZE VERBOSE SELECT to_timestamp(to_char(date_trunc('week', event_time), 'IYYY-IW'), 'IYYY-IW')::date AS date, count(DISTINCT account) FROM account_history WHERE event_time BETWEEN now() - interval '51 weeks' AND now() GROUP BY date_trunc('week', event_time) ORDER BY date;

 Sort  (cost=724523.11..730249.97 rows=2290745 width=12) (actual time=3188.495..3188.496 rows=52 loops=1)
   Output: ((to_timestamp(to_char((date_trunc('week'::text, event_time)), 'IYYY-IW'::text), 'IYYY-IW'::text))::date), (count(DISTINCT account)), (date_trunc('week'::text, event_time))
   Sort Key: ((to_timestamp(to_char((date_trunc('week'::text, account_history.event_time)), 'IYYY-IW'::text), 'IYYY-IW'::text))::date)
   Sort Method: quicksort  Memory: 29kB
   ->  GroupAggregate  (cost=374662.50..443384.85 rows=2290745 width=12) (actual time=2573.694..3188.451 rows=52 loops=1)
         Output: (to_timestamp(to_char((date_trunc('week'::text, event_time)), 'IYYY-IW'::text), 'IYYY-IW'::text))::date, count(DISTINCT account), (date_trunc('week'::text, event_time))
         Group Key: (date_trunc('week'::text, account_history.event_time))
         ->  Sort  (cost=374662.50..380389.36 rows=2290745 width=12) (actual time=2566.086..2859.590 rows=2313889 loops=1)
               Output: (date_trunc('week'::text, event_time)), event_time, account
               Sort Key: (date_trunc('week'::text, account_history.event_time))
               Sort Method: external merge  Disk: 67816kB
               ->  Index Scan using account_history_time_idx on public.account_history  (cost=0.44..93524.23 rows=2290745 width=12) (actual time=0.090..1503.985 rows=2313889 loops=1)
                     Output: date_trunc('week'::text, event_time), event_time, account
                     Index Cond: ((account_history.event_time >= (now() - '357 days'::interval)) AND (account_history.event_time <= now()))
 Planning time: 0.205 ms
 Execution time: 3198.125 ms
(16 rows)

Parker

Asked: 2015-10-30 07:55:55 +0800 CST

使用 GROUP BY 和 HAVING 时如何避免两次调用函数？

4

我有一个带有父子关系表的 PostgreSQL 数据库（9.2）。我有一个查询来查找具有多个父节点的节点。

以下查询有效并返回正确的结果：

SELECT node,parents FROM
(
  SELECT nr.child AS node, COUNT(nr.parent) AS parents 
  FROM node_relation nr 
  GROUP BY nr.child
) AS count WHERE parents > 1;

结果集：

 node   | parents
--------+---------
 n21174 |       2
 n8635  |       2
(2 rows)

表定义为：

            Table "public.node_relation"
   Column    |         Type          |   Modifiers
-------------+-----------------------+---------------
 child       | character varying(50) | not null
 parent      | character varying(50) | not null
Indexes:
    "node_relation_pkey" PRIMARY KEY, btree (child, parent)

我重写了查询以不使用子选择：

SELECT child AS node, COUNT(parent) AS parents 
FROM node_relation 
GROUP BY child 
HAVING COUNT(parent) > 1;

新查询有效，但我想知道 COUNT 函数被多次调用。

更新：这是查询计划：

                                                 QUERY PLAN
-------------------------------------------------------------------------------------------------------------
 GroupAggregate  (cost=0.00..1658.81 rows=19970 width=16)
   Filter: (count(parent) > 1)
   ->  Index Only Scan using node_relation_pkey on node_relation  (cost=0.00..1259.40 rows=19971 width=16)

我更喜欢使用parents别名，但以下不起作用：

SELECT child AS node, COUNT(parent) AS parents 
FROM node_relation 
GROUP BY child 
HAVING parents > 1;

ERROR:  column "parents" does not exist
LINE 1: ...parents FROM node_relation GROUP BY child HAVING parents > ...
                                                            ^

PostgreSQL 会优化 ? 的多次调用COUNT？

如果没有，是否有这种查询的替代形式会更有效？

在 PostgreSQL 中对单个索引文本列进行计数时出现这种间歇性慢查询的原因是什么？

对多个 OUTER JOIN 进行计数和分组

通过 PostgreSQL 视图或函数从两个表的“叉积”创建二维真值表

同一列上的多个 COUNT

如何在不使用 PostgreSQL 中的子查询的情况下计算 GROUP BY 之外的 DISTINCT 值？

PostgreSQL 中的慢查询从两列中定义的范围之间选择单行

PostgreSQL 9.3 中按周时间戳范围内的查询性能不佳

使用 GROUP BY 和 HAVING 时如何避免两次调用函数？

连接到 PostgreSQL 服务器：致命：主机没有 pg_hba.conf 条目

如何让sqlplus的输出出现在一行中？

选择具有最大日期或最晚日期的日期

如何列出 PostgreSQL 中的所有模式？

列出指定表的所有列

如何在不修改我自己的 tnsnames.ora 的情况下使用 sqlplus 连接到位于另一台主机上的 Oracle 数据库

你如何mysqldump特定的表？

使用 psql 列出数据库权限

如何从 PostgreSQL 中的选择查询中将值插入表中？

如何使用 psql 列出所有数据库和表？

Parker's questions