我可以在使用数据库后激活 PITR 吗？

Question

Parker

Asked: 2017-12-09 06:30:05 +0800 CST2017-12-09 06:30:05 +0800 CST 2017-12-09 06:30:05 +0800 CST

如何在不使用 PostgreSQL 中的子查询的情况下计算 GROUP BY 之外的 DISTINCT 值？

772

我正在重构一些我们通常在 Java 中作为夜间进程执行的业务逻辑，现在正尝试将其作为一组物化视图迁移到 PostgreSQL。我已将问题简化为基本问题，这导致了下面的三个表（category_source~10 行、category~100k 行、category_tags~10M 行）。

逻辑相当简单：对于 each ，将in that集合中Source的 each 的值相加，然后将这些和中的每一个除以in the的总数。TagCategoriesSourceCategoriesSource

--DROP TABLE category_tags;
--DROP TABLE category;
--DROP TABLE category_source;

CREATE TABLE category_source
(
  id integer NOT NULL,
  PRIMARY KEY (id)
);

CREATE TABLE category
(
  label text NOT NULL,
  source integer NOT NULL REFERENCES category_source(id),
  PRIMARY KEY (label)
);

CREATE TABLE category_tags
(
  category text NOT NULL REFERENCES category(label),
  tag text NOT NULL,
  rank real NOT NULL,
  PRIMARY KEY (category,tag)
);

-- Load sample data.
INSERT INTO category_source VALUES (1);
INSERT INTO category_source VALUES (2);

INSERT INTO category VALUES ('C3035240',1);
INSERT INTO category VALUES ('C3035245',1);
INSERT INTO category VALUES ('C3035250',2);

INSERT INTO category_tags VALUES ('C3035240','test',24.00);
INSERT INTO category_tags VALUES ('C3035240','sample',24.00);
INSERT INTO category_tags VALUES ('C3035240','method',20.00);
INSERT INTO category_tags VALUES ('C3035240','variety',18.00);
INSERT INTO category_tags VALUES ('C3035240','explanation',15.00);
INSERT INTO category_tags VALUES ('C3035245','test',20.00);
INSERT INTO category_tags VALUES ('C3035245','extra',21.00);
INSERT INTO category_tags VALUES ('C3035245','method',20.00);
INSERT INTO category_tags VALUES ('C3035245','sample',18.00);
INSERT INTO category_tags VALUES ('C3035245','question',15.00);
INSERT INTO category_tags VALUES ('C3035250','method',10.00);
INSERT INTO category_tags VALUES ('C3035250','explanation',8.00);
INSERT INTO category_tags VALUES ('C3035250','test',6.00);
INSERT INTO category_tags VALUES ('C3035250','question',5.00);
INSERT INTO category_tags VALUES ('C3035250','sample',2.00);
INSERT INTO category_tags VALUES ('C3035250','variety',4.00);

此处的查询 1获得了大部分的方式，但该source_category_count列包含（可变）中的数量Categories，而我真正想要的是.TagSourceCategoriesSource

SELECT category_source.id,category_tags.tag,
       SUM(category_tags.rank) AS tag_total,
       COUNT(category.label) AS source_category_count,
       SUM(category_tags.rank)/COUNT(category.label) AS source_tag_rank
 FROM
  category_source,category,category_tags
 WHERE
  category_source.id=category.source
  AND category.label=category_tags.category
 GROUP BY category_source.id,category_tags.tag ORDER BY category_source.id,tag_total DESC;

 id |     tag     | tag_total | source_category_count | source_tag_rank
----+-------------+-----------+-----------------------+-----------------
  1 | test        |        44 |                     2 |              22
  1 | sample      |        42 |                     2 |              21
  1 | method      |        40 |                     2 |              20
  1 | extra       |        21 |                     1 |              21
  1 | variety     |        18 |                     1 |              18
  1 | explanation |        15 |                     1 |              15
  1 | question    |        15 |                     1 |              15
  2 | method      |        10 |                     1 |              10
  2 | explanation |         8 |                     1 |               8
  2 | test        |         6 |                     1 |               6
  2 | question    |         5 |                     1 |               5
  2 | variety     |         4 |                     1 |               4
  2 | sample      |         2 |                     1 |               2
(13 rows)

下面的查询 2产生了我真正想要的结果：

SELECT q1.*,q2.source_category_count,q1.tag_total/q2.source_category_count AS tag_source_rank FROM
(
  SELECT category_source.id AS source,category_tags.tag,SUM(category_tags.rank) AS tag_total
   FROM category_source
   INNER JOIN category ON (category_source.id=category.source)
   INNER JOIN category_tags ON (category.label=category_tags.category)
   GROUP BY category_source.id,category_tags.tag
) q1,
(
  SELECT source,COUNT(category) AS source_category_count FROM category GROUP BY source
) q2
 WHERE q1.source=q2.source
 ORDER BY source,tag_source_rank DESC
;

     source |     tag     | tag_total | source_category_count | tag_source_rank
    --------+-------------+-----------+-----------------------+-----------------
          1 | test        |        44 |                     2 |              22
          1 | sample      |        42 |                     2 |              21
          1 | method      |        40 |                     2 |              20
          1 | extra       |        21 |                     2 |            10.5
          1 | variety     |        18 |                     2 |               9
          1 | explanation |        15 |                     2 |             7.5
          1 | question    |        15 |                     2 |             7.5
          2 | method      |        10 |                     1 |              10
          2 | explanation |         8 |                     1 |               8
          2 | test        |         6 |                     1 |               6
          2 | question    |         5 |                     1 |               5
          2 | variety     |         4 |                     1 |               4
          2 | sample      |         2 |                     1 |               2
    (13 rows)

查询 3使用产生等效结果WITH x () SELECT ...：

WITH category_counts AS
(
  SELECT source,COUNT(category) AS source_category_count FROM category GROUP BY source
) 
SELECT category_counts.source,category_tags.tag,
       SUM(category_tags.rank) AS tag_total,
       COUNT(category.label) AS source_category_freq,
       category_counts.source_category_count,
       SUM(category_tags.rank)/category_counts.source_category_count AS source_tag_rank
 FROM category_counts
 INNER JOIN category ON (category_counts.source=category.source)
 INNER JOIN category_tags ON (category.label=category_tags.category)
 GROUP BY category_counts.source,category_counts.source_category_count,category_tags.tag ORDER BY category_counts.source,tag_total DESC;

 source |     tag     | tag_total | source_category_freq | source_category_count | source_tag_rank
--------+-------------+-----------+----------------------+-----------------------+-----------------
      1 | test        |        44 |                    2 |                     2 |              22
      1 | sample      |        42 |                    2 |                     2 |              21
      1 | method      |        40 |                    2 |                     2 |              20
      1 | extra       |        21 |                    1 |                     2 |            10.5
      1 | variety     |        18 |                    1 |                     2 |               9
      1 | explanation |        15 |                    1 |                     2 |             7.5
      1 | question    |        15 |                    1 |                     2 |             7.5
      2 | method      |        10 |                    1 |                     1 |              10
      2 | explanation |         8 |                    1 |                     1 |               8
      2 | test        |         6 |                    1 |                     1 |               6
      2 | question    |         5 |                    1 |                     1 |               5
      2 | variety     |         4 |                    1 |                     1 |               4
      2 | sample      |         2 |                    1 |                     1 |               2
(13 rows)

虽然我有两个工作查询都产生了我正在寻找的结果，但我对在这种大小的表上使用两个子查询不满意（我还没有加载我的所有数据或进行任何性能测试，我只是在工作目前这个测试用例）。

我觉得source_category_count我正在寻找的价值隐藏在查询 1的某个地方，我只是不知道如何访问它。

我正在研究的另一种选择是COUNT() OVER (PARTITION BY category_source)，但目前我没有对该方法的有效查询。

是否有更简单的查询会产生与Query 2或Query 3相同的结果（即Query 1的修改）？

更新：添加了第二个工作查询。

4 个回答

Voted

Evan Carroll · Answer 1 · 2017-12-09T10:03:55+08:00

而我真正想要的是源中的类别总数。

您正在分组category_source.id, category_tags.tag- 这意味着您永远不能在不包括该组的情况下说您想要“ in the ”，在您的情况下包括标签。具有不同 's 的两个子选择GROUP BY是可接受的方法，但是还有其他选项可以生成您想要的数据，例如GROUPING SETS；但是，结果看起来不一样。

SELECT
        c.label,
        tag,
        SUM(ct.rank) AS tag_total,
        COUNT(c.label) AS source_category_count
FROM category AS c
JOIN category_tags AS ct
        ON (ct.category=c.label)
GROUP BY GROUPING SETS ((c.label, ct.tag), (c.label))

;
  label   |     tag     | tag_total | source_category_count 
----------+-------------+-----------+-----------------------
 C3035240 | explanation |        15 |                     1
 C3035240 | method      |        20 |                     1
 C3035240 | sample      |        24 |                     1
 C3035240 | test        |        24 |                     1
 C3035240 | variety     |        18 |                     1
 C3035240 |             |       101 |                     5
 C3035245 | extra       |        21 |                     1
 C3035245 | method      |        20 |                     1
 C3035245 | question    |        15 |                     1
 C3035245 | sample      |        18 |                     1
 C3035245 | test        |        20 |                     1
 C3035245 |             |        94 |                     5
 C3035250 | explanation |         8 |                     1
 C3035250 | method      |        10 |                     1
 C3035250 | question    |         5 |                     1
 C3035250 | sample      |         2 |                     1
 C3035250 | test        |         6 |                     1
 C3035250 | variety     |         4 |                     1
 C3035250 |             |        35 |                     6
(19 rows)

您可以在这里看到我们有两个GROUP BY相同的SELECT，您可以看到 SQL 如何显示这样的查询。

作为旁注，我建议永远不要使用 SQL-89 连接。没有理由这样做。用显式写入您的连接[INNER] JOIN。

ypercubeᵀᴹ · Answer 2 · 2017-12-09T11:37:24+08:00

只是为了好玩^*，查询可以在没有子查询的情况下完成 - 如果 Postgres 已经DISTINCT在窗口聚合中实现：

select distinct
    source,
    tag,
    sum(rank) over (partition by source, tag) as tag_total,
    count(*) over (partition by source, tag) as count,
    count(distinct category) over (partition by source) as count_category,
    sum(rank) over (partition by source, tag)
    / count(distinct category) over (partition by source) as avg_rank
from 
    category c 
    join category_tags ct
        on label = category
order by 
    source,
    tag_total desc ;

在dbfiddle.uk (Oracle)中测试，它可以正常工作。

在dbfiddle.uk (Postgres)中，给出错误：

ERROR:  DISTINCT is not implemented for window functions  
LINE 6:     count(distinct category) over (partition by source) as c...  
            ^

^*即使语法可用，我也不建议使用上述方法。在同一结果中需要两个不同的聚合集需要使用两个不同OVER ()的表达式并使用SELECT DISTINCT. 总而言之，可能是平庸效率的秘诀。

具有 2 个派生表然后加入它们的查询可能会更有效。

indiri · Answer 3 · 2017-12-09T11:57:28+08:00

Best Answer

indiri

2017-12-09T11:57:28+08:002017-12-09T11:57:28+08:00

除了直接连接到category表之外，您还可以连接到带有窗口函数的子查询。您只需查询一次表并获得最终结果。

SELECT category_source.id AS source,
    category_tags.tag,
    SUM(category_tags.rank) AS tag_total,
    MAX(category_source_count) AS source_category_count,
    SUM(category_tags.rank)/MAX(category_source_count) AS source_tag_rank
FROM category_source
   INNER JOIN 
        (
        SELECT label, source, 
            COUNT(*) OVER (PARTITION BY source) category_source_count 
            FROM category
        ) AS category ON category_source.id=category.source
   INNER JOIN category_tags ON category.label=category_tags.category
GROUP BY category_source.id,category_tags.tag 
ORDER BY category_source.id,tag_total DESC;

1

stefan · Answer 4 · 2017-12-09T10:03:26+08:00

也许是一个起点...

select distinct
  CS.id
, CT.tag
, sum( CT.rank ) over ( partition by C.source, CT.tag order by C.source) as tag_total
, count( CT.category ) over ( partition by C.source, CT.tag order by C.source) as count
, ( sum( CT.rank ) over ( partition by C.source, CT.tag order by C.source) )
  / ( count( CT.category ) over ( partition by C.source, CT.tag order by C.source) 
  ) as tag_source_rank
from category_source CS
  join category C on CS.id = C.source
  join category_tags CT on C.label = CT.category
order by CS.id, tag_total desc
;

- 结果

 id |     tag     | tag_total | count | tag_source_rank 
----+-------------+-----------+-------+-----------------
  1 | test        |        44 |     2 |              22
  1 | sample      |        42 |     2 |              21
  1 | method      |        40 |     2 |              20
  1 | extra       |        21 |     1 |              21
  1 | variety     |        18 |     1 |              18
  1 | explanation |        15 |     1 |              15
  1 | question    |        15 |     1 |              15
  2 | method      |        10 |     1 |              10
  2 | explanation |         8 |     1 |               8
  2 | test        |         6 |     1 |               6
  2 | question    |         5 |     1 |               5
  2 | variety     |         4 |     1 |               4
  2 | sample      |         2 |     1 |               2
(13 rows)

Dbfiddle (Postgresql 9.5)

如何在不使用 PostgreSQL 中的子查询的情况下计算 GROUP BY 之外的 DISTINCT 值？

连接到 PostgreSQL 服务器：致命：主机没有 pg_hba.conf 条目

如何让sqlplus的输出出现在一行中？

选择具有最大日期或最晚日期的日期

如何列出 PostgreSQL 中的所有模式？

列出指定表的所有列

如何在不修改我自己的 tnsnames.ora 的情况下使用 sqlplus 连接到位于另一台主机上的 Oracle 数据库

你如何mysqldump特定的表？

使用 psql 列出数据库权限

如何从 PostgreSQL 中的选择查询中将值插入表中？

如何使用 psql 列出所有数据库和表？

如何在不使用 PostgreSQL 中的子查询的情况下计算 GROUP BY 之外的 DISTINCT 值？

4 个回答

相关问题