是的,更多 greatest-n-per-group 问题。
给定一个releases
包含以下列的表:
id | primary key |
volume | double precision |
chapter | double precision |
series | integer-foreign-key |
include | boolean | not null
我想选择复合最大体积,然后选择一组系列的章节。
现在,如果我查询 per-distinct-series,我可以按如下方式轻松完成此操作:
SELECT
releases.chapter AS releases_chapter,
releases.include AS releases_include,
releases.series AS releases_series
FROM releases
WHERE releases.series = 741
AND releases.include = TRUE
ORDER BY releases.volume DESC NULLS LAST, releases.chapter DESC NULLS LAST LIMIT 1;
但是,如果我有大量series
(而且我有),这很快就会遇到效率问题,我会发出 100 多个查询来生成单个页面。
我想将整个事情滚动到一个查询中,我可以简单地说WHERE releases.series IN (1,2,3....)
,但我还没有想出如何说服 Postgres 让我这样做。
天真的方法是:
SELECT releases.volume AS releases_volume,
releases.chapter AS releases_chapter,
releases.series AS releases_series
FROM
releases
WHERE
releases.series IN (12, 17, 44, 79, 88, 110, 129, 133, 142, 160, 193, 231, 235, 295, 340, 484, 499,
556, 581, 664, 666, 701, 741, 780, 790, 796, 874, 930, 1066, 1091, 1135, 1137,
1172, 1331, 1374, 1418, 1435, 1447, 1471, 1505, 1521, 1540, 1616, 1702, 1768,
1825, 1828, 1847, 1881, 2007, 2020, 2051, 2085, 2158, 2183, 2190, 2235, 2255,
2264, 2275, 2325, 2333, 2334, 2337, 2341, 2343, 2348, 2370, 2372, 2376, 2606,
2634, 2636, 2695, 2696 )
AND releases.include = TRUE
GROUP BY
releases_series
ORDER BY releases.volume DESC NULLS LAST, releases.chapter DESC NULLS LAST;
这显然不起作用:
ERROR: column "releases.volume" must appear in the GROUP BY clause or be used in an aggregate function
如果没有GROUP BY
,它确实会获取所有内容,并且通过一些简单的过程过滤它甚至可以工作,但必须有一种“正确”的方法在 SQL 中执行此操作。
跟踪错误并添加聚合:
SELECT max(releases.volume) AS releases_volume,
max(releases.chapter) AS releases_chapter,
releases.series AS releases_series
FROM
releases
WHERE
releases.series IN (12, 17, 44, 79, 88, 110, 129, 133, 142, 160, 193, 231, 235, 295, 340, 484, 499,
556, 581, 664, 666, 701, 741, 780, 790, 796, 874, 930, 1066, 1091, 1135, 1137,
1172, 1331, 1374, 1418, 1435, 1447, 1471, 1505, 1521, 1540, 1616, 1702, 1768,
1825, 1828, 1847, 1881, 2007, 2020, 2051, 2085, 2158, 2183, 2190, 2235, 2255,
2264, 2275, 2325, 2333, 2334, 2337, 2341, 2343, 2348, 2370, 2372, 2376, 2606,
2634, 2636, 2695, 2696 )
AND releases.include = TRUE
GROUP BY
releases_series;
大多数情况下有效,但问题是两个最大值不一致。如果我有两行,其中 volume:chapter 为 1:5 和 4:1,我需要返回 4:1,但独立最大值返回 4:5。
坦率地说,这在我的应用程序代码中实现起来非常简单,我必须在这里遗漏一些明显的东西。如何实现真正满足我要求的查询?
Postgres 中的简单解决方案是
DISTINCT ON
:细节:
根据数据分布,可能会有更快的技术:
此外,对于长列表,还有比
IN ()
.将未嵌套的数组与
LATERAL
连接组合:往往更快。为了获得最佳性能,您需要一个匹配的多列索引,例如:
有关的:
如果有不止几行 where
include
is nottrue
,而您只对行感兴趣 withinclude = true
,则考虑部分多列索引: