我添加了一个不使用窗口函数的解决方案和一个低于 Martin's Answer 的大数据集的基准
这是使用不在 SELECT 列表中的列进行 GROUP BY 的后续线程- 这什么时候实用、优雅或强大?
在我应对这一挑战的解决方案中,我使用了一个查询,该查询按不属于选择列表的表达式进行分组。当逻辑分组元素涉及来自其他行的数据时,这经常与窗口函数一起使用。
举个例子,也许这是一个矫枉过正的例子,但我认为你可能会发现这个挑战本身很有趣。我会等待发布我的解决方案,也许你们中的一些人可以提出更好的解决方案。
挑战
我们有一个定期记录读数的传感器表。不能保证采样时间处于单调间隔。
您需要编写一个查询来报告“异常”,即传感器报告超出阈值读数的次数,无论是低值还是高值。传感器报告高于或低于阈值的每个时间段都被视为“异常”。一旦读数恢复正常,异常结束。
示例表和数据
该脚本在 T-SQL 中,是我培训材料的一部分。
------------------------------------------
-- Sensor Thresholds - 1 - Setup Example --
------------------------------------------
CREATE TABLE [Sensors]
(
[Sensor] NVARCHAR(10) NOT NULL,
[Lower Threshold] DECIMAL(7,2) NOT NULL,
[Upper Threshold] DECIMAL(7,2) NOT NULL,
CONSTRAINT [PK Sensors]
PRIMARY KEY CLUSTERED ([Sensor]),
CONSTRAINT [CK Value Range]
CHECK ([Upper Threshold] > [Lower Threshold])
);
GO
INSERT INTO [Sensors]
(
[Sensor] ,
[Lower Threshold] ,
[Upper Threshold]
)
VALUES (N'Sensor A', -50, 50 ),
(N'Sensor B', 40, 80),
(N'Sensor C', 0, 100);
GO
CREATE TABLE [Measurements]
(
[Sensor] NVARCHAR(10) NOT NULL,
[Measure Time] DATETIME2(0) NOT NULL,
[Measurement] DECIMAL(7,2) NOT NULL,
CONSTRAINT [PK Measurements]
PRIMARY KEY CLUSTERED ([Sensor], [Measure Time]),
CONSTRAINT [FK Measurements Sensors]
FOREIGN KEY ([Sensor])
REFERENCES [Sensors]([Sensor])
);
GO
INSERT INTO [Measurements]
(
[Sensor] ,
[Measure Time] ,
[Measurement]
)
VALUES ( N'Sensor A', N'20160101 08:00', -9),
( N'Sensor A', N'20160101 09:00', 30),
( N'Sensor A', N'20160101 10:30', 59),
( N'Sensor A', N'20160101 23:00', 66),
( N'Sensor A', N'20160102 08:00', 48),
( N'Sensor A', N'20160102 11:30', 08),
( N'Sensor B', N'20160101 08:00', 39), -- Note that this exception range has both over and under....
( N'Sensor B', N'20160101 10:30', 88),
( N'Sensor B', N'20160101 13:00', 75),
( N'Sensor B', N'20160102 08:00', 95),
( N'Sensor B', N'20160102 17:00', 75),
( N'Sensor C', N'20160101 09:00', 01),
( N'Sensor C', N'20160101 10:00', -1),
( N'Sensor C', N'20160101 18:00', -2),
( N'Sensor C', N'20160101 22:00', -2),
( N'Sensor C', N'20160101 23:30', -1);
GO
预期结果
Sensor Exception Start Time Exception End Time Exception Duration (minutes) Min Measurement Max Measurement Lower Threshold Upper Threshold Maximal Delta From Thresholds
------ -------------------- ------------------ ---------------------------- --------------- --------------- --------------- --------------- -----------------------------
Sensor A 2016-01-01 10:30:00 2016-01-02 08:00:00 1290 59.00 66.00 -50.00 50.00 16.00
Sensor B 2016-01-01 08:00:00 2016-01-01 13:00:00 300 39.00 88.00 40.00 80.00 8.00
Sensor B 2016-01-02 08:00:00 2016-01-02 17:00:00 540 95.00 95.00 40.00 80.00 15.00
Sensor C 2016-01-01 10:00:00 2016-01-01 23:30:00 810 -2.00 -1.00 0.00 100.00 -2.00
*/
我可能会使用类似下面的东西。
它能够使用索引顺序并避免排序,直到它到达最终
GROUP BY
(对我来说,它使用流聚合)原则上,实际上并不需要最后的分组操作。应该可以读取排序的输入流
Sensor, MeasureTime
并以流方式输出所需的结果,但我认为您需要为此编写一个 SQLCLR 过程。流式 SQL CLR 函数实现按顺序读取行
Sensor, [Measure Time]
:资源
部署脚本
创建装配位(稍微太长以内联发布)
注意:由于限制,此程序集需要
EXTERNAL_ACCESS
权限,但它只能从同一数据库读取。出于测试目的,制作数据库就足够了TRUSTWORTHY
,尽管有充分的理由不在生产中这样做——而是对程序集进行签名。询问
需要参数以便函数知道如何连接到源数据。
执行计划
结果
我在没有查看其他答案的情况下写下了我对解决方案的尝试,但看到我的查询与 Martin 的查询非常相似,我并不感到惊讶。我似乎用少一个窗口函数得到了正确的结果,但我怀疑性能会有很大差异。这是完整的代码:
这是该计划的屏幕截图:
很难看到细节所以我也把它上传到粘贴计划。
要了解其部分工作原理,请考虑
q2
派生表中传感器 B 的行:For that sensor there are two exception periods. The first row of an exception period must be an exception by definition. If the exception period contains a row that isn't an exception then it must be after all of the exception rows. Therefore, I can get the minimum exception time by taking the minimum time value and I can get the maximum exception time by taking the minimum time value for a row that isn't an exception, or if that doesn't exist, taking the maximum time value.
Better late than never...
I promised to provide my solution to this challenge a few months ago, but since both Martin and Joe came up with very similar solutions to my original one, I decided to look for another. :-) For extra challenge, I decided to try and find one without window functions, so that it will be valid for other RDBMS as well that don't yet support window functions.
Time went by, and I honestly just forgot about this challenge, but this morning I had an hour to spare, and I happened to recall this challenge, so here is an alternative solution, without using window functions. The general idea is to find the 'nearest next normal measurements' for each exception row, and use that as a grouping expression for the GROUP BY in the outer query. More details in the code comments.
While Paul's CLR solution is unique and might be very efficient (I didn't test it), I was really looking for SQL solutions. Still, Kudos to Paul, and if you have a case where the benefit of using CLR outweighs the challenges it introduces, you should consider it. I usually advise avoiding CLR in SQL Server like the plague, and only use as a last resort... Also, it is not portable to other platforms.
Both Martin's and Joe's solutions come up with nearly identical execution plans, only minor operation order difference. I also find them similar in terms of clarity, so I'm granting the correct answer to Martin, but only because he published his first.
Both Martin and Joe's solutions seem to be more efficient than mine when looking at the estimated query cost. For the small sample data, the optimizer came up with this plan for my solution:
You can see that there are 5 table access operators vs. only 2 for Joe's and Martin's solutions, but on the other hand, no spooling vs. the 2 spools...
The optimizer estimated that my solution will be about twice as expensive as Joe's and Martin's; 0.056 vs. 0.032 total estimated plan cost.
So, being curious, I decided to test all solutions with a larger set:
This resulted in ~160,000 rows in the table. The estimated plan cost now increased to 30501 for my solution, vs only 3.8 for Joe's and Martin's... Here is the plan for my solution, with the large data set:
我决定运行一个实际的基准测试。我在每次执行之前清除了缓冲池,这是我笔记本电脑上的结果:
现在这是一个决定性的结果……窗口功能万岁!
我试了一下以尝试进一步优化它,但我添加此解决方案主要是出于教育目的。看来优化器的估计还差得远。
你能在不使用窗口函数的情况下找到它的优化吗?那会很有趣。
再次感谢您提供的所有解决方案,我从中学到了东西。并且......再次对(非常)迟到的回复表示抱歉。
祝大家周末愉快!