比方说
您有一个接受日期时间数组的存储过程,这些数组被加载到临时表中,并用于过滤表中的日期时间列。
- 可以插入任意数量的值作为开始日期和结束日期。
- 日期范围有时可能会重叠,但这不是我经常依赖的情况。
- 也可以提供带有时间的日期。
编写查询来执行过滤的最有效方法是什么?
设置
USE StackOverflow2013;
CREATE TABLE
#d
(
dfrom datetime,
dto datetime,
PRIMARY KEY (dfrom, dto)
)
INSERT
#d
(
dfrom,
dto
)
SELECT
dfrom = '2013-11-20',
dto = '2013-12-05'
UNION ALL
SELECT
dfrom = '2013-11-27',
dto = '2013-12-12';
CREATE INDEX
p
ON dbo.Posts
(CreationDate)
WITH
(SORT_IN_TEMPDB = ON, DATA_COMPRESSION = PAGE);
询问
我能得到的最好的就是EXISTS
像这样使用:
SELECT
c = COUNT_BIG(*)
FROM dbo.Posts AS p
WHERE EXISTS
(
SELECT
1/0
FROM #d AS d
WHERE p.CreationDate BETWEEN d.dfrom
AND d.dto
);
这导致了一个看起来相当悲伤的执行计划:
嵌套循环是唯一可用的连接运算符,因为我们没有相等谓词。
我正在寻找的是产生不同类型连接的替代语法。
谢谢!
加入
您可能会发现,尽管仍然使用嵌套循环,但联接仍能提供足够的性能。这是大卫·布朗更新答案的变体:
对我来说,运行时间约为 150 毫秒。
删除重叠并连接
如果您有大量重叠的行,则可能值得将它们减少到不同的、不重叠的范围,如 Martin Smith 的回答中所述。我的类似想法的实现是:
对我来说大约需要 40 毫秒。
无连接的动态查找
另一种方法是动态生成文字范围:
使用样本数据,会产生:
哪个 SQL Server 简化为单个范围查找:
对我来说大约需要 35 毫秒。
一般来说,优化器将尽可能地简化范围(就像合并间隔所做的那样)。单个索引查找运算符中单独范围查找的数量似乎没有限制。1440次搜索后我感到无聊。
索引视图
在索引视图中存储和维护分桶计数可能会有所帮助,而不是一遍又一遍地计算行数。以下实现使用小时粒度:
The view takes around 2s to create on my laptop. At hour granularity, the indexed view holds 47,469 rows for my copy of the StackOverflow2013 database's Posts table with 17,142,169 rows.
The majority of the work can be done from the view. Only part-hour periods at the start and end of any range not covered by any other range needs to be processed separately. For example:
The three queries above complete in less than 1 ms overall. There are no part-hour periods in the sample data. Performance will decrease somewhat with more and larger part periods, or if a less precise indexed view granularity is used.
如果您的日期数量足够小,您可以具体化所有有效日期的排序列表,例如
然后进行合并连接。或者走另一个方向,例如
如果您预处理 #D 以合并重叠范围,则只需 COUNT_BIG(*) 即可。
使用动态 SQL 生成一系列
UNION
语句似乎消除了Nested Loops
and 结果Hash Match
:实际执行计划
我这边似乎跑得很快。当然,这可能会导致一长串子句
UNION
,具体取决于临时表中有多少行#d
。YMMV。What I would really like is to be able to get a plan like
Where SQL Server uses its built in merge interval operator to combine overlapping ranges.
I don't think this is currently possible though to get this driven by a table in this case.
My attempt to manually simulate this is below
If I got the overlapping range logic correct (which has taken a few stabs at it so far but hopefully is now correct) then this should collapse down the distinct ranges efficiently and then seek exactly the needed rows for those ranges from Posts.
dfrom, dto
(which conveniently is the order of the PK) and keep track of largestdto
value seen in preceding rows (MaxDtoSoFar
).dfrom
in a row is <=MaxDtoSoFar
then we are continuing an existing interval and setIsIntervalStart
to0
otherwise we are starting a new interval and set that to1
.1
and fill in the relevantMaxDtoSoFar
value.MaxDtoSoFar
and notGREATEST(dto, MaxDtoSoFar)
.NB: The above method does do the interval merging with a single ordered clustered index scan and no sort operations but does have a wide execution plan with multiple of each of Segment/Sequence Project/Window Spool/ Stream Aggregate operators.
Paul White provided an alternative method in the comments that can use batch mode Window Aggregates exclusively and has a much more streamlined plan (as well as likely being simpler SQL)
First I'd like to apologize for the odd query plan in my answer. My computer was recently hacked and I've been unable to remove the SSMS plugin.
通过将每个有效日期拆分为自己的行,可以极大地提高性能。诀窍是还可以携带时间信息,以便在必要时排除端点处的数据。例如,考虑日期范围“20230709 18:00:00”到“20230711 04:00:00”。对于该范围,您将包括日期为 20230709、时间 >= 18:00:00 的行、20230710 的所有行,以及日期为 20230711、时间 <= 04:00:00 的行。
下面是相同的示例数据以及一个额外的临时表,该表以我之前描述的方式总结了日期范围:
此查询在我的计算机上执行需要 64 毫秒:
相对加速约为 300 倍。这是查询计划:
有一个有趣的概念,称为静态关系区间树,Itzik 写了一些关于它的东西,但我认为他的博客文章中的示例代码存在一些问题,所以我没能让它工作,无论如何我发现了一个示例链接,如果您使用这些术语进行网络搜索,还有更多
https://lucient.com/blog/a-static-relational-interval-tree/#
Seems your objective is to find the number of matching posts, that are in the given date ranges. Given you don't have more information on data distribution and quantity structure, it is hard to give proper recommendations.
How about the obvious, assuming that there's primary key "id" in Posts:
and creating an index on Posts(CreationDate, id).