我正在尝试编写一个查询,我必须通过处理重叠天数来计算客户的访问次数。假设 itemID 2009 的开始日期是 23 日,结束日期是 26 日,因此项目 20010 在这几天之间,我们不会将此购买日期添加到我们的总数中。
示例场景:
Item ID Start Date End Date Number of days Number of days Candidate for visit count
20009 2015-01-23 2015-01-26 4 4
20010 2015-01-24 2015-01-24 1 0
20011 2015-01-23 2015-01-26 4 0
20012 2015-01-23 2015-01-27 5 1
20013 2015-01-23 2015-01-27 5 0
20014 2015-01-29 2015-01-30 2 2
输出应为 7 VisitDays
输入表:
CREATE TABLE #Items
(
CustID INT,
ItemID INT,
StartDate DATETIME,
EndDate DATETIME
)
INSERT INTO #Items
SELECT 11205, 20009, '2015-01-23', '2015-01-26'
UNION ALL
SELECT 11205, 20010, '2015-01-24', '2015-01-24'
UNION ALL
SELECT 11205, 20011, '2015-01-23', '2015-01-26'
UNION ALL
SELECT 11205, 20012, '2015-01-23', '2015-01-27'
UNION ALL
SELECT 11205, 20012, '2015-01-23', '2015-01-27'
UNION ALL
SELECT 11205, 20012, '2015-01-28', '2015-01-29'
到目前为止我已经尝试过:
CREATE TABLE #VisitsTable
(
StartDate DATETIME,
EndDate DATETIME
)
INSERT INTO #VisitsTable
SELECT DISTINCT
StartDate,
EndDate
FROM #Items items
WHERE CustID = 11205
ORDER BY StartDate ASC
IF EXISTS (SELECT TOP 1 1 FROM #VisitsTable)
BEGIN
SELECT ISNULL(SUM(VisitDays),1)
FROM ( SELECT DISTINCT
abc.StartDate,
abc.EndDate,
DATEDIFF(DD, abc.StartDate, abc.EndDate) + 1 VisitDays
FROM #VisitsTable abc
INNER JOIN #VisitsTable bc ON bc.StartDate NOT BETWEEN abc.StartDate AND abc.EndDate
) Visits
END
--DROP TABLE #Items
--DROP TABLE #VisitsTable
有很多关于打包时间间隔的问题和文章。例如, Itzik Ben-Gan 的Packing Intervals。
您可以为给定用户打包间隔。一旦打包,就不会有重叠,因此您可以简单地总结打包间隔的持续时间。
如果您的间隔是没有时间的日期,我会使用
Calendar
表格。这张表只是列出了几十年的日期。如果您没有日历表,只需创建一个:有很多方法可以填充这样的表。
例如,从 1900-01-01 开始的 100K 行(约 270 年):
另请参阅为什么数字表“无价”?
一旦你有了一张
Calendar
桌子,这里就是如何使用它。每个原始行都与表连接,以返回与和
Calendar
之间的日期一样多的行。StartDate
EndDate
然后我们计算不同的日期,从而消除重叠的日期。
结果
我强烈同意 a
Numbers
和 aCalendar
表非常有用,如果这个问题可以用日历表简化很多。I'll suggest another solution though (that doesn't need either a calendar table or windowed aggregates - as some of the answers from the linked post by Itzik do). It may not be the most efficient in all cases (or may be the worst in all cases!) but I don't think it harms to test.
It works by first finding start and end dates that do not overlap with other intervals, then puts them in two rows (separately the start and end dates) in order to assign them row numbers and finally matches the 1st start date with the 1st end date, the 2nd with the 2nd, etc.:
Two indexes, on
(CustID, StartDate, EndDate)
and on(CustID, EndDate, StartDate)
would be useful for improving performance of the query.An advantage over the Calendar (perhaps the only one) is that it can easily adapted to work with
datetime
values and counting the length of the "packed intervals" in different precision, larger (weeks, years) or smaller (hours, minutes or seconds, milliseconds, etc) and not only counting dates. A Calendar table of minute or seconds precision would be quite big and (cross) joining it to a big table would be a quite interesting experience but possibly not the most efficient one.(thanks to Vladimir Baranov): It is rather difficult to have a proper comparison of performance, because performance of different methods would likely depend on the data distribution. 1) how long are the intervals - the shorter the intervals, the better Calendar table would perform, because long intervals would produce a lot of intermediate rows 2) how often intervals overlap - mostly non-overlapping intervals vs. most intervals covering the same range. I think performance of Itzik's solution depends on that. There could be other ways to skew the data and it's hard to tell how efficiency of the various methods would be affected.
第一个查询创建不同的开始日期和结束日期范围,没有重叠。
笔记:
id=0
) 与来自 Ypercube (id=1
)的样本混合询问:
输出:
如果您将这些开始日期和结束日期与 DATEDIFF 一起使用:
输出(有重复)是:
SUM=7
)SUM=10
)然后,您只需将所有内容与 a
SUM
and放在一起GROUP BY
:输出:
使用 2 个不同 ID 的数据:
我认为这对于日历表来说很简单,例如:
试验台