AskOverflow.Dev

AskOverflow.Dev Logo AskOverflow.Dev Logo

AskOverflow.Dev Navigation

  • 主页
  • 系统&网络
  • Ubuntu
  • Unix
  • DBA
  • Computer
  • Coding
  • LangChain

Mobile menu

Close
  • 主页
  • 系统&网络
    • 最新
    • 热门
    • 标签
  • Ubuntu
    • 最新
    • 热门
    • 标签
  • Unix
    • 最新
    • 标签
  • DBA
    • 最新
    • 标签
  • Computer
    • 最新
    • 标签
  • Coding
    • 最新
    • 标签
主页 / dba / 问题 / 130141
Accepted
AA.SC
AA.SC
Asked: 2016-02-23 23:59:56 +0800 CST2016-02-23 23:59:56 +0800 CST 2016-02-23 23:59:56 +0800 CST

计算总访问次数

  • 772

我正在尝试编写一个查询,我必须通过处理重叠天数来计算客户的访问次数。假设 itemID 2009 的开始日期是 23 日,结束日期是 26 日,因此项目 20010 在这几天之间,我们不会将此购买日期添加到我们的总数中。

示例场景:

Item ID Start Date   End Date   Number of days     Number of days Candidate for visit count
20009   2015-01-23  2015-01-26     4                      4
20010   2015-01-24  2015-01-24     1                      0
20011   2015-01-23  2015-01-26     4                      0
20012   2015-01-23  2015-01-27     5                      1
20013   2015-01-23  2015-01-27     5                      0
20014   2015-01-29  2015-01-30     2                      2

输出应为 7 VisitDays

输入表:

CREATE TABLE #Items    
(
CustID INT,
ItemID INT,
StartDate DATETIME,
EndDate DATETIME
)           


INSERT INTO #Items
SELECT 11205, 20009, '2015-01-23',  '2015-01-26'  
UNION ALL 
SELECT 11205, 20010, '2015-01-24',  '2015-01-24'    
UNION ALL  
SELECT 11205, 20011, '2015-01-23',  '2015-01-26' 
UNION ALL  
SELECT 11205, 20012, '2015-01-23',  '2015-01-27'  
UNION ALL  
SELECT 11205, 20012, '2015-01-23',  '2015-01-27'   
UNION ALL  
SELECT 11205, 20012, '2015-01-28',  '2015-01-29'  

到目前为止我已经尝试过:

CREATE TABLE #VisitsTable
    (
      StartDate DATETIME,
      EndDate DATETIME
    )

INSERT  INTO #VisitsTable
        SELECT DISTINCT
                StartDate,
                EndDate
        FROM    #Items items
        WHERE   CustID = 11205
        ORDER BY StartDate ASC

IF EXISTS (SELECT TOP 1 1 FROM #VisitsTable) 
BEGIN 


SELECT  ISNULL(SUM(VisitDays),1)
FROM    ( SELECT DISTINCT
                    abc.StartDate,
                    abc.EndDate,
                    DATEDIFF(DD, abc.StartDate, abc.EndDate) + 1 VisitDays
          FROM      #VisitsTable abc
                    INNER JOIN #VisitsTable bc ON bc.StartDate NOT BETWEEN abc.StartDate AND abc.EndDate      
        ) Visits

END



--DROP TABLE #Items 
--DROP TABLE #VisitsTable      
sql-server sql-server-2008-r2
  • 4 4 个回答
  • 2117 Views

4 个回答

  • Voted
  1. Vladimir Baranov
    2016-02-24T02:54:49+08:002016-02-24T02:54:49+08:00

    有很多关于打包时间间隔的问题和文章。例如, Itzik Ben-Gan 的Packing Intervals。

    您可以为给定用户打包间隔。一旦打包,就不会有重叠,因此您可以简单地总结打包间隔的持续时间。


    如果您的间隔是没有时间的日期,我会使用Calendar表格。这张表只是列出了几十年的日期。如果您没有日历表,只需创建一个:

    CREATE TABLE [dbo].[Calendar](
        [dt] [date] NOT NULL,
    CONSTRAINT [PK_Calendar] PRIMARY KEY CLUSTERED 
    (
        [dt] ASC
    ));
    

    有很多方法可以填充这样的表。

    例如,从 1900-01-01 开始的 100K 行(约 270 年):

    INSERT INTO dbo.Calendar (dt)
    SELECT TOP (100000) 
        DATEADD(day, ROW_NUMBER() OVER (ORDER BY s1.[object_id])-1, '19000101') AS dt
    FROM sys.all_objects AS s1 CROSS JOIN sys.all_objects AS s2
    OPTION (MAXDOP 1);
    

    另请参阅为什么数字表“无价”?

    一旦你有了一张Calendar桌子,这里就是如何使用它。

    每个原始行都与表连接,以返回与和Calendar之间的日期一样多的行。StartDateEndDate

    然后我们计算不同的日期,从而消除重叠的日期。

    SELECT COUNT(DISTINCT CA.dt) AS TotalCount
    FROM
        #Items AS T
        CROSS APPLY
        (
            SELECT dbo.Calendar.dt
            FROM dbo.Calendar
            WHERE
                dbo.Calendar.dt >= T.StartDate
                AND dbo.Calendar.dt <= T.EndDate
        ) AS CA
    WHERE T.CustID = 11205
    ;
    

    结果

    TotalCount
    7
    
    • 8
  2. ypercubeᵀᴹ
    2016-02-24T05:38:44+08:002016-02-24T05:38:44+08:00

    我强烈同意 aNumbers和 aCalendar表非常有用,如果这个问题可以用日历表简化很多。

    I'll suggest another solution though (that doesn't need either a calendar table or windowed aggregates - as some of the answers from the linked post by Itzik do). It may not be the most efficient in all cases (or may be the worst in all cases!) but I don't think it harms to test.

    It works by first finding start and end dates that do not overlap with other intervals, then puts them in two rows (separately the start and end dates) in order to assign them row numbers and finally matches the 1st start date with the 1st end date, the 2nd with the 2nd, etc.:

    WITH 
      start_dates AS
        ( SELECT CustID, StartDate,
                 Rn = ROW_NUMBER() OVER (PARTITION BY CustID 
                                         ORDER BY StartDate)
          FROM items AS i
          WHERE NOT EXISTS
                ( SELECT *
                  FROM Items AS j
                  WHERE j.CustID = i.CustID
                    AND j.StartDate < i.StartDate AND i.StartDate <= j.EndDate 
                )
          GROUP BY CustID, StartDate
        ),
      end_dates AS
        ( SELECT CustID, EndDate,
                 Rn = ROW_NUMBER() OVER (PARTITION BY CustID 
                                         ORDER BY EndDate) 
          FROM items AS i
          WHERE NOT EXISTS
                ( SELECT *
                  FROM Items AS j
                  WHERE j.CustID = i.CustID
                    AND j.StartDate <= i.EndDate AND i.EndDate < j.EndDate 
                )
          GROUP BY CustID, EndDate
        )
    SELECT s.CustID, 
           Result = SUM( DATEDIFF(day, s.StartDate, e.EndDate) + 1 )
    FROM start_dates AS s
      JOIN end_dates AS e
        ON  s.CustID = e.CustID
        AND s.Rn = e.Rn 
    GROUP BY s.CustID ;
    

    Two indexes, on (CustID, StartDate, EndDate) and on (CustID, EndDate, StartDate) would be useful for improving performance of the query.

    An advantage over the Calendar (perhaps the only one) is that it can easily adapted to work with datetime values and counting the length of the "packed intervals" in different precision, larger (weeks, years) or smaller (hours, minutes or seconds, milliseconds, etc) and not only counting dates. A Calendar table of minute or seconds precision would be quite big and (cross) joining it to a big table would be a quite interesting experience but possibly not the most efficient one.

    (thanks to Vladimir Baranov): It is rather difficult to have a proper comparison of performance, because performance of different methods would likely depend on the data distribution. 1) how long are the intervals - the shorter the intervals, the better Calendar table would perform, because long intervals would produce a lot of intermediate rows 2) how often intervals overlap - mostly non-overlapping intervals vs. most intervals covering the same range. I think performance of Itzik's solution depends on that. There could be other ways to skew the data and it's hard to tell how efficiency of the various methods would be affected.

    • 7
  3. Best Answer
    Julien Vavasseur
    2016-02-24T02:55:59+08:002016-02-24T02:55:59+08:00

    第一个查询创建不同的开始日期和结束日期范围,没有重叠。

    笔记:

    • 您的样本 ( id=0) 与来自 Ypercube ( id=1)的样本混合
    • 对于每个 id 或大量 id 的大量数据,此解决方案可能无法很好地扩展。这具有不需要数字表的优点。对于大型数据集,数字表很可能会提供更好的性能。

    询问:

    SELECT DISTINCT its.id
        , Start_Date = its.Start_Date 
        , End_Date = COALESCE(DATEADD(day, -1, itmax.End_Date), CASE WHEN itmin.Start_Date > its.End_Date THEN itmin.Start_Date ELSE its.End_Date END)
        --, x1=itmax.End_Date, x2=itmin.Start_Date, x3=its.End_Date
    FROM @Items its
    OUTER APPLY (
        SELECT Start_Date = MAX(End_Date) FROM @Items std
        WHERE std.Item_ID <> its.Item_ID AND std.Start_Date < its.Start_Date AND std.End_Date > its.Start_Date
    ) itmin
    OUTER APPLY (
        SELECT End_Date = MIN(Start_Date) FROM @Items std
        WHERE std.Item_ID <> its.Item_ID+1000 AND std.Start_Date > its.Start_Date AND std.Start_Date < its.End_Date
    ) itmax;
    

    输出:

    id  | Start_Date                    | End_Date                      
    0   | 2015-01-23 00:00:00.0000000   | 2015-01-23 00:00:00.0000000   => 1
    0   | 2015-01-24 00:00:00.0000000   | 2015-01-27 00:00:00.0000000   => 4
    0   | 2015-01-29 00:00:00.0000000   | 2015-01-30 00:00:00.0000000   => 2
    1   | 2016-01-20 00:00:00.0000000   | 2016-01-22 00:00:00.0000000   => 3
    1   | 2016-01-23 00:00:00.0000000   | 2016-01-24 00:00:00.0000000   => 2
    1   | 2016-01-25 00:00:00.0000000   | 2016-01-29 00:00:00.0000000   => 5
    

    如果您将这些开始日期和结束日期与 DATEDIFF 一起使用:

    SELECT DATEDIFF(day
        , its.Start_Date 
        , End_Date = COALESCE(DATEADD(day, -1, itmax.End_Date), CASE WHEN itmin.Start_Date > its.End_Date THEN itmin.Start_Date ELSE its.End_Date END)
    ) + 1
    ...
    

    输出(有重复)是:

    • id 0 的 1、4 和 2(您的示例 => SUM=7)
    • id 1 的 3、2 和 5(Ypercube 样本 => SUM=10)

    然后,您只需将所有内容与 a SUMand放在一起GROUP BY:

    SELECT id 
        , Days = SUM(
            DATEDIFF(day, Start_Date, End_Date)+1
        )
    FROM (
        SELECT DISTINCT its.id
             , Start_Date = its.Start_Date 
            , End_Date = COALESCE(DATEADD(day, -1, itmax.End_Date), CASE WHEN itmin.Start_Date > its.End_Date THEN itmin.Start_Date ELSE its.End_Date END)
        FROM @Items its
        OUTER APPLY (
            SELECT Start_Date = MAX(End_Date) FROM @Items std
            WHERE std.Item_ID <> its.Item_ID AND std.Start_Date < its.Start_Date AND std.End_Date > its.Start_Date
        ) itmin
        OUTER APPLY (
            SELECT End_Date = MIN(Start_Date) FROM @Items std
            WHERE std.Item_ID <> its.Item_ID AND std.Start_Date > its.Start_Date AND std.Start_Date < its.End_Date
        ) itmax
    ) as d
    GROUP BY id;
    

    输出:

    id  Days
    0   7
    1   10
    

    使用 2 个不同 ID 的数据:

    INSERT INTO @Items
        (id, Item_ID, Start_Date, End_Date)
    VALUES 
        (0, 20009, '2015-01-23', '2015-01-26'),
        (0, 20010, '2015-01-24', '2015-01-24'),
        (0, 20011, '2015-01-23', '2015-01-26'),
        (0, 20012, '2015-01-23', '2015-01-27'),
        (0, 20013, '2015-01-23', '2015-01-27'),
        (0, 20014, '2015-01-29', '2015-01-30'),
    
        (1, 20009, '2016-01-20', '2016-01-24'),
        (1, 20010, '2016-01-23', '2016-01-26'),
        (1, 20011, '2016-01-25', '2016-01-29')
    
    • 5
  4. wBob
    2016-02-24T02:53:41+08:002016-02-24T02:53:41+08:00

    我认为这对于日历表来说很简单,例如:

    SELECT i.CustID, COUNT( DISTINCT c.calendarDate ) days
    FROM #Items i
        INNER JOIN calendar.main c ON c.calendarDate Between i.StartDate And i.EndDate
    GROUP BY i.CustID
    

    试验台

    USE tempdb
    GO
    
    -- Cutdown calendar script
    IF OBJECT_ID('dbo.calendar') IS NULL
    BEGIN
    
        CREATE TABLE dbo.calendar (
            calendarId      INT IDENTITY(1,1) NOT NULL,
            calendarDate    DATE NOT NULL,
    
            CONSTRAINT PK_calendar__main PRIMARY KEY ( calendarDate ASC ) WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY],
            CONSTRAINT UK_calendar__main UNIQUE NONCLUSTERED ( calendarId ASC ) WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
        ) ON [PRIMARY]
    END
    GO
    
    
    -- Populate calendar table once only
    IF NOT EXISTS ( SELECT * FROM dbo.calendar )
    BEGIN
    
        -- Populate calendar table
        WITH cte AS
        (
        SELECT 0 x
        UNION ALL
        SELECT x + 1
        FROM cte
        WHERE x < 11323 -- Do from year 1 Jan 2000 until 31 Dec 2030 (extend if required)
        )
        INSERT INTO dbo.calendar ( calendarDate )
        SELECT
            calendarDate
        FROM
            (
            SELECT 
                DATEADD( day, x, '1 Jan 2010' ) calendarDate,
                DATEADD( month, -7, DATEADD( day, x, '1 Jan 2010' ) ) academicDate
            FROM cte
            ) x
        WHERE calendarDate < '1 Jan 2031'
        OPTION ( MAXRECURSION 0 )
    
        ALTER INDEX ALL ON dbo.calendar REBUILD
    
    END
    GO
    
    
    
    
    
    IF OBJECT_ID('tempdb..Items') IS NOT NULL DROP TABLE Items
    GO
    
    CREATE TABLE dbo.Items
        (
        CustID INT NOT NULL,
        ItemID INT NOT NULL,
        StartDate DATE NOT NULL,
        EndDate DATE NOT NULL,
    
        INDEX _cdx_Items CLUSTERED ( CustID, StartDate, EndDate )
        )
    GO
    
    INSERT INTO Items ( CustID, ItemID, StartDate, EndDate )
    SELECT 11205, 20009, '2015-01-23',  '2015-01-26'  
    UNION ALL 
    SELECT 11205, 20010, '2015-01-24',  '2015-01-24'    
    UNION ALL  
    SELECT 11205, 20011, '2015-01-23',  '2015-01-26' 
    UNION ALL  
    SELECT 11205, 20012, '2015-01-23',  '2015-01-27'  
    UNION ALL  
    SELECT 11205, 20012, '2015-01-23',  '2015-01-27'   
    UNION ALL  
    SELECT 11205, 20012, '2015-01-28',  '2015-01-29'
    GO
    
    
    -- Scale up : )
    ;WITH cte AS (
    SELECT TOP 1000000 ROW_NUMBER() OVER ( ORDER BY ( SELECT 1 ) ) rn
    FROM master.sys.columns c1
        CROSS JOIN master.sys.columns c2
        CROSS JOIN master.sys.columns c3
    )
    INSERT INTO Items ( CustID, ItemID, StartDate, EndDate )
    SELECT 11206 + rn % 999, 20012 + rn, DATEADD( day, rn % 333, '1 Jan 2015' ), DATEADD( day, ( rn % 333 ) + rn % 7, '1 Jan 2015' )
    FROM cte
    GO
    --:exit
    
    
    
    -- My query: Pros: simple, one copy of items, easy to understand and maintain.  Scales well to 1 million + rows.
    -- Cons: requires calendar table.  Others?
    SELECT i.CustID, COUNT( DISTINCT c.calendarDate ) days
    FROM dbo.Items i
        INNER JOIN dbo.calendar c ON c.calendarDate Between i.StartDate And i.EndDate
    GROUP BY i.CustID
    --ORDER BY i.CustID
    GO
    
    
    -- Vladimir query: Pros: Effectively same as above
    -- Cons: I wouldn't use CROSS APPLY where it's not necessary.  Fortunately optimizer simplifies avoiding RBAR (I think).
    -- Point of style maybe, but in terms of queries being self-documenting I prefer number 1.
    SELECT T.CustID, COUNT( DISTINCT CA.calendarDate ) AS TotalCount
    FROM
        Items AS T
        CROSS APPLY
        (
            SELECT c.calendarDate
            FROM dbo.calendar c
            WHERE
                c.calendarDate >= T.StartDate
                AND c.calendarDate <= T.EndDate
        ) AS CA
    GROUP BY T.CustID
    --ORDER BY T.CustID
    --WHERE T.CustID = 11205
    GO
    
    
    /*  WARNING!! This is commented out as it can't compete in the scale test.  Will finish at scale 100, 1,000, 10,000, eventually.  I got 38 mins for 10,0000.  Pegs CPU.  
    
    -- Julian:  Pros; does not require calendar table.
    -- Cons: over-complicated (eg versus Query 1 in terms of number of lines of code, clauses etc); three copies of dbo.Items table (we have already shown
    -- this query is possible with one); does not scale (even at 100,000 rows query ran for 38 minutes on my test rig versus sub-second for first two queries).  <<-- this is serious.
    -- Indexing could help.
    SELECT DISTINCT
        CustID,
         StartDate = CASE WHEN itmin.StartDate < its.StartDate THEN itmin.StartDate ELSE its.StartDate END
        , EndDate = CASE WHEN itmax.EndDate > its.EndDate THEN itmax.EndDate ELSE its.EndDate END
    FROM Items its
    OUTER APPLY (
        SELECT StartDate = MIN(StartDate) FROM Items std
        WHERE std.ItemID <> its.ItemID AND (
            (std.StartDate <= its.StartDate AND std.EndDate >= its.StartDate)
            OR (std.StartDate >= its.StartDate AND std.StartDate <= its.EndDate)
        )
    ) itmin
    OUTER APPLY (
        SELECT EndDate = MAX(EndDate) FROM Items std
        WHERE std.ItemID <> its.ItemID AND (
            (std.EndDate >= its.StartDate AND std.EndDate <= its.EndDate)
            OR (std.StartDate <= its.EndDate AND std.EndDate >= its.EndDate)
        )
    ) itmax
    GO
    */
    
    -- ypercube:  Pros; does not require calendar table.
    -- Cons: over-complicated (eg versus Query 1 in terms of number of lines of code, clauses etc); four copies of dbo.Items table (we have already shown
    -- this query is possible with one); does not scale well; at 1,000,000 rows query ran for 2:20 minutes on my test rig versus sub-second for first two queries.
    WITH 
      start_dates AS
        ( SELECT CustID, StartDate,
                 Rn = ROW_NUMBER() OVER (PARTITION BY CustID 
                                         ORDER BY StartDate)
          FROM items AS i
          WHERE NOT EXISTS
                ( SELECT *
                  FROM Items AS j
                  WHERE j.CustID = i.CustID
                    AND j.StartDate < i.StartDate AND i.StartDate <= j.EndDate 
                )
          GROUP BY CustID, StartDate
        ),
      end_dates AS
        ( SELECT CustID, EndDate,
                 Rn = ROW_NUMBER() OVER (PARTITION BY CustID 
                                         ORDER BY EndDate) 
          FROM items AS i
          WHERE NOT EXISTS
                ( SELECT *
                  FROM Items AS j
                  WHERE j.CustID = i.CustID
                    AND j.StartDate <= i.EndDate AND i.EndDate < j.EndDate 
                )
          GROUP BY CustID, EndDate
        )
    SELECT s.CustID, 
           Result = SUM( DATEDIFF(day, s.StartDate, e.EndDate) + 1 )
    FROM start_dates AS s
      JOIN end_dates AS e
        ON  s.CustID = e.CustID
        AND s.Rn = e.Rn 
    GROUP BY s.CustID ;
    
    • 2

相关问题

  • SQL Server - 使用聚集索引时如何存储数据页

  • 我需要为每种类型的查询使用单独的索引,还是一个多列索引可以工作?

  • 什么时候应该使用唯一约束而不是唯一索引?

  • 死锁的主要原因是什么,可以预防吗?

  • 如何确定是否需要或需要索引

Sidebar

Stats

  • 问题 205573
  • 回答 270741
  • 最佳答案 135370
  • 用户 68524
  • 热门
  • 回答
  • Marko Smith

    连接到 PostgreSQL 服务器:致命:主机没有 pg_hba.conf 条目

    • 12 个回答
  • Marko Smith

    如何让sqlplus的输出出现在一行中?

    • 3 个回答
  • Marko Smith

    选择具有最大日期或最晚日期的日期

    • 3 个回答
  • Marko Smith

    如何列出 PostgreSQL 中的所有模式?

    • 4 个回答
  • Marko Smith

    列出指定表的所有列

    • 5 个回答
  • Marko Smith

    如何在不修改我自己的 tnsnames.ora 的情况下使用 sqlplus 连接到位于另一台主机上的 Oracle 数据库

    • 4 个回答
  • Marko Smith

    你如何mysqldump特定的表?

    • 4 个回答
  • Marko Smith

    使用 psql 列出数据库权限

    • 10 个回答
  • Marko Smith

    如何从 PostgreSQL 中的选择查询中将值插入表中?

    • 4 个回答
  • Marko Smith

    如何使用 psql 列出所有数据库和表?

    • 7 个回答
  • Martin Hope
    Jin 连接到 PostgreSQL 服务器:致命:主机没有 pg_hba.conf 条目 2014-12-02 02:54:58 +0800 CST
  • Martin Hope
    Stéphane 如何列出 PostgreSQL 中的所有模式? 2013-04-16 11:19:16 +0800 CST
  • Martin Hope
    Mike Walsh 为什么事务日志不断增长或空间不足? 2012-12-05 18:11:22 +0800 CST
  • Martin Hope
    Stephane Rolland 列出指定表的所有列 2012-08-14 04:44:44 +0800 CST
  • Martin Hope
    haxney MySQL 能否合理地对数十亿行执行查询? 2012-07-03 11:36:13 +0800 CST
  • Martin Hope
    qazwsx 如何监控大型 .sql 文件的导入进度? 2012-05-03 08:54:41 +0800 CST
  • Martin Hope
    markdorison 你如何mysqldump特定的表? 2011-12-17 12:39:37 +0800 CST
  • Martin Hope
    Jonas 如何使用 psql 对 SQL 查询进行计时? 2011-06-04 02:22:54 +0800 CST
  • Martin Hope
    Jonas 如何从 PostgreSQL 中的选择查询中将值插入表中? 2011-05-28 00:33:05 +0800 CST
  • Martin Hope
    Jonas 如何使用 psql 列出所有数据库和表? 2011-02-18 00:45:49 +0800 CST

热门标签

sql-server mysql postgresql sql-server-2014 sql-server-2016 oracle sql-server-2008 database-design query-performance sql-server-2017

Explore

  • 主页
  • 问题
    • 最新
    • 热门
  • 标签
  • 帮助

Footer

AskOverflow.Dev

关于我们

  • 关于我们
  • 联系我们

Legal Stuff

  • Privacy Policy

Language

  • Pt
  • Server
  • Unix

© 2023 AskOverflow.DEV All Rights Reserve