AskOverflow.Dev

AskOverflow.Dev Logo AskOverflow.Dev Logo

AskOverflow.Dev Navigation

  • 主页
  • 系统&网络
  • Ubuntu
  • Unix
  • DBA
  • Computer
  • Coding
  • LangChain

Mobile menu

Close
  • 主页
  • 系统&网络
    • 最新
    • 热门
    • 标签
  • Ubuntu
    • 最新
    • 热门
    • 标签
  • Unix
    • 最新
    • 标签
  • DBA
    • 最新
    • 标签
  • Computer
    • 最新
    • 标签
  • Coding
    • 最新
    • 标签
主页 / dba / 问题 / 329139
Accepted
Erik Darling
Erik Darling
Asked: 2023-07-11 06:20:49 +0800 CST2023-07-11 06:20:49 +0800 CST 2023-07-11 06:20:49 +0800 CST

提高多个日期范围谓词的性能

  • 772

比方说

您有一个接受日期时间数组的存储过程,这些数组被加载到临时表中,并用于过滤表中的日期时间列。

  • 可以插入任意数量的值作为开始日期和结束日期。
  • 日期范围有时可能会重叠,但这不是我经常依赖的情况。
  • 也可以提供带有时间的日期。

编写查询来执行过滤的最有效方法是什么?

设置

USE StackOverflow2013;

CREATE TABLE
    #d
(
    dfrom datetime,
    dto datetime,
    PRIMARY KEY (dfrom, dto)
)
INSERT
    #d
(
    dfrom,
    dto
)
SELECT
    dfrom = '2013-11-20',
    dto =   '2013-12-05'
UNION ALL
SELECT
    dfrom = '2013-11-27',
    dto =   '2013-12-12'; 

CREATE INDEX
    p
ON dbo.Posts
    (CreationDate)
WITH
    (SORT_IN_TEMPDB = ON, DATA_COMPRESSION = PAGE);

询问

我能得到的最好的就是EXISTS像这样使用:

SELECT
    c = COUNT_BIG(*)
FROM dbo.Posts AS p
WHERE EXISTS
(
    SELECT
        1/0
    FROM #d AS d
    WHERE p.CreationDate BETWEEN d.dfrom
                             AND d.dto
);

这导致了一个看起来相当悲伤的执行计划:

坚果

嵌套循环是唯一可用的连接运算符,因为我们没有相等谓词。

我正在寻找的是产生不同类型连接的替代语法。

谢谢!

sql-server
  • 7 7 个回答
  • 1452 Views

7 个回答

  • Voted
  1. Best Answer
    Paul White
    2023-07-11T17:50:34+08:002023-07-11T17:50:34+08:00

    加入

    您可能会发现,尽管仍然使用嵌套循环,但联接仍能提供足够的性能。这是大卫·布朗更新答案的变体:

    SELECT 
        NumRows = COUNT_BIG(DISTINCT P.Id) 
    FROM #d AS D
    JOIN dbo.Posts AS P
        ON P.CreationDate BETWEEN D.dfrom AND D.dto;
    

    加盟计划

    对我来说,运行时间约为 150 毫秒。

    删除重叠并连接

    如果您有大量重叠的行,则可能值得将它们减少到不同的、不重叠的范围,如 Martin Smith 的回答中所述。我的类似想法的实现是:

    WITH 
        Intervals AS
        (
        SELECT 
            IntervalStart =
                ISNULL
                (
                    LAG(Q1.NextStart) OVER (ORDER BY Q1.ThisFrom),
                    Q1.FirstFrom
                ),
            IntervalEnd = 
                IIF 
                (
                    Q1.NextStart IS NOT NULL, 
                    Q1.ThisEnd,
                    Q1.LastEnd
                )
        FROM 
        (
            SELECT
                ThisFrom = D.dfrom, 
                ThisEnd = D.dto,
                -- Remember the start of the next row because that row
                -- may get filtered out by the outer WHERE clase if it
                -- is not also the end of an interval.
                NextStart = LEAD(D.dfrom) OVER (
                    ORDER BY D.dfrom, D.dto),
                -- Start point of the first interval
                FirstFrom = MIN(D.dfrom) OVER (
                    ORDER BY D.dfrom, D.dto 
                    ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW),
                -- End point of the last interval
                LastEnd = MAX(D.dto) OVER (
                    ORDER BY D.dfrom, D.dto 
                    ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)
            FROM #d AS D 
            WHERE 
                -- Valid intervals only
                D.dto >= D.dfrom
        ) AS Q1
        WHERE
            -- Interval ends only
            Q1.NextStart > Q1.ThisEnd
            OR Q1.NextStart IS NULL
    )
    SELECT COUNT_BIG(*) 
    FROM dbo.Posts AS P
    JOIN Intervals AS N
        ON P.CreationDate BETWEEN N.IntervalStart AND N.IntervalEnd;
    

    合并间隔

    对我来说大约需要 40 毫秒。

    无连接的动态查找

    另一种方法是动态生成文字范围:

    DECLARE @SQL nvarchar(max) =
    N'
    SELECT Rows = COUNT_BIG(*)
    FROM dbo.Posts AS P
    WHERE 0 = 1
    ';
    
    SELECT @SQL +=
        STRING_AGG
        (
            CONCAT
            (
                CONVERT(nvarchar(max), SPACE(0)),
                N'OR P.CreationDate BETWEEN ',
                N'CONVERT(datetime, ',
                NCHAR(39),
                CONVERT(nchar(23), D.dfrom, 121),
                NCHAR(39),
                N', 121)',
                N' AND CONVERT(datetime, ',
                NCHAR(39),
                CONVERT(nchar(23), D.dto, 121),
                NCHAR(39),
                N', 121)',
                NCHAR(13), NCHAR(10)
            ),
            N''
        )
    FROM #d AS D;
    
    SET @SQL += N'OPTION (RECOMPILE);' -- Plan reuse is unlikely
    
    EXECUTE (@SQL);
    

    使用样本数据,会产生:

    SELECT Rows = COUNT_BIG(*)
    FROM dbo.Posts AS P
    WHERE 0 = 1
    OR P.CreationDate BETWEEN CONVERT(datetime, '2013-11-20 00:00:00.000', 121) 
        AND CONVERT(datetime, '2013-12-05 00:00:00.000', 121)
    OR P.CreationDate BETWEEN CONVERT(datetime, '2013-11-27 00:00:00.000', 121) 
        AND CONVERT(datetime, '2013-12-12 00:00:00.000', 121)
    OPTION (RECOMPILE);
    

    哪个 SQL Server 简化为单个范围查找:

    寻找

    对我来说大约需要 35 毫秒。

    一般来说,优化器将尽可能地简化范围(就像合并间隔所做的那样)。单个索引查找运算符中单独范围查找的数量似乎没有限制。1440次搜索后我感到无聊。

    索引视图

    在索引视图中存储和维护分桶计数可能会有所帮助,而不是一遍又一遍地计算行数。以下实现使用小时粒度:

    CREATE OR ALTER VIEW dbo.PostsTimeBucket
    WITH SCHEMABINDING
    AS
    SELECT
        HourBucket =
            DATEADD(HOUR, 
                DATEDIFF(HOUR, 
                    CONVERT(datetime, '20000101', 112), 
                    P.CreationDate),
                CONVERT(datetime, '20000101', 112)),
        NumRows = COUNT_BIG(*)
    FROM dbo.Posts AS P
    GROUP BY
        DATEADD(HOUR, 
            DATEDIFF(HOUR, 
                CONVERT(datetime, '20000101', 112), 
                P.CreationDate),
            CONVERT(datetime, '20000101', 112));
    GO
    CREATE UNIQUE CLUSTERED INDEX [CUQ dbo.PostsTimeBucket HourBucket] 
    ON dbo.PostsTimeBucket (HourBucket);
    

    The view takes around 2s to create on my laptop. At hour granularity, the indexed view holds 47,469 rows for my copy of the StackOverflow2013 database's Posts table with 17,142,169 rows.

    The majority of the work can be done from the view. Only part-hour periods at the start and end of any range not covered by any other range needs to be processed separately. For example:

    -- Helper inline function
    CREATE OR ALTER FUNCTION dbo.RoundToHour (@d datetime)
    RETURNS table
    AS
    RETURN
        SELECT
            HourBucket =
                DATEADD(HOUR, 
                    DATEDIFF(HOUR, 
                        CONVERT(datetime, '20000101', 112), @d),
                    CONVERT(datetime, '20000101', 112));
    
    -- Whole hours covered by any range in the table
    SELECT 
        SUM(PTB.NumRows) 
    FROM dbo.PostsTimeBucket AS PTB 
        WITH (NOEXPAND)
    WHERE 
        PTB.HourBucket >= (SELECT MIN(D.dfrom) FROM #d AS D)
        AND PTB.HourBucket <= (SELECT MAX(D.dto) FROM #d AS D)
        AND EXISTS
        (
            SELECT * 
            FROM #d AS D
            WHERE
                D.dfrom <= PTB.HourBucket
                AND D.dto >= DATEADD(HOUR, 1, PTB.HourBucket)
        )
    

    索引视图

    -- Extra rows at start before hour boundary
    SELECT 
        COUNT_BIG(DISTINCT P.Id) 
    FROM #d AS D
    CROSS APPLY dbo.RoundToHour(D.dfrom) AS HF
    JOIN dbo.Posts AS P
        ON P.CreationDate >= D.dfrom
        AND P.CreationDate < DATEADD(HOUR, 1, HF.HourBucket)
    WHERE
        D.dfrom <> HF.HourBucket
        AND NOT EXISTS
        (
            -- Not covered by any hour-long period
            SELECT * 
            FROM #d AS D2
            WHERE
                D2.dfrom <= HF.HourBucket
                AND D2.dto > DATEADD(HOUR, 1, HF.HourBucket)
        );
    

    额外开始时间

    -- Extra rows at end
    SELECT 
        COUNT_BIG(DISTINCT P.Id) 
    FROM #d AS D
    CROSS APPLY dbo.RoundToHour(D.dto) AS HF
    JOIN dbo.Posts AS P
        ON P.CreationDate >= HF.HourBucket
        AND P.CreationDate < D.dto
    WHERE
        D.dto <> HF.HourBucket
        AND NOT EXISTS
        (
            -- Not covered by any hour-long period
            SELECT * 
            FROM #d AS D2
            WHERE
                D2.dfrom <= HF.HourBucket
                AND D2.dto > DATEADD(HOUR, 1, HF.HourBucket)
        );
    

    末尾有额外的行

    The three queries above complete in less than 1 ms overall. There are no part-hour periods in the sample data. Performance will decrease somewhat with more and larger part periods, or if a less precise indexed view granularity is used.

    • 9
  2. David Browne - Microsoft
    2023-07-11T08:34:11+08:002023-07-11T08:34:11+08:00

    如果您的日期数量足够小,您可以具体化所有有效日期的排序列表,例如

    CREATE TABLE #d
    (
        d datetime primary key
    )
    

    然后进行合并连接。或者走另一个方向,例如

    SELECT COUNT_BIG(distinct p.id)
    FROM #D d
    CROSS APPLY (select p.Id 
           from dbo.Posts AS p
           where p.CreateDate >= d.fromDate
             and p.CreateDate < d.toDate) p;
    

    如果您预处理 #D 以合并重叠范围,则只需 COUNT_BIG(*) 即可。

    • 4
  3. J.D.
    2023-07-11T11:27:56+08:002023-07-11T11:27:56+08:00

    使用动态 SQL 生成一系列UNION语句似乎消除了Nested Loopsand 结果Hash Match:

    DECLARE @DynamicSQL NVARCHAR(MAX) = N'';
    
    SELECT @DynamicSQL = 
        CONCAT
        (
            '
            SELECT
                c = COUNT_BIG(*)
            FROM (
            ',
                STRING_AGG
                (
                    CONCAT
                    (
                        '
                        SELECT p.Id
                        FROM dbo.Posts AS p
                        WHERE p.CreationDate BETWEEN ''', dfrom, ''' AND ''', dto, ''''
                    ),
                    CONCAT(CHAR(13), CHAR(10), CHAR(13), CHAR(10), 'UNION', CHAR(13), CHAR(10))
                ),
            '
            ) AS PostIds
            '
        )
    FROM #d
    
    --PRINT @DynamicSQL;
    EXEC sp_ExecuteSQL @DynamicSQL;
    

    实际执行计划

    我这边似乎跑得很快。当然,这可能会导致一长串子句UNION,具体取决于临时表中有多少行#d。YMMV。

    • 4
  4. Martin Smith
    2023-07-11T18:12:18+08:002023-07-11T18:12:18+08:00

    What I would really like is to be able to get a plan like

    在此输入图像描述

    Where SQL Server uses its built in merge interval operator to combine overlapping ranges.

    I don't think this is currently possible though to get this driven by a table in this case.

    My attempt to manually simulate this is below

    WITH T1 AS
    (
    SELECT *, 
            IsIntervalStart = CASE WHEN dfrom <= MAX(dto) OVER (ORDER BY dfrom, dto ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING) THEN 0 ELSE 1 END, 
            IsLastRow = CASE WHEN LEAD(dfrom) OVER (ORDER BY dfrom, dto) IS NULL THEN 1 ELSE 0 END,
            MaxDtoSoFar = MAX(dto) OVER (ORDER BY dfrom, dto ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING)
    FROM #d
    )
    , T2
         AS (SELECT dfrom,
                    dto = CASE
                            WHEN IsLastRow = 1 THEN IIF(dto > MaxDtoSoFar, dto, ISNULL(MaxDtoSoFar, dto))
                            ELSE LEAD(CASE WHEN IsLastRow = 1 AND IsIntervalStart = 0 THEN IIF(dto > MaxDtoSoFar, dto, ISNULL(MaxDtoSoFar, dto)) ELSE MaxDtoSoFar END, 1) OVER (ORDER BY dfrom, dto) END,
                    IsIntervalStart
             FROM   T1
             WHERE  1 IN ( IsIntervalStart, IsLastRow ))
    ,Mergedintervals AS
    (
    SELECT *
    FROM T2
    WHERE IsIntervalStart = 1
    )
    SELECT Rows = COUNT_BIG(*)
    FROM Mergedintervals
    JOIN dbo.Posts AS p ON p.CreationDate  BETWEEN dfrom AND dto
    

    If I got the overlapping range logic correct (which has taken a few stabs at it so far but hopefully is now correct) then this should collapse down the distinct ranges efficiently and then seek exactly the needed rows for those ranges from Posts.

    在此输入图像描述

    • Follow the rows in order of dfrom, dto (which conveniently is the order of the PK) and keep track of largest dto value seen in preceding rows (MaxDtoSoFar).
    • If the dfrom in a row is <= MaxDtoSoFar then we are continuing an existing interval and set IsIntervalStart to 0 otherwise we are starting a new interval and set that to 1.
    • Finally select all rows where IsIntervalStart = 1 and fill in the relevant MaxDtoSoFar value.
    • For this purpose we need to be careful not to remove the final row too early as we will need to get the values from it for the end date of the final interval.
    • And we need to also be careful about the case that the last row itself is an interval start - in which case the preceding interval should only be looking at MaxDtoSoFar and not GREATEST(dto, MaxDtoSoFar).

    NB: The above method does do the interval merging with a single ordered clustered index scan and no sort operations but does have a wide execution plan with multiple of each of Segment/Sequence Project/Window Spool/ Stream Aggregate operators.

    Paul White provided an alternative method in the comments that can use batch mode Window Aggregates exclusively and has a much more streamlined plan (as well as likely being simpler SQL)

    在此输入图像描述

    • 3
  5. Joe Obbish
    2023-07-12T08:04:48+08:002023-07-12T08:04:48+08:00

    First I'd like to apologize for the odd query plan in my answer. My computer was recently hacked and I've been unable to remove the SSMS plugin.

    通过将每个有效日期拆分为自己的行,可以极大地提高性能。诀窍是还可以携带时间信息,以便在必要时排除端点处的数据。例如,考虑日期范围“20230709 18:00:00”到“20230711 04:00:00”。对于该范围,您将包括日期为 20230709、时间 >= 18:00:00 的行、20230710 的所有行,以及日期为 20230711、时间 <= 04:00:00 的行。

    下面是相同的示例数据以及一个额外的临时表,该表以我之前描述的方式总结了日期范围:

    DROP TABLE IF EXISTS #d;
    CREATE TABLE #d
    (
        dfrom datetime,
        dto datetime,
        PRIMARY KEY (dfrom, dto)
    );
    
    INSERT #d (dfrom, dto)
    SELECT
        dfrom = '2013-11-20',
        dto =   '2013-12-05'
    UNION ALL
    SELECT
        dfrom = '2013-11-27',
        dto =   '2013-12-12'; 
    
    DROP TABLE IF EXISTS #d_expanded;
    CREATE TABLE
        #d_expanded
    (
        base_date DATE,
        IsFullDate BIT,
        dfrom datetime,
        dto datetime,
        PRIMARY KEY (base_date)
    );
    
    INSERT INTO #d_expanded (base_date, IsFullDate, dfrom, dto)
    SELECT base_date, CASE WHEN MIN(dfrom) <= base_date AND MAX(dto) >= DATEADD(DAY, 1, base_date) THEN 1 ELSE 0 END , MIN(dfrom), MAX(dto)
    FROM (
        SELECT ca.base_date, dfrom, dto
        FROM #d d
        CROSS APPLY (
            SELECT TOP (1 + DATEDIFF_BIG(DAY, d.dfrom, d.dto))
            DATEADD(DAY, ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) - 1, CAST(d.dfrom AS DATE))
            FROM master..spt_values t1
            CROSS JOIN master..spt_values t2
        ) ca (base_date)
    ) q
    GROUP BY base_date;
    

    此查询在我的计算机上执行需要 64 毫秒:

    SELECT COUNT_BIG(*)
    FROM #d_expanded d
    INNER JOIN dbo.Posts AS p ON p.CreationDate >= d.base_date AND p.CreationDate < DATEADD(DAY, 1, d.base_date)
    WHERE (d.IsFullDate = 1 OR p.CreationDate BETWEEN d.dfrom AND d.dto);
    

    相对加速约为 300 倍。这是查询计划:

    在此输入图像描述

    • 2
  6. Stephen Morris - Mo64
    2023-07-11T16:00:57+08:002023-07-11T16:00:57+08:00

    有一个有趣的概念,称为静态关系区间树,Itzik 写了一些关于它的东西,但我认为他的博客文章中的示例代码存在一些问题,所以我没能让它工作,无论如何我发现了一个示例链接,如果您使用这些术语进行网络搜索,还有更多

    https://lucient.com/blog/a-static-relational-interval-tree/#

    • 0
  7. Grimaldi
    2023-07-12T00:10:01+08:002023-07-12T00:10:01+08:00

    Seems your objective is to find the number of matching posts, that are in the given date ranges. Given you don't have more information on data distribution and quantity structure, it is hard to give proper recommendations.

    How about the obvious, assuming that there's primary key "id" in Posts:

    SELECT count(distinct p.id) 
      FROM dbo.Posts AS p, 
           #d AS d
     WHERE p.CreationDate BETWEEN d.dfrom
                              AND d.dto
    

    and creating an index on Posts(CreationDate, id).

    • -2

相关问题

  • SQL Server - 使用聚集索引时如何存储数据页

  • 我需要为每种类型的查询使用单独的索引,还是一个多列索引可以工作?

  • 什么时候应该使用唯一约束而不是唯一索引?

  • 死锁的主要原因是什么,可以预防吗?

  • 如何确定是否需要或需要索引

Sidebar

Stats

  • 问题 205573
  • 回答 270741
  • 最佳答案 135370
  • 用户 68524
  • 热门
  • 回答
  • Marko Smith

    连接到 PostgreSQL 服务器:致命:主机没有 pg_hba.conf 条目

    • 12 个回答
  • Marko Smith

    如何让sqlplus的输出出现在一行中?

    • 3 个回答
  • Marko Smith

    选择具有最大日期或最晚日期的日期

    • 3 个回答
  • Marko Smith

    如何列出 PostgreSQL 中的所有模式?

    • 4 个回答
  • Marko Smith

    列出指定表的所有列

    • 5 个回答
  • Marko Smith

    如何在不修改我自己的 tnsnames.ora 的情况下使用 sqlplus 连接到位于另一台主机上的 Oracle 数据库

    • 4 个回答
  • Marko Smith

    你如何mysqldump特定的表?

    • 4 个回答
  • Marko Smith

    使用 psql 列出数据库权限

    • 10 个回答
  • Marko Smith

    如何从 PostgreSQL 中的选择查询中将值插入表中?

    • 4 个回答
  • Marko Smith

    如何使用 psql 列出所有数据库和表?

    • 7 个回答
  • Martin Hope
    Jin 连接到 PostgreSQL 服务器:致命:主机没有 pg_hba.conf 条目 2014-12-02 02:54:58 +0800 CST
  • Martin Hope
    Stéphane 如何列出 PostgreSQL 中的所有模式? 2013-04-16 11:19:16 +0800 CST
  • Martin Hope
    Mike Walsh 为什么事务日志不断增长或空间不足? 2012-12-05 18:11:22 +0800 CST
  • Martin Hope
    Stephane Rolland 列出指定表的所有列 2012-08-14 04:44:44 +0800 CST
  • Martin Hope
    haxney MySQL 能否合理地对数十亿行执行查询? 2012-07-03 11:36:13 +0800 CST
  • Martin Hope
    qazwsx 如何监控大型 .sql 文件的导入进度? 2012-05-03 08:54:41 +0800 CST
  • Martin Hope
    markdorison 你如何mysqldump特定的表? 2011-12-17 12:39:37 +0800 CST
  • Martin Hope
    Jonas 如何使用 psql 对 SQL 查询进行计时? 2011-06-04 02:22:54 +0800 CST
  • Martin Hope
    Jonas 如何从 PostgreSQL 中的选择查询中将值插入表中? 2011-05-28 00:33:05 +0800 CST
  • Martin Hope
    Jonas 如何使用 psql 列出所有数据库和表? 2011-02-18 00:45:49 +0800 CST

热门标签

sql-server mysql postgresql sql-server-2014 sql-server-2016 oracle sql-server-2008 database-design query-performance sql-server-2017

Explore

  • 主页
  • 问题
    • 最新
    • 热门
  • 标签
  • 帮助

Footer

AskOverflow.Dev

关于我们

  • 关于我们
  • 联系我们

Legal Stuff

  • Privacy Policy

Language

  • Pt
  • Server
  • Unix

© 2023 AskOverflow.DEV All Rights Reserve