SQL Server - 使用聚集索引时如何存储数据页

Question

Erik Darling

Asked: 2023-07-11 06:20:49 +0800 CST2023-07-11 06:20:49 +0800 CST 2023-07-11 06:20:49 +0800 CST

提高多个日期范围谓词的性能

772

比方说

您有一个接受日期时间数组的存储过程，这些数组被加载到临时表中，并用于过滤表中的日期时间列。

可以插入任意数量的值作为开始日期和结束日期。
日期范围有时可能会重叠，但这不是我经常依赖的情况。
也可以提供带有时间的日期。

编写查询来执行过滤的最有效方法是什么？

设置

USE StackOverflow2013;

CREATE TABLE
    #d
(
    dfrom datetime,
    dto datetime,
    PRIMARY KEY (dfrom, dto)
)
INSERT
    #d
(
    dfrom,
    dto
)
SELECT
    dfrom = '2013-11-20',
    dto =   '2013-12-05'
UNION ALL
SELECT
    dfrom = '2013-11-27',
    dto =   '2013-12-12'; 

CREATE INDEX
    p
ON dbo.Posts
    (CreationDate)
WITH
    (SORT_IN_TEMPDB = ON, DATA_COMPRESSION = PAGE);

询问

我能得到的最好的就是EXISTS像这样使用：

SELECT
    c = COUNT_BIG(*)
FROM dbo.Posts AS p
WHERE EXISTS
(
    SELECT
        1/0
    FROM #d AS d
    WHERE p.CreationDate BETWEEN d.dfrom
                             AND d.dto
);

这导致了一个看起来相当悲伤的执行计划：

嵌套循环是唯一可用的连接运算符，因为我们没有相等谓词。

我正在寻找的是产生不同类型连接的替代语法。

谢谢！

7 个回答

Voted

Paul White · Answer 1 · 2023-07-11T17:50:34+08:00

加入

您可能会发现，尽管仍然使用嵌套循环，但联接仍能提供足够的性能。这是大卫·布朗更新答案的变体：

SELECT 
    NumRows = COUNT_BIG(DISTINCT P.Id) 
FROM #d AS D
JOIN dbo.Posts AS P
    ON P.CreationDate BETWEEN D.dfrom AND D.dto;

对我来说，运行时间约为 150 毫秒。

删除重叠并连接

如果您有大量重叠的行，则可能值得将它们减少到不同的、不重叠的范围，如 Martin Smith 的回答中所述。我的类似想法的实现是：

WITH 
    Intervals AS
    (
    SELECT 
        IntervalStart =
            ISNULL
            (
                LAG(Q1.NextStart) OVER (ORDER BY Q1.ThisFrom),
                Q1.FirstFrom
            ),
        IntervalEnd = 
            IIF 
            (
                Q1.NextStart IS NOT NULL, 
                Q1.ThisEnd,
                Q1.LastEnd
            )
    FROM 
    (
        SELECT
            ThisFrom = D.dfrom, 
            ThisEnd = D.dto,
            -- Remember the start of the next row because that row
            -- may get filtered out by the outer WHERE clase if it
            -- is not also the end of an interval.
            NextStart = LEAD(D.dfrom) OVER (
                ORDER BY D.dfrom, D.dto),
            -- Start point of the first interval
            FirstFrom = MIN(D.dfrom) OVER (
                ORDER BY D.dfrom, D.dto 
                ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW),
            -- End point of the last interval
            LastEnd = MAX(D.dto) OVER (
                ORDER BY D.dfrom, D.dto 
                ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)
        FROM #d AS D 
        WHERE 
            -- Valid intervals only
            D.dto >= D.dfrom
    ) AS Q1
    WHERE
        -- Interval ends only
        Q1.NextStart > Q1.ThisEnd
        OR Q1.NextStart IS NULL
)
SELECT COUNT_BIG(*) 
FROM dbo.Posts AS P
JOIN Intervals AS N
    ON P.CreationDate BETWEEN N.IntervalStart AND N.IntervalEnd;

对我来说大约需要 40 毫秒。

无连接的动态查找

另一种方法是动态生成文字范围：

DECLARE @SQL nvarchar(max) =
N'
SELECT Rows = COUNT_BIG(*)
FROM dbo.Posts AS P
WHERE 0 = 1
';

SELECT @SQL +=
    STRING_AGG
    (
        CONCAT
        (
            CONVERT(nvarchar(max), SPACE(0)),
            N'OR P.CreationDate BETWEEN ',
            N'CONVERT(datetime, ',
            NCHAR(39),
            CONVERT(nchar(23), D.dfrom, 121),
            NCHAR(39),
            N', 121)',
            N' AND CONVERT(datetime, ',
            NCHAR(39),
            CONVERT(nchar(23), D.dto, 121),
            NCHAR(39),
            N', 121)',
            NCHAR(13), NCHAR(10)
        ),
        N''
    )
FROM #d AS D;

SET @SQL += N'OPTION (RECOMPILE);' -- Plan reuse is unlikely

EXECUTE (@SQL);

使用样本数据，会产生：

SELECT Rows = COUNT_BIG(*)
FROM dbo.Posts AS P
WHERE 0 = 1
OR P.CreationDate BETWEEN CONVERT(datetime, '2013-11-20 00:00:00.000', 121) 
    AND CONVERT(datetime, '2013-12-05 00:00:00.000', 121)
OR P.CreationDate BETWEEN CONVERT(datetime, '2013-11-27 00:00:00.000', 121) 
    AND CONVERT(datetime, '2013-12-12 00:00:00.000', 121)
OPTION (RECOMPILE);

哪个 SQL Server 简化为单个范围查找：

对我来说大约需要 35 毫秒。

一般来说，优化器将尽可能地简化范围（就像合并间隔所做的那样）。单个索引查找运算符中单独范围查找的数量似乎没有限制。1440次搜索后我感到无聊。

索引视图

在索引视图中存储和维护分桶计数可能会有所帮助，而不是一遍又一遍地计算行数。以下实现使用小时粒度：

CREATE OR ALTER VIEW dbo.PostsTimeBucket
WITH SCHEMABINDING
AS
SELECT
    HourBucket =
        DATEADD(HOUR, 
            DATEDIFF(HOUR, 
                CONVERT(datetime, '20000101', 112), 
                P.CreationDate),
            CONVERT(datetime, '20000101', 112)),
    NumRows = COUNT_BIG(*)
FROM dbo.Posts AS P
GROUP BY
    DATEADD(HOUR, 
        DATEDIFF(HOUR, 
            CONVERT(datetime, '20000101', 112), 
            P.CreationDate),
        CONVERT(datetime, '20000101', 112));
GO
CREATE UNIQUE CLUSTERED INDEX [CUQ dbo.PostsTimeBucket HourBucket] 
ON dbo.PostsTimeBucket (HourBucket);

The view takes around 2s to create on my laptop. At hour granularity, the indexed view holds 47,469 rows for my copy of the StackOverflow2013 database's Posts table with 17,142,169 rows.

The majority of the work can be done from the view. Only part-hour periods at the start and end of any range not covered by any other range needs to be processed separately. For example:

-- Helper inline function
CREATE OR ALTER FUNCTION dbo.RoundToHour (@d datetime)
RETURNS table
AS
RETURN
    SELECT
        HourBucket =
            DATEADD(HOUR, 
                DATEDIFF(HOUR, 
                    CONVERT(datetime, '20000101', 112), @d),
                CONVERT(datetime, '20000101', 112));

-- Whole hours covered by any range in the table
SELECT 
    SUM(PTB.NumRows) 
FROM dbo.PostsTimeBucket AS PTB 
    WITH (NOEXPAND)
WHERE 
    PTB.HourBucket >= (SELECT MIN(D.dfrom) FROM #d AS D)
    AND PTB.HourBucket <= (SELECT MAX(D.dto) FROM #d AS D)
    AND EXISTS
    (
        SELECT * 
        FROM #d AS D
        WHERE
            D.dfrom <= PTB.HourBucket
            AND D.dto >= DATEADD(HOUR, 1, PTB.HourBucket)
    )

-- Extra rows at start before hour boundary
SELECT 
    COUNT_BIG(DISTINCT P.Id) 
FROM #d AS D
CROSS APPLY dbo.RoundToHour(D.dfrom) AS HF
JOIN dbo.Posts AS P
    ON P.CreationDate >= D.dfrom
    AND P.CreationDate < DATEADD(HOUR, 1, HF.HourBucket)
WHERE
    D.dfrom <> HF.HourBucket
    AND NOT EXISTS
    (
        -- Not covered by any hour-long period
        SELECT * 
        FROM #d AS D2
        WHERE
            D2.dfrom <= HF.HourBucket
            AND D2.dto > DATEADD(HOUR, 1, HF.HourBucket)
    );

-- Extra rows at end
SELECT 
    COUNT_BIG(DISTINCT P.Id) 
FROM #d AS D
CROSS APPLY dbo.RoundToHour(D.dto) AS HF
JOIN dbo.Posts AS P
    ON P.CreationDate >= HF.HourBucket
    AND P.CreationDate < D.dto
WHERE
    D.dto <> HF.HourBucket
    AND NOT EXISTS
    (
        -- Not covered by any hour-long period
        SELECT * 
        FROM #d AS D2
        WHERE
            D2.dfrom <= HF.HourBucket
            AND D2.dto > DATEADD(HOUR, 1, HF.HourBucket)
    );

The three queries above complete in less than 1 ms overall. There are no part-hour periods in the sample data. Performance will decrease somewhat with more and larger part periods, or if a less precise indexed view granularity is used.

David Browne - Microsoft · Answer 2 · 2023-07-11T08:34:11+08:00

如果您的日期数量足够小，您可以具体化所有有效日期的排序列表，例如

CREATE TABLE #d
(
    d datetime primary key
)

然后进行合并连接。或者走另一个方向，例如

SELECT COUNT_BIG(distinct p.id)
FROM #D d
CROSS APPLY (select p.Id 
       from dbo.Posts AS p
       where p.CreateDate >= d.fromDate
         and p.CreateDate < d.toDate) p;

如果您预处理 #D 以合并重叠范围，则只需 COUNT_BIG(*) 即可。

J.D. · Answer 3 · 2023-07-11T11:27:56+08:00

使用动态 SQL 生成一系列UNION语句似乎消除了Nested Loopsand 结果Hash Match：

DECLARE @DynamicSQL NVARCHAR(MAX) = N'';

SELECT @DynamicSQL = 
    CONCAT
    (
        '
        SELECT
            c = COUNT_BIG(*)
        FROM (
        ',
            STRING_AGG
            (
                CONCAT
                (
                    '
                    SELECT p.Id
                    FROM dbo.Posts AS p
                    WHERE p.CreationDate BETWEEN ''', dfrom, ''' AND ''', dto, ''''
                ),
                CONCAT(CHAR(13), CHAR(10), CHAR(13), CHAR(10), 'UNION', CHAR(13), CHAR(10))
            ),
        '
        ) AS PostIds
        '
    )
FROM #d

--PRINT @DynamicSQL;
EXEC sp_ExecuteSQL @DynamicSQL;

实际执行计划

我这边似乎跑得很快。当然，这可能会导致一长串子句UNION，具体取决于临时表中有多少行#d。YMMV。

Martin Smith · Answer 4 · 2023-07-11T18:12:18+08:00

What I would really like is to be able to get a plan like

Where SQL Server uses its built in merge interval operator to combine overlapping ranges.

I don't think this is currently possible though to get this driven by a table in this case.

My attempt to manually simulate this is below

WITH T1 AS
(
SELECT *, 
        IsIntervalStart = CASE WHEN dfrom <= MAX(dto) OVER (ORDER BY dfrom, dto ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING) THEN 0 ELSE 1 END, 
        IsLastRow = CASE WHEN LEAD(dfrom) OVER (ORDER BY dfrom, dto) IS NULL THEN 1 ELSE 0 END,
        MaxDtoSoFar = MAX(dto) OVER (ORDER BY dfrom, dto ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING)
FROM #d
)
, T2
     AS (SELECT dfrom,
                dto = CASE
                        WHEN IsLastRow = 1 THEN IIF(dto > MaxDtoSoFar, dto, ISNULL(MaxDtoSoFar, dto))
                        ELSE LEAD(CASE WHEN IsLastRow = 1 AND IsIntervalStart = 0 THEN IIF(dto > MaxDtoSoFar, dto, ISNULL(MaxDtoSoFar, dto)) ELSE MaxDtoSoFar END, 1) OVER (ORDER BY dfrom, dto) END,
                IsIntervalStart
         FROM   T1
         WHERE  1 IN ( IsIntervalStart, IsLastRow ))
,Mergedintervals AS
(
SELECT *
FROM T2
WHERE IsIntervalStart = 1
)
SELECT Rows = COUNT_BIG(*)
FROM Mergedintervals
JOIN dbo.Posts AS p ON p.CreationDate  BETWEEN dfrom AND dto

If I got the overlapping range logic correct (which has taken a few stabs at it so far but hopefully is now correct) then this should collapse down the distinct ranges efficiently and then seek exactly the needed rows for those ranges from Posts.

Follow the rows in order of dfrom, dto (which conveniently is the order of the PK) and keep track of largest dto value seen in preceding rows (MaxDtoSoFar).
If the dfrom in a row is <= MaxDtoSoFar then we are continuing an existing interval and set IsIntervalStart to 0 otherwise we are starting a new interval and set that to 1.
Finally select all rows where IsIntervalStart = 1 and fill in the relevant MaxDtoSoFar value.
For this purpose we need to be careful not to remove the final row too early as we will need to get the values from it for the end date of the final interval.
And we need to also be careful about the case that the last row itself is an interval start - in which case the preceding interval should only be looking at MaxDtoSoFar and not GREATEST(dto, MaxDtoSoFar).

NB: The above method does do the interval merging with a single ordered clustered index scan and no sort operations but does have a wide execution plan with multiple of each of Segment/Sequence Project/Window Spool/ Stream Aggregate operators.

Paul White provided an alternative method in the comments that can use batch mode Window Aggregates exclusively and has a much more streamlined plan (as well as likely being simpler SQL)

Joe Obbish · Answer 5 · 2023-07-12T08:04:48+08:00

First I'd like to apologize for the odd query plan in my answer. My computer was recently hacked and I've been unable to remove the SSMS plugin.

通过将每个有效日期拆分为自己的行，可以极大地提高性能。诀窍是还可以携带时间信息，以便在必要时排除端点处的数据。例如，考虑日期范围“20230709 18:00:00”到“20230711 04:00:00”。对于该范围，您将包括日期为 20230709、时间 >= 18:00:00 的行、20230710 的所有行，以及日期为 20230711、时间 <= 04:00:00 的行。

下面是相同的示例数据以及一个额外的临时表，该表以我之前描述的方式总结了日期范围：

DROP TABLE IF EXISTS #d;
CREATE TABLE #d
(
    dfrom datetime,
    dto datetime,
    PRIMARY KEY (dfrom, dto)
);

INSERT #d (dfrom, dto)
SELECT
    dfrom = '2013-11-20',
    dto =   '2013-12-05'
UNION ALL
SELECT
    dfrom = '2013-11-27',
    dto =   '2013-12-12'; 

DROP TABLE IF EXISTS #d_expanded;
CREATE TABLE
    #d_expanded
(
    base_date DATE,
    IsFullDate BIT,
    dfrom datetime,
    dto datetime,
    PRIMARY KEY (base_date)
);

INSERT INTO #d_expanded (base_date, IsFullDate, dfrom, dto)
SELECT base_date, CASE WHEN MIN(dfrom) <= base_date AND MAX(dto) >= DATEADD(DAY, 1, base_date) THEN 1 ELSE 0 END , MIN(dfrom), MAX(dto)
FROM (
    SELECT ca.base_date, dfrom, dto
    FROM #d d
    CROSS APPLY (
        SELECT TOP (1 + DATEDIFF_BIG(DAY, d.dfrom, d.dto))
        DATEADD(DAY, ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) - 1, CAST(d.dfrom AS DATE))
        FROM master..spt_values t1
        CROSS JOIN master..spt_values t2
    ) ca (base_date)
) q
GROUP BY base_date;

此查询在我的计算机上执行需要 64 毫秒：

SELECT COUNT_BIG(*)
FROM #d_expanded d
INNER JOIN dbo.Posts AS p ON p.CreationDate >= d.base_date AND p.CreationDate < DATEADD(DAY, 1, d.base_date)
WHERE (d.IsFullDate = 1 OR p.CreationDate BETWEEN d.dfrom AND d.dto);

相对加速约为 300 倍。这是查询计划：

Stephen Morris - Mo64 · Answer 6 · 2023-07-11T16:00:57+08:00

有一个有趣的概念，称为静态关系区间树，Itzik 写了一些关于它的东西，但我认为他的博客文章中的示例代码存在一些问题，所以我没能让它工作，无论如何我发现了一个示例链接，如果您使用这些术语进行网络搜索，还有更多

https://lucient.com/blog/a-static-relational-interval-tree/#

Grimaldi · Answer 7 · 2023-07-12T00:10:01+08:00

Seems your objective is to find the number of matching posts, that are in the given date ranges. Given you don't have more information on data distribution and quantity structure, it is hard to give proper recommendations.

How about the obvious, assuming that there's primary key "id" in Posts:

SELECT count(distinct p.id) 
  FROM dbo.Posts AS p, 
       #d AS d
 WHERE p.CreationDate BETWEEN d.dfrom
                          AND d.dto

and creating an index on Posts(CreationDate, id).

提高多个日期范围谓词的性能

比方说

设置

询问

加入

删除重叠并连接

无连接的动态查找

索引视图

连接到 PostgreSQL 服务器：致命：主机没有 pg_hba.conf 条目

如何让sqlplus的输出出现在一行中？

选择具有最大日期或最晚日期的日期

如何列出 PostgreSQL 中的所有模式？

列出指定表的所有列

如何在不修改我自己的 tnsnames.ora 的情况下使用 sqlplus 连接到位于另一台主机上的 Oracle 数据库

你如何mysqldump特定的表？

使用 psql 列出数据库权限

如何从 PostgreSQL 中的选择查询中将值插入表中？

如何使用 psql 列出所有数据库和表？

提高多个日期范围谓词的性能

比方说

设置

询问

7 个回答

加入

删除重叠并连接

无连接的动态查找

索引视图

相关问题