SQL Server - 使用聚集索引时如何存储数据页

Question

AcePL

Asked: 2017-04-20 06:50:06 +0800 CST2017-04-20 06:50:06 +0800 CST 2017-04-20 06:50:06 +0800 CST

一条记录22亿次执行的表索引扫描

772

在我的查询中，有些事情我不确定如何解决。

一、定义：

快递服务表。有一张唱片。

CREATE TABLE [dbo].[CS](
    [ServiceID] [int] IDENTITY(1,1) NOT NULL,
    [CSID] [nvarchar](6) NULL,
    [CSDescription] [varchar](50) NULL,
    [OperatingDays] [int] NULL,
    [DefaultService] [bit] NULL,
 CONSTRAINT [CourierServices_PK] PRIMARY KEY CLUSTERED 
(
    [ServiceID] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF,
       ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON, FILLFACTOR = 90
) ON [PRIMARY]
) ON [PRIMARY]

GO
SET IDENTITY_INSERT [dbo].[CS] ON 

INSERT [dbo].[CS] ([ServiceID], [CSID], [OperatingDays], [DefaultService])
           VALUES (1, N'RM48', 2, 1)
SET IDENTITY_INSERT [dbo].[CS] OFF
SET ANSI_PADDING ON

GO
/****** Object:  Index [ix_CourierServices]    Script Date: 19/04/2017 14:27:03 ******/
CREATE NONCLUSTERED INDEX [ix_CourierServices] ON [dbo].[CS]
(
    [CSID] ASC,
    [DefaultService] ASC,
    [OperatingDays] ASC
)
INCLUDE (   [CSDescription]) WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF,
SORT_IN_TEMPDB = OFF, DROP_EXISTING = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS = ON,
ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
GO

日历数据库和表格，代码由Genius Jim Horn编写：

CREATE TABLE [dbo].[days](
    [PKDate] [date] NOT NULL,
    [calendar_year] [smallint] NULL,
    [calendar_quarter] [tinyint] NULL,
    [calendar_quarter_desc] [varchar](10) NULL,
    [calendar_month] [tinyint] NULL,
    [calendar_month_name_long] [varchar](30) NULL,
    [calendar_month_name_short] [varchar](10) NULL,
    [calendar_week_in_year] [tinyint] NULL,
    [calendar_week_in_month] [tinyint] NULL,
    [calendar_day_in_year] [smallint] NULL,
    [calendar_day_in_week] [tinyint] NULL,
    [calendar_day_in_month] [tinyint] NULL,
    [dmy_name_long] [varchar](30) NULL,
    [dmy_name_long_with_suffix] [varchar](30) NULL,
    [day_name_long] [varchar](10) NULL,
    [day_name_short] [varchar](10) NULL,
    [continuous_year] [tinyint] NULL,
    [continuous_quarter] [smallint] NULL,
    [continuous_month] [smallint] NULL,
    [continuous_week] [smallint] NULL,
    [continuous_day] [int] NULL,
    [description] [varchar](100) NULL,
    [is_weekend] [tinyint] NULL,
    [is_holiday] [tinyint] NULL,
    [is_workday] [tinyint] NULL,
    [is_event] [tinyint] NULL,
PRIMARY KEY CLUSTERED 
(
    [PKDate] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF,
 ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
) ON [PRIMARY]

GO

/****** Object:  Index [ix_days]    Script Date: 19/04/2017 14:38:47 ******/
CREATE NONCLUSTERED INDEX [ix_days] ON [dbo].[days]
(
    [PKDate] ASC
)
INCLUDE (   [is_weekend],
    [is_holiday],
    [is_workday],
    [is_event]) WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF,
 SORT_IN_TEMPDB = OFF, DROP_EXISTING = OFF, ONLINE = OFF,
 ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
GO

现在，我正在运行一个查询，它根据代码位引用两个表：

Select
    OID
   ,case when
     Cast(o.[CreationDate] as time) > '16:00:00' 
        then (select top 1 [PKDate] from [calendar].[dbo].days
              where is_weekend <> 1 and is_holiday <>1 and 
              PKDate > cast(o.[CreationDate] as date)
              order by PKDate asc)
        else (select top 1 [PKDate] from [calendar].[dbo].days
              where is_weekend <> 1 and is_holiday <>1 and 
              PKDate >= Cast(o.[CreationDate] as date) 
              order by PKDate asc)
        end  OperatingDate
   ,case when
     Cast(o.[CreationDate] as time) > '16:00:00' 
        then (select top 1 [PKDate] from [calendar].[dbo].days
              where is_weekend <> 1 and is_holiday <>1 and 
              PKDate > dateadd(day,isnull(
                  (select top 1 [operatingdays]
                  from [dbo].[CS]
                  where DefaultService = 1)
                 ,2)+1,Cast(o.[CreationDate] as date))
                 order by PKDate asc)
            else (select top 1 [PKDate] from [calendar].[dbo].days
                  where is_weekend <> 1 and is_holiday <>1 and
                  PKDate > dateadd(day,isnull(
                      (select top 1 [operatingdays]
                       from [dbo].[CS]
                       where DefaultService = 1)
                      ,2), Cast(o.[CreationDate] as date))
                      order by PKDate asc)
            end EstimatedDeliveryDate
  ,(select dateadd(day,3,o.[CreationDate])) DeliveryDate
From o

现在的问题是，与索引扫描和执行次数有关：为什么是 20 亿？还是60亿？不可否认，整个查询的输出是 170 万行，但这并不能解释查询计划中显示的疯狂数字：

https://www.brentozar.com/pastetheplan/?id=H1iahxHAe

如果我可以敲平所有这些扫描，我可以显着减少查询时间，但首先：我如何解释这些数字以找到解决方案？

days 表包含 7.6 k 行（涵盖 2000-2020 年）。

3 个回答

Voted

Joe Obbish · Answer 1 · 2017-04-20T08:11:50+08:00

让我们从查看计划的右上角开始。该部分计算OperatingDate列：

由于我们为外部行集返回了 1.72 M 行，因此我们可以预期大约 1.72 M 索引针对ix_days. 确实是这样。有 478k 行，o.[CreationDate] as time) > '16:00:00'因此该CASE语句将 478k 查找发送到一个分支，其余的发送到另一个分支。

请注意，您拥有的索引对于此查询而言并不是最有效的索引。我们只能针对做一个查找谓词PKDate。其余过滤器作为谓词应用。这意味着查找可能会遍历许多行才能找到匹配项。我假设您的日历表中的大多数日子都不是周末或假期，因此它可能不会对该查询产生实际影响。但是，您可以在上定义一个索引is_weekend, is_holiday, PKDate。那应该让你立即寻找你想要的第一行。

为了更清楚地说明这一点，让我们看一个简单的例子：

-- does a scan
SELECT TOP 1 PkDate
FROM [Days]
WHERE is_weekend <> 1 AND is_holiday <> 1
AND PkDate >= '2000-04-01'
ORDER BY PkDate;

-- does a seek, reads 3 rows to return 1
SELECT TOP 1 PkDate
FROM [Days]
WHERE is_weekend = 0 AND is_holiday = 0
AND PkDate >= '2000-04-01'
ORDER BY PkDate;

-- create new index
CREATE NONCLUSTERED INDEX [ix_days_2] ON [dbo].[days]
(
    [is_weekend],
    [is_holiday],
    PkDate
)

-- does a seek, reads 1 row to return 1
SELECT TOP 1 PkDate
FROM [Days]
WHERE is_weekend = 0 AND is_holiday = 0
AND PkDate >= '2000-04-01'
ORDER BY PkDate;

DROP INDEX [days].[ix_days_2];

让我们进入更有趣的部分，即计算DeliveryDate列的分支。我只会包括一半：

我怀疑您希望优化器做的是将其计算为标量：

dateadd(day,isnull(
                  (select top 1 [operatingdays]
                  from [dbo].[CS]
                  where DefaultService = 1)
                 ,2)+1,Cast(o.[CreationDate] as date))

并使用它的值来使用进行索引查找ix_days。不幸的是，优化器不这样做。它改为对索引应用行目标并进行扫描。对于扫描返回的每一行，它都会检查该值是否与过滤器相匹配[dbo].[CS]。一旦找到匹配的行，扫描就会停止。SQL Server 估计在找到匹配项之前，它平均只会从扫描中拉回 3.33 行。如果那是真的，那么你会看到大约 150 万次针对[dbo].[CS]. 相反，优化器对该表执行了 20 亿次，因此估计值偏离了 1000 多次。

作为一般规则，您应该仔细检查嵌套循环内侧的任何扫描。当然，有些查询正是您想要的。并且仅仅因为您进行了搜索并不意味着查询将是有效的。例如，如果搜索返回许多行，则与进行扫描可能没有太大区别。您没有在此处发布完整的查询，但我会介绍一些可能有帮助的想法。

这个查询有点奇怪：

select top 1 [operatingdays]
from [dbo].[CS]
where DefaultService = 1

它是不确定的，因为你TOP没有ORDER BY。但是，表本身有 1 行，您总是从o. 如果可能的话，我会尝试将此查询的值保存到局部变量中，然后在查询中使用它。这应该再次为您节省 80 亿次扫描[dbo].[CS]，我希望看到索引搜索而不是针对ix_days. 我能够在我的机器上模拟一些数据。这是查询计划的一部分：

现在我们有了所有的搜索，这些搜索不应该处理太多额外的行。但是，实际查询可能比这更复杂，因此您可能无法使用变量。

Let's say I write a different filter condition that doesn't use TOP. Instead I'll use MIN. SQL Server is able to process that subquery in a more efficient way. TOP can prevent certain query transformations. Here is my subquery:

WHERE PKDate > dateadd(day,isnull(
                      (select MIN([operatingdays])
                       from [dbo].[CS]
                       where DefaultService = 1)
                      ,2), Cast(o.[CreationDate] as date))

Here is what the plan might look like:

Now we'll only do around 1.5 million scans against the CS table. we also get a much more efficient index seek against the ix_days index which is able to use the results of the subquery:

Of course, I'm not saying that you should rewrite your code to use that. It'll probably return incorrect results. The important point is that you can get the index seeks that you want with a subquery. You just need to write your subquery in the right way.

For one more example, let's assume that you absolutely need to keep the TOP operator in the subquery. It might be possible to add a redundant filter against PkDate to get better performance. I'm going to assume that the results of the subquery are non-negative and small. That means that this query will be equivalent:

  PKDate > Cast(o.[CreationDate] as date) AND 
  PKDate > dateadd(day,isnull(
      (select top 1 [operatingdays]
      from [dbo].[CS]
      where DefaultService = 1)
     ,2)+1,Cast(o.[CreationDate] as date))

This changes the plan to use seeks:

It's important to realize that the seeks may return more just one row. The important point is that SQL Server can start seeking at o.[CreationDate]. If there's a large gap in the dates then the index seek will process many extra rows and the query will not be as efficient.

SqlWorldWide · Answer 2 · 2017-04-20T07:57:54+08:00

现在的问题是，与索引扫描和执行次数有关：为什么是 20 亿？还是60亿？

您正在从嵌套循环连接中获取这些数字。

在其最简单的形式中，嵌套循环连接将每一行与一个

表（称为外表）从另一个表（称为内表）的每一行寻找满足连接谓词的行。（请注意，术语“内部”和“外部”已被重载；它们的含义必须从上下文中推断出来。“内部表”和“外部表”指的是连接的输入。“内部连接”和“外部连接”指的是到逻辑操作。）

我们可以将算法用伪代码表示为：

对于外部表中的每一行 R1 对于内部表中的每一行 R2 如果 R1 与 R2 连接返回 (R1, R2)

正是该算法中 for 循环的嵌套，使嵌套循环加入了它的名字。

比较的总行数以及该算法的成本与外表的大小乘以内表的大小成正比。由于此成本随着输入表大小的增长而快速增长，因此在实践中，我们尝试通过减少我们必须为每个外部行考虑的内部行数来最小化成本。] 1

在您的示例中，这是您如何获得 2B 记录的一个示例。

另一个如何获得 5B+。

关于如何避免大型嵌套循环连接的几个链接：

AcePL · Answer 3 · 2017-04-21T01:54:31+08:00

The information that both other answers try to convey but fail (only partly due to assumptions that I understand exactly what they say) is this:

With the query written the way it was in the question the observed performance was inevitable.

While it was fancy and mostly easy to see the purpose it was simply too heavy for the optimizer to work magic on it. It wasn't quite the nested loop problem SqlWorldWide indicated, but the subqueries simply had to be executed for each row and since they were index seeks and scans they multiplied, and multiplied... and multiplied.

What I ended up having was this:

Select
    OID
   ,case when
     Cast(o.[CreationDate] as time) > '16:00:00' 
        then (select top 1 [PKDate] from [calendar].[dbo].days
              where is_workday = 1 and continuous_day > da.continuous_day
              and continuous_day < da.continuous_day+7 order by PKDate asc)
        else (select top 1 [PKDate] from [calendar].[dbo].days
              where is_workday = 1 and continuous_day >= da.continuous_day
              and continuous_day < da.continuous_day+7 order by PKDate asc)
        end  OperatingDate
   ,case when
     Cast(o.[CreationDate] as time) > '16:00:00' 
        then (select top 1 d.[PKDate] from [calendar].[dbo].days d
              where is_workday = 1 and
              continuous_day > (da.continuous_day+isnull(dt.DeliveryDays,2)) and 
              d.continuous_day < da.continuous_day+7 order by PKDate asc)
            else (select top 1 d.[PKDate] from [calendar].[dbo].days d
              where is_workday = 1 and
              continuous_day >= (da.continuous_day+isnull(dt.DeliveryDays,2)) and 
              d.continuous_day < da.continuous_day+7 order by PKDate asc)
            end EstimatedDeliveryDate
  ,(select dateadd(day,3,o.[CreationDate])) DeliveryDate
From o
left join deliverytype dt on o.deliverytypeid = dt.deliverytypeid
join calendar.dbo.days da on (cast o.creationdate as date) = da.pkdate

In addition to streamlining the query - which still is not optimal - I've also reworked the calendar.dbo.days table's indexes. Dropped the constraint (which I really didn't have to, but what the hell, it might cause more problems further down the line) and added this:

/****** Object:  Index [ixc_days]    Script Date: 20/04/2017 10:40:58 ******/
CREATE UNIQUE CLUSTERED INDEX [ixc_days] ON [dbo].[days]
(
    [PKDate] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, SORT_IN_TEMPDB = OFF,
IGNORE_DUP_KEY = OFF, DROP_EXISTING = OFF, ONLINE = OFF,
ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
GO

SET ANSI_PADDING ON

GO

/****** Object:  Index [ix_days]    Script Date: 20/04/2017 10:40:58 ******/
CREATE NONCLUSTERED INDEX [ix_days] ON [dbo].[days]
(
    [is_workday] ASC,
    [PKDate] ASC,
    [continuous_day] ASC
)
INCLUDE (   [calendar_year],
    [calendar_month],
    [calendar_week_in_year],
    [calendar_week_in_month],
    [calendar_day_in_year],
    [calendar_day_in_week],
    [calendar_day_in_month],
    [dmy_name_long],
    [description]) WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF,
SORT_IN_TEMPDB = OFF, DROP_EXISTING = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS = ON,
ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
GO

我承认这主要是为了让我可以更充分地利用日历表（我有没有提到吉姆霍恩是个天才？），但当人们看到我的账户时，他们想要越来越多的东西存储在......无处不在.

所以，归根结底，虽然查看查询的所有方面都很重要：逻辑、索引、谓词等，但有时唯一明智的改进方法是更改代码。在我的例子中，完整查询（几个插入、更新和 CTE）的执行时间现在在大约 2 分钟内完成，而之前是 15 分钟。

一条记录22亿次执行的表索引扫描

连接到 PostgreSQL 服务器：致命：主机没有 pg_hba.conf 条目

如何让sqlplus的输出出现在一行中？

选择具有最大日期或最晚日期的日期

如何列出 PostgreSQL 中的所有模式？

列出指定表的所有列

如何在不修改我自己的 tnsnames.ora 的情况下使用 sqlplus 连接到位于另一台主机上的 Oracle 数据库

你如何mysqldump特定的表？

使用 psql 列出数据库权限

如何从 PostgreSQL 中的选择查询中将值插入表中？

如何使用 psql 列出所有数据库和表？

一条记录22亿次执行的表索引扫描

3 个回答

相关问题