AskOverflow.Dev

AskOverflow.Dev Logo AskOverflow.Dev Logo

AskOverflow.Dev Navigation

  • 主页
  • 系统&网络
  • Ubuntu
  • Unix
  • DBA
  • Computer
  • Coding
  • LangChain

Mobile menu

Close
  • 主页
  • 系统&网络
    • 最新
    • 热门
    • 标签
  • Ubuntu
    • 最新
    • 热门
    • 标签
  • Unix
    • 最新
    • 标签
  • DBA
    • 最新
    • 标签
  • Computer
    • 最新
    • 标签
  • Coding
    • 最新
    • 标签
主页 / dba / 问题 / 171407
Accepted
AcePL
AcePL
Asked: 2017-04-20 06:50:06 +0800 CST2017-04-20 06:50:06 +0800 CST 2017-04-20 06:50:06 +0800 CST

一条记录22亿次执行的表索引扫描

  • 772

在我的查询中,有些事情我不确定如何解决。

一、定义:

快递服务表。有一张唱片。

CREATE TABLE [dbo].[CS](
    [ServiceID] [int] IDENTITY(1,1) NOT NULL,
    [CSID] [nvarchar](6) NULL,
    [CSDescription] [varchar](50) NULL,
    [OperatingDays] [int] NULL,
    [DefaultService] [bit] NULL,
 CONSTRAINT [CourierServices_PK] PRIMARY KEY CLUSTERED 
(
    [ServiceID] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF,
       ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON, FILLFACTOR = 90
) ON [PRIMARY]
) ON [PRIMARY]

GO
SET IDENTITY_INSERT [dbo].[CS] ON 

INSERT [dbo].[CS] ([ServiceID], [CSID], [OperatingDays], [DefaultService])
           VALUES (1, N'RM48', 2, 1)
SET IDENTITY_INSERT [dbo].[CS] OFF
SET ANSI_PADDING ON

GO
/****** Object:  Index [ix_CourierServices]    Script Date: 19/04/2017 14:27:03 ******/
CREATE NONCLUSTERED INDEX [ix_CourierServices] ON [dbo].[CS]
(
    [CSID] ASC,
    [DefaultService] ASC,
    [OperatingDays] ASC
)
INCLUDE (   [CSDescription]) WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF,
SORT_IN_TEMPDB = OFF, DROP_EXISTING = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS = ON,
ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
GO

日历数据库和表格,代码由Genius Jim Horn编写:

CREATE TABLE [dbo].[days](
    [PKDate] [date] NOT NULL,
    [calendar_year] [smallint] NULL,
    [calendar_quarter] [tinyint] NULL,
    [calendar_quarter_desc] [varchar](10) NULL,
    [calendar_month] [tinyint] NULL,
    [calendar_month_name_long] [varchar](30) NULL,
    [calendar_month_name_short] [varchar](10) NULL,
    [calendar_week_in_year] [tinyint] NULL,
    [calendar_week_in_month] [tinyint] NULL,
    [calendar_day_in_year] [smallint] NULL,
    [calendar_day_in_week] [tinyint] NULL,
    [calendar_day_in_month] [tinyint] NULL,
    [dmy_name_long] [varchar](30) NULL,
    [dmy_name_long_with_suffix] [varchar](30) NULL,
    [day_name_long] [varchar](10) NULL,
    [day_name_short] [varchar](10) NULL,
    [continuous_year] [tinyint] NULL,
    [continuous_quarter] [smallint] NULL,
    [continuous_month] [smallint] NULL,
    [continuous_week] [smallint] NULL,
    [continuous_day] [int] NULL,
    [description] [varchar](100) NULL,
    [is_weekend] [tinyint] NULL,
    [is_holiday] [tinyint] NULL,
    [is_workday] [tinyint] NULL,
    [is_event] [tinyint] NULL,
PRIMARY KEY CLUSTERED 
(
    [PKDate] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF,
 ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
) ON [PRIMARY]

GO

/****** Object:  Index [ix_days]    Script Date: 19/04/2017 14:38:47 ******/
CREATE NONCLUSTERED INDEX [ix_days] ON [dbo].[days]
(
    [PKDate] ASC
)
INCLUDE (   [is_weekend],
    [is_holiday],
    [is_workday],
    [is_event]) WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF,
 SORT_IN_TEMPDB = OFF, DROP_EXISTING = OFF, ONLINE = OFF,
 ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
GO

现在,我正在运行一个查询,它根据代码位引用两个表:

Select
    OID
   ,case when
     Cast(o.[CreationDate] as time) > '16:00:00' 
        then (select top 1 [PKDate] from [calendar].[dbo].days
              where is_weekend <> 1 and is_holiday <>1 and 
              PKDate > cast(o.[CreationDate] as date)
              order by PKDate asc)
        else (select top 1 [PKDate] from [calendar].[dbo].days
              where is_weekend <> 1 and is_holiday <>1 and 
              PKDate >= Cast(o.[CreationDate] as date) 
              order by PKDate asc)
        end  OperatingDate
   ,case when
     Cast(o.[CreationDate] as time) > '16:00:00' 
        then (select top 1 [PKDate] from [calendar].[dbo].days
              where is_weekend <> 1 and is_holiday <>1 and 
              PKDate > dateadd(day,isnull(
                  (select top 1 [operatingdays]
                  from [dbo].[CS]
                  where DefaultService = 1)
                 ,2)+1,Cast(o.[CreationDate] as date))
                 order by PKDate asc)
            else (select top 1 [PKDate] from [calendar].[dbo].days
                  where is_weekend <> 1 and is_holiday <>1 and
                  PKDate > dateadd(day,isnull(
                      (select top 1 [operatingdays]
                       from [dbo].[CS]
                       where DefaultService = 1)
                      ,2), Cast(o.[CreationDate] as date))
                      order by PKDate asc)
            end EstimatedDeliveryDate
  ,(select dateadd(day,3,o.[CreationDate])) DeliveryDate
From o

现在的问题是,与索引扫描和执行次数有关:为什么是 20 亿?还是60亿?不可否认,整个查询的输出是 170 万行,但这并不能解释查询计划中显示的疯狂数字:

https://www.brentozar.com/pastetheplan/?id=H1iahxHAe

如果我可以敲平所有这些扫描,我可以显着减少查询时间,但首先:我如何解释这些数字以找到解决方案?

days 表包含 7.6 k 行(涵盖 2000-2020 年)。

sql-server sql-server-2012
  • 3 3 个回答
  • 1286 Views

3 个回答

  • Voted
  1. Best Answer
    Joe Obbish
    2017-04-20T08:11:50+08:002017-04-20T08:11:50+08:00

    让我们从查看计划的右上角开始。该部分计算OperatingDate列:

    营业日期

    由于我们为外部行集返回了 1.72 M 行,因此我们可以预期大约 1.72 M 索引针对ix_days. 确实是这样。有 478k 行,o.[CreationDate] as time) > '16:00:00'因此该CASE语句将 478k 查找发送到一个分支,其余的发送到另一个分支。

    请注意,您拥有的索引对于此查询而言并不是最有效的索引。我们只能针对 做一个查找谓词PKDate。其余过滤器作为谓词应用。这意味着查找可能会遍历许多行才能找到匹配项。我假设您的日历表中的大多数日子都不是周末或假期,因​​此它可能不会对该查询产生实际影响。但是,您可以在 上定义一个索引is_weekend, is_holiday, PKDate。那应该让你立即寻找你想要的第一行。

    寻求与谓词

    为了更清楚地说明这一点,让我们看一个简单的例子:

    -- does a scan
    SELECT TOP 1 PkDate
    FROM [Days]
    WHERE is_weekend <> 1 AND is_holiday <> 1
    AND PkDate >= '2000-04-01'
    ORDER BY PkDate;
    
    -- does a seek, reads 3 rows to return 1
    SELECT TOP 1 PkDate
    FROM [Days]
    WHERE is_weekend = 0 AND is_holiday = 0
    AND PkDate >= '2000-04-01'
    ORDER BY PkDate;
    
    -- create new index
    CREATE NONCLUSTERED INDEX [ix_days_2] ON [dbo].[days]
    (
        [is_weekend],
        [is_holiday],
        PkDate
    )
    
    -- does a seek, reads 1 row to return 1
    SELECT TOP 1 PkDate
    FROM [Days]
    WHERE is_weekend = 0 AND is_holiday = 0
    AND PkDate >= '2000-04-01'
    ORDER BY PkDate;
    
    DROP INDEX [days].[ix_days_2];
    

    让我们进入更有趣的部分,即计算DeliveryDate列的分支。我只会包括一半:

    交货日期分支

    我怀疑您希望优化器做的是将其计算为标量:

    dateadd(day,isnull(
                      (select top 1 [operatingdays]
                      from [dbo].[CS]
                      where DefaultService = 1)
                     ,2)+1,Cast(o.[CreationDate] as date))
    

    并使用它的值来使用 进行索引查找ix_days。不幸的是,优化器不这样做。它改为对索引应用行目标并进行扫描。对于扫描返回的每一行,它都会检查该值是否与过滤器相匹配[dbo].[CS]。一旦找到匹配的行,扫描就会停止。SQL Server 估计在找到匹配项之前,它平均只会从扫描中拉回 3.33 行。如果那是真的,那么你会看到大约 150 万次针对[dbo].[CS]. 相反,优化器对该表执行了 20 亿次,因此估计值偏离了 1000 多次。

    作为一般规则,您应该仔细检查嵌套循环内侧的任何扫描。当然,有些查询正是您想要的。并且仅仅因为您进行了搜索并不意味着查询将是有效的。例如,如果搜索返回许多行,则与进行扫描可能没有太大区别。您没有在此处发布完整的查询,但我会介绍一些可能有帮助的想法。

    这个查询有点奇怪:

    select top 1 [operatingdays]
    from [dbo].[CS]
    where DefaultService = 1
    

    它是不确定的,因为你TOP没有ORDER BY。但是,表本身有 1 行,您总是从o. 如果可能的话,我会尝试将此查询的值保存到局部变量中,然后在查询中使用它。这应该再次为您节省 80 亿次扫描[dbo].[CS],我希望看到索引搜索而不是针对ix_days. 我能够在我的机器上模拟一些数据。这是查询计划的一部分:

    好的查询计划 1

    现在我们有了所有的搜索,这些搜索不应该处理太多额外的行。但是,实际查询可能比这更复杂,因此您可能无法使用变量。

    Let's say I write a different filter condition that doesn't use TOP. Instead I'll use MIN. SQL Server is able to process that subquery in a more efficient way. TOP can prevent certain query transformations. Here is my subquery:

    WHERE PKDate > dateadd(day,isnull(
                          (select MIN([operatingdays])
                           from [dbo].[CS]
                           where DefaultService = 1)
                          ,2), Cast(o.[CreationDate] as date))
    

    Here is what the plan might look like:

    好计划2

    Now we'll only do around 1.5 million scans against the CS table. we also get a much more efficient index seek against the ix_days index which is able to use the results of the subquery:

    不错的寻求

    Of course, I'm not saying that you should rewrite your code to use that. It'll probably return incorrect results. The important point is that you can get the index seeks that you want with a subquery. You just need to write your subquery in the right way.

    For one more example, let's assume that you absolutely need to keep the TOP operator in the subquery. It might be possible to add a redundant filter against PkDate to get better performance. I'm going to assume that the results of the subquery are non-negative and small. That means that this query will be equivalent:

      PKDate > Cast(o.[CreationDate] as date) AND 
      PKDate > dateadd(day,isnull(
          (select top 1 [operatingdays]
          from [dbo].[CS]
          where DefaultService = 1)
         ,2)+1,Cast(o.[CreationDate] as date))
    

    This changes the plan to use seeks:

    再次寻求

    It's important to realize that the seeks may return more just one row. The important point is that SQL Server can start seeking at o.[CreationDate]. If there's a large gap in the dates then the index seek will process many extra rows and the query will not be as efficient.

    • 5
  2. SqlWorldWide
    2017-04-20T07:57:54+08:002017-04-20T07:57:54+08:00

    现在的问题是,与索引扫描和执行次数有关:为什么是 20 亿?还是60亿?

    您正在从嵌套循环连接中获取这些数字。

    在其最简单的形式中,嵌套循环连接将每一行与一个

    表(称为外表)从另一个表(称为内表)的每一行寻找满足连接谓词的行。(请注意,术语“内部”和“外部”已被重载;它们的含义必须从上下文中推断出来。“内部表”和“外部表”指的是连接的输入。“内部连接”和“外部连接”指的是到逻辑操作。)

    我们可以将算法用伪代码表示为:

    对于外部表中的每一行 R1 对于内部表中的每一行 R2 如果 R1 与 R2 连接返回 (R1, R2)

    正是该算法中 for 循环的嵌套,使嵌套循环加入了它的名字。

    比较的总行数以及该算法的成本与外表的大小乘以内表的大小成正比。由于此成本随着输入表大小的增长而快速增长,因此在实践中,我们尝试通过减少我们必须为每个外部行考虑的内部行数来最小化成本。] 1

    在您的示例中,这是您如何获得 2B 记录的一个示例。 在此处输入图像描述

    另一个如何获得 5B+。 在此处输入图像描述

    关于如何避免大型嵌套循环连接的几个链接:

    1. 如何优化在嵌套循环(内连接)上运行缓慢的查询
    2. https://stackoverflow.com/questions/28441468/why-are-nested-loops-chosen-causing-long-execution-time-for-self-join
    3. https://www.littlekendra.com/2016/09/06/estimated-vs-actual-number-of-rows-in-nested-loop-operators/
    • 3
  3. AcePL
    2017-04-21T01:54:31+08:002017-04-21T01:54:31+08:00

    The information that both other answers try to convey but fail (only partly due to assumptions that I understand exactly what they say) is this:

    With the query written the way it was in the question the observed performance was inevitable.

    While it was fancy and mostly easy to see the purpose it was simply too heavy for the optimizer to work magic on it. It wasn't quite the nested loop problem SqlWorldWide indicated, but the subqueries simply had to be executed for each row and since they were index seeks and scans they multiplied, and multiplied... and multiplied.

    What I ended up having was this:

    Select
        OID
       ,case when
         Cast(o.[CreationDate] as time) > '16:00:00' 
            then (select top 1 [PKDate] from [calendar].[dbo].days
                  where is_workday = 1 and continuous_day > da.continuous_day
                  and continuous_day < da.continuous_day+7 order by PKDate asc)
            else (select top 1 [PKDate] from [calendar].[dbo].days
                  where is_workday = 1 and continuous_day >= da.continuous_day
                  and continuous_day < da.continuous_day+7 order by PKDate asc)
            end  OperatingDate
       ,case when
         Cast(o.[CreationDate] as time) > '16:00:00' 
            then (select top 1 d.[PKDate] from [calendar].[dbo].days d
                  where is_workday = 1 and
                  continuous_day > (da.continuous_day+isnull(dt.DeliveryDays,2)) and 
                  d.continuous_day < da.continuous_day+7 order by PKDate asc)
                else (select top 1 d.[PKDate] from [calendar].[dbo].days d
                  where is_workday = 1 and
                  continuous_day >= (da.continuous_day+isnull(dt.DeliveryDays,2)) and 
                  d.continuous_day < da.continuous_day+7 order by PKDate asc)
                end EstimatedDeliveryDate
      ,(select dateadd(day,3,o.[CreationDate])) DeliveryDate
    From o
    left join deliverytype dt on o.deliverytypeid = dt.deliverytypeid
    join calendar.dbo.days da on (cast o.creationdate as date) = da.pkdate
    

    In addition to streamlining the query - which still is not optimal - I've also reworked the calendar.dbo.days table's indexes. Dropped the constraint (which I really didn't have to, but what the hell, it might cause more problems further down the line) and added this:

    /****** Object:  Index [ixc_days]    Script Date: 20/04/2017 10:40:58 ******/
    CREATE UNIQUE CLUSTERED INDEX [ixc_days] ON [dbo].[days]
    (
        [PKDate] ASC
    )WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, SORT_IN_TEMPDB = OFF,
    IGNORE_DUP_KEY = OFF, DROP_EXISTING = OFF, ONLINE = OFF,
    ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
    GO
    
    SET ANSI_PADDING ON
    
    GO
    
    /****** Object:  Index [ix_days]    Script Date: 20/04/2017 10:40:58 ******/
    CREATE NONCLUSTERED INDEX [ix_days] ON [dbo].[days]
    (
        [is_workday] ASC,
        [PKDate] ASC,
        [continuous_day] ASC
    )
    INCLUDE (   [calendar_year],
        [calendar_month],
        [calendar_week_in_year],
        [calendar_week_in_month],
        [calendar_day_in_year],
        [calendar_day_in_week],
        [calendar_day_in_month],
        [dmy_name_long],
        [description]) WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF,
    SORT_IN_TEMPDB = OFF, DROP_EXISTING = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS = ON,
    ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
    GO
    

    我承认这主要是为了让我可以更充分地利用日历表(我有没有提到吉姆霍恩是个天才?),但当人们看到我的账户时,他们想要越来越多的东西存储在......无处不在.

    所以,归根结底,虽然查看查询的所有方面都很重要:逻辑、索引、谓词等,但有时唯一明智的改进方法是更改​​代码。在我的例子中,完整查询(几个插入、更新和 CTE)的执行时间现在在大约 2 分钟内完成,而之前是 15 分钟。

    • 0

相关问题

  • SQL Server - 使用聚集索引时如何存储数据页

  • 我需要为每种类型的查询使用单独的索引,还是一个多列索引可以工作?

  • 什么时候应该使用唯一约束而不是唯一索引?

  • 死锁的主要原因是什么,可以预防吗?

  • 如何确定是否需要或需要索引

Sidebar

Stats

  • 问题 205573
  • 回答 270741
  • 最佳答案 135370
  • 用户 68524
  • 热门
  • 回答
  • Marko Smith

    连接到 PostgreSQL 服务器:致命:主机没有 pg_hba.conf 条目

    • 12 个回答
  • Marko Smith

    如何让sqlplus的输出出现在一行中?

    • 3 个回答
  • Marko Smith

    选择具有最大日期或最晚日期的日期

    • 3 个回答
  • Marko Smith

    如何列出 PostgreSQL 中的所有模式?

    • 4 个回答
  • Marko Smith

    列出指定表的所有列

    • 5 个回答
  • Marko Smith

    如何在不修改我自己的 tnsnames.ora 的情况下使用 sqlplus 连接到位于另一台主机上的 Oracle 数据库

    • 4 个回答
  • Marko Smith

    你如何mysqldump特定的表?

    • 4 个回答
  • Marko Smith

    使用 psql 列出数据库权限

    • 10 个回答
  • Marko Smith

    如何从 PostgreSQL 中的选择查询中将值插入表中?

    • 4 个回答
  • Marko Smith

    如何使用 psql 列出所有数据库和表?

    • 7 个回答
  • Martin Hope
    Jin 连接到 PostgreSQL 服务器:致命:主机没有 pg_hba.conf 条目 2014-12-02 02:54:58 +0800 CST
  • Martin Hope
    Stéphane 如何列出 PostgreSQL 中的所有模式? 2013-04-16 11:19:16 +0800 CST
  • Martin Hope
    Mike Walsh 为什么事务日志不断增长或空间不足? 2012-12-05 18:11:22 +0800 CST
  • Martin Hope
    Stephane Rolland 列出指定表的所有列 2012-08-14 04:44:44 +0800 CST
  • Martin Hope
    haxney MySQL 能否合理地对数十亿行执行查询? 2012-07-03 11:36:13 +0800 CST
  • Martin Hope
    qazwsx 如何监控大型 .sql 文件的导入进度? 2012-05-03 08:54:41 +0800 CST
  • Martin Hope
    markdorison 你如何mysqldump特定的表? 2011-12-17 12:39:37 +0800 CST
  • Martin Hope
    Jonas 如何使用 psql 对 SQL 查询进行计时? 2011-06-04 02:22:54 +0800 CST
  • Martin Hope
    Jonas 如何从 PostgreSQL 中的选择查询中将值插入表中? 2011-05-28 00:33:05 +0800 CST
  • Martin Hope
    Jonas 如何使用 psql 列出所有数据库和表? 2011-02-18 00:45:49 +0800 CST

热门标签

sql-server mysql postgresql sql-server-2014 sql-server-2016 oracle sql-server-2008 database-design query-performance sql-server-2017

Explore

  • 主页
  • 问题
    • 最新
    • 热门
  • 标签
  • 帮助

Footer

AskOverflow.Dev

关于我们

  • 关于我们
  • 联系我们

Legal Stuff

  • Privacy Policy

Language

  • Pt
  • Server
  • Unix

© 2023 AskOverflow.DEV All Rights Reserve