这不是关于:
这不是关于接受用户输入或使用变量的全面查询的问题。
这严格来说是关于ISNULL()
在WHERE
子句中用于将值替换为金丝雀值以与谓词进行比较的查询,以及在 SQL Server中将NULL
这些查询重写为SARGable的不同方法。
你怎么不在那边占位子?
我们的示例查询针对 SQL Server 2016 上 Stack Overflow 数据库的本地副本,并查找NULL
年龄或年龄 < 18 岁的用户。
SELECT COUNT(*)
FROM dbo.Users AS u
WHERE ISNULL(u.Age, 17) < 18;
查询计划显示了一个经过深思熟虑的非聚集索引的扫描。
扫描操作符显示(感谢在 SQL Server 的较新版本中对实际执行计划 XML 的添加)我们读取了每个臭名昭著的行。
总的来说,我们进行了 9157 次读取并使用了大约半秒的 CPU 时间:
Table 'Users'. Scan count 1, logical reads 9157, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
SQL Server Execution Times:
CPU time = 485 ms, elapsed time = 483 ms.
问题: 有什么方法可以重写此查询以使其更高效,甚至可能是 SARGable?
随时提供其他建议。我认为我的答案不一定是答案,并且有足够多的聪明人想出可能更好的替代方案。
如果你想在自己的电脑上玩,请到这里下载 SO 数据库。
谢谢!
答案部分
有多种方法可以使用不同的 T-SQL 结构来重写它。我们将在下面查看优缺点并进行总体比较。
首先:使用
OR
Using
OR
为我们提供了一个更有效的 Seek 计划,它读取我们需要的确切行数,但是它将技术世界调用的内容添加a whole mess of malarkey
到查询计划中。另请注意,此处执行了两次 Seek,这从图形运算符中应该更明显:
其次:使用派生表和
UNION ALL
我们的查询也可以这样重写这产生了相同类型的计划,更少的恶意,以及更明显的关于索引被搜索(搜索?)多少次的诚实程度。
它执行与
OR
查询相同数量的读取 (8233),但减少了大约 100 毫秒的 CPU 时间。但是,您必须非常小心,因为如果此计划尝试并行,则两个单独的
COUNT
操作将被序列化,因为它们都被视为全局标量聚合。如果我们使用跟踪标志 8649 强制执行并行计划,问题就会变得很明显。这可以通过稍微改变我们的查询来避免。
现在执行 Seek 的两个节点都是完全并行的,直到我们点击连接运算符。
对于它的价值,完全并行的版本有一些好处。以大约 100 次读取和大约 90 毫秒的额外 CPU 时间为代价,经过的时间缩减到 93 毫秒。
交叉申请呢? 没有魔法的答案是不完整的
CROSS APPLY
!不幸的是,我们遇到了更多的问题
COUNT
。这个计划太可怕了。当你最后一次出现在圣帕特里克节时,这就是你最终的计划。虽然很好地并行,但出于某种原因,它正在扫描 PK/CX。嗯。该计划的成本为 2198 美元。
这是一个奇怪的选择,因为如果我们强制它使用非聚集索引,成本会显着下降到 1798 美元。
嘿,寻找!在那边检查你。还要注意,有了 的魔力
CROSS APPLY
,我们不需要做任何愚蠢的事情来拥有一个几乎完全平行的计划。COUNT
如果没有这些东西,Cross apply 最终会表现得更好。该计划看起来不错,但读取和 CPU 并没有改进。
将交叉应用重写为派生连接会导致完全相同的所有内容。我不会重新发布查询计划和统计信息——它们真的没有改变。
关系代数:为了彻底,为了不让 Joe Celko 困扰我的梦想,我们至少需要尝试一些奇怪的关系代数。这里什么都没有!
一次尝试
INTERSECT
这是一个尝试
EXCEPT
可能还有其他方法可以编写这些,但我将把它留给那些可能比我更经常使用
EXCEPT
的人。INTERSECT
如果您真的只需要 我在查询中使用的计数
COUNT
作为速记(阅读:有时我懒得想出更多涉及的场景)。如果你只需要一个计数,你可以使用一个CASE
表达式来做几乎同样的事情。它们都获得相同的计划并具有相同的 CPU 和读取特性。
获胜者,冠军? 在我的测试中,在派生表上使用 SUM 的强制并行计划表现最好。是的,可以通过添加几个过滤索引来解释这两个谓词来帮助这些查询中的许多查询,但我想将一些实验留给其他人。
谢谢!
I wasn't game to restore a 110 GB database for just one table so I created my own data. The age distributions should match what's on Stack Overflow but obviously the table itself won't match. I don't think that it's too much of an issue because the queries are going to hit indexes anyway. I'm testing on a 4 CPU computer with SQL Server 2016 SP1. One thing to note is that for queries that finish this quickly it's important not to include the actual execution plan. That can slow things down quite a bit.
I started by going through some of the solutions in Erik's excellent answer. For this one:
I got the following results from sys.dm_exec_sessions over 10 trials (the query naturally went parallel for me):
The query that worked better for Erik actually performed worse on my machine:
Results from 10 trials:
I'm not immediately able to explain why it's that bad, but it's not clear why we want to force nearly every operator in the query plan to go parallel. In the original plan we have a serial zone that finds all rows with
AGE < 18
. There are only a few thousand rows. On my machine I get 9 logical reads for that part of the query and 9 ms of reported CPU time and elapsed time. There's also a serial zone for the global aggregate for the rows withAGE IS NULL
but that only processes one row per DOP. On my machine this is just four rows.My takeaway is that it's most important to optimize the part of the query that finds rows with a
NULL
forAge
because there are millions of those rows. I wasn't able to create an index with less pages that covered the data than a simple page-compressed one on the column. I assume that there's a minimum index size per row or that a lot of the index space cannot be avoided with the tricks that I tried. So if we're stuck with about the same number of logical reads to get the data then the only way to make it faster is to make the query more parallel, but this needs to be done in a different way than Erik's query that used TF 8649. In the query above we have a ratio of 3.62 for CPU time to elapsed time which is pretty good. The ideal would be a ratio of 4.0 on my machine.One possible area of improvement is to divide the work more evenly among threads. In the screenshot below we can see that one of my CPUs decided to take a little break:
Index scan is one of the few operators that can be implemented in parallel and we can't do anything about how the rows are distributed to threads. There's an element of chance to it as well but pretty consistently I saw one underworked thread. One way to work around this is to do parallelism the hard way: on the inner part of a nested loop join. Anything on the inner part of a nested loop will be implemented in a serial way but many serial threads can run concurrently. As long as we get a favorable parallel distribution method (such as round robin), we can control exactly how many rows are sent to each thread.
I'm running queries with DOP 4 so I need to evenly divide the
NULL
rows in the table into four buckets. One way to do this is to create a bunch of indexes on computed columns:I'm not quite sure why four separate indexes is a little faster than one index but that's one what I found in my testing.
To get a parallel nested loop plan I'm going to use the undocumented trace flag 8649. I'm also going to write the code a little strangely to encourage the optimizer not to process more rows than necessary. Below is one implementation which appears to work well:
The results from ten trials:
通过该查询,我们的 CPU 与经过时间的比率为 3.85!我们从运行时缩短了 17 毫秒,并且只需要 4 个计算列和索引就可以完成!每个线程处理的总行数非常接近,因为每个索引的行数非常接近,并且每个线程只扫描一个索引:
最后一点,我们还可以点击简单按钮并将非集群 CCI 添加到
Age
列中:以下查询在我的机器上在 3 毫秒内完成:
这将很难被击败。
虽然我没有 Stack Overflow 数据库的本地副本,但我可以尝试几个查询。我的想法是从系统目录视图中获取用户数(而不是直接从基础表中获取行数)。然后计算符合(或可能不符合)Erik 标准的行数,并做一些简单的数学运算。
I used the Stack Exchange Data Explorer (Along with
SET STATISTICS TIME ON;
andSET STATISTICS IO ON;
) to test the queries. For a point of reference, here are some queries and the CPU/IO statistics:QUERY 1
QUERY 2
QUERY 3
1st Attempt
This was slower than all of Erik's queries I listed here...at least in terms of elapsed time.
2nd Attempt
Here I opted for a variable to store the total number of users (instead of a sub-query). The scan count increased from 1 to 17 compared to the 1st attempt. Logical reads stayed the same. However, elapsed time dropped considerably.
Other Notes: DBCC TRACEON is not permitted on Stack Exchange Data Explorer, as noted below:
使用变量?
根据评论可以跳过变量
一个简单的解决方案是计算 count(*) - count (age >= 18):
或者:
结果在这里
善用
SET ANSI_NULLS OFF;
这是我突然想到的。刚刚在https://data.stackexchange.com中执行了这个
但效率不如@blitz_erik