在下面的查询中,两个执行计划估计在唯一索引上执行 1,000 次查找。
查找是由同一源表上的有序扫描驱动的,因此看起来应该最终以相同的顺序查找相同的值。
两个嵌套循环都有<NestedLoops Optimized="false" WithOrderedPrefetch="true">
有谁知道为什么这个任务在第一个计划中的成本是 0.172434 而在第二个计划中是 3.01702?
(这个问题的原因是第一个查询被建议给我作为优化,因为计划成本明显低得多。实际上在我看来它好像做了更多的工作但我只是试图解释差异.. .)
设置
CREATE TABLE dbo.Target(KeyCol int PRIMARY KEY, OtherCol char(32) NOT NULL);
CREATE TABLE dbo.Staging(KeyCol int PRIMARY KEY, OtherCol char(32) NOT NULL);
INSERT INTO dbo.Target
SELECT TOP (1000000) ROW_NUMBER() OVER (ORDER BY @@SPID), LEFT(NEWID(),32)
FROM master..spt_values v1,
master..spt_values v2;
INSERT INTO dbo.Staging
SELECT TOP (1000) ROW_NUMBER() OVER (ORDER BY @@SPID), LEFT(NEWID(),32)
FROM master..spt_values v1;
查询 1 “粘贴计划”链接
WITH T
AS (SELECT *
FROM Target AS T
WHERE T.KeyCol IN (SELECT S.KeyCol
FROM Staging AS S))
MERGE T
USING Staging S
ON ( T.KeyCol = S.KeyCol )
WHEN NOT MATCHED THEN
INSERT ( KeyCol, OtherCol )
VALUES(S.KeyCol, S.OtherCol )
WHEN MATCHED AND T.OtherCol > S.OtherCol THEN
UPDATE SET T.OtherCol = S.OtherCol;
查询 2 “粘贴计划”链接
MERGE Target T
USING Staging S
ON ( T.KeyCol = S.KeyCol )
WHEN NOT MATCHED THEN
INSERT ( KeyCol, OtherCol )
VALUES( S.KeyCol, S.OtherCol )
WHEN MATCHED AND T.OtherCol > S.OtherCol THEN
UPDATE SET T.OtherCol = S.OtherCol;
查询 1
查询 2
以上是在 SQL Server 2014 (SP2) (KB3171021) - 12.0.5000.0 (X64) 上测试的
@Joe Obbish在评论中指出,更简单的复制是
SELECT *
FROM staging AS S
LEFT OUTER JOIN Target AS T
ON T.KeyCol = S.KeyCol;
对比
SELECT *
FROM staging AS S
LEFT OUTER JOIN (SELECT * FROM Target) AS T
ON T.KeyCol = S.KeyCol;
对于 1,000 行的临时表,以上两个仍然具有相同的计划形状,带有嵌套循环,并且没有派生表的计划看起来更便宜,但是对于 10,000 行的临时表和与上面相同的目标表,成本差异确实改变了计划形状(完整扫描和合并连接看起来比昂贵的搜索更具吸引力)显示这种成本差异可能会产生影响,而不仅仅是使比较计划变得更加困难。
一般而言,嵌套循环连接下的内部搜索在假设随机 I/O 模式的情况下进行计算。后续访问有一个简单的基于替换的减少,考虑到所需页面已经被前一次迭代带入内存的可能性。此基本评估产生标准(较高)成本。
还有另一个成本计算输入,Smart Seek Costing,关于它的细节知之甚少。我的猜测(现阶段仅此而已)是 SSC 试图更详细地评估内部寻道 I/O 成本,可能通过考虑本地排序和/或要获取的值的范围。谁知道。
For example, the first seeking operation brings in not just the requested row, but all rows on that page (in index order). Given the overall access pattern, fetching the 1000 rows in 1000 seeks requires only 2 physical reads, even with read-ahead and prefetching disabled. From that perspective, the default I/O costing represents a significant overestimate, and the SSC-adjusted cost is closer to reality.
It seems reasonable to expect that SSC would be most effective where the loop drives an index seek more or less directly, and the join outer reference is the basis of the seeking operation. From what I can tell, SSC is always attempted for suitable physical operations, but most often produces no downward adjustment when the seek is separated from the join by other operations. Simple filters are one exception to this, perhaps because SQL Server can often push these into the data access operator. In any case, the optimizer has pretty deep support for selections.
It is unfortunate that the Compute Scalar for the subquery outer projections seems to interfere with SSC here. Compute Scalars are usually relocated above the join, but these ones have to stay where they are. Even so, most normal Compute Scalars are pretty transparent to optimization, so this is a bit surprising.
Regardless, when the physical operation
PhyOp_Range
is produced from a simple selection on an indexSelIdxToRng
, SSC is effective. When the more complexSelToIdxStrategy
(selection on a table to an index strategy) is employed, the resultingPhyOp_Range
runs SSC but results in no reduction. Again, it seems that simpler, more direct operations work best with SSC.I wish I could tell you exactly what SSC does, and show the exact calculations, but I don't know those details. If you want to explore the limited trace output available for yourself, you can employ undocumented trace flag 2398. An example output is:
That example relates to memo group 7, alternative 1, showing a cost upper bound, and a factor of 0.001. To see cleaner factors, be sure to rebuild the tables without parallelism so the pages are as dense as possible. Without doing that, the factor is more like 0.000821 for your example Target table. There are some fairly obvious relationships there, of course.
SSC can also be disabled with undocumented trace flag 2399. With that flag active, both costs are the higher value.
不确定这是一个答案,但评论有点长。造成差异的原因纯粹是我的猜测,也许可以引起其他人的思考。
使用执行计划简化查询。
这些实际上可能导致相同执行计划的等效查询之间的主要区别在于计算标量运算符。我不知道为什么它必须在那里,但我想这是优化器可以优化派生表的范围。
我的猜测是,计算标量的存在是造成第二个查询的 IO 成本的原因。
从优化器内部:计划成本核算
在我的例子中,该表需要 5618 页,要从 1000000 行中获取 1000 行,估计所需的页数为 5.618,IO 成本为 0.015625。
两个查询接缝的 CPU 成本相同,
0.0001581 * 1000 executions = 0.1581
.因此,根据上面链接的文章,我们可以计算出第一次查询的成本为 0.173725。
假设我关于计算标量如何使 IO 成本变得一团糟的说法是正确的,它可以计算为 3.2831。
不完全是计划中显示的内容,但它就在附近。
(This would be better as a comment to Paul's answer, but I don't have enough rep yet.)
I wanted to provide the list of trace flags (and a couple
DBCC
statements) I used to come to a near-conclusion, in case it will be helpful to investigate similar discrepancies in the future. All of these should not be used on production.First, I had a look at the Final Memo to see what physical operators were being used. They certainly look the same according to the graphical execution plans. So, I used trace flags
3604
and8615
, the first directs output to the client and the second reveals the Final Memo:Tracing back from the
Root Group
, I found these nearly identicalPhyOp_Range
operators:PhyOp_Range 1 ASC 2.0 Cost(RowGoal 0,ReW 0,ReB 999,Dist 1000,Total 1000)= 0.175559(Distance = 2)
PhyOp_Range 1 ASC 3.0 Cost(RowGoal 0,ReW 0,ReB 999,Dist 1000,Total 1000)= 3.01702(Distance = 2)
The only obvious difference to me was the
2.0
and3.0
, which refer to their respective "memo group 2, original" and "memo group 3, original". Checking the memo, these refer to the same thing - so no differences revealed yet.Second, I looked into a whole mess of trace flags that proved fruitless to me - but have some interesting content. I lifted most from Benjamin Nevarez. I was looking for clues as to optimization rules that were applied in one case and not the other.
Third, I looked at which rules were applied for our
PhyOp_Range
s that look so similar. I used a couple trace flags mentioned by Paul in a blog post.从输出中,我们看到直接
JOIN
应用此规则来获取我们的PhyOp_Range
运算符:Rule Result: group=7 2 <SelIdxToRng>PhyOp_Range 1 ASC 2 (Distance = 2)
。subselect 改为应用此规则:Rule Result: group=9 2 <SelToIdxStrategy>PhyOp_Range 1 ASC 3 (Distance = 2)
。这也是您看到与每个规则关联的“智能搜索成本核算”信息的地方。对于直接JOIN
- 这是输出(对我来说)Smart seek costing (7.2) :: 1.34078e+154 , 0.001
:. 对于子选择,这是输出:Smart seek costing (9.2) :: 1.34078e+154 , 1
。最后,我无法得出太多结论 - 但保罗的回答弥补了大部分差距。我想查看有关智能搜索成本核算的更多信息。
这也不是真正的答案 - 正如 Mikael 指出的那样,很难在评论中讨论这个问题......
有趣的是,如果将子查询
(select KeyCol FROM Target)
转换为内联 TVF,您会看到计划及其成本与简单的原始查询相同:查询计划(pastetheplan 链接):
推论使我相信成本核算引擎对这种类型的子查询可能产生的潜在影响感到困惑。
例如,以下内容:
你会怎么花这个钱?查询优化器选择一个与上面的“子查询”变体非常相似的计划,包含一个计算标量(pastetheplan.com 链接):
计算标量与上面显示的“子查询”变体具有完全不同的成本,但是它仍然只是一个猜测,因为查询优化器无法先验地知道返回的行数可能是多少。该计划对左外连接使用散列匹配,因为行估计是不可知的,因此设置为目标表中的行数。
除了我同意 Mikael 在他的回答中所做的工作之外,我没有得出很好的结论,并且希望其他人能提出更好的答案。