我经常需要从结果集中的每个组中选择一些行。
例如,我可能想列出每个客户最近的“n”个最高或最低的订单值。
在更复杂的情况下,要列出的行数可能因组而异(由分组/父记录的属性定义)。这部分绝对是可选的/额外的功劳,并不是为了阻止人们回答。
在 SQL Server 2005 及更高版本中解决这些类型问题的主要选项是什么?每种方法的主要优点和缺点是什么?
AdventureWorks 示例(为清楚起见,可选)
- 列出表中最近的五个交易日期和 ID
TransactionHistory
,每个产品以字母从 M 到 R(含)开头。 - 再次相同,但
n
每个产品都有历史行,其中是Product 属性n
的五倍。DaysToManufacture
- 同样,对于每个产品都需要一个历史行的特殊情况(最近的单个条目
TransactionDate
, tie-break onTransactionID
.
Let's start with the basic scenario.
If I want to get some number of rows out of a table, I have two main options: ranking functions; or
TOP
.First, let's consider the whole set from
Production.TransactionHistory
for a particularProductID
:This returns 418 rows, and the plan shows that it checks every row in the table looking for this - an unrestricted Clustered Index Scan, with a Predicate to provide the filter. 797 reads here, which is ugly.
So let's be fair to it, and create an index that would be more useful. Our conditions call for an equality match on
ProductID
, followed by a search for the most recent byTransactionDate
. We need theTransactionID
returned too, so let's go with:CREATE INDEX ix_FindingMostRecent ON Production.TransactionHistory (ProductID, TransactionDate) INCLUDE (TransactionID);
.Having done this, our plan changes significantly, and drops the reads down to just 3. So we're already improving things by over 250x or so...
Now that we've levelled the playing field, let's look at the top options - ranking functions and
TOP
.You will notice that the second (
TOP
) query is much simpler than the first, both in query and in plan. But very significantly, they both useTOP
to limit the number of rows actually being pulled out of the index. The costs are only estimates and worth ignoring, but you can see a lot of similarity in the two plans, with theROW_NUMBER()
version doing a tiny amount of extra work to assign numbers and filter accordingly, and both queries end up doing just 2 reads to do their work. The Query Optimizer certainly recognises the idea of filtering on aROW_NUMBER()
field, realising that it can use a Top operator to ignore rows that aren't going to be needed. Both these queries are good enough -TOP
isn't so much better that it's worth changing code, but it is simpler and probably clearer for beginners.So this work across a single product. But we need to consider what happens if we need to do this across multiple products.
The iterative programmer is going to consider the idea of looping through the products of interest, and calling this query multiple times, and we can actually get away with writing a query in this form - not using cursors, but using
APPLY
. I'm usingOUTER APPLY
, figuring that we might want to return the Product with NULL, if there are no Transactions for it.The plan for this is the iterative programmers' method - Nested Loop, doing a Top operation and Seek (those 2 reads we had before) for each Product. This gives 4 reads against Product, and 360 against TransactionHistory.
Using
ROW_NUMBER()
, the method is to usePARTITION BY
in theOVER
clause, so that we restart the numbering for each Product. This can then be filtered like before. The plan ends up being quite different. The logical reads are about 15% lower on TransactionHistory, with a full Index Scan going on to get the rows out.Significantly, though, this plan has an expensive Sort operator. The Merge Join doesn't seem to maintain the order of rows in TransactionHistory, the data must be resorted to be able to find the rownumbers. It's fewer reads, but this blocking Sort could feel painful. Using
APPLY
, the Nested Loop will return the first rows very quickly, after just a few reads, but with a Sort,ROW_NUMBER()
will only return rows after a most of the work has been finished.Interestingly, if the
ROW_NUMBER()
query usesINNER JOIN
instead ofLEFT JOIN
, then a different plan comes up.This plan uses a Nested Loop, just like with
APPLY
. But there's no Top operator, so it pulls all the transactions for each product, and uses a lot more reads than before - 492 reads against TransactionHistory. There isn't a good reason for it not to choose the Merge Join option here, so I guess the plan was considered 'Good Enough'. Still - it doesn't block, which is nice - just not as nice asAPPLY
.The
PARTITION BY
column that I used forROW_NUMBER()
wash.ProductID
in both cases, because I had wanted to give the QO the option of producing the RowNum value before joining to the Product table. If I usep.ProductID
, we see the same shape plan as with theINNER JOIN
variation.But the Join operator says 'Left Outer Join' instead of 'Inner Join'. The number of reads is still just under 500 reads against the TransactionHistory table.
Anyway - back to the question at hand...
We've answered question 1, with two options that you could pick and choose from. Personally, I like the
APPLY
option.To extend this to use a variable number (question 2), the
5
just needs to be changed accordingly. Oh, and I added another index, so that there was an index onProduction.Product.Name
that included theDaysToManufacture
column.And both plans are almost identical to what they were before!
Again, ignore the estimated costs - but I still like the TOP scenario, as it is so much more simple, and the plan has no blocking operator. The reads are less on TransactionHistory because of the high number of zeroes in
DaysToManufacture
, but in real life, I doubt we'd be picking that column. ;)One way to avoid the block is to come up with a plan that handles the
ROW_NUMBER()
bit to the right (in the plan) of the join. We can persuade this to happen by doing the join outside the CTE.The plan here looks simpler - it's not blocking, but there's a hidden danger.
Notice the Compute Scalar that's pulling data from the Product table. This is working out the
5 * p.DaysToManufacture
value. This value isn't being passed into the branch that's pulling data from the TransactionHistory table, it's being used in the Merge Join. As a Residual.So the Merge Join is consuming ALL the rows, not just the first however-many-are-needed, but all of them and then doing a residual check. This is dangerous as the number of transactions increases. I'm not a fan of this scenario - residual predicates in Merge Joins can quickly escalate. Another reason why I prefer the
APPLY/TOP
scenario.In the special case where it's exactly one row, for question 3, we can obviously use the same queries, but with
1
instead of5
. But then we have an extra option, which is to use regular aggregates.A query like this would be a useful start, and we could easily modify it to pull out the TransactionID as well for tie-break purposes (using a concatenation which would then be broken down), but we either look at the whole index, or we dive in product by product, and we don't really get a big improvement on what we had before in this scenario.
But I should point out that we're looking at a particular scenario here. With real data, and with an indexing strategy that may not be ideal, mileage may vary considerably. Despite the fact that we've seen that
APPLY
is strong here, it can be slower in some situations. It rarely blocks though, as it has a tendency to use Nested Loops, which many people (myself included) find very appealing.I haven't tried to explore parallelism here, or dived very hard into question 3, which I see as a special case that people rarely want based on the complication of concatenating and splitting. The main thing to consider here is that these two options are both very strong.
I prefer
APPLY
. It's clear, it uses the Top operator well, and it rarely causes blocking.在 SQL Server 2005 及更高版本中执行此操作的典型方法是使用 CTE 和窗口函数。对于每个组的前 n 个,您可以简单地使用
ROW_NUMBER()
一个PARTITION
子句,并在外部查询中对其进行过滤。因此,例如,每个客户的前 5 个最近订单可以这样显示:你也可以这样做
CROSS APPLY
:使用 Paul 指定的附加选项,假设 Customers 表有一列指示每个客户要包含多少行:
再次,使用
CROSS APPLY
并合并添加的选项,即客户的行数由客户表中的某些列决定:请注意,这些将根据数据分布和支持索引的可用性而有所不同,因此优化性能和获得最佳计划实际上取决于本地因素。
就个人而言,我更喜欢 CTE 和窗口解决方案而不是
CROSS APPLY
/TOP
因为它们更好地分离逻辑并且更直观(对我而言)。一般来说(在这种情况下和我的一般经验中),CTE 方法会产生更有效的计划(下面的示例),但这不应被视为普遍真理 - 您应该始终测试您的场景,特别是如果索引已更改或数据出现明显偏差。AdventureWorks 示例 - 没有任何更改
这两者在运行时指标上的比较:
CTE/
OVER()
计划:CROSS APPLY
计划:CTE 计划看起来更复杂,但实际上效率更高。很少关注估计的成本百分比数字,而是关注更重要的实际观察,例如更少的读取和更短的持续时间。我也在没有并行性的情况下运行这些,这没有区别。运行时指标和 CTE 计划(
CROSS APPLY
计划保持不变):这里需要非常小的更改。对于 CTE,我们可以在内部查询中添加一列,并在外部查询上进行过滤;对于
CROSS APPLY
,我们可以在相关性里面进行计算TOP
。你会认为这会给CROSS APPLY
解决方案带来一些效率,但在这种情况下不会发生这种情况。查询:运行时结果:
并行 CTE/
OVER()
计划:单线程 CTE/
OVER()
计划:CROSS APPLY
计划:再次,这里的小改动。在 CTE 解决方案中,我们添加
TransactionID
到OVER()
子句,并将外部过滤器更改为rn = 1
. 对于CROSS APPLY
,我们将其更改TOP
为TOP (1)
,并添加TransactionID
到内部ORDER BY
。运行时结果:
并行 CTE/
OVER()
计划:单线程 CTE / OVER() 计划:
CROSS APPLY
计划:开窗函数并不总是最好的选择(试一试
COUNT(*) OVER()
),这并不是解决每组 n 行问题的唯一两种方法,但在这种特定情况下 - 给定架构、现有索引和数据分布 -在所有有意义的账户中,CTE 的表现都更好。AdventureWorks 示例 - 可以灵活地添加索引
但是,如果您添加一个支持索引,类似于Paul 在评论中提到的索引,但第 2 列和第 3 列是有序的
DESC
:实际上,您会得到更有利的计划,并且
CROSS APPLY
在所有三种情况下,指标都会翻转以支持该方法:如果这是我的生产环境,我可能会对这种情况下的持续时间感到满意,并且不会费心进一步优化。
这在不支持
APPLY
orOVER()
子句的 SQL Server 2000 中更加丑陋。In DBMS, like MySQL, that do not have window functions or
CROSS APPLY
, the way to do this would be to use standard SQL (89). The slow way would be a triangular cross join with aggregate. The faster way (but still and probably not as efficient as using cross apply or the row_number function) would be what I call the "poor man'sCROSS APPLY
". It would be interesting to compare this query with the others:Assumption:
Orders (CustomerID, OrderDate)
has aUNIQUE
constraint:For the extra problem of customized top rows per group:
Note: In MySQL, instead of
AND o.OrderID IN (SELECT TOP(@top) oi.OrderID ...)
one would useAND o.OrderDate >= (SELECT oi.OrderDate ... LIMIT 1 OFFSET (@top - 1))
. SQL-Server addedFETCH / OFFSET
syntax in 2012 version. The queries here were adjusted withIN (TOP...)
to work with earlier versions.I took a slightly different approach, mainly to see how this technique would compare to the others, because having options is good, right?
The Testing
Why don't we start by just looking at how the various methods stacked up against each other. I did three sets of tests:
TransactionDate
-based queries againstProduction.TransactionHistory
.Additional test details:
AdventureWorks2012
on SQL Server 2012, SP2 (Developer Edition).RowCounts
appear to be "off" for my method. This is due my method being a manual implementation of whatCROSS APPLY
is doing: it runs the initial query againstProduction.Product
and gets 161 rows back, which it then uses for the queries againstProduction.TransactionHistory
. Hence, theRowCount
values for my entries are always 161 more than the other entries. In the third set of tests (with caching) the row counts are the same for all methods.Name >= N'M' AND Name < N'S'
construct, I chose to useName LIKE N'[M-R]%'
, and SQL Server treats them the same.The Results
No Supporting Index
This is essentially out-of-the-box AdventureWorks2012. In all cases my method is clearly better than some of the other, but never as good as the top 1 or 2 methods.
Test 1
Aaron's CTE is clearly the winner here.
Test 2
Aaron's CTE (again) and Mikael's second
apply row_number()
method is a close second.Test 3
Aaron's CTE (again) is the winner.
Conclusion
When there is no supporting index on
TransactionDate
, my method is better than doing a standardCROSS APPLY
, but still, using the CTE method is clearly the way to go.With Supporting Index (no Caching)
For this set of tests I added the obvious index on
TransactionHistory.TransactionDate
since all of the queries sort on that field. I say "obvious" since most other answers also agree on this point. And since the queries are all wanting the most recent dates, theTransactionDate
field should be orderedDESC
, so I just grabbed theCREATE INDEX
statement at the bottom of Mikael's answer and added an explicitFILLFACTOR
:Once this index is in place, the results change quite a bit.
Test 1
This time it is my method that comes out ahead, at least in terms of Logical Reads. The
CROSS APPLY
method, previously the worst performer for Test 1, wins on Duration and even beats the CTE method on Logical Reads.Test 2
This time it is Mikael's first
apply row_number()
method that is the winner when looking at Reads, whereas previously it was one of the worst performers. And now my method comes in at a very close second place when looking at Reads. In fact, outside of the CTE method, the rest are all fairly close in terms of Reads.Test 3
Here the CTE is still the winner, but now the difference between the other methods is barely noticeable compared to the drastic difference that existed prior to creating the index.
Conclusion
The applicability of my method is more apparent now, though it is less resilient to not having proper indexes in place.
With Supporting Index AND Caching
For this set of tests I made use of caching because, well, why not? My method allows for using in-memory caching that the other methods cannot access. So to be fair, I created the following temp table that was used in place of
Product.Product
for all references in those other methods across all three tests. TheDaysToManufacture
field is only used in Test Number 2, but it was easier to be consistent across the SQL scripts to use the same table and it didn't hurt to have it there.Test 1
All methods seem to benefit equally from caching, and my method still comes out ahead.
Test 2
Here we now see a difference in the lineup as my method comes out barely ahead, only 2 Reads better than Mikael's first
apply row_number()
method, whereas without the caching my method was behind by 4 Reads.Test 3
Please see update towards the bottom (below the line). Here we again see some difference. The "parameterized" flavor of my method is now barely in the lead by 2 Reads compared to Aaron's CROSS APPLY method (with no caching they were equal). But the really strange thing is that for the first time we see a method that is negatively affected by the caching: Aaron's CTE method (which was previously the best for Test Number 3). But, I am not going to take credit where it is not due, and since without the caching Aaron's CTE method is still faster than my method is here with the caching, the best approach for this particular situation appears to be Aaron's CTE method.
Conclusion Please see update towards the bottom (below the line)
Situations that make repeated use of the results of a secondary query can often (but not always) benefit from caching those results. But when caching is a benefit, using memory for said caching has some advantage over using temporary tables.
The Method
Generally
I separated the "header" query (i.e. getting the
ProductID
s, and in one case also theDaysToManufacture
, based on theName
starting with certain letters) from the "detail" queries (i.e. getting theTransactionID
s andTransactionDate
s). The concept was to perform very simple queries and not allow the optimizer to get confused when JOINing them. Clearly this is not always advantageous as it also disallows the optimizer from, well, optimizing. But as we saw in the results, depending on the type of query, this method does have its merits.The difference between the various flavors of this method are:
Constants: Submit any replaceable values as inline constants instead of being parameters. This would refer to
ProductID
in all three tests and also the number of rows to return in Test 2 as that is a function of "five times theDaysToManufacture
Product attribute". This sub-method means that eachProductID
will get its own execution plan, which can be beneficial if there is a wide variation in data distribution forProductID
. But if there is little variation in the data distribution, the cost of generating the additional plans will likely not be worth it.Parameterized: Submit at least
ProductID
as@ProductID
, allowing for execution plan caching and reuse. There is an additional test option to also treat the variable number of rows to return for Test 2 as a parameter.Optimize Unknown: When referencing
ProductID
as@ProductID
, if there is wide variation of data distribution then it is possible to cache a plan that has a negative effect on otherProductID
values so it would be good to know if using this Query Hint helps any.Cache Products: Rather than querying the
Production.Product
table each time, only to get the exact same list, run the query once (and while we are at it, filter out anyProductID
s that aren't even in theTransactionHistory
table so we don't waste any resources there) and cache that list. The list should include theDaysToManufacture
field. Using this option there is a slightly higher initial hit on Logical Reads for the first execution, but after that it is only theTransactionHistory
table that is queried.Specifically
Ok, but so, um, how is it possible to issue all of the sub-queries as separate queries without using a CURSOR and dumping each result set to a temporary table or table variable? Clearly doing the CURSOR / Temp Table method would reflect quite obviously in the Reads and Writes. Well, by using SQLCLR :). By creating a SQLCLR stored procedure, I was able to open a result set and essentially stream the results of each sub-query to it, as a continuous result set (and not multiple result sets). Outside of the Product information (i.e.
ProductID
,Name
, andDaysToManufacture
), none of the sub-query results had to be stored anywhere (memory or disk) and just got passed through as the main result set of the SQLCLR stored procedure. This allowed me to do a simple query to get the Product info and then cycle through it, issuing very simple queries againstTransactionHistory
.And, this is why I had to use SQL Server Profiler to capture the statistics. The SQLCLR stored procedure did not return an execution plan, either by setting the "Include Actual Execution Plan" Query Option, or by issuing
SET STATISTICS XML ON;
.For the Product Info caching, I used a
readonly static
Generic List (i.e._GlobalProducts
in the code below). It seems that adding to collections does not violate thereadonly
option, hence this code works when the assembly has aPERMISSON_SET
ofSAFE
:), even if that is counter-intuitive.The Generated Queries
The queries produced by this SQLCLR stored procedure are as follows:
Product Info
Test Numbers 1 and 3 (no Caching)
Test Number 2 (no Caching)
Test Numbers 1, 2, and 3 (Caching)
Transaction Info
Test Numbers 1 and 2 (Constants)
Test Numbers 1 and 2 (Parameterized)
Test Numbers 1 and 2 (Parameterized + OPTIMIZE UNKNOWN)
Test Number 2 (Parameterized Both)
Test Number 2 (Parameterized Both + OPTIMIZE UNKNOWN)
Test Number 3 (Constants)
Test Number 3 (Parameterized)
Test Number 3 (Parameterized + OPTIMIZE UNKNOWN)
The Code
The Test Queries
There is not enough room to post the tests here so I will find another location.
The Conclusion
For certain scenarios, SQLCLR can be used to manipulate certain aspects of queries that cannot be done in T-SQL. And there is the ability to use memory for caching instead of temp tables, though that should be done sparingly and carefully as the memory does not get automatically released back to the system. This method is also not something that will help ad hoc queries, though it is possible to make it more flexible than I have shown here simply by adding parameters to tailor more aspects of the queries being executed.
UPDATE
Additional Test
My original tests that included a supporting index on
TransactionHistory
used the following definition:I had decided at the time to forgo including
TransactionId DESC
at the end, figuring that while it might help Test Number 3 (which specifies tie-breaking on the most recentTransactionId
--well, "most recent" is assumed since not explicitly stated, but everyone seems to agree on this assumption), there likely wouldn't be enough ties to make a difference.But, then Aaron retested with a supporting index that did include
TransactionId DESC
and found that theCROSS APPLY
method was the winner across all three tests. This was different than my testing which indicated that the CTE method was best for Test Number 3 (when no caching was used, which mirrors Aaron's test). It was clear that there was an additional variation that needed to be tested.I removed the current supporting index, created a new one with
TransactionId
, and cleared the plan cache (just to be sure):I re-ran Test Number 1 and the results were the same, as expected. I then re-ran Test Number 3 and the results did indeed change:
The above results are for the standard, non-caching test. This time, not only does the
CROSS APPLY
beat the CTE (just as Aaron's test indicated), but the SQLCLR proc took the lead by 30 Reads (woo hoo).The above results are for the test with caching enabled. This time the CTE's performance is not degraded, though the
CROSS APPLY
still beats it. However, now the SQLCLR proc takes the lead by 23 Reads (woo hoo, again).Take Aways
There are various options to use. It is best to try several as they each have their strengths. The tests done here show a rather small variance in both Reads and Duration between the best and worst performers across all tests (with a supporting index); the variation in Reads is about 350 and Duration is 55 ms. While the SQLCLR proc did win in all but 1 test (in terms of Reads), only saving a few Reads usually isn't worth the maintenance cost of going the SQLCLR route. But in AdventureWorks2012, the
Product
table has only 504 rows andTransactionHistory
has only 113,443 rows. The performance difference across these methods probably becomes more pronounced as the row counts increase.虽然这个问题专门针对获取一组特定的行,但不应忽视性能的最大因素是索引,而不是特定的 SQL。在确定哪种方法真正最好之前,需要有一个好的索引。
这里发现的最重要的教训不是关于 CROSS APPLY、CTE 和 SQLCLR:而是关于测试。不要假设。从几个人那里获得想法并尽可能多地测试场景。
APPLY TOP
orROW_NUMBER()
? What could there possibly be more to say on that matter?A short recap of the differences and to really keep it short I will only show the plans for option 2 and I have added the index on
Production.TransactionHistory
.The
row_number()
query:.The
apply top
version:The main difference between these are that
apply top
filters on the top expression below the nested loops join whererow_number
version filters after the join. That means there are more reads fromProduction.TransactionHistory
than really is necessary.If there only existed a way to push the operators responsible for enumerating rows down to the lower branch before the join then
row_number
version might do better.So enter
apply row_number()
version.As you can see
apply row_number()
is pretty much the same asapply top
only slightly more complicated. Execution time is also about the same or bit slower.So why did I bother to come up with an answer that is no better than what we already have? Well, you have one more thing to try out in the real world and there actually is a difference in reads. One that I don't have an explanation for*.
While I'm at it i might as well throw in a second
row_number()
version that in certain cases might be the way to go. Those certain cases would be when you expect you actually need most of the rows fromProduction.TransactionHistory
because here you get a merge join betweenProduction.Product
and the enumeratedProduction.TransactionHistory
.To get the above shape without a sort operator you also have to change the supporting index to order by
TransactionDate
descending.* Edit: The extra logical reads are due to the nested loops prefetching used with the apply-top. You can disable this with undoc'd TF 8744 (and/or 9115 on later versions) to get the same number of logical reads. Prefetching could be an advantage of the apply-top alternative in the right circumstances. - Paul White
我通常使用 CTE 和窗口函数的组合。您可以使用以下内容来实现此答案:
For the extra credit portion, where different groups may want to return different numbers of rows, you could use a separate table. Lets say using geographic criteria such as state:
In order to achieve this where the values may be different you would need to join your CTE to the State table similar to this: