AskOverflow.Dev

AskOverflow.Dev Logo AskOverflow.Dev Logo

AskOverflow.Dev Navigation

  • Início
  • system&network
  • Ubuntu
  • Unix
  • DBA
  • Computer
  • Coding
  • LangChain

Mobile menu

Close
  • Início
  • system&network
    • Recentes
    • Highest score
    • tags
  • Ubuntu
    • Recentes
    • Highest score
    • tags
  • Unix
    • Recentes
    • tags
  • DBA
    • Recentes
    • tags
  • Computer
    • Recentes
    • tags
  • Coding
    • Recentes
    • tags
Início / dba / Perguntas / 86415
Accepted
Paul White
Paul White
Asked: 2014-12-18 06:33:48 +0800 CST2014-12-18 06:33:48 +0800 CST 2014-12-18 06:33:48 +0800 CST

Recuperando n linhas por grupo

  • 772

Muitas vezes preciso selecionar um número de linhas de cada grupo em um conjunto de resultados.

Por exemplo, talvez eu queira listar os 'n' valores de pedidos recentes mais altos ou mais baixos por cliente.

Em casos mais complexos, o número de linhas a serem listadas pode variar por grupo (definido por um atributo do registro de agrupamento/pai). Esta parte é definitivamente opcional/para crédito extra e não pretende dissuadir as pessoas de responder.

Quais são as principais opções para resolver esses tipos de problemas no SQL Server 2005 e posterior? Quais são as principais vantagens e desvantagens de cada método?

Exemplos de AdventureWorks (para maior clareza, opcional)

  1. Liste as cinco datas e IDs de transações mais recentes da TransactionHistorytabela, para cada produto que começa com uma letra de M a R inclusive.
  2. Mesmo novamente, mas com nlinhas de histórico por produto, onde né cinco vezes o DaysToManufactureatributo Product.
  3. O mesmo, para o caso especial em que é necessária exatamente uma linha de histórico por produto (a única entrada mais recente de TransactionDate, desempate em TransactionID.
sql-server t-sql
  • 6 6 respostas
  • 71469 Views

6 respostas

  • Voted
  1. Best Answer
    Rob Farley
    2014-12-23T03:28:40+08:002014-12-23T03:28:40+08:00

    Let's start with the basic scenario.

    If I want to get some number of rows out of a table, I have two main options: ranking functions; or TOP.

    First, let's consider the whole set from Production.TransactionHistory for a particular ProductID:

    SELECT h.TransactionID, h.ProductID, h.TransactionDate
    FROM Production.TransactionHistory h
    WHERE h.ProductID = 800;
    

    This returns 418 rows, and the plan shows that it checks every row in the table looking for this - an unrestricted Clustered Index Scan, with a Predicate to provide the filter. 797 reads here, which is ugly.

    Varredura cara com predicado 'residual'

    So let's be fair to it, and create an index that would be more useful. Our conditions call for an equality match on ProductID, followed by a search for the most recent by TransactionDate. We need the TransactionID returned too, so let's go with: CREATE INDEX ix_FindingMostRecent ON Production.TransactionHistory (ProductID, TransactionDate) INCLUDE (TransactionID);.

    Having done this, our plan changes significantly, and drops the reads down to just 3. So we're already improving things by over 250x or so...

    Plano melhorado

    Now that we've levelled the playing field, let's look at the top options - ranking functions and TOP.

    WITH Numbered AS
    (
    SELECT h.TransactionID, h.ProductID, h.TransactionDate, ROW_NUMBER() OVER (ORDER BY TransactionDate DESC) AS RowNum
    FROM Production.TransactionHistory h
    WHERE h.ProductID = 800
    )
    SELECT TransactionID, ProductID, TransactionDate
    FROM Numbered
    WHERE RowNum <= 5;
    
    SELECT TOP (5) h.TransactionID, h.ProductID, h.TransactionDate
    FROM Production.TransactionHistory h
    WHERE h.ProductID = 800
    ORDER BY TransactionDate DESC;
    

    Dois planos - TOP\RowNum básico

    You will notice that the second (TOP) query is much simpler than the first, both in query and in plan. But very significantly, they both use TOP to limit the number of rows actually being pulled out of the index. The costs are only estimates and worth ignoring, but you can see a lot of similarity in the two plans, with the ROW_NUMBER() version doing a tiny amount of extra work to assign numbers and filter accordingly, and both queries end up doing just 2 reads to do their work. The Query Optimizer certainly recognises the idea of filtering on a ROW_NUMBER() field, realising that it can use a Top operator to ignore rows that aren't going to be needed. Both these queries are good enough - TOP isn't so much better that it's worth changing code, but it is simpler and probably clearer for beginners.

    So this work across a single product. But we need to consider what happens if we need to do this across multiple products.

    The iterative programmer is going to consider the idea of looping through the products of interest, and calling this query multiple times, and we can actually get away with writing a query in this form - not using cursors, but using APPLY. I'm using OUTER APPLY, figuring that we might want to return the Product with NULL, if there are no Transactions for it.

    SELECT p.Name, p.ProductID, t.TransactionID, t.TransactionDate
    FROM 
    Production.Product p
    OUTER APPLY (
        SELECT TOP (5) h.TransactionID, h.ProductID, h.TransactionDate
        FROM Production.TransactionHistory h
        WHERE h.ProductID = p.ProductID
        ORDER BY TransactionDate DESC
    ) t
    WHERE p.Name >= 'M' AND p.Name < 'S';
    

    The plan for this is the iterative programmers' method - Nested Loop, doing a Top operation and Seek (those 2 reads we had before) for each Product. This gives 4 reads against Product, and 360 against TransactionHistory.

    APLICAR plano

    Using ROW_NUMBER(), the method is to use PARTITION BY in the OVER clause, so that we restart the numbering for each Product. This can then be filtered like before. The plan ends up being quite different. The logical reads are about 15% lower on TransactionHistory, with a full Index Scan going on to get the rows out.

    WITH Numbered AS
    (
    SELECT p.Name, p.ProductID, h.TransactionID, h.TransactionDate, ROW_NUMBER() OVER (PARTITION BY h.ProductID ORDER BY h.TransactionDate DESC) AS RowNum
    FROM Production.Product p
    LEFT JOIN Production.TransactionHistory h ON h.ProductID = p.ProductID
    WHERE p.Name >= 'M' AND p.Name < 'S'
    )
    SELECT Name, ProductID, TransactionID, TransactionDate
    FROM Numbered n
    WHERE RowNum <= 5;
    

    plano ROW_NUMBER

    Significantly, though, this plan has an expensive Sort operator. The Merge Join doesn't seem to maintain the order of rows in TransactionHistory, the data must be resorted to be able to find the rownumbers. It's fewer reads, but this blocking Sort could feel painful. Using APPLY, the Nested Loop will return the first rows very quickly, after just a few reads, but with a Sort, ROW_NUMBER() will only return rows after a most of the work has been finished.

    Interestingly, if the ROW_NUMBER() query uses INNER JOIN instead of LEFT JOIN, then a different plan comes up.

    ROW_NUMBER() com INNER JOIN

    This plan uses a Nested Loop, just like with APPLY. But there's no Top operator, so it pulls all the transactions for each product, and uses a lot more reads than before - 492 reads against TransactionHistory. There isn't a good reason for it not to choose the Merge Join option here, so I guess the plan was considered 'Good Enough'. Still - it doesn't block, which is nice - just not as nice as APPLY.

    The PARTITION BY column that I used for ROW_NUMBER() was h.ProductID in both cases, because I had wanted to give the QO the option of producing the RowNum value before joining to the Product table. If I use p.ProductID, we see the same shape plan as with the INNER JOIN variation.

    WITH Numbered AS
    (
    SELECT p.Name, p.ProductID, h.TransactionID, h.TransactionDate, ROW_NUMBER() OVER (PARTITION BY p.ProductID ORDER BY h.TransactionDate DESC) AS RowNum
    FROM Production.Product p
    LEFT JOIN Production.TransactionHistory h ON h.ProductID = p.ProductID
    WHERE p.Name >= 'M' AND p.Name < 'S'
    )
    SELECT Name, ProductID, TransactionID, TransactionDate
    FROM Numbered n
    WHERE RowNum <= 5;
    

    But the Join operator says 'Left Outer Join' instead of 'Inner Join'. The number of reads is still just under 500 reads against the TransactionHistory table.

    PARTITION BY em p.ProductID em vez de h.ProductID

    Anyway - back to the question at hand...

    We've answered question 1, with two options that you could pick and choose from. Personally, I like the APPLY option.

    To extend this to use a variable number (question 2), the 5 just needs to be changed accordingly. Oh, and I added another index, so that there was an index on Production.Product.Name that included the DaysToManufacture column.

    WITH Numbered AS
    (
    SELECT p.Name, p.ProductID, p.DaysToManufacture, h.TransactionID, h.TransactionDate, ROW_NUMBER() OVER (PARTITION BY h.ProductID ORDER BY h.TransactionDate DESC) AS RowNum
    FROM Production.Product p
    LEFT JOIN Production.TransactionHistory h ON h.ProductID = p.ProductID
    WHERE p.Name >= 'M' AND p.Name < 'S'
    )
    SELECT Name, ProductID, TransactionID, TransactionDate
    FROM Numbered n
    WHERE RowNum <= 5 * DaysToManufacture;
    
    SELECT p.Name, p.ProductID, t.TransactionID, t.TransactionDate
    FROM 
    Production.Product p
    OUTER APPLY (
        SELECT TOP (5 * p.DaysToManufacture) h.TransactionID, h.ProductID, h.TransactionDate
        FROM Production.TransactionHistory h
        WHERE h.ProductID = p.ProductID
        ORDER BY TransactionDate DESC
    ) t
    WHERE p.Name >= 'M' AND p.Name < 'S';
    

    And both plans are almost identical to what they were before!

    Linhas variáveis

    Again, ignore the estimated costs - but I still like the TOP scenario, as it is so much more simple, and the plan has no blocking operator. The reads are less on TransactionHistory because of the high number of zeroes in DaysToManufacture, but in real life, I doubt we'd be picking that column. ;)

    One way to avoid the block is to come up with a plan that handles the ROW_NUMBER() bit to the right (in the plan) of the join. We can persuade this to happen by doing the join outside the CTE.

    WITH Numbered AS
    (
    SELECT h.TransactionID, h.ProductID, h.TransactionDate, ROW_NUMBER() OVER (PARTITION BY ProductID ORDER BY TransactionDate DESC) AS RowNum
    FROM Production.TransactionHistory h
    )
    SELECT p.Name, p.ProductID, t.TransactionID, t.TransactionDate
    FROM Production.Product p
    LEFT JOIN Numbered t ON t.ProductID = p.ProductID
        AND t.RowNum <= 5 * p.DaysToManufacture
    WHERE p.Name >= 'M' AND p.Name < 'S';
    

    The plan here looks simpler - it's not blocking, but there's a hidden danger.

    Entrando fora do CTE

    Notice the Compute Scalar that's pulling data from the Product table. This is working out the 5 * p.DaysToManufacture value. This value isn't being passed into the branch that's pulling data from the TransactionHistory table, it's being used in the Merge Join. As a Residual.

    Residual Sutil!

    So the Merge Join is consuming ALL the rows, not just the first however-many-are-needed, but all of them and then doing a residual check. This is dangerous as the number of transactions increases. I'm not a fan of this scenario - residual predicates in Merge Joins can quickly escalate. Another reason why I prefer the APPLY/TOP scenario.

    In the special case where it's exactly one row, for question 3, we can obviously use the same queries, but with 1 instead of 5. But then we have an extra option, which is to use regular aggregates.

    SELECT ProductID, MAX(TransactionDate)
    FROM Production.TransactionHistory
    GROUP BY ProductID;
    

    A query like this would be a useful start, and we could easily modify it to pull out the TransactionID as well for tie-break purposes (using a concatenation which would then be broken down), but we either look at the whole index, or we dive in product by product, and we don't really get a big improvement on what we had before in this scenario.

    But I should point out that we're looking at a particular scenario here. With real data, and with an indexing strategy that may not be ideal, mileage may vary considerably. Despite the fact that we've seen that APPLY is strong here, it can be slower in some situations. It rarely blocks though, as it has a tendency to use Nested Loops, which many people (myself included) find very appealing.

    I haven't tried to explore parallelism here, or dived very hard into question 3, which I see as a special case that people rarely want based on the complication of concatenating and splitting. The main thing to consider here is that these two options are both very strong.

    I prefer APPLY. It's clear, it uses the Top operator well, and it rarely causes blocking.

    • 84
  2. Aaron Bertrand
    2014-12-18T06:55:41+08:002014-12-18T06:55:41+08:00

    A maneira típica de fazer isso no SQL Server 2005 e superior é usar uma CTE e funções de janela. Para os n principais por grupo, você pode simplesmente usar ROW_NUMBER()uma PARTITIONcláusula e filtrar isso na consulta externa. Assim, por exemplo, os 5 principais pedidos mais recentes por cliente podem ser exibidos desta forma:

    DECLARE @top INT;
    SET @top = 5;
    
    ;WITH grp AS 
    (
       SELECT CustomerID, OrderID, OrderDate,
         rn = ROW_NUMBER() OVER
         (PARTITION BY CustomerID ORDER BY OrderDate DESC)
       FROM dbo.Orders
    )
    SELECT CustomerID, OrderID, OrderDate
      FROM grp
      WHERE rn <= @top
      ORDER BY CustomerID, OrderDate DESC;
    

    Você também pode fazer isso com CROSS APPLY:

    DECLARE @top INT;
    SET @top = 5;
    
    SELECT c.CustomerID, o.OrderID, o.OrderDate
    FROM dbo.Customers AS c
    CROSS APPLY 
    (
        SELECT TOP (@top) OrderID, OrderDate 
        FROM dbo.Orders AS o
        WHERE CustomerID = c.CustomerID
        ORDER BY OrderDate DESC
    ) AS o
    ORDER BY c.CustomerID, o.OrderDate DESC;
    

    Com a opção adicional que Paul especificou, digamos que a tabela Customers tenha uma coluna indicando quantas linhas incluir por cliente:

    ;WITH grp AS 
    (
       SELECT CustomerID, OrderID, OrderDate,
         rn = ROW_NUMBER() OVER
         (PARTITION BY CustomerID ORDER BY OrderDate DESC)
       FROM dbo.Orders
    )
    SELECT c.CustomerID, grp.OrderID, grp.OrderDate
      FROM grp 
      INNER JOIN dbo.Customers AS c
      ON grp.CustomerID = c.CustomerID
      AND grp.rn <= c.Number_of_Recent_Orders_to_Show
      ORDER BY c.CustomerID, grp.OrderDate DESC;
    

    E, novamente, usando CROSS APPLYe incorporando a opção adicionada de que o número de linhas de um cliente seja ditado por alguma coluna na tabela de clientes:

    SELECT c.CustomerID, o.OrderID, o.OrderDate
    FROM dbo.Customers AS c
    CROSS APPLY 
    (
        SELECT TOP (c.Number_of_Recent_Orders_to_Show) OrderID, OrderDate 
        FROM dbo.Orders AS o
        WHERE CustomerID = c.CustomerID
        ORDER BY OrderDate DESC
    ) AS o
    ORDER BY c.CustomerID, o.OrderDate DESC;
    

    Observe que eles terão um desempenho diferente dependendo da distribuição de dados e da disponibilidade de índices de suporte, portanto, otimizar o desempenho e obter o melhor plano dependerá realmente de fatores locais.

    Pessoalmente, prefiro as soluções CTE e windowing ao invés do CROSS APPLY/ TOPporque separam melhor a lógica e são mais intuitivas (para mim). Em geral (tanto neste caso quanto na minha experiência geral), a abordagem CTE produz planos mais eficientes (exemplos abaixo), mas isso não deve ser tomado como uma verdade universal - você deve sempre testar seus cenários, especialmente se os índices mudaram ou os dados se desviaram significativamente.


    Exemplos de AdventureWorks - sem nenhuma alteração

    1. Liste as cinco datas e IDs de transações mais recentes da TransactionHistorytabela, para cada produto que começa com uma letra de M a R inclusive.
    -- CTE / OVER()
    
    ;WITH History AS
    (
      SELECT p.ProductID, p.Name, t.TransactionID, t.TransactionDate,
        rn = ROW_NUMBER() OVER 
        (PARTITION BY t.ProductID ORDER BY t.TransactionDate DESC)
      FROM Production.Product AS p
      INNER JOIN Production.TransactionHistory AS t
      ON p.ProductID = t.ProductID
      WHERE p.Name >= N'M' AND p.Name < N'S'
    )
    SELECT ProductID, Name, TransactionID, TransactionDate
    FROM History 
    WHERE rn <= 5;
    
    -- CROSS APPLY
    
    SELECT p.ProductID, p.Name, t.TransactionID, t.TransactionDate
    FROM Production.Product AS p
    CROSS APPLY
    (
      SELECT TOP (5) TransactionID, TransactionDate
      FROM Production.TransactionHistory
      WHERE ProductID = p.ProductID
      ORDER BY TransactionDate DESC
    ) AS t
    WHERE p.Name >= N'M' AND p.Name < N'S';
    

    Comparação desses dois em métricas de tempo de execução:

    insira a descrição da imagem aqui

    CTE / OVER()plano:

    insira a descrição da imagem aqui

    CROSS APPLYplano:

    insira a descrição da imagem aqui

    O plano CTE parece mais complicado, mas na verdade é muito mais eficiente. Preste pouca atenção aos números de % de custo estimado, mas concentre-se em observações reais mais importantes , como muito menos leituras e uma duração muito menor. Eu também os executei sem paralelismo, e essa não foi a diferença. Métricas de tempo de execução e o plano CTE (o CROSS APPLYplano permaneceu o mesmo):

    insira a descrição da imagem aqui

    insira a descrição da imagem aqui

    1. Mesmo novamente, mas com nlinhas de histórico por produto, onde né cinco vezes o DaysToManufactureatributo Product.

    Muito pequenas alterações necessárias aqui. Para o CTE, podemos adicionar uma coluna à consulta interna e filtrar a consulta externa; para o CROSS APPLY, podemos realizar o cálculo dentro do correlacionado TOP. Você pensaria que isso daria alguma eficiência à CROSS APPLYsolução, mas isso não acontece neste caso. Consultas:

    -- CTE / OVER()
    
    ;WITH History AS
    (
      SELECT p.ProductID, p.Name, p.DaysToManufacture, t.TransactionID, t.TransactionDate,
        rn = ROW_NUMBER() OVER 
        (PARTITION BY t.ProductID ORDER BY t.TransactionDate DESC)
      FROM Production.Product AS p
      INNER JOIN Production.TransactionHistory AS t
      ON p.ProductID = t.ProductID
      WHERE p.Name >= N'M' AND p.Name < N'S'
    )
    SELECT ProductID, Name, TransactionID, TransactionDate
    FROM History 
    WHERE rn <= (5 * DaysToManufacture);
    
    -- CROSS APPLY
    
    SELECT p.ProductID, p.Name, t.TransactionID, t.TransactionDate
    FROM Production.Product AS p
    CROSS APPLY
    (
      SELECT TOP (5 * p.DaysToManufacture) TransactionID, TransactionDate
      FROM Production.TransactionHistory
      WHERE ProductID = p.ProductID
      ORDER BY TransactionDate DESC
    ) AS t
    WHERE p.Name >= N'M' AND p.Name < N'S';
    

    Resultados do tempo de execução:

    insira a descrição da imagem aqui

    CTE/ OVER()plano paralelo:

    insira a descrição da imagem aqui

    OVER()CTE/ plano de thread único :

    insira a descrição da imagem aqui

    CROSS APPLYplano:

    insira a descrição da imagem aqui

    1. O mesmo, para o caso especial em que é necessária exatamente uma linha de histórico por produto (a única entrada mais recente de TransactionDate, desempate em TransactionID.

    Novamente, pequenas mudanças aqui. Na solução CTE, adicionamos TransactionIDà OVER()cláusula e alteramos o filtro externo para rn = 1. Para o CROSS APPLY, alteramos o TOPpara TOP (1)e adicionamos TransactionIDao interno ORDER BY.

    -- CTE / OVER()
    
    ;WITH History AS
    (
      SELECT p.ProductID, p.Name, t.TransactionID, t.TransactionDate,
        rn = ROW_NUMBER() OVER 
        (PARTITION BY t.ProductID ORDER BY t.TransactionDate DESC, TransactionID DESC)
      FROM Production.Product AS p
      INNER JOIN Production.TransactionHistory AS t
      ON p.ProductID = t.ProductID
      WHERE p.Name >= N'M' AND p.Name < N'S'
    )
    SELECT ProductID, Name, TransactionID, TransactionDate
    FROM History 
    WHERE rn = 1;
    
    -- CROSS APPLY
    
    SELECT p.ProductID, p.Name, t.TransactionID, t.TransactionDate
    FROM Production.Product AS p
    CROSS APPLY
    (
      SELECT TOP (1) TransactionID, TransactionDate
      FROM Production.TransactionHistory
      WHERE ProductID = p.ProductID
      ORDER BY TransactionDate DESC, TransactionID DESC
    ) AS t
    WHERE p.Name >= N'M' AND p.Name < N'S';
    

    Resultados do tempo de execução:

    insira a descrição da imagem aqui

    CTE/ OVER()plano paralelo:

    insira a descrição da imagem aqui

    Plano CTE/OVER() de thread único:

    insira a descrição da imagem aqui

    CROSS APPLYplano:

    insira a descrição da imagem aqui

    As funções de janela nem sempre são a melhor alternativa (experimente COUNT(*) OVER()), e essas não são as únicas duas abordagens para resolver o problema de n linhas por grupo, mas neste caso específico - dado o esquema, índices existentes e distribuição de dados - o CTE se saiu melhor em todas as contas significativas.


    Exemplos do AdventureWorks - com flexibilidade para adicionar índices

    However, if you add a supporting index, similar to the one Paul mentioned in a comment but with the 2nd and 3rd columns ordered DESC:

    CREATE UNIQUE NONCLUSTERED INDEX UQ3 ON Production.TransactionHistory 
      (ProductID, TransactionDate DESC, TransactionID DESC);
    

    You would actually get much more favorable plans all around, and the metrics would flip to favor the CROSS APPLY approach in all three cases:

    insira a descrição da imagem aqui

    If this were my production environment, I'd probably be satisfied with the duration in this case, and wouldn't bother to optimize further.


    This was all much uglier in SQL Server 2000, which didn't support APPLY or the OVER() clause.

    • 49
  3. ypercubeᵀᴹ
    2014-12-18T07:47:27+08:002014-12-18T07:47:27+08:00

    In DBMS, like MySQL, that do not have window functions or CROSS APPLY, the way to do this would be to use standard SQL (89). The slow way would be a triangular cross join with aggregate. The faster way (but still and probably not as efficient as using cross apply or the row_number function) would be what I call the "poor man's CROSS APPLY". It would be interesting to compare this query with the others:

    Assumption: Orders (CustomerID, OrderDate) has a UNIQUE constraint:

    DECLARE @top INT;
    SET @top = 5;
    
    SELECT o.CustomerID, o.OrderID, o.OrderDate
      FROM dbo.Customers AS c
        JOIN dbo.Orders AS o
          ON  o.CustomerID = c.CustomerID
          AND o.OrderID IN
              ( SELECT TOP (@top) oi.OrderID
                FROM dbo.Orders AS oi
                WHERE oi.CustomerID = c.CustomerID
                ORDER BY oi.OrderDate DESC
              )
      ORDER BY CustomerID, OrderDate DESC ;
    

    For the extra problem of customized top rows per group:

    SELECT o.CustomerID, o.OrderID, o.OrderDate
      FROM dbo.Customers AS c
        JOIN dbo.Orders AS o
          ON  o.CustomerID = c.CustomerID
          AND o.OrderID IN
              ( SELECT TOP (c.Number_of_Recent_Orders_to_Show) oi.OrderID
                FROM dbo.Orders AS oi
                WHERE oi.CustomerID = c.CustomerID
                ORDER BY oi.OrderDate DESC
              )
      ORDER BY CustomerID, OrderDate DESC ;
    

    Note: In MySQL, instead of AND o.OrderID IN (SELECT TOP(@top) oi.OrderID ...) one would use AND o.OrderDate >= (SELECT oi.OrderDate ... LIMIT 1 OFFSET (@top - 1)). SQL-Server added FETCH / OFFSET syntax in 2012 version. The queries here were adjusted with IN (TOP...) to work with earlier versions.

    • 25
  4. Solomon Rutzky
    2014-12-26T01:39:48+08:002014-12-26T01:39:48+08:00

    I took a slightly different approach, mainly to see how this technique would compare to the others, because having options is good, right?

    The Testing

    Why don't we start by just looking at how the various methods stacked up against each other. I did three sets of tests:

    1. The first set ran with no DB modifications
    2. The second set ran after an index was created to support TransactionDate-based queries against Production.TransactionHistory.
    3. The third set made a slightly different assumption. Since all three tests ran against the same list of Products, what if we cached that list? My method uses an in-memory cache while the other methods used an equivalent temp table. The supporting index created for the second set of tests still exists for this set of tests.

    Additional test details:

    • The tests were run against AdventureWorks2012 on SQL Server 2012, SP2 (Developer Edition).
    • For each test I labeled whose answer I took the query from and which particular query it was.
    • I used the "Discard results after execution" option of Query Options | Results.
    • Please note that for the first two sets of tests, the RowCounts appear to be "off" for my method. This is due my method being a manual implementation of what CROSS APPLY is doing: it runs the initial query against Production.Product and gets 161 rows back, which it then uses for the queries against Production.TransactionHistory. Hence, the RowCount values for my entries are always 161 more than the other entries. In the third set of tests (with caching) the row counts are the same for all methods.
    • I used SQL Server Profiler to capture the stats instead of relying on the execution plans. Aaron and Mikael already did a great job showing the plans for their queries and there is no need to reproduce that information. And the intent of my method is to reduce the queries to such a simple form that it wouldn't really matter. There is an additional reason for using Profiler, but that will be mentioned later.
    • Rather than using the Name >= N'M' AND Name < N'S' construct, I chose to use Name LIKE N'[M-R]%', and SQL Server treats them the same.

    The Results

    No Supporting Index

    This is essentially out-of-the-box AdventureWorks2012. In all cases my method is clearly better than some of the other, but never as good as the top 1 or 2 methods.

    Test 1 Resultados do Teste 1 - sem índice
    Aaron's CTE is clearly the winner here.

    Test 2 Resultados do Teste 2 - sem índice
    Aaron's CTE (again) and Mikael's second apply row_number() method is a close second.

    Test 3 Teste 3 Resultados - sem índice
    Aaron's CTE (again) is the winner.

    Conclusion
    When there is no supporting index on TransactionDate, my method is better than doing a standard CROSS APPLY, but still, using the CTE method is clearly the way to go.

    With Supporting Index (no Caching)

    For this set of tests I added the obvious index on TransactionHistory.TransactionDate since all of the queries sort on that field. I say "obvious" since most other answers also agree on this point. And since the queries are all wanting the most recent dates, the TransactionDate field should be ordered DESC, so I just grabbed the CREATE INDEX statement at the bottom of Mikael's answer and added an explicit FILLFACTOR:

    CREATE INDEX [IX_TransactionHistoryX]
        ON Production.TransactionHistory (ProductID ASC, TransactionDate DESC)
        WITH (FILLFACTOR = 100);
    

    Once this index is in place, the results change quite a bit.

    Test 1 Teste 1 Resultados - com índice de suporte
    This time it is my method that comes out ahead, at least in terms of Logical Reads. The CROSS APPLY method, previously the worst performer for Test 1, wins on Duration and even beats the CTE method on Logical Reads.

    Test 2 Test 2 Results-with supporting index
    This time it is Mikael's first apply row_number() method that is the winner when looking at Reads, whereas previously it was one of the worst performers. And now my method comes in at a very close second place when looking at Reads. In fact, outside of the CTE method, the rest are all fairly close in terms of Reads.

    Test 3 Test 3 Results-with supporting index
    Here the CTE is still the winner, but now the difference between the other methods is barely noticeable compared to the drastic difference that existed prior to creating the index.

    Conclusion
    The applicability of my method is more apparent now, though it is less resilient to not having proper indexes in place.

    With Supporting Index AND Caching

    For this set of tests I made use of caching because, well, why not? My method allows for using in-memory caching that the other methods cannot access. So to be fair, I created the following temp table that was used in place of Product.Product for all references in those other methods across all three tests. The DaysToManufacture field is only used in Test Number 2, but it was easier to be consistent across the SQL scripts to use the same table and it didn't hurt to have it there.

    CREATE TABLE #Products
    (
        ProductID INT NOT NULL PRIMARY KEY,
        Name NVARCHAR(50) NOT NULL,
        DaysToManufacture INT NOT NULL
    );
    
    INSERT INTO #Products (ProductID, Name, DaysToManufacture)
        SELECT  p.ProductID, p.Name, p.DaysToManufacture
        FROM    Production.Product p
        WHERE   p.Name >= N'M' AND p.Name < N'S'
        AND    EXISTS (
                        SELECT  *
                        FROM    Production.TransactionHistory th
                        WHERE   th.ProductID = p.ProductID
                    );
    
    ALTER TABLE #Products REBUILD WITH (FILLFACTOR = 100);
    

    Test 1 Test 1 Results-with supporting index AND caching
    All methods seem to benefit equally from caching, and my method still comes out ahead.

    Test 2 Test 2 Results-with supporting index AND caching
    Here we now see a difference in the lineup as my method comes out barely ahead, only 2 Reads better than Mikael's first apply row_number() method, whereas without the caching my method was behind by 4 Reads.

    Test 3 Test 3 Results-with supporting index AND caching
    Please see update towards the bottom (below the line). Here we again see some difference. The "parameterized" flavor of my method is now barely in the lead by 2 Reads compared to Aaron's CROSS APPLY method (with no caching they were equal). But the really strange thing is that for the first time we see a method that is negatively affected by the caching: Aaron's CTE method (which was previously the best for Test Number 3). But, I am not going to take credit where it is not due, and since without the caching Aaron's CTE method is still faster than my method is here with the caching, the best approach for this particular situation appears to be Aaron's CTE method.

    Conclusion Please see update towards the bottom (below the line)
    Situations that make repeated use of the results of a secondary query can often (but not always) benefit from caching those results. But when caching is a benefit, using memory for said caching has some advantage over using temporary tables.

    The Method

    Generally

    I separated the "header" query (i.e. getting the ProductIDs, and in one case also the DaysToManufacture, based on the Name starting with certain letters) from the "detail" queries (i.e. getting the TransactionIDs and TransactionDates). The concept was to perform very simple queries and not allow the optimizer to get confused when JOINing them. Clearly this is not always advantageous as it also disallows the optimizer from, well, optimizing. But as we saw in the results, depending on the type of query, this method does have its merits.

    The difference between the various flavors of this method are:

    • Constants: Submit any replaceable values as inline constants instead of being parameters. This would refer to ProductID in all three tests and also the number of rows to return in Test 2 as that is a function of "five times the DaysToManufacture Product attribute". This sub-method means that each ProductID will get its own execution plan, which can be beneficial if there is a wide variation in data distribution for ProductID. But if there is little variation in the data distribution, the cost of generating the additional plans will likely not be worth it.

    • Parameterized: Submit at least ProductID as @ProductID, allowing for execution plan caching and reuse. There is an additional test option to also treat the variable number of rows to return for Test 2 as a parameter.

    • Optimize Unknown: When referencing ProductID as @ProductID, if there is wide variation of data distribution then it is possible to cache a plan that has a negative effect on other ProductID values so it would be good to know if using this Query Hint helps any.

    • Cache Products: Rather than querying the Production.Product table each time, only to get the exact same list, run the query once (and while we are at it, filter out any ProductIDs that aren't even in the TransactionHistory table so we don't waste any resources there) and cache that list. The list should include the DaysToManufacture field. Using this option there is a slightly higher initial hit on Logical Reads for the first execution, but after that it is only the TransactionHistory table that is queried.

    Specifically

    Ok, but so, um, how is it possible to issue all of the sub-queries as separate queries without using a CURSOR and dumping each result set to a temporary table or table variable? Clearly doing the CURSOR / Temp Table method would reflect quite obviously in the Reads and Writes. Well, by using SQLCLR :). By creating a SQLCLR stored procedure, I was able to open a result set and essentially stream the results of each sub-query to it, as a continuous result set (and not multiple result sets). Outside of the Product information (i.e. ProductID, Name, and DaysToManufacture), none of the sub-query results had to be stored anywhere (memory or disk) and just got passed through as the main result set of the SQLCLR stored procedure. This allowed me to do a simple query to get the Product info and then cycle through it, issuing very simple queries against TransactionHistory.

    And, this is why I had to use SQL Server Profiler to capture the statistics. The SQLCLR stored procedure did not return an execution plan, either by setting the "Include Actual Execution Plan" Query Option, or by issuing SET STATISTICS XML ON;.

    For the Product Info caching, I used a readonly static Generic List (i.e. _GlobalProducts in the code below). It seems that adding to collections does not violate the readonly option, hence this code works when the assembly has a PERMISSON_SET of SAFE :), even if that is counter-intuitive.

    The Generated Queries

    The queries produced by this SQLCLR stored procedure are as follows:

    Product Info

    Test Numbers 1 and 3 (no Caching)

    SELECT prod1.ProductID, prod1.Name, 1 AS [DaysToManufacture]
    FROM   Production.Product prod1
    WHERE  prod1.Name LIKE N'[M-R]%';
    

    Test Number 2 (no Caching)

    ;WITH cte AS
    (
        SELECT prod1.ProductID
        FROM   Production.Product prod1 WITH (INDEX(AK_Product_Name))
        WHERE  prod1.Name LIKE N'[M-R]%'
    )
    SELECT prod2.ProductID, prod2.Name, prod2.DaysToManufacture
    FROM   Production.Product prod2
    INNER JOIN cte
            ON cte.ProductID = prod2.ProductID;
    

    Test Numbers 1, 2, and 3 (Caching)

    ;WITH cte AS
    (
        SELECT prod1.ProductID
        FROM   Production.Product prod1 WITH (INDEX(AK_Product_Name))
        WHERE  prod1.Name LIKE N'[M-R]%'
        AND    EXISTS (
                    SELECT *
                    FROM Production.TransactionHistory th
                    WHERE th.ProductID = prod1.ProductID
                      )
    )
    SELECT prod2.ProductID, prod2.Name, prod2.DaysToManufacture
    FROM   Production.Product prod2
    INNER JOIN cte
            ON cte.ProductID = prod2.ProductID;
    

    Transaction Info

    Test Numbers 1 and 2 (Constants)

    SELECT TOP (5) th.TransactionID, th.TransactionDate
    FROM   Production.TransactionHistory th
    WHERE  th.ProductID = 977
    ORDER BY th.TransactionDate DESC;
    

    Test Numbers 1 and 2 (Parameterized)

    SELECT TOP (5) th.TransactionID, th.TransactionDate
    FROM   Production.TransactionHistory th
    WHERE  th.ProductID = @ProductID
    ORDER BY th.TransactionDate DESC
    ;
    

    Test Numbers 1 and 2 (Parameterized + OPTIMIZE UNKNOWN)

    SELECT TOP (5) th.TransactionID, th.TransactionDate
    FROM   Production.TransactionHistory th
    WHERE  th.ProductID = @ProductID
    ORDER BY th.TransactionDate DESC
    OPTION (OPTIMIZE FOR (@ProductID UNKNOWN));
    

    Test Number 2 (Parameterized Both)

    SELECT TOP (@RowsToReturn) th.TransactionID, th.TransactionDate
    FROM   Production.TransactionHistory th
    WHERE  th.ProductID = @ProductID
    ORDER BY th.TransactionDate DESC
    ;
    

    Test Number 2 (Parameterized Both + OPTIMIZE UNKNOWN)

    SELECT TOP (@RowsToReturn) th.TransactionID, th.TransactionDate
    FROM   Production.TransactionHistory th
    WHERE  th.ProductID = @ProductID
    ORDER BY th.TransactionDate DESC
    OPTION (OPTIMIZE FOR (@ProductID UNKNOWN));
    

    Test Number 3 (Constants)

    SELECT TOP (1) th.TransactionID, th.TransactionDate
    FROM   Production.TransactionHistory th
    WHERE  th.ProductID = 977
    ORDER BY th.TransactionDate DESC, th.TransactionID DESC;
    

    Test Number 3 (Parameterized)

    SELECT TOP (1) th.TransactionID, th.TransactionDate
    FROM   Production.TransactionHistory th
    WHERE  th.ProductID = @ProductID
    ORDER BY th.TransactionDate DESC, th.TransactionID DESC
    ;
    

    Test Number 3 (Parameterized + OPTIMIZE UNKNOWN)

    SELECT TOP (1) th.TransactionID, th.TransactionDate
    FROM   Production.TransactionHistory th
    WHERE  th.ProductID = @ProductID
    ORDER BY th.TransactionDate DESC, th.TransactionID DESC
    OPTION (OPTIMIZE FOR (@ProductID UNKNOWN));
    

    The Code

    using System;
    using System.Collections.Generic;
    using System.Data;
    using System.Data.SqlClient;
    using System.Data.SqlTypes;
    using Microsoft.SqlServer.Server;
    
    public class ObligatoryClassName
    {
        private class ProductInfo
        {
            public int ProductID;
            public string Name;
            public int DaysToManufacture;
    
            public ProductInfo(int ProductID, string Name, int DaysToManufacture)
            {
                this.ProductID = ProductID;
                this.Name = Name;
                this.DaysToManufacture = DaysToManufacture;
    
                return;
            }
        }
    
        private static readonly List<ProductInfo> _GlobalProducts = new List<ProductInfo>();
    
        private static void PopulateGlobalProducts(SqlBoolean PrintQuery)
        {
            if (_GlobalProducts.Count > 0)
            {
                if (PrintQuery.IsTrue)
                {
                    SqlContext.Pipe.Send(String.Concat("I already haz ", _GlobalProducts.Count,
                                " entries :)"));
                }
    
                return;
            }
    
            SqlConnection _Connection = new SqlConnection("Context Connection = true;");
            SqlCommand _Command = new SqlCommand();
            _Command.CommandType = CommandType.Text;
            _Command.Connection = _Connection;
            _Command.CommandText = @"
       ;WITH cte AS
       (
         SELECT prod1.ProductID
         FROM   Production.Product prod1 WITH (INDEX(AK_Product_Name))
         WHERE  prod1.Name LIKE N'[M-R]%'
         AND    EXISTS (
                         SELECT *
                         FROM Production.TransactionHistory th
                         WHERE th.ProductID = prod1.ProductID
                       )
       )
       SELECT prod2.ProductID, prod2.Name, prod2.DaysToManufacture
       FROM   Production.Product prod2
       INNER JOIN cte
               ON cte.ProductID = prod2.ProductID;
    ";
    
            SqlDataReader _Reader = null;
    
            try
            {
                _Connection.Open();
    
                _Reader = _Command.ExecuteReader();
    
                while (_Reader.Read())
                {
                    _GlobalProducts.Add(new ProductInfo(_Reader.GetInt32(0), _Reader.GetString(1),
                                                        _Reader.GetInt32(2)));
                }
            }
            catch
            {
                throw;
            }
            finally
            {
                if (_Reader != null && !_Reader.IsClosed)
                {
                    _Reader.Close();
                }
    
                if (_Connection != null && _Connection.State != ConnectionState.Closed)
                {
                    _Connection.Close();
                }
    
                if (PrintQuery.IsTrue)
                {
                    SqlContext.Pipe.Send(_Command.CommandText);
                }
            }
    
            return;
        }
    
    
        [Microsoft.SqlServer.Server.SqlProcedure]
        public static void GetTopRowsPerGroup(SqlByte TestNumber,
            SqlByte ParameterizeProductID, SqlBoolean OptimizeForUnknown,
            SqlBoolean UseSequentialAccess, SqlBoolean CacheProducts, SqlBoolean PrintQueries)
        {
            SqlConnection _Connection = new SqlConnection("Context Connection = true;");
            SqlCommand _Command = new SqlCommand();
            _Command.CommandType = CommandType.Text;
            _Command.Connection = _Connection;
    
            List<ProductInfo> _Products = null;
            SqlDataReader _Reader = null;
    
            int _RowsToGet = 5; // default value is for Test Number 1
            string _OrderByTransactionID = "";
            string _OptimizeForUnknown = "";
            CommandBehavior _CmdBehavior = CommandBehavior.Default;
    
            if (OptimizeForUnknown.IsTrue)
            {
                _OptimizeForUnknown = "OPTION (OPTIMIZE FOR (@ProductID UNKNOWN))";
            }
    
            if (UseSequentialAccess.IsTrue)
            {
                _CmdBehavior = CommandBehavior.SequentialAccess;
            }
    
            if (CacheProducts.IsTrue)
            {
                PopulateGlobalProducts(PrintQueries);
            }
            else
            {
                _Products = new List<ProductInfo>();
            }
    
    
            if (TestNumber.Value == 2)
            {
                _Command.CommandText = @"
       ;WITH cte AS
       (
         SELECT prod1.ProductID
         FROM   Production.Product prod1 WITH (INDEX(AK_Product_Name))
         WHERE  prod1.Name LIKE N'[M-R]%'
       )
       SELECT prod2.ProductID, prod2.Name, prod2.DaysToManufacture
       FROM   Production.Product prod2
       INNER JOIN cte
               ON cte.ProductID = prod2.ProductID;
    ";
            }
            else
            {
                _Command.CommandText = @"
         SELECT prod1.ProductID, prod1.Name, 1 AS [DaysToManufacture]
         FROM   Production.Product prod1
         WHERE  prod1.Name LIKE N'[M-R]%';
    ";
                if (TestNumber.Value == 3)
                {
                    _RowsToGet = 1;
                    _OrderByTransactionID = ", th.TransactionID DESC";
                }
            }
    
            try
            {
                _Connection.Open();
    
                // Populate Product list for this run if not using the Product Cache
                if (!CacheProducts.IsTrue)
                {
                    _Reader = _Command.ExecuteReader(_CmdBehavior);
    
                    while (_Reader.Read())
                    {
                        _Products.Add(new ProductInfo(_Reader.GetInt32(0), _Reader.GetString(1),
                                                      _Reader.GetInt32(2)));
                    }
    
                    _Reader.Close();
    
                    if (PrintQueries.IsTrue)
                    {
                        SqlContext.Pipe.Send(_Command.CommandText);
                    }
                }
                else
                {
                    _Products = _GlobalProducts;
                }
    
                SqlDataRecord _ResultRow = new SqlDataRecord(
                    new SqlMetaData[]{
                        new SqlMetaData("ProductID", SqlDbType.Int),
                        new SqlMetaData("Name", SqlDbType.NVarChar, 50),
                        new SqlMetaData("TransactionID", SqlDbType.Int),
                        new SqlMetaData("TransactionDate", SqlDbType.DateTime)
                    });
    
                SqlParameter _ProductID = new SqlParameter("@ProductID", SqlDbType.Int);
                _Command.Parameters.Add(_ProductID);
                SqlParameter _RowsToReturn = new SqlParameter("@RowsToReturn", SqlDbType.Int);
                _Command.Parameters.Add(_RowsToReturn);
    
                SqlContext.Pipe.SendResultsStart(_ResultRow);
    
                for (int _Row = 0; _Row < _Products.Count; _Row++)
                {
                    // Tests 1 and 3 use previously set static values for _RowsToGet
                    if (TestNumber.Value == 2)
                    {
                        if (_Products[_Row].DaysToManufacture == 0)
                        {
                            continue; // no use in issuing SELECT TOP (0) query
                        }
    
                        _RowsToGet = (5 * _Products[_Row].DaysToManufacture);
                    }
    
                    _ResultRow.SetInt32(0, _Products[_Row].ProductID);
                    _ResultRow.SetString(1, _Products[_Row].Name);
    
                    switch (ParameterizeProductID.Value)
                    {
                        case 0x01:
                            _Command.CommandText = String.Format(@"
       SELECT TOP ({0}) th.TransactionID, th.TransactionDate
       FROM   Production.TransactionHistory th
       WHERE  th.ProductID = @ProductID
       ORDER BY th.TransactionDate DESC{2}
       {1};
    ", _RowsToGet, _OptimizeForUnknown, _OrderByTransactionID);
    
                            _ProductID.Value = _Products[_Row].ProductID;
                            break;
                        case 0x02:
                            _Command.CommandText = String.Format(@"
       SELECT TOP (@RowsToReturn) th.TransactionID, th.TransactionDate
       FROM   Production.TransactionHistory th
       WHERE  th.ProductID = @ProductID
       ORDER BY th.TransactionDate DESC
       {0};
    ", _OptimizeForUnknown);
    
                            _ProductID.Value = _Products[_Row].ProductID;
                            _RowsToReturn.Value = _RowsToGet;
                            break;
                        default:
                            _Command.CommandText = String.Format(@"
       SELECT TOP ({0}) th.TransactionID, th.TransactionDate
       FROM   Production.TransactionHistory th
       WHERE  th.ProductID = {1}
       ORDER BY th.TransactionDate DESC{2};
    ", _RowsToGet, _Products[_Row].ProductID, _OrderByTransactionID);
                            break;
                    }
    
    
                    _Reader = _Command.ExecuteReader(_CmdBehavior);
    
                    while (_Reader.Read())
                    {
                        _ResultRow.SetInt32(2, _Reader.GetInt32(0));
                        _ResultRow.SetDateTime(3, _Reader.GetDateTime(1));
    
                        SqlContext.Pipe.SendResultsRow(_ResultRow);
                    }
                    _Reader.Close();
                }
    
            }
            catch
            {
                throw;
            }
            finally
            {
                if (SqlContext.Pipe.IsSendingResults)
                {
                    SqlContext.Pipe.SendResultsEnd();
                }
    
                if (_Reader != null && !_Reader.IsClosed)
                {
                    _Reader.Close();
                }
    
                if (_Connection != null && _Connection.State != ConnectionState.Closed)
                {
                    _Connection.Close();
                }
    
                if (PrintQueries.IsTrue)
                {
                    SqlContext.Pipe.Send(_Command.CommandText);
                }
            }
    
    
        }
    }
    

    The Test Queries

    There is not enough room to post the tests here so I will find another location.

    The Conclusion

    For certain scenarios, SQLCLR can be used to manipulate certain aspects of queries that cannot be done in T-SQL. And there is the ability to use memory for caching instead of temp tables, though that should be done sparingly and carefully as the memory does not get automatically released back to the system. This method is also not something that will help ad hoc queries, though it is possible to make it more flexible than I have shown here simply by adding parameters to tailor more aspects of the queries being executed.


    UPDATE

    Additional Test
    My original tests that included a supporting index on TransactionHistory used the following definition:

    ProductID ASC, TransactionDate DESC
    

    I had decided at the time to forgo including TransactionId DESC at the end, figuring that while it might help Test Number 3 (which specifies tie-breaking on the most recent TransactionId--well, "most recent" is assumed since not explicitly stated, but everyone seems to agree on this assumption), there likely wouldn't be enough ties to make a difference.

    But, then Aaron retested with a supporting index that did include TransactionId DESC and found that the CROSS APPLY method was the winner across all three tests. This was different than my testing which indicated that the CTE method was best for Test Number 3 (when no caching was used, which mirrors Aaron's test). It was clear that there was an additional variation that needed to be tested.

    I removed the current supporting index, created a new one with TransactionId, and cleared the plan cache (just to be sure):

    DROP INDEX [IX_TransactionHistoryX] ON Production.TransactionHistory;
    
    CREATE UNIQUE INDEX [UIX_TransactionHistoryX]
        ON Production.TransactionHistory (ProductID ASC, TransactionDate DESC, TransactionID DESC)
        WITH (FILLFACTOR = 100);
    
    DBCC FREEPROCCACHE WITH NO_INFOMSGS;
    

    I re-ran Test Number 1 and the results were the same, as expected. I then re-ran Test Number 3 and the results did indeed change:

    Test 3 Results-with supporting index (with TransactionId DESC)
    The above results are for the standard, non-caching test. This time, not only does the CROSS APPLY beat the CTE (just as Aaron's test indicated), but the SQLCLR proc took the lead by 30 Reads (woo hoo).

    Test 3 Results-with supporting index (with TransactionId DESC) AND caching
    The above results are for the test with caching enabled. This time the CTE's performance is not degraded, though the CROSS APPLY still beats it. However, now the SQLCLR proc takes the lead by 23 Reads (woo hoo, again).

    Take Aways

    1. There are various options to use. It is best to try several as they each have their strengths. The tests done here show a rather small variance in both Reads and Duration between the best and worst performers across all tests (with a supporting index); the variation in Reads is about 350 and Duration is 55 ms. While the SQLCLR proc did win in all but 1 test (in terms of Reads), only saving a few Reads usually isn't worth the maintenance cost of going the SQLCLR route. But in AdventureWorks2012, the Product table has only 504 rows and TransactionHistory has only 113,443 rows. The performance difference across these methods probably becomes more pronounced as the row counts increase.

    2. Embora essa pergunta tenha sido específica para obter um determinado conjunto de linhas, não deve ser esquecido que o maior fator de desempenho foi a indexação e não o SQL específico. Um bom índice precisa estar em vigor antes de determinar qual método é realmente melhor.

    3. A lição mais importante encontrada aqui não é sobre CROSS APPLY vs CTE vs SQLCLR: é sobre TESTES. Não assuma. Obtenha ideias de várias pessoas e teste quantos cenários puder.

    • 23
  5. Mikael Eriksson
    2014-12-23T07:36:14+08:002014-12-23T07:36:14+08:00

    APPLY TOP or ROW_NUMBER()? What could there possibly be more to say on that matter?

    A short recap of the differences and to really keep it short I will only show the plans for option 2 and I have added the index on Production.TransactionHistory.

    create index IX_TransactionHistoryX on 
      Production.TransactionHistory(ProductID, TransactionDate)
    

    The row_number() query:.

    with C as
    (
      select T.TransactionID,
             T.TransactionDate,
             P.DaysToManufacture,
             row_number() over(partition by P.ProductID order by T.TransactionDate desc) as rn
      from Production.Product as P
        inner join Production.TransactionHistory as T
          on P.ProductID = T.ProductID
      where P.Name >= N'M' and
            P.Name < N'S'
    )
    select C.TransactionID,
           C.TransactionDate
    from C
    where C.rn <= 5 * C.DaysToManufacture;
    

    insira a descrição da imagem aqui

    The apply top version:

    select T.TransactionID, 
           T.TransactionDate
    from Production.Product as P
      cross apply (
                  select top(cast(5 * P.DaysToManufacture as bigint))
                    T.TransactionID,
                    T.TransactionDate
                  from Production.TransactionHistory as T
                  where P.ProductID = T.ProductID
                  order by T.TransactionDate desc
                  ) as T
    where P.Name >= N'M' and
          P.Name < N'S';
    

    insira a descrição da imagem aqui

    The main difference between these are that apply top filters on the top expression below the nested loops join where row_number version filters after the join. That means there are more reads from Production.TransactionHistory than really is necessary.

    If there only existed a way to push the operators responsible for enumerating rows down to the lower branch before the join then row_number version might do better.

    So enter apply row_number() version.

    select T.TransactionID, 
           T.TransactionDate
    from Production.Product as P
      cross apply (
                  select T.TransactionID,
                         T.TransactionDate
                  from (
                       select T.TransactionID,
                              T.TransactionDate,
                              row_number() over(order by T.TransactionDate desc) as rn
                       from Production.TransactionHistory as T
                       where P.ProductID = T.ProductID
                       ) as T
                  where T.rn <= cast(5 * P.DaysToManufacture as bigint)
                  ) as T
    where P.Name >= N'M' and
          P.Name < N'S';
    

    insira a descrição da imagem aqui

    As you can see apply row_number() is pretty much the same as apply top only slightly more complicated. Execution time is also about the same or bit slower.

    So why did I bother to come up with an answer that is no better than what we already have? Well, you have one more thing to try out in the real world and there actually is a difference in reads. One that I don't have an explanation for*.

    APPLY - ROW_NUMBER
    (961 row(s) affected)
    Table 'TransactionHistory'. Scan count 115, logical reads 230, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
    Table 'Product'. Scan count 1, logical reads 15, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
    
    APPLY - TOP
    (961 row(s) affected)
    Table 'TransactionHistory'. Scan count 115, logical reads 268, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
    Table 'Product'. Scan count 1, logical reads 15, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
    

    While I'm at it i might as well throw in a second row_number() version that in certain cases might be the way to go. Those certain cases would be when you expect you actually need most of the rows from Production.TransactionHistory because here you get a merge join between Production.Product and the enumerated Production.TransactionHistory.

    with C as
    (
      select T.TransactionID,
             T.TransactionDate,
             T.ProductID,
             row_number() over(partition by T.ProductID order by T.TransactionDate desc) as rn
      from Production.TransactionHistory as T
    )
    select C.TransactionID,
           C.TransactionDate
    from C
     inner join Production.Product as P
          on P.ProductID = C.ProductID
    where P.Name >= N'M' and
          P.Name < N'S' and
          C.rn <= 5 * P.DaysToManufacture;
    

    insira a descrição da imagem aqui

    To get the above shape without a sort operator you also have to change the supporting index to order by TransactionDate descending.

    create index IX_TransactionHistoryX on 
      Production.TransactionHistory(ProductID, TransactionDate desc)
    

    * Edit: The extra logical reads are due to the nested loops prefetching used with the apply-top. You can disable this with undoc'd TF 8744 (and/or 9115 on later versions) to get the same number of logical reads. Prefetching could be an advantage of the apply-top alternative in the right circumstances. - Paul White

    • 19
  6. Kris Gruttemeyer
    2014-12-18T07:16:43+08:002014-12-18T07:16:43+08:00

    I typically use a combination of CTEs and windowing functions. You could achieve this answer using something like the following:

    ;WITH GiveMeCounts
    AS (
        SELECT CustomerID
            ,OrderDate
            ,TotalAmt
    
            ,ROW_NUMBER() OVER (
                PARTITION BY CustomerID ORDER BY 
                --You can change the following field or sort order to whatever you'd like to order by.
                TotalAmt desc
                ) AS MySeqNum
        )
    SELECT CustomerID, OrderDate, TotalAmt
    FROM GiveMeCounts
    --Set n per group here
    where MySeqNum <= 10
    

    For the extra credit portion, where different groups may want to return different numbers of rows, you could use a separate table. Lets say using geographic criteria such as state:

    +-------+-----------+
    | State | MaxSeqnum |
    +-------+-----------+
    | AK    |        10 |
    | NY    |         5 |
    | NC    |        23 |
    +-------+-----------+
    

    In order to achieve this where the values may be different you would need to join your CTE to the State table similar to this:

    SELECT [CustomerID]
        ,[OrderDate]
        ,[TotalAmt]
        ,[State]
    FROM GiveMeCounts gmc
    INNER JOIN StateTable st ON gmc.[State] = st.[State]
        AND gmc.MySeqNum <= st.MaxSeqNum
    
    • 12

relate perguntas

  • SQL Server - Como as páginas de dados são armazenadas ao usar um índice clusterizado

  • Preciso de índices separados para cada tipo de consulta ou um índice de várias colunas funcionará?

  • Quando devo usar uma restrição exclusiva em vez de um índice exclusivo?

  • Quais são as principais causas de deadlocks e podem ser evitadas?

  • Como determinar se um Índice é necessário ou necessário

Sidebar

Stats

  • Perguntas 205573
  • respostas 270741
  • best respostas 135370
  • utilizador 68524
  • Highest score
  • respostas
  • Marko Smith

    conectar ao servidor PostgreSQL: FATAL: nenhuma entrada pg_hba.conf para o host

    • 12 respostas
  • Marko Smith

    Como fazer a saída do sqlplus aparecer em uma linha?

    • 3 respostas
  • Marko Smith

    Selecione qual tem data máxima ou data mais recente

    • 3 respostas
  • Marko Smith

    Como faço para listar todos os esquemas no PostgreSQL?

    • 4 respostas
  • Marko Smith

    Listar todas as colunas de uma tabela especificada

    • 5 respostas
  • Marko Smith

    Como usar o sqlplus para se conectar a um banco de dados Oracle localizado em outro host sem modificar meu próprio tnsnames.ora

    • 4 respostas
  • Marko Smith

    Como você mysqldump tabela (s) específica (s)?

    • 4 respostas
  • Marko Smith

    Listar os privilégios do banco de dados usando o psql

    • 10 respostas
  • Marko Smith

    Como inserir valores em uma tabela de uma consulta de seleção no PostgreSQL?

    • 4 respostas
  • Marko Smith

    Como faço para listar todos os bancos de dados e tabelas usando o psql?

    • 7 respostas
  • Martin Hope
    Jin conectar ao servidor PostgreSQL: FATAL: nenhuma entrada pg_hba.conf para o host 2014-12-02 02:54:58 +0800 CST
  • Martin Hope
    Stéphane Como faço para listar todos os esquemas no PostgreSQL? 2013-04-16 11:19:16 +0800 CST
  • Martin Hope
    Mike Walsh Por que o log de transações continua crescendo ou fica sem espaço? 2012-12-05 18:11:22 +0800 CST
  • Martin Hope
    Stephane Rolland Listar todas as colunas de uma tabela especificada 2012-08-14 04:44:44 +0800 CST
  • Martin Hope
    haxney O MySQL pode realizar consultas razoavelmente em bilhões de linhas? 2012-07-03 11:36:13 +0800 CST
  • Martin Hope
    qazwsx Como posso monitorar o andamento de uma importação de um arquivo .sql grande? 2012-05-03 08:54:41 +0800 CST
  • Martin Hope
    markdorison Como você mysqldump tabela (s) específica (s)? 2011-12-17 12:39:37 +0800 CST
  • Martin Hope
    Jonas Como posso cronometrar consultas SQL usando psql? 2011-06-04 02:22:54 +0800 CST
  • Martin Hope
    Jonas Como inserir valores em uma tabela de uma consulta de seleção no PostgreSQL? 2011-05-28 00:33:05 +0800 CST
  • Martin Hope
    Jonas Como faço para listar todos os bancos de dados e tabelas usando o psql? 2011-02-18 00:45:49 +0800 CST

Hot tag

sql-server mysql postgresql sql-server-2014 sql-server-2016 oracle sql-server-2008 database-design query-performance sql-server-2017

Explore

  • Início
  • Perguntas
    • Recentes
    • Highest score
  • tag
  • help

Footer

AskOverflow.Dev

About Us

  • About Us
  • Contact Us

Legal Stuff

  • Privacy Policy

Language

  • Pt
  • Server
  • Unix

© 2023 AskOverflow.DEV All Rights Reserve