Preciso calcular uma soma contínua em um intervalo de datas. Para ilustrar, usando o banco de dados de exemplo AdventureWorks , a seguinte sintaxe hipotética faria exatamente o que eu preciso:
SELECT
TH.ProductID,
TH.TransactionDate,
TH.ActualCost,
RollingSum45 = SUM(TH.ActualCost) OVER (
PARTITION BY TH.ProductID
ORDER BY TH.TransactionDate
RANGE BETWEEN
INTERVAL 45 DAY PRECEDING
AND CURRENT ROW)
FROM Production.TransactionHistory AS TH
ORDER BY
TH.ProductID,
TH.TransactionDate,
TH.ReferenceOrderID;
Infelizmente, a RANGE
extensão do quadro da janela não permite atualmente um intervalo no SQL Server.
Eu sei que posso escrever uma solução usando uma subconsulta e uma agregação regular (não janela):
SELECT
TH.ProductID,
TH.TransactionDate,
TH.ActualCost,
RollingSum45 =
(
SELECT SUM(TH2.ActualCost)
FROM Production.TransactionHistory AS TH2
WHERE
TH2.ProductID = TH.ProductID
AND TH2.TransactionDate <= TH.TransactionDate
AND TH2.TransactionDate >= DATEADD(DAY, -45, TH.TransactionDate)
)
FROM Production.TransactionHistory AS TH
ORDER BY
TH.ProductID,
TH.TransactionDate,
TH.ReferenceOrderID;
Dado o seguinte índice:
CREATE UNIQUE INDEX i
ON Production.TransactionHistory
(ProductID, TransactionDate, ReferenceOrderID)
INCLUDE
(ActualCost);
O plano de execução é:
Embora não seja terrivelmente ineficiente, parece que deve ser possível expressar essa consulta usando apenas funções analíticas e agregadas de janela com suporte no SQL Server 2012, 2014 ou 2016 (até agora).
Para maior clareza, estou procurando uma solução que execute uma única passagem pelos dados.
Em T-SQL, isso provavelmente significa que a OVER
cláusula fará o trabalho e o plano de execução apresentará Spools de Janelas e Agregados de Janelas. Todos os elementos de linguagem que usam a OVER
cláusula são um jogo justo. Uma solução SQLCLR é aceitável, desde que seja garantida a produção de resultados corretos.
Para soluções T-SQL, quanto menos Hashes, Classificações e Spools/Agregados de Janelas no plano de execução, melhor. Sinta-se à vontade para adicionar índices, mas estruturas separadas não são permitidas (portanto, nenhuma tabela pré-computada é mantida em sincronia com gatilhos, por exemplo). São permitidas tabelas de referência (tabelas de números, datas, etc.)
Idealmente, as soluções produzirão exatamente os mesmos resultados na mesma ordem que a versão da subconsulta acima, mas qualquer coisa comprovadamente correta também é aceitável. O desempenho é sempre uma consideração, portanto, as soluções devem ser pelo menos razoavelmente eficientes.
Sala de bate-papo dedicada: Criei uma sala de bate-papo pública para discussões relacionadas a esta pergunta e suas respostas. Qualquer usuário com pelo menos 20 pontos de reputação pode participar diretamente. Por favor, ping me em um comentário abaixo se você tem menos de 20 representantes e gostaria de participar.
Great question, Paul! I used a couple different approaches, one in T-SQL and one in CLR.
T-SQL quick summary
The T-SQL approach can be summarized as the following steps:
Using
SET STATISTICS IO ON
, this approach reportsTable 'TransactionHistory'. Scan count 1, logical reads 484
, which confirms the "single pass" over the table. For reference, the original loop-seek query reportsTable 'TransactionHistory'. Scan count 113444, logical reads 438366
.As reported by
SET STATISTICS TIME ON
, the CPU time is514ms
. This compares favorably to2231ms
for the original query.CLR quick summary
The CLR summary can be summarized as the following steps:
Using
SET STATISTICS IO ON
, this approach reports that no logical I/O has occurred! Wow, a perfect solution! (Actually, it seems thatSET STATISTICS IO
does not report I/O incurred within CLR. But from the code, it is easy to see that exactly one scan of the table is made and retrieves the data in order by the index Paul suggested.As reported by
SET STATISTICS TIME ON
, the CPU time is now187ms
. So this is quite an improvement over the T-SQL approach. Unfortunately, the overall elapsed time of both approaches is very similar at about half a second each. However, the CLR based approach does have to output 113K rows to the console (vs. just 52K for the T-SQL approach that groups by product/date), so that's why I've focused on CPU time instead.Another big advantage of this approach is that it yields exactly the same results as the original loop/seek approach, including a row for every transaction even in cases where a product is sold multiple times on the same day. (On AdventureWorks, I specifically compared row-by-row results and confirmed that they tie out with Paul's original query.)
A disadvantage of this approach, at least in its current form, is that it reads all data in memory. However, the algorithm that has been designed only strictly needs the current window frame in memory at any given time and could be updated to work for data sets that exceed memory. Paul has illustrated this point in his answer by producing an implementation of this algorithm that stores only the sliding window in memory. This comes at the expense of granting higher permissions to CLR assembly, but would definitely be worthwhile in scaling this solution up to arbitrarily large data sets.
T-SQL - one scan, grouped by date
Initial setup
The query
The execution plan
From the execution plan, we see that the original index proposed by Paul is sufficient to allow us to perform a single ordered scan of
Production.TransactionHistory
, using a merge join to combine the transaction history with each possible product/date combination.Assumptions
There are a few significant assumptions baked into this approach. I suppose it will be up to Paul to decide whether they are acceptable :)
Production.Product
table. This table is freely available onAdventureWorks2012
and the relationship is enforced by a foreign key fromProduction.TransactionHistory
, so I interpreted this as fair game.AdventureWorks2012
; if they did, generating the full set of product/date combinations would no longer be possible without first taking a pass over the transaction history.NumOrders
column to indicate how many sales occurred. See the following screenshot for a comparison of the results of the original query vs. the proposed query in cases where a product was sold multiple times on the same date (e.g.,319
/2007-09-05 00:00:00.000
)CLR - one scan, full ungrouped result set
The main function body
There isn't a ton to see here; the main body of the function declares the inputs (which must match the corresponding SQL function), sets up a SQL connection, and opens the SQLReader.
The core logic
I've separated out the main logic so that's easier to focus on:
Helpers
The following logic could be written inline, but it's a little easier to read when they are split out into their own methods.
Tying it all together in SQL
Everything up to this point has been in C#, so let's see the actual SQL involved. (Alternatively, you can use this deployment script to create the assembly directly from the bits of my assembly rather than compiling yourself.)
Caveats
The CLR approach provides a lot more flexibility to optimize the algorithm, and it could probably be tuned even further by an expert in C#. However, there are also downsides to the CLR strategy. A few things to keep in mind:
TRUSTWORTHY
and grantingEXTERNAL_ACCESS
to the CLR assembly. So there is some hassle and potential security implication, but the payoff is a streaming approach that can better scale to much larger data sets than those on AdventureWorks.Bonus: T-SQL #2 - the practical approach I'd actually use
After trying to think about the problem creatively for a while, I thought I'd also post the fairly simple, practical way that I would likely choose to tackle this problem if it came up in my daily work. It does make use of SQL 2012+ window functionality, but not in type of groundbreaking way that the question was hoping for:
This actually yields a fairly simple overall query plan, even when looking at the both of the two relevant query plans together:
A few reasons I like this approach:
900ms
on the provided data set, rather than the2700ms
of the original loop-seekA couple potential caveats:
Esta é uma resposta longa, então decidi adicionar um resumo aqui.
ProductIDs
com o intervalo de datas de cada Produto, para somar os custos de cada dia (porque existem várias transações com as mesmas datas), para unir o resultado com as linhas originais.ProductIDs
.Usarei o banco de dados AdventureWorks2014 e o SQL Server Express 2014.
Alterações no banco de dados original:
[Production].[TransactionHistory].[TransactionDate]
dedatetime
paradate
. O componente de tempo era zero de qualquer maneira.[dbo].[Calendar]
[Production].[TransactionHistory]
.
O artigo do MSDN sobre a
OVER
cláusula tem um link para uma excelente postagem de blog sobre funções de janela de Itzik Ben-Gan. Nesse post, ele explica comoOVER
funciona, a diferença entreROWS
eRANGE
opções e menciona esse mesmo problema de calcular uma soma contínua em um intervalo de datas. Ele menciona que a versão atual do SQL Server não implementaRANGE
totalmente e não implementa tipos de dados de intervalo temporal. Sua explicação da diferença entreROWS
eRANGE
me deu uma idéia.Datas sem lacunas e duplicatas
Se
TransactionHistory
a tabela contiver datas sem lacunas e sem duplicatas, a consulta a seguir produzirá resultados corretos:De fato, uma janela de 45 linhas cobriria exatamente 45 dias.
Datas com lacunas sem duplicatas
Infelizmente, nossos dados têm lacunas nas datas. Para resolver este problema podemos usar uma
Calendar
tabela para gerar um conjunto de datas sem gaps, entãoLEFT JOIN
dados originais para este conjunto e usar a mesma consulta comROWS BETWEEN 45 PRECEDING AND CURRENT ROW
. Isso produziria resultados corretos apenas se as datas não se repetirem (dentro do mesmoProductID
).Datas com lacunas com duplicatas
Infelizmente, nossos dados têm lacunas nas datas e as datas podem se repetir no mesmo arquivo
ProductID
. Para resolver este problema, podemos criarGROUP
dados originaisProductID, TransactionDate
gerando um conjunto de datas sem duplicatas. Em seguida, useCalendar
a tabela para gerar um conjunto de datas sem lacunas. Em seguida, podemos usar a consulta comROWS BETWEEN 45 PRECEDING AND CURRENT ROW
para calcular o rolamentoSUM
. Isso produziria resultados corretos. Veja os comentários na consulta abaixo.Confirmei que essa consulta produz os mesmos resultados que a abordagem da questão que usa subconsulta.
Planos de execução
A primeira consulta usa a subconsulta, a segunda - essa abordagem. Você pode ver que a duração e o número de leituras são muito menores nessa abordagem. A maioria do custo estimado nesta abordagem é o final
ORDER BY
, veja abaixo.A abordagem de subconsulta tem um plano simples com loops aninhados e
O(n*n)
complexidade.Planeje essa abordagem verifica
TransactionHistory
várias vezes, mas não há loops. Como você pode ver mais de 70% do custo estimado éSort
para o finalORDER BY
.Resultado superior -
subquery
, inferior -OVER
.Evitando varreduras extras
O último Index Scan, Merge Join e Sort no plano acima é causado pela final
INNER JOIN
com a tabela original para tornar o resultado final exatamente igual a uma abordagem lenta com subconsulta. O número de linhas retornadas é o mesmo daTransactionHistory
tabela. Há linhas emTransactionHistory
que várias transações ocorreram no mesmo dia para o mesmo produto. Se não houver problema em mostrar apenas o resumo diário no resultado, esse finalJOIN
poderá ser removido e a consulta se tornará um pouco mais simples e um pouco mais rápida. O último Index Scan, Merge Join e Sort do plano anterior são substituídos por Filter, que remove as linhas adicionadas porCalendar
.Ainda assim,
TransactionHistory
é digitalizado duas vezes. Uma varredura extra é necessária para obter o intervalo de datas de cada produto. Eu estava interessado em ver como ele se compara a outra abordagem, onde usamos conhecimento externo sobre o intervalo global de datas emTransactionHistory
, além de uma tabela extraProduct
que tem tudoProductIDs
para evitar essa varredura extra. Eu removi o cálculo do número de transações por dia desta consulta para tornar a comparação válida. Ele pode ser adicionado em ambas as consultas, mas gostaria de mantê-lo simples para comparação. Também tive que usar outras datas, pois uso a versão 2014 do banco de dados.Ambas as consultas retornam o mesmo resultado na mesma ordem.
Comparação
Aqui estão as estatísticas de tempo e IO.
The two-scan variant is a bit faster and has fewer reads, because one-scan variant has to use Worktable a lot. Besides, one-scan variant generates more rows than needed as you can see in the plans. It generates dates for each
ProductID
that is in theProduct
table, even if aProductID
doesn't have any transactions. There are 504 rows inProduct
table, but only 441 products have transactions inTransactionHistory
. Also, it generates the same range of dates for each product, which is more than needed. IfTransactionHistory
had a longer overall history, with each individual product having relatively short history, the number of extra unneeded rows would be even higher.On the other hand, it is possible to optimize two-scan variant a bit further by creating another, more narrow index on just
(ProductID, TransactionDate)
. This index would be used to calculate Start/End dates for each product (CTE_Products
) and it would have less pages than covering index and as a result cause less reads.So, we can choose, either have an extra explicit simple scan, or have an implicit Worktable.
BTW, if it is OK to have a result with only daily summaries, then it is better to create an index that doesn't include
ReferenceOrderID
. It would use less pages => less IO.Single pass solution using CROSS APPLY
It becomes a really long answer, but here is one more variant that returns only daily summary again, but it does only one scan of the data and it doesn't require external knowledge about range of dates or list of ProductIDs. It doesn't do intermediate Sorts as well. Overall performance is similar to previous variants, though seems to be a bit worse.
The main idea is to use a table of numbers to generate rows that would fill the gaps in dates. For each existing date use
LEAD
to calculate the size of the gap in days and then useCROSS APPLY
to add required number of rows into the result set. At first I tried it with a permanent table of numbers. The plan showed large number of reads in this table, though actual duration was pretty much the same, as when I generated numbers on the fly usingCTE
.This plan is "longer", because query uses two window functions (
LEAD
andSUM
).An alternative SQLCLR solution that executes faster and requires less memory:
Deployment Script
That requires the
EXTERNAL_ACCESS
permission set because it uses a loopback connection to the target server and database instead of the (slow) context connection. This is how to call the function:Produces exactly the same results, in the same order, as the question.
Execution plan:
Profiler logical reads: 481
The main advantage of this implementation is that it is faster than using the context connection, and it uses less memory. It only keeps two things in memory at any one time:
This minimal caching should ensure this method scales well; certainly better than trying to hold the entire input set in CLR memory.
Source code
If you are on 64-bit Enterprise, Developer, or Evaluation edition of SQL Server 2014 you can use In-Memory OLTP. The solution will not be a single scan and and will hardly use any window functions at all but it might add some value to this question and the algorithm used could possibly be used as inspiration to other solutions.
First you need to enable In-Memory OLTP on AdventureWorks database.
The parameter to the procedure is an In-Memory table variable and that has to be defined as a type.
ID is not unique in this table, it is unique for each combination of
ProductID
andTransactionDate
.There are some comments in the procedure that tell you what it does but overall it is calculating the running total in a loop and for each iteration it does a lookup for the running total as it was 45 days ago (or more).
The current running total minus the running total as it was 45 days ago is the rolling 45 days sum we are looking for.
Invoke the procedure like this.
Testing this on my computer Client Statistics reports a Total execution time of about 750 millisecond. For comparisons the sub-query version takes 3.5 seconds.
Extra ramblings:
This algorithm could also be used by regular T-SQL. Calculate the running total, using
range
not rows, and store the result in a temp table. Then you can query that table with a self join to to the running total as it was 45 days ago and calculate the rolling sum. However, the implementation ofrange
compared torows
is quite slow due to the fact that is needs to treat duplicates of the order by clause differently so I did not get all that good performance with this approach. A workaround to that could be to use another window function likelast_value()
over a calculated running total usingrows
to simulate arange
running total. Another way is to usemax() over()
. Both had some issues. Finding the appropriate index to use to avoid sorts and avoiding spools with themax() over()
version. I gave up optimising those things but if you are interested in the code I have so far please let me know.Well that was fun :) My solution is a bit slower than @GeoffPatterson's but part of that is the fact that I'm tying back to the original table in order to eliminate one of Geoff's assumptions (i.e. one row per product/date pair). I went with the assumption this was a simplified version of a final query and may require additional information out of the original table.
Note: I'm borrowing Geoff's calendar table and in fact ended up with a very similar solution:
Here is the query itself:
Basically I decided that the easiest way to deal with it was to use the option for the ROWS clause. But that required that I only have one row per
ProductID
,TransactionDate
combination and not just that, but I had to have one row perProductID
andpossible date
. I did that combining the Product, calendar and TransactionHistory tables in a CTE. Then I had to create another CTE to generate the rolling information. I had to do this because if I joined it back the the original table directly I got row elimination that threw off my results. After that it was a simple matter of joining my second CTE back to the original table. I did add theTBE
column (to be eliminated) to get rid of the blank rows created in the CTEs. Also I used aCROSS APPLY
in the initial CTE to generate boundaries for my calendar table.I then added the recommended index:
And got the final execution plan:
EDIT: In the end I added an index on the calendar table that sped up performance by a reasonable margin.
I have a few alternate solutions that don't use indexes or reference tables. Perhaps they could be useful in situations in which you don't have access to any additional tables and cannot create indexes. It does appear to be possible to get correct results when grouping by
TransactionDate
with just a single pass of the data and just a single window function. However, I could not figure out a way to do it with just one window function when you cannot group byTransactionDate
.To provide a frame of reference, on my machine the original solution posted in the question has a CPU time of 2808 ms without the covering index and 1950 ms with the covering index. I am testing with the AdventureWorks2014 database and SQL Server Express 2014.
Let's start with a solution for when we can group by
TransactionDate
. A running sum over the last X days can also be expressed in the following way:In SQL, one way to express this is by making two copies of your data and for the second copy, multiplying the cost by -1 and adding X+1 days to the date column. Computing a running sum over all of the data will implement the above formula. I'll show this for some example data. Below is some sample date for a single
ProductID
. I represent dates as numbers to make the calculations easier. Starting data:Add in a second copy of the data. The second copy has 46 days added to the date and the cost multiplied by -1:
Take the running sum ordered by
Date
ascending andCopiedRow
descending:Filter out the copied rows to get the desired result:
The following SQL is one way to implement the above algorithm:
On my machine this took 702 ms of CPU time with the covering index and 734 ms of CPU time without the index. The query plan can be found here: https://www.brentozar.com/pastetheplan/?id=SJdCsGVSl
One downside of this solution is that there appears to be an unavoidable sort when ordering by the new
TransactionDate
column. I don't think that this sort can be resolved by adding indexes because we need to combine two copies of the data before doing the ordering. I was able to get rid of a sort at the end of the query by adding in a different column to ORDER BY. If I ordered byFilterFlag
I found that SQL Server would optimize out that column from the sort and would perform an explicit sort.Solutions for when we need to return a result set with duplicate
TransactionDate
values for the sameProductId
were much more complicated. I would summarize the problem as simultaneously needing to partition by and order by the same column. The syntax that Paul provided resolves that issue so it's not surprisingly that it's so difficult to express with the current window functions available in SQL Server (if it wasn't difficult to express there would be no need to expand the syntax).If I use the above query without grouping then I get different values for the rolling sum when there are multiple rows with the same
ProductId
andTransactionDate
. One way to resolve this is to do the same running sum calculation as above but also to flag the last row in the partition. This can be done withLEAD
(assumingProductID
is never NULL) without an additional sort. For the final running sum value, I useMAX
as a window function to apply the value in the last row of the partition to all rows in the partition.On my machine this took 2464ms of CPU time without the covering index. As before there appears to be an unavoidable sort. The query plan can be found here: https://www.brentozar.com/pastetheplan/?id=HyWxhGVBl
Eu acho que há espaço para melhorias na consulta acima. Certamente existem outras maneiras de usar as funções do Windows para obter o resultado desejado.