AskOverflow.Dev

AskOverflow.Dev Logo AskOverflow.Dev Logo

AskOverflow.Dev Navigation

  • Início
  • system&network
  • Ubuntu
  • Unix
  • DBA
  • Computer
  • Coding
  • LangChain

Mobile menu

Close
  • Início
  • system&network
    • Recentes
    • Highest score
    • tags
  • Ubuntu
    • Recentes
    • Highest score
    • tags
  • Unix
    • Recentes
    • tags
  • DBA
    • Recentes
    • tags
  • Computer
    • Recentes
    • tags
  • Coding
    • Recentes
    • tags
Início / dba / Perguntas / 114403
Accepted
Paul White
Paul White
Asked: 2015-09-08 12:13:41 +0800 CST2015-09-08 12:13:41 +0800 CST 2015-09-08 12:13:41 +0800 CST

Soma contínua do intervalo de datas usando funções de janela

  • 772

Preciso calcular uma soma contínua em um intervalo de datas. Para ilustrar, usando o banco de dados de exemplo AdventureWorks , a seguinte sintaxe hipotética faria exatamente o que eu preciso:

SELECT
    TH.ProductID,
    TH.TransactionDate,
    TH.ActualCost,
    RollingSum45 = SUM(TH.ActualCost) OVER (
        PARTITION BY TH.ProductID
        ORDER BY TH.TransactionDate
        RANGE BETWEEN 
            INTERVAL 45 DAY PRECEDING
            AND CURRENT ROW)
FROM Production.TransactionHistory AS TH
ORDER BY
    TH.ProductID,
    TH.TransactionDate,
    TH.ReferenceOrderID;

Infelizmente, a RANGEextensão do quadro da janela não permite atualmente um intervalo no SQL Server.

Eu sei que posso escrever uma solução usando uma subconsulta e uma agregação regular (não janela):

SELECT 
    TH.ProductID,
    TH.TransactionDate,
    TH.ActualCost,
    RollingSum45 =
    (
        SELECT SUM(TH2.ActualCost)
        FROM Production.TransactionHistory AS TH2
        WHERE
            TH2.ProductID = TH.ProductID
            AND TH2.TransactionDate <= TH.TransactionDate
            AND TH2.TransactionDate >= DATEADD(DAY, -45, TH.TransactionDate)
    )
FROM Production.TransactionHistory AS TH
ORDER BY
    TH.ProductID,
    TH.TransactionDate,
    TH.ReferenceOrderID;

Dado o seguinte índice:

CREATE UNIQUE INDEX i
ON Production.TransactionHistory
    (ProductID, TransactionDate, ReferenceOrderID)
INCLUDE
    (ActualCost);

O plano de execução é:

Plano de execução

Embora não seja terrivelmente ineficiente, parece que deve ser possível expressar essa consulta usando apenas funções analíticas e agregadas de janela com suporte no SQL Server 2012, 2014 ou 2016 (até agora).

Para maior clareza, estou procurando uma solução que execute uma única passagem pelos dados.

Em T-SQL, isso provavelmente significa que a OVERcláusula fará o trabalho e o plano de execução apresentará Spools de Janelas e Agregados de Janelas. Todos os elementos de linguagem que usam a OVERcláusula são um jogo justo. Uma solução SQLCLR é aceitável, desde que seja garantida a produção de resultados corretos.

Para soluções T-SQL, quanto menos Hashes, Classificações e Spools/Agregados de Janelas no plano de execução, melhor. Sinta-se à vontade para adicionar índices, mas estruturas separadas não são permitidas (portanto, nenhuma tabela pré-computada é mantida em sincronia com gatilhos, por exemplo). São permitidas tabelas de referência (tabelas de números, datas, etc.)

Idealmente, as soluções produzirão exatamente os mesmos resultados na mesma ordem que a versão da subconsulta acima, mas qualquer coisa comprovadamente correta também é aceitável. O desempenho é sempre uma consideração, portanto, as soluções devem ser pelo menos razoavelmente eficientes.

Sala de bate-papo dedicada: Criei uma sala de bate-papo pública para discussões relacionadas a esta pergunta e suas respostas. Qualquer usuário com pelo menos 20 pontos de reputação pode participar diretamente. Por favor, ping me em um comentário abaixo se você tem menos de 20 representantes e gostaria de participar.

  • Discussão para "Soma rolante de intervalo de datas usando funções de janela"
sql-server sql-server-2012
  • 6 6 respostas
  • 62821 Views

6 respostas

  • Voted
  1. Best Answer
    Geoff Patterson
    2015-09-09T07:41:26+08:002015-09-09T07:41:26+08:00

    Great question, Paul! I used a couple different approaches, one in T-SQL and one in CLR.

    T-SQL quick summary

    The T-SQL approach can be summarized as the following steps:

    • Take the cross-product of products/dates
    • Merge in the observed sales data
    • Aggregate that data to the product/date level
    • Compute rolling sums over the past 45 days based on this aggregate data (which contains any "missing" days filled in)
    • Filter those results to only the product/date pairings that had one or more sales

    Using SET STATISTICS IO ON, this approach reports Table 'TransactionHistory'. Scan count 1, logical reads 484, which confirms the "single pass" over the table. For reference, the original loop-seek query reports Table 'TransactionHistory'. Scan count 113444, logical reads 438366.

    As reported by SET STATISTICS TIME ON, the CPU time is 514ms. This compares favorably to 2231ms for the original query.

    CLR quick summary

    The CLR summary can be summarized as the following steps:

    • Read the data into memory, ordered by product and date
    • While processing each transaction, add to a running total of the costs. Whenever a transaction is a different product than the previous transaction, reset the running total to 0.
    • Maintain a pointer to the first transaction that has the same (product, date) as the current transaction. Whenever the last transaction with that (product, date) is encountered, compute the rolling sum for that transaction and apply it to all transactions with the same (product, date)
    • Return all of the results to the user!

    Using SET STATISTICS IO ON, this approach reports that no logical I/O has occurred! Wow, a perfect solution! (Actually, it seems that SET STATISTICS IO does not report I/O incurred within CLR. But from the code, it is easy to see that exactly one scan of the table is made and retrieves the data in order by the index Paul suggested.

    As reported by SET STATISTICS TIME ON, the CPU time is now 187ms. So this is quite an improvement over the T-SQL approach. Unfortunately, the overall elapsed time of both approaches is very similar at about half a second each. However, the CLR based approach does have to output 113K rows to the console (vs. just 52K for the T-SQL approach that groups by product/date), so that's why I've focused on CPU time instead.

    Another big advantage of this approach is that it yields exactly the same results as the original loop/seek approach, including a row for every transaction even in cases where a product is sold multiple times on the same day. (On AdventureWorks, I specifically compared row-by-row results and confirmed that they tie out with Paul's original query.)

    A disadvantage of this approach, at least in its current form, is that it reads all data in memory. However, the algorithm that has been designed only strictly needs the current window frame in memory at any given time and could be updated to work for data sets that exceed memory. Paul has illustrated this point in his answer by producing an implementation of this algorithm that stores only the sliding window in memory. This comes at the expense of granting higher permissions to CLR assembly, but would definitely be worthwhile in scaling this solution up to arbitrarily large data sets.


    T-SQL - one scan, grouped by date

    Initial setup

    USE AdventureWorks2012
    GO
    -- Create Paul's index
    CREATE UNIQUE INDEX i
    ON Production.TransactionHistory (ProductID, TransactionDate, ReferenceOrderID)
    INCLUDE (ActualCost);
    GO
    -- Build calendar table for 2000 ~ 2020
    CREATE TABLE dbo.calendar (d DATETIME NOT NULL CONSTRAINT PK_calendar PRIMARY KEY)
    GO
    DECLARE @d DATETIME = '1/1/2000'
    WHILE (@d < '1/1/2021')
    BEGIN
        INSERT INTO dbo.calendar (d) VALUES (@d)
        SELECT @d =  DATEADD(DAY, 1, @d)
    END
    GO
    

    The query

    DECLARE @minAnalysisDate DATE = '2007-09-01', -- Customizable start date depending on business needs
            @maxAnalysisDate DATE = '2008-09-03'  -- Customizable end date depending on business needs
    SELECT ProductID, TransactionDate, ActualCost, RollingSum45, NumOrders
    FROM (
        SELECT ProductID, TransactionDate, NumOrders, ActualCost,
            SUM(ActualCost) OVER (
                    PARTITION BY ProductId ORDER BY TransactionDate 
                    ROWS BETWEEN 45 PRECEDING AND CURRENT ROW
                ) AS RollingSum45
        FROM (
            -- The full cross-product of products and dates, combined with actual cost information for that product/date
            SELECT p.ProductID, c.d AS TransactionDate,
                COUNT(TH.ProductId) AS NumOrders, SUM(TH.ActualCost) AS ActualCost
            FROM Production.Product p
            JOIN dbo.calendar c
                ON c.d BETWEEN @minAnalysisDate AND @maxAnalysisDate
            LEFT OUTER JOIN Production.TransactionHistory TH
                ON TH.ProductId = p.productId
                AND TH.TransactionDate = c.d
            GROUP BY P.ProductID, c.d
        ) aggsByDay
    ) rollingSums
    WHERE NumOrders > 0
    ORDER BY ProductID, TransactionDate
    -- MAXDOP 1 to avoid parallel scan inflating the scan count
    OPTION (MAXDOP 1)
    

    The execution plan

    From the execution plan, we see that the original index proposed by Paul is sufficient to allow us to perform a single ordered scan of Production.TransactionHistory, using a merge join to combine the transaction history with each possible product/date combination.

    insira a descrição da imagem aqui

    Assumptions

    There are a few significant assumptions baked into this approach. I suppose it will be up to Paul to decide whether they are acceptable :)

    • I am using the Production.Product table. This table is freely available on AdventureWorks2012 and the relationship is enforced by a foreign key from Production.TransactionHistory, so I interpreted this as fair game.
    • This approach relies on the fact that transactions do not have a time component on AdventureWorks2012; if they did, generating the full set of product/date combinations would no longer be possible without first taking a pass over the transaction history.
    • I am producing a rowset that contains just one row per product/date pair. I think that this is "arguably correct" and in many cases a more desirable result to return. For each product/date, I have added a NumOrders column to indicate how many sales occurred. See the following screenshot for a comparison of the results of the original query vs. the proposed query in cases where a product was sold multiple times on the same date (e.g., 319 / 2007-09-05 00:00:00.000)

    insira a descrição da imagem aqui


    CLR - one scan, full ungrouped result set

    The main function body

    There isn't a ton to see here; the main body of the function declares the inputs (which must match the corresponding SQL function), sets up a SQL connection, and opens the SQLReader.

    // SQL CLR function for rolling SUMs on AdventureWorks2012.Production.TransactionHistory
    [SqlFunction(DataAccess = DataAccessKind.Read,
        FillRowMethodName = "RollingSum_Fill",
        TableDefinition = "ProductId INT, TransactionDate DATETIME, ReferenceOrderID INT," +
                          "ActualCost FLOAT, PrevCumulativeSum FLOAT, RollingSum FLOAT")]
    public static IEnumerable RollingSumTvf(SqlInt32 rollingPeriodDays) {
        using (var connection = new SqlConnection("context connection=true;")) {
            connection.Open();
            List<TrxnRollingSum> trxns;
            using (var cmd = connection.CreateCommand()) {
                //Read the transaction history (note: the order is important!)
                cmd.CommandText = @"SELECT ProductId, TransactionDate, ReferenceOrderID,
                                        CAST(ActualCost AS FLOAT) AS ActualCost 
                                    FROM Production.TransactionHistory 
                                    ORDER BY ProductId, TransactionDate";
                using (var reader = cmd.ExecuteReader()) {
                    trxns = ComputeRollingSums(reader, rollingPeriodDays.Value);
                }
            }
    
            return trxns;
        }
    }
    

    The core logic

    I've separated out the main logic so that's easier to focus on:

    // Given a SqlReader with transaction history data, computes / returns the rolling sums
    private static List<TrxnRollingSum> ComputeRollingSums(SqlDataReader reader,
                                                            int rollingPeriodDays) {
        var startIndexOfRollingPeriod = 0;
        var rollingSumIndex = 0;
        var trxns = new List<TrxnRollingSum>();
    
        // Prior to the loop, initialize "next" to be the first transaction
        var nextTrxn = GetNextTrxn(reader, null);
        while (nextTrxn != null)
        {
            var currTrxn = nextTrxn;
            nextTrxn = GetNextTrxn(reader, currTrxn);
            trxns.Add(currTrxn);
    
            // If the next transaction is not the same product/date as the current
            // transaction, we can finalize the rolling sum for the current transaction
            // and all previous transactions for the same product/date
            var finalizeRollingSum = nextTrxn == null || (nextTrxn != null &&
                                    (currTrxn.ProductId != nextTrxn.ProductId ||
                                    currTrxn.TransactionDate != nextTrxn.TransactionDate));
            if (finalizeRollingSum)
            {
                // Advance the pointer to the first transaction (for the same product)
                // that occurs within the rolling period
                while (startIndexOfRollingPeriod < trxns.Count
                    && trxns[startIndexOfRollingPeriod].TransactionDate <
                        currTrxn.TransactionDate.AddDays(-1 * rollingPeriodDays))
                {
                    startIndexOfRollingPeriod++;
                }
    
                // Compute the rolling sum as the cumulative sum (for this product),
                // minus the cumulative sum for prior to the beginning of the rolling window
                var sumPriorToWindow = trxns[startIndexOfRollingPeriod].PrevSum;
                var rollingSum = currTrxn.ActualCost + currTrxn.PrevSum - sumPriorToWindow;
                // Fill in the rolling sum for all transactions sharing this product/date
                while (rollingSumIndex < trxns.Count)
                {
                    trxns[rollingSumIndex++].RollingSum = rollingSum;
                }
            }
    
            // If this is the last transaction for this product, reset the rolling period
            if (nextTrxn != null && currTrxn.ProductId != nextTrxn.ProductId)
            {
                startIndexOfRollingPeriod = trxns.Count;
            }
        }
    
        return trxns;
    }
    

    Helpers

    The following logic could be written inline, but it's a little easier to read when they are split out into their own methods.

    private static TrxnRollingSum GetNextTrxn(SqlDataReader r, TrxnRollingSum currTrxn) {
        TrxnRollingSum nextTrxn = null;
        if (r.Read()) {
            nextTrxn = new TrxnRollingSum {
                ProductId = r.GetInt32(0),
                TransactionDate = r.GetDateTime(1),
                ReferenceOrderId = r.GetInt32(2),
                ActualCost = r.GetDouble(3),
                PrevSum = 0 };
            if (currTrxn != null) {
                nextTrxn.PrevSum = (nextTrxn.ProductId == currTrxn.ProductId)
                        ? currTrxn.PrevSum + currTrxn.ActualCost : 0;
            }
        }
        return nextTrxn;
    }
    
    // Represents the output to be returned
    // Note that the ReferenceOrderId/PrevSum fields are for debugging only
    private class TrxnRollingSum {
        public int ProductId { get; set; }
        public DateTime TransactionDate { get; set; }
        public int ReferenceOrderId { get; set; }
        public double ActualCost { get; set; }
        public double PrevSum { get; set; }
        public double RollingSum { get; set; }
    }
    
    // The function that generates the result data for each row
    // (Such a function is mandatory for SQL CLR table-valued functions)
    public static void RollingSum_Fill(object trxnWithRollingSumObj,
                                        out int productId,
                                        out DateTime transactionDate, 
                                        out int referenceOrderId, out double actualCost,
                                        out double prevCumulativeSum,
                                        out double rollingSum) {
        var trxn = (TrxnRollingSum)trxnWithRollingSumObj;
        productId = trxn.ProductId;
        transactionDate = trxn.TransactionDate;
        referenceOrderId = trxn.ReferenceOrderId;
        actualCost = trxn.ActualCost;
        prevCumulativeSum = trxn.PrevSum;
        rollingSum = trxn.RollingSum;
    }
    

    Tying it all together in SQL

    Everything up to this point has been in C#, so let's see the actual SQL involved. (Alternatively, you can use this deployment script to create the assembly directly from the bits of my assembly rather than compiling yourself.)

    USE AdventureWorks2012; /* GPATTERSON2\SQL2014DEVELOPER */
    GO
    
    -- Enable CLR
    EXEC sp_configure 'clr enabled', 1;
    GO
    RECONFIGURE;
    GO
    
    -- Create the assembly based on the dll generated by compiling the CLR project
    -- I've also included the "assembly bits" version that can be run without compiling
    CREATE ASSEMBLY ClrPlayground
    -- See http://pastebin.com/dfbv1w3z for a "from assembly bits" version
    FROM 'C:\FullPathGoesHere\ClrPlayground\bin\Debug\ClrPlayground.dll'
    WITH PERMISSION_SET = safe;
    GO
    
    --Create a function from the assembly
    CREATE FUNCTION dbo.RollingSumTvf (@rollingPeriodDays INT)
    RETURNS TABLE ( ProductId INT, TransactionDate DATETIME, ReferenceOrderID INT,
                    ActualCost FLOAT, PrevCumulativeSum FLOAT, RollingSum FLOAT)
    -- The function yields rows in order, so let SQL Server know to avoid an extra sort
    ORDER (ProductID, TransactionDate, ReferenceOrderID)
    AS EXTERNAL NAME ClrPlayground.UserDefinedFunctions.RollingSumTvf;
    GO
    
    -- Now we can actually use the TVF!
    SELECT * 
    FROM dbo.RollingSumTvf(45) 
    ORDER BY ProductId, TransactionDate, ReferenceOrderId
    GO
    

    Caveats

    The CLR approach provides a lot more flexibility to optimize the algorithm, and it could probably be tuned even further by an expert in C#. However, there are also downsides to the CLR strategy. A few things to keep in mind:

    • This CLR approach keeps a copy of the data set in memory. It is possible to use a streaming approach, but I encountered initial difficulties and found that there is an outstanding Connect issue complaining that changes in SQL 2008+ make it more difficult to use this type of approach. It's still possible (as Paul demonstrates), but requires a higher level of permissions by setting the database as TRUSTWORTHY and granting EXTERNAL_ACCESS to the CLR assembly. So there is some hassle and potential security implication, but the payoff is a streaming approach that can better scale to much larger data sets than those on AdventureWorks.
    • CLR may be less accessible to some DBAs, making such a function more of a black box that is not as transparent, not as easily modified, not as easily deployed, and perhaps not as easily debugged. This is a pretty big disadvantage when compared to a T-SQL approach.

    Bonus: T-SQL #2 - the practical approach I'd actually use

    After trying to think about the problem creatively for a while, I thought I'd also post the fairly simple, practical way that I would likely choose to tackle this problem if it came up in my daily work. It does make use of SQL 2012+ window functionality, but not in type of groundbreaking way that the question was hoping for:

    -- Compute all running costs into a #temp table; Note that this query could simply read
    -- from Production.TransactionHistory, but a CROSS APPLY by product allows the window 
    -- function to be computed independently per product, supporting a parallel query plan
    SELECT t.*
    INTO #runningCosts
    FROM Production.Product p
    CROSS APPLY (
        SELECT t.ProductId, t.TransactionDate, t.ReferenceOrderId, t.ActualCost,
            -- Running sum of the cost for this product, including all ties on TransactionDate
            SUM(t.ActualCost) OVER (
                ORDER BY t.TransactionDate 
                RANGE UNBOUNDED PRECEDING) AS RunningCost
        FROM Production.TransactionHistory t
        WHERE t.ProductId = p.ProductId
    ) t
    GO
    
    -- Key the table in our output order
    ALTER TABLE #runningCosts
    ADD PRIMARY KEY (ProductId, TransactionDate, ReferenceOrderId)
    GO
    
    SELECT r.ProductId, r.TransactionDate, r.ReferenceOrderId, r.ActualCost,
        -- Cumulative running cost - running cost prior to the sliding window
        r.RunningCost - ISNULL(w.RunningCost,0) AS RollingSum45
    FROM #runningCosts r
    OUTER APPLY (
        -- For each transaction, find the running cost just before the sliding window begins
        SELECT TOP 1 b.RunningCost
        FROM #runningCosts b
        WHERE b.ProductId = r.ProductId
            AND b.TransactionDate < DATEADD(DAY, -45, r.TransactionDate)
        ORDER BY b.TransactionDate DESC
    ) w
    ORDER BY r.ProductId, r.TransactionDate, r.ReferenceOrderId
    GO
    

    This actually yields a fairly simple overall query plan, even when looking at the both of the two relevant query plans together:

    insira a descrição da imagem aqui insira a descrição da imagem aqui

    A few reasons I like this approach:

    • It yields the full result set requested in the problem statement (as opposed to most other T-SQL solutions, which return a grouped version of the results).
    • It is easy to explain, understand, and debug; I won't come back a year later and wonder how the heck I can make a small change without ruining the correctness or performance
    • It runs in about 900ms on the provided data set, rather than the 2700ms of the original loop-seek
    • If the data were much denser (more transactions per day), the computational complexity does not grow quadratically with the number of transactions in the sliding window (as it does for the original query); I think this addresses part of Paul's concern about wanted to avoid multiple scans
    • It results in essentially no tempdb I/O in recent updates of SQL 2012+ due to new tempdb lazy write functionality
    • For very large data sets, it is trivial to split the work into separate batches for each product if memory pressure were to become a concern

    A couple potential caveats:

    • While it technically does scan Production.TransactionHistory just once, it's not truly a "one scan" approach because the #temp table of similar size and will need to perform additional logicion I/O on that table as well. However, I don't see this as too different from a work table that we have more manual control over since we have defined its precise structure
    • Depending on your environment, the usage of tempdb could be viewed as a positive (e.g., it's on a separate set of SSD drives) or a negative (high concurrency on the server, lots of tempdb contention already)
    • 44
  2. Vladimir Baranov
    2015-09-08T23:34:21+08:002015-09-08T23:34:21+08:00

    Esta é uma resposta longa, então decidi adicionar um resumo aqui.

    • Inicialmente apresento uma solução que produz exatamente o mesmo resultado na mesma ordem da questão. Ele verifica a tabela principal 3 vezes: para obter uma lista ProductIDscom o intervalo de datas de cada Produto, para somar os custos de cada dia (porque existem várias transações com as mesmas datas), para unir o resultado com as linhas originais.
    • Em seguida, comparo duas abordagens que simplificam a tarefa e evitam uma última varredura da tabela principal. Seu resultado é um resumo diário, ou seja, se várias transações em um Produto tiverem a mesma data, elas serão roladas em uma única linha. Minha abordagem da etapa anterior verifica a tabela duas vezes. A abordagem de Geoff Patterson varre a tabela uma vez, porque ele usa conhecimento externo sobre o intervalo de datas e a lista de Produtos.
    • Por fim, apresento uma solução de passagem única que retorna novamente um resumo diário, mas não requer conhecimento externo sobre intervalo de datas ou lista de arquivos ProductIDs.

    Usarei o banco de dados AdventureWorks2014 e o SQL Server Express 2014.

    Alterações no banco de dados original:

    • Tipo alterado de [Production].[TransactionHistory].[TransactionDate]de datetimepara date. O componente de tempo era zero de qualquer maneira.
    • Tabela de calendário adicionada [dbo].[Calendar]
    • Índice adicionado a[Production].[TransactionHistory]

    .

    CREATE TABLE [dbo].[Calendar]
    (
        [dt] [date] NOT NULL,
        CONSTRAINT [PK_Calendar] PRIMARY KEY CLUSTERED 
    (
        [dt] ASC
    ))
    
    CREATE UNIQUE NONCLUSTERED INDEX [i] ON [Production].[TransactionHistory]
    (
        [ProductID] ASC,
        [TransactionDate] ASC,
        [ReferenceOrderID] ASC
    )
    INCLUDE ([ActualCost])
    
    -- Init calendar table
    INSERT INTO dbo.Calendar (dt)
    SELECT TOP (50000)
        DATEADD(day, ROW_NUMBER() OVER (ORDER BY s1.[object_id])-1, '2000-01-01') AS dt
    FROM sys.all_objects AS s1 CROSS JOIN sys.all_objects AS s2
    OPTION (MAXDOP 1);
    

    O artigo do MSDN sobre a OVERcláusula tem um link para uma excelente postagem de blog sobre funções de janela de Itzik Ben-Gan. Nesse post, ele explica como OVERfunciona, a diferença entre ROWSe RANGEopções e menciona esse mesmo problema de calcular uma soma contínua em um intervalo de datas. Ele menciona que a versão atual do SQL Server não implementa RANGEtotalmente e não implementa tipos de dados de intervalo temporal. Sua explicação da diferença entre ROWSe RANGEme deu uma idéia.

    Datas sem lacunas e duplicatas

    Se TransactionHistorya tabela contiver datas sem lacunas e sem duplicatas, a consulta a seguir produzirá resultados corretos:

    SELECT
        TH.ProductID,
        TH.TransactionDate,
        TH.ActualCost,
        RollingSum45 = SUM(TH.ActualCost) OVER (
            PARTITION BY TH.ProductID
            ORDER BY TH.TransactionDate
            ROWS BETWEEN 
                45 PRECEDING
                AND CURRENT ROW)
    FROM Production.TransactionHistory AS TH
    ORDER BY
        TH.ProductID,
        TH.TransactionDate,
        TH.ReferenceOrderID;
    

    De fato, uma janela de 45 linhas cobriria exatamente 45 dias.

    Datas com lacunas sem duplicatas

    Infelizmente, nossos dados têm lacunas nas datas. Para resolver este problema podemos usar uma Calendartabela para gerar um conjunto de datas sem gaps, então LEFT JOINdados originais para este conjunto e usar a mesma consulta com ROWS BETWEEN 45 PRECEDING AND CURRENT ROW. Isso produziria resultados corretos apenas se as datas não se repetirem (dentro do mesmo ProductID).

    Datas com lacunas com duplicatas

    Infelizmente, nossos dados têm lacunas nas datas e as datas podem se repetir no mesmo arquivo ProductID. Para resolver este problema, podemos criar GROUPdados originais ProductID, TransactionDategerando um conjunto de datas sem duplicatas. Em seguida, use Calendara tabela para gerar um conjunto de datas sem lacunas. Em seguida, podemos usar a consulta com ROWS BETWEEN 45 PRECEDING AND CURRENT ROWpara calcular o rolamento SUM. Isso produziria resultados corretos. Veja os comentários na consulta abaixo.

    WITH
    
    -- calculate Start/End dates for each product
    CTE_Products
    AS
    (
        SELECT TH.ProductID
            ,MIN(TH.TransactionDate) AS MinDate
            ,MAX(TH.TransactionDate) AS MaxDate
        FROM [Production].[TransactionHistory] AS TH
        GROUP BY TH.ProductID
    )
    
    -- generate set of dates without gaps for each product
    ,CTE_ProductsWithDates
    AS
    (
        SELECT CTE_Products.ProductID, C.dt
        FROM
            CTE_Products
            INNER JOIN dbo.Calendar AS C ON
                C.dt >= CTE_Products.MinDate AND
                C.dt <= CTE_Products.MaxDate
    )
    
    -- generate set of dates without duplicates for each product
    -- calculate daily cost as well
    ,CTE_DailyCosts
    AS
    (
        SELECT TH.ProductID, TH.TransactionDate, SUM(ActualCost) AS DailyActualCost
        FROM [Production].[TransactionHistory] AS TH
        GROUP BY TH.ProductID, TH.TransactionDate
    )
    
    -- calculate rolling sum over 45 days
    ,CTE_Sum
    AS
    (
        SELECT
            CTE_ProductsWithDates.ProductID
            ,CTE_ProductsWithDates.dt
            ,CTE_DailyCosts.DailyActualCost
            ,SUM(CTE_DailyCosts.DailyActualCost) OVER (
                PARTITION BY CTE_ProductsWithDates.ProductID
                ORDER BY CTE_ProductsWithDates.dt
                ROWS BETWEEN 45 PRECEDING AND CURRENT ROW) AS RollingSum45
        FROM
            CTE_ProductsWithDates
            LEFT JOIN CTE_DailyCosts ON 
                CTE_DailyCosts.ProductID = CTE_ProductsWithDates.ProductID AND
                CTE_DailyCosts.TransactionDate = CTE_ProductsWithDates.dt
    )
    
    -- remove rows that were added by Calendar, which fill the gaps in dates
    -- add back duplicate dates that were removed by GROUP BY
    SELECT
        TH.ProductID
        ,TH.TransactionDate
        ,TH.ActualCost
        ,CTE_Sum.RollingSum45
    FROM
        [Production].[TransactionHistory] AS TH
        INNER JOIN CTE_Sum ON
            CTE_Sum.ProductID = TH.ProductID AND
            CTE_Sum.dt = TH.TransactionDate
    ORDER BY
        TH.ProductID
        ,TH.TransactionDate
        ,TH.ReferenceOrderID
    ;
    

    Confirmei que essa consulta produz os mesmos resultados que a abordagem da questão que usa subconsulta.

    Planos de execução

    Estatísticas

    A primeira consulta usa a subconsulta, a segunda - essa abordagem. Você pode ver que a duração e o número de leituras são muito menores nessa abordagem. A maioria do custo estimado nesta abordagem é o final ORDER BY, veja abaixo.

    subconsulta

    A abordagem de subconsulta tem um plano simples com loops aninhados e O(n*n)complexidade.

    sobre

    Planeje essa abordagem verifica TransactionHistoryvárias vezes, mas não há loops. Como você pode ver mais de 70% do custo estimado é Sortpara o final ORDER BY.

    io

    Resultado superior - subquery, inferior - OVER.


    Evitando varreduras extras

    O último Index Scan, Merge Join e Sort no plano acima é causado pela final INNER JOINcom a tabela original para tornar o resultado final exatamente igual a uma abordagem lenta com subconsulta. O número de linhas retornadas é o mesmo da TransactionHistorytabela. Há linhas em TransactionHistoryque várias transações ocorreram no mesmo dia para o mesmo produto. Se não houver problema em mostrar apenas o resumo diário no resultado, esse final JOINpoderá ser removido e a consulta se tornará um pouco mais simples e um pouco mais rápida. O último Index Scan, Merge Join e Sort do plano anterior são substituídos por Filter, que remove as linhas adicionadas por Calendar.

    WITH
    -- two scans
    -- calculate Start/End dates for each product
    CTE_Products
    AS
    (
        SELECT TH.ProductID
            ,MIN(TH.TransactionDate) AS MinDate
            ,MAX(TH.TransactionDate) AS MaxDate
        FROM [Production].[TransactionHistory] AS TH
        GROUP BY TH.ProductID
    )
    
    -- generate set of dates without gaps for each product
    ,CTE_ProductsWithDates
    AS
    (
        SELECT CTE_Products.ProductID, C.dt
        FROM
            CTE_Products
            INNER JOIN dbo.Calendar AS C ON
                C.dt >= CTE_Products.MinDate AND
                C.dt <= CTE_Products.MaxDate
    )
    
    -- generate set of dates without duplicates for each product
    -- calculate daily cost as well
    ,CTE_DailyCosts
    AS
    (
        SELECT TH.ProductID, TH.TransactionDate, SUM(ActualCost) AS DailyActualCost
        FROM [Production].[TransactionHistory] AS TH
        GROUP BY TH.ProductID, TH.TransactionDate
    )
    
    -- calculate rolling sum over 45 days
    ,CTE_Sum
    AS
    (
        SELECT
            CTE_ProductsWithDates.ProductID
            ,CTE_ProductsWithDates.dt
            ,CTE_DailyCosts.DailyActualCost
            ,SUM(CTE_DailyCosts.DailyActualCost) OVER (
                PARTITION BY CTE_ProductsWithDates.ProductID
                ORDER BY CTE_ProductsWithDates.dt
                ROWS BETWEEN 45 PRECEDING AND CURRENT ROW) AS RollingSum45
        FROM
            CTE_ProductsWithDates
            LEFT JOIN CTE_DailyCosts ON 
                CTE_DailyCosts.ProductID = CTE_ProductsWithDates.ProductID AND
                CTE_DailyCosts.TransactionDate = CTE_ProductsWithDates.dt
    )
    
    -- remove rows that were added by Calendar, which fill the gaps in dates
    SELECT
        CTE_Sum.ProductID
        ,CTE_Sum.dt AS TransactionDate
        ,CTE_Sum.DailyActualCost
        ,CTE_Sum.RollingSum45
    FROM CTE_Sum
    WHERE CTE_Sum.DailyActualCost IS NOT NULL
    ORDER BY
        CTE_Sum.ProductID
        ,CTE_Sum.dt
    ;
    

    duas varreduras

    Ainda assim, TransactionHistoryé digitalizado duas vezes. Uma varredura extra é necessária para obter o intervalo de datas de cada produto. Eu estava interessado em ver como ele se compara a outra abordagem, onde usamos conhecimento externo sobre o intervalo global de datas em TransactionHistory, além de uma tabela extra Productque tem tudo ProductIDspara evitar essa varredura extra. Eu removi o cálculo do número de transações por dia desta consulta para tornar a comparação válida. Ele pode ser adicionado em ambas as consultas, mas gostaria de mantê-lo simples para comparação. Também tive que usar outras datas, pois uso a versão 2014 do banco de dados.

    DECLARE @minAnalysisDate DATE = '2013-07-31', 
    -- Customizable start date depending on business needs
            @maxAnalysisDate DATE = '2014-08-03'  
    -- Customizable end date depending on business needs
    SELECT 
        -- one scan
        ProductID, TransactionDate, ActualCost, RollingSum45
    --, NumOrders
    FROM (
        SELECT ProductID, TransactionDate, 
        --NumOrders, 
        ActualCost,
            SUM(ActualCost) OVER (
                    PARTITION BY ProductId ORDER BY TransactionDate 
                    ROWS BETWEEN 45 PRECEDING AND CURRENT ROW
                ) AS RollingSum45
        FROM (
            -- The full cross-product of products and dates, 
            -- combined with actual cost information for that product/date
            SELECT p.ProductID, c.dt AS TransactionDate,
                --COUNT(TH.ProductId) AS NumOrders, 
                SUM(TH.ActualCost) AS ActualCost
            FROM Production.Product p
            JOIN dbo.calendar c
                ON c.dt BETWEEN @minAnalysisDate AND @maxAnalysisDate
            LEFT OUTER JOIN Production.TransactionHistory TH
                ON TH.ProductId = p.productId
                AND TH.TransactionDate = c.dt
            GROUP BY P.ProductID, c.dt
        ) aggsByDay
    ) rollingSums
    --WHERE NumOrders > 0
    WHERE ActualCost IS NOT NULL
    ORDER BY ProductID, TransactionDate
    -- MAXDOP 1 to avoid parallel scan inflating the scan count
    OPTION (MAXDOP 1);
    

    uma varredura

    Ambas as consultas retornam o mesmo resultado na mesma ordem.

    Comparação

    Aqui estão as estatísticas de tempo e IO.

    estatísticas2

    io2

    The two-scan variant is a bit faster and has fewer reads, because one-scan variant has to use Worktable a lot. Besides, one-scan variant generates more rows than needed as you can see in the plans. It generates dates for each ProductID that is in the Product table, even if a ProductID doesn't have any transactions. There are 504 rows in Product table, but only 441 products have transactions in TransactionHistory. Also, it generates the same range of dates for each product, which is more than needed. If TransactionHistory had a longer overall history, with each individual product having relatively short history, the number of extra unneeded rows would be even higher.

    On the other hand, it is possible to optimize two-scan variant a bit further by creating another, more narrow index on just (ProductID, TransactionDate). This index would be used to calculate Start/End dates for each product (CTE_Products) and it would have less pages than covering index and as a result cause less reads.

    So, we can choose, either have an extra explicit simple scan, or have an implicit Worktable.

    BTW, if it is OK to have a result with only daily summaries, then it is better to create an index that doesn't include ReferenceOrderID. It would use less pages => less IO.

    CREATE NONCLUSTERED INDEX [i2] ON [Production].[TransactionHistory]
    (
        [ProductID] ASC,
        [TransactionDate] ASC
    )
    INCLUDE ([ActualCost])
    

    Single pass solution using CROSS APPLY

    It becomes a really long answer, but here is one more variant that returns only daily summary again, but it does only one scan of the data and it doesn't require external knowledge about range of dates or list of ProductIDs. It doesn't do intermediate Sorts as well. Overall performance is similar to previous variants, though seems to be a bit worse.

    The main idea is to use a table of numbers to generate rows that would fill the gaps in dates. For each existing date use LEAD to calculate the size of the gap in days and then use CROSS APPLY to add required number of rows into the result set. At first I tried it with a permanent table of numbers. The plan showed large number of reads in this table, though actual duration was pretty much the same, as when I generated numbers on the fly using CTE.

    WITH 
    e1(n) AS
    (
        SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL 
        SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL 
        SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1
    ) -- 10
    ,e2(n) AS (SELECT 1 FROM e1 CROSS JOIN e1 AS b) -- 10*10
    ,e3(n) AS (SELECT 1 FROM e1 CROSS JOIN e2) -- 10*100
    ,CTE_Numbers
    AS
    (
        SELECT ROW_NUMBER() OVER (ORDER BY n) AS Number
        FROM e3
    )
    ,CTE_DailyCosts
    AS
    (
        SELECT
            TH.ProductID
            ,TH.TransactionDate
            ,SUM(ActualCost) AS DailyActualCost
            ,ISNULL(DATEDIFF(day,
                TH.TransactionDate,
                LEAD(TH.TransactionDate) 
                OVER(PARTITION BY TH.ProductID ORDER BY TH.TransactionDate)), 1) AS DiffDays
        FROM [Production].[TransactionHistory] AS TH
        GROUP BY TH.ProductID, TH.TransactionDate
    )
    ,CTE_NoGaps
    AS
    (
        SELECT
            CTE_DailyCosts.ProductID
            ,CTE_DailyCosts.TransactionDate
            ,CASE WHEN CA.Number = 1 
            THEN CTE_DailyCosts.DailyActualCost
            ELSE NULL END AS DailyCost
        FROM
            CTE_DailyCosts
            CROSS APPLY
            (
                SELECT TOP(CTE_DailyCosts.DiffDays) CTE_Numbers.Number
                FROM CTE_Numbers
                ORDER BY CTE_Numbers.Number
            ) AS CA
    )
    ,CTE_Sum
    AS
    (
        SELECT
            ProductID
            ,TransactionDate
            ,DailyCost
            ,SUM(DailyCost) OVER (
                PARTITION BY ProductID
                ORDER BY TransactionDate
                ROWS BETWEEN 45 PRECEDING AND CURRENT ROW) AS RollingSum45
        FROM CTE_NoGaps
    )
    SELECT
        ProductID
        ,TransactionDate
        ,DailyCost
        ,RollingSum45
    FROM CTE_Sum
    WHERE DailyCost IS NOT NULL
    ORDER BY 
        ProductID
        ,TransactionDate
    ;
    

    This plan is "longer", because query uses two window functions (LEAD and SUM).

    aplicação cruzada

    ca estatísticas

    ca io

    • 26
  3. Paul White
    2015-09-15T02:03:04+08:002015-09-15T02:03:04+08:00

    An alternative SQLCLR solution that executes faster and requires less memory:

    Deployment Script

    That requires the EXTERNAL_ACCESS permission set because it uses a loopback connection to the target server and database instead of the (slow) context connection. This is how to call the function:

    SELECT 
        RS.ProductID,
        RS.TransactionDate,
        RS.ActualCost,
        RS.RollingSum45
    FROM dbo.RollingSum
    (
        N'.\SQL2014',           -- Instance name
        N'AdventureWorks2012'   -- Database name
    ) AS RS 
    ORDER BY
        RS.ProductID,
        RS.TransactionDate,
        RS.ReferenceOrderID;
    

    Produces exactly the same results, in the same order, as the question.

    Execution plan:

    Plano de execução SQLCLR TVF

    Plano de execução da consulta de origem SQLCLR

    Estatísticas de desempenho do Plan Explorer

    Profiler logical reads: 481

    The main advantage of this implementation is that it is faster than using the context connection, and it uses less memory. It only keeps two things in memory at any one time:

    1. Any duplicate rows (same product and transaction date). This is required because until either the product or date changes, we do not know what the final running sum will be. In the sample data, there is one combination of product and date that has 64 rows.
    2. A sliding 45 day range of cost and transaction dates only, for the current product. This is necessary to adjust the simple running sum for rows that leave the 45-day sliding window.

    This minimal caching should ensure this method scales well; certainly better than trying to hold the entire input set in CLR memory.

    Source code

    • 24
  4. Mikael Eriksson
    2015-09-16T04:38:42+08:002015-09-16T04:38:42+08:00

    If you are on 64-bit Enterprise, Developer, or Evaluation edition of SQL Server 2014 you can use In-Memory OLTP. The solution will not be a single scan and and will hardly use any window functions at all but it might add some value to this question and the algorithm used could possibly be used as inspiration to other solutions.

    First you need to enable In-Memory OLTP on AdventureWorks database.

    alter database AdventureWorks2014 
      add filegroup InMem contains memory_optimized_data;
    
    alter database AdventureWorks2014 
      add file (name='AW2014_InMem', 
                filename='D:\SQL Server\MSSQL12.MSSQLSERVER\MSSQL\DATA\AW2014') 
        to filegroup InMem;
    
    alter database AdventureWorks2014 
      set memory_optimized_elevate_to_snapshot = on;
    

    The parameter to the procedure is an In-Memory table variable and that has to be defined as a type.

    create type dbo.TransHistory as table
    (
      ID int not null,
      ProductID int not null,
      TransactionDate datetime not null,
      ReferenceOrderID int not null,
      ActualCost money not null,
      RunningTotal money not null,
      RollingSum45 money not null,
    
      -- Index used in while loop
      index IX_T1 nonclustered hash (ID) with (bucket_count = 1000000),
    
      -- Used to lookup the running total as it was 45 days ago (or more)
      index IX_T2 nonclustered (ProductID, TransactionDate desc)
    ) with (memory_optimized = on);
    

    ID is not unique in this table, it is unique for each combination of ProductID and TransactionDate.

    There are some comments in the procedure that tell you what it does but overall it is calculating the running total in a loop and for each iteration it does a lookup for the running total as it was 45 days ago (or more).

    The current running total minus the running total as it was 45 days ago is the rolling 45 days sum we are looking for.

    create procedure dbo.GetRolling45
      @TransHistory dbo.TransHistory readonly
    with native_compilation, schemabinding, execute as owner as
    begin atomic with(transaction isolation level = snapshot, language = N'us_english')
    
      -- Table to hold the result
      declare @TransRes dbo.TransHistory;
    
      -- Loop variable
      declare @ID int = 0;
    
      -- Current ProductID
      declare @ProductID int = -1;
    
      -- Previous ProductID used to restart the running total
      declare @PrevProductID int;
    
      -- Current transaction date used to get the running total 45 days ago (or more)
      declare @TransactionDate datetime;
    
      -- Sum of actual cost for the group ProductID and TransactionDate
      declare @ActualCost money;
    
      -- Running total so far
      declare @RunningTotal money = 0;
    
      -- Running total as it was 45 days ago (or more)
      declare @RunningTotal45 money = 0;
    
      -- While loop for each unique occurence of the combination of ProductID, TransactionDate
      while @ProductID <> 0
      begin
        set @ID += 1;
        set @PrevProductID = @ProductID;
    
        -- Get the current values
        select @ProductID = min(ProductID),
               @TransactionDate = min(TransactionDate),
               @ActualCost = sum(ActualCost)
        from @TransHistory 
        where ID = @ID;
    
        if @ProductID <> 0
        begin
          set @RunningTotal45 = 0;
    
          if @ProductID <> @PrevProductID
          begin
            -- New product, reset running total
            set @RunningTotal = @ActualCost;
          end
          else
          begin
            -- Same product as last row, aggregate running total
            set @RunningTotal += @ActualCost;
    
            -- Get the running total as it was 45 days ago (or more)
            select top(1) @RunningTotal45 = TR.RunningTotal
            from @TransRes as TR
            where TR.ProductID = @ProductID and
                  TR.TransactionDate < dateadd(day, -45, @TransactionDate)
            order by TR.TransactionDate desc;
    
          end;
    
          -- Add all rows that match ID to the result table
          -- RollingSum45 is calculated by using the current running total and the running total as it was 45 days ago (or more)
          insert into @TransRes(ID, ProductID, TransactionDate, ReferenceOrderID, ActualCost, RunningTotal, RollingSum45)
          select @ID, 
                 @ProductID, 
                 @TransactionDate, 
                 TH.ReferenceOrderID, 
                 TH.ActualCost, 
                 @RunningTotal, 
                 @RunningTotal - @RunningTotal45
          from @TransHistory as TH
          where ID = @ID;
    
        end
      end;
    
      -- Return the result table to caller
      select TR.ProductID, TR.TransactionDate, TR.ReferenceOrderID, TR.ActualCost, TR.RollingSum45
      from @TransRes as TR
      order by TR.ProductID, TR.TransactionDate, TR.ReferenceOrderID;
    
    end;
    

    Invoke the procedure like this.

    -- Parameter to stored procedure GetRollingSum
    declare @T dbo.TransHistory;
    
    -- Load data to in-mem table
    -- ID is unique for each combination of ProductID, TransactionDate
    insert into @T(ID, ProductID, TransactionDate, ReferenceOrderID, ActualCost, RunningTotal, RollingSum45)
    select dense_rank() over(order by TH.ProductID, TH.TransactionDate),
           TH.ProductID, 
           TH.TransactionDate, 
           TH.ReferenceOrderID,
           TH.ActualCost,
           0, 
           0
    from Production.TransactionHistory as TH;
    
    -- Get the rolling 45 days sum
    exec dbo.GetRolling45 @T;
    

    Testing this on my computer Client Statistics reports a Total execution time of about 750 millisecond. For comparisons the sub-query version takes 3.5 seconds.

    Extra ramblings:

    This algorithm could also be used by regular T-SQL. Calculate the running total, using range not rows, and store the result in a temp table. Then you can query that table with a self join to to the running total as it was 45 days ago and calculate the rolling sum. However, the implementation of range compared to rows is quite slow due to the fact that is needs to treat duplicates of the order by clause differently so I did not get all that good performance with this approach. A workaround to that could be to use another window function like last_value() over a calculated running total using rows to simulate a range running total. Another way is to use max() over(). Both had some issues. Finding the appropriate index to use to avoid sorts and avoiding spools with the max() over() version. I gave up optimising those things but if you are interested in the code I have so far please let me know.

    • 17
  5. Kenneth Fisher
    2015-09-11T08:34:51+08:002015-09-11T08:34:51+08:00

    Well that was fun :) My solution is a bit slower than @GeoffPatterson's but part of that is the fact that I'm tying back to the original table in order to eliminate one of Geoff's assumptions (i.e. one row per product/date pair). I went with the assumption this was a simplified version of a final query and may require additional information out of the original table.

    Note: I'm borrowing Geoff's calendar table and in fact ended up with a very similar solution:

    -- Build calendar table for 2000 ~ 2020
    CREATE TABLE dbo.calendar (d DATETIME NOT NULL CONSTRAINT PK_calendar PRIMARY KEY)
    GO
    DECLARE @d DATETIME = '1/1/2000'
    WHILE (@d < '1/1/2021')
    BEGIN
        INSERT INTO dbo.calendar (d) VALUES (@d)
        SELECT @d =  DATEADD(DAY, 1, @d)
    END
    

    Here is the query itself:

    WITH myCTE AS (SELECT PP.ProductID, calendar.d AS TransactionDate, 
                        SUM(ActualCost) AS CostPerDate
                    FROM Production.Product PP
                    CROSS JOIN calendar
                    LEFT OUTER JOIN Production.TransactionHistory PTH
                        ON PP.ProductID = PTH.ProductID
                        AND calendar.d = PTH.TransactionDate
                    CROSS APPLY (SELECT MAX(TransactionDate) AS EndDate,
                                    MIN(TransactionDate) AS StartDate
                                FROM Production.TransactionHistory) AS Boundaries
                    WHERE calendar.d BETWEEN Boundaries.StartDate AND Boundaries.EndDate
                    GROUP BY PP.ProductID, calendar.d),
        RunningTotal AS (
            SELECT ProductId, TransactionDate, CostPerDate AS TBE,
                    SUM(myCTE.CostPerDate) OVER (
                        PARTITION BY myCTE.ProductID
                        ORDER BY myCTE.TransactionDate
                        ROWS BETWEEN 
                            45 PRECEDING
                            AND CURRENT ROW) AS RollingSum45
            FROM myCTE)
    SELECT 
        TH.ProductID,
        TH.TransactionDate,
        TH.ActualCost,
        RollingSum45
    FROM Production.TransactionHistory AS TH
    JOIN RunningTotal
        ON TH.ProductID = RunningTotal.ProductID
        AND TH.TransactionDate = RunningTotal.TransactionDate
    WHERE RunningTotal.TBE IS NOT NULL
    ORDER BY
        TH.ProductID,
        TH.TransactionDate,
        TH.ReferenceOrderID;
    

    Basically I decided that the easiest way to deal with it was to use the option for the ROWS clause. But that required that I only have one row per ProductID, TransactionDate combination and not just that, but I had to have one row per ProductID and possible date. I did that combining the Product, calendar and TransactionHistory tables in a CTE. Then I had to create another CTE to generate the rolling information. I had to do this because if I joined it back the the original table directly I got row elimination that threw off my results. After that it was a simple matter of joining my second CTE back to the original table. I did add the TBE column (to be eliminated) to get rid of the blank rows created in the CTEs. Also I used a CROSS APPLY in the initial CTE to generate boundaries for my calendar table.

    I then added the recommended index:

    CREATE NONCLUSTERED INDEX [TransactionHistory_IX1]
    ON [Production].[TransactionHistory] ([TransactionDate])
    INCLUDE ([ProductID],[ReferenceOrderID],[ActualCost])
    

    And got the final execution plan:

    insira a descrição da imagem aqui insira a descrição da imagem aqui insira a descrição da imagem aqui

    EDIT: In the end I added an index on the calendar table that sped up performance by a reasonable margin.

    CREATE INDEX ix_calendar ON calendar(d)
    
    • 13
  6. Joe Obbish
    2016-12-31T09:49:14+08:002016-12-31T09:49:14+08:00

    I have a few alternate solutions that don't use indexes or reference tables. Perhaps they could be useful in situations in which you don't have access to any additional tables and cannot create indexes. It does appear to be possible to get correct results when grouping by TransactionDate with just a single pass of the data and just a single window function. However, I could not figure out a way to do it with just one window function when you cannot group by TransactionDate.

    To provide a frame of reference, on my machine the original solution posted in the question has a CPU time of 2808 ms without the covering index and 1950 ms with the covering index. I am testing with the AdventureWorks2014 database and SQL Server Express 2014.

    Let's start with a solution for when we can group by TransactionDate. A running sum over the last X days can also be expressed in the following way:

    Running sum for a row = running sum of all previous rows - running sum of all previous rows for which the date is outside the date window.

    In SQL, one way to express this is by making two copies of your data and for the second copy, multiplying the cost by -1 and adding X+1 days to the date column. Computing a running sum over all of the data will implement the above formula. I'll show this for some example data. Below is some sample date for a single ProductID. I represent dates as numbers to make the calculations easier. Starting data:

    ╔══════╦══════╗
    ║ Date ║ Cost ║
    ╠══════╬══════╣
    ║    1 ║    3 ║
    ║    2 ║    6 ║
    ║   20 ║    1 ║
    ║   45 ║   -4 ║
    ║   47 ║    2 ║
    ║   64 ║    2 ║
    ╚══════╩══════╝
    

    Add in a second copy of the data. The second copy has 46 days added to the date and the cost multiplied by -1:

    ╔══════╦══════╦═══════════╗
    ║ Date ║ Cost ║ CopiedRow ║
    ╠══════╬══════╬═══════════╣
    ║    1 ║    3 ║         0 ║
    ║    2 ║    6 ║         0 ║
    ║   20 ║    1 ║         0 ║
    ║   45 ║   -4 ║         0 ║
    ║   47 ║   -3 ║         1 ║
    ║   47 ║    2 ║         0 ║
    ║   48 ║   -6 ║         1 ║
    ║   64 ║    2 ║         0 ║
    ║   66 ║   -1 ║         1 ║
    ║   91 ║    4 ║         1 ║
    ║   93 ║   -2 ║         1 ║
    ║  110 ║   -2 ║         1 ║
    ╚══════╩══════╩═══════════╝
    

    Take the running sum ordered by Date ascending and CopiedRow descending:

    ╔══════╦══════╦═══════════╦════════════╗
    ║ Date ║ Cost ║ CopiedRow ║ RunningSum ║
    ╠══════╬══════╬═══════════╬════════════╣
    ║    1 ║    3 ║         0 ║          3 ║
    ║    2 ║    6 ║         0 ║          9 ║
    ║   20 ║    1 ║         0 ║         10 ║
    ║   45 ║   -4 ║         0 ║          6 ║
    ║   47 ║   -3 ║         1 ║          3 ║
    ║   47 ║    2 ║         0 ║          5 ║
    ║   48 ║   -6 ║         1 ║         -1 ║
    ║   64 ║    2 ║         0 ║          1 ║
    ║   66 ║   -1 ║         1 ║          0 ║
    ║   91 ║    4 ║         1 ║          4 ║
    ║   93 ║   -2 ║         1 ║          0 ║
    ║  110 ║   -2 ║         1 ║          0 ║
    ╚══════╩══════╩═══════════╩════════════╝
    

    Filter out the copied rows to get the desired result:

    ╔══════╦══════╦═══════════╦════════════╗
    ║ Date ║ Cost ║ CopiedRow ║ RunningSum ║
    ╠══════╬══════╬═══════════╬════════════╣
    ║    1 ║    3 ║         0 ║          3 ║
    ║    2 ║    6 ║         0 ║          9 ║
    ║   20 ║    1 ║         0 ║         10 ║
    ║   45 ║   -4 ║         0 ║          6 ║
    ║   47 ║    2 ║         0 ║          5 ║
    ║   64 ║    2 ║         0 ║          1 ║
    ╚══════╩══════╩═══════════╩════════════╝
    

    The following SQL is one way to implement the above algorithm:

    WITH THGrouped AS 
    (
        SELECT
        ProductID,
        TransactionDate,
        SUM(ActualCost) ActualCost
        FROM Production.TransactionHistory
        GROUP BY ProductID,
        TransactionDate
    )
    SELECT
    ProductID,
    TransactionDate,
    ActualCost,
    RollingSum45
    FROM
    (
        SELECT
        TH.ProductID,
        TH.ActualCost,
        t.TransactionDate,
        SUM(t.ActualCost) OVER (PARTITION BY TH.ProductID ORDER BY t.TransactionDate, t.OrderFlag) AS RollingSum45,
        t.OrderFlag,
        t.FilterFlag -- define this column to avoid another sort at the end
        FROM THGrouped AS TH
        CROSS APPLY (
            VALUES
            (TH.ActualCost, TH.TransactionDate, 1, 0),
            (-1 * TH.ActualCost, DATEADD(DAY, 46, TH.TransactionDate), 0, 1)
        ) t (ActualCost, TransactionDate, OrderFlag, FilterFlag)
    ) tt
    WHERE tt.FilterFlag = 0
    ORDER BY
    tt.ProductID,
    tt.TransactionDate,
    tt.OrderFlag
    OPTION (MAXDOP 1);
    

    On my machine this took 702 ms of CPU time with the covering index and 734 ms of CPU time without the index. The query plan can be found here: https://www.brentozar.com/pastetheplan/?id=SJdCsGVSl

    One downside of this solution is that there appears to be an unavoidable sort when ordering by the new TransactionDate column. I don't think that this sort can be resolved by adding indexes because we need to combine two copies of the data before doing the ordering. I was able to get rid of a sort at the end of the query by adding in a different column to ORDER BY. If I ordered by FilterFlag I found that SQL Server would optimize out that column from the sort and would perform an explicit sort.

    Solutions for when we need to return a result set with duplicate TransactionDate values for the same ProductId were much more complicated. I would summarize the problem as simultaneously needing to partition by and order by the same column. The syntax that Paul provided resolves that issue so it's not surprisingly that it's so difficult to express with the current window functions available in SQL Server (if it wasn't difficult to express there would be no need to expand the syntax).

    If I use the above query without grouping then I get different values for the rolling sum when there are multiple rows with the same ProductId and TransactionDate. One way to resolve this is to do the same running sum calculation as above but also to flag the last row in the partition. This can be done with LEAD (assuming ProductID is never NULL) without an additional sort. For the final running sum value, I use MAX as a window function to apply the value in the last row of the partition to all rows in the partition.

    SELECT
    ProductID,
    TransactionDate,
    ReferenceOrderID,
    ActualCost,
    MAX(CASE WHEN LasttRowFlag = 1 THEN RollingSum ELSE NULL END) OVER (PARTITION BY ProductID, TransactionDate) RollingSum45
    FROM
    (
        SELECT
        TH.ProductID,
        TH.ActualCost,
        TH.ReferenceOrderID,
        t.TransactionDate,
        SUM(t.ActualCost) OVER (PARTITION BY TH.ProductID ORDER BY t.TransactionDate, t.OrderFlag, TH.ReferenceOrderID) RollingSum,
        CASE WHEN LEAD(TH.ProductID) OVER (PARTITION BY TH.ProductID, t.TransactionDate ORDER BY t.OrderFlag, TH.ReferenceOrderID) IS NULL THEN 1 ELSE 0 END LasttRowFlag,
        t.OrderFlag,
        t.FilterFlag -- define this column to avoid another sort at the end
        FROM Production.TransactionHistory AS TH
        CROSS APPLY (
            VALUES
            (TH.ActualCost, TH.TransactionDate, 1, 0),
            (-1 * TH.ActualCost, DATEADD(DAY, 46, TH.TransactionDate), 0, 1)
        ) t (ActualCost, TransactionDate, OrderFlag, FilterFlag)
    ) tt
    WHERE tt.FilterFlag = 0
    ORDER BY
    tt.ProductID,
    tt.TransactionDate,
    tt.OrderFlag,
    tt.ReferenceOrderID
    OPTION (MAXDOP 1);  
    

    On my machine this took 2464ms of CPU time without the covering index. As before there appears to be an unavoidable sort. The query plan can be found here: https://www.brentozar.com/pastetheplan/?id=HyWxhGVBl

    Eu acho que há espaço para melhorias na consulta acima. Certamente existem outras maneiras de usar as funções do Windows para obter o resultado desejado.

    • 4

relate perguntas

  • SQL Server - Como as páginas de dados são armazenadas ao usar um índice clusterizado

  • Preciso de índices separados para cada tipo de consulta ou um índice de várias colunas funcionará?

  • Quando devo usar uma restrição exclusiva em vez de um índice exclusivo?

  • Quais são as principais causas de deadlocks e podem ser evitadas?

  • Como determinar se um Índice é necessário ou necessário

Sidebar

Stats

  • Perguntas 205573
  • respostas 270741
  • best respostas 135370
  • utilizador 68524
  • Highest score
  • respostas
  • Marko Smith

    conectar ao servidor PostgreSQL: FATAL: nenhuma entrada pg_hba.conf para o host

    • 12 respostas
  • Marko Smith

    Como fazer a saída do sqlplus aparecer em uma linha?

    • 3 respostas
  • Marko Smith

    Selecione qual tem data máxima ou data mais recente

    • 3 respostas
  • Marko Smith

    Como faço para listar todos os esquemas no PostgreSQL?

    • 4 respostas
  • Marko Smith

    Listar todas as colunas de uma tabela especificada

    • 5 respostas
  • Marko Smith

    Como usar o sqlplus para se conectar a um banco de dados Oracle localizado em outro host sem modificar meu próprio tnsnames.ora

    • 4 respostas
  • Marko Smith

    Como você mysqldump tabela (s) específica (s)?

    • 4 respostas
  • Marko Smith

    Listar os privilégios do banco de dados usando o psql

    • 10 respostas
  • Marko Smith

    Como inserir valores em uma tabela de uma consulta de seleção no PostgreSQL?

    • 4 respostas
  • Marko Smith

    Como faço para listar todos os bancos de dados e tabelas usando o psql?

    • 7 respostas
  • Martin Hope
    Jin conectar ao servidor PostgreSQL: FATAL: nenhuma entrada pg_hba.conf para o host 2014-12-02 02:54:58 +0800 CST
  • Martin Hope
    Stéphane Como faço para listar todos os esquemas no PostgreSQL? 2013-04-16 11:19:16 +0800 CST
  • Martin Hope
    Mike Walsh Por que o log de transações continua crescendo ou fica sem espaço? 2012-12-05 18:11:22 +0800 CST
  • Martin Hope
    Stephane Rolland Listar todas as colunas de uma tabela especificada 2012-08-14 04:44:44 +0800 CST
  • Martin Hope
    haxney O MySQL pode realizar consultas razoavelmente em bilhões de linhas? 2012-07-03 11:36:13 +0800 CST
  • Martin Hope
    qazwsx Como posso monitorar o andamento de uma importação de um arquivo .sql grande? 2012-05-03 08:54:41 +0800 CST
  • Martin Hope
    markdorison Como você mysqldump tabela (s) específica (s)? 2011-12-17 12:39:37 +0800 CST
  • Martin Hope
    Jonas Como posso cronometrar consultas SQL usando psql? 2011-06-04 02:22:54 +0800 CST
  • Martin Hope
    Jonas Como inserir valores em uma tabela de uma consulta de seleção no PostgreSQL? 2011-05-28 00:33:05 +0800 CST
  • Martin Hope
    Jonas Como faço para listar todos os bancos de dados e tabelas usando o psql? 2011-02-18 00:45:49 +0800 CST

Hot tag

sql-server mysql postgresql sql-server-2014 sql-server-2016 oracle sql-server-2008 database-design query-performance sql-server-2017

Explore

  • Início
  • Perguntas
    • Recentes
    • Highest score
  • tag
  • help

Footer

AskOverflow.Dev

About Us

  • About Us
  • Contact Us

Legal Stuff

  • Privacy Policy

Language

  • Pt
  • Server
  • Unix

© 2023 AskOverflow.DEV All Rights Reserve