Eu preciso converter dados entre dois sistemas.
O primeiro sistema armazena horários como uma lista simples de datas. Cada data incluída na agenda é uma linha. Podem existir várias lacunas na sequência de datas (fins de semana, feriados e pausas mais longas, alguns dias da semana podem ser excluídos da programação). Não pode haver nenhuma lacuna, mesmo fins de semana podem ser incluídos. O cronograma pode durar até 2 anos. Geralmente dura algumas semanas.
Aqui está um exemplo simples de uma programação que abrange duas semanas excluindo fins de semana (existem exemplos mais complicados no script abaixo):
+----+------------+------------+---------+--------+
| ID | ContractID | dt | dowChar | dowInt |
+----+------------+------------+---------+--------+
| 10 | 1 | 2016-05-02 | Mon | 2 |
| 11 | 1 | 2016-05-03 | Tue | 3 |
| 12 | 1 | 2016-05-04 | Wed | 4 |
| 13 | 1 | 2016-05-05 | Thu | 5 |
| 14 | 1 | 2016-05-06 | Fri | 6 |
| 15 | 1 | 2016-05-09 | Mon | 2 |
| 16 | 1 | 2016-05-10 | Tue | 3 |
| 17 | 1 | 2016-05-11 | Wed | 4 |
| 18 | 1 | 2016-05-12 | Thu | 5 |
| 19 | 1 | 2016-05-13 | Fri | 6 |
+----+------------+------------+---------+--------+
ID
é único, mas não é necessariamente sequencial (é a chave primária). As datas são únicas dentro de cada Contrato (há um índice único em (ContractID, dt)
).
O segundo sistema armazena as programações como intervalos com a lista de dias da semana que fazem parte da programação. Cada intervalo é definido por suas datas de início e término (inclusive) e uma lista de dias da semana incluídos na programação. Nesse formato, você pode definir com eficiência padrões semanais repetitivos, como de segunda a quarta, mas torna-se um problema quando um padrão é interrompido, por exemplo, em um feriado.
Veja como ficará o exemplo simples acima:
+------------+------------+------------+----------+----------------------+
| ContractID | StartDT | EndDT | DayCount | WeekDays |
+------------+------------+------------+----------+----------------------+
| 1 | 2016-05-02 | 2016-05-13 | 10 | Mon,Tue,Wed,Thu,Fri, |
+------------+------------+------------+----------+----------------------+
[StartDT;EndDT]
intervalos que pertençam ao mesmo Contrato não devem se sobrepor.
Preciso converter dados do primeiro sistema para o formato usado pelo segundo sistema. No momento, estou resolvendo isso no lado do cliente em C # para o único contrato fornecido, mas gostaria de fazer isso em T-SQL no lado do servidor para processamento em massa e exportação/importação entre servidores. Provavelmente, isso poderia ser feito usando CLR UDF, mas neste estágio não posso usar SQLCLR.
O desafio aqui é tornar a lista de intervalos o mais curta e amigável possível.
Por exemplo, este cronograma:
+-----+------------+------------+---------+--------+
| ID | ContractID | dt | dowChar | dowInt |
+-----+------------+------------+---------+--------+
| 223 | 2 | 2016-05-05 | Thu | 5 |
| 224 | 2 | 2016-05-06 | Fri | 6 |
| 225 | 2 | 2016-05-09 | Mon | 2 |
| 226 | 2 | 2016-05-10 | Tue | 3 |
| 227 | 2 | 2016-05-11 | Wed | 4 |
| 228 | 2 | 2016-05-12 | Thu | 5 |
| 229 | 2 | 2016-05-13 | Fri | 6 |
| 230 | 2 | 2016-05-16 | Mon | 2 |
| 231 | 2 | 2016-05-17 | Tue | 3 |
+-----+------------+------------+---------+--------+
deve se tornar isso:
+------------+------------+------------+----------+----------------------+
| ContractID | StartDT | EndDT | DayCount | WeekDays |
+------------+------------+------------+----------+----------------------+
| 2 | 2016-05-05 | 2016-05-17 | 9 | Mon,Tue,Wed,Thu,Fri, |
+------------+------------+------------+----------+----------------------+
,isso não:
+------------+------------+------------+----------+----------------------+
| ContractID | StartDT | EndDT | DayCount | WeekDays |
+------------+------------+------------+----------+----------------------+
| 2 | 2016-05-05 | 2016-05-06 | 2 | Thu,Fri, |
| 2 | 2016-05-09 | 2016-05-13 | 5 | Mon,Tue,Wed,Thu,Fri, |
| 2 | 2016-05-16 | 2016-05-17 | 2 | Mon,Tue, |
+------------+------------+------------+----------+----------------------+
Eu tentei aplicar uma gaps-and-islands
abordagem para este problema. Eu tentei fazer isso em duas passagens. Na primeira passagem encontro ilhas de dias consecutivos simples, ou seja, o fim da ilha é qualquer intervalo na sequência de dias, seja final de semana, feriado ou qualquer outra coisa. Para cada ilha encontrada, construo uma lista separada por vírgulas de arquivos WeekDays
. Na segunda passagem, o grupo I encontrou ilhas mais adiante, observando a lacuna na sequência dos números da semana ou uma mudança no WeekDays
.
Com essa abordagem, cada semana parcial acaba sendo um intervalo extra, conforme mostrado acima, porque, embora os números das semanas sejam consecutivos, a WeekDays
mudança. Além disso, pode haver intervalos regulares dentro de uma semana (veja ContractID=3
nos dados de amostra, que contém dados apenas para Mon,Wed,Fri,
) e essa abordagem geraria intervalos separados para cada dia nessa programação. Pelo lado bom, gera um intervalo se o cronograma não tiver nenhuma lacuna (veja ContractID=7
nos dados de amostra que incluem finais de semana) e, nesse caso, não importa se a semana de início ou fim é parcial.
Por favor, veja outros exemplos no script abaixo para ter uma ideia melhor do que estou procurando. Você pode ver que muitas vezes os fins de semana são excluídos, mas quaisquer outros dias da semana também podem ser excluídos. No exemplo 3 apenas Mon
, Wed
e Fri
fazem parte do cronograma. Além disso, finais de semana podem ser incluídos, como no exemplo 7. A solução deve tratar todos os dias da semana igualmente. Qualquer dia da semana pode ser incluído ou excluído da programação.
Para verificar se a lista de intervalos gerada descreve o cronograma fornecido corretamente, você pode usar o seguinte pseudocódigo:
- percorrer todos os intervalos
- para cada intervalo de loop por todas as datas do calendário entre as datas de início e término (inclusive).
- para cada data, verifique se o dia da semana está listado no arquivo
WeekDays
. Se sim, então esta data está incluída no cronograma.
Esperançosamente, isso esclarece em quais casos um novo intervalo deve ser criado. Nos exemplos 4 e 5 uma segunda-feira ( 2016-05-09
) é retirada do meio do horário e tal horário não pode ser representado por um único intervalo. No exemplo 6, há um longo intervalo no cronograma, portanto, são necessários dois intervalos.
Os intervalos representam padrões semanais na programação e quando um padrão é interrompido/alterado, um novo intervalo deve ser adicionado. No exemplo 11, as três primeiras semanas têm um padrão Tue
, então esse padrão muda para Thu
. Como resultado, precisamos de dois intervalos para descrever tal cronograma.
Estou usando o SQL Server 2008 no momento, então a solução deve funcionar nesta versão. Se uma solução para SQL Server 2008 puder ser simplificada/melhorada usando recursos de versões posteriores, isso é um bônus, mostre-o também.
Eu tenho uma Calendar
tabela (lista de datas) e uma Numbers
tabela (lista de números inteiros começando em 1), então não há problema em usá-los, se necessário. Também não há problema em criar tabelas temporárias e ter várias consultas que processam dados em vários estágios. O número de estágios em um algoritmo deve ser corrigido, porém, cursores e WHILE
loops explícitos não são aceitáveis.
Script para dados de amostra e resultados esperados
-- @Src is sample data
-- @Dst is expected result
DECLARE @Src TABLE (ID int PRIMARY KEY, ContractID int, dt date, dowChar char(3), dowInt int);
INSERT INTO @Src (ID, ContractID, dt, dowChar, dowInt) VALUES
-- simple two weeks (without weekend)
(110, 1, '2016-05-02', 'Mon', 2),
(111, 1, '2016-05-03', 'Tue', 3),
(112, 1, '2016-05-04', 'Wed', 4),
(113, 1, '2016-05-05', 'Thu', 5),
(114, 1, '2016-05-06', 'Fri', 6),
(115, 1, '2016-05-09', 'Mon', 2),
(116, 1, '2016-05-10', 'Tue', 3),
(117, 1, '2016-05-11', 'Wed', 4),
(118, 1, '2016-05-12', 'Thu', 5),
(119, 1, '2016-05-13', 'Fri', 6),
-- a partial end of the week, the whole week, partial start of the week (without weekends)
(223, 2, '2016-05-05', 'Thu', 5),
(224, 2, '2016-05-06', 'Fri', 6),
(225, 2, '2016-05-09', 'Mon', 2),
(226, 2, '2016-05-10', 'Tue', 3),
(227, 2, '2016-05-11', 'Wed', 4),
(228, 2, '2016-05-12', 'Thu', 5),
(229, 2, '2016-05-13', 'Fri', 6),
(230, 2, '2016-05-16', 'Mon', 2),
(231, 2, '2016-05-17', 'Tue', 3),
-- only Mon, Wed, Fri are included across two weeks plus partial third week
(310, 3, '2016-05-02', 'Mon', 2),
(311, 3, '2016-05-04', 'Wed', 4),
(314, 3, '2016-05-06', 'Fri', 6),
(315, 3, '2016-05-09', 'Mon', 2),
(317, 3, '2016-05-11', 'Wed', 4),
(319, 3, '2016-05-13', 'Fri', 6),
(330, 3, '2016-05-16', 'Mon', 2),
-- a whole week (without weekend), in the second week Mon is not included
(410, 4, '2016-05-02', 'Mon', 2),
(411, 4, '2016-05-03', 'Tue', 3),
(412, 4, '2016-05-04', 'Wed', 4),
(413, 4, '2016-05-05', 'Thu', 5),
(414, 4, '2016-05-06', 'Fri', 6),
(416, 4, '2016-05-10', 'Tue', 3),
(417, 4, '2016-05-11', 'Wed', 4),
(418, 4, '2016-05-12', 'Thu', 5),
(419, 4, '2016-05-13', 'Fri', 6),
-- three weeks, but without Mon in the second week (no weekends)
(510, 5, '2016-05-02', 'Mon', 2),
(511, 5, '2016-05-03', 'Tue', 3),
(512, 5, '2016-05-04', 'Wed', 4),
(513, 5, '2016-05-05', 'Thu', 5),
(514, 5, '2016-05-06', 'Fri', 6),
(516, 5, '2016-05-10', 'Tue', 3),
(517, 5, '2016-05-11', 'Wed', 4),
(518, 5, '2016-05-12', 'Thu', 5),
(519, 5, '2016-05-13', 'Fri', 6),
(520, 5, '2016-05-16', 'Mon', 2),
(521, 5, '2016-05-17', 'Tue', 3),
(522, 5, '2016-05-18', 'Wed', 4),
(523, 5, '2016-05-19', 'Thu', 5),
(524, 5, '2016-05-20', 'Fri', 6),
-- long gap between two intervals
(623, 6, '2016-05-05', 'Thu', 5),
(624, 6, '2016-05-06', 'Fri', 6),
(625, 6, '2016-05-09', 'Mon', 2),
(626, 6, '2016-05-10', 'Tue', 3),
(627, 6, '2016-05-11', 'Wed', 4),
(628, 6, '2016-05-12', 'Thu', 5),
(629, 6, '2016-05-13', 'Fri', 6),
(630, 6, '2016-05-16', 'Mon', 2),
(631, 6, '2016-05-17', 'Tue', 3),
(645, 6, '2016-06-06', 'Mon', 2),
(646, 6, '2016-06-07', 'Tue', 3),
(647, 6, '2016-06-08', 'Wed', 4),
(648, 6, '2016-06-09', 'Thu', 5),
(649, 6, '2016-06-10', 'Fri', 6),
(655, 6, '2016-06-13', 'Mon', 2),
(656, 6, '2016-06-14', 'Tue', 3),
(657, 6, '2016-06-15', 'Wed', 4),
(658, 6, '2016-06-16', 'Thu', 5),
(659, 6, '2016-06-17', 'Fri', 6),
-- two weeks, no gaps between days at all, even weekends are included
(710, 7, '2016-05-02', 'Mon', 2),
(711, 7, '2016-05-03', 'Tue', 3),
(712, 7, '2016-05-04', 'Wed', 4),
(713, 7, '2016-05-05', 'Thu', 5),
(714, 7, '2016-05-06', 'Fri', 6),
(715, 7, '2016-05-07', 'Sat', 7),
(716, 7, '2016-05-08', 'Sun', 1),
(725, 7, '2016-05-09', 'Mon', 2),
(726, 7, '2016-05-10', 'Tue', 3),
(727, 7, '2016-05-11', 'Wed', 4),
(728, 7, '2016-05-12', 'Thu', 5),
(729, 7, '2016-05-13', 'Fri', 6),
-- no gaps between days at all, even weekends are included, with partial weeks
(805, 8, '2016-04-30', 'Sat', 7),
(806, 8, '2016-05-01', 'Sun', 1),
(810, 8, '2016-05-02', 'Mon', 2),
(811, 8, '2016-05-03', 'Tue', 3),
(812, 8, '2016-05-04', 'Wed', 4),
(813, 8, '2016-05-05', 'Thu', 5),
(814, 8, '2016-05-06', 'Fri', 6),
(815, 8, '2016-05-07', 'Sat', 7),
(816, 8, '2016-05-08', 'Sun', 1),
(825, 8, '2016-05-09', 'Mon', 2),
(826, 8, '2016-05-10', 'Tue', 3),
(827, 8, '2016-05-11', 'Wed', 4),
(828, 8, '2016-05-12', 'Thu', 5),
(829, 8, '2016-05-13', 'Fri', 6),
(830, 8, '2016-05-14', 'Sat', 7),
-- only Mon-Wed included, two weeks plus partial third week
(910, 9, '2016-05-02', 'Mon', 2),
(911, 9, '2016-05-03', 'Tue', 3),
(912, 9, '2016-05-04', 'Wed', 4),
(915, 9, '2016-05-09', 'Mon', 2),
(916, 9, '2016-05-10', 'Tue', 3),
(917, 9, '2016-05-11', 'Wed', 4),
(930, 9, '2016-05-16', 'Mon', 2),
(931, 9, '2016-05-17', 'Tue', 3),
-- only Thu-Sun included, three weeks
(1013,10,'2016-05-05', 'Thu', 5),
(1014,10,'2016-05-06', 'Fri', 6),
(1015,10,'2016-05-07', 'Sat', 7),
(1016,10,'2016-05-08', 'Sun', 1),
(1018,10,'2016-05-12', 'Thu', 5),
(1019,10,'2016-05-13', 'Fri', 6),
(1020,10,'2016-05-14', 'Sat', 7),
(1021,10,'2016-05-15', 'Sun', 1),
(1023,10,'2016-05-19', 'Thu', 5),
(1024,10,'2016-05-20', 'Fri', 6),
(1025,10,'2016-05-21', 'Sat', 7),
(1026,10,'2016-05-22', 'Sun', 1),
-- only Tue for first three weeks, then only Thu for the next three weeks
(1111,11,'2016-05-03', 'Tue', 3),
(1116,11,'2016-05-10', 'Tue', 3),
(1131,11,'2016-05-17', 'Tue', 3),
(1123,11,'2016-05-19', 'Thu', 5),
(1124,11,'2016-05-26', 'Thu', 5),
(1125,11,'2016-06-02', 'Thu', 5),
-- one week, then one week gap, then one week
(1210,12,'2016-05-02', 'Mon', 2),
(1211,12,'2016-05-03', 'Tue', 3),
(1212,12,'2016-05-04', 'Wed', 4),
(1213,12,'2016-05-05', 'Thu', 5),
(1214,12,'2016-05-06', 'Fri', 6),
(1215,12,'2016-05-16', 'Mon', 2),
(1216,12,'2016-05-17', 'Tue', 3),
(1217,12,'2016-05-18', 'Wed', 4),
(1218,12,'2016-05-19', 'Thu', 5),
(1219,12,'2016-05-20', 'Fri', 6);
SELECT ID, ContractID, dt, dowChar, dowInt
FROM @Src
ORDER BY ContractID, dt;
DECLARE @Dst TABLE (ContractID int, StartDT date, EndDT date, DayCount int, WeekDays varchar(255));
INSERT INTO @Dst (ContractID, StartDT, EndDT, DayCount, WeekDays) VALUES
(1, '2016-05-02', '2016-05-13', 10, 'Mon,Tue,Wed,Thu,Fri,'),
(2, '2016-05-05', '2016-05-17', 9, 'Mon,Tue,Wed,Thu,Fri,'),
(3, '2016-05-02', '2016-05-16', 7, 'Mon,Wed,Fri,'),
(4, '2016-05-02', '2016-05-06', 5, 'Mon,Tue,Wed,Thu,Fri,'),
(4, '2016-05-10', '2016-05-13', 4, 'Tue,Wed,Thu,Fri,'),
(5, '2016-05-02', '2016-05-06', 5, 'Mon,Tue,Wed,Thu,Fri,'),
(5, '2016-05-10', '2016-05-20', 9, 'Mon,Tue,Wed,Thu,Fri,'),
(6, '2016-05-05', '2016-05-17', 9, 'Mon,Tue,Wed,Thu,Fri,'),
(6, '2016-06-06', '2016-06-17', 10, 'Mon,Tue,Wed,Thu,Fri,'),
(7, '2016-05-02', '2016-05-13', 12, 'Sun,Mon,Tue,Wed,Thu,Fri,Sat,'),
(8, '2016-04-30', '2016-05-14', 15, 'Sun,Mon,Tue,Wed,Thu,Fri,Sat,'),
(9, '2016-05-02', '2016-05-17', 8, 'Mon,Tue,Wed,'),
(10,'2016-05-05', '2016-05-22', 12, 'Sun,Thu,Fri,Sat,'),
(11,'2016-05-03', '2016-05-17', 3, 'Tue,'),
(11,'2016-05-19', '2016-06-02', 3, 'Thu,'),
(12,'2016-05-02', '2016-05-06', 5, 'Mon,Tue,Wed,Thu,Fri,'),
(12,'2016-05-16', '2016-05-20', 5, 'Mon,Tue,Wed,Thu,Fri,');
SELECT ContractID, StartDT, EndDT, DayCount, WeekDays
FROM @Dst
ORDER BY ContractID, StartDT;
Comparação de respostas
A tabela real @Src
possui 403,555
linhas 15,857
com ContractIDs
. Todas as respostas produzem resultados corretos (pelo menos para meus dados) e todas elas são razoavelmente rápidas, mas diferem em otimização. Quanto menos intervalos gerados, melhor. Incluí tempos de execução apenas por curiosidade. O foco principal é o resultado correto e ideal, não a velocidade (a menos que demore muito - interrompi a consulta não recursiva de Ziggy Crueltyfree Zeitgeister após 10 minutos).
+--------------------------------------------------------+-----------+---------+
| Answer | Intervals | Seconds |
+--------------------------------------------------------+-----------+---------+
| Ziggy Crueltyfree Zeitgeister | 25751 | 7.88 |
| While loop | | |
| | | |
| Ziggy Crueltyfree Zeitgeister | 25751 | 8.27 |
| Recursive | | |
| | | |
| Michael Green | 25751 | 22.63 |
| Recursive | | |
| | | |
| Geoff Patterson | 26670 | 4.79 |
| Weekly gaps-and-islands with merging of partial weeks | | |
| | | |
| Vladimir Baranov | 34560 | 4.03 |
| Daily, then weekly gaps-and-islands | | |
| | | |
| Mikael Eriksson | 35840 | 0.65 |
| Weekly gaps-and-islands | | |
+--------------------------------------------------------+-----------+---------+
| Vladimir Baranov | 25751 | 121.51 |
| Cursor | | |
+--------------------------------------------------------+-----------+---------+
This one uses a recursive CTE. Its result is identical to the example in the question. It was a nightmare to come up with... The code includes comments to ease through its convoluted logic.
Another strategy
This one should be significantly faster than the previous one because it doesn't rely on the slow limited recursive CTE in SQL Server 2008, although it implements more or less the same strategy.
There is a
WHILE
loop (I couldn't devise a way to avoid it), but goes for a reduced number of iterations (the highest number of sequences (minus one) on any given contract).It's a simple strategy, and could be used for sequences either shorter or longer than a week (replacing any occurrence of the constant 7 for any other number, and the
dowBit
calculated from MODULUS x ofDayNo
rather thanDATEPART(wk)
) and up to 32.Não é exatamente o que você está procurando, mas talvez possa ser do seu interesse.
A consulta cria semanas com uma string separada por vírgula para os dias usados em cada semana. Em seguida, encontra as ilhas de semanas consecutivas que usam o mesmo padrão em
Weekdays
.Resultado:
ContractID = 2
shows what the difference in the result is compared to what you want. The first and last week will be treated as separate periods sinceWeekDays
is different.I ended up with an approach that yields the optimal solution in this case and I think will do well in general. The solution is quite lengthy, however, so it would be interesting to see if someone else has a different approach that is more concise.
Here is a script that contains the full solution.
And here is an outline of the algorithm:
ContractId
ContractId
and have sameWeekDays
WeekDays
of the single week matches a leading subset of theWeekDays
of the previous grouping, merge into that previous groupingWeekDays
of the single week matches a trailing subset of theWeekDays
of the next grouping, merge into that next groupingNão consegui entender a lógica por trás do agrupamento de semanas com intervalos ou semanas com finais de semana (por exemplo, quando há duas semanas consecutivas com um final de semana, para qual semana o final de semana vai?).
A consulta a seguir produz a saída desejada, exceto que agrupa apenas dias da semana consecutivos e agrupa semanas de domingo a sábado (em vez de segunda a domingo). Embora não seja exatamente o que você deseja, talvez isso possa fornecer algumas pistas para uma estratégia diferente. O agrupamento de dias vem daqui . As funções de janelamento usadas devem funcionar com o SQLServer 2008, mas não tenho essa versão para testar se realmente funciona.
Resultado
For the sake of completeness, here is a two-pass
gaps-and-islands
approach that I tried myself before asking this question.As I was testing it on the real data I found few cases when it was producing incorrect results and fixed it.
Here is the algorithm:
CTE_ContractDays
,CTE_DailyRN
,CTE_DailyIslands
) and calculate a week number for each starting and ending date of an island. Here week number is calculated assuming that Monday is the first day of the week.CTE_Weeks
).CTE_FirstResult
).WeekDays
(CTE_SecondRN
,CTE_Schedules
).It handles well cases when there is no disruption in the weekly patterns (1, 7, 8, 10, 12). It handles well cases when pattern has non-sequential days (3).
But, unfortunately, it generates extra intervals for partial weeks (2, 3, 5, 6, 9, 11).
Result
Cursor-based solution
I converted my C# code into a cursor-based algorithm, just to see how it compares to other solutions on real data. It confirms that it is much slower than other set-based or recursive approaches, but it generates an optimal result.
I was a little bit surprised that the Vladimir's cursor solution was so slow, so I also tried to optimize that version. I did confirm that using a cursor was very slow for me as well.
However, at the cost of using undocumented functionality in SQL Server by appending to a variable while processing a rowset, I was able to create a simplified version of this logic that yields the optimal result and executes much faster than both the cursor and my original solution. So use at your own risk, but I'll present the solution in case it's of interest. It would also be possible to update the solution to use a
WHILE
loop from one to the maximum row number, seeking to the next row number at each iteration of the loop. This would stick to fully documented and reliable functionality, but would violate the (somewhat artificial) stated constraint of the problem thatWHILE
loops are not allowed.Note that if using SQL 2014 was allowed, it is likely that a natively-compiled stored procedure that loops over the row numbers and access each row number in a memory-optimized table would be an implementation of this same logic that would run more quickly.
Here is the full solution, including expanding the trial data set out to about a half million rows. The new solution completes in about 3 seconds and in my opinion is a lot more concise and readable than the previous solution I offered. I'll break out the three steps involved here:
Step 1: pre-processing
We first add a row number to the data set, in the order we will process the data. While doing so, we also convert each dowInt into a power of 2 so that we can use a bitmap to represent which days have been observed in any given grouping:
Step 2: Looping through the contract days in order to identify new groupings
We next loop over the data, in order by row number. We compute only the list of row numbers that form the boundary of a new grouping, then output those row numbers into a table:
Step 3: Computing final results based on the row numbers of each grouping boundary
We then compute the final groupings by using the boundaries identified in the loop above to aggregate all dates that fall into each grouping:
Discussion will follow the code.
@Helper
is to cope with this rule:It allows me to list day names, in day number order, between any two given days. This is used when deciding if a new interval should start. I populate it with two week's worth of values to make wrapping around a weekend easier to code.
There are cleaner ways to implement this. A full "dates" table would be one. There's probably a clever way with day number and modulo arithmetic, too.
The CTE
MissingDays
is to generate a list of day names between any two given days. It is handled in this clunky way because the recursive CTE (following) does not allow aggregates, TOP(), or other operators. This is inelegant, but it works.CTE
Numbered
is to enforce a known, gap-free sequence on the data. It avoids a lot of comparisons later.CTE
Incremented
is where the action happens. In essence I use a recursive CTE to step through the data and enforce the rules. The row number generated inNumbered
(above) is used to drive the recursive processing.The seed of the recursive CTE simply gets the first date for each ContractID and initialises values that will be used to decide if a new interval is required.
Deciding if a new interval should start requires the current interval's start date, day list and the length of any gap in the calendar dates. These may be reset or carried forward, depending on the decision. Hence the recursive part is verbose and a little repetitive, as we have to decide whether to start a new interval for more than one column value.
The decision logic for columns
WeekDays
andIntervalStart
should have the same decision logic - it can be cut-and-pasted between them. If the logic for starting a new interval were to change this is the code to alter. Ideally it would be abstracted, therefore; doing this in a recursive CTE may be challenging.The
EXISTS()
clause is the outfall of not being able to use aggregate functions in a recursive CTE. All it does is see if the days falling within a gap are in the current interval already.There is nothing magic about the nesting of the logic clauses. If it is clearer in another conformation, or using nested CASEs, say, there is no reason to keep it this way.
The final
SELECT
is to give the output in the format desired.Having the PK on
Src.ID
is not useful for this method. A clustered index on(ContractID,dt)
would be nice, I think.There are a few rough edges. The days are not returned in dow sequence, but in the calendar sequence they appear in the source data. Everything to do with @Helper is klunky and could be smoothed. I like the idea of using one bit per day and using binary functions instead of
LIKE
. Separating some of the auxilliary CTEs into temp table with proper indexes would undoubtedly help.One of the challenges with this is that a "week" does not align with a standard calendar, but rather is driven by the data, and resets when it is determined that a new interval should begin. A "week", or at least an interval, can be from one day long to spanning the entire dataset.
For interests sake, here's the estimated costs against Geoff's sample data (thanks for that!) after various changes:
The estimated and actual number of rows differ wildly.
The plan has a table spoo, likely as a result of the recursive CTE. Most of the action is in a worktable coming off that:
Just the way recursive's implemented, I guess!