我接近解决这个问题,但我只是被困在墙上。我正在尝试理解 Aaron Betrand 的一篇文章,并将其应用于我遇到的一种情况,在这种情况下,由于我继承了先前的设计错误,我有一个大量重复的更改表。示例数据集在概念上与我的真实数据集相同,除了 SortOrder 通常是日期时间值而不是整数。我试过的代码在这里:
; with main as (
select *, ROW_NUMBER() over (partition by ID, Val, sortorder order by ID, SortOrder) as "Rank"
, row_number() over (partition by ID, val order by ID, sortorder) as "s_rank"
from
(values (1, 'A', 1), (1, 'A', 1), (1, 'B', 2), (1, 'C', 3), (1, 'B', 4), (1, 'A', 5), (1, 'A', 5), (2, 'A', 1), (2, 'B', 2), (2, 'A', 3), (3, 'A', 1), (3, 'A', 1), (3, 'A', 2) )
as x("ID", "VAL", "SortOrder")
group by id, val, SortOrder
--order by ID, "SortOrder"
)
, cte_rest as (
select *
from main
where "s_rank" > 1
)
select *
from main left join cte_rest rest on main.id = rest.id and main.s_rank > 1 and main.SortOrder = rest.SortOrder
--where not exists (select 1 from cte_rest r where r.id = main.id and r.val <> main.VAL and main.s_rank < s_rank)
order by main.ID, main.SortOrder
结果几乎是有效的;但是,最后一行突出显示了我无法解释的情况:日期更改,值没有更改。我希望排除此记录,因为它不是真正的值更改。
╔════╦═════╦═══════════╦══════╦════════╦══════╦══════╦═══════════╦══════╦════════╗
║ ID ║ VAL ║ SortOrder ║ Rank ║ s_rank ║ ID ║ VAL ║ SortOrder ║ Rank ║ s_rank ║
╠════╬═════╬═══════════╬══════╬════════╬══════╬══════╬═══════════╬══════╬════════╣
║ 1 ║ A ║ 1 ║ 1 ║ 1 ║ NULL ║ NULL ║ NULL ║ NULL ║ NULL ║
║ 1 ║ B ║ 2 ║ 1 ║ 1 ║ NULL ║ NULL ║ NULL ║ NULL ║ NULL ║
║ 1 ║ C ║ 3 ║ 1 ║ 1 ║ NULL ║ NULL ║ NULL ║ NULL ║ NULL ║
║ 1 ║ B ║ 4 ║ 1 ║ 2 ║ 1 ║ B ║ 4 ║ 1 ║ 2 ║
║ 1 ║ A ║ 5 ║ 1 ║ 2 ║ 1 ║ A ║ 5 ║ 1 ║ 2 ║
║ 2 ║ A ║ 1 ║ 1 ║ 1 ║ NULL ║ NULL ║ NULL ║ NULL ║ NULL ║
║ 2 ║ B ║ 2 ║ 1 ║ 1 ║ NULL ║ NULL ║ NULL ║ NULL ║ NULL ║
║ 2 ║ A ║ 3 ║ 1 ║ 2 ║ 2 ║ A ║ 3 ║ 1 ║ 2 ║
║ 3 ║ A ║ 1 ║ 1 ║ 1 ║ NULL ║ NULL ║ NULL ║ NULL ║ NULL ║
║ 3 ║ A ║ 2 ║ 1 ║ 2 ║ 3 ║ A ║ 2 ║ 1 ║ 2 ║
╚════╩═════╩═══════════╩══════╩════════╩══════╩══════╩═══════════╩══════╩════════╝
我的一位同事建议了这段代码,虽然我可以了解它是如何到达的,但我不明白为什么第一个代码示例不起作用。在我看来,这需要大量额外的解析,并且对于大型数据集,我会担心性能影响。
WITH cte1
AS (SELECT [id]
, [val]
, [sortorder]
, ROW_NUMBER() OVER(PARTITION BY [id]
, [val]
, [sortorder]
ORDER BY [id]
, [sortorder]) AS "rankall"
FROM (VALUES
( 1, 'A', 1 ),
( 1, 'A', 1 ),
( 1, 'B', 2 ),
( 1, 'C', 3 ),
( 1, 'B', 4 ),
( 1, 'A', 5 ),
( 1, 'A', 5 ),
( 2, 'A', 1 ),
( 2, 'B', 2 ),
( 2, 'A', 3 ),
( 3, 'A', 1 ),
( 3, 'A', 1 ),
( 3, 'A', 2 )) AS x("id", "val", "sortorder")),
ctedropped
AS (SELECT [id]
, [val]
, [sortorder]
, ROW_NUMBER() OVER(PARTITION BY [id]
, [val]
, [sortorder]
ORDER BY [id]
, [sortorder]) AS "rankall"
FROM cte1
WHERE [cte1].[rankall] > 1)
SELECT [cte1].[id]
, [cte1].[val]
, [cte1].[sortorder]
FROM cte1
WHERE NOT EXISTS
(
SELECT *
FROM [ctedropped]
WHERE [cte1].[id] = [ctedropped].[id] AND
[cte1].[val] = [ctedropped].[val] AND
[cte1].[rankall] = [ctedropped].[rankall]
)
ORDER BY [cte1].[id]
, [cte1].[sortorder];
目前尚不清楚您的数据集和预期结果是否与引用的问题相同。我认为您正在寻找识别 id 更新为与以前不同的值的最新时间。在这种情况下,您可以尝试以下
分贝小提琴