我正在努力寻找一种更快的方法来更新我目前正在使用的数据。更准确地说,我有每个合同的开始和结束日期的工人合同数据。但是,合同的时间跨度可能会重叠,这意味着即使之前的合同仍然有效,工人也可以开始另一个工作合同。因此,我想隔离那些与以前的合同完全重叠的合同。为了做到这一点,我制作了一个存储过程,将每个合同的结尾与前一个合同的结尾进行比较。如果当前合同的结束发生在前一个合同的结束之前,我标记这个合同,它将在程序的下一个循环中省略,直到结束。然而,由于我的数据库是由超过 1000 万个观测值组成的,我制作的存储过程(写在下面)太长了。因此,如果可能的话,我想创建一个查询,但我正在努力寻找合适的解决方案。任何建议将不胜感激。
在下面的命令中有我的数据库示例的复制和我当前使用的过程。
-- Table replication
drop table if exists my_table;
create table my_table (
id MEDIUMINT NOT NULL AUTO_INCREMENT,
worker_id int,
dt_start date,
dt_end date,
PRIMARY KEY (id)
);
insert into
my_table(id, worker_id, dt_start, dt_end)
values
('12', '20', '2014-05-02', '2014-07-08'),
('13', '20', '2017-01-14', '2017-01-31'),
('14', '20', '2017-04-18', '2018-01-01'),
('15', '20', '2017-11-06', '2017-11-06'),
('16', '20', '2017-11-06', '2017-12-07'),
('17', '20', '2019-12-02', '2020-05-31'),
('18', '20', '2020-06-01', '2020-07-31'),
('25', '29', '2014-11-24', '2017-02-11'),
('26', '42', '2016-01-22', '2016-05-05'),
('40', '71', '2016-12-01', '2017-05-31'),
('41', '71', '2017-06-01', '2020-12-21'),
('42', '71', '2020-07-17', '2020-08-02'),
('53', '380', '2017-02-15', '2017-07-31'),
('54', '380', '2017-09-04', '2017-12-23'),
('55', '380', '2017-12-27', '2018-12-22'),
('56', '380', '2019-05-15', '2019-09-15'),
('57', '380', '2020-03-23', '2099-01-01'),
('58', '380', '2020-09-28', '2022-09-30'),
('63', '391', '2013-07-23', '2013-11-30'),
('64', '391', '2014-06-16', '2014-12-16'),
('65', '391', '2014-11-21', '2015-01-20'),
('66', '391', '2015-04-01', '2015-04-15'),
('67', '391', '2015-06-10', '2015-06-22')
;
alter table my_table add index (id);
alter table my_table add index (worker_id);
-- Note: when the end date is '2099-01-01', it means the contract is an open-ended one and still ongoing
-- With this flag I will identify contracts completely overlapped, hence to discard
alter table my_table add column flag_del INT default 0;
-- Identify maximum number of contracts per person, I will use the maximum value for the loop
drop table if exists max_att;
create table max_att
as select worker_id, count(*) n
from my_table
group by worker_id;
-- Procedure to identify recursively contracts whose time-span is completely covedere by previous contracts
-- Those specific contracts will be idenfitied in the 'fla_del' column (= 1)
DROP PROCEDURE IF EXISTS doiterate;
delimiter //
CREATE PROCEDURE doiterate()
BEGIN
DECLARE total INT unsigned DEFAULT 0;
WHILE total <= (select MAX(n) from max_att) DO
with new_table as (
select
*,
lag(dt_end, 1) over (partition by worker_id order by id) dt_end_lag
from my_table
where flag_del = 0)
update my_table a
left outer join new_table b on a.id = b.id
set a.flag_del = 1 where b.dt_end_lag >= b.dt_end;
SET total = total + 1;
END WHILE;
END//
delimiter ;
CALL doiterate();
select * from my_table;
那些(最后)标记为 1 的条目将被删除,因为与之前的合同完全重叠。
期望的输出。
https://dbfiddle.uk/7HjwGOrJ
最后 2 列提供当前行日期范围所在的日期范围。使用此技术编写最终查询。
PS。该查询结合了重叠范围 (dt_end_1 > dt_start_2) 和相邻范围 (dt_end_1 = dt_start_2),但不结合连续范围 (dt_start_2 - dt_end_1 = 1 天)。如果您只需要组合重叠的范围,请使用
WHEN cte2.range_end < cte1.dt_start
. 如果您需要组合连续范围,请使用WHEN cte2.range_end + INTERVAL 1 DAY <= cte1.dt_start
.