我正在尝试处理网络日志并将会话加入在一起(如果会话之间的时间少于 15 分钟)。相关字段有开始时间、结束时间、MAC 地址和 WiFi 接入点。
我正在 Greenplum 6.22/Postgresql 9.4.26 中工作:
pdap=# SELECT version();
版本 |
---|
PostgreSQL 9.4.26(Greenplum数据库6.22.2) |
从逻辑上讲,我想要做的是“如果下一行的开始时间比该行的结束时间晚不到 15 分钟,则将两行合并为具有较早开始时间和较晚结束时间的一行。”
这是包含一些数据的示例表:
CREATE TABLE network_test
( start_ts TIMESTAMPTZ,
end_ts TIMESTAMPTZ,
mac_addr MACADDR,
access_point VARCHAR
);
INSERT INTO network_test
VALUES
('2023-08-14 13:21:10.289'::timestamptz, '2023-08-14 13:31:20.855'::timestamptz, '00:00:00:00:00:01'::macaddr, 'access_point_01'),
('2023-08-14 13:58:10.638'::timestamptz, '2023-08-14 13:58:22.668'::timestamptz, '00:00:00:00:00:01'::macaddr, 'access_point_01'),
('2023-08-14 13:58:22.727'::timestamptz, '2023-08-14 13:58:38.966'::timestamptz, '00:00:00:00:00:01'::macaddr, 'access_point_01'),
('2023-08-14 13:28:28.190'::timestamptz, '2023-08-14 13:28:28.190'::timestamptz, '00:00:00:00:00:02'::macaddr, 'access_point_02'),
('2023-08-14 13:28:44.167'::timestamptz, '2023-08-14 13:28:44.288'::timestamptz, '00:00:00:00:00:02'::macaddr, 'access_point_02'),
('2023-08-14 13:45:40.281'::timestamptz, '2023-08-14 13:46:02.726'::timestamptz, '00:00:00:00:00:02'::macaddr, 'access_point_03'),
('2023-08-14 13:46:02.964'::timestamptz, '2023-08-14 13:46:10.783'::timestamptz, '00:00:00:00:00:02'::macaddr, 'access_point_03'),
('2023-08-14 13:46:11.026'::timestamptz, '2023-08-14 13:46:18.803'::timestamptz, '00:00:00:00:00:02'::macaddr, 'access_point_03'),
('2023-08-14 13:46:19.037'::timestamptz, '2023-08-14 13:46:26.798'::timestamptz, '00:00:00:00:00:02'::macaddr, 'access_point_03'),
('2023-08-14 13:46:27.036'::timestamptz, '2023-08-14 13:46:34.815'::timestamptz, '00:00:00:00:00:02'::macaddr, 'access_point_03'),
('2023-08-14 13:46:35.057'::timestamptz, '2023-08-14 13:46:46.980'::timestamptz, '00:00:00:00:00:02'::macaddr, 'access_point_03'),
('2023-08-14 13:46:47.213'::timestamptz, '2023-08-14 13:46:54.946'::timestamptz, '00:00:00:00:00:02'::macaddr, 'access_point_03'),
('2023-08-14 13:46:55.189'::timestamptz, '2023-08-14 13:47:17.040'::timestamptz, '00:00:00:00:00:02'::macaddr, 'access_point_03'),
('2023-08-14 13:47:17.297'::timestamptz, '2023-08-14 13:47:25.106'::timestamptz, '00:00:00:00:00:02'::macaddr, 'access_point_03'),
('2023-08-14 13:55:25.381'::timestamptz, '2023-08-14 13:58:33.059'::timestamptz, '00:00:00:00:00:02'::macaddr, 'access_point_03');
SELECT *
FROM network_test
ORDER BY mac_addr, access_point, start_ts
开始_ts | 结束_ts | mac地址 | 切入点 |
---|---|---|---|
2023-08-14 13:21:10.289+00 | 2023-08-14 13:31:20.855+00 | 00:00:00:00:00:01 | 接入点_01 |
2023-08-14 13:58:10.638+00 | 2023-08-14 13:58:22.668+00 | 00:00:00:00:00:01 | 接入点_01 |
2023-08-14 13:58:22.727+00 | 2023-08-14 13:58:38.966+00 | 00:00:00:00:00:01 | 接入点_01 |
2023-08-14 13:28:28.19+00 | 2023-08-14 13:28:28.19+00 | 00:00:00:00:00:02 | 接入点_02 |
2023-08-14 13:28:44.167+00 | 2023-08-14 13:28:44.288+00 | 00:00:00:00:00:02 | 接入点_02 |
2023-08-14 13:45:40.281+00 | 2023-08-14 13:46:02.726+00 | 00:00:00:00:00:02 | 接入点_03 |
2023-08-14 13:46:02.964+00 | 2023-08-14 13:46:10.783+00 | 00:00:00:00:00:02 | 接入点_03 |
2023-08-14 13:46:11.026+00 | 2023-08-14 13:46:18.803+00 | 00:00:00:00:00:02 | 接入点_03 |
2023-08-14 13:46:19.037+00 | 2023-08-14 13:46:26.798+00 | 00:00:00:00:00:02 | 接入点_03 |
2023-08-14 13:46:27.036+00 | 2023-08-14 13:46:34.815+00 | 00:00:00:00:00:02 | 接入点_03 |
2023-08-14 13:46:35.057+00 | 2023-08-14 13:46:46.98+00 | 00:00:00:00:00:02 | 接入点_03 |
2023-08-14 13:46:47.213+00 | 2023-08-14 13:46:54.946+00 | 00:00:00:00:00:02 | 接入点_03 |
2023-08-14 13:46:55.189+00 | 2023-08-14 13:47:17.04+00 | 00:00:00:00:00:02 | 接入点_03 |
2023-08-14 13:47:17.297+00 | 2023-08-14 13:47:25.106+00 | 00:00:00:00:00:02 | 接入点_03 |
2023-08-14 13:55:25.381+00 | 2023-08-14 13:58:33.059+00 | 00:00:00:00:00:02 | 接入点_03 |
这是我希望的结果:
开始_ts | 结束_ts | mac地址 | 切入点 |
---|---|---|---|
2023-08-14 13:21:10.289+00 | 2023-08-14 13:31:20.855+00 | 00:00:00:00:00:01 | 接入点_01 |
2023-08-14 13:58:10.638+00 | 2023-08-14 13:58:38.966+00 | 00:00:00:00:00:01 | 接入点_01 |
2023-08-14 13:28:28.19+00 | 2023-08-14 13:28:44.288+00 | 00:00:00:00:00:02 | 接入点_02 |
2023-08-14 13:45:40.281+00 | 2023-08-14 13:58:33.059+00 | 00:00:00:00:00:02 | 接入点_03 |
第一个会话保持原样。第 2 次和第 3 次会话合并为一次,因为它们具有相同的 MAC 地址和接入点,并且它们之间的时间间隔不到 15 分钟。第 4 次和第 5 次以及第 6 次到第 15 次也会发生同样的情况。
我可以使用窗口函数来接近:
SELECT DISTINCT
MIN(start_ts) OVER (PARTITION BY mac_addr, access_point, ROUND(EXTRACT(EPOCH FROM start_ts)/900)) AS start_ts,
MAX(end_ts) OVER (PARTITION BY mac_addr, access_point, ROUND(EXTRACT(EPOCH FROM end_ts)/900)) AS end_ts,
mac_addr,
access_point
FROM network_test
ORDER BY mac_addr, access_point, start_ts
开始_ts | 结束_ts | mac地址 | 切入点 |
---|---|---|---|
2023-08-14 13:21:10.289+00 | 2023-08-14 13:31:20.855+00 | 00:00:00:00:00:01 | 接入点_01 |
2023-08-14 13:58:10.638+00 | 2023-08-14 13:58:38.966+00 | 00:00:00:00:00:01 | 接入点_01 |
2023-08-14 13:28:28.19+00 | 2023-08-14 13:28:44.288+00 | 00:00:00:00:00:02 | 接入点_02 |
2023-08-14 13:45:40.281+00 | 2023-08-14 13:47:25.106+00 | 00:00:00:00:00:02 | 接入点_03 |
2023-08-14 13:55:25.381+00 | 2023-08-14 13:58:33.059+00 | 00:00:00:00:00:02 | 接入点_03 |
但请注意,最后两个数据点最终会出现在单独的 15 分钟时间段中,尽管它们仅相隔 8 分钟。
有谁知道是否有办法在 SQL 中执行此操作,或者我是否必须编写 PL/pgSQL 函数来逐行遍历数据并进行比较?
这在 Postgres 9.4 中有效:
小提琴
看: