一般来说,我想按一列或另一列分组:
SELECT count(*)
FROM foo
GROUP BY column1 OR column2
特别是,我想识别给定表中的重复项,下面是一个数据最少的示例。
最小的例子
| id | name | birth_date | email |
|---:|:-----------|:-------------|:-----------------------------|
| 1 | 'Gamow' | '1904-03-04' | '[email protected]' |
| 2 | 'Gamow' | NULL | NULL |
| 3 | 'Gamow' | '1904-03-04' | NULL |
| 4 | 'Gamow' | '1904-03-04' | '[email protected]' |
| 5 | 'Gamow' | NULL | '[email protected]' |
| 6 | 'Feynman' | '1918-05-11' | '[email protected]' |
| 7 | 'Feynman' | '1918-05-11' | NULL |
| 8 | 'Poincaré' | '1854-04-29' | '[email protected]' |
如果两个人具有相同的名字和(相同的出生日期或相同的电子邮件),则他们被认为是重复的。
我想找到一个查询谁给
MIN(id)
在重复项中:要保留的行- 1+ 个其他 ID:要删除的行
我写了这个查询
SELECT
MIN(p.id) AS id_tokeep,
REPLACE(
GROUP_CONCAT(p.id ORDER BY p.id ASC SEPARATOR ','),
CONCAT(MIN(p.id), ','),
''
) AS ids_todelete,
MIN(p.name) AS name
FROM people AS p
WHERE p.birth_date IS NOT NULL OR p.email IS NOT NULL
GROUP BY p.name, (p.birth_date IS NOT NULL) OR (p.email IS NOT NULL)
HAVING COUNT(id) > 1
ORDER BY id_tokeep;
哪个有效:
| id_tokeep | ids_todelete | name |
|----------:|--------------|-----------|
| 1 | '3,4,5' | 'Gamow' |
| 6 | '7' | 'Feynman' |
除了:
OR
感觉就像一个 hack(因为我在任何地方都没有发现inside a clause的使用GROUP BY
),- 它有时会产生警告 «
Truncated incorrect DOUBLE value: '[email protected]'
» - 对于真实数据,它会返回“奇怪”的结果(这似乎证实了它是一个 hack)。
因此,我的问题是:
- 在 GROUP BY 子句中使用 OR 是否“合法”?
- 如果不是,如何在 GROUP BY 中没有 OR 的情况下重写此查询?
生成最小示例的 SQL 代码
DROP TABLE IF EXISTS people;
CREATE TABLE people (
id INT NOT NULL AUTO_INCREMENT PRIMARY KEY,
name VARCHAR(255) CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci DEFAULT NULL,
birth_date DATE DEFAULT NULL,
email VARCHAR(255) CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci DEFAULT NULL
) ENGINE=InnoDB AUTO_INCREMENT=1 DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci;
INSERT INTO people (name, birth_date, email)
VALUES
('Gamow', '1904-03-04', '[email protected]'), -- 1
('Gamow', NULL, NULL), -- 2
('Gamow', '1904-03-04', NULL), -- 3
('Gamow', '1904-03-04', '[email protected]'), -- 4
('Gamow', NULL, '[email protected]'), -- 5
('Feynman', '1918-05-11', '[email protected]'), -- 6
('Feynman', '1918-05-11', NULL), -- 7
('Poincaré', '1854-04-29', '[email protected]') -- 8
;
其他试用查询
我尝试了两个自连接,但随后我很难对重复项进行分组:
SELECT
p1.id AS patient_id,
p2.id AS patient_match_birth_date_id,
p3.id AS patient_match_email_id
FROM people AS p1
JOIN people AS p2 ON p2.birth_date = p1.birth_date
JOIN people AS p3 ON p3.email = p1.email
ORDER BY p1.id, p2.id, p3.id;
我尝试将出生日期的重复项与电子邮件中的重复项结合起来,但它返回了冗余数据:
WITH patient_matches AS (
(
SELECT
p1.id AS patient_id,
p2.id AS patient_match_id
FROM people AS p1
JOIN people AS p2 ON p2.birth_date = p1.birth_date
ORDER BY p1.id ASC
)
UNION ALL
(
SELECT
p1.id AS patient_id,
p2.id AS patient_match_id
FROM people AS p1
JOIN people AS p2 ON p2.email = p1.email
ORDER BY p1.id ASC
)
ORDER BY patient_match_id
)
SELECT json_arrayagg(pm.patient_id)
FROM patient_matches AS pm
GROUP BY pm.patient_match_id;