我有一个表格来描述我的应用程序用户,其中包含一些详细信息,例如姓名、姓氏、出生日期、国籍、电子邮件等......
我想知道每个属性和每个用户类别的最常见值和出现百分比。
例如 :
create table test ( userId int identity(1,1),
categoryId int,
name varchar(50),
surname varchar(50))
insert into test(categoryId, name, surname)
values (1, 'John', 'Locke'),
(1, 'John', 'Millar'),
(1, 'James', 'Mill'),
(1, 'John Stuart', 'Mill'),
(2, 'Thomas', 'Bayes'),
(2, 'Laurent', 'Schwartz'),
(2, 'Herrmann Amandus', 'Schwartz'),
(2, 'Thomas', 'Simpson'),
(2, 'Leonhard', 'Euler')
结果应该是:
+------------+-------+--------+---------+----------+------------+
| categoryId | total | name | namePct | surname | surnamePct |
+------------+-------+--------+---------+----------+------------+
| 1 | 4 | John | 0.50 | Mill | 0.50 |
| 2 | 5 | Thomas | 0.40 | Schwartz | 0.40 |
+------------+-------+--------+---------+----------+------------+
对于这个简单的示例,我可以通过如下查询来计算如何实现这一点:
select t.categoryId,
t.total,
n.name,
1. * n.total / t.total as namePct,
sn.surname,
1. * sn.total / t.total as surnamePct
from (
select categoryId, count(*) as total
from test
group by categoryId
) t
join (
select categoryId, name, total
from (
select categoryId, name, total, row_number() over(partition by categoryId order by total desc) as rn
from (
select categoryId, name, count(*) as total
from test
group by categoryId, name
) t
) t
where rn = 1
) n on t.categoryId = n.categoryId
join (
select categoryId, surname, total
from (
select categoryId, surname, total, row_number() over(partition by categoryId order by total desc) as rn
from (
select categoryId, surname, count(*) as total
from test
group by categoryId, surname
) t
) t
where rn = 1
) sn on t.categoryId = sn.categoryId
但是,在我的实际用例中,我的表有数百万行、数百个类别和十几个属性。
有没有办法使查询更简单、更高效(即每个属性没有一堆子选择)?
我目前使用的是 SQL Server 2008,但欢迎使用更新版本的答案。