AskOverflow.Dev

AskOverflow.Dev Logo AskOverflow.Dev Logo

AskOverflow.Dev Navigation

  • 主页
  • 系统&网络
  • Ubuntu
  • Unix
  • DBA
  • Computer
  • Coding
  • LangChain

Mobile menu

Close
  • 主页
  • 系统&网络
    • 最新
    • 热门
    • 标签
  • Ubuntu
    • 最新
    • 热门
    • 标签
  • Unix
    • 最新
    • 标签
  • DBA
    • 最新
    • 标签
  • Computer
    • 最新
    • 标签
  • Coding
    • 最新
    • 标签
主页 / dba / 问题 / 213183
Accepted
irimias
irimias
Asked: 2018-07-26 03:54:37 +0800 CST2018-07-26 03:54:37 +0800 CST 2018-07-26 03:54:37 +0800 CST

在单个查询中评估最常见的列值

  • 772

我有一个表格来描述我的应用程序用户,其中包含一些详细信息,例如姓名、姓氏、出生日期、国籍、电子邮件等......

我想知道每个属性和每个用户类别的最常见值和出现百分比。

例如 :

create table test ( userId int identity(1,1), 
                    categoryId int, 
                    name varchar(50), 
                    surname varchar(50))

insert into test(categoryId, name, surname)
values  (1, 'John', 'Locke'),
        (1, 'John', 'Millar'),
        (1, 'James', 'Mill'),
        (1, 'John Stuart', 'Mill'),
        (2, 'Thomas', 'Bayes'),
        (2, 'Laurent', 'Schwartz'),
        (2, 'Herrmann Amandus', 'Schwartz'),
        (2, 'Thomas', 'Simpson'),
        (2, 'Leonhard', 'Euler')

结果应该是:

+------------+-------+--------+---------+----------+------------+
| categoryId | total |  name  | namePct | surname  | surnamePct |
+------------+-------+--------+---------+----------+------------+
|          1 |     4 | John   |    0.50 | Mill     |       0.50 |
|          2 |     5 | Thomas |    0.40 | Schwartz |       0.40 |
+------------+-------+--------+---------+----------+------------+

对于这个简单的示例,我可以通过如下查询来计算如何实现这一点:

select  t.categoryId, 
        t.total, 
        n.name, 
        1. * n.total / t.total as namePct,
        sn.surname,
        1. * sn.total / t.total as surnamePct
from (
    select categoryId, count(*) as total
    from test
    group by categoryId
    ) t
join (
        select categoryId, name, total
        from (
            select categoryId, name, total, row_number() over(partition by categoryId order by total desc) as rn
            from (
                select categoryId, name, count(*) as total
                from test
                group by categoryId, name
                ) t
            ) t
        where rn = 1
        ) n on t.categoryId = n.categoryId
join (
        select categoryId, surname, total
        from (
            select categoryId, surname, total, row_number() over(partition by categoryId order by total desc) as rn
            from (
                select categoryId, surname, count(*) as total
                from test
                group by categoryId, surname
                ) t
            ) t
        where rn = 1
        ) sn on t.categoryId = sn.categoryId

但是,在我的实际用例中,我的表有数百万行、数百个类别和十几个属性。

有没有办法使查询更简单、更高效(即每个属性没有一堆子选择)?

我目前使用的是 SQL Server 2008,但欢迎使用更新版本的答案。

sql-server sql-server-2008
  • 1 1 个回答
  • 10139 Views

1 个回答

  • Voted
  1. Best Answer
    EzLo
    2018-07-26T05:29:29+08:002018-07-26T05:29:29+08:00

    您可以使用函数的窗口版本COUNT(),按每个类别拆分PARTITION BY以获取计数和总数,而无需子查询(注意缺少GROUP BY):

    SELECT
        T.categoryId,
    
        T.name,
        NameOccurencies = COUNT(T.name) OVER (PARTITION BY T.categoryId, T.name),
        NameTotals = COUNT(T.name) OVER (PARTITION BY T.categoryId),
    
        T.surname,
        SurnameOccurencies = COUNT(T.surname) OVER (PARTITION BY T.categoryId, T.surname),
        SurnameTotals = COUNT(T.surname) OVER (PARTITION BY T.categoryId)
    FROM
        #test AS T
    

    结果:

    categoryId  name                NameOccurencies NameTotals  surname             SurnameOccurencies  SurnameTotals
    1           John                2               4           Locke               1                   4
    1           John Stuart         1               4           Mill                2                   4
    1           James               1               4           Mill                2                   4
    1           John                2               4           Millar              1                   4
    2           Thomas              2               5           Bayes               1                   5
    2           Leonhard            1               5           Euler               1                   5
    2           Herrmann Amandus    1               5           Schwartz            2                   5
    2           Laurent             1               5           Schwartz            2                   5
    2           Thomas              2               5           Simpson             1                   5
    

    然后,您可以使用此结果来获得每个百分比,只需将出现次数除以每个总数即可。您还可以ROW_NUMBER()在此步骤中使用 a 计算最佳(最常见的)姓名和姓氏:

    ;WITH Totals AS
    (
        SELECT
            T.categoryId,
    
            T.name,
            NameOccurencies = COUNT(T.name) OVER (PARTITION BY T.categoryId, T.name),
            NameTotals = COUNT(T.name) OVER (PARTITION BY T.categoryId),
    
            T.surname,
            SurnameOccurencies = COUNT(T.surname) OVER (PARTITION BY T.categoryId, T.surname),
            SurnameTotals = COUNT(T.surname) OVER (PARTITION BY T.categoryId)
        FROM
            #test AS T
    )
    SELECT
        T.categoryId,
    
        T.name,
        NamePercentage = T.NameOccurencies * 1.0 / NULLIF(T.NameTotals, 0),
        NameMostFrequentRanking = ROW_NUMBER() OVER (
            PARTITION BY
                T.categoryId
            ORDER BY 
                T.NameOccurencies * 1.0 / NULLIF(T.NameTotals, 0) DESC), -- NamePercentage
    
        T.surname,
        SurnamePercentage = T.SurnameOccurencies * 1.0 / NULLIF(T.SurnameTotals, 0),
        SurnameMostFrequentRanking = ROW_NUMBER() OVER (
            PARTITION BY
                T.categoryId
            ORDER BY 
                T.SurnameOccurencies * 1.0 / NULLIF(T.SurnameTotals, 0) DESC) -- SurnamePercentage
    FROM
        Totals AS T
    

    结果:

    categoryId  name                NamePercentage  NameMostFrequentRanking surname     SurnamePercentage   SurnameMostFrequentRanking
    1           John Stuart         0.250000000000  3                       Mill        0.500000000000      1
    1           James               0.250000000000  4                       Mill        0.500000000000      2
    1           John                0.500000000000  1                       Millar      0.250000000000      3
    1           John                0.500000000000  2                       Locke       0.250000000000      4
    2           Herrmann Amandus    0.200000000000  4                       Schwartz    0.400000000000      1
    2           Laurent             0.200000000000  5                       Schwartz    0.400000000000      2
    2           Thomas              0.400000000000  1                       Simpson     0.200000000000      3
    2           Thomas              0.400000000000  2                       Bayes       0.200000000000      4
    2           Leonhard            0.200000000000  3                       Euler       0.200000000000      5
    

    最后,对于每个可用的类别...

    SELECT
        T.categoryId,
        TotalRecords = COUNT(1)
    FROM
        #test AS T
    GROUP BY
        T.categoryId
    

    我们可以通过一些连接获得最常见的名字和姓氏及其百分比:

    ;WITH Totals AS
    (
        SELECT
            T.categoryId,
    
            T.name,
            NameOccurencies = COUNT(T.name) OVER (PARTITION BY T.categoryId, T.name),
            NameTotals = COUNT(T.name) OVER (PARTITION BY T.categoryId),
    
            T.surname,
            SurnameOccurencies = COUNT(T.surname) OVER (PARTITION BY T.categoryId, T.surname),
            SurnameTotals = COUNT(T.surname) OVER (PARTITION BY T.categoryId)
        FROM
            #test AS T
    ),
    MostFrequentRanking AS
    (
        SELECT
            T.categoryId,
    
            T.name,
            NamePercentage = T.NameOccurencies * 1.0 / NULLIF(T.NameTotals, 0),
            NameMostFrequentRanking = ROW_NUMBER() OVER (
                PARTITION BY
                    T.categoryId
                ORDER BY 
                    T.NameOccurencies * 1.0 / NULLIF(T.NameTotals, 0) DESC),
    
            T.surname,
            SurnamePercentage = T.SurnameOccurencies * 1.0 / NULLIF(T.SurnameTotals, 0),
            SurnameMostFrequentRanking = ROW_NUMBER() OVER (
                PARTITION BY
                    T.categoryId
                ORDER BY 
                    T.SurnameOccurencies * 1.0 / NULLIF(T.SurnameTotals, 0) DESC)
        FROM
            Totals AS T
    ),
    AvailableCategories AS
    (
        SELECT
            T.categoryId,
            TotalRecords = COUNT(1)
        FROM
            #test AS T
        GROUP BY
            T.categoryId
    )
    SELECT
        A.categoryId,
        A.TotalRecords,
        MN.name,
        NamePercentage = CONVERT(DECIMAL(3, 2), MN.NamePercentage),
        MS.surname,
        SurnamePercentage = CONVERT(DECIMAL(3, 2), MS.SurnamePercentage)
    FROM
        AvailableCategories AS A
        LEFT JOIN MostFrequentRanking AS MN ON 
            A.categoryId = MN.categoryId AND
            MN.NameMostFrequentRanking = 1
        LEFT JOIN MostFrequentRanking AS MS ON 
            A.categoryId = MS.categoryId AND
            MS.SurnameMostFrequentRanking = 1
    

    结果:

    categoryId  TotalRecords    name    NamePercentage  surname     SurnamePercentage
    1           4               John    0.50            Mill        0.50
    2           5               Thomas  0.40            Schwartz    0.40
    

    它可能有点大,但您可以在不添加 new 的情况下使用任意数量的新列来编辑此查询SELECT,只需对要显示的每个新列重复相同的逻辑并在最后添加一个附加连接。

    SELECT ... INTO如果您有数百万条记录并且查询需要很长时间,您可能希望使用+ CREATE INDEXby将每个 CTE 拆分为一个临时表categoryId以加快处理速度(如果您愿意通过创建这些表来花费一些资源)

    • 4

相关问题

  • 死锁的主要原因是什么,可以预防吗?

  • 我在索引上放了多少“填充”?

  • 是否有开发人员遵循数据库更改的“最佳实践”类型流程?

  • 如何确定是否需要或需要索引

  • 从 SQL Server 2008 降级到 2005

Sidebar

Stats

  • 问题 205573
  • 回答 270741
  • 最佳答案 135370
  • 用户 68524
  • 热门
  • 回答
  • Marko Smith

    连接到 PostgreSQL 服务器:致命:主机没有 pg_hba.conf 条目

    • 12 个回答
  • Marko Smith

    如何让sqlplus的输出出现在一行中?

    • 3 个回答
  • Marko Smith

    选择具有最大日期或最晚日期的日期

    • 3 个回答
  • Marko Smith

    如何列出 PostgreSQL 中的所有模式?

    • 4 个回答
  • Marko Smith

    列出指定表的所有列

    • 5 个回答
  • Marko Smith

    如何在不修改我自己的 tnsnames.ora 的情况下使用 sqlplus 连接到位于另一台主机上的 Oracle 数据库

    • 4 个回答
  • Marko Smith

    你如何mysqldump特定的表?

    • 4 个回答
  • Marko Smith

    使用 psql 列出数据库权限

    • 10 个回答
  • Marko Smith

    如何从 PostgreSQL 中的选择查询中将值插入表中?

    • 4 个回答
  • Marko Smith

    如何使用 psql 列出所有数据库和表?

    • 7 个回答
  • Martin Hope
    Jin 连接到 PostgreSQL 服务器:致命:主机没有 pg_hba.conf 条目 2014-12-02 02:54:58 +0800 CST
  • Martin Hope
    Stéphane 如何列出 PostgreSQL 中的所有模式? 2013-04-16 11:19:16 +0800 CST
  • Martin Hope
    Mike Walsh 为什么事务日志不断增长或空间不足? 2012-12-05 18:11:22 +0800 CST
  • Martin Hope
    Stephane Rolland 列出指定表的所有列 2012-08-14 04:44:44 +0800 CST
  • Martin Hope
    haxney MySQL 能否合理地对数十亿行执行查询? 2012-07-03 11:36:13 +0800 CST
  • Martin Hope
    qazwsx 如何监控大型 .sql 文件的导入进度? 2012-05-03 08:54:41 +0800 CST
  • Martin Hope
    markdorison 你如何mysqldump特定的表? 2011-12-17 12:39:37 +0800 CST
  • Martin Hope
    Jonas 如何使用 psql 对 SQL 查询进行计时? 2011-06-04 02:22:54 +0800 CST
  • Martin Hope
    Jonas 如何从 PostgreSQL 中的选择查询中将值插入表中? 2011-05-28 00:33:05 +0800 CST
  • Martin Hope
    Jonas 如何使用 psql 列出所有数据库和表? 2011-02-18 00:45:49 +0800 CST

热门标签

sql-server mysql postgresql sql-server-2014 sql-server-2016 oracle sql-server-2008 database-design query-performance sql-server-2017

Explore

  • 主页
  • 问题
    • 最新
    • 热门
  • 标签
  • 帮助

Footer

AskOverflow.Dev

关于我们

  • 关于我们
  • 联系我们

Legal Stuff

  • Privacy Policy

Language

  • Pt
  • Server
  • Unix

© 2023 AskOverflow.DEV All Rights Reserve