AskOverflow.Dev

AskOverflow.Dev Logo AskOverflow.Dev Logo

AskOverflow.Dev Navigation

  • 主页
  • 系统&网络
  • Ubuntu
  • Unix
  • DBA
  • Computer
  • Coding
  • LangChain

Mobile menu

Close
  • 主页
  • 系统&网络
    • 最新
    • 热门
    • 标签
  • Ubuntu
    • 最新
    • 热门
    • 标签
  • Unix
    • 最新
    • 标签
  • DBA
    • 最新
    • 标签
  • Computer
    • 最新
    • 标签
  • Coding
    • 最新
    • 标签
主页 / dba / 问题 / 157602
Accepted
Jez
Jez
Asked: 2016-12-09 10:09:11 +0800 CST2016-12-09 10:09:11 +0800 CST 2016-12-09 10:09:11 +0800 CST

如何使此聚合查询更高效?

  • 772

我有一个查询在我的机器上需要很长时间(7 分钟)才能执行,我想知道我是否可以(显着)更快地完成它:

SELECT
    rec.[Id] AS RecordId,
    MIN(rec.[CreationDate]) AS RecordCreationDate,
    MIN(rec.[LastModified]) AS RecordLastModified,
    MIN(rec.[AssetType]) AS RecordAssetType,
    MIN(rec.[MasterFilename]) AS RecordMasterFilename,
    MIN(rec.[GameName]) AS RecordGameName,
    usr.[OrganizationName],
    COUNT(hist.[Id]) AS TimesDownloaded
FROM
(
    SELECT
        innerRec.Id,
        MIN(innerRec.CreationDate) AS CreationDate,
        MIN(innerRec.LastModified) AS LastModified,
        MIN(innerRec.AssetType) AS AssetType,
        MIN(innerRec.MasterFilename) AS MasterFilename,
        MIN(innerRec.GameName) AS GameName
    FROM
        [dbo].[Record] innerRec INNER JOIN [dbo].[RecordClassificationLink] innerLnk ON innerRec.Id = innerLnk.RecordId
    -- WHERE (classification ID is foo or bar)
    GROUP BY
        innerRec.Id
    -- HAVING COUNT(innerLnk.ClassificationId) = (number of specified classifications)
) rec
CROSS JOIN
    [dbo].[AdamUser] usr
LEFT JOIN
    (SELECT * FROM [dbo].[MaintenanceJobHistory] WHERE [CreatedOn] > '2016-01-01 00:00:00' AND [CreatedOn] < '2016-12-01 00:00:00') hist ON usr.Name = hist.AccessingUser AND rec.Id = hist.RecordId
GROUP BY
    rec.Id, usr.OrganizationName

它正在做的是提取要放入 Excel 电子表格报告中的数据(电子表格是否可以很好地表示这些数据不在这个问题的范围内:-))

第一个子查询提取按分类 ID 列表选择性过滤的记录。然后将它们与用户表交叉连接,因为每个用户表行实际上包含我们真正需要的信息:用户的组织名称。然后我离开加入维护作业历史表(为每个记录下载存储一个条目)以便在记录被多次访问时创建多行,然后按记录 ID 和组织名称分组以获得“每个记录下载数”组织”算作TimesDownloaded.

读取此输出的代码然后填充一个关联数组,其键为OrganizationName,其值为TimesDownloaded,创建一个动态的等价物,PIVOT其中每个记录行包含每个组织的一列,每个包含记录下载次数的计数。

就像你想象的那样,它在大型数据集上运行得相当慢,正如我上面所说的;我正在使用的那个有 ~38000 Records 和 ~1000 个用户,这意味着交叉连接导致 ~38,000,000 行,但这在概念上似乎是必要的。

这可以显着提高效率吗?如果我PIVOT改用动态 SQL 会更好吗?

我使用的 DBMS 是 SQL Server 2014。

以下是表的模式定义:

CREATE TABLE [dbo].[AdamUser](
    [Id] [uniqueidentifier] NOT NULL,
    [Name] [nvarchar](200) NOT NULL,
    [UserGroupName] [nvarchar](50) NOT NULL,
    [OrganizationName] [nvarchar](50) NOT NULL,
PRIMARY KEY CLUSTERED 
(
    [Id] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
) ON [PRIMARY]

GO

CREATE UNIQUE NONCLUSTERED INDEX [UIX_AdamUser_Name] ON [dbo].[AdamUser]
(
    [Name] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, SORT_IN_TEMPDB = OFF, IGNORE_DUP_KEY = OFF, DROP_EXISTING = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON)
GO

CREATE TABLE [dbo].[MaintenanceJobHistory](
    [Id] [uniqueidentifier] NOT NULL,
    [Data] [xml] NOT NULL,
    [CreatedOn] [datetime] NOT NULL,
    [Type] [nvarchar](512) NOT NULL,
    [RecordId] [uniqueidentifier] NOT NULL,
    [AccessingUser] [nvarchar](200) NOT NULL,
 CONSTRAINT [PK_MaintenanceJobHistory] PRIMARY KEY CLUSTERED 
(
    [Id] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
) ON [PRIMARY] TEXTIMAGE_ON [PRIMARY]

GO

CREATE NONCLUSTERED INDEX [IX_MaintenanceJobHistory_CreatedOn] ON [dbo].[MaintenanceJobHistory]
(
    [CreatedOn] ASC
)
INCLUDE (     [Id],
    [RecordId],
    [AccessingUser]) WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, SORT_IN_TEMPDB = OFF, DROP_EXISTING = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON)
GO

CREATE TABLE [dbo].[Record](
    [Id] [uniqueidentifier] NOT NULL,
    [CreationDate] [datetime] NOT NULL,
    [LastModified] [datetime] NOT NULL,
    [AssetType] [nvarchar](max) NULL,
    [MasterFilename] [nvarchar](max) NULL,
    [GameName] [nvarchar](max) NULL,
 CONSTRAINT [PK_Record] PRIMARY KEY CLUSTERED 
(
    [Id] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
) ON [PRIMARY] TEXTIMAGE_ON [PRIMARY]

GO

CREATE TABLE [dbo].[RecordClassificationLink](
    [Id] [uniqueidentifier] NOT NULL,
    [RecordId] [uniqueidentifier] NOT NULL,
    [ClassificationId] [uniqueidentifier] NOT NULL,
 CONSTRAINT [PK_RecordClassificationLink] PRIMARY KEY CLUSTERED 
(
    [Id] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
) ON [PRIMARY]

GO

这是执行计划:https ://www.brentozar.com/pastetheplan/?id=Sy6LlXDXg

典型输出: 典型输出

然而,这个输出被调用程序按照以下几行转换成 Excel 电子表格(因此它就像一个PIVOT操作):

.----------------------------------------------------------------------.
| Filename | Creation Date | #times downloaded by: CompanyA | CompanyB | ...
| foo.png  | 1/2/3         |                       0        | 2        |
| bar.png  | 1/3/4         |                       3        | 1        |
...

更新:

PIVOT通过将操作移至 SQL 查询本身,我最终大大提高了效率;这样,SQL Server 只需要输出Record表中的行数而不是乘以组织的数量(在你达到数百个组织之前还不错,此时它是一个巨大的数字)。手术仍然需要几分钟,但已经可以忍受多了。这是我最终决定使用的查询:

SELECT *
FROM (
    SELECT
        rec.[Id]                         AS RecordId,
        'Org_' + usr.[OrganizationName]  AS OrganizationNamePrefixed,
        COUNT(hist.[Id])                 AS TimesDownloaded  -- To be aggregated by PIVOT
    FROM (
        SELECT
            innerRec.[Id]
        FROM
            [dbo].[Record] innerRec
        INNER JOIN
            [dbo].[RecordClassificationLink] innerLnk ON innerLnk.[RecordId] = innerRec.[Id]
        -- WHERE (classification ID is foo or bar), for optional classification filtering
        GROUP BY
            innerRec.[Id]
        -- HAVING COUNT(innerLnk.ClassificationId) = (number of specified classifications), for optional classification filtering
    ) rec
    CROSS JOIN [dbo].[AdamUser] usr
    LEFT JOIN (
        SELECT * FROM [dbo].[MaintenanceJobHistory] WHERE [CreatedOn] > 'eg. 2016-01-01 12:00:00' AND [CreatedOn] < 'eg. 2016-12-01 12:00:00'
    ) hist ON hist.[AccessingUser] = usr.[Name] AND hist.[RecordId] = rec.[Id]
    GROUP BY
        rec.[Id], usr.[OrganizationName]
) srcTable
PIVOT  -- Pivot around columns outside aggregation fn, eg. heading column [OrganizationNamePrefixed] & all other columns: [RecordId]
(
    MIN(srcTable.[TimesDownloaded]) FOR [OrganizationNamePrefixed] IN (...list of ~200 columns dynamically generated...)
) pivotTable
INNER JOIN [dbo].[Record] outerRec ON outerRec.[Id] = pivotTable.[RecordId]

我添加了各种索引,并PIVOT通过仅选择聚合列、标题列和必要的其他列来尽可能地提高效率。最后,我重新JOIN使用PKRecord表RecordId来获取每行的额外记录信息。

optimization sql-server-2014
  • 1 1 个回答
  • 187 Views

1 个回答

  • Voted
  1. Best Answer
    Daniel Hutmacher
    2016-12-09T13:26:12+08:002016-12-09T13:26:12+08:00

    可能花费大量时间的是查询计划中的大量Sort操作。您可以通过自己以索引的形式对数据进行排序来抢占那些。

    以下是一些我认为可以帮助您入门的索引建议:

    CREATE INDEX IX_RecordId
        ON dbo.RecordClassificationLink (RecordId);
    
    CREATE INDEX IX_MaintenanceJobHistory_ByUser
        ON dbo.MaintenanceJobHistory (AccessingUser, RecordId);
    
    CREATE UNIQUE INDEX UIX_AdamUser_OrgnName
        ON dbo.AdamUser (OrganizationName, Name);
    

    然后,您可以稍微修改您的查询以帮助优化器做出一些明智的选择,例如在加入某些数据流之前聚合它们并创建一个更大的产品,这将花费更多时间来聚合:

    SELECT
        rec.[Id] AS RecordId,
        MIN(rec.[CreationDate]) AS RecordCreationDate,
        MIN(rec.[LastModified]) AS RecordLastModified,
        MIN(rec.[AssetType]) AS RecordAssetType,
        MIN(rec.[MasterFilename]) AS RecordMasterFilename,
        MIN(rec.[GameName]) AS RecordGameName,
        usr.[OrganizationName],
        SUM(ISNULL(hist._count, 0)) AS TimesDownloaded
    FROM
    (
        SELECT
            innerRec.Id,
            MIN(innerRec.CreationDate) AS CreationDate,
            MIN(innerRec.LastModified) AS LastModified,
            MIN(innerRec.AssetType) AS AssetType,
            MIN(innerRec.MasterFilename) AS MasterFilename,
            MIN(innerRec.GameName) AS GameName
        FROM [dbo].[Record] innerRec
        INNER JOIN [dbo].[RecordClassificationLink] innerLnk ON innerRec.Id = innerLnk.RecordId
        -- WHERE (classification ID is foo or bar)
        GROUP BY
            innerRec.Id
        -- HAVING COUNT(innerLnk.ClassificationId) = (number of specified classifications)
    ) AS rec
    CROSS JOIN [dbo].[AdamUser] AS usr
    LEFT JOIN
        (SELECT AccessingUser, RecordId, COUNT(*) AS _count
         FROM [dbo].[MaintenanceJobHistory]
         WHERE [CreatedOn] > '2016-01-01 00:00:00' AND
               [CreatedOn] < '2016-12-01 00:00:00'
         GROUP BY AccessingUser, RecordId
        ) AS hist ON
            usr.Name = hist.AccessingUser AND
            rec.Id = hist.RecordId
    GROUP BY
        rec.Id, usr.OrganizationName;
    

    我做了以下事情:

    • 子hist查询聚合在 上AccessingUser, RecordId,我创建了一个COUNT(*) AS _count. 此查询使用新索引IX_MaintenanceJobHistory_ByUser非常高效地执行,而无需任何内存授予或哈希表。
    • 因此,我已经替换COUNT(hist.Id)为SUM(ISNULL(hist._count, 0)) AS TimesDownloaded
    • 新索引dbo.RecordClassificationLink有助于执行与Record表的平滑连接,但如果您添加WHEREand HAVING,该索引将无济于事。
    • 新索引dbo.AdamUser还通过消除排序运算符来提高性能 - 因为您在OrganizationName列上聚合,所以不妨从一开始就按此对数据进行排序。

    在我看来,这应该会给你相同的结果,但现在已经晚了,所以你必须自己验证结果。:)

    这是我的查询计划: 第一个计划

    编辑rec:您还可以稍微简化部分 - 可能更容易阅读:

    SELECT
        rec.[Id] AS RecordId,
        rec.[CreationDate] AS RecordCreationDate,
        rec.[LastModified] AS RecordLastModified,
        rec.[AssetType] AS RecordAssetType,
        rec.[MasterFilename] AS RecordMasterFilename,
        rec.[GameName] AS RecordGameName,
        usr.[OrganizationName],
        SUM(ISNULL(hist._count, 0)) AS TimesDownloaded
    FROM
    (
        SELECT
            Id,
            CreationDate,
            LastModified,
            AssetType,
            MasterFilename,
            GameName
        FROM [dbo].[Record]
        WHERE Id IN (
            SELECT RecordId
            FROM [dbo].[RecordClassificationLink]
            --- WHERE ClassificationId=...
            )
    ) AS rec
    CROSS JOIN [dbo].[AdamUser] AS usr
    LEFT JOIN
        (SELECT AccessingUser, RecordId, COUNT(*) AS _count
         FROM [dbo].[MaintenanceJobHistory]
         WHERE [CreatedOn] > '2016-01-01 00:00:00' AND
               [CreatedOn] < '2016-12-01 00:00:00'
         GROUP BY AccessingUser, RecordId
        ) AS hist ON
            usr.Name = hist.AccessingUser AND
            rec.Id = hist.RecordId
    GROUP BY
        usr.OrganizationName,
        rec.[Id],
        rec.[CreationDate],
        rec.[LastModified],
        rec.[AssetType],
        rec.[MasterFilename],
        rec.[GameName],
        usr.[OrganizationName];
    

    ...而且该计划看起来也稍微好一点(寻找Lazy spool原始计划底部的红色,现在已经消失了。)

    新计划

    • 3

相关问题

  • Yelp 如何有效地计算数据库中的距离?

  • 查询优化

  • 我应该如何优化此表的存储?

  • oracle 中的 DBMS_REDEFINITION 与 EXCHANGE PARTITION

  • 将 EXPLAIN 成本转换为(挂钟)运行时是否有好的“经验法则”?

Sidebar

Stats

  • 问题 205573
  • 回答 270741
  • 最佳答案 135370
  • 用户 68524
  • 热门
  • 回答
  • Marko Smith

    连接到 PostgreSQL 服务器:致命:主机没有 pg_hba.conf 条目

    • 12 个回答
  • Marko Smith

    如何让sqlplus的输出出现在一行中?

    • 3 个回答
  • Marko Smith

    选择具有最大日期或最晚日期的日期

    • 3 个回答
  • Marko Smith

    如何列出 PostgreSQL 中的所有模式?

    • 4 个回答
  • Marko Smith

    列出指定表的所有列

    • 5 个回答
  • Marko Smith

    如何在不修改我自己的 tnsnames.ora 的情况下使用 sqlplus 连接到位于另一台主机上的 Oracle 数据库

    • 4 个回答
  • Marko Smith

    你如何mysqldump特定的表?

    • 4 个回答
  • Marko Smith

    使用 psql 列出数据库权限

    • 10 个回答
  • Marko Smith

    如何从 PostgreSQL 中的选择查询中将值插入表中?

    • 4 个回答
  • Marko Smith

    如何使用 psql 列出所有数据库和表?

    • 7 个回答
  • Martin Hope
    Jin 连接到 PostgreSQL 服务器:致命:主机没有 pg_hba.conf 条目 2014-12-02 02:54:58 +0800 CST
  • Martin Hope
    Stéphane 如何列出 PostgreSQL 中的所有模式? 2013-04-16 11:19:16 +0800 CST
  • Martin Hope
    Mike Walsh 为什么事务日志不断增长或空间不足? 2012-12-05 18:11:22 +0800 CST
  • Martin Hope
    Stephane Rolland 列出指定表的所有列 2012-08-14 04:44:44 +0800 CST
  • Martin Hope
    haxney MySQL 能否合理地对数十亿行执行查询? 2012-07-03 11:36:13 +0800 CST
  • Martin Hope
    qazwsx 如何监控大型 .sql 文件的导入进度? 2012-05-03 08:54:41 +0800 CST
  • Martin Hope
    markdorison 你如何mysqldump特定的表? 2011-12-17 12:39:37 +0800 CST
  • Martin Hope
    Jonas 如何使用 psql 对 SQL 查询进行计时? 2011-06-04 02:22:54 +0800 CST
  • Martin Hope
    Jonas 如何从 PostgreSQL 中的选择查询中将值插入表中? 2011-05-28 00:33:05 +0800 CST
  • Martin Hope
    Jonas 如何使用 psql 列出所有数据库和表? 2011-02-18 00:45:49 +0800 CST

热门标签

sql-server mysql postgresql sql-server-2014 sql-server-2016 oracle sql-server-2008 database-design query-performance sql-server-2017

Explore

  • 主页
  • 问题
    • 最新
    • 热门
  • 标签
  • 帮助

Footer

AskOverflow.Dev

关于我们

  • 关于我们
  • 联系我们

Legal Stuff

  • Privacy Policy

Language

  • Pt
  • Server
  • Unix

© 2023 AskOverflow.DEV All Rights Reserve