我有一个查询在我的机器上需要很长时间(7 分钟)才能执行,我想知道我是否可以(显着)更快地完成它:
SELECT
rec.[Id] AS RecordId,
MIN(rec.[CreationDate]) AS RecordCreationDate,
MIN(rec.[LastModified]) AS RecordLastModified,
MIN(rec.[AssetType]) AS RecordAssetType,
MIN(rec.[MasterFilename]) AS RecordMasterFilename,
MIN(rec.[GameName]) AS RecordGameName,
usr.[OrganizationName],
COUNT(hist.[Id]) AS TimesDownloaded
FROM
(
SELECT
innerRec.Id,
MIN(innerRec.CreationDate) AS CreationDate,
MIN(innerRec.LastModified) AS LastModified,
MIN(innerRec.AssetType) AS AssetType,
MIN(innerRec.MasterFilename) AS MasterFilename,
MIN(innerRec.GameName) AS GameName
FROM
[dbo].[Record] innerRec INNER JOIN [dbo].[RecordClassificationLink] innerLnk ON innerRec.Id = innerLnk.RecordId
-- WHERE (classification ID is foo or bar)
GROUP BY
innerRec.Id
-- HAVING COUNT(innerLnk.ClassificationId) = (number of specified classifications)
) rec
CROSS JOIN
[dbo].[AdamUser] usr
LEFT JOIN
(SELECT * FROM [dbo].[MaintenanceJobHistory] WHERE [CreatedOn] > '2016-01-01 00:00:00' AND [CreatedOn] < '2016-12-01 00:00:00') hist ON usr.Name = hist.AccessingUser AND rec.Id = hist.RecordId
GROUP BY
rec.Id, usr.OrganizationName
它正在做的是提取要放入 Excel 电子表格报告中的数据(电子表格是否可以很好地表示这些数据不在这个问题的范围内:-))
第一个子查询提取按分类 ID 列表选择性过滤的记录。然后将它们与用户表交叉连接,因为每个用户表行实际上包含我们真正需要的信息:用户的组织名称。然后我离开加入维护作业历史表(为每个记录下载存储一个条目)以便在记录被多次访问时创建多行,然后按记录 ID 和组织名称分组以获得“每个记录下载数”组织”算作TimesDownloaded
.
读取此输出的代码然后填充一个关联数组,其键为OrganizationName
,其值为TimesDownloaded
,创建一个动态的等价物,PIVOT
其中每个记录行包含每个组织的一列,每个包含记录下载次数的计数。
就像你想象的那样,它在大型数据集上运行得相当慢,正如我上面所说的;我正在使用的那个有 ~38000 Record
s 和 ~1000 个用户,这意味着交叉连接导致 ~38,000,000 行,但这在概念上似乎是必要的。
这可以显着提高效率吗?如果我PIVOT
改用动态 SQL 会更好吗?
我使用的 DBMS 是 SQL Server 2014。
以下是表的模式定义:
CREATE TABLE [dbo].[AdamUser](
[Id] [uniqueidentifier] NOT NULL,
[Name] [nvarchar](200) NOT NULL,
[UserGroupName] [nvarchar](50) NOT NULL,
[OrganizationName] [nvarchar](50) NOT NULL,
PRIMARY KEY CLUSTERED
(
[Id] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
) ON [PRIMARY]
GO
CREATE UNIQUE NONCLUSTERED INDEX [UIX_AdamUser_Name] ON [dbo].[AdamUser]
(
[Name] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, SORT_IN_TEMPDB = OFF, IGNORE_DUP_KEY = OFF, DROP_EXISTING = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON)
GO
CREATE TABLE [dbo].[MaintenanceJobHistory](
[Id] [uniqueidentifier] NOT NULL,
[Data] [xml] NOT NULL,
[CreatedOn] [datetime] NOT NULL,
[Type] [nvarchar](512) NOT NULL,
[RecordId] [uniqueidentifier] NOT NULL,
[AccessingUser] [nvarchar](200) NOT NULL,
CONSTRAINT [PK_MaintenanceJobHistory] PRIMARY KEY CLUSTERED
(
[Id] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
) ON [PRIMARY] TEXTIMAGE_ON [PRIMARY]
GO
CREATE NONCLUSTERED INDEX [IX_MaintenanceJobHistory_CreatedOn] ON [dbo].[MaintenanceJobHistory]
(
[CreatedOn] ASC
)
INCLUDE ( [Id],
[RecordId],
[AccessingUser]) WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, SORT_IN_TEMPDB = OFF, DROP_EXISTING = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON)
GO
CREATE TABLE [dbo].[Record](
[Id] [uniqueidentifier] NOT NULL,
[CreationDate] [datetime] NOT NULL,
[LastModified] [datetime] NOT NULL,
[AssetType] [nvarchar](max) NULL,
[MasterFilename] [nvarchar](max) NULL,
[GameName] [nvarchar](max) NULL,
CONSTRAINT [PK_Record] PRIMARY KEY CLUSTERED
(
[Id] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
) ON [PRIMARY] TEXTIMAGE_ON [PRIMARY]
GO
CREATE TABLE [dbo].[RecordClassificationLink](
[Id] [uniqueidentifier] NOT NULL,
[RecordId] [uniqueidentifier] NOT NULL,
[ClassificationId] [uniqueidentifier] NOT NULL,
CONSTRAINT [PK_RecordClassificationLink] PRIMARY KEY CLUSTERED
(
[Id] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
) ON [PRIMARY]
GO
这是执行计划:https ://www.brentozar.com/pastetheplan/?id=Sy6LlXDXg
然而,这个输出被调用程序按照以下几行转换成 Excel 电子表格(因此它就像一个PIVOT
操作):
.----------------------------------------------------------------------.
| Filename | Creation Date | #times downloaded by: CompanyA | CompanyB | ...
| foo.png | 1/2/3 | 0 | 2 |
| bar.png | 1/3/4 | 3 | 1 |
...
更新:
PIVOT
通过将操作移至 SQL 查询本身,我最终大大提高了效率;这样,SQL Server 只需要输出Record
表中的行数而不是乘以组织的数量(在你达到数百个组织之前还不错,此时它是一个巨大的数字)。手术仍然需要几分钟,但已经可以忍受多了。这是我最终决定使用的查询:
SELECT *
FROM (
SELECT
rec.[Id] AS RecordId,
'Org_' + usr.[OrganizationName] AS OrganizationNamePrefixed,
COUNT(hist.[Id]) AS TimesDownloaded -- To be aggregated by PIVOT
FROM (
SELECT
innerRec.[Id]
FROM
[dbo].[Record] innerRec
INNER JOIN
[dbo].[RecordClassificationLink] innerLnk ON innerLnk.[RecordId] = innerRec.[Id]
-- WHERE (classification ID is foo or bar), for optional classification filtering
GROUP BY
innerRec.[Id]
-- HAVING COUNT(innerLnk.ClassificationId) = (number of specified classifications), for optional classification filtering
) rec
CROSS JOIN [dbo].[AdamUser] usr
LEFT JOIN (
SELECT * FROM [dbo].[MaintenanceJobHistory] WHERE [CreatedOn] > 'eg. 2016-01-01 12:00:00' AND [CreatedOn] < 'eg. 2016-12-01 12:00:00'
) hist ON hist.[AccessingUser] = usr.[Name] AND hist.[RecordId] = rec.[Id]
GROUP BY
rec.[Id], usr.[OrganizationName]
) srcTable
PIVOT -- Pivot around columns outside aggregation fn, eg. heading column [OrganizationNamePrefixed] & all other columns: [RecordId]
(
MIN(srcTable.[TimesDownloaded]) FOR [OrganizationNamePrefixed] IN (...list of ~200 columns dynamically generated...)
) pivotTable
INNER JOIN [dbo].[Record] outerRec ON outerRec.[Id] = pivotTable.[RecordId]
我添加了各种索引,并PIVOT
通过仅选择聚合列、标题列和必要的其他列来尽可能地提高效率。最后,我重新JOIN
使用PKRecord
表RecordId
来获取每行的额外记录信息。
可能花费大量时间的是查询计划中的大量
Sort
操作。您可以通过自己以索引的形式对数据进行排序来抢占那些。以下是一些我认为可以帮助您入门的索引建议:
然后,您可以稍微修改您的查询以帮助优化器做出一些明智的选择,例如在加入某些数据流之前聚合它们并创建一个更大的产品,这将花费更多时间来聚合:
我做了以下事情:
hist
查询聚合在 上AccessingUser, RecordId
,我创建了一个COUNT(*) AS _count
. 此查询使用新索引IX_MaintenanceJobHistory_ByUser
非常高效地执行,而无需任何内存授予或哈希表。COUNT(hist.Id)
为SUM(ISNULL(hist._count, 0)) AS TimesDownloaded
dbo.RecordClassificationLink
有助于执行与Record
表的平滑连接,但如果您添加WHERE
andHAVING
,该索引将无济于事。dbo.AdamUser
还通过消除排序运算符来提高性能 - 因为您在OrganizationName
列上聚合,所以不妨从一开始就按此对数据进行排序。在我看来,这应该会给你相同的结果,但现在已经晚了,所以你必须自己验证结果。:)
这是我的查询计划:
编辑
rec
:您还可以稍微简化部分 - 可能更容易阅读:...而且该计划看起来也稍微好一点(寻找
Lazy spool
原始计划底部的红色,现在已经消失了。)