我们有一个 sql server 数据库,用作数据湖和数据仓库。数据库中的每个表都有一些标准化的定义,因为我们现在有 600 个左右的表,因此维护需要在某种程度上自动化。
加载每个表的一般过程是首先将表的副本加载到 changeLog 模式中的哈希表中(如果我们可以确定更改的记录是什么,有时只加载更改的记录),然后将 changeLog 表与目标表进行比较桌子。目标表用于报告,因此这种 changeLog 方法允许我们持久化目标表并仅应用最少的 UPDATE/INSERT 操作。
每个目标表都有一个唯一键/业务键,可通过配置表识别,并具有在每个表中命名相同的标准化审计列。审计专栏告诉我们
- 当记录被添加到数据仓库时
- 上次在数据仓库中更新记录的时间
- 记录是否已在源中删除
- 使用 HASHBYTES('SHA2_256', CONCAT()) 更改的记录标识符
更改后的记录标识符曾经是 CHECKSUM(),但我们发现 CHECKSUM() 的冲突率太高而无法信任。我刚刚在每个表中添加了一个 HASHBYTES() 列并填充了它。
我将 HASHBYTES() 列创建为 VARBINARY(MAX)。现在,每次加载表时,我们可以通过将在 changeLog 表中计算的新 HASHBYTES() 值与目标表中的持久值进行比较来判断是否需要更新记录。
我立即注意到从 INT CHECKSUM() 切换到 VARBINARY(MAX) HASHBYTES() 导致更新检查过程显着减慢。我在每个 CHECKSUM 列上都有 NONCLUSTERED 索引,但在我刚刚添加的 HASHBYTES 列上没有。每个表的唯一键也有聚集索引。
- 添加以检查更新的理想索引是什么?
- 我可以为每个表添加一个标准化索引吗?
- VARBINARY(MAX) 是正确的数据类型还是可以安全地将其缩小到更小的大小?
希望这足以让这个问题变得有意义。我需要尽快加快这个过程。
编辑:我正在添加一个大型 SQL 脚本作为示例,它具有表的 changeLog 版本和目标版本的示例表定义,以及为更新目标版本而运行的查询。
--OBJECT DEFINITIONS
CREATE TABLE [changeLog].[DimSalesOffice](
[Sales Office Code] [varchar](3) NOT NULL,
[CCN_Key] [uniqueidentifier] NOT NULL,
[Sales Office Name (Short)] [nvarchar](14) NOT NULL,
[Sales Office Name (Long)] [nvarchar](50) NOT NULL,
[Sales Office City] [nvarchar](50) NOT NULL,
[Sales Office StateProvince] [nvarchar](50) NOT NULL,
[Sales Office Postal Code] [nvarchar](20) NOT NULL,
[Sales Office Country Code] [varchar](3) NOT NULL,
[Sales Office Address] [nvarchar](65) NOT NULL,
[Sales Office Address (Line 2)] [nvarchar](65) NOT NULL,
[Sales Office Name (Short - Native Language)] [nvarchar](14) NOT NULL,
[Sales Office Name (Long - Native Language)] [nvarchar](50) NOT NULL,
[Sales Office Address (Native Language)] [nvarchar](65) NOT NULL,
[Sales Office Address (Line 2 - Native Language)] [nvarchar](65) NOT NULL,
[Sales Office City (Native Language)] [nvarchar](100) NOT NULL,
[Sales Office StateProvince (Native Language)] [nvarchar](100) NOT NULL,
[Sales Office Region] [nvarchar](100) NULL,
[Native Language Code] [varchar](2) NOT NULL
) ON [PRIMARY]
GO
CREATE TABLE [dbo].[DimSalesOffice](
[SalesOffice_Key] [uniqueidentifier] NOT NULL,
[Sales Office Code] [varchar](3) NOT NULL,
[CCN_Key] [uniqueidentifier] NOT NULL,
[Sales Office Name (Short)] [nvarchar](14) NOT NULL,
[Sales Office Name (Long)] [nvarchar](50) NOT NULL,
[Sales Office City] [nvarchar](50) NOT NULL,
[Sales Office StateProvince] [nvarchar](50) NOT NULL,
[Sales Office Postal Code] [nvarchar](20) NOT NULL,
[Sales Office Country Code] [varchar](3) NOT NULL,
[Sales Office Address] [nvarchar](65) NOT NULL,
[Sales Office Address (Line 2)] [nvarchar](65) NOT NULL,
[Sales Office Name (Short - Native Language)] [nvarchar](14) NOT NULL,
[Sales Office Name (Long - Native Language)] [nvarchar](50) NOT NULL,
[Sales Office Address (Native Language)] [nvarchar](65) NOT NULL,
[Sales Office Address (Line 2 - Native Language)] [nvarchar](65) NOT NULL,
[Sales Office City (Native Language)] [nvarchar](100) NOT NULL,
[Sales Office StateProvince (Native Language)] [nvarchar](100) NOT NULL,
[Sales Office Region] [nvarchar](100) NULL,
[Native Language Code] [varchar](2) NOT NULL,
[DW_CreatedOn] [datetime2](7) NULL,
[DW_ModifiedOn] [datetime2](7) NULL,
[DW_IsDeleted?] [bit] NULL,
[DW_Checksum] [int] NULL,
[Source_ModifiedOn] [datetime2](7) NULL,
[DW_Hashbytes] [varbinary](max) NULL,
PRIMARY KEY NONCLUSTERED
(
[SalesOffice_Key] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
) ON [PRIMARY] TEXTIMAGE_ON [PRIMARY]
GO
ALTER TABLE [dbo].[DimSalesOffice] ADD DEFAULT (newsequentialid()) FOR [SalesOffice_Key]
GO
CREATE UNIQUE CLUSTERED INDEX [IX_UK_DimSalesOffice] ON [dbo].[DimSalesOffice]
(
[Sales Office Code] ASC
)
GO
--MERGE QUERY
DECLARE @InsertRecordCount INT, @UpdateRecordCount INT;
/*****UPDATE*****/
UPDATE [dbo].[DimSalesOffice] SET
[CCN_Key] = [Source].[CCN_Key],
[Sales Office Name (Short)] = [Source].[Sales Office Name (Short)],
[Sales Office Name (Long)] = [Source].[Sales Office Name (Long)],
[Sales Office City] = [Source].[Sales Office City],
[Sales Office StateProvince] = [Source].[Sales Office StateProvince],
[Sales Office Postal Code] = [Source].[Sales Office Postal Code],
[Sales Office Country Code] = [Source].[Sales Office Country Code],
[Sales Office Address] = [Source].[Sales Office Address],
[Sales Office Address (Line 2)] = [Source].[Sales Office Address (Line 2)],
[Sales Office Name (Short - Native Language)] = [Source].[Sales Office Name (Short - Native Language)],
[Sales Office Name (Long - Native Language)] = [Source].[Sales Office Name (Long - Native Language)],
[Sales Office Address (Native Language)] = [Source].[Sales Office Address (Native Language)],
[Sales Office Address (Line 2 - Native Language)] = [Source].[Sales Office Address (Line 2 - Native Language)],
[Sales Office City (Native Language)] = [Source].[Sales Office City (Native Language)],
[Sales Office StateProvince (Native Language)] = [Source].[Sales Office StateProvince (Native Language)],
[Sales Office Region] = [Source].[Sales Office Region],
[Native Language Code] = [Source].[Native Language Code],
[DW_Checksum] =
CHECKSUM(
[Source].[CCN_Key],
[Source].[Sales Office Name (Short)],
[Source].[Sales Office Name (Long)],
[Source].[Sales Office City],
[Source].[Sales Office StateProvince],
[Source].[Sales Office Postal Code],
[Source].[Sales Office Country Code],
[Source].[Sales Office Address],
[Source].[Sales Office Address (Line 2)],
[Source].[Sales Office Name (Short - Native Language)],
[Source].[Sales Office Name (Long - Native Language)],
[Source].[Sales Office Address (Native Language)],
[Source].[Sales Office Address (Line 2 - Native Language)],
[Source].[Sales Office City (Native Language)],
[Source].[Sales Office StateProvince (Native Language)],
[Source].[Sales Office Region],
[Source].[Native Language Code],
0
),
[DW_Hashbytes] =
HASHBYTES(
'SHA2_256',
ISNULL(CAST([Source].[CCN_Key] AS NVARCHAR(MAX)), '') + '|'
+ ISNULL(CAST([Source].[Sales Office Name (Short)] AS NVARCHAR(MAX)), '') + '|'
+ ISNULL(CAST([Source].[Sales Office Name (Long)] AS NVARCHAR(MAX)), '') + '|'
+ ISNULL(CAST([Source].[Sales Office City] AS NVARCHAR(MAX)), '') + '|'
+ ISNULL(CAST([Source].[Sales Office StateProvince] AS NVARCHAR(MAX)), '') + '|'
+ ISNULL(CAST([Source].[Sales Office Postal Code] AS NVARCHAR(MAX)), '') + '|'
+ ISNULL(CAST([Source].[Sales Office Country Code] AS NVARCHAR(MAX)), '') + '|'
+ ISNULL(CAST([Source].[Sales Office Address] AS NVARCHAR(MAX)), '') + '|'
+ ISNULL(CAST([Source].[Sales Office Address (Line 2)] AS NVARCHAR(MAX)), '') + '|'
+ ISNULL(CAST([Source].[Sales Office Name (Short - Native Language)] AS NVARCHAR(MAX)), '') + '|'
+ ISNULL(CAST([Source].[Sales Office Name (Long - Native Language)] AS NVARCHAR(MAX)), '') + '|'
+ ISNULL(CAST([Source].[Sales Office Address (Native Language)] AS NVARCHAR(MAX)), '') + '|'
+ ISNULL(CAST([Source].[Sales Office Address (Line 2 - Native Language)] AS NVARCHAR(MAX)), '') + '|'
+ ISNULL(CAST([Source].[Sales Office City (Native Language)] AS NVARCHAR(MAX)), '') + '|'
+ ISNULL(CAST([Source].[Sales Office StateProvince (Native Language)] AS NVARCHAR(MAX)), '') + '|'
+ ISNULL(CAST([Source].[Sales Office Region] AS NVARCHAR(MAX)), '') + '|'
+ ISNULL(CAST([Source].[Native Language Code] AS NVARCHAR(MAX)), '') + '|'
+ '0'
),
[Source_ModifiedOn] = NULL,
[DW_ModifiedOn] = GETUTCDATE(),
[DW_IsDeleted?] = 0
FROM [changeLog].[DimSalesOffice] [Source]
JOIN [dbo].[DimSalesOffice]
ON [Source].[Sales Office Code] = [DimSalesOffice].[Sales Office Code]
AND ISNULL([DimSalesOffice].[DW_Hashbytes], HASHBYTES('SHA2_256', '')) <> HASHBYTES(
'SHA2_256',
ISNULL(CAST([Source].[CCN_Key] AS NVARCHAR(MAX)), '') + '|'
+ ISNULL(CAST([Source].[Sales Office Name (Short)] AS NVARCHAR(MAX)), '') + '|'
+ ISNULL(CAST([Source].[Sales Office Name (Long)] AS NVARCHAR(MAX)), '') + '|'
+ ISNULL(CAST([Source].[Sales Office City] AS NVARCHAR(MAX)), '') + '|'
+ ISNULL(CAST([Source].[Sales Office StateProvince] AS NVARCHAR(MAX)), '') + '|'
+ ISNULL(CAST([Source].[Sales Office Postal Code] AS NVARCHAR(MAX)), '') + '|'
+ ISNULL(CAST([Source].[Sales Office Country Code] AS NVARCHAR(MAX)), '') + '|'
+ ISNULL(CAST([Source].[Sales Office Address] AS NVARCHAR(MAX)), '') + '|'
+ ISNULL(CAST([Source].[Sales Office Address (Line 2)] AS NVARCHAR(MAX)), '') + '|'
+ ISNULL(CAST([Source].[Sales Office Name (Short - Native Language)] AS NVARCHAR(MAX)), '') + '|'
+ ISNULL(CAST([Source].[Sales Office Name (Long - Native Language)] AS NVARCHAR(MAX)), '') + '|'
+ ISNULL(CAST([Source].[Sales Office Address (Native Language)] AS NVARCHAR(MAX)), '') + '|'
+ ISNULL(CAST([Source].[Sales Office Address (Line 2 - Native Language)] AS NVARCHAR(MAX)), '') + '|'
+ ISNULL(CAST([Source].[Sales Office City (Native Language)] AS NVARCHAR(MAX)), '') + '|'
+ ISNULL(CAST([Source].[Sales Office StateProvince (Native Language)] AS NVARCHAR(MAX)), '') + '|'
+ ISNULL(CAST([Source].[Sales Office Region] AS NVARCHAR(MAX)), '') + '|'
+ ISNULL(CAST([Source].[Native Language Code] AS NVARCHAR(MAX)), '') + '|'
+ '0'
)
SET @UpdateRecordCount = @@ROWCOUNT;
/*****Soft Deletes*****/
UPDATE [dbo].[DimSalesOffice] SET
[DW_Checksum] = 0,
[DW_Hashbytes] =
HASHBYTES(
'SHA2_256',
ISNULL(CAST([DimSalesOffice].[CCN_Key] AS NVARCHAR(MAX)), '') + '|'
+ ISNULL(CAST([DimSalesOffice].[Sales Office Name (Short)] AS NVARCHAR(MAX)), '') + '|'
+ ISNULL(CAST([DimSalesOffice].[Sales Office Name (Long)] AS NVARCHAR(MAX)), '') + '|'
+ ISNULL(CAST([DimSalesOffice].[Sales Office City] AS NVARCHAR(MAX)), '') + '|'
+ ISNULL(CAST([DimSalesOffice].[Sales Office StateProvince] AS NVARCHAR(MAX)), '') + '|'
+ ISNULL(CAST([DimSalesOffice].[Sales Office Postal Code] AS NVARCHAR(MAX)), '') + '|'
+ ISNULL(CAST([DimSalesOffice].[Sales Office Country Code] AS NVARCHAR(MAX)), '') + '|'
+ ISNULL(CAST([DimSalesOffice].[Sales Office Address] AS NVARCHAR(MAX)), '') + '|'
+ ISNULL(CAST([DimSalesOffice].[Sales Office Address (Line 2)] AS NVARCHAR(MAX)), '') + '|'
+ ISNULL(CAST([DimSalesOffice].[Sales Office Name (Short - Native Language)] AS NVARCHAR(MAX)), '') + '|'
+ ISNULL(CAST([DimSalesOffice].[Sales Office Name (Long - Native Language)] AS NVARCHAR(MAX)), '') + '|'
+ ISNULL(CAST([DimSalesOffice].[Sales Office Address (Native Language)] AS NVARCHAR(MAX)), '') + '|'
+ ISNULL(CAST([DimSalesOffice].[Sales Office Address (Line 2 - Native Language)] AS NVARCHAR(MAX)), '') + '|'
+ ISNULL(CAST([DimSalesOffice].[Sales Office City (Native Language)] AS NVARCHAR(MAX)), '') + '|'
+ ISNULL(CAST([DimSalesOffice].[Sales Office StateProvince (Native Language)] AS NVARCHAR(MAX)), '') + '|'
+ ISNULL(CAST([DimSalesOffice].[Sales Office Region] AS NVARCHAR(MAX)), '') + '|'
+ ISNULL(CAST([DimSalesOffice].[Native Language Code] AS NVARCHAR(MAX)), '') + '|'
+ '1'
),
[Source_ModifiedOn] = NULL,
[DW_ModifiedOn] = GETUTCDATE(),
[DW_IsDeleted?] = 1
FROM [dbo].[DimSalesOffice]
WHERE NOT EXISTS
(
SELECT 1
FROM [changeLog].[DimSalesOffice] [Source]
WHERE [Source].[Sales Office Code] = [DimSalesOffice].[Sales Office Code]
)
SET @UpdateRecordCount = @UpdateRecordCount + @@ROWCOUNT;
/*****INSERT*****/
INSERT INTO [dbo].[DimSalesOffice]
(
[Sales Office Code],
[CCN_Key],
[Sales Office Name (Short)],
[Sales Office Name (Long)],
[Sales Office City],
[Sales Office StateProvince],
[Sales Office Postal Code],
[Sales Office Country Code],
[Sales Office Address],
[Sales Office Address (Line 2)],
[Sales Office Name (Short - Native Language)],
[Sales Office Name (Long - Native Language)],
[Sales Office Address (Native Language)],
[Sales Office Address (Line 2 - Native Language)],
[Sales Office City (Native Language)],
[Sales Office StateProvince (Native Language)],
[Sales Office Region],
[Native Language Code],
[DW_Checksum],
[DW_Hashbytes],
[Source_ModifiedOn],
[DW_ModifiedOn],
[DW_IsDeleted?],
[DW_CreatedOn]
)
SELECT
[Source].[Sales Office Code],
[Source].[CCN_Key],
[Source].[Sales Office Name (Short)],
[Source].[Sales Office Name (Long)],
[Source].[Sales Office City],
[Source].[Sales Office StateProvince],
[Source].[Sales Office Postal Code],
[Source].[Sales Office Country Code],
[Source].[Sales Office Address],
[Source].[Sales Office Address (Line 2)],
[Source].[Sales Office Name (Short - Native Language)],
[Source].[Sales Office Name (Long - Native Language)],
[Source].[Sales Office Address (Native Language)],
[Source].[Sales Office Address (Line 2 - Native Language)],
[Source].[Sales Office City (Native Language)],
[Source].[Sales Office StateProvince (Native Language)],
[Source].[Sales Office Region],
[Source].[Native Language Code],
[DW_Checksum] =
CHECKSUM(
[Source].[CCN_Key],
[Source].[Sales Office Name (Short)],
[Source].[Sales Office Name (Long)],
[Source].[Sales Office City],
[Source].[Sales Office StateProvince],
[Source].[Sales Office Postal Code],
[Source].[Sales Office Country Code],
[Source].[Sales Office Address],
[Source].[Sales Office Address (Line 2)],
[Source].[Sales Office Name (Short - Native Language)],
[Source].[Sales Office Name (Long - Native Language)],
[Source].[Sales Office Address (Native Language)],
[Source].[Sales Office Address (Line 2 - Native Language)],
[Source].[Sales Office City (Native Language)],
[Source].[Sales Office StateProvince (Native Language)],
[Source].[Sales Office Region],
[Source].[Native Language Code],
0
),
[DW_Hashbytes] =
HASHBYTES(
'SHA2_256',
ISNULL(CAST([Source].[CCN_Key] AS NVARCHAR(MAX)), '') + '|'
+ ISNULL(CAST([Source].[Sales Office Name (Short)] AS NVARCHAR(MAX)), '') + '|'
+ ISNULL(CAST([Source].[Sales Office Name (Long)] AS NVARCHAR(MAX)), '') + '|'
+ ISNULL(CAST([Source].[Sales Office City] AS NVARCHAR(MAX)), '') + '|'
+ ISNULL(CAST([Source].[Sales Office StateProvince] AS NVARCHAR(MAX)), '') + '|'
+ ISNULL(CAST([Source].[Sales Office Postal Code] AS NVARCHAR(MAX)), '') + '|'
+ ISNULL(CAST([Source].[Sales Office Country Code] AS NVARCHAR(MAX)), '') + '|'
+ ISNULL(CAST([Source].[Sales Office Address] AS NVARCHAR(MAX)), '') + '|'
+ ISNULL(CAST([Source].[Sales Office Address (Line 2)] AS NVARCHAR(MAX)), '') + '|'
+ ISNULL(CAST([Source].[Sales Office Name (Short - Native Language)] AS NVARCHAR(MAX)), '') + '|'
+ ISNULL(CAST([Source].[Sales Office Name (Long - Native Language)] AS NVARCHAR(MAX)), '') + '|'
+ ISNULL(CAST([Source].[Sales Office Address (Native Language)] AS NVARCHAR(MAX)), '') + '|'
+ ISNULL(CAST([Source].[Sales Office Address (Line 2 - Native Language)] AS NVARCHAR(MAX)), '') + '|'
+ ISNULL(CAST([Source].[Sales Office City (Native Language)] AS NVARCHAR(MAX)), '') + '|'
+ ISNULL(CAST([Source].[Sales Office StateProvince (Native Language)] AS NVARCHAR(MAX)), '') + '|'
+ ISNULL(CAST([Source].[Sales Office Region] AS NVARCHAR(MAX)), '') + '|'
+ ISNULL(CAST([Source].[Native Language Code] AS NVARCHAR(MAX)), '') + '|'
+ '0'
),
[Source_ModifiedOn] = NULL,
[DW_ModifiedOn] = GETUTCDATE(),
[DW_IsDeleted?] = 0,
[DW_CreatedOn] = GETUTCDATE()
FROM [changeLog].[DimSalesOffice] [Source]
WHERE NOT EXISTS
(
SELECT 1
FROM [dbo].[DimSalesOffice]
WHERE [Source].[Sales Office Code] = [DimSalesOffice].[Sales Office Code]
)
SET @InsertRecordCount = @@ROWCOUNT;
SELECT [Update Record Count] = @UpdateRecordCount, [Insert Record Count] = @InsertRecordCount;
HASHBYTES 的输出长度取决于所使用的算法。SHA2_256 产生 256 位或 32 字节。这是在文档中。将列声明
binary(32)
为很好。我目前使用的系统就是这样做的。反过来回答,因为它更具语言意义:
正如我在评论中提到的,
SHA2_256 means
它将输出哈希为 256 位,也就是 32 字节(1 字节中 8 位),这意味着VARBINARY
您需要的最大大小是VARBINARY(32)
. 这在文档中提到HASBYTES()
:所以,是的,你可以安全地将你的尺寸
VARBINARY
缩小到VARBINARY(32)
.是的,在减少
VARBINARY
字段的大小后,可以将其添加到表的索引中。我会推荐一个非聚集索引,它以主键字段开头,然后HASHBYTES()
在定义中将计算字段排在第二位。使其成为非聚集索引的原因是因为您通常希望避免索引中的热列,尤其是与每个非聚集索引一起存储的聚集索引。频繁更新的热列会导致对索引的大量写入,并且在聚集索引的情况下,这些写入也必须发生在每个非聚集索引上(因为聚集索引与它一起存储)。所有列上的行哈希肯定会经常更改。
以主键字段开头是有意义的,因为您需要先按该字段连接以匹配相同的行,然后按
HASHBYTES()
计算字段连接以检查它们是否不同。是的,您可以使用计算列(甚至不需要持久化但可以)并对其进行索引,或者您也可以创建索引视图,我之前都做过。我会首先为计算列拍摄,因为它比索引视图更灵活一点,将实际行哈希耦合到表本身的行中,并且管理的对象更少。