作为 ETL 流程的一部分,我们将暂存的行与报告数据库进行比较,以确定自上次加载数据以来是否有任何列实际发生了变化。
比较基于表的唯一键和所有其他列的某种散列。我们目前使用HASHBYTES
该SHA2_256
算法,发现如果许多并发工作线程都在调用它,则它无法在大型服务器上扩展HASHBYTES
。
在 96 核服务器上进行测试时,以每秒哈希值衡量的吞吐量不会增加超过 16 个并发线程。我通过将并发MAXDOP 8
查询的数量从 1 更改为 12 来进行测试。测试MAXDOP 1
显示了相同的可伸缩性瓶颈。
作为一种解决方法,我想尝试 SQL CLR 解决方案。这是我试图说明要求的尝试:
- 该函数必须能够参与并行查询
- 函数必须是确定性的
- 该函数必须输入一个
NVARCHAR
或VARBINARY
字符串(所有相关列连接在一起) - 字符串的典型输入大小为 100 - 20000 个字符。20000 不是最大值
- 哈希冲突的机会应该大致等于或优于 MD5 算法。
CHECKSUM
对我们不起作用,因为有太多的碰撞。 - 该函数必须在大型服务器上很好地扩展(每个线程的吞吐量不应随着线程数量的增加而显着降低)
对于 Application Reasons™,假设我无法保存报表的哈希值。这是一个不支持触发器或计算列的 CCI(还有其他我不想讨论的问题)。
HASHBYTES
使用 SQL CLR 函数进行模拟的可扩展方式是什么?我的目标可以表示为在大型服务器上每秒获得尽可能多的哈希值,因此性能也很重要。我对CLR很糟糕,所以我不知道如何做到这一点。如果它激励任何人回答,我计划尽快为这个问题添加赏金。下面是一个示例查询,它非常粗略地说明了用例:
DROP TABLE IF EXISTS #CHANGED_IDS;
SELECT stg.ID INTO #CHANGED_IDS
FROM (
SELECT ID,
CAST( HASHBYTES ('SHA2_256',
CAST(FK1 AS NVARCHAR(19)) +
CAST(FK2 AS NVARCHAR(19)) +
CAST(FK3 AS NVARCHAR(19)) +
CAST(FK4 AS NVARCHAR(19)) +
CAST(FK5 AS NVARCHAR(19)) +
CAST(FK6 AS NVARCHAR(19)) +
CAST(FK7 AS NVARCHAR(19)) +
CAST(FK8 AS NVARCHAR(19)) +
CAST(FK9 AS NVARCHAR(19)) +
CAST(FK10 AS NVARCHAR(19)) +
CAST(FK11 AS NVARCHAR(19)) +
CAST(FK12 AS NVARCHAR(19)) +
CAST(FK13 AS NVARCHAR(19)) +
CAST(FK14 AS NVARCHAR(19)) +
CAST(FK15 AS NVARCHAR(19)) +
CAST(STR1 AS NVARCHAR(500)) +
CAST(STR2 AS NVARCHAR(500)) +
CAST(STR3 AS NVARCHAR(500)) +
CAST(STR4 AS NVARCHAR(500)) +
CAST(STR5 AS NVARCHAR(500)) +
CAST(COMP1 AS NVARCHAR(1)) +
CAST(COMP2 AS NVARCHAR(1)) +
CAST(COMP3 AS NVARCHAR(1)) +
CAST(COMP4 AS NVARCHAR(1)) +
CAST(COMP5 AS NVARCHAR(1)))
AS BINARY(32)) HASH1
FROM HB_TBL WITH (TABLOCK)
) stg
INNER JOIN (
SELECT ID,
CAST(HASHBYTES ('SHA2_256',
CAST(FK1 AS NVARCHAR(19)) +
CAST(FK2 AS NVARCHAR(19)) +
CAST(FK3 AS NVARCHAR(19)) +
CAST(FK4 AS NVARCHAR(19)) +
CAST(FK5 AS NVARCHAR(19)) +
CAST(FK6 AS NVARCHAR(19)) +
CAST(FK7 AS NVARCHAR(19)) +
CAST(FK8 AS NVARCHAR(19)) +
CAST(FK9 AS NVARCHAR(19)) +
CAST(FK10 AS NVARCHAR(19)) +
CAST(FK11 AS NVARCHAR(19)) +
CAST(FK12 AS NVARCHAR(19)) +
CAST(FK13 AS NVARCHAR(19)) +
CAST(FK14 AS NVARCHAR(19)) +
CAST(FK15 AS NVARCHAR(19)) +
CAST(STR1 AS NVARCHAR(500)) +
CAST(STR2 AS NVARCHAR(500)) +
CAST(STR3 AS NVARCHAR(500)) +
CAST(STR4 AS NVARCHAR(500)) +
CAST(STR5 AS NVARCHAR(500)) +
CAST(COMP1 AS NVARCHAR(1)) +
CAST(COMP2 AS NVARCHAR(1)) +
CAST(COMP3 AS NVARCHAR(1)) +
CAST(COMP4 AS NVARCHAR(1)) +
CAST(COMP5 AS NVARCHAR(1)) )
AS BINARY(32)) HASH1
FROM HB_TBL_2 WITH (TABLOCK)
) rpt ON rpt.ID = stg.ID
WHERE rpt.HASH1 <> stg.HASH1
OPTION (MAXDOP 8);
为了简化一些事情,我可能会使用以下类似的东西进行基准测试。我将HASHBYTES
在星期一发布结果:
CREATE TABLE dbo.HASH_ME (
ID BIGINT NOT NULL,
FK1 BIGINT NOT NULL,
FK2 BIGINT NOT NULL,
FK3 BIGINT NOT NULL,
FK4 BIGINT NOT NULL,
FK5 BIGINT NOT NULL,
FK6 BIGINT NOT NULL,
FK7 BIGINT NOT NULL,
FK8 BIGINT NOT NULL,
FK9 BIGINT NOT NULL,
FK10 BIGINT NOT NULL,
FK11 BIGINT NOT NULL,
FK12 BIGINT NOT NULL,
FK13 BIGINT NOT NULL,
FK14 BIGINT NOT NULL,
FK15 BIGINT NOT NULL,
STR1 NVARCHAR(500) NOT NULL,
STR2 NVARCHAR(500) NOT NULL,
STR3 NVARCHAR(500) NOT NULL,
STR4 NVARCHAR(500) NOT NULL,
STR5 NVARCHAR(2000) NOT NULL,
COMP1 TINYINT NOT NULL,
COMP2 TINYINT NOT NULL,
COMP3 TINYINT NOT NULL,
COMP4 TINYINT NOT NULL,
COMP5 TINYINT NOT NULL
);
INSERT INTO dbo.HASH_ME WITH (TABLOCK)
SELECT RN,
RN % 1000000, RN % 1000000, RN % 1000000, RN % 1000000, RN % 1000000,
RN % 1000000, RN % 1000000, RN % 1000000, RN % 1000000, RN % 1000000,
RN % 1000000, RN % 1000000, RN % 1000000, RN % 1000000, RN % 1000000,
REPLICATE(CHAR(65 + RN % 10 ), 30)
,REPLICATE(CHAR(65 + RN % 10 ), 30)
,REPLICATE(CHAR(65 + RN % 10 ), 30)
,REPLICATE(CHAR(65 + RN % 10 ), 30)
,REPLICATE(CHAR(65 + RN % 10 ), 1000),
0,1,0,1,0
FROM (
SELECT TOP (100000) ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) RN
FROM master..spt_values t1
CROSS JOIN master..spt_values t2
) q
OPTION (MAXDOP 1);
SELECT MAX(HASHBYTES('SHA2_256',
CAST(N'' AS NVARCHAR(MAX)) + N'|' +
CAST(FK1 AS NVARCHAR(19)) + N'|' +
CAST(FK2 AS NVARCHAR(19)) + N'|' +
CAST(FK3 AS NVARCHAR(19)) + N'|' +
CAST(FK4 AS NVARCHAR(19)) + N'|' +
CAST(FK5 AS NVARCHAR(19)) + N'|' +
CAST(FK6 AS NVARCHAR(19)) + N'|' +
CAST(FK7 AS NVARCHAR(19)) + N'|' +
CAST(FK8 AS NVARCHAR(19)) + N'|' +
CAST(FK9 AS NVARCHAR(19)) + N'|' +
CAST(FK10 AS NVARCHAR(19)) + N'|' +
CAST(FK11 AS NVARCHAR(19)) + N'|' +
CAST(FK12 AS NVARCHAR(19)) + N'|' +
CAST(FK13 AS NVARCHAR(19)) + N'|' +
CAST(FK14 AS NVARCHAR(19)) + N'|' +
CAST(FK15 AS NVARCHAR(19)) + N'|' +
CAST(STR1 AS NVARCHAR(500)) + N'|' +
CAST(STR2 AS NVARCHAR(500)) + N'|' +
CAST(STR3 AS NVARCHAR(500)) + N'|' +
CAST(STR4 AS NVARCHAR(500)) + N'|' +
CAST(STR5 AS NVARCHAR(2000)) + N'|' +
CAST(COMP1 AS NVARCHAR(1)) + N'|' +
CAST(COMP2 AS NVARCHAR(1)) + N'|' +
CAST(COMP3 AS NVARCHAR(1)) + N'|' +
CAST(COMP4 AS NVARCHAR(1)) + N'|' +
CAST(COMP5 AS NVARCHAR(1)) )
)
FROM dbo.HASH_ME
OPTION (MAXDOP 1);
由于您只是在寻找更改,因此您不需要加密哈希函数。
您可以从 Brandon Dahler 的开源Data.HashFunction 库中选择一种速度更快的非加密哈希,该库在许可和 OSI 批准的MIT 许可下获得许可。
SpookyHash
是一个受欢迎的选择。示例实现
源代码
该源提供了两种功能,一种用于 8000 字节或更少的输入,另一种是 LOB 版本。非 LOB 版本应该更快。
您可能可以将 LOB 二进制文件包装
COMPRESS
到 8000 字节限制以下,如果这对性能来说是值得的的话。或者,您可以将 LOB 分解为 8000 字节以下的段,或者仅保留HASHBYTES
用于 LOB 的情况(因为更长的输入可以更好地扩展)。预建代码
You can obviously grab the package for yourself and compile everything, but I built the assemblies below to make quick testing easier:
https://gist.github.com/SQLKiwi/365b265b476bf86754457fc9514b2300
T-SQL functions
Usage
An example use given the sample data in the question:
When using the LOB version, the first parameter should be cast or converted to
varbinary(max)
.Execution plan
Safe Spooky
The Data.HashFunction library uses a number of CLR language features that are considered
UNSAFE
by SQL Server. It is possible to write a basic Spooky Hash compatible withSAFE
status. An example I wrote based on Jon Hanna's SpookilySharp is below:https://gist.github.com/SQLKiwi/7a5bb26b0bee56f6d28a1d26669ce8f2
我不确定 SQLCLR 的并行性是否会更好/明显更好。但是,它真的很容易测试,因为在SQL# SQLCLR 库(我编写的)的免费版本中有一个名为Util_HashBinary的哈希函数。支持的算法有:MD5、SHA1、SHA256、SHA384 和 SHA512。
它需要一个
VARBINARY(MAX)
值作为输入,因此您可以连接每个字段的字符串版本(如您当前所做的那样)然后转换为VARBINARY(MAX)
,或者您可以直接转到VARBINARY
每列并连接转换后的值(这可能会更快,因为您没有处理字符串或从字符串到VARBINARY
) 的额外转换。下面是一个显示这两个选项的示例。它还显示了该HASHBYTES
函数,因此您可以看到它与SQL#.Util_HashBinary之间的值相同。请注意,连接值时的哈希结果与连接
VARBINARY
值时的哈希结果不匹配NVARCHAR
。这是因为INT
值“1”的二进制形式是 0x00000001,而值“1”的 UTF-16LE(即NVARCHAR
)形式INT
(二进制形式,因为这是散列函数将对其进行操作)是 0x3100。您可以使用以下方法测试与非 LOB Spooky 更具可比性的东西:
注意:Util_HashBinary使用 .NET 中内置的托管 SHA256 算法,不应使用“bcrypt”库。
除了问题的那个方面,还有一些其他的想法可能有助于这个过程:
额外的想法#1(预先计算哈希,至少一些)
你提到了几件事:
和:
和:
听起来这个报告表中的数据在一段时间内是稳定的,只是通过这个 ETL 过程进行修改。
如果没有其他东西修改这个表,那么我们真的不需要触发器或索引视图(我最初认为你可能会这样做)。
由于您无法修改报告表的架构,是否至少可以创建一个相关表以包含预先计算的哈希(以及计算时的 UTC 时间)?这将允许您有一个预先计算的值来与下一次进行比较,只留下需要计算散列的传入值。
HASHBYTES
这将减少对任何一个或SQL#.Util_HashBinary
一半的调用次数。您只需在导入过程中加入此哈希表即可。您还将创建一个单独的存储过程,它只是刷新此表的哈希值。它只是更新已更改为当前行的任何相关行的哈希值,并更新那些已修改行的时间戳。此过程可以/应该在更新此表的任何其他进程结束时执行。它也可以安排在此 ETL 开始前 30 到 60 分钟运行(取决于执行需要多长时间,以及这些其他进程中的任何一个可能运行的时间)。如果您怀疑可能存在不同步的行,它甚至可以手动执行。
然后注意到:
如此多的表确实使每个表都有一个额外的表来包含当前的哈希值变得更加困难,但这并非不可能,因为它可以编写脚本,因为它是一个标准模式。脚本只需要考虑源表名称和源表 PK 列的发现。
尽管如此,无论哪种哈希算法最终被证明是最具可扩展性的,我仍然强烈建议至少找到几个表(也许有些表比其余 500 个表大得多)并设置一个相关表来捕获当前散列,因此可以在 ETL 过程之前知道“当前”值。即使是最快的函数也不能胜过永远不必首先调用它;-)。
额外的想法#2(
VARBINARY
而不是NVARCHAR
)无论 SQLCLR 与 built-in 是什么
HASHBYTES
,我仍然建议直接转换VARBINARY
为,因为这样会更快。连接字符串并不是非常有效。并且,除了首先将非字符串值转换为字符串之外,这需要额外的努力(我假设努力量因基本类型而异:DATETIME
需要超过BIGINT
),而转换为VARBINARY
简单地为您提供基础值(在大多数情况下)。而且,事实上,测试其他测试使用的相同数据集并使用
HASHBYTES(N'SHA2_256',...)
,显示在一分钟内计算的总哈希值增加了 23.415%。而这种增加只是为了使用VARBINARY
而不是NVARCHAR
!?(详情请查看社区 wiki 答案)附加想法#3(注意输入参数)
进一步的测试表明,影响性能的一个领域(在这个执行量上)是输入参数:多少和什么类型。
目前在我的 SQL# 库中的Util_HashBinary SQLCLR 函数有两个输入参数:一个
VARBINARY
(要散列的值)和一个NVARCHAR
(要使用的算法)。这是由于我镜像了HASHBYTES
函数的签名。但是,我发现如果我删除NVARCHAR
参数并创建一个只执行 SHA256 的函数,那么性能会得到相当不错的提升。我认为即使将NVARCHAR
参数切换到INT
也会有所帮助,但我也认为即使没有额外的INT
参数至少会稍微快一些。此外,
SqlBytes.Value
可能比SqlBinary.Value
.我为此测试创建了两个新函数:Util_HashSHA256Binary和Util_HashSHA256Binary8k 。这些将包含在 SQL# 的下一个版本中(尚未设置日期)。
我还发现测试方法可以稍微改进,所以我更新了下面社区 wiki 答案中的测试工具,包括:
CHECKSUM
记录了超过 9k 的碰撞,即 9%(yikes)。附加想法 #4(
HASHBYTES
+ SQLCLR 一起?)根据瓶颈所在的位置,它甚至可能有助于使用内置
HASHBYTES
和 SQLCLR UDF 的组合来执行相同的哈希。如果内置函数的约束与 SQLCLR 操作不同/分开,那么这种方法可能比HASHBYTES
单独的 SQLCLR 或 SQLCLR 能够同时完成更多的任务。这绝对值得测试。附加想法#5(散列对象缓存?)
David Browne 的回答中建议的哈希算法对象的缓存当然看起来很有趣,所以我尝试了一下,发现了以下两个兴趣点:
无论出于何种原因,它似乎并没有提供太多(如果有的话)性能改进。我可能做错了什么,但这是我尝试过的:
对于特定查询中的所有 SQLCLR 引用,该
ManagedThreadId
值似乎相同。我测试了对同一函数的多个引用,以及对不同函数的引用,所有 3 个都被赋予不同的输入值,并返回不同的(但预期的)返回值。对于这两个测试函数,输出都是一个字符串,其中包含ManagedThreadId
以及哈希结果的字符串表示形式。ManagedThreadId
对于查询中的所有 UDF 引用以及所有行,该值都相同。但是,对于相同的输入字符串,哈希结果是相同的,而对于不同的输入字符串,哈希结果是不同的。虽然我在测试中没有看到任何错误的结果,但这不会增加竞争条件的机会吗?如果字典的键对于在特定查询中调用的所有 SQLCLR 对象都是相同的,那么它们将共享为该键存储的相同值或对象,对吗?关键是,即使认为它似乎在这里工作(在某种程度上,似乎又没有太多的性能提升,但在功能上没有任何问题),这并没有让我相信这种方法在其他情况下也能工作。
This isn't a traditional answer, but I thought it would be helpful to post benchmarks of some of the techniques mentioned so far. I'm testing on a 96 core server with SQL Server 2017 CU9.
Many scalability problems are caused by concurrent threads contending over some global state. For example, consider classic PFS page contention. This can happen if too many worker threads need to modify the same page in memory. As code becomes more efficient it may request the latch faster. That increases contention. To put it simply, efficient code is more likely to lead to scalability issues because the global state is contended over more severely. Slow code is less likely to cause scalability issues because the global state isn't accessed as frequently.
HASHBYTES
scalability is partially based on the length of the input string. My theory was to why this occurs is that access to some global state is needed when theHASHBYTES
function is called. The easy global state to observe is a memory page needs to be allocated per call on some versions of SQL Server. The harder one to observe is that there's some kind of OS contention. As a result, ifHASHBYTES
is called by the code less frequently then contention goes down. One way to reduce the rate ofHASHBYTES
calls is to increase the amount of hashing work needed per call. Hashing work is partially based on the length of the input string. To reproduce the scalability problem I saw in the application I needed to change the demo data. A reasonable worst case scenario is a table with 21BIGINT
columns. The definition of the table is included in the code at the bottom. To reduce Local Factors™, I'm using concurrentMAXDOP 1
queries that operate on relatively small tables. My quick benchmark code is at the bottom.Note the functions return different hash lengths.
MD5
andSpookyHash
are both 128 bit hashes,SHA256
is a 256 bit hash.RESULTS (
NVARCHAR
vsVARBINARY
conversion and concatenation)In order to see if converting to, and concatenating,
VARBINARY
is truly more efficient / performant thanNVARCHAR
, anNVARCHAR
version of theRUN_HASHBYTES_SHA2_256
stored procedure was created from the same template (see "Step 5" in BENCHMARKING CODE section below). The only differences are:_NVC
BINARY(8)
for theCAST
function was changed to beNVARCHAR(15)
0x7C
was changed to beN'|'
Resulting in:
instead of:
The table below contains the number of hashes performed in 1 minute. The tests were performed on a different server than was used for the other tests noted below.
Looking at just the averages, we can calculate the benefit of switching to
VARBINARY
:That returns:
RESULTS (hash algorithms and implementations)
The table below contains the number of hashes performed in 1 minute. For example, using
CHECKSUM
with 84 concurrent queries resulted in over 2 billion hashes being performed before time ran out.If you prefer to see the same numbers measured in terms of work per thread-second:
Some quick thoughts on all of the methods:
CHECKSUM
: very good scalability as expectedHASHBYTES
: scalability issues include one memory allocation per call and a large amount of CPU spent in the OSSpooky
: surprisingly good scalabilitySpooky LOB
: the spinlockSOS_SELIST_SIZED_SLOCK
spins out of control. I suspect this is a general issue with passing LOBs through CLR functions, but I'm not sureUtil_HashBinary
: looks like it gets hit by the same spinlock. I haven't looked into this so far because there's probably not a lot that I can do about it:Util_HashBinary 8k
: very surprising results, not sure what's going on hereFinal results tested on a smaller server:
BENCHMARKING CODE
SETUP 1: Tables and Data
SETUP 2: Master Execution Proc
SETUP 3: Collision Detection Proc
SETUP 4: Cleanup (DROP All Test Procs)
SETUP 5: Generate Test Procs
TEST 1: Check For Collisions
TEST 2: Run Performance Tests
VALIDATION ISSUES TO RESOLVE
While focusing on the performance testing of a singular SQLCLR UDF, two issues that were discussed early on were not incorporated into the tests, but ideally should be investigated in order to determine which approach meets all of the requirements.
In a comment that has since been deleted, Paul White had mentioned:
So that is something to consider, and clearly requires testing. If the SQLCLR options do not provide any benefit over the built-in
HASHBYTES
, then that adds weight to Solomon's suggestion of capturing existing hashes (for at least the largest tables) into related tables.You can probably improve the performance, and perhaps the scalability of all the .NET approaches by pooling and caching any objects created in the function call. EG for Paul White's code above:
SQL CLR discourages and tries to prevent using static/shared variables, but it will let you use shared variables if you mark them as readonly. Which, of course, is meaningless as you can just assign a single instance of some mutable type, like
ConcurrentDictionary
.