AskOverflow.Dev

AskOverflow.Dev Logo AskOverflow.Dev Logo

AskOverflow.Dev Navigation

  • 主页
  • 系统&网络
  • Ubuntu
  • Unix
  • DBA
  • Computer
  • Coding
  • LangChain

Mobile menu

Close
  • 主页
  • 系统&网络
    • 最新
    • 热门
    • 标签
  • Ubuntu
    • 最新
    • 热门
    • 标签
  • Unix
    • 最新
    • 标签
  • DBA
    • 最新
    • 标签
  • Computer
    • 最新
    • 标签
  • Coding
    • 最新
    • 标签
主页 / dba / 问题 / 228789
Accepted
Joe Obbish
Joe Obbish
Asked: 2019-02-04 08:57:07 +0800 CST2019-02-04 08:57:07 +0800 CST 2019-02-04 08:57:07 +0800 CST

使用 SQL CLR 标量函数模拟 HASHBYTES 的可扩展方式是什么?

  • 772

作为 ETL 流程的一部分,我们将暂存的行与报告数据库进行比较,以确定自上次加载数据以来是否有任何列实际发生了变化。

比较基于表的唯一键和所有其他列的某种散列。我们目前使用HASHBYTES该SHA2_256算法,发现如果许多并发工作线程都在调用它,则它无法在大型服务器上扩展HASHBYTES。

在 96 核服务器上进行测试时,以每秒哈希值衡量的吞吐量不会增加超过 16 个并发线程。我通过将并发MAXDOP 8查询的数量从 1 更改为 12 来进行测试。测试MAXDOP 1显示了相同的可伸缩性瓶颈。

作为一种解决方法,我想尝试 SQL CLR 解决方案。这是我试图说明要求的尝试:

  • 该函数必须能够参与并行查询
  • 函数必须是确定性的
  • 该函数必须输入一个NVARCHAR或VARBINARY字符串(所有相关列连接在一起)
  • 字符串的典型输入大小为 100 - 20000 个字符。20000 不是最大值
  • 哈希冲突的机会应该大致等于或优于 MD5 算法。CHECKSUM对我们不起作用,因为有太多的碰撞。
  • 该函数必须在大型服务器上很好地扩展(每个线程的吞吐量不应随着线程数量的增加而显着降低)

对于 Application Reasons™,假设我无法保存报表的哈希值。这是一个不支持触发器或计算列的 CCI(还有其他我不想讨论的问题)。

HASHBYTES使用 SQL CLR 函数进行模拟的可扩展方式是什么?我的目标可以表示为在大型服务器上每秒获得尽可能多的哈希值,因此性能也很重要。我对CLR很糟糕,所以我不知道如何做到这一点。如果它激励任何人回答,我计划尽快为这个问题添加赏金。下面是一个示例查询,它非常粗略地说明了用例:

DROP TABLE IF EXISTS #CHANGED_IDS;

SELECT stg.ID INTO #CHANGED_IDS
FROM (
    SELECT ID,
    CAST( HASHBYTES ('SHA2_256', 
        CAST(FK1 AS NVARCHAR(19)) + 
        CAST(FK2 AS NVARCHAR(19)) + 
        CAST(FK3 AS NVARCHAR(19)) + 
        CAST(FK4 AS NVARCHAR(19)) + 
        CAST(FK5 AS NVARCHAR(19)) + 
        CAST(FK6 AS NVARCHAR(19)) + 
        CAST(FK7 AS NVARCHAR(19)) + 
        CAST(FK8 AS NVARCHAR(19)) + 
        CAST(FK9 AS NVARCHAR(19)) + 
        CAST(FK10 AS NVARCHAR(19)) + 
        CAST(FK11 AS NVARCHAR(19)) + 
        CAST(FK12 AS NVARCHAR(19)) + 
        CAST(FK13 AS NVARCHAR(19)) + 
        CAST(FK14 AS NVARCHAR(19)) + 
        CAST(FK15 AS NVARCHAR(19)) + 
        CAST(STR1 AS NVARCHAR(500)) +
        CAST(STR2 AS NVARCHAR(500)) +
        CAST(STR3 AS NVARCHAR(500)) +
        CAST(STR4 AS NVARCHAR(500)) +
        CAST(STR5 AS NVARCHAR(500)) +
        CAST(COMP1 AS NVARCHAR(1)) + 
        CAST(COMP2 AS NVARCHAR(1)) + 
        CAST(COMP3 AS NVARCHAR(1)) + 
        CAST(COMP4 AS NVARCHAR(1)) + 
        CAST(COMP5 AS NVARCHAR(1)))
     AS BINARY(32)) HASH1
    FROM HB_TBL WITH (TABLOCK)
) stg
INNER JOIN (
    SELECT ID,
    CAST(HASHBYTES ('SHA2_256', 
        CAST(FK1 AS NVARCHAR(19)) + 
        CAST(FK2 AS NVARCHAR(19)) + 
        CAST(FK3 AS NVARCHAR(19)) + 
        CAST(FK4 AS NVARCHAR(19)) + 
        CAST(FK5 AS NVARCHAR(19)) + 
        CAST(FK6 AS NVARCHAR(19)) + 
        CAST(FK7 AS NVARCHAR(19)) + 
        CAST(FK8 AS NVARCHAR(19)) + 
        CAST(FK9 AS NVARCHAR(19)) + 
        CAST(FK10 AS NVARCHAR(19)) + 
        CAST(FK11 AS NVARCHAR(19)) + 
        CAST(FK12 AS NVARCHAR(19)) + 
        CAST(FK13 AS NVARCHAR(19)) + 
        CAST(FK14 AS NVARCHAR(19)) + 
        CAST(FK15 AS NVARCHAR(19)) + 
        CAST(STR1 AS NVARCHAR(500)) +
        CAST(STR2 AS NVARCHAR(500)) +
        CAST(STR3 AS NVARCHAR(500)) +
        CAST(STR4 AS NVARCHAR(500)) +
        CAST(STR5 AS NVARCHAR(500)) +
        CAST(COMP1 AS NVARCHAR(1)) + 
        CAST(COMP2 AS NVARCHAR(1)) + 
        CAST(COMP3 AS NVARCHAR(1)) + 
        CAST(COMP4 AS NVARCHAR(1)) + 
        CAST(COMP5 AS NVARCHAR(1)) )
 AS BINARY(32)) HASH1
    FROM HB_TBL_2 WITH (TABLOCK)
) rpt ON rpt.ID = stg.ID
WHERE rpt.HASH1 <> stg.HASH1
OPTION (MAXDOP 8);

为了简化一些事情,我可能会使用以下类似的东西进行基准测试。我将HASHBYTES在星期一发布结果:

CREATE TABLE dbo.HASH_ME (
    ID BIGINT NOT NULL,
    FK1 BIGINT NOT NULL,
    FK2 BIGINT NOT NULL,
    FK3 BIGINT NOT NULL,
    FK4 BIGINT NOT NULL,
    FK5 BIGINT NOT NULL,
    FK6 BIGINT NOT NULL,
    FK7 BIGINT NOT NULL,
    FK8 BIGINT NOT NULL,
    FK9 BIGINT NOT NULL,
    FK10 BIGINT NOT NULL,
    FK11 BIGINT NOT NULL,
    FK12 BIGINT NOT NULL,
    FK13 BIGINT NOT NULL,
    FK14 BIGINT NOT NULL,
    FK15 BIGINT NOT NULL,
    STR1 NVARCHAR(500) NOT NULL,
    STR2 NVARCHAR(500) NOT NULL,
    STR3 NVARCHAR(500) NOT NULL,
    STR4 NVARCHAR(500) NOT NULL,
    STR5 NVARCHAR(2000) NOT NULL,
    COMP1 TINYINT NOT NULL,
    COMP2 TINYINT NOT NULL,
    COMP3 TINYINT NOT NULL,
    COMP4 TINYINT NOT NULL,
    COMP5 TINYINT NOT NULL
);

INSERT INTO dbo.HASH_ME WITH (TABLOCK)
SELECT RN,
RN % 1000000, RN % 1000000, RN % 1000000, RN % 1000000, RN % 1000000,
RN % 1000000, RN % 1000000, RN % 1000000, RN % 1000000, RN % 1000000,
RN % 1000000, RN % 1000000, RN % 1000000, RN % 1000000, RN % 1000000,
REPLICATE(CHAR(65 + RN % 10 ), 30)
,REPLICATE(CHAR(65 + RN % 10 ), 30)
,REPLICATE(CHAR(65 + RN % 10 ), 30)
,REPLICATE(CHAR(65 + RN % 10 ), 30)
,REPLICATE(CHAR(65 + RN % 10 ), 1000),
0,1,0,1,0
FROM (
    SELECT TOP (100000) ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) RN
    FROM master..spt_values t1
    CROSS JOIN master..spt_values t2
) q
OPTION (MAXDOP 1);

SELECT MAX(HASHBYTES('SHA2_256',
CAST(N'' AS NVARCHAR(MAX)) + N'|' +
CAST(FK1 AS NVARCHAR(19)) + N'|' +
CAST(FK2 AS NVARCHAR(19)) + N'|' +
CAST(FK3 AS NVARCHAR(19)) + N'|' +
CAST(FK4 AS NVARCHAR(19)) + N'|' +
CAST(FK5 AS NVARCHAR(19)) + N'|' +
CAST(FK6 AS NVARCHAR(19)) + N'|' +
CAST(FK7 AS NVARCHAR(19)) + N'|' +
CAST(FK8 AS NVARCHAR(19)) + N'|' +
CAST(FK9 AS NVARCHAR(19)) + N'|' +
CAST(FK10 AS NVARCHAR(19)) + N'|' +
CAST(FK11 AS NVARCHAR(19)) + N'|' +
CAST(FK12 AS NVARCHAR(19)) + N'|' +
CAST(FK13 AS NVARCHAR(19)) + N'|' +
CAST(FK14 AS NVARCHAR(19)) + N'|' +
CAST(FK15 AS NVARCHAR(19)) + N'|' +
CAST(STR1 AS NVARCHAR(500)) + N'|' +
CAST(STR2 AS NVARCHAR(500)) + N'|' +
CAST(STR3 AS NVARCHAR(500)) + N'|' +
CAST(STR4 AS NVARCHAR(500)) + N'|' +
CAST(STR5 AS NVARCHAR(2000)) + N'|' +
CAST(COMP1 AS NVARCHAR(1)) + N'|' +
CAST(COMP2 AS NVARCHAR(1)) + N'|' +
CAST(COMP3 AS NVARCHAR(1)) + N'|' +
CAST(COMP4 AS NVARCHAR(1)) + N'|' +
CAST(COMP5 AS NVARCHAR(1)) )
)
FROM dbo.HASH_ME
OPTION (MAXDOP 1);
sql-server sql-server-2016
  • 4 4 个回答
  • 2538 Views

4 个回答

  • Voted
  1. Best Answer
    Paul White
    2019-02-05T07:25:03+08:002019-02-05T07:25:03+08:00

    由于您只是在寻找更改,因此您不需要加密哈希函数。

    您可以从 Brandon Dahler 的开源Data.HashFunction 库中选择一种速度更快的非加密哈希,该库在许可和 OSI 批准的MIT 许可下获得许可。SpookyHash是一个受欢迎的选择。

    示例实现

    源代码

    using Microsoft.SqlServer.Server;
    using System.Data.HashFunction.SpookyHash;
    using System.Data.SqlTypes;
    
    public partial class UserDefinedFunctions
    {
        [SqlFunction
            (
                DataAccess = DataAccessKind.None,
                SystemDataAccess = SystemDataAccessKind.None,
                IsDeterministic = true,
                IsPrecise = true
            )
        ]
        public static byte[] SpookyHash
            (
                [SqlFacet (MaxSize = 8000)]
                SqlBinary Input
            )
        {
            ISpookyHashV2 sh = SpookyHashV2Factory.Instance.Create();
            return sh.ComputeHash(Input.Value).Hash;
        }
    
        [SqlFunction
            (
                DataAccess = DataAccessKind.None,
                IsDeterministic = true,
                IsPrecise = true,
                SystemDataAccess = SystemDataAccessKind.None
            )
        ]
        public static byte[] SpookyHashLOB
            (
                [SqlFacet (MaxSize = -1)]
                SqlBinary Input
            )
        {
            ISpookyHashV2 sh = SpookyHashV2Factory.Instance.Create();
            return sh.ComputeHash(Input.Value).Hash;
        }
    }
    

    该源提供了两种功能,一种用于 8000 字节或更少的输入,另一种是 LOB 版本。非 LOB 版本应该更快。

    您可能可以将 LOB 二进制文件包装COMPRESS到 8000 字节限制以下,如果这对性能来说是值得的的话。或者,您可以将 LOB 分解为 8000 字节以下的段,或者仅保留HASHBYTES用于 LOB 的情况(因为更长的输入可以更好地扩展)。

    预建代码

    You can obviously grab the package for yourself and compile everything, but I built the assemblies below to make quick testing easier:

    https://gist.github.com/SQLKiwi/365b265b476bf86754457fc9514b2300

    T-SQL functions

    CREATE FUNCTION dbo.SpookyHash
    (
        @Input varbinary(8000)
    )
    RETURNS binary(16)
    WITH 
        RETURNS NULL ON NULL INPUT, 
        EXECUTE AS OWNER
    AS EXTERNAL NAME Spooky.UserDefinedFunctions.SpookyHash;
    GO
    CREATE FUNCTION dbo.SpookyHashLOB
    (
        @Input varbinary(max)
    )
    RETURNS binary(16)
    WITH 
        RETURNS NULL ON NULL INPUT, 
        EXECUTE AS OWNER
    AS EXTERNAL NAME Spooky.UserDefinedFunctions.SpookyHashLOB;
    GO
    

    Usage

    An example use given the sample data in the question:

    SELECT
        HT1.ID
    FROM dbo.HB_TBL AS HT1
    JOIN dbo.HB_TBL_2 AS HT2
        ON HT2.ID = HT1.ID
        AND dbo.SpookyHash
        (
            CONVERT(binary(8), HT2.FK1) + 0x7C +
            CONVERT(binary(8), HT2.FK2) + 0x7C +
            CONVERT(binary(8), HT2.FK3) + 0x7C +
            CONVERT(binary(8), HT2.FK4) + 0x7C +
            CONVERT(binary(8), HT2.FK5) + 0x7C +
            CONVERT(binary(8), HT2.FK6) + 0x7C +
            CONVERT(binary(8), HT2.FK7) + 0x7C +
            CONVERT(binary(8), HT2.FK8) + 0x7C +
            CONVERT(binary(8), HT2.FK9) + 0x7C +
            CONVERT(binary(8), HT2.FK10) + 0x7C +
            CONVERT(binary(8), HT2.FK11) + 0x7C +
            CONVERT(binary(8), HT2.FK12) + 0x7C +
            CONVERT(binary(8), HT2.FK13) + 0x7C +
            CONVERT(binary(8), HT2.FK14) + 0x7C +
            CONVERT(binary(8), HT2.FK15) + 0x7C +
            CONVERT(varbinary(1000), HT2.STR1) + 0x7C +
            CONVERT(varbinary(1000), HT2.STR2) + 0x7C +
            CONVERT(varbinary(1000), HT2.STR3) + 0x7C +
            CONVERT(varbinary(1000), HT2.STR4) + 0x7C +
            CONVERT(varbinary(1000), HT2.STR5) + 0x7C +
            CONVERT(binary(1), HT2.COMP1) + 0x7C +
            CONVERT(binary(1), HT2.COMP2) + 0x7C +
            CONVERT(binary(1), HT2.COMP3) + 0x7C +
            CONVERT(binary(1), HT2.COMP4) + 0x7C +
            CONVERT(binary(1), HT2.COMP5)
        )
        <> dbo.SpookyHash
        (
            CONVERT(binary(8), HT1.FK1) + 0x7C +
            CONVERT(binary(8), HT1.FK2) + 0x7C +
            CONVERT(binary(8), HT1.FK3) + 0x7C +
            CONVERT(binary(8), HT1.FK4) + 0x7C +
            CONVERT(binary(8), HT1.FK5) + 0x7C +
            CONVERT(binary(8), HT1.FK6) + 0x7C +
            CONVERT(binary(8), HT1.FK7) + 0x7C +
            CONVERT(binary(8), HT1.FK8) + 0x7C +
            CONVERT(binary(8), HT1.FK9) + 0x7C +
            CONVERT(binary(8), HT1.FK10) + 0x7C +
            CONVERT(binary(8), HT1.FK11) + 0x7C +
            CONVERT(binary(8), HT1.FK12) + 0x7C +
            CONVERT(binary(8), HT1.FK13) + 0x7C +
            CONVERT(binary(8), HT1.FK14) + 0x7C +
            CONVERT(binary(8), HT1.FK15) + 0x7C +
            CONVERT(varbinary(1000), HT1.STR1) + 0x7C +
            CONVERT(varbinary(1000), HT1.STR2) + 0x7C +
            CONVERT(varbinary(1000), HT1.STR3) + 0x7C +
            CONVERT(varbinary(1000), HT1.STR4) + 0x7C +
            CONVERT(varbinary(1000), HT1.STR5) + 0x7C +
            CONVERT(binary(1), HT1.COMP1) + 0x7C +
            CONVERT(binary(1), HT1.COMP2) + 0x7C +
            CONVERT(binary(1), HT1.COMP3) + 0x7C +
            CONVERT(binary(1), HT1.COMP4) + 0x7C +
            CONVERT(binary(1), HT1.COMP5)
        );
    

    When using the LOB version, the first parameter should be cast or converted to varbinary(max).

    Execution plan

    plan


    Safe Spooky

    The Data.HashFunction library uses a number of CLR language features that are considered UNSAFE by SQL Server. It is possible to write a basic Spooky Hash compatible with SAFE status. An example I wrote based on Jon Hanna's SpookilySharp is below:

    https://gist.github.com/SQLKiwi/7a5bb26b0bee56f6d28a1d26669ce8f2

    • 21
  2. Solomon Rutzky
    2019-02-04T12:41:07+08:002019-02-04T12:41:07+08:00

    我不确定 SQLCLR 的并行性是否会更好/明显更好。但是,它真的很容易测试,因为在SQL# SQLCLR 库(我编写的)的免费版本中有一个名为Util_HashBinary的哈希函数。支持的算法有:MD5、SHA1、SHA256、SHA384 和 SHA512。

    它需要一个VARBINARY(MAX)值作为输入,因此您可以连接每个字段的字符串版本(如您当前所做的那样)然后转换为VARBINARY(MAX),或者您可以直接转到VARBINARY每列并连接转换后的值(这可能会更快,因为您没有处理字符串或从字符串到VARBINARY) 的额外转换。下面是一个显示这两个选项的示例。它还显示了该HASHBYTES函数,因此您可以看到它与SQL#.Util_HashBinary之间的值相同。

    请注意,连接值时的哈希结果与连接VARBINARY值时的哈希结果不匹配NVARCHAR。这是因为INT值“1”的二进制形式是 0x00000001,而值“1”的 UTF-16LE(即NVARCHAR)形式INT(二进制形式,因为这是散列函数将对其进行操作)是 0x3100。

    SELECT so.[object_id],
           SQL#.Util_HashBinary(N'SHA256',
                                CONVERT(VARBINARY(MAX),
                                        CONCAT(so.[name], so.[schema_id], so.[create_date])
                                       )
                               ) AS [SQLCLR-ConcatStrings],
           HASHBYTES(N'SHA2_256',
                     CONVERT(VARBINARY(MAX),
                             CONCAT(so.[name], so.[schema_id], so.[create_date])
                            )
                    ) AS [BuiltIn-ConcatStrings]
    FROM sys.objects so;
    
    
    SELECT so.[object_id],
           SQL#.Util_HashBinary(N'SHA256',
                                CONVERT(VARBINARY(500), so.[name]) + 
                                CONVERT(VARBINARY(500), so.[schema_id]) +
                                CONVERT(VARBINARY(500), so.[create_date])
                               ) AS [SQLCLR-ConcatVarBinaries],
           HASHBYTES(N'SHA2_256',
                     CONVERT(VARBINARY(500), so.[name]) + 
                     CONVERT(VARBINARY(500), so.[schema_id]) +
                     CONVERT(VARBINARY(500), so.[create_date])
                    ) AS [BuiltIn-ConcatVarBinaries]
    FROM sys.objects so;
    

    您可以使用以下方法测试与非 LOB Spooky 更具可比性的东西:

    CREATE FUNCTION [SQL#].[Util_HashBinary8k]
    (@Algorithm [nvarchar](50), @BaseData [varbinary](8000))
    RETURNS [varbinary](8000) 
    WITH EXECUTE AS CALLER, RETURNS NULL ON NULL INPUT
    AS EXTERNAL NAME [SQL#].[UTILITY].[HashBinary];
    

    注意:Util_HashBinary使用 .NET 中内置的托管 SHA256 算法,不应使用“bcrypt”库。

    除了问题的那个方面,还有一些其他的想法可能有助于这个过程:

    额外的想法#1(预先计算哈希,至少一些)

    你提到了几件事:

    1. 我们将 staging 中的行与报告数据库进行比较,以确定自上次加载数据以来是否有任何列实际发生了变化。

      和:

    2. 我无法保存报表的哈希值。这是一个不支持触发器或计算列的 CCI

      和:

    3. 这些表可以在 ETL 过程之外进行更新

    听起来这个报告表中的数据在一段时间内是稳定的,只是通过这个 ETL 过程进行修改。

    如果没有其他东西修改这个表,那么我们真的不需要触发器或索引视图(我最初认为你可能会这样做)。

    由于您无法修改报告表的架构,是否至少可以创建一个相关表以包含预先计算的哈希(以及计算时的 UTC 时间)?这将允许您有一个预先计算的值来与下一次进行比较,只留下需要计算散列的传入值。HASHBYTES这将减少对任何一个或SQL#.Util_HashBinary一半的调用次数。您只需在导入过程中加入此哈希表即可。

    您还将创建一个单独的存储过程,它只是刷新此表的哈希值。它只是更新已更改为当前行的任何相关行的哈希值,并更新那些已修改行的时间戳。此过程可以/应该在更新此表的任何其他进程结束时执行。它也可以安排在此 ETL 开始前 30 到 60 分钟运行(取决于执行需要多长时间,以及这些其他进程中的任何一个可能运行的时间)。如果您怀疑可能存在不同步的行,它甚至可以手动执行。

    然后注意到:

    有超过500张桌子

    如此多的表确实使每个表都有一个额外的表来包含当前的哈希值变得更加困难,但这并非不可能,因为它可以编写脚本,因为它是一个标准模式。脚本只需要考虑源表名称和源表 PK 列的发现。

    尽管如此,无论哪种哈希算法最终被证明是最具可扩展性的,我仍然强烈建议至少找到几个表(也许有些表比其余 500 个表大得多)并设置一个相关表来捕获当前散列,因此可以在 ETL 过程之前知道“当前”值。即使是最快的函数也不能胜过永远不必首先调用它;-)。

    额外的想法#2(VARBINARY而不是NVARCHAR)

    无论 SQLCLR 与 built-in 是什么HASHBYTES,我仍然建议直接转换VARBINARY为,因为这样会更快。连接字符串并不是非常有效。并且,除了首先将非字符串值转换为字符串之外,这需要额外的努力(我假设努力量因基本类型而异:DATETIME需要超过BIGINT),而转换为VARBINARY简单地为您提供基础值(在大多数情况下)。

    而且,事实上,测试其他测试使用的相同数据集并使用HASHBYTES(N'SHA2_256',...),显示在一分钟内计算的总哈希值增加了 23.415%。而这种增加只是为了使用VARBINARY而不是NVARCHAR!?(详情请查看社区 wiki 答案)

    附加想法#3(注意输入参数)

    进一步的测试表明,影响性能的一个领域(在这个执行量上)是输入参数:多少和什么类型。

    目前在我的 SQL# 库中的Util_HashBinary SQLCLR 函数有两个输入参数:一个VARBINARY(要散列的值)和一个NVARCHAR(要使用的算法)。这是由于我镜像了HASHBYTES函数的签名。但是,我发现如果我删除NVARCHAR参数并创建一个只执行 SHA256 的函数,那么性能会得到相当不错的提升。我认为即使将NVARCHAR参数切换到INT也会有所帮助,但我也认为即使没有额外的INT参数至少会稍微快一些。

    此外,SqlBytes.Value可能比SqlBinary.Value.

    我为此测试创建了两个新函数:Util_HashSHA256Binary和Util_HashSHA256Binary8k 。这些将包含在 SQL# 的下一个版本中(尚未设置日期)。

    我还发现测试方法可以稍微改进,所以我更新了下面社区 wiki 答案中的测试工具,包括:

    1. 预加载 SQLCLR 程序集以确保加载时间开销不会影响结果。
    2. 检查冲突的验证程序。如果找到,它会显示唯一/不同的行数和总行数。这允许人们确定冲突的数量(如果有的话)是否超出了给定用例的限制。一些用例可能允许少量冲突,而其他用例可能不需要。如果无法检测到所需精度水平的变化,那么超快速功能将毫无用处。例如,使用 OP 提供的测试工具,我将行数增加到 100k 行(最初是 10k),发现CHECKSUM记录了超过 9k 的碰撞,即 9%(yikes)。

    附加想法 #4(HASHBYTES+ SQLCLR 一起?)

    根据瓶颈所在的位置,它甚至可能有助于使用内置HASHBYTES和 SQLCLR UDF 的组合来执行相同的哈希。如果内置函数的约束与 SQLCLR 操作不同/分开,那么这种方法可能比HASHBYTES单独的 SQLCLR 或 SQLCLR 能够同时完成更多的任务。这绝对值得测试。

    附加想法#5(散列对象缓存?)

    David Browne 的回答中建议的哈希算法对象的缓存当然看起来很有趣,所以我尝试了一下,发现了以下两个兴趣点:

    1. 无论出于何种原因,它似乎并没有提供太多(如果有的话)性能改进。我可能做错了什么,但这是我尝试过的:

      static readonly ConcurrentDictionary<int, SHA256Managed> hashers =
          new ConcurrentDictionary<int, SHA256Managed>();
      
      [return: SqlFacet(MaxSize = 100)]
      [SqlFunction(IsDeterministic = true)]
      public static SqlBinary FastHash([SqlFacet(MaxSize = 1000)] SqlBytes Input)
      {
          SHA256Managed sh = hashers.GetOrAdd(Thread.CurrentThread.ManagedThreadId,
                                              i => new SHA256Managed());
      
          return sh.ComputeHash(Input.Value);
      }
      
    2. 对于特定查询中的所有 SQLCLR 引用,该ManagedThreadId值似乎相同。我测试了对同一函数的多个引用,以及对不同函数的引用,所有 3 个都被赋予不同的输入值,并返回不同的(但预期的)返回值。对于这两个测试函数,输出都是一个字符串,其中包含ManagedThreadId以及哈希结果的字符串表示形式。ManagedThreadId对于查询中的所有 UDF 引用以及所有行,该值都相同。但是,对于相同的输入字符串,哈希结果是相同的,而对于不同的输入字符串,哈希结果是不同的。

      虽然我在测试中没有看到任何错误的结果,但这不会增加竞争条件的机会吗?如果字典的键对于在特定查询中调用的所有 SQLCLR 对象都是相同的,那么它们将共享为该键存储的相同值或对象,对吗?关键是,即使认为它似乎在这里工作(在某种程度上,似乎又没有太多的性能提升,但在功能上没有任何问题),这并没有让我相信这种方法在其他情况下也能工作。

    • 17
  3. Joe Obbish
    2019-02-06T11:25:19+08:002019-02-06T11:25:19+08:00

    This isn't a traditional answer, but I thought it would be helpful to post benchmarks of some of the techniques mentioned so far. I'm testing on a 96 core server with SQL Server 2017 CU9.

    Many scalability problems are caused by concurrent threads contending over some global state. For example, consider classic PFS page contention. This can happen if too many worker threads need to modify the same page in memory. As code becomes more efficient it may request the latch faster. That increases contention. To put it simply, efficient code is more likely to lead to scalability issues because the global state is contended over more severely. Slow code is less likely to cause scalability issues because the global state isn't accessed as frequently.

    HASHBYTES scalability is partially based on the length of the input string. My theory was to why this occurs is that access to some global state is needed when the HASHBYTES function is called. The easy global state to observe is a memory page needs to be allocated per call on some versions of SQL Server. The harder one to observe is that there's some kind of OS contention. As a result, if HASHBYTES is called by the code less frequently then contention goes down. One way to reduce the rate of HASHBYTES calls is to increase the amount of hashing work needed per call. Hashing work is partially based on the length of the input string. To reproduce the scalability problem I saw in the application I needed to change the demo data. A reasonable worst case scenario is a table with 21 BIGINT columns. The definition of the table is included in the code at the bottom. To reduce Local Factors™, I'm using concurrent MAXDOP 1 queries that operate on relatively small tables. My quick benchmark code is at the bottom.

    Note the functions return different hash lengths. MD5 and SpookyHash are both 128 bit hashes, SHA256 is a 256 bit hash.

    RESULTS (NVARCHAR vs VARBINARY conversion and concatenation)

    In order to see if converting to, and concatenating, VARBINARY is truly more efficient / performant than NVARCHAR, an NVARCHAR version of the RUN_HASHBYTES_SHA2_256 stored procedure was created from the same template (see "Step 5" in BENCHMARKING CODE section below). The only differences are:

    1. Stored Procedure name ends in _NVC
    2. BINARY(8) for the CAST function was changed to be NVARCHAR(15)
    3. 0x7C was changed to be N'|'

    Resulting in:

    CAST(FK1 AS NVARCHAR(15)) + N'|' +
    

    instead of:

    CAST(FK1 AS BINARY(8)) + 0x7C +
    

    The table below contains the number of hashes performed in 1 minute. The tests were performed on a different server than was used for the other tests noted below.

    ╔════════════════╦══════════╦══════════════╗
    ║    Datatype    ║  Test #  ║ Total Hashes ║
    ╠════════════════╬══════════╬══════════════╣
    ║ NVARCHAR       ║        1 ║     10200000 ║
    ║ NVARCHAR       ║        2 ║     10300000 ║
    ║ NVARCHAR       ║  AVERAGE ║ * 10250000 * ║
    ║ -------------- ║ -------- ║ ------------ ║
    ║ VARBINARY      ║        1 ║     12500000 ║
    ║ VARBINARY      ║        2 ║     12800000 ║
    ║ VARBINARY      ║  AVERAGE ║ * 12650000 * ║
    ╚════════════════╩══════════╩══════════════╝
    

    Looking at just the averages, we can calculate the benefit of switching to VARBINARY:

    SELECT (12650000 - 10250000) AS [IncreaseAmount],
           ROUND(((126500000 - 10250000) / 10250000) * 100.0, 3) AS [IncreasePercentage]
    

    That returns:

    IncreaseAmount:    2400000.0
    IncreasePercentage:   23.415
    

    RESULTS (hash algorithms and implementations)

    The table below contains the number of hashes performed in 1 minute. For example, using CHECKSUM with 84 concurrent queries resulted in over 2 billion hashes being performed before time ran out.

    ╔════════════════════╦════════════╦════════════╦════════════╗
    ║      Function      ║ 12 threads ║ 48 threads ║ 84 threads ║
    ╠════════════════════╬════════════╬════════════╬════════════╣
    ║ CHECKSUM           ║  281250000 ║ 1122440000 ║ 2040100000 ║
    ║ HASHBYTES MD5      ║   75940000 ║  106190000 ║  112750000 ║
    ║ HASHBYTES SHA2_256 ║   80210000 ║  117080000 ║  124790000 ║
    ║ CLR Spooky         ║  131250000 ║  505700000 ║  786150000 ║
    ║ CLR SpookyLOB      ║   17420000 ║   27160000 ║   31380000 ║
    ║ SQL# MD5           ║   17080000 ║   26450000 ║   29080000 ║
    ║ SQL# SHA2_256      ║   18370000 ║   28860000 ║   32590000 ║
    ║ SQL# MD5 8k        ║   24440000 ║   30560000 ║   32550000 ║
    ║ SQL# SHA2_256 8k   ║   87240000 ║  159310000 ║  155760000 ║
    ╚════════════════════╩════════════╩════════════╩════════════╝
    

    If you prefer to see the same numbers measured in terms of work per thread-second:

    ╔════════════════════╦════════════════════════════╦════════════════════════════╦════════════════════════════╗
    ║      Function      ║ 12 threads per core-second ║ 48 threads per core-second ║ 84 threads per core-second ║
    ╠════════════════════╬════════════════════════════╬════════════════════════════╬════════════════════════════╣
    ║ CHECKSUM           ║                     390625 ║                     389736 ║                     404782 ║
    ║ HASHBYTES MD5      ║                     105472 ║                      36872 ║                      22371 ║
    ║ HASHBYTES SHA2_256 ║                     111403 ║                      40653 ║                      24760 ║
    ║ CLR Spooky         ║                     182292 ║                     175590 ║                     155982 ║
    ║ CLR SpookyLOB      ║                      24194 ║                       9431 ║                       6226 ║
    ║ SQL# MD5           ║                      23722 ║                       9184 ║                       5770 ║
    ║ SQL# SHA2_256      ║                      25514 ║                      10021 ║                       6466 ║
    ║ SQL# MD5 8k        ║                      33944 ║                      10611 ║                       6458 ║
    ║ SQL# SHA2_256 8k   ║                     121167 ║                      55316 ║                      30905 ║
    ╚════════════════════╩════════════════════════════╩════════════════════════════╩════════════════════════════╝
    

    Some quick thoughts on all of the methods:

    • CHECKSUM: very good scalability as expected
    • HASHBYTES: scalability issues include one memory allocation per call and a large amount of CPU spent in the OS
    • Spooky: surprisingly good scalability
    • Spooky LOB: the spinlock SOS_SELIST_SIZED_SLOCK spins out of control. I suspect this is a general issue with passing LOBs through CLR functions, but I'm not sure
    • Util_HashBinary: looks like it gets hit by the same spinlock. I haven't looked into this so far because there's probably not a lot that I can do about it:

    spin your lock

    • Util_HashBinary 8k: very surprising results, not sure what's going on here

    Final results tested on a smaller server:

    ╔═════════════════════════╦════════════════════════╦════════════════════════╗
    ║     Hash Algorithm      ║ Hashes over 11 threads ║ Hashes over 44 threads ║
    ╠═════════════════════════╬════════════════════════╬════════════════════════╣
    ║ HASHBYTES SHA2_256      ║               85220000 ║              167050000 ║
    ║ SpookyHash              ║              101200000 ║              239530000 ║
    ║ Util_HashSHA256Binary8k ║               90590000 ║              217170000 ║
    ║ SpookyHashLOB           ║               23490000 ║               38370000 ║
    ║ Util_HashSHA256Binary   ║               23430000 ║               36590000 ║
    ╚═════════════════════════╩════════════════════════╩════════════════════════╝
    

    BENCHMARKING CODE

    SETUP 1: Tables and Data

    DROP TABLE IF EXISTS dbo.HASH_SMALL;
    
    CREATE TABLE dbo.HASH_SMALL (
        ID BIGINT NOT NULL,
        FK1 BIGINT NOT NULL,
        FK2 BIGINT NOT NULL,
        FK3 BIGINT NOT NULL,
        FK4 BIGINT NOT NULL,
        FK5 BIGINT NOT NULL,
        FK6 BIGINT NOT NULL,
        FK7 BIGINT NOT NULL,
        FK8 BIGINT NOT NULL,
        FK9 BIGINT NOT NULL,
        FK10 BIGINT NOT NULL,
        FK11 BIGINT NOT NULL,
        FK12 BIGINT NOT NULL,
        FK13 BIGINT NOT NULL,
        FK14 BIGINT NOT NULL,
        FK15 BIGINT NOT NULL,
        FK16 BIGINT NOT NULL,
        FK17 BIGINT NOT NULL,
        FK18 BIGINT NOT NULL,
        FK19 BIGINT NOT NULL,
        FK20 BIGINT NOT NULL
    );
    
    INSERT INTO dbo.HASH_SMALL WITH (TABLOCK)
    SELECT RN,
    4000000 - RN, 4000000 - RN
    ,200000000 - RN, 200000000 - RN
    , RN % 500000 , RN % 500000 , RN % 500000
    , RN % 500000 , RN % 500000 , RN % 500000 
    , 100000 - RN % 100000, RN % 100000
    , 100000 - RN % 100000, RN % 100000
    , 100000 - RN % 100000, RN % 100000
    , 100000 - RN % 100000, RN % 100000
    , 100000 - RN % 100000, RN % 100000
    FROM (
        SELECT TOP (10000) ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) RN
        FROM master..spt_values t1
        CROSS JOIN master..spt_values t2
    ) q
    OPTION (MAXDOP 1);
    
    
    DROP TABLE IF EXISTS dbo.LOG_HASHES;
    CREATE TABLE dbo.LOG_HASHES (
    LOG_TIME DATETIME,
    HASH_ALGORITHM INT,
    SESSION_ID INT,
    NUM_HASHES BIGINT
    );
    

    SETUP 2: Master Execution Proc

    GO
    CREATE OR ALTER PROCEDURE dbo.RUN_HASHES_FOR_ONE_MINUTE (@HashAlgorithm INT)
    AS
    BEGIN
    DECLARE @target_end_time DATETIME = DATEADD(MINUTE, 1, GETDATE()),
            @query_execution_count INT = 0;
    
    SET NOCOUNT ON;
    
    DECLARE @ProcName NVARCHAR(261); -- schema_name + proc_name + '[].[]'
    
    DECLARE @RowCount INT;
    SELECT @RowCount = SUM(prtn.[row_count])
    FROM   sys.dm_db_partition_stats prtn
    WHERE  prtn.[object_id] = OBJECT_ID(N'dbo.HASH_SMALL')
    AND    prtn.[index_id] < 2;
    
    
    -- Load assembly if not loaded to prevent load time from skewing results
    DECLARE @OptionalInitSQL NVARCHAR(MAX);
    SET @OptionalInitSQL = CASE @HashAlgorithm
           WHEN 1 THEN N'SELECT @Dummy = dbo.SpookyHash(0x1234);'
           WHEN 2 THEN N'' -- HASHBYTES
           WHEN 3 THEN N'' -- HASHBYTES
           WHEN 4 THEN N'' -- CHECKSUM
           WHEN 5 THEN N'SELECT @Dummy = dbo.SpookyHashLOB(0x1234);'
           WHEN 6 THEN N'SELECT @Dummy = SQL#.Util_HashBinary(N''MD5'', 0x1234);'
           WHEN 7 THEN N'SELECT @Dummy = SQL#.Util_HashBinary(N''SHA256'', 0x1234);'
           WHEN 8 THEN N'SELECT @Dummy = SQL#.Util_HashBinary8k(N''MD5'', 0x1234);'
           WHEN 9 THEN N'SELECT @Dummy = SQL#.Util_HashBinary8k(N''SHA256'', 0x1234);'
    /* -- BETA / non-public code
           WHEN 10 THEN N'SELECT @Dummy = SQL#.Util_HashSHA256Binary8k(0x1234);'
           WHEN 11 THEN N'SELECT @Dummy = SQL#.Util_HashSHA256Binary(0x1234);'
    */
       END;
    
    
    IF (RTRIM(@OptionalInitSQL) <> N'')
    BEGIN
        SET @OptionalInitSQL = N'
    SET NOCOUNT ON;
    DECLARE @Dummy VARBINARY(100);
    ' + @OptionalInitSQL;
    
        RAISERROR(N'** Executing optional initialization code:', 10, 1) WITH NOWAIT;
        RAISERROR(@OptionalInitSQL, 10, 1) WITH NOWAIT;
        EXEC (@OptionalInitSQL);
        RAISERROR(N'-------------------------------------------', 10, 1) WITH NOWAIT;
    END;
    
    
    SET @ProcName = CASE @HashAlgorithm
                        WHEN 1 THEN N'dbo.RUN_SpookyHash'
                        WHEN 2 THEN N'dbo.RUN_HASHBYTES_MD5'
                        WHEN 3 THEN N'dbo.RUN_HASHBYTES_SHA2_256'
                        WHEN 4 THEN N'dbo.RUN_CHECKSUM'
                        WHEN 5 THEN N'dbo.RUN_SpookyHashLOB'
                        WHEN 6 THEN N'dbo.RUN_SR_MD5'
                        WHEN 7 THEN N'dbo.RUN_SR_SHA256'
                        WHEN 8 THEN N'dbo.RUN_SR_MD5_8k'
                        WHEN 9 THEN N'dbo.RUN_SR_SHA256_8k'
    /* -- BETA / non-public code
                        WHEN 10 THEN N'dbo.RUN_SR_SHA256_new'
                        WHEN 11 THEN N'dbo.RUN_SR_SHA256LOB_new'
    */
                        WHEN 13 THEN N'dbo.RUN_HASHBYTES_SHA2_256_NVC'
                    END;
    
    RAISERROR(N'** Executing proc: %s', 10, 1, @ProcName) WITH NOWAIT;
    
    WHILE GETDATE() < @target_end_time
    BEGIN
        EXEC @ProcName;
    
        SET @query_execution_count = @query_execution_count + 1;
    END;
    
    INSERT INTO dbo.LOG_HASHES
    VALUES (GETDATE(), @HashAlgorithm, @@SPID, @RowCount * @query_execution_count);
    
    END;
    GO
    

    SETUP 3: Collision Detection Proc

    GO
    CREATE OR ALTER PROCEDURE dbo.VERIFY_NO_COLLISIONS (@HashAlgorithm INT)
    AS
    SET NOCOUNT ON;
    
    DECLARE @RowCount INT;
    SELECT @RowCount = SUM(prtn.[row_count])
    FROM   sys.dm_db_partition_stats prtn
    WHERE  prtn.[object_id] = OBJECT_ID(N'dbo.HASH_SMALL')
    AND    prtn.[index_id] < 2;
    
    
    DECLARE @CollisionTestRows INT;
    DECLARE @CollisionTestSQL NVARCHAR(MAX);
    SET @CollisionTestSQL = N'
    SELECT @RowsOut = COUNT(DISTINCT '
    + CASE @HashAlgorithm
           WHEN 1 THEN N'dbo.SpookyHash('
           WHEN 2 THEN N'HASHBYTES(''MD5'','
           WHEN 3 THEN N'HASHBYTES(''SHA2_256'','
           WHEN 4 THEN N'CHECKSUM('
           WHEN 5 THEN N'dbo.SpookyHashLOB('
           WHEN 6 THEN N'SQL#.Util_HashBinary(N''MD5'','
           WHEN 7 THEN N'SQL#.Util_HashBinary(N''SHA256'','
           WHEN 8 THEN N'SQL#.[Util_HashBinary8k](N''MD5'','
           WHEN 9 THEN N'SQL#.[Util_HashBinary8k](N''SHA256'','
    --/* -- BETA / non-public code
           WHEN 10 THEN N'SQL#.[Util_HashSHA256Binary8k]('
           WHEN 11 THEN N'SQL#.[Util_HashSHA256Binary]('
    --*/
       END
    + N'
        CAST(FK1 AS BINARY(8)) + 0x7C +
        CAST(FK2 AS BINARY(8)) + 0x7C +
        CAST(FK3 AS BINARY(8)) + 0x7C +
        CAST(FK4 AS BINARY(8)) + 0x7C +
        CAST(FK5 AS BINARY(8)) + 0x7C +
        CAST(FK6 AS BINARY(8)) + 0x7C +
        CAST(FK7 AS BINARY(8)) + 0x7C +
        CAST(FK8 AS BINARY(8)) + 0x7C +
        CAST(FK9 AS BINARY(8)) + 0x7C +
        CAST(FK10 AS BINARY(8)) + 0x7C +
        CAST(FK11 AS BINARY(8)) + 0x7C +
        CAST(FK12 AS BINARY(8)) + 0x7C +
        CAST(FK13 AS BINARY(8)) + 0x7C +
        CAST(FK14 AS BINARY(8)) + 0x7C +
        CAST(FK15 AS BINARY(8)) + 0x7C +
        CAST(FK16 AS BINARY(8)) + 0x7C +
        CAST(FK17 AS BINARY(8)) + 0x7C +
        CAST(FK18 AS BINARY(8)) + 0x7C +
        CAST(FK19 AS BINARY(8)) + 0x7C +
        CAST(FK20 AS BINARY(8))  ))
    FROM dbo.HASH_SMALL;';
    
    PRINT @CollisionTestSQL;
    
    EXEC sp_executesql
      @CollisionTestSQL,
      N'@RowsOut INT OUTPUT',
      @RowsOut = @CollisionTestRows OUTPUT;
    
    
    IF (@CollisionTestRows <> @RowCount)
    BEGIN
        RAISERROR('Collisions for algorithm: %d!!!  %d unique rows out of %d.',
        16, 1, @HashAlgorithm, @CollisionTestRows, @RowCount);
    END;
    GO
    

    SETUP 4: Cleanup (DROP All Test Procs)

    DECLARE @SQL NVARCHAR(MAX) = N'';
    SELECT @SQL += N'DROP PROCEDURE [dbo].' + QUOTENAME(sp.[name])
                + N';' + NCHAR(13) + NCHAR(10)
    FROM  sys.objects sp
    WHERE sp.[name] LIKE N'RUN[_]%'
    AND   sp.[type_desc] = N'SQL_STORED_PROCEDURE'
    AND   sp.[name] <> N'RUN_HASHES_FOR_ONE_MINUTE'
    
    PRINT @SQL;
    
    EXEC (@SQL);
    

    SETUP 5: Generate Test Procs

    SET NOCOUNT ON;
    
    DECLARE @TestProcsToCreate TABLE
    (
      ProcName sysname NOT NULL,
      CodeToExec NVARCHAR(261) NOT NULL
    );
    DECLARE @ProcName sysname,
            @CodeToExec NVARCHAR(261);
    
    INSERT INTO @TestProcsToCreate VALUES
      (N'SpookyHash', N'dbo.SpookyHash('),
      (N'HASHBYTES_MD5', N'HASHBYTES(''MD5'','),
      (N'HASHBYTES_SHA2_256', N'HASHBYTES(''SHA2_256'','),
      (N'CHECKSUM', N'CHECKSUM('),
      (N'SpookyHashLOB', N'dbo.SpookyHashLOB('),
      (N'SR_MD5', N'SQL#.Util_HashBinary(N''MD5'','),
      (N'SR_SHA256', N'SQL#.Util_HashBinary(N''SHA256'','),
      (N'SR_MD5_8k', N'SQL#.[Util_HashBinary8k](N''MD5'','),
      (N'SR_SHA256_8k', N'SQL#.[Util_HashBinary8k](N''SHA256'',')
    --/* -- BETA / non-public code
      , (N'SR_SHA256_new', N'SQL#.[Util_HashSHA256Binary8k]('),
      (N'SR_SHA256LOB_new', N'SQL#.[Util_HashSHA256Binary](');
    --*/
    DECLARE @ProcTemplate NVARCHAR(MAX),
            @ProcToCreate NVARCHAR(MAX);
    
    SET @ProcTemplate = N'
    CREATE OR ALTER PROCEDURE dbo.RUN_{{ProcName}}
    AS
    BEGIN
    DECLARE @dummy INT;
    SET NOCOUNT ON;
    
    SELECT @dummy = COUNT({{CodeToExec}}
        CAST(FK1 AS BINARY(8)) + 0x7C +
        CAST(FK2 AS BINARY(8)) + 0x7C +
        CAST(FK3 AS BINARY(8)) + 0x7C +
        CAST(FK4 AS BINARY(8)) + 0x7C +
        CAST(FK5 AS BINARY(8)) + 0x7C +
        CAST(FK6 AS BINARY(8)) + 0x7C +
        CAST(FK7 AS BINARY(8)) + 0x7C +
        CAST(FK8 AS BINARY(8)) + 0x7C +
        CAST(FK9 AS BINARY(8)) + 0x7C +
        CAST(FK10 AS BINARY(8)) + 0x7C +
        CAST(FK11 AS BINARY(8)) + 0x7C +
        CAST(FK12 AS BINARY(8)) + 0x7C +
        CAST(FK13 AS BINARY(8)) + 0x7C +
        CAST(FK14 AS BINARY(8)) + 0x7C +
        CAST(FK15 AS BINARY(8)) + 0x7C +
        CAST(FK16 AS BINARY(8)) + 0x7C +
        CAST(FK17 AS BINARY(8)) + 0x7C +
        CAST(FK18 AS BINARY(8)) + 0x7C +
        CAST(FK19 AS BINARY(8)) + 0x7C +
        CAST(FK20 AS BINARY(8)) 
        )
        )
        FROM dbo.HASH_SMALL
        OPTION (MAXDOP 1);
    
    END;
    ';
    
    DECLARE CreateProcsCurs CURSOR READ_ONLY FORWARD_ONLY LOCAL FAST_FORWARD
    FOR SELECT [ProcName], [CodeToExec]
        FROM @TestProcsToCreate;
    
    OPEN [CreateProcsCurs];
    
    FETCH NEXT
    FROM  [CreateProcsCurs]
    INTO  @ProcName, @CodeToExec;
    
    WHILE (@@FETCH_STATUS = 0)
    BEGIN
        -- First: create VARBINARY version
        SET @ProcToCreate = REPLACE(REPLACE(@ProcTemplate,
                                            N'{{ProcName}}',
                                            @ProcName),
                                    N'{{CodeToExec}}',
                                    @CodeToExec);
    
        EXEC (@ProcToCreate);
    
        -- Second: create NVARCHAR version (optional: built-ins only)
        IF (CHARINDEX(N'.', @CodeToExec) = 0)
        BEGIN
            SET @ProcToCreate = REPLACE(REPLACE(REPLACE(@ProcToCreate,
                                                        N'dbo.RUN_' + @ProcName,
                                                        N'dbo.RUN_' + @ProcName + N'_NVC'),
                                                N'BINARY(8)',
                                                N'NVARCHAR(15)'),
                                        N'0x7C',
                                        N'N''|''');
    
            EXEC (@ProcToCreate);
        END;
    
        FETCH NEXT
        FROM  [CreateProcsCurs]
        INTO  @ProcName, @CodeToExec;
    END;
    
    CLOSE [CreateProcsCurs];
    DEALLOCATE [CreateProcsCurs];
    

    TEST 1: Check For Collisions

    EXEC dbo.VERIFY_NO_COLLISIONS 1;
    EXEC dbo.VERIFY_NO_COLLISIONS 2;
    EXEC dbo.VERIFY_NO_COLLISIONS 3;
    EXEC dbo.VERIFY_NO_COLLISIONS 4;
    EXEC dbo.VERIFY_NO_COLLISIONS 5;
    EXEC dbo.VERIFY_NO_COLLISIONS 6;
    EXEC dbo.VERIFY_NO_COLLISIONS 7;
    EXEC dbo.VERIFY_NO_COLLISIONS 8;
    EXEC dbo.VERIFY_NO_COLLISIONS 9;
    EXEC dbo.VERIFY_NO_COLLISIONS 10;
    EXEC dbo.VERIFY_NO_COLLISIONS 11;
    

    TEST 2: Run Performance Tests

    EXEC dbo.RUN_HASHES_FOR_ONE_MINUTE 1;
    EXEC dbo.RUN_HASHES_FOR_ONE_MINUTE 2;
    EXEC dbo.RUN_HASHES_FOR_ONE_MINUTE 3; -- HASHBYTES('SHA2_256'
    EXEC dbo.RUN_HASHES_FOR_ONE_MINUTE 4;
    EXEC dbo.RUN_HASHES_FOR_ONE_MINUTE 5;
    EXEC dbo.RUN_HASHES_FOR_ONE_MINUTE 6;
    EXEC dbo.RUN_HASHES_FOR_ONE_MINUTE 7;
    EXEC dbo.RUN_HASHES_FOR_ONE_MINUTE 8;
    EXEC dbo.RUN_HASHES_FOR_ONE_MINUTE 9;
    EXEC dbo.RUN_HASHES_FOR_ONE_MINUTE 10;
    EXEC dbo.RUN_HASHES_FOR_ONE_MINUTE 11;
    EXEC dbo.RUN_HASHES_FOR_ONE_MINUTE 13; -- NVC version of #3
    
    
    SELECT *
    FROM   dbo.LOG_HASHES
    ORDER BY [LOG_TIME] DESC;
    

    VALIDATION ISSUES TO RESOLVE

    While focusing on the performance testing of a singular SQLCLR UDF, two issues that were discussed early on were not incorporated into the tests, but ideally should be investigated in order to determine which approach meets all of the requirements.

    1. The function will be executed twice per each query (once for the import row, and once for the current row). The tests so far have only referenced the UDF one time in the test queries. This factor might not change the ranking of the options, but it shouldn't be ignored, just in case.
    2. In a comment that has since been deleted, Paul White had mentioned:

      One downside of replacing HASHBYTES with a CLR scalar function - it appears that CLR functions cannot use batch mode whereas HASHBYTES can. That might be important, performance-wise.

      So that is something to consider, and clearly requires testing. If the SQLCLR options do not provide any benefit over the built-in HASHBYTES, then that adds weight to Solomon's suggestion of capturing existing hashes (for at least the largest tables) into related tables.

    • 12
  4. David Browne - Microsoft
    2019-02-09T14:41:51+08:002019-02-09T14:41:51+08:00

    You can probably improve the performance, and perhaps the scalability of all the .NET approaches by pooling and caching any objects created in the function call. EG for Paul White's code above:

    static readonly ConcurrentDictionary<int,ISpookyHashV2> hashers = new ConcurrentDictonary<ISpookyHashV2>()
    public static byte[] SpookyHash([SqlFacet (MaxSize = 8000)] SqlBinary Input)
    {
        ISpookyHashV2 sh = hashers.GetOrAdd(Thread.CurrentThread.ManagedThreadId, i => SpookyHashV2Factory.Instance.Create());
    
        return sh.ComputeHash(Input.Value).Hash;
    }
    

    SQL CLR discourages and tries to prevent using static/shared variables, but it will let you use shared variables if you mark them as readonly. Which, of course, is meaningless as you can just assign a single instance of some mutable type, like ConcurrentDictionary.

    • 7

相关问题

  • SQL Server - 使用聚集索引时如何存储数据页

  • 我需要为每种类型的查询使用单独的索引,还是一个多列索引可以工作?

  • 什么时候应该使用唯一约束而不是唯一索引?

  • 死锁的主要原因是什么,可以预防吗?

  • 如何确定是否需要或需要索引

Sidebar

Stats

  • 问题 205573
  • 回答 270741
  • 最佳答案 135370
  • 用户 68524
  • 热门
  • 回答
  • Marko Smith

    连接到 PostgreSQL 服务器:致命:主机没有 pg_hba.conf 条目

    • 12 个回答
  • Marko Smith

    如何让sqlplus的输出出现在一行中?

    • 3 个回答
  • Marko Smith

    选择具有最大日期或最晚日期的日期

    • 3 个回答
  • Marko Smith

    如何列出 PostgreSQL 中的所有模式?

    • 4 个回答
  • Marko Smith

    列出指定表的所有列

    • 5 个回答
  • Marko Smith

    如何在不修改我自己的 tnsnames.ora 的情况下使用 sqlplus 连接到位于另一台主机上的 Oracle 数据库

    • 4 个回答
  • Marko Smith

    你如何mysqldump特定的表?

    • 4 个回答
  • Marko Smith

    使用 psql 列出数据库权限

    • 10 个回答
  • Marko Smith

    如何从 PostgreSQL 中的选择查询中将值插入表中?

    • 4 个回答
  • Marko Smith

    如何使用 psql 列出所有数据库和表?

    • 7 个回答
  • Martin Hope
    Jin 连接到 PostgreSQL 服务器:致命:主机没有 pg_hba.conf 条目 2014-12-02 02:54:58 +0800 CST
  • Martin Hope
    Stéphane 如何列出 PostgreSQL 中的所有模式? 2013-04-16 11:19:16 +0800 CST
  • Martin Hope
    Mike Walsh 为什么事务日志不断增长或空间不足? 2012-12-05 18:11:22 +0800 CST
  • Martin Hope
    Stephane Rolland 列出指定表的所有列 2012-08-14 04:44:44 +0800 CST
  • Martin Hope
    haxney MySQL 能否合理地对数十亿行执行查询? 2012-07-03 11:36:13 +0800 CST
  • Martin Hope
    qazwsx 如何监控大型 .sql 文件的导入进度? 2012-05-03 08:54:41 +0800 CST
  • Martin Hope
    markdorison 你如何mysqldump特定的表? 2011-12-17 12:39:37 +0800 CST
  • Martin Hope
    Jonas 如何使用 psql 对 SQL 查询进行计时? 2011-06-04 02:22:54 +0800 CST
  • Martin Hope
    Jonas 如何从 PostgreSQL 中的选择查询中将值插入表中? 2011-05-28 00:33:05 +0800 CST
  • Martin Hope
    Jonas 如何使用 psql 列出所有数据库和表? 2011-02-18 00:45:49 +0800 CST

热门标签

sql-server mysql postgresql sql-server-2014 sql-server-2016 oracle sql-server-2008 database-design query-performance sql-server-2017

Explore

  • 主页
  • 问题
    • 最新
    • 热门
  • 标签
  • 帮助

Footer

AskOverflow.Dev

关于我们

  • 关于我们
  • 联系我们

Legal Stuff

  • Privacy Policy

Language

  • Pt
  • Server
  • Unix

© 2023 AskOverflow.DEV All Rights Reserve