当接收到多个并发 API 请求时,我们看到生产中出现间歇性死锁。每个请求基本上都会在同一张表中的 INSERT 语句中达到高潮,这就是我们看到死锁的地方。我编写了一个双线程控制台应用程序,它可以通过简单地同时执行两个 API 请求来可靠地重现问题,但仅限于生产环境,而不是暂存环境。(这让我相信,我们的暂存数据库——可能是数据量、SQL Server 2012 与 2005 或索引调整——与生产环境的不同之处在于避免了死锁。代码是相同,我相信是架构。)
由于我现在可以重现死锁,我能够说服我的老板暂时启用跟踪标志 1222,并捕获以下日志:
Date,Source,Severity,Message
09/05/2013 16:32:19,spid71,Unknown,DBCC TRACEOFF 1222<c/> server process ID (SPID) 71. This is an informational message only; no user action is required.
09/05/2013 16:30:55,spid17s,Unknown,waiter id=processf34868 mode=X requestType=wait
09/05/2013 16:30:55,spid17s,Unknown,waiter-list
09/05/2013 16:30:55,spid17s,Unknown,owner id=processf35c18 mode=X
09/05/2013 16:30:55,spid17s,Unknown,owner-list
09/05/2013 16:30:55,spid17s,Unknown,objectlock lockPartition=0 objid=428945000 subresource=FULL dbid=8 objectname=MyDB.DomainTransferRAR id=lock120a72c80 mode=X associatedObjectId=428945000
09/05/2013 16:30:55,spid17s,Unknown,waiter id=processf35c18 mode=X requestType=convert
09/05/2013 16:30:55,spid17s,Unknown,waiter-list
09/05/2013 16:30:55,spid17s,Unknown,owner id=processf34868 mode=IX
09/05/2013 16:30:55,spid17s,Unknown,owner-list
09/05/2013 16:30:55,spid17s,Unknown,objectlock lockPartition=0 objid=2096426938 subresource=FULL dbid=8 objectname=MyDB.DomainTransferRANT id=lock11de95480 mode=IX associatedObjectId=2096426938
09/05/2013 16:30:55,spid17s,Unknown,resource-list
09/05/2013 16:30:55,spid17s,Unknown,VALUES (@p0<c/>
09/05/2013 16:30:55,spid17s,Unknown,(@p0 uniqueidentifier<c/>@p1 int<c/>@p2 int<c/>@p3 varchar(8000)<c/>@p4 char(5)<c/>@p5 int<c/>@p6 datetime<c/>@p7 datetime<c/>@p8 varchar(8000)<c/>@p9 char(5)<c/>@p10 int<c/>@p11 datetime<c/>@p12 datetime<c/>@p13 varchar(8000)<c/>@p14 int<c/>@p15 xml<c/>@p16 datetime<c/>@p17 datetime<c/>@p18 varchar(8000)<c/>@p19 datetime<c/>@p20 datetime<c/>@p21 varchar(8000)<c/>@p22 bit<c/>@p23 varchar(8000)<c/>@p24 varchar(8000)<c/>@p25 uniqueidentifier)INSERT INTO [dbo].[DomainTransferRANT]([DomainTransferRANTGUID]<c/> [PrebookedBillPackageId]<c/> [domainID]<c/> [DomainName]<c/> [GainingWNAccountID]<c/> [GainingRegistrantID]<c/> [dtGainingRequestSent]<c/> [dtGainingResponseReceived]<c/> [DomainTransferGainingPartyRespCode]<c/> [LosingWNAccountID]<c/> [LosingRegistrantID]<c/> [dtLosingRequestSent]<c/> [dtLosingResponseReceived]<c/> [DomainTransferLosingPartyRespCode]<c/> [SubmittedBillPackageId]<c/> [ExtraInfo]<c/> [dtRequestSentToRegistry]<c/> [dtResponseFromRegistry]<c/> [DomainTransferRegistryRespCode]<c/> [dtDeleted]<c/> [dtLastChecked]<c/> [DomainTransferStatusCode]<c/> [SendConfirmationForEachDomain]<c/> [SummaryStatus]<c/> [WHOISBeforeTransfer])
09/05/2013 16:30:55,spid17s,Unknown,inputbuf
09/05/2013 16:30:55,spid17s,Unknown,unknown
09/05/2013 16:30:55,spid17s,Unknown,frame procname=unknown line=1 sqlhandle=0x000000000000000000000000000000000000000000000000
09/05/2013 16:30:55,spid17s,Unknown,VALUES (@p0<c/> @p1<c/> @p2<c/> @p3<c/> @p4<c/> @p5<c/> @p6<c/> @p7<c/> @p8<c/> @p9<c/> @p10<c/> @p11<c/> @p12<c/> @p13<c/> @p14<c/> @p15<c/> @p16<c/> @p17<c/> @p18<c/> @p19<c/> @p20<c/> @p21<c/> @p22<c/> @p23<c/> @p24)
09/05/2013 16:30:55,spid17s,Unknown,INSERT INTO [dbo].[DomainTransferRANT]([DomainTransferRANTGUID]<c/> [PrebookedBillPackageId]<c/> [domainID]<c/> [DomainName]<c/> [GainingWNAccountID]<c/> [GainingRegistrantID]<c/> [dtGainingRequestSent]<c/> [dtGainingResponseReceived]<c/> [DomainTransferGainingPartyRespCode]<c/> [LosingWNAccountID]<c/> [LosingRegistrantID]<c/> [dtLosingRequestSent]<c/> [dtLosingResponseReceived]<c/> [DomainTransferLosingPartyRespCode]<c/> [SubmittedBillPackageId]<c/> [ExtraInfo]<c/> [dtRequestSentToRegistry]<c/> [dtResponseFromRegistry]<c/> [DomainTransferRegistryRespCode]<c/> [dtDeleted]<c/> [dtLastChecked]<c/> [DomainTransferStatusCode]<c/> [SendConfirmationForEachDomain]<c/> [SummaryStatus]<c/> [WHOISBeforeTransfer])
09/05/2013 16:30:55,spid17s,Unknown,frame procname=adhoc line=1 stmtstart=738 stmtend=2322 sqlhandle=0x0200000038d7940c60b76abf51c3cf1bc774fe27ba136260
09/05/2013 16:30:55,spid17s,Unknown,select @iLockDummy = 1 from DomainTransferRANT with (TABLOCKX<c/> HOLDLOCK) OPTION(EXPAND VIEWS)
09/05/2013 16:30:55,spid17s,Unknown,frame procname=MyDB.tri_DomainTransferRANT_InsUpd line=68 stmtstart=10336 stmtend=10544 sqlhandle=0x03000800c8c31e75b259fa002fa200000000000000000000
09/05/2013 16:30:55,spid17s,Unknown,executionStack
09/05/2013 16:30:55,spid17s,Unknown,process id=processf35c18 taskpriority=0 logused=2992 waitresource=OBJECT: 8:2096426938:0 waittime=171 ownerId=26880196297 transactionname=user_transaction lasttranstarted=2013-09-05T16:30:55.710 XDES=0x1c7b18b60 lockMode=X schedulerid=2 kpid=34404 status=suspended spid=69 sbid=0 ecid=0 priority=0 transcount=2 lastbatchstarted=2013-09-05T16:30:55.763 lastbatchcompleted=2013-09-05T16:30:55.710 clientapp=.Net SqlClient Data Provider hostname=WEB1 hostpid=13216 loginname=client isolationlevel=serializable (4) xactid=26880196297 currentdb=8 lockTimeout=4294967295 clientoption1=673316896 clientoption2=128056
09/05/2013 16:30:55,spid17s,Unknown,VALUES (@p0<c/>
09/05/2013 16:30:55,spid17s,Unknown,(@p0 uniqueidentifier<c/>@p1 int<c/>@p2 int<c/>@p3 varchar(8000)<c/>@p4 char(5)<c/>@p5 int<c/>@p6 datetime<c/>@p7 datetime<c/>@p8 varchar(8000)<c/>@p9 char(5)<c/>@p10 int<c/>@p11 datetime<c/>@p12 datetime<c/>@p13 varchar(8000)<c/>@p14 int<c/>@p15 xml<c/>@p16 datetime<c/>@p17 datetime<c/>@p18 varchar(8000)<c/>@p19 datetime<c/>@p20 datetime<c/>@p21 varchar(8000)<c/>@p22 bit<c/>@p23 varchar(8000)<c/>@p24 varchar(8000)<c/>@p25 uniqueidentifier)INSERT INTO [dbo].[DomainTransferRANT]([DomainTransferRANTGUID]<c/> [PrebookedBillPackageId]<c/> [domainID]<c/> [DomainName]<c/> [GainingWNAccountID]<c/> [GainingRegistrantID]<c/> [dtGainingRequestSent]<c/> [dtGainingResponseReceived]<c/> [DomainTransferGainingPartyRespCode]<c/> [LosingWNAccountID]<c/> [LosingRegistrantID]<c/> [dtLosingRequestSent]<c/> [dtLosingResponseReceived]<c/> [DomainTransferLosingPartyRespCode]<c/> [SubmittedBillPackageId]<c/> [ExtraInfo]<c/> [dtRequestSentToRegistry]<c/> [dtResponseFromRegistry]<c/> [DomainTransferRegistryRespCode]<c/> [dtDeleted]<c/> [dtLastChecked]<c/> [DomainTransferStatusCode]<c/> [SendConfirmationForEachDomain]<c/> [SummaryStatus]<c/> [WHOISBeforeTransfer])
09/05/2013 16:30:55,spid17s,Unknown,inputbuf
09/05/2013 16:30:55,spid17s,Unknown,unknown
09/05/2013 16:30:55,spid17s,Unknown,frame procname=unknown line=1 sqlhandle=0x000000000000000000000000000000000000000000000000
09/05/2013 16:30:55,spid17s,Unknown,VALUES (@p0<c/> @p1<c/> @p2<c/> @p3<c/> @p4<c/> @p5<c/> @p6<c/> @p7<c/> @p8<c/> @p9<c/> @p10<c/> @p11<c/> @p12<c/> @p13<c/> @p14<c/> @p15<c/> @p16<c/> @p17<c/> @p18<c/> @p19<c/> @p20<c/> @p21<c/> @p22<c/> @p23<c/> @p24)
09/05/2013 16:30:55,spid17s,Unknown,INSERT INTO [dbo].[DomainTransferRANT]([DomainTransferRANTGUID]<c/> [PrebookedBillPackageId]<c/> [domainID]<c/> [DomainName]<c/> [GainingWNAccountID]<c/> [GainingRegistrantID]<c/> [dtGainingRequestSent]<c/> [dtGainingResponseReceived]<c/> [DomainTransferGainingPartyRespCode]<c/> [LosingWNAccountID]<c/> [LosingRegistrantID]<c/> [dtLosingRequestSent]<c/> [dtLosingResponseReceived]<c/> [DomainTransferLosingPartyRespCode]<c/> [SubmittedBillPackageId]<c/> [ExtraInfo]<c/> [dtRequestSentToRegistry]<c/> [dtResponseFromRegistry]<c/> [DomainTransferRegistryRespCode]<c/> [dtDeleted]<c/> [dtLastChecked]<c/> [DomainTransferStatusCode]<c/> [SendConfirmationForEachDomain]<c/> [SummaryStatus]<c/> [WHOISBeforeTransfer])
09/05/2013 16:30:55,spid17s,Unknown,frame procname=adhoc line=1 stmtstart=738 stmtend=2322 sqlhandle=0x0200000038d7940c60b76abf51c3cf1bc774fe27ba136260
09/05/2013 16:30:55,spid17s,Unknown,select @iLockDummy = 1 from DomainTransferRAR with (TABLOCKX<c/> HOLDLOCK) OPTION(EXPAND VIEWS)
09/05/2013 16:30:55,spid17s,Unknown,frame procname=MyDB.tri_DomainTransferRANT_InsUpd line=67 stmtstart=10140 stmtend=10334 sqlhandle=0x03000800c8c31e75b259fa002fa200000000000000000000
09/05/2013 16:30:55,spid17s,Unknown,executionStack
09/05/2013 16:30:55,spid17s,Unknown,process id=processf34868 taskpriority=0 logused=3000 waitresource=OBJECT: 8:428945000:0 waittime=171 ownerId=26880196295 transactionname=user_transaction lasttranstarted=2013-09-05T16:30:55.710 XDES=0x1c7b18370 lockMode=X schedulerid=2 kpid=13932 status=suspended spid=93 sbid=0 ecid=0 priority=0 transcount=2 lastbatchstarted=2013-09-05T16:30:55.770 lastbatchcompleted=2013-09-05T16:30:55.710 clientapp=.Net SqlClient Data Provider hostname=MYDB hostpid=13216 loginname=client isolationlevel=serializable (4) xactid=26880196295 currentdb=8 lockTimeout=4294967295 clientoption1=673316896 clientoption2=128056
09/05/2013 16:30:55,spid17s,Unknown,process-list
09/05/2013 16:30:55,spid17s,Unknown,deadlock victim=processf35c18
09/05/2013 16:30:55,spid17s,Unknown,deadlock-list
09/05/2013 16:30:08,spid71,Unknown,DBCC TRACEON 1222<c/> server process ID (SPID) 71. This is an informational message only; no user action is required.
需要注意的一件事是在插入相关表时有一个触发器。触发器是确定整个记录的状态代码所必需的,这可能取决于同一表中的兄弟记录。很长一段时间以来,我们认为触发器是死锁的原因,因此我们向触发器添加了越来越激进的锁定提示,最终在当前设置中达到顶峰,我们在关键部分之前对相关表执行 TABLOCKX、HOLDLOCK。我们认为这将通过有效地序列化所有插入来完全防止死锁,但会牺牲一些性能。但似乎并非如此。据我了解,在我们的独占表锁之前的其他东西必须已经持有共享或更新锁。但是什么?
其他可能对您有帮助的信息:DomainTransferRANT 表被大量索引。它的主键是一个非集群 GUID。另一个重要的 INT 列上有一个聚集索引。还有其他 7 个非聚集索引。最后,还有几个外键约束。
我没有机会重新运行跟踪并捕获格式更好的 XDL 文件,但与此同时,我能够通过从触发器中删除所有隔离级别和锁定提示并修改应用程序代码来解决问题来自可序列化事务的 INSERT 本身。
该应用程序使用 LINQ to SQL,因此 INSERT 语句是动态生成的,并在默认的 READ COMMITTED 隔离级别下运行。从死锁跟踪中,我可以看出 INSERT 占用的锁与触发器占用的锁相冲突。即使在从触发器中删除所有隔离和锁定提示之后,死锁仍在发生,因此我推测触发器的 UPDATE 语句导致的锁定比 INSERT 语句更强。当竞争的同时事务试图做同样的事情时,就会发生死锁。
我们之前尝试的解决方案绝对是在正确的轨道上:在触发器中使用更高的隔离级别和更强的锁定。问题是在触发器被调用之前,由 LINQ to SQL 生成的 INSERT 语句已经占用了较弱的锁。解决方案是在我们的 .NET 代码中使用 SEALIZABLE 隔离级别显式启动一个事务,在该事务中执行插入,让触发器发生在同一个事务中(不改变隔离级别或锁定提示),然后提交并处置该事务. 当我这样做时,我不再能够重现该问题(我什至将其放回旧方式并验证问题是否返回。)这是新代码:
通常我们会避免从客户端代码中打开事务,因为如果代码执行远程 API 调用,它们可能会运行很长时间,从而导致过度阻塞。但是在这种情况下,我们隔离了一小部分代码,并且我们围绕它进行了强大的错误/回滚处理。
另一种解决方案是实现在显式 SERIALIZABLE 事务中执行 INSERT 的存储过程,然后更改 DataContext 以调用存储过程。但是,该解决方案的可维护性较差,因为它需要在每次架构更改时更新存储过程和 DataContext 定义。
最后,另一个潜在的替代解决方案是详细检查死锁日志,并找到一种方法来调整我们的索引或键以避免死锁。但是,这更加困难,并且在我们的设计中可能无法实现。
我认为上面的解决方案对于未来的开发人员来说是最简单和最容易理解的。