由于在用户数据库上运行 DBCC CHECKDB 时缓冲区锁存超时,我们的 Ola Hallengren IntegrityCheck 作业失败。
但是,报告的缓冲区锁存超时在 TempDB(数据库 ID 2)中。
作业输出:
Date and time: 2022-01-22 09:04:15 [SQLSTATE 01000]
Database context: [master] [SQLSTATE 01000]
Command: SET LOCK_TIMEOUT 600000; DBCC CHECKDB ([SentryOne]) WITH NO_INFOMSGS, ALL_ERRORMSGS, MAXDOP = 4 [SQLSTATE 01000]
Msg 845, Sev 17, State 1, Line 1 : Time-out occurred while waiting for buffer latch type 2 for page (6:222), database ID 2. [SQLSTATE 42000]
Outcome: Failed [SQLSTATE 01000]
Duration: 12:40:32 [SQLSTATE 01000]
Date and time: 2022-01-22 21:44:47 [SQLSTATE 01000]
SQL 错误日志中的消息:
Date 1/22/2022 9:35:22 PM
Log SQL Server (Archive #1 - 1/23/2022 12:00:00 AM)
Source spid777
Message
A time-out occurred while waiting for buffer latch -- type 2, bp 0000016D7A1DE340, page 6:222, stat 0x40d, database id: 2, allocation unit Id: 536870912/281475513581568, task 0x0000016CD624E4E8 : 2, waittime 300 seconds, flags 0x1a, owning task 0x0000016CD624E4E8. Not continuing to wait.
BobMgr::GetBuf: Sort Big Output Buffer write not complete after 60 seconds.
DBCC CHECKDB (SentryOne) WITH all_errormsgs, no_infomsgs, maxdop = 4 executed by NT SERVICE\SQLSERVERAGENT terminated abnormally due to error state 6. Elapsed time: 12 hours 39 minutes 57 seconds.
[INFO] Database ID: [12]. Cleaning up StorageArray. LastClosedCheckpointEndTs: '1225176490'
[WARNING] ALTER or DROP TABLE could not clean up root row within 10 seconds.
有问题的页面是 TempDB 中的 IAM 页面。它的 DBCC PAGE 输出:
PAGE: (6:222)
BUFFER:
BUF @0x0000016D6B622C40
bpage = 0x00000161A6C52000 bPmmpage = 0x0000000000000000 bsort_r_nextbP = 0x0000016D6B622AD0
bsort_r_prevbP = 0x0000016D6B622B80 bhash = 0x0000000000000000 bpageno = (6:222)
bpart = 4 ckptGen = 0x0000000000000000 bDirtyRefCount = 0
bstat = 0x9 breferences = 3 berrcode = 0
bUse1 = 19321 bstat2 = 0x0 blog = 0x215a215a
bsampleCount = 1 bIoCount = 0 resPoolId = 0
bcputicks = 330 bReadMicroSec = 848 bDirtyContext = 0x0000000000000000
bDbPageBroker = 0x0000000000000000 bdbid = 2 bpru = 0x0000016D3A170040
PAGE HEADER:
Page @0x00000161A6C52000
m_pageId = (6:222) m_headerVersion = 1 m_type = 10
m_typeFlagBits = 0x0 m_level = 0 m_flagBits = 0x200
m_objId (AllocUnitId.idObj) = 1837007664 m_indexId (AllocUnitId.idInd) = 28675
Metadata: AllocUnitId = 8071415347312328704
Metadata: PartitionId = 2162691495132069888 Metadata: IndexId = 0
Metadata: ObjectId = -1083263203 m_prevPage = (0:0) m_nextPage = (0:0)
pminlen = 90 m_slotCnt = 2 m_freeCnt = 6
m_freeData = 8182 m_reservedCnt = 0 m_lsn = (324:1435184:42)
m_xactReserved = 0 m_xdesId = (0:0) m_ghostRecCnt = 0
m_tornBits = -551517712 DB Frag ID = 1
Allocation Status
GAM (6:2) = ALLOCATED SGAM (6:3) = ALLOCATED
PFS (6:1) = 0x70 IAM_PG MIXED_EXT ALLOCATED 0_PCT_FULL DIFF (6:6) = NOT CHANGED
ML (6:7) = NOT MIN_LOGGED
IAM: Header @0x00000036C4278064 Slot 0, Offset 96
sequenceNumber = 0 status = 0x0 objectId = 0
indexId = 0 page_count = 0 start_pg = (8:511232)
IAM: Single Page Allocations @0x00000036C427808E
Slot 0 = (0:0) Slot 1 = (0:0) Slot 2 = (0:0)
Slot 3 = (0:0) Slot 4 = (0:0) Slot 5 = (0:0)
Slot 6 = (0:0) Slot 7 = (0:0)
IAM: Extent Alloc Status Slot 1 @0x00000036C42780C2
(8:511232) - (8:592656) = NOT ALLOCATED
(8:592664) - = ALLOCATED
(8:592672) - (8:770424) = NOT ALLOCATED
我发现这个似乎最相关的 DBA StackExchange 问题: 出现错误“在等待页面 (1:3564879) 的缓冲区闩锁类型 2 时发生超时,数据库 ID 7。” 但是 ID 为 7 的数据库不存在
但是,在这种情况下,报告的超时页面位于用户数据库中,而不是 TempDB 中。
此外,我们使用的是最新的 SQL Server 版本(2019-CU14 Enterprise),因此无论如何都不应该发生这种情况。
我确实注意到 TempDB 数据文件在下午 6 点和 9:40 左右的延迟非常高。不过,我不确定这是否相关,因为下午 6 点的延迟峰值更高,但这项工作仅在晚上 9:44 左右失败。
这些延迟峰值与 Azure Monitor 指标VM Cached IOPS Consumed Percentage
和VM Cached Bandwidth Consumed Percentage
.
附加信息:
- 服务器安装在 Azure VM 上:
Standard E8s v3 (8 vcpus, 64 GiB memory)
- 有问题的用户数据库 (SentryOne) 具有内存优化表。
- TempDB 内存优化元数据已禁用。
- 即时文件初始化已启用。
- 内存中的锁定页面已禁用。
任何想法为什么会发生这种情况以及将来如何防止它?
谢谢!
TempDB 与CheckDB一起使用,很可能有一组项目在等待闩锁(或单个项目,不能用提供的数据说),并且由于某些事情很慢而发生了车队(例如, IO 或内存分配)或仅发生过一次。
根据您的屏幕截图广告,作业的输出:
这相关性很好,根据可用数据,当时看起来是一些慢速磁盘。