更改了 max_allowed_packet 并仍然收到“数据包太大”错误

Question

Mark S. Rasmussen

Asked: 2012-01-20 06:17:42 +0800 CST2012-01-20 06:17:42 +0800 CST 2012-01-20 06:17:42 +0800 CST

超过 15 秒的 I/O 请求

772

通常我们每周的完整备份在大约 35 分钟内完成，每日差异备份在大约 5 分钟内完成。自周二以来，日报已经花费了将近 4 个小时来完成，远远超过了应有的时间。巧合的是，这在我们获得新的 SAN/磁盘配置后就开始发生了。

请注意，服务器正在生产中运行，我们没有整体问题，它运行平稳 - 除了主要表现在备份性能中的 IO 问题。

在备份期间查看 dm_exec_requests，备份一直在等待 ASYNC_IO_COMPLETION。啊哈，所以我们有磁盘争用！

但是，MDF（日志存储在本地磁盘上）和备份驱动器都没有任何活动（IOPS ~= 0 - 我们有足够的内存）。磁盘队列长度 ~= 0 也是。CPU 徘徊在 2-3% 左右，也没有问题。

SAN 是 Dell MD3220i，LUN 由 6x10k SAS 驱动器组成。服务器通过两条物理路径连接到 SAN，每条路径都通过一个单独的交换机，并与 SAN 有冗余连接——总共有四个路径，其中两个在任何时候都处于活动状态。我可以通过任务管理器验证两个连接都处于活动状态 - 完全均匀地分配负载。两个连接都运行 1G 全双工。

我们曾经使用巨型帧，但我已禁用它们以排除这里的任何问题 - 没有变化。我们有另一台服务器（相同的 OS+config，2008 R2）连接到其他 LUN，它没有显示任何问题。然而，它没有运行 SQL Server，而只是在它们之上共享 CIFS。但是，它的 LUN 首选路径之一与麻烦的 LUN 位于同一 SAN 控制器上 - 所以我也排除了这种情况。

运行几个 SQLIO 测试（10G 测试文件）似乎表明 IO 是不错的，尽管存在以下问题：

sqlio -kR -t8 -o8 -s30 -frandom -b8 -BN -LS -Fparam.txt
IOs/sec:  3582.20
MBs/sec:    27.98
Min_Latency(ms): 0
Avg_Latency(ms): 3
Max_Latency(ms): 98
histogram:
ms: 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24+
%: 45  9  5  4  4  4  4  4  4  3  2  2  1  1  1  1  1  1  1  0  0  0  0  0  2

sqlio -kW -t8 -o8 -s30 -frandom -b8 -BN -LS -Fparam.txt
IOs/sec:  4742.16
MBs/sec:    37.04
Min_Latency(ms): 0
Avg_Latency(ms): 2
Max_Latency(ms): 880
histogram:
ms: 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24+
%: 46 33  2  2  2  2  2  2  2  1  1  1  1  0  0  0  0  0  0  0  0  0  0  0  1

sqlio -kR -t8 -o8 -s30 -fsequential -b64 -BN -LS -Fparam.txt
IOs/sec:  1824.60
MBs/sec:   114.03
Min_Latency(ms): 0
Avg_Latency(ms): 8
Max_Latency(ms): 421
histogram:
ms: 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24+
%:  1  3 14  4 14 43  4  2  1  1  1  1  1  1  0  0  0  0  0  0  0  0  0  0  6

sqlio -kW -t8 -o8 -s30 -fsequential -b64 -BN -LS -Fparam.txt
IOs/sec:  3238.88
MBs/sec:   202.43
Min_Latency(ms): 1
Avg_Latency(ms): 4
Max_Latency(ms): 62
histogram:
ms: 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24+
%:  0  0  0  9 51 31  6  1  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0

我意识到这些无论如何都不是详尽的测试，但它们确实让我知道这并不是完全的垃圾。请注意，较高的写入性能是由两个活动的 MPIO 路径引起的，而读取将只使用其中一个。

检查应用程序事件日志会发现这些分散在各处的事件：

SQL Server has encountered 2 occurrence(s) of I/O requests taking longer than 15 seconds to complete on file [J:\XXX.mdf] in database [XXX] (150).  The OS file handle is 0x0000000000003294.  The offset of the latest long I/O is: 0x00000033da0000

它们不是恒定的，但它们确实会定期发生（每小时几次，在备份期间更多）。除了该事件，系统事件日志将发布这些：

Initiator sent a task management command to reset the target. The target name is given in the dump data.
Target did not respond in time for a SCSI request. The CDB is given in the dump data.

这些也发生在运行在同一个 SAN/控制器上的非问题 CIFS 服务器上，从我的谷歌搜索来看，它们似乎不是关键的。

请注意，所有服务器都使用相同的 NIC - 带有最新驱动程序的 Broadcom 5709C。服务器本身是戴尔 R610 的。

我不确定接下来要检查什么。有什么建议么？

更新 - 运行 perfmon
我尝试记录平均。执行备份时的磁盘秒/读写性能计数器。备份开始时非常火爆，然后基本上在 50% 处停止，缓慢地爬向 100%，但花费了应有的 20 倍时间。

备份开始期间的任务监视器显示两个 SAN 路径都在使用，然后退出。

在同一期间执行备份在 15:38:50 左右开始 - 注意一切看起来都不错，然后出现了一系列峰值。我不关心写入，只有读取似乎挂起。

备份结束期间的任务监视器请注意开/关动作很少，尽管最后表现出色。

同一期间的性能请注意最长 12 秒，但总体而言平均不错。

更新 - 备份到 NUL 设备
为了隔离读取问题并简化操作，我运行了以下命令：

BACKUP DATABASE XXX TO DISK = 'NUL'

结果完全相同 - 从突发读取开始，然后停止，不时恢复操作：

更新 - IO 停止
我按照 Shawn 的建议运行了 Jonathan Kehayias 和 Ted Kruegers 的书（第 29 页）中的 dm_io_virtual_file_stats 查询。查看前 25 个文件（每个文件一个数据文件 - 所有结果都是数据文件），看起来读取比写入更糟糕 - 可能是因为写入直接进入 SAN 缓存而冷读取需要命中磁盘 - 虽然只是猜测.

IO 档位

更新 - 等待统计
我做了三个测试来收集一些等待统计。使用 Glenn Berry/Paul Randals脚本查询等待统计信息。只是为了确认 - 备份不是针对磁带，而是针对 iSCSI LUN。如果对本地磁盘执行结果类似，结果类似于 NUL 备份。

清除统计。跑了10分钟，正常负载：无备份

清除统计。运行 10 分钟，正常负载 + 正常备份运行（未完成）：

清除统计。跑了 10 分钟，正常负载 + NUL 备份运行（未完成）：

更新 - Wtf，博通？
基于 Mark Storey-Smiths 的建议和 Kyle Brandts 以前使用 Broadcom NIC 的经验，我决定做一些实验。由于我们有多个活动路径，我可以相对轻松地一一更改 NIC 的配置，而不会导致任何中断。

禁用 TOE 和大发送卸载产生了近乎完美的运行：在此处输入图像描述

Processed 1064672 pages for database 'XXX', file 'XXX' on file 1.
Processed 21 pages for database 'XXX', file 'XXX' on file 1.
BACKUP DATABASE successfully processed 1064693 pages in 58.533 seconds (142.106 MB/sec).

那么，罪魁祸首是 TOE 还是 LSO？启用 TOE，禁用 LSO：在此处输入图像描述

Didn't finish the backup as it took forever - just as the original problem!

禁用 TOE，启用 LSO - 看起来不错：在此处输入图像描述

Processed 1064680 pages for database 'XXX', file 'XXX' on file 1.
Processed 29 pages for database 'XXX', file 'XXX' on file 1.
BACKUP DATABASE successfully processed 1064709 pages in 59.073 seconds (140.809 MB/sec).

作为对照，我禁用了 TOE 和 LSO 以确认问题已经消失：在此处输入图像描述

Processed 1064720 pages for database 'XXX', file 'XXX' on file 1.
Processed 13 pages for database 'XXX', file 'XXX' on file 1.
BACKUP DATABASE successfully processed 1064733 pages in 60.675 seconds (137.094 MB/sec).

In conclusion it seems the enabled Broadcom NICs TCP Offload Engine caused the problems. As soon as TOE was disabled, everything worked like a charm. Guess I won't be ordering any more Broadcom NICs going forward.

Update - Down goes the CIFS server
Today the identical and functioning CIFS server started exhibiting IO requests hanging. This server wasn't running SQL Server, just plain Windows Web Server 2008 R2 serving shares over CIFS. As soon as I disabled TOE on it as well, everything was back to running smooth.

Just confirms I won't ever be using TOE on Broadcom NICs again, if I can't avoid the Broadcom NICs at all, that is.

2 个回答

Voted

Mark Storey-Smith · Answer 1 · 2012-01-20T07:16:23+08:00

Best Answer

Mark Storey-Smith

2012-01-20T07:16:23+08:002012-01-20T07:16:23+08:00

Note that all servers use the same NICs - Broadcom 5709Cs with up-to-date drivers. The servers themselves are Dell R610's.

Kyle Brandt has an opinion on Broadcom network cards which echoes my own (repeated) experience.

Broadcom, Die Mutha

我的问题一直与TCP 卸载功能有关，在 99% 的情况下禁用或切换到另一个网卡已经解决了这些症状。一个客户端（如您的情况）使用戴尔服务器，始终订购单独的英特尔 NIC，并在构建时禁用板载 Broadcom 卡。

如此MSDN 博客文章中所述，我将从在操作系统中禁用：

netsh int ip set chimney DISABLED

IIRC 在某些情况下可能需要在卡驱动程序级别禁用这些功能，这样做肯定不会有什么坏处。

15

user507 · Answer 2 · 2012-01-20T06:59:57+08:00

user507

2012-01-20T06:59:57+08:002012-01-20T06:59:57+08:00

Not that I am a SAN/disk expert (there are folks on here that know more than me)...I only share what I've done a little of and mostly read :)

Jonathan Kehayias and Ted Krueger wrote a book "Troubleshooting SQL Server" that has some good info on disk performance. You can get the PDF for free from here. (I might buy the printed edition of this too for my desk.)

Anyway they have a good query that can be used to check the sys.dm_io_virtual_file_stats and check the average latency on your data files. You may find that RAID10 is not the ideal config for the data files to reside on.

4

超过 15 秒的 I/O 请求

如何查看 Oracle 中的数据库列表？

mysql innodb_buffer_pool_size 应该有多大？

列出指定表的所有列

从 .frm 和 .ibd 文件恢复表？

如何在不修改我自己的 tnsnames.ora 的情况下使用 sqlplus 连接到位于另一台主机上的 Oracle 数据库

你如何mysqldump特定的表？

如何选择每组的第一行？

使用 psql 列出数据库权限

如何从 PostgreSQL 中的选择查询中将值插入表中？

如何使用 psql 列出所有数据库和表？

超过 15 秒的 I/O 请求

2 个回答

相关问题