SQL Server 统计中的直方图步数是如何确定的?
为什么即使我的键列有超过 200 个不同的值,它也限制为 200 个步骤?有什么决定因素吗?
演示
架构定义
CREATE TABLE histogram_step
(
id INT IDENTITY(1, 1),
name VARCHAR(50),
CONSTRAINT pk_histogram_step PRIMARY KEY (id)
)
在我的表中插入 100 条记录
INSERT INTO histogram_step
(name)
SELECT TOP 100 name
FROM sys.syscolumns
更新和检查统计数据
UPDATE STATISTICS histogram_step WITH fullscan
DBCC show_statistics('histogram_step', pk_histogram_step)
直方图步骤:
+--------------+------------+---------+---------------------+----------------+
| RANGE_HI_KEY | RANGE_ROWS | EQ_ROWS | DISTINCT_RANGE_ROWS | AVG_RANGE_ROWS |
+--------------+------------+---------+---------------------+----------------+
| 1 | 0 | 1 | 0 | 1 |
| 3 | 1 | 1 | 1 | 1 |
| 5 | 1 | 1 | 1 | 1 |
| 7 | 1 | 1 | 1 | 1 |
| 9 | 1 | 1 | 1 | 1 |
| 11 | 1 | 1 | 1 | 1 |
| 13 | 1 | 1 | 1 | 1 |
| 15 | 1 | 1 | 1 | 1 |
| 17 | 1 | 1 | 1 | 1 |
| 19 | 1 | 1 | 1 | 1 |
| 21 | 1 | 1 | 1 | 1 |
| 23 | 1 | 1 | 1 | 1 |
| 25 | 1 | 1 | 1 | 1 |
| 27 | 1 | 1 | 1 | 1 |
| 29 | 1 | 1 | 1 | 1 |
| 31 | 1 | 1 | 1 | 1 |
| 33 | 1 | 1 | 1 | 1 |
| 35 | 1 | 1 | 1 | 1 |
| 37 | 1 | 1 | 1 | 1 |
| 39 | 1 | 1 | 1 | 1 |
| 41 | 1 | 1 | 1 | 1 |
| 43 | 1 | 1 | 1 | 1 |
| 45 | 1 | 1 | 1 | 1 |
| 47 | 1 | 1 | 1 | 1 |
| 49 | 1 | 1 | 1 | 1 |
| 51 | 1 | 1 | 1 | 1 |
| 53 | 1 | 1 | 1 | 1 |
| 55 | 1 | 1 | 1 | 1 |
| 57 | 1 | 1 | 1 | 1 |
| 59 | 1 | 1 | 1 | 1 |
| 61 | 1 | 1 | 1 | 1 |
| 63 | 1 | 1 | 1 | 1 |
| 65 | 1 | 1 | 1 | 1 |
| 67 | 1 | 1 | 1 | 1 |
| 69 | 1 | 1 | 1 | 1 |
| 71 | 1 | 1 | 1 | 1 |
| 73 | 1 | 1 | 1 | 1 |
| 75 | 1 | 1 | 1 | 1 |
| 77 | 1 | 1 | 1 | 1 |
| 79 | 1 | 1 | 1 | 1 |
| 81 | 1 | 1 | 1 | 1 |
| 83 | 1 | 1 | 1 | 1 |
| 85 | 1 | 1 | 1 | 1 |
| 87 | 1 | 1 | 1 | 1 |
| 89 | 1 | 1 | 1 | 1 |
| 91 | 1 | 1 | 1 | 1 |
| 93 | 1 | 1 | 1 | 1 |
| 95 | 1 | 1 | 1 | 1 |
| 97 | 1 | 1 | 1 | 1 |
| 99 | 1 | 1 | 1 | 1 |
| 100 | 0 | 1 | 0 | 1 |
+--------------+------------+---------+---------------------+----------------+
正如我们所见,直方图中有 53 个步骤。
再次插入几千条记录
INSERT INTO histogram_step
(name)
SELECT TOP 10000 b.name
FROM sys.syscolumns a
CROSS JOIN sys.syscolumns b
更新和检查统计数据
UPDATE STATISTICS histogram_step WITH fullscan
DBCC show_statistics('histogram_step', pk_histogram_step)
现在直方图步骤减少到 4 个步骤
+--------------+------------+---------+---------------------+----------------+
| RANGE_HI_KEY | RANGE_ROWS | EQ_ROWS | DISTINCT_RANGE_ROWS | AVG_RANGE_ROWS |
+--------------+------------+---------+---------------------+----------------+
| 1 | 0 | 1 | 0 | 1 |
| 10088 | 10086 | 1 | 10086 | 1 |
| 10099 | 10 | 1 | 10 | 1 |
| 10100 | 0 | 1 | 0 | 1 |
+--------------+------------+---------+---------------------+----------------+
再次插入几千条记录
INSERT INTO histogram_step
(name)
SELECT TOP 100000 b.name
FROM sys.syscolumns a
CROSS JOIN sys.syscolumns b
更新和检查统计数据
UPDATE STATISTICS histogram_step WITH fullscan
DBCC show_statistics('histogram_step', pk_histogram_step)
现在直方图步骤减少到 3 个步骤
+--------------+------------+---------+---------------------+----------------+
| RANGE_HI_KEY | RANGE_ROWS | EQ_ROWS | DISTINCT_RANGE_ROWS | AVG_RANGE_ROWS |
+--------------+------------+---------+---------------------+----------------+
| 1 | 0 | 1 | 0 | 1 |
| 110099 | 110097 | 1 | 110097 | 1 |
| 110100 | 0 | 1 | 0 | 1 |
+--------------+------------+---------+---------------------+----------------+
有人能告诉我这些步骤是如何决定的吗?
我将把这篇文章限制在讨论单列统计信息,因为它已经很长了,而且您对 SQL Server 如何将数据分桶到直方图步骤感兴趣。对于多列统计,直方图仅在前导列上创建。
当 SQL Server 确定需要更新统计信息时,它会启动一个隐藏查询,该查询读取表的所有数据或表数据的样本。您可以使用扩展事件查看这些查询。在 SQL Server 中调用了一个
StatMan
与创建直方图有关的函数。对于简单的统计对象,至少有两种不同类型的StatMan
查询(对于快速统计更新有不同的查询,我怀疑分区表上的增量统计功能也使用不同的查询)。第一个只是从表中抓取所有数据而没有任何过滤。当表格非常小时或者您使用以下
FULLSCAN
选项收集统计信息时,您可以看到这一点:SQL Server 根据表的大小选择自动样本大小(我认为它是表中的行数和页数)。如果表格太大,则自动样本大小会低于 100%。这是我为具有 1M 行的同一张表得到的结果:
TABLESAMPLE
已记录但 StatMan 和 step_direction 没有。这里 SQL Server 从表中抽取大约 66.6% 的数据来创建直方图。这意味着在更新相同数据的统计信息(不带 )时,您可以获得不同数量的直方图步骤FULLSCAN
。我在实践中从未观察到这一点,但我不明白为什么这是不可能的。让我们对简单数据进行一些测试,看看统计数据如何随时间变化。下面是我编写的一些测试代码,用于将连续整数插入表中,在每次插入后收集统计信息,并将有关统计信息的信息保存到结果表中。让我们从一次插入 1 行开始,最多 10000 行。测试台:
对于此数据,直方图步数迅速增加到 200(它首先达到 397 行的最大步数),保持在 199 或 200 直到表中有 1485 行,然后慢慢减少直到直方图只有 3 或 4脚步。这是所有数据的图表:
这是 10k 行的直方图:
直方图只有 3 个步骤有问题吗?从我们的角度来看,信息似乎被保留了下来。请注意,因为数据类型是 INTEGER,我们可以计算出表中从 1 到 10000 的每个整数有多少行。通常 SQL Server 也可以计算出来,尽管在某些情况下这并不完全可行. 有关此示例,请参阅此 SE 帖子。
如果我们从表中删除一行并更新统计信息,您认为会发生什么?理想情况下,我们会得到另一个直方图步骤来显示丢失的整数不再在表中。
这有点令人失望。如果我们手动构建直方图,我们将为每个缺失值添加一个步骤。SQL Server 使用的是通用算法,因此对于某些数据集,我们可能会提出比它使用的代码更合适的直方图。当然,从表中获取 0 行或 1 行之间的实际差异非常小。当使用 20000 行进行测试时,我得到了相同的结果,其中每个整数在表中有 2 行。当我删除数据时,直方图没有增加步骤。
如果我测试 100 万行,每个整数在表中有 100 行,我会得到稍微好一点的结果,但我仍然可以手动构建更好的直方图。
最终直方图:
让我们用顺序整数进一步测试,但表中有更多行。请注意,对于太小的表,手动指定样本大小将不起作用,因此我将在每次插入中添加 100 行并每次收集最多 100 万行的统计信息。我看到了与以前类似的模式,除了一旦我在表中达到 637300 行,我不再使用默认采样率对表中的 100% 行进行采样。随着我获得行数,直方图步数增加。这可能是因为随着表中非抽样行数的增加,SQL Server 最终会出现更多的数据间隙。即使在 1 M 行,我也没有达到 200 步,但是如果我继续添加行,我希望我会到达那里并最终开始返回。
X 轴是表中的行数。随着行数的增加,采样的行会有所不同,并且不会超过 650k。
现在让我们用 VARCHAR 数据做一些简单的测试。
在这里,我插入 200 个数字(作为字符串)以及 NULL。
请注意,当在表中找到 NULL 时,它总是会获得自己的直方图步骤。SQL Server 本来可以给我准确的 201 个步骤来保存所有信息,但它没有这样做。技术上的信息会丢失,因为例如“1111”在“1”和“2”之间排序。
现在让我们尝试插入不同的字符而不仅仅是整数:
与上次测试没有真正的区别。
现在让我们尝试插入字符,但在表格中放置不同数字的每个字符。例如,
CHAR(11)
有 1 行,CHAR(12)
有 2 行,等等。As before I still don't get exactly 200 histogram steps. However, many of the steps have
RANGE_ROWS
of 0.For the final test, I'm going to insert a random string of 5 characters in each loop and gather stats each time. Here's the code the random string:
Here is the graph of rows in table vs histogram steps:
Note that the number of steps doesn't dip below 100 once it starts going up and down. I've heard from somewhere (but can't source it right now) that the SQL Server histogram building algorithm combines histogram steps as it runs out of room for them. So you can end up with drastic changes in the number of steps just by adding a little data. Here's one sample of the data that I found interesting:
Even when sampling with
FULLSCAN
, adding a single row can increase the number of steps by 10, keep it constant, then decrease it by 2, then decrease it by 3.What can we summarize from all of this? I can't prove any of this, but these observations appear to hold true:
RANGE_ROWS
= 0.RANGE_HI_KEY
in the table.DISTINCT_RANGE_ROWS
orRANGE_ROWS
. For example, 255 shows up a bunch of times forRANGE_ROWS
andDISTINCT_RANGE_ROWS
for the final test case here.When is all of this a problem? It's a problem when a query performs poorly due to a histogram that is unable to represent the data distribution in a way for the query optimizer to make good decisions. I think there's a tendency to think that having more histogram steps is always better and for there to be consternation when SQL Server generates a histogram on millions of rows or more but doesn't use exactly 200 or 201 histogram steps. However, I have seen plenty of stats problems even when the histogram has 200 or 201 steps. We don't have any control over how many histogram steps that SQL Server generates for a statistics object so I wouldn't worry about it. However, there are some steps that you can take when you experience poor performing queries caused by stats issues. I will give an extremely brief overview.
Gathering statistics in full can help in some cases. For very large tables the auto sample size may be less than 1% of the rows in the table. Sometimes that can lead to bad plans depending on the data disruption in the column. Microsofts's documentation for CREATE STATISTICS and UPDATE STATISTICS says as much:
In some cases creating filtered statistics can help. You may have a column with skewed data and many different distinct values. If there are certain values in the data that are commonly filtered on you can create a statistics histogram for just those common values. The query optimizer can use the statistics defined on a smaller range of data instead of the statistics defined on all column values. You still are not guaranteed to get 200 steps in the histogram, but if you create the filtered stats on just one value you will a histogram step that value.
使用分区视图是一种有效地为表获得 200 多个步骤的方法。假设您可以轻松地将一张大表拆分为每年一张表。您创建一个
UNION ALL
组合所有年度表的视图。每个表都有自己的直方图。请注意,SQL Server 2014 中引入的新增量统计信息只允许更有效地更新统计信息。查询优化器不会使用每个分区创建的统计信息。这里可以运行更多的测试,所以我鼓励你进行实验。我在 SQL Server 2014 express 上做了这个测试,所以真的没有什么能阻止你。