SQL Server - 使用聚集索引时如何存储数据页

Question

Pரதீப்

Asked: 2017-01-03 20:13:25 +0800 CST2017-01-03 20:13:25 +0800 CST 2017-01-03 20:13:25 +0800 CST

统计中如何决定直方图步骤的数量

772

SQL Server 统计中的直方图步数是如何确定的？

为什么即使我的键列有超过 200 个不同的值，它也限制为 200 个步骤？有什么决定因素吗？

演示

架构定义

CREATE TABLE histogram_step
  (
     id   INT IDENTITY(1, 1),
     name VARCHAR(50),
     CONSTRAINT pk_histogram_step PRIMARY KEY (id)
  )

在我的表中插入 100 条记录

INSERT INTO histogram_step
            (name)
SELECT TOP 100 name
FROM   sys.syscolumns

更新和检查统计数据

UPDATE STATISTICS histogram_step WITH fullscan

DBCC show_statistics('histogram_step', pk_histogram_step)

直方图步骤：

+--------------+------------+---------+---------------------+----------------+
| RANGE_HI_KEY | RANGE_ROWS | EQ_ROWS | DISTINCT_RANGE_ROWS | AVG_RANGE_ROWS |
+--------------+------------+---------+---------------------+----------------+
|            1 |          0 |       1 |                   0 |              1 |
|            3 |          1 |       1 |                   1 |              1 |
|            5 |          1 |       1 |                   1 |              1 |
|            7 |          1 |       1 |                   1 |              1 |
|            9 |          1 |       1 |                   1 |              1 |
|           11 |          1 |       1 |                   1 |              1 |
|           13 |          1 |       1 |                   1 |              1 |
|           15 |          1 |       1 |                   1 |              1 |
|           17 |          1 |       1 |                   1 |              1 |
|           19 |          1 |       1 |                   1 |              1 |
|           21 |          1 |       1 |                   1 |              1 |
|           23 |          1 |       1 |                   1 |              1 |
|           25 |          1 |       1 |                   1 |              1 |
|           27 |          1 |       1 |                   1 |              1 |
|           29 |          1 |       1 |                   1 |              1 |
|           31 |          1 |       1 |                   1 |              1 |
|           33 |          1 |       1 |                   1 |              1 |
|           35 |          1 |       1 |                   1 |              1 |
|           37 |          1 |       1 |                   1 |              1 |
|           39 |          1 |       1 |                   1 |              1 |
|           41 |          1 |       1 |                   1 |              1 |
|           43 |          1 |       1 |                   1 |              1 |
|           45 |          1 |       1 |                   1 |              1 |
|           47 |          1 |       1 |                   1 |              1 |
|           49 |          1 |       1 |                   1 |              1 |
|           51 |          1 |       1 |                   1 |              1 |
|           53 |          1 |       1 |                   1 |              1 |
|           55 |          1 |       1 |                   1 |              1 |
|           57 |          1 |       1 |                   1 |              1 |
|           59 |          1 |       1 |                   1 |              1 |
|           61 |          1 |       1 |                   1 |              1 |
|           63 |          1 |       1 |                   1 |              1 |
|           65 |          1 |       1 |                   1 |              1 |
|           67 |          1 |       1 |                   1 |              1 |
|           69 |          1 |       1 |                   1 |              1 |
|           71 |          1 |       1 |                   1 |              1 |
|           73 |          1 |       1 |                   1 |              1 |
|           75 |          1 |       1 |                   1 |              1 |
|           77 |          1 |       1 |                   1 |              1 |
|           79 |          1 |       1 |                   1 |              1 |
|           81 |          1 |       1 |                   1 |              1 |
|           83 |          1 |       1 |                   1 |              1 |
|           85 |          1 |       1 |                   1 |              1 |
|           87 |          1 |       1 |                   1 |              1 |
|           89 |          1 |       1 |                   1 |              1 |
|           91 |          1 |       1 |                   1 |              1 |
|           93 |          1 |       1 |                   1 |              1 |
|           95 |          1 |       1 |                   1 |              1 |
|           97 |          1 |       1 |                   1 |              1 |
|           99 |          1 |       1 |                   1 |              1 |
|          100 |          0 |       1 |                   0 |              1 |
+--------------+------------+---------+---------------------+----------------+

正如我们所见，直方图中有 53 个步骤。

再次插入几千条记录

INSERT INTO histogram_step
            (name)
SELECT TOP 10000 b.name
FROM   sys.syscolumns a
       CROSS JOIN sys.syscolumns b

更新和检查统计数据

UPDATE STATISTICS histogram_step WITH fullscan

DBCC show_statistics('histogram_step', pk_histogram_step)

现在直方图步骤减少到 4 个步骤

+--------------+------------+---------+---------------------+----------------+
| RANGE_HI_KEY | RANGE_ROWS | EQ_ROWS | DISTINCT_RANGE_ROWS | AVG_RANGE_ROWS |
+--------------+------------+---------+---------------------+----------------+
|            1 |          0 |       1 |                   0 |              1 |
|        10088 |      10086 |       1 |               10086 |              1 |
|        10099 |         10 |       1 |                  10 |              1 |
|        10100 |          0 |       1 |                   0 |              1 |
+--------------+------------+---------+---------------------+----------------+

再次插入几千条记录

INSERT INTO histogram_step
            (name)
SELECT TOP 100000 b.name
FROM   sys.syscolumns a
       CROSS JOIN sys.syscolumns b

更新和检查统计数据

UPDATE STATISTICS histogram_step WITH fullscan

DBCC show_statistics('histogram_step', pk_histogram_step)

现在直方图步骤减少到 3 个步骤

+--------------+------------+---------+---------------------+----------------+
| RANGE_HI_KEY | RANGE_ROWS | EQ_ROWS | DISTINCT_RANGE_ROWS | AVG_RANGE_ROWS |
+--------------+------------+---------+---------------------+----------------+
|            1 |          0 |       1 |                   0 |              1 |
|       110099 |     110097 |       1 |              110097 |              1 |
|       110100 |          0 |       1 |                   0 |              1 |
+--------------+------------+---------+---------------------+----------------+

有人能告诉我这些步骤是如何决定的吗？

1 个回答

Voted

Joe Obbish · Answer 1 · 2017-01-04T13:21:02+08:00

我将把这篇文章限制在讨论单列统计信息，因为它已经很长了，而且您对 SQL Server 如何将数据分桶到直方图步骤感兴趣。对于多列统计，直方图仅在前导列上创建。

当 SQL Server 确定需要更新统计信息时，它会启动一个隐藏查询，该查询读取表的所有数据或表数据的样本。您可以使用扩展事件查看这些查询。在 SQL Server 中调用了一个StatMan与创建直方图有关的函数。对于简单的统计对象，至少有两种不同类型的StatMan查询（对于快速统计更新有不同的查询，我怀疑分区表上的增量统计功能也使用不同的查询）。

第一个只是从表中抓取所有数据而没有任何过滤。当表格非常小时或者您使用以下FULLSCAN选项收集统计信息时，您可以看到这一点：

CREATE TABLE X_SHOW_ME_STATMAN (N INT);
CREATE STATISTICS X_STAT_X_SHOW_ME_STATMAN ON X_SHOW_ME_STATMAN (N);

-- after gathering stats with 1 row in table
SELECT StatMan([SC0]) FROM
(
    SELECT TOP 100 PERCENT [N] AS [SC0] 
    FROM [dbo].[X_SHOW_ME_STATMAN] WITH (READUNCOMMITTED)
    ORDER BY [SC0] 
) AS _MS_UPDSTATS_TBL 
OPTION (MAXDOP 16);

SQL Server 根据表的大小选择自动样本大小（我认为它是表中的行数和页数）。如果表格太大，则自动样本大小会低于 100%。这是我为具有 1M 行的同一张表得到的结果：

-- after gathering stats with 1 M rows in table
SELECT StatMan([SC0], [SB0000]) FROM 
(
    SELECT TOP 100 PERCENT [SC0], step_direction([SC0]) over (order by NULL) AS [SB0000] 
    FROM 
    (
        SELECT [N] AS [SC0] 
        FROM [dbo].[X_SHOW_ME_STATMAN] TABLESAMPLE SYSTEM (6.666667e+001 PERCENT) WITH (READUNCOMMITTED) 
    ) AS _MS_UPDSTATS_TBL_HELPER 
    ORDER BY [SC0], [SB0000] 
) AS _MS_UPDSTATS_TBL
OPTION (MAXDOP 1);

TABLESAMPLE已记录但 StatMan 和 step_direction 没有。这里 SQL Server 从表中抽取大约 66.6% 的数据来创建直方图。这意味着在更新相同数据的统计信息（不带）时，您可以获得不同数量的直方图步骤FULLSCAN。我在实践中从未观察到这一点，但我不明白为什么这是不可能的。

让我们对简单数据进行一些测试，看看统计数据如何随时间变化。下面是我编写的一些测试代码，用于将连续整数插入表中，在每次插入后收集统计信息，并将有关统计信息的信息保存到结果表中。让我们从一次插入 1 行开始，最多 10000 行。测试台：

DECLARE
@stats_id INT,
@table_object_id INT,
@rows_per_loop INT = 1,
@num_of_loops INT = 10000,
@loop_num INT;

BEGIN
    SET NOCOUNT ON;

    TRUNCATE TABLE X_STATS_RESULTS;

    SET @table_object_id = OBJECT_ID ('X_SEQ_NUM');
    SELECT @stats_id = stats_id FROM sys.stats
    WHERE OBJECT_ID = @table_object_id
    AND name = 'X_STATS_SEQ_INT_FULL';

    SET @loop_num = 0;
    WHILE @loop_num < @num_of_loops
    BEGIN
        SET @loop_num = @loop_num + 1;

        INSERT INTO X_SEQ_NUM WITH (TABLOCK)
        SELECT @rows_per_loop * (@loop_num - 1) + N FROM dbo.GetNums(@rows_per_loop);

        UPDATE STATISTICS X_SEQ_NUM X_STATS_SEQ_INT_FULL WITH FULLSCAN; -- can comment out FULLSCAN as needed

        INSERT INTO X_STATS_RESULTS WITH (TABLOCK)
        SELECT 'X_STATS_SEQ_INT_FULL', @rows_per_loop * @loop_num, rows_sampled, steps 
        FROM sys.dm_db_stats_properties(@table_object_id, @stats_id);
        END;
END;

对于此数据，直方图步数迅速增加到 200（它首先达到 397 行的最大步数），保持在 199 或 200 直到表中有 1485 行，然后慢慢减少直到直方图只有 3 或 4脚步。这是所有数据的图表：

这是 10k 行的直方图：

RANGE_HI_KEY    RANGE_ROWS  EQ_ROWS DISTINCT_RANGE_ROWS AVG_RANGE_ROWS
1               0           1       0                   1
9999            9997        1       9997                1
10000           0           1       0                   1

直方图只有 3 个步骤有问题吗？从我们的角度来看，信息似乎被保留了下来。请注意，因为数据类型是 INTEGER，我们可以计算出表中从 1 到 10000 的每个整数有多少行。通常 SQL Server 也可以计算出来，尽管在某些情况下这并不完全可行. 有关此示例，请参阅此 SE 帖子。

如果我们从表中删除一行并更新统计信息，您认为会发生什么？理想情况下，我们会得到另一个直方图步骤来显示丢失的整数不再在表中。

DELETE FROM X_SEQ_NUM
WHERE X_NUM  = 1000;

UPDATE STATISTICS X_SEQ_NUM X_STATS_SEQ_INT_FULL WITH FULLSCAN;

DBCC SHOW_STATISTICS ('X_SEQ_NUM', 'X_STATS_SEQ_INT_FULL'); -- still 3 steps

DELETE FROM X_SEQ_NUM
WHERE X_NUM  IN (2000, 3000, 4000, 5000, 6000, 7000, 8000, 9000);

UPDATE STATISTICS X_SEQ_NUM X_STATS_SEQ_INT_FULL WITH FULLSCAN;

DBCC SHOW_STATISTICS ('X_SEQ_NUM', 'X_STATS_SEQ_INT_FULL'); -- still 3 steps

这有点令人失望。如果我们手动构建直方图，我们将为每个缺失值添加一个步骤。SQL Server 使用的是通用算法，因此对于某些数据集，我们可能会提出比它使用的代码更合适的直方图。当然，从表中获取 0 行或 1 行之间的实际差异非常小。当使用 20000 行进行测试时，我得到了相同的结果，其中每个整数在表中有 2 行。当我删除数据时，直方图没有增加步骤。

RANGE_HI_KEY    RANGE_ROWS  EQ_ROWS DISTINCT_RANGE_ROWS AVG_RANGE_ROWS
1               0           2       0                   1
9999            19994       2       9997                2
10000           0           2       0                   1

如果我测试 100 万行，每个整数在表中有 100 行，我会得到稍微好一点的结果，但我仍然可以手动构建更好的直方图。

truncate table X_SEQ_NUM;

BEGIN TRANSACTION;
INSERT INTO X_SEQ_NUM WITH (TABLOCK)
SELECT N FROM dbo.GetNums(10000);
GO 100
COMMIT TRANSACTION;

UPDATE STATISTICS X_SEQ_NUM X_STATS_SEQ_INT_FULL WITH FULLSCAN;

DBCC SHOW_STATISTICS ('X_SEQ_NUM', 'X_STATS_SEQ_INT_FULL'); -- 4 steps

DELETE FROM X_SEQ_NUM
WHERE X_NUM  = 1000;

UPDATE STATISTICS X_SEQ_NUM X_STATS_SEQ_INT_FULL WITH FULLSCAN;

DBCC SHOW_STATISTICS ('X_SEQ_NUM', 'X_STATS_SEQ_INT_FULL'); -- now 5 steps with a RANGE_HI_KEY of 998 (?)

DELETE FROM X_SEQ_NUM
WHERE X_NUM  IN (2000, 3000, 4000, 5000, 6000, 7000, 8000, 9000);

UPDATE STATISTICS X_SEQ_NUM X_STATS_SEQ_INT_FULL WITH FULLSCAN;

DBCC SHOW_STATISTICS ('X_SEQ_NUM', 'X_STATS_SEQ_INT_FULL'); -- still 5 steps

最终直方图：

RANGE_HI_KEY    RANGE_ROWS  EQ_ROWS DISTINCT_RANGE_ROWS AVG_RANGE_ROWS
1               0           100     0                   1
998             99600       100     996                 100
3983            298100      100     2981                100
9999            600900      100     6009                100
10000           0           100     0                   1

让我们用顺序整数进一步测试，但表中有更多行。请注意，对于太小的表，手动指定样本大小将不起作用，因此我将在每次插入中添加 100 行并每次收集最多 100 万行的统计信息。我看到了与以前类似的模式，除了一旦我在表中达到 637300 行，我不再使用默认采样率对表中的 100% 行进行采样。随着我获得行数，直方图步数增加。这可能是因为随着表中非抽样行数的增加，SQL Server 最终会出现更多的数据间隙。即使在 1 M 行，我也没有达到 200 步，但是如果我继续添加行，我希望我会到达那里并最终开始返回。

X 轴是表中的行数。随着行数的增加，采样的行会有所不同，并且不会超过 650k。

现在让我们用 VARCHAR 数据做一些简单的测试。

CREATE TABLE X_SEQ_STR (X_STR VARCHAR(5));
CREATE STATISTICS X_SEQ_STR ON X_SEQ_STR(X_STR);

在这里，我插入 200 个数字（作为字符串）以及 NULL。

INSERT INTO X_SEQ_STR
SELECT N FROM dbo.GetNums(200)
UNION ALL
SELECT NULL;

UPDATE STATISTICS X_SEQ_STR X_SEQ_STR ;

DBCC SHOW_STATISTICS ('X_SEQ_STR', 'X_SEQ_STR'); -- 111 steps, RANGE_ROWS is 0 or 1 for all steps

请注意，当在表中找到 NULL 时，它总是会获得自己的直方图步骤。SQL Server 本来可以给我准确的 201 个步骤来保存所有信息，但它没有这样做。技术上的信息会丢失，因为例如“1111”在“1”和“2”之间排序。

现在让我们尝试插入不同的字符而不仅仅是整数：

truncate table X_SEQ_STR;

INSERT INTO X_SEQ_STR
SELECT CHAR(10 + N) FROM dbo.GetNums(200)
UNION ALL
SELECT NULL;

UPDATE STATISTICS X_SEQ_STR X_SEQ_STR ;

DBCC SHOW_STATISTICS ('X_SEQ_STR', 'X_SEQ_STR'); -- 95 steps, RANGE_ROWS is 0 or 1 or 2

与上次测试没有真正的区别。

现在让我们尝试插入字符，但在表格中放置不同数字的每个字符。例如，CHAR(11)有 1 行，CHAR(12)有 2 行，等等。

truncate table X_SEQ_STR;

DECLARE
@loop_num INT;

BEGIN
    SET NOCOUNT ON;

    SET @loop_num = 0;
    WHILE @loop_num < 200
    BEGIN
        SET @loop_num = @loop_num + 1;

        INSERT INTO X_SEQ_STR WITH (TABLOCK)
        SELECT CHAR(10 + @loop_num) FROM dbo.GetNums(@loop_num);
    END;
END;

UPDATE STATISTICS X_SEQ_STR X_SEQ_STR ;

DBCC SHOW_STATISTICS ('X_SEQ_STR', 'X_SEQ_STR'); -- 148 steps, most with RANGE_ROWS of 0

As before I still don't get exactly 200 histogram steps. However, many of the steps have RANGE_ROWS of 0.

For the final test, I'm going to insert a random string of 5 characters in each loop and gather stats each time. Here's the code the random string:

char((rand()*25 + 65))+char((rand()*25 + 65))+char((rand()*25 + 65))+char((rand()*25 + 65))+char((rand()*25 + 65))

Here is the graph of rows in table vs histogram steps:

Note that the number of steps doesn't dip below 100 once it starts going up and down. I've heard from somewhere (but can't source it right now) that the SQL Server histogram building algorithm combines histogram steps as it runs out of room for them. So you can end up with drastic changes in the number of steps just by adding a little data. Here's one sample of the data that I found interesting:

ROWS_IN_TABLE   ROWS_SAMPLED    STEPS
36661           36661           133
36662           36662           143
36663           36663           143
36664           36664           141
36665           36665           138

Even when sampling with FULLSCAN, adding a single row can increase the number of steps by 10, keep it constant, then decrease it by 2, then decrease it by 3.

What can we summarize from all of this? I can't prove any of this, but these observations appear to hold true:

SQL Server uses a general use algorithm to create the histograms. For some data distributions it may be possible to create a more complete representation of the data by hand.
If there is NULL data in the table and the stats query finds it then that NULL data always gets its own histogram step.
The minimum value found in the table gets its own histogram step with RANGE_ROWS = 0.
The maximum value found in the table will be the final RANGE_HI_KEY in the table.
As SQL Server samples more data it may need to combine existing steps to make room for the new data that it finds. If you look at enough histograms you may see common values repeat for DISTINCT_RANGE_ROWS or RANGE_ROWS. For example, 255 shows up a bunch of times for RANGE_ROWS and DISTINCT_RANGE_ROWS for the final test case here.
For simple data distributions you may see SQL Server combine sequential data into one histogram step that causes no loss of information. However when adding gaps to the data the histogram may not adjust in the way you would hope.

When is all of this a problem? It's a problem when a query performs poorly due to a histogram that is unable to represent the data distribution in a way for the query optimizer to make good decisions. I think there's a tendency to think that having more histogram steps is always better and for there to be consternation when SQL Server generates a histogram on millions of rows or more but doesn't use exactly 200 or 201 histogram steps. However, I have seen plenty of stats problems even when the histogram has 200 or 201 steps. We don't have any control over how many histogram steps that SQL Server generates for a statistics object so I wouldn't worry about it. However, there are some steps that you can take when you experience poor performing queries caused by stats issues. I will give an extremely brief overview.

Gathering statistics in full can help in some cases. For very large tables the auto sample size may be less than 1% of the rows in the table. Sometimes that can lead to bad plans depending on the data disruption in the column. Microsofts's documentation for CREATE STATISTICS and UPDATE STATISTICS says as much:

SAMPLE is useful for special cases in which the query plan, based on default sampling, is not optimal. In most situations, it is not necessary to specify SAMPLE because the query optimizer already uses sampling and determines the statistically significant sample size by default, as required to create high-quality query plans.

For most workloads, a full scan is not required, and default sampling is adequate. However, certain workloads that are sensitive to widely varying data distributions may require an increased sample size, or even a full scan.

In some cases creating filtered statistics can help. You may have a column with skewed data and many different distinct values. If there are certain values in the data that are commonly filtered on you can create a statistics histogram for just those common values. The query optimizer can use the statistics defined on a smaller range of data instead of the statistics defined on all column values. You still are not guaranteed to get 200 steps in the histogram, but if you create the filtered stats on just one value you will a histogram step that value.

使用分区视图是一种有效地为表获得 200 多个步骤的方法。假设您可以轻松地将一张大表拆分为每年一张表。您创建一个UNION ALL组合所有年度表的视图。每个表都有自己的直方图。请注意，SQL Server 2014 中引入的新增量统计信息只允许更有效地更新统计信息。查询优化器不会使用每个分区创建的统计信息。

这里可以运行更多的测试，所以我鼓励你进行实验。我在 SQL Server 2014 express 上做了这个测试，所以真的没有什么能阻止你。

统计中如何决定直方图步骤的数量

连接到 PostgreSQL 服务器：致命：主机没有 pg_hba.conf 条目

如何让sqlplus的输出出现在一行中？

选择具有最大日期或最晚日期的日期

如何列出 PostgreSQL 中的所有模式？

列出指定表的所有列

如何在不修改我自己的 tnsnames.ora 的情况下使用 sqlplus 连接到位于另一台主机上的 Oracle 数据库

你如何mysqldump特定的表？

使用 psql 列出数据库权限

如何从 PostgreSQL 中的选择查询中将值插入表中？

如何使用 psql 列出所有数据库和表？

统计中如何决定直方图步骤的数量

1 个回答

相关问题