SQL Server - 使用聚集索引时如何存储数据页

Question

Joe Obbish

Asked: 2017-06-08 07:49:11 +0800 CST2017-06-08 07:49:11 +0800 CST 2017-06-08 07:49:11 +0800 CST

为什么创建一个简单的 CCI 行组最多需要 30 秒？

772

当我注意到我的一些插入花费的时间比预期的要长时，我正在制作一个涉及 CCI 的演示。要重现的表定义：

DROP TABLE IF EXISTS dbo.STG_1048576;
CREATE TABLE dbo.STG_1048576 (ID BIGINT NOT NULL);
INSERT INTO dbo.STG_1048576
SELECT TOP (1048576) ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) RN
FROM master..spt_values t1
CROSS JOIN master..spt_values t2;

DROP TABLE IF EXISTS dbo.CCI_BIGINT;
CREATE TABLE dbo.CCI_BIGINT (ID BIGINT NOT NULL, INDEX CCI CLUSTERED COLUMNSTORE);

对于测试，我从暂存表中插入所有 1048576 行。这足以完全填充一个压缩的行组，只要它由于某种原因没有被修剪。

如果我插入所有 mod 17000 的整数，它需要不到一秒钟的时间：

TRUNCATE TABLE dbo.CCI_BIGINT;

INSERT INTO dbo.CCI_BIGINT WITH (TABLOCK)
SELECT ID % 17000
FROM dbo.STG_1048576
OPTION (MAXDOP 1);

SQL Server 执行时间：CPU 时间 = 359 毫秒，运行时间 = 364 毫秒。

但是，如果我插入相同的整数 mod 16000，它有时会花费 30 多秒：

TRUNCATE TABLE dbo.CCI_BIGINT;

INSERT INTO dbo.CCI_BIGINT WITH (TABLOCK)
SELECT ID % 16000
FROM dbo.STG_1048576
OPTION (MAXDOP 1);

SQL Server 执行时间：CPU 时间 = 32062 毫秒，运行时间 = 32511 毫秒。

这是已在多台机器上完成的可重复测试。随着 mod 值的变化，经过的时间似乎有一个清晰的模式：

MOD_NUM TIME_IN_MS
1000    2036
2000    3857
3000    5463
4000    6930
5000    8414
6000    10270
7000    12350
8000    13936
9000    17470
10000   19946
11000   21373
12000   24950
13000   28677
14000   31030
15000   34040
16000   37000
17000   563
18000   583
19000   576
20000   584

如果您想自己运行测试，请随时修改我在此处编写的测试代码。

我在 sys.dm_os_wait_stats 中找不到 mod 16000 插入的任何有趣内容：

╔════════════════════════════════════╦══════════════╗
║             wait_type              ║ diff_wait_ms ║
╠════════════════════════════════════╬══════════════╣
║ XE_DISPATCHER_WAIT                 ║       164406 ║
║ QDS_PERSIST_TASK_MAIN_LOOP_SLEEP   ║       120002 ║
║ LAZYWRITER_SLEEP                   ║        97718 ║
║ LOGMGR_QUEUE                       ║        97298 ║
║ DIRTY_PAGE_POLL                    ║        97254 ║
║ HADR_FILESTREAM_IOMGR_IOCOMPLETION ║        97111 ║
║ SQLTRACE_INCREMENTAL_FLUSH_SLEEP   ║        96008 ║
║ REQUEST_FOR_DEADLOCK_SEARCH        ║        95001 ║
║ XE_TIMER_EVENT                     ║        94689 ║
║ SLEEP_TASK                         ║        48308 ║
║ BROKER_TO_FLUSH                    ║        48264 ║
║ CHECKPOINT_QUEUE                   ║        35589 ║
║ SOS_SCHEDULER_YIELD                ║           13 ║
╚════════════════════════════════════╩══════════════╝

为什么插入 forID % 16000比插入 for 花费的时间长得多ID % 17000？

3 个回答

Voted

Paul White · Answer 1 · 2017-06-16T16:47:39+08:00

In many respects, this is expected behaviour. Any set of compression routines will have widely ranging performance depending on input data distribution. We expect to trade data loading speed for storage size and runtime querying performance.

There is a definite limit to how detailed an answer you're going to get here, since VertiPaq is a proprietary implementation, and the details are a closely-guarded secret. Even so, we do know that VertiPaq contains routines for:

Value encoding (scaling and/or translating values to fit in a small number of bits)
Dictionary encoding (integer references to unique values)
Run Length Encoding (storing runs of repeated values as [value, count] pairs)
Bit-packing (storing the stream in as few bits as possible)

Typically, data will be value or dictionary encoded, then RLE or bit-packing will be applied (or a hybrid of RLE and bit-packing used on different subsections of the segment data). The process of deciding which techniques to apply can involve generating a histogram to help determine how maximum bit savings can be achieved.

Capturing the slow case with Windows Performance Recorder and analyzing the result with Windows Performance Analyzer, we can see that the vast majority of the execution time is consumed in looking at the clustering of the data, building histograms, and deciding how to partition it for best savings:

The most expensive processing occurs for values that appear at least 64 times in the segment. This is a heuristic to determine when pure RLE is likely to be beneficial. The faster cases result in impure storage e.g. a bit-packed representation, with a larger final storage size. In the hybrid cases, values with 64 or more repetitions are RLE encoded, and the remainder are bit-packed.

The longest duration occurs when the maximum number of distinct values with 64 repetitions appear in the largest possible segment i.e. 1,048,576 rows with 16,384 sets of values with 64 entries each. Inspection of the code reveals a hard-coded time limit for the expensive processing. This can be configured in other VertiPaq implementations e.g. SSAS, but not in SQL Server as far as I can tell.

Some insight into the final storage arrangement can be acquired using the undocumented DBCC CSINDEX command. This shows the RLE header and array entries, any bookmarks into the RLE data, and a brief summary of the bit-pack data (if any).

For more information, see:

The VertiPaq Engine in DAX by Alberto Ferrari and Marco Russo
微软专利 WO2015038442：使用 DBMS 引擎处理数据集
微软专利 WO2010039898：用于查询基于列的数据编码结构的高效大规模过滤和/或排序

Joe Obbish · Answer 2 · 2017-06-11T10:50:54+08:00

我不能确切地说出为什么会出现这种行为，但我相信我已经通过蛮力测试开发了一个很好的行为模型。以下结论仅适用于将数据加载到单个列中且整数分布非常均匀的情况。

首先，我尝试使用改变插入到 CCI 中的行数TOP。我用于ID % 16000所有测试。下面是一个图表，比较插入的行与压缩的行组段大小：

下面是插入到 CPU 时间（以毫秒为单位）的行的图表。请注意，X 轴有不同的起点：

我们可以看到行组段大小以线性速率增长，并且使用少量 CPU 直到大约 1 M 行。此时，行组大小显着减小，CPU 使用率显着增加。看起来我们为这种压缩在 CPU 上付出了沉重的代价。

当插入少于 1024000 行时，我最终在 CCI 中得到了一个开放的行组。REORGANIZE但是，使用或强制压缩REBUILD对大小没有影响。顺便说一句，我发现有趣的是，当我使用一个变量时，TOP我最终得到了一个开放的行组，但最终却得到RECOMPILE了一个封闭的行组。

接下来，我通过在保持行数不变的情况下改变模数值来进行测试。这是插入 102400 行时的数据示例：

╔═══════════╦═════════╦═══════════════╦═════════════╗
║ TOP_VALUE ║ MOD_NUM ║ SIZE_IN_BYTES ║ CPU_TIME_MS ║
╠═══════════╬═════════╬═══════════════╬═════════════╣
║    102400 ║    1580 ║         13504 ║         352 ║
║    102400 ║    1590 ║         13584 ║         316 ║
║    102400 ║    1600 ║         13664 ║         317 ║
║    102400 ║    1601 ║         19624 ║         270 ║
║    102400 ║    1602 ║         25568 ║         283 ║
║    102400 ║    1603 ║         31520 ║         286 ║
║    102400 ║    1604 ║         37464 ║         288 ║
║    102400 ║    1605 ║         43408 ║         273 ║
║    102400 ║    1606 ║         49360 ║         269 ║
║    102400 ║    1607 ║         55304 ║         265 ║
║    102400 ║    1608 ║         61256 ║         262 ║
║    102400 ║    1609 ║         67200 ║         255 ║
║    102400 ║    1610 ║         73144 ║         265 ║
║    102400 ║    1620 ║        132616 ║         132 ║
║    102400 ║    1621 ║        138568 ║         100 ║
║    102400 ║    1622 ║        144512 ║          91 ║
║    102400 ║    1623 ║        150464 ║          75 ║
║    102400 ║    1624 ║        156408 ║          60 ║
║    102400 ║    1625 ║        162352 ║          47 ║
║    102400 ║    1626 ║        164712 ║          41 ║
╚═══════════╩═════════╩═══════════════╩═════════════╝

直到 mod 值为 1600，每增加 10 个唯一值，行组段大小线性增加 80 字节。这是一个有趣的巧合，BIGINT传统上 a 占用 8 个字节，并且每增加一个唯一值，段大小就会增加 8 个字节。超过 1600 的 mod 值，段大小会迅速增加，直到它稳定下来。

在保持模值不变并更改插入的行数时查看数据也很有帮助：

╔═══════════╦═════════╦═══════════════╦═════════════╗
║ TOP_VALUE ║ MOD_NUM ║ SIZE_IN_BYTES ║ CPU_TIME_MS ║
╠═══════════╬═════════╬═══════════════╬═════════════╣
║    300000 ║    5000 ║        600656 ║         131 ║
║    305000 ║    5000 ║        610664 ║         124 ║
║    310000 ║    5000 ║        620672 ║         127 ║
║    315000 ║    5000 ║        630680 ║         132 ║
║    320000 ║    5000 ║         40688 ║        2344 ║
║    325000 ║    5000 ║         40696 ║        2577 ║
║    330000 ║    5000 ║         40704 ║        2589 ║
║    335000 ║    5000 ║         40712 ║        2673 ║
║    340000 ║    5000 ║         40728 ║        2715 ║
║    345000 ║    5000 ║         40736 ║        2744 ║
║    350000 ║    5000 ║         40744 ║        2157 ║
╚═══════════╩═════════╩═══════════════╩═════════════╝

看起来当插入的行数 < ~64 * 唯一值的数量时，我们看到相对较差的压缩（mod <= 65000 每行 2 个字节）和较低的线性 CPU 使用率。当插入的行数 > ~64 * 唯一值的数量时，我们会看到更好的压缩和更高的线性 CPU 使用率。两种状态之间有一个过渡，这对我来说不容易建模，但可以在图中看到。当为每个唯一值恰好插入 64 行时，我们看到的最大 CPU 使用率似乎并不正确。相反，我们最多只能将 1048576 行插入到一个行组中，一旦每个唯一值超过 64 行，我们就会看到更高的 CPU 使用率和压缩率。

下面是 cpu 时间如何随着插入行数和唯一行数的变化而变化的等高线图。我们可以看到上面描述的模式：

下面是该段使用的空间的等高线图。在某一点之后，我们开始看到更好的压缩，如上所述：

似乎这里至少有两种不同的压缩算法在起作用。鉴于上述情况，我们可以在插入 1048576 行时看到最大 CPU 使用率。当插入大约 16000 行时，我们看到此时 CPU 使用率最高也是有道理的。1048576 / 64 = 16384。

我在这里上传了我所有的原始数据，以防有人想要分析它。

值得一提的是并行计划会发生什么。我只观察到这种行为具有均匀分布的值。在进行并行插入时，通常存在随机因素，并且线程通常是不平衡的。

将 2097152 行放入暂存表：

DROP TABLE IF EXISTS STG_2097152;
CREATE TABLE dbo.STG_2097152 (ID BIGINT NOT NULL);
INSERT INTO dbo.STG_2097152 WITH (TABLOCK)
SELECT TOP (2097152) ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) RN
FROM master..spt_values t1
CROSS JOIN master..spt_values t2;

此插入在不到一秒内完成，压缩效果很差：

DROP TABLE IF EXISTS dbo.CCI_BIGINT;
CREATE TABLE dbo.CCI_BIGINT (ID BIGINT NOT NULL, INDEX CCI CLUSTERED COLUMNSTORE);

INSERT INTO dbo.CCI_BIGINT WITH (TABLOCK)
SELECT ID % 16000
FROM dbo.STG_2097152 
OPTION (MAXDOP 2);

我们可以看到不平衡线程的影响：

╔════════════╦════════════╦══════════════╦═══════════════╗
║ state_desc ║ total_rows ║ deleted_rows ║ size_in_bytes ║
╠════════════╬════════════╬══════════════╬═══════════════╣
║ OPEN       ║      13540 ║            0 ║        311296 ║
║ COMPRESSED ║    1048576 ║            0 ║       2095872 ║
║ COMPRESSED ║    1035036 ║            0 ║       2070784 ║
╚════════════╩════════════╩══════════════╩═══════════════╝

我们可以使用各种技巧来强制线程平衡并具有相同的行分布。这是其中之一：

DROP TABLE IF EXISTS dbo.CCI_BIGINT;
CREATE TABLE dbo.CCI_BIGINT (ID BIGINT NOT NULL, INDEX CCI CLUSTERED COLUMNSTORE);

INSERT INTO dbo.CCI_BIGINT WITH (TABLOCK)
SELECT FLOOR(0.5 * ROW_NUMBER() OVER (ORDER BY (SELECT NULL)))  % 15999
FROM dbo.STG_2097152
OPTION (MAXDOP 2)

在这里为模数选择一个奇数很重要。SQL Server 串行扫描暂存表，计算行号，然后使用循环分配将行放在并行线程上。这意味着我们最终会得到完美平衡的线程。

插入大约需要 40 秒，这与串行插入类似。我们得到很好压缩的行组：

╔════════════╦════════════╦══════════════╦═══════════════╗
║ state_desc ║ total_rows ║ deleted_rows ║ size_in_bytes ║
╠════════════╬════════════╬══════════════╬═══════════════╣
║ COMPRESSED ║    1048576 ║            0 ║        128568 ║
║ COMPRESSED ║    1048576 ║            0 ║        128568 ║
╚════════════╩════════════╩══════════════╩═══════════════╝

我们可以通过从原始临时表中插入数据来获得相同的结果：

DROP TABLE IF EXISTS dbo.CCI_BIGINT;
CREATE TABLE dbo.CCI_BIGINT (ID BIGINT NOT NULL, INDEX CCI CLUSTERED COLUMNSTORE);

INSERT INTO dbo.CCI_BIGINT WITH (TABLOCK)
SELECT t.ID % 16000 ID
FROM  (
    SELECT TOP (2) ID 
    FROM (SELECT 1 ID UNION ALL SELECT 2 ) r
) s
CROSS JOIN dbo.STG_1048576 t
OPTION (MAXDOP 2, NO_PERFORMANCE_SPOOL);

这里循环分配用于派生表s，因此在每个并行线程上完成一次表扫描：

In conclusion, when inserting evenly distributed integers you can see very high compression when each unique integer appears more than 64 times. This may be due to a different compression algorithm being used. There can be a high cost in CPU to achieve this compression. Small changes in the data can lead to dramatic differences in the size of the compressed rowgroup segment. I suspect that seeing the worst case (from a CPU perspective) will be uncommon in the wild, at least for this data set. It's even harder to see when doing parallel inserts.

Niko Neugebuer · Answer 3 · 2017-06-11T01:09:59+08:00

我相信，这与单列表压缩的内部优化以及字典占用的 64 KB 的幻数有关。

示例：如果使用MOD 16600运行，行组大小的最终结果将为1.683 MB，而运行MOD 17000将为您提供大小为2.001 MB的行组。

现在，查看创建的词典（您可以为此使用我的CISL 库，您将需要函数 cstore_GetDictionaries，或者去查询 sys.column_store_dictionaries DMV）：

(MOD 16600) 61 KB

(MOD 17000) 65 KB

有趣的是，如果您要在表中添加另一列，我们称之为 REALID ：

DROP TABLE IF EXISTS dbo.CCI_BIGINT;
CREATE TABLE dbo.CCI_BIGINT (ID BIGINT NOT NULL, REALID BIGINT NOT NULL, INDEX CCI CLUSTERED COLUMNSTORE);

重新加载 MOD 16600 的数据：

TRUNCATE TABLE dbo.CCI_BIGINT;

INSERT INTO dbo.CCI_BIGINT WITH (TABLOCK)
SELECT ID % 16600, ID
FROM dbo.STG_1048576
OPTION (MAXDOP 1);

这次执行会很快，因为优化器会决定不要过度工作并将其压缩得太远：

select column_id, segment_id, cast(sum(seg.on_disk_size) / 1024. / 1024 as Decimal(8,3) ) as SizeInMB
    from sys.column_store_segments seg
        inner join sys.partitions part
            on seg.hobt_id = part.hobt_id 
    where object_id = object_id('dbo.CCI_BIGINT')
    group by column_id, segment_id;

即使行组大小之间存在微小差异，也可以忽略不计（2.000 (MOD 16600) 与 2.001 (MOD 17000)）

对于这种情况，MOD 16000 的字典将比具有 1 列的第一种情况更大（0.63 对 0.61）。

为什么创建一个简单的 CCI 行组最多需要 30 秒？

连接到 PostgreSQL 服务器：致命：主机没有 pg_hba.conf 条目

如何让sqlplus的输出出现在一行中？

选择具有最大日期或最晚日期的日期

如何列出 PostgreSQL 中的所有模式？

列出指定表的所有列

如何在不修改我自己的 tnsnames.ora 的情况下使用 sqlplus 连接到位于另一台主机上的 Oracle 数据库

你如何mysqldump特定的表？

使用 psql 列出数据库权限

如何从 PostgreSQL 中的选择查询中将值插入表中？

如何使用 psql 列出所有数据库和表？

为什么创建一个简单的 CCI 行组最多需要 30 秒？

3 个回答

相关问题