SQL Server - 使用聚集索引时如何存储数据页

Question

Asked: 2019-02-13 14:39:06 +0800 CST2019-02-13 14:39:06 +0800 CST 2019-02-13 14:39:06 +0800 CST

为有限协作距离的行分配唯一值的解决方案

772

我有一个可以使用以下代码创建和填充的表：

CREATE TABLE dbo.Example(GroupKey int NOT NULL, RecordKey varchar(12) NOT NULL);
ALTER TABLE dbo.Example
    ADD CONSTRAINT iExample PRIMARY KEY CLUSTERED(GroupKey ASC, RecordKey ASC);
INSERT INTO dbo.Example(GroupKey, RecordKey)
VALUES (1, 'Archimedes'), (1, 'Newton'), (1, 'Euler'), (2, 'Euler'), (2, 'Gauss'),
       (3, 'Gauss'), (3, 'Poincaré'), (4, 'Ramanujan'), (5, 'Neumann'),
       (5, 'Grothendieck'), (6, 'Grothendieck'), (6, 'Tao');

对于基于另一行具有有限协作距离RecordKey的所有行，我想分配一个唯一值——我不关心唯一值是什么数据类型或数据类型。

可以使用以下查询生成符合我要求的正确结果集：

SELECT 1 AS SupergroupKey, GroupKey, RecordKey
FROM dbo.Example
WHERE GroupKey IN(1, 2, 3)
UNION ALL
SELECT 2 AS SupergroupKey, GroupKey, RecordKey
FROM dbo.Example
WHERE GroupKey = 4
UNION ALL
SELECT 3 AS SupergroupKey, GroupKey, RecordKey
FROM dbo.Example
WHERE GroupKey IN(5, 6)
ORDER BY SupergroupKey ASC, GroupKey ASC, RecordKey ASC;

为了更好地帮助我解决问题，我将解释为什么GroupKeys 1–3 具有相同的内容SupergroupKey：

GroupKey1 包含RecordKey欧拉，而欧拉又包含在GroupKey2 中；因此GroupKeys 1 和 2 必须具有相同的SupergroupKey.
因为 Gauss 包含在GroupKeys 2 和 3 中，所以它们也必须具有相同的SupergroupKey. 这导致GroupKeys 1–3 具有相同的SupergroupKey.
由于GroupKeys 1–3 不RecordKey与其余GroupKeys 共享任何 s，因此它们是唯一分配SupergroupKey值为 1 的。

我应该补充一点，解决方案需要是通用的。上表和结果集只是一个例子。

附录

我删除了解决方案是非迭代的要求。虽然我更喜欢这样的解决方案，但我认为这是一个不合理的限制。不幸的是，我无法使用任何基于 CLR 的解决方案；但是如果你想包含这样的解决方案，请随意。不过，我可能不会接受它作为答案。

我的真实表中的行数高达 500 万，但有时行数“仅”在一万左右。平均有 8 RecordKeys perGroupKey和 4 GroupKeys per RecordKey。我想一个解决方案将具有指数级的时间复杂度，但我仍然对解决方案感兴趣。

3 个回答

Voted

Michael Green · Answer 1 · 2019-03-05T04:13:58+08:00

这个问题是关于跟踪项目之间的链接。这将其置于图形和图形处理领域。具体来说，整个数据集形成一个图，我们正在寻找该图的组成部分。这可以通过问题样本数据的图表来说明。

问题说我们可以按照 GroupKey 或 RecordKey 来查找共享该值的其他行。所以我们可以将两者都视为图中的顶点。问题继续解释 GroupKeys 1-3 如何具有相同的 SupergroupKey。这可以看作是左侧由细线连接的集群。图片还显示了由原始数据形成的另外两个组件（SupergroupKey）。

SQL Server 在 T-SQL 中内置了一些图形处理能力。然而，此时它非常微不足道，对解决这个问题没有帮助。SQL Server 还能够调出 R 和 Python，以及可供它们使用的丰富而强大的软件包套件。igraph就是其中之一。它是为“快速处理具有数百万个顶点和边（链接）的大型图形”而编写的。

^{在本地测试1}中，使用 R 和 igraph 我能够在 2 分 22 秒内处理一百万行。这是它与当前最佳解决方案的比较：

Record Keys     Paul White  R               
------------    ----------  --------
Per question    15ms        ~220ms
100             80ms        ~270ms
1,000           250ms       430ms
10,000          1.4s        1.7s
100,000         14s         14s
1M              2m29        2m22s
1M              n/a         1m40    process only, no display

The first column is the number of distinct RecordKey values. The number of rows
in the table will be 8 x this number.

当处理 1M 行时，1 分 40 秒用于加载和处理图形，以及更新表格。用输出填充 SSMS 结果表需要 42 秒。

Observation of Task Manager while 1M rows were processed suggest about 3GB of working memory were required. This was available on this system without paging.

I can confirm Ypercube's assessment of the recursive CTE approach. With a few hundred record keys it consumed 100% of CPU and all available RAM. Eventually tempdb grew to over 80GB and the SPID crashed.

I used Paul's table with the SupergroupKey column so there's a fair comparison between the solutions.

For some reason R objected to the accent on Poincaré. Changing it to a plain "e" allowed it to run. I didn't investigate since it's not germane to the problem at hand. I'm sure there's a solution.

Here is the code

-- This captures the output from R so the base table can be updated.
drop table if exists #Results;

create table #Results
(
    Component   int         not NULL,
    Vertex      varchar(12) not NULL primary key
);


truncate table #Results;    -- facilitates re-execution

declare @Start time = sysdatetimeoffset();  -- for a 'total elapsed' calculation.

insert #Results(Component, Vertex)
exec sp_execute_external_script   
    @language = N'R',
    @input_data_1 = N'select GroupKey, RecordKey from dbo.Example',
    @script = N'
library(igraph)
df.g <- graph.data.frame(d = InputDataSet, directed = FALSE)
cpts <- components(df.g, mode = c("weak"))
OutputDataSet <- data.frame(cpts$membership)
OutputDataSet$VertexName <- V(df.g)$name
';

-- Write SuperGroupKey to the base table, as other solutions do
update e
set
    SupergroupKey = r.Component
from dbo.Example as e
inner join #Results as r
    on r.Vertex = e.RecordKey;

-- Return all rows, as other solutions do
select
    e.SupergroupKey,
    e.GroupKey,
    e.RecordKey
from dbo.Example as e;

-- Calculate the elapsed
declare @End time = sysdatetimeoffset();
select Elapse_ms = DATEDIFF(MILLISECOND, @Start, @End);

This is what the R code does

@input_data_1 is how SQL Server transfers data from a table to R code and translates it to an R dataframe called InputDataSet.
library(igraph) imports the library into the R execution environment.
df.g <- graph.data.frame(d = InputDataSet, directed = FALSE) load the data into an igraph object. This is an undirected graph since we can follow links from group to record or record to group. InputDataSet is SQL Server's default name for the dataset sent to R.
cpts <- components(df.g, mode = c("weak")) process the graph to find discrete sub-graphs (components) and other measures.
OutputDataSet <- data.frame(cpts$membership) SQL Server expects a data frame back from R. Its default name is OutputDataSet. The components are stored in a vector called "membership". This statement translates the vector to a data frame.
OutputDataSet$VertexName <- V(df.g)$name V() is a vector of vertices in the graph - a list of GroupKeys and RecordKeys. This copies them into the ouput data frame, creating a new column called VertexName. This is the key used to match to the source table for updating SupergroupKey.

I'm not an R expert. Likely this could be optimised.

Test Data

The OP's data was used for validation. For scale tests I used the following script.

drop table if exists Records;
drop table if exists Groups;

create table Groups(GroupKey int NOT NULL primary key);
create table Records(RecordKey varchar(12) NOT NULL primary key);
go

set nocount on;

-- Set @RecordCount to the number of distinct RecordKey values desired.
-- The number of rows in dbo.Example will be 8 * @RecordCount.
declare @RecordCount    int             = 1000000;

-- @Multiplier was determined by experiment.
-- It gives the OP's "8 RecordKeys per GroupKey and 4 GroupKeys per RecordKey"
-- and allows for clashes of the chosen random values.
declare @Multiplier     numeric(4, 2)   = 2.7;

-- The number of groups required to reproduce the OP's distribution.
declare @GroupCount     int             = FLOOR(@RecordCount * @Multiplier);


-- This is a poor man's numbers table.
insert Groups(GroupKey)
select top(@GroupCount)
    ROW_NUMBER() over (order by (select NULL))
from sys.objects as a
cross join sys.objects as b
--cross join sys.objects as c  -- include if needed


declare @c int = 0
while @c < @RecordCount
begin
    -- Can't use a set-based method since RAND() gives the same value for all rows.
    -- There are better ways to do this, but it works well enough.
    -- RecordKeys will be 10 letters, a-z.
    insert Records(RecordKey)
    select
        CHAR(97 + (26*RAND())) +
        CHAR(97 + (26*RAND())) +
        CHAR(97 + (26*RAND())) +
        CHAR(97 + (26*RAND())) +
        CHAR(97 + (26*RAND())) +
        CHAR(97 + (26*RAND())) +
        CHAR(97 + (26*RAND())) +
        CHAR(97 + (26*RAND())) +
        CHAR(97 + (26*RAND())) +
        CHAR(97 + (26*RAND()));

    set @c += 1;
end


-- Process each RecordKey in alphabetical order.
-- For each choose 8 GroupKeys to pair with it.
declare @RecordKey varchar(12) = '';
declare @Groups table (GroupKey int not null);

truncate table dbo.Example;

select top(1) @RecordKey = RecordKey 
from Records 
where RecordKey > @RecordKey 
order by RecordKey;

while @@ROWCOUNT > 0
begin
    print @Recordkey;

    delete @Groups;

    insert @Groups(GroupKey)
    select distinct C
    from
    (
        -- Hard-code * from OP's statistics
        select FLOOR(RAND() * @GroupCount)
        union all
        select FLOOR(RAND() * @GroupCount)
        union all
        select FLOOR(RAND() * @GroupCount)
        union all
        select FLOOR(RAND() * @GroupCount)
        union all
        select FLOOR(RAND() * @GroupCount)
        union all
        select FLOOR(RAND() * @GroupCount)
        union all
        select FLOOR(RAND() * @GroupCount)
        union all
        select FLOOR(RAND() * @GroupCount)
    ) as T(C);

    insert dbo.Example(GroupKey, RecordKey)
    select
        GroupKey, @RecordKey
    from @Groups;

    select top(1) @RecordKey = RecordKey 
    from Records 
    where RecordKey > @RecordKey 
    order by RecordKey;
end

-- Rebuild the indexes to have a consistent environment
alter index iExample on dbo.Example rebuild partition = all 
WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, SORT_IN_TEMPDB = OFF, 
      ONLINE = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON);


-- Check what we ended up with:
select COUNT(*) from dbo.Example;  -- Should be @RecordCount * 8
                                   -- Often a little less due to random clashes
select 
    ByGroup = AVG(C)
from
(
    select CONVERT(float, COUNT(1) over(partition by GroupKey)) 
    from dbo.Example
) as T(C);

select
    ByRecord = AVG(C)
from
(
    select CONVERT(float, COUNT(1) over(partition by RecordKey)) 
    from dbo.Example
) as T(C);

I've just now realised I've gotten the ratios the wrong way around from the OP's definition. I don't believe this will affect the timings. Records & Groups are symmetrical to this process. To the algorithm they're all just nodes in a graph.

In testing the data invariably formed a single component. I believe this is due to the uniform distribution of the data. If instead of the static 1:8 ratio hard-coded into the generation routine I had allowed the ratio to vary there would more likely have been further components.

¹机器规格：Microsoft SQL Server 2017 (RTM-CU12)，开发人员版（64 位），Windows 10 家庭版。16GB RAM，SSD，4 核超线程 i7，标称 2.8GHz。除了正常的系统活动（大约 4% CPU）之外，测试是当时唯一运行的项目。

Paul White · Answer 2 · 2019-02-13T22:05:14+08:00

这是用于性能比较的迭代 T-SQL 解决方案。

它假设可以向表中添加一个额外的列来存储超级组键，并且可以更改索引：

设置

DROP TABLE IF EXISTS 
    dbo.Example;

CREATE TABLE dbo.Example
(
    SupergroupKey integer NOT NULL
        DEFAULT 0, 
    GroupKey integer NOT NULL, 
    RecordKey varchar(12) NOT NULL,

    CONSTRAINT iExample 
    PRIMARY KEY CLUSTERED 
        (GroupKey ASC, RecordKey ASC),

    CONSTRAINT [IX dbo.Example RecordKey, GroupKey]
    UNIQUE NONCLUSTERED (RecordKey, GroupKey),

    INDEX [IX dbo.Example SupergroupKey, GroupKey]
        (SupergroupKey ASC, GroupKey ASC)
);

INSERT dbo.Example
    (GroupKey, RecordKey)
VALUES 
    (1, 'Archimedes'), 
    (1, 'Newton'),
    (1, 'Euler'),
    (2, 'Euler'),
    (2, 'Gauss'),
    (3, 'Gauss'),
    (3, 'Poincaré'),
    (4, 'Ramanujan'),
    (5, 'Neumann'),
    (5, 'Grothendieck'),
    (6, 'Grothendieck'),
    (6, 'Tao');

如果您能够反转当前主键的键顺序，则不需要额外的唯一索引。

大纲

该解决方案的方法是：

设置超级组id为1
查找编号最小的未处理组密钥
如果没有找到，退出
使用当前组键为所有行设置超级组
为当前组中与行相关的所有行设置超级组
重复第 5 步，直到没有行被更新
增加当前超级组 id
转到步骤 2

执行

内联评论：

-- No execution plans or rows affected messages
SET NOCOUNT ON;
SET STATISTICS XML OFF;

-- Reset all supergroups
UPDATE E
SET SupergroupKey = 0
FROM dbo.Example AS E
    WITH (TABLOCKX)
WHERE 
    SupergroupKey != 0;

DECLARE 
    @CurrentSupergroup integer = 0,
    @CurrentGroup integer = 0;

WHILE 1 = 1
BEGIN
    -- Next super group
    SET @CurrentSupergroup += 1;

    -- Find the lowest unprocessed group key
    SELECT 
        @CurrentGroup = MIN(E.GroupKey)
    FROM dbo.Example AS E
    WHERE 
        E.SupergroupKey = 0;

    -- Exit when no more unprocessed groups
    IF @CurrentGroup IS NULL BREAK;

    -- Set super group for all records in the current group
    UPDATE E
    SET E.SupergroupKey = @CurrentSupergroup
    FROM dbo.Example AS E 
    WHERE 
        E.GroupKey = @CurrentGroup;

    -- Iteratively find all groups for the super group
    WHILE 1 = 1
    BEGIN
        WITH 
            RecordKeys AS
            (
                SELECT DISTINCT
                    E.RecordKey
                FROM dbo.Example AS E
                WHERE
                    E.SupergroupKey = @CurrentSupergroup
            ),
            GroupKeys AS
            (
                SELECT DISTINCT
                    E.GroupKey
                FROM RecordKeys AS RK
                JOIN dbo.Example AS E
                    WITH (FORCESEEK)
                    ON E.RecordKey = RK.RecordKey
            )
        UPDATE E WITH (TABLOCKX)
        SET SupergroupKey = @CurrentSupergroup
        FROM GroupKeys AS GK
        JOIN dbo.Example AS E
            ON E.GroupKey = GK.GroupKey
        WHERE
            E.SupergroupKey = 0
        OPTION (RECOMPILE, QUERYTRACEON 9481); -- The original CE does better

        -- Break when no more related groups found
        IF @@ROWCOUNT = 0 BREAK;
    END;
END;

SELECT
    E.SupergroupKey,
    E.GroupKey,
    E.RecordKey
FROM dbo.Example AS E;

执行计划

对于密钥更新：

结果

表的最终状态是：

╔═══════════════╦══════════╦══════════════╗
║ SupergroupKey ║ GroupKey ║  RecordKey   ║
╠═══════════════╬══════════╬══════════════╣
║             1 ║        1 ║ Archimedes   ║
║             1 ║        1 ║ Euler        ║
║             1 ║        1 ║ Newton       ║
║             1 ║        2 ║ Euler        ║
║             1 ║        2 ║ Gauss        ║
║             1 ║        3 ║ Gauss        ║
║             1 ║        3 ║ Poincaré     ║
║             2 ║        4 ║ Ramanujan    ║
║             3 ║        5 ║ Grothendieck ║
║             3 ║        5 ║ Neumann      ║
║             3 ║        6 ║ Grothendieck ║
║             3 ║        6 ║ Tao          ║
╚═══════════════╩══════════╩══════════════╝

演示：数据库<>小提琴

性能测试

使用 Michael Green 的回答中提供的扩展测试数据集，我的笔记本电脑上的时间^*是：

╔═════════════╦════════╗
║ Record Keys ║  Time  ║
╠═════════════╬════════╣
║ 10k         ║ 2s     ║
║ 100k        ║ 12s    ║
║ 1M          ║ 2m 30s ║
╚═════════════╩════════╝

_{* Microsoft SQL Server 2017 (RTM-CU13)，开发人员版（64 位），Windows 10 Pro，16GB RAM，SSD，4 核超线程 i7，标称 2.4GHz。}

ypercubeᵀᴹ · Answer 3 · 2019-02-14T02:11:48+08:00

递归 CTE 方法——在大表中可能效率极低：

WITH rCTE AS 
(
    -- Anchor
    SELECT 
        GroupKey, RecordKey, 
        CAST('|' + CAST(GroupKey AS VARCHAR(10)) + '|' AS VARCHAR(100)) AS GroupKeys,
        CAST('|' + CAST(RecordKey AS VARCHAR(10)) + '|' AS VARCHAR(100)) AS RecordKeys,
        1 AS lvl
    FROM Example

    UNION ALL

    -- Recursive
    SELECT
        e.GroupKey, e.RecordKey, 
        CASE WHEN r.GroupKeys NOT LIKE '%|' + CAST(e.GroupKey AS VARCHAR(10)) + '|%'
            THEN CAST(r.GroupKeys + CAST(e.GroupKey AS VARCHAR(10)) + '|' AS VARCHAR(100))
            ELSE r.GroupKeys
        END,
        CASE WHEN r.RecordKeys NOT LIKE '%|' + CAST(e.RecordKey AS VARCHAR(10)) + '|%'
            THEN CAST(r.RecordKeys + CAST(e.RecordKey AS VARCHAR(10)) + '|' AS VARCHAR(100))
            ELSE r.RecordKeys
        END,
        r.lvl + 1
    FROM rCTE AS r
         JOIN Example AS e
         ON  e.RecordKey = r.RecordKey
         AND r.GroupKeys NOT LIKE '%|' + CAST(e.GroupKey AS VARCHAR(10)) + '|%'
         -- 
         OR e.GroupKey = r.GroupKey
         AND r.RecordKeys NOT LIKE '%|' + CAST(e.RecordKey AS VARCHAR(10)) + '|%'
)
SELECT 
    ROW_NUMBER() OVER (ORDER BY GroupKeys) AS SuperGroupKey,
    GroupKeys, RecordKeys
FROM rCTE AS c
WHERE NOT EXISTS
      ( SELECT 1
        FROM rCTE AS m
        WHERE m.lvl > c.lvl
          AND m.GroupKeys LIKE '%|' + CAST(c.GroupKey AS VARCHAR(10)) + '|%'
        OR    m.lvl = c.lvl
          AND ( m.GroupKey > c.GroupKey
             OR m.GroupKey = c.GroupKey
             AND m.RecordKeys > c.RecordKeys
              )
          AND m.GroupKeys LIKE '%|' + CAST(c.GroupKey AS VARCHAR(10)) + '|%'
          AND c.GroupKeys LIKE '%|' + CAST(m.GroupKey AS VARCHAR(10)) + '|%'
      ) 
OPTION (MAXRECURSION 0) ;

在dbfiddle.uk测试

为有限协作距离的行分配唯一值的解决方案

Test Data

设置

大纲

执行

执行计划

结果

性能测试

连接到 PostgreSQL 服务器：致命：主机没有 pg_hba.conf 条目

如何让sqlplus的输出出现在一行中？

选择具有最大日期或最晚日期的日期

如何列出 PostgreSQL 中的所有模式？

列出指定表的所有列

如何在不修改我自己的 tnsnames.ora 的情况下使用 sqlplus 连接到位于另一台主机上的 Oracle 数据库

你如何mysqldump特定的表？

使用 psql 列出数据库权限

如何从 PostgreSQL 中的选择查询中将值插入表中？

如何使用 psql 列出所有数据库和表？

为有限协作距离的行分配唯一值的解决方案

3 个回答

Test Data

设置

大纲

执行

执行计划

结果

性能测试

相关问题