SQL Server - 使用聚集索引时如何存储数据页

Question

Philᵀᴹ

Asked: 2017-06-01 00:39:13 +0800 CST2017-06-01 00:39:13 +0800 CST 2017-06-01 00:39:13 +0800 CST

查询大量数据行间的详细差异

772

我有许多大表，每个表都有 >300 列。我正在使用的应用程序通过在辅助表中复制当前行来创建更改行的“存档”。

考虑一个简单的例子：

CREATE TABLE dbo.bigtable
(
  UpdateDate datetime,
  PK varchar(12) PRIMARY KEY,
  col1 varchar(100),
  col2 int,
  col3 varchar(20),
  .
  .
  .
  colN datetime
);

存档表：

CREATE TABLE dbo.bigtable_archive
(
  UpdateDate datetime,
  PK varchar(12) NOT NULL,
  col1 varchar(100),
  col2 int,
  col3 varchar(20),
  .
  .
  .
  colN datetime
);

在上执行任何更新之前dbo.bigtable，会在中创建该行的副本dbo.bigtable_archive，然后dbo.bigtable.UpdateDate使用当前日期进行更新。

因此UNION，将两个表放在一起并分组时，按 .PK排序时创建更改时间线UpdateDate。

我希望创建一个报告，详细说明行之间的差异，按排序UpdateDate，按分组PK，格式如下：

PK,   UpdateDate,  ColumnName,  Old Value,   New Value

Old Value并且New Value可以将相关列转换为 a （VARCHAR(MAX)不涉及列），因为我不需要对值本身进行任何后处理。TEXTBYTE

目前，我想不出一种合理的方法来对大量列执行此操作，而不求助于以编程方式生成查询——我可能不得不这样做。

对很多想法持开放态度，所以我会在 2 天后为这个问题增加赏金。

6 个回答

Voted

Andriy M · Answer 1 · 2017-06-01T04:25:16+08:00

这看起来不会很漂亮，特别是考虑到超过 300 列和不可用LAG，它也不太可能表现得非常好，但作为开始，我会尝试以下方法：

UNION两张桌子。
对于组合集中的每个 PK，从存档表中获取其先前的“化身”（下面的实现使用OUTER APPLY+TOP (1)作为穷人的LAG）。
将每个数据列转换为varchar(max)成对的和反透视它们，即当前值和先前值（CROSS APPLY (VALUES ...)适用于此操作）。
最后，根据每对中的值是否彼此不同来过滤结果。

如我所见，上面的 Transact-SQL：

WITH
  Combined AS
  (
    SELECT * FROM dbo.bigtable
    UNION ALL
    SELECT * FROM dbo.bigtable_archive
  ) AS derived,
  OldAndNew AS
  (
    SELECT
      this.*,
      OldCol1 = last.Col1,
      OldCol2 = last.Col2,
      ...
    FROM
      Combined AS this
      OUTER APPLY
      (
        SELECT TOP (1)
          *
        FROM
          dbo.bigtable_archive
        WHERE
          PK = this.PK
          AND UpdateDate < this.UpdateDate
        ORDER BY
          UpdateDate DESC
      ) AS last
  )
SELECT
  t.PK,
  t.UpdateDate,
  x.ColumnName,
  x.OldValue,
  x.NewValue
FROM
  OldAndNew AS t
  CROSS APPLY
  (
    VALUES
    ('Col1', CAST(t.OldCol1 AS varchar(max), CAST(t.Col1 AS varchar(max))),
    ('Col2', CAST(t.OldCol2 AS varchar(max), CAST(t.Col2 AS varchar(max))),
    ...
  ) AS x (ColumnName, OldValue, NewValue)
WHERE
  NOT EXISTS (SELECT x.OldValue INTERSECT x.NewValue)
ORDER BY
  t.PK,
  t.UpdateDate,
  x.ColumnName
;

Mikael Eriksson · Answer 2 · 2017-06-01T23:32:14+08:00

如果将数据反透视到临时表

create table #T
(
  PK varchar(12) not null,
  UpdateDate datetime not null,
  ColumnName nvarchar(128) not null,
  Value varchar(max),
  Version int not null
);

PK您可以通过,ColumnName和上的自联接来匹配行以查找新值和旧值Version = Version + 1。

当然，不太漂亮的部分是将 300 列从两个基表中逆透视到临时表中。

XML 来拯救，让事情变得不那么尴尬。

可以使用 XML 对数据进行反透视，而不必知道表中有哪些实际列将被反透视。列名称必须作为 XML 中的元素名称有效，否则它将失败。

这个想法是为每一行创建一个 XML，其中包含该行的所有值。

select bt.PK,
       bt.UpdateDate,
       (select bt.* for xml path(''), elements xsinil, type) as X
from dbo.bigtable as bt;

<UpdateDate xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">2001-01-03T00:00:00</UpdateDate>
<PK xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">PK1</PK>
<col1 xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">c1_1_3</col1>
<col2 xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">3</col2>
<col3 xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:nil="true" />
<colN xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">2001-01-03T00:00:00</colN>

elements xsinil是否可以为带有NULL.

然后可以使用nodes('*') 为每一列获取一行并使用local-name(.)获取元素名称和text()获取值来分解 XML。

  select C1.PK,
         C1.UpdateDate,
         T.X.value('local-name(.)', 'nvarchar(128)') as ColumnName,
         T.X.value('text()[1]', 'varchar(max)') as Value
  from C1
    cross apply C1.X.nodes('row/*') as T(X)

下面的完整解决方案。注意Version是反的。0 = 最新版本。

create table #X
(
  PK varchar(12) not null,
  UpdateDate datetime not null,
  Version int not null,
  RowData xml not null
);

create table #T
(
  PK varchar(12) not null,
  UpdateDate datetime not null,
  ColumnName nvarchar(128) not null,
  Value varchar(max),
  Version int not null
);


insert into #X(PK, UpdateDate, Version, RowData)
select bt.PK,
       bt.UpdateDate,
       0,
       (select bt.* for xml path(''), elements xsinil, type)
from dbo.bigtable as bt
union all
select bt.PK,
       bt.UpdateDate,
       row_number() over(partition by bt.PK order by bt.UpdateDate desc),
       (select bt.* for xml path(''), elements xsinil, type)
from dbo.bigtable_archive as bt;

with C as 
(
  select X.PK,
         X.UpdateDate,
         X.Version,
         T.C.value('local-name(.)', 'nvarchar(128)') as ColumnName,
         T.C.value('text()[1]', 'varchar(max)') as Value
  from #X as X
    cross apply X.RowData.nodes('*') as T(C)
)
insert into #T (PK, UpdateDate, ColumnName, Value, Version)
select C.PK,
       C.UpdateDate,
       C.ColumnName,
       C.Value,
       C.Version
from C 
where C.ColumnName not in (N'PK', N'UpdateDate');

/*
option (querytraceon 8649);

The above query might need some trick to go parallel.
For the testdata I had on my machine exection time is 16 seconds vs 2 seconds
https://sqlkiwi.blogspot.com/2011/12/forcing-a-parallel-query-execution-plan.html
http://dataeducation.com/next-level-parallel-plan-forcing-an-alternative-to-8649/

*/

select New.PK,
       New.UpdateDate,
       New.ColumnName,
       Old.Value as OldValue,
       New.Value as NewValue
from #T as New
  left outer join #T as Old
    on Old.PK = New.PK and
       Old.ColumnName = New.ColumnName and
       Old.Version = New.Version + 1;

McNets · Answer 3 · 2017-06-07T01:39:56+08:00

我建议你另一种方法。

虽然您不能更改当前应用程序，但您可以更改数据库行为。

如果可能，我会在当前表中添加两个 TRIGGERS。

dbo.bigtable_archive 上的一个 INSTEAD OF INSERT 仅当新记录当前不存在时才添加它。

CREATE TRIGGER dbo.IoI_BTA
ON dbo.bigtable_archive
INSTEAD OF INSERT
AS
BEGIN
    IF NOT EXISTs(SELECT 1 
                  FROM dbo.bigtable_archive bta
                  INNER JOIN inserted i
                  ON  bta.PK = i.PK
                  AND bta.UpdateDate = i.UpdateDate)
    BEGIN
        INSERT INTO dbo.bigtable_archive
        SELECT * FROM inserted;
    END
END

bigtable 上的 AFTER INSERT 触发器执行完全相同的工作，但使用 bigtable 的数据。

CREATE TRIGGER dbo.IoI_BT
ON dbo.bigtable
AFTER INSERT
AS
BEGIN
    IF NOT EXISTS(SELECT 1 
                  FROM dbo.bigtable_archive bta
                  INNER JOIN inserted i
                  ON  bta.PK = i.PK
                  AND bta.UpdateDate = i.UpdateDate)
    BEGIN
        INSERT INTO dbo.bigtable_archive
        SELECT * FROM inserted;
    END
END

好的，我在这里用这个初始值设置了一个小例子：

SELECT * FROM bigtable;
SELECT * FROM bigtable_archive;

更新日期 | 对战| col1 | col2 | 列3
:-------------------- | :-- | :--- | ---： | :---
02/01/2017 00:00:00 | 美国广播公司 | C3 | 1 | C1  

更新日期 | 对战| col1 | col2 | 列3
:-------------------- | :-- | :--- | ---： | :---
01/01/2017 00:00:00 | 美国广播公司 | C1 | 1 | C1

现在您应该将 bigtable 中的所有未决记录插入到 bigtable_archive 中。

INSERT INTO bigtable_archive
SELECT *
FROM   bigtable
WHERE  UpdateDate >= '20170102';

SELECT * FROM bigtable_archive;
GO

更新日期 | 对战| col1 | col2 | 列3
:-------------------- | :-- | :--- | ---： | :---
01/01/2017 00:00:00 | 美国广播公司 | C1 | 1 | C1  
02/01/2017 00:00:00 | 美国广播公司 | C3 | 1 | C1

现在，下一次应用程序尝试在 bigtable_archive 表上插入记录时，触发器将检测它是否存在，并避免插入。

INSERT INTO dbo.bigtable_archive VALUES('20170102', 'ABC', 'C3', 1, 'C1');
GO

SELECT * FROM bigtable_archive;
GO

更新日期 | 对战| col1 | col2 | 列3
:-------------------- | :-- | :--- | ---： | :---
01/01/2017 00:00:00 | 美国广播公司 | C1 | 1 | C1  
02/01/2017 00:00:00 | 美国广播公司 | C3 | 1 | C1

显然，现在您可以通过仅查询存档表来获取更改的时间线。而且应用程序永远不会意识到触发器正在幕后悄悄地完成工作。

dbfiddle在这里

markp-fuso · Answer 4 · 2017-06-11T07:39:56+08:00

Working proposal, w/ some sample data, can be found @ rextester: bigtable unpivot

The gist of the operation:

1 - Use syscolumns and for xml to dynamically generate our column lists for the unpivot operation; all values will be converted to varchar(max), w/ NULLs being converted to the string 'NULL' (this addresses issue with unpivot skipping NULL values)

2 - Generate a dynamic query to unpivot data into the #columns temp table

Why a temp table vs CTE (via with clause)? concerned with potential performance issue for a large volume of data and a CTE self-join with no usable index/hashing scheme; a temp table allows for creation of an index which should improve performance on the self-join [ see slow CTE self join ]
Data is written to #columns in PK+ColName+UpdateDate order, allowing us to store PK/Colname values in adjacent rows; an identity column (rid) allows us to self-join these consecutive rows via rid = rid + 1

3 - Perform a self join of the #temp table to generate the desired output

Cutting-n-pasting from rextester ...

Create some sample data and our #columns table:

CREATE TABLE dbo.bigtable
(UpdateDate datetime      not null
,PK         varchar(12)   not null
,col1       varchar(100)      null
,col2       int               null
,col3       varchar(20)       null
,col4       datetime          null
,col5       char(20)          null
,PRIMARY KEY (PK)
);

CREATE TABLE dbo.bigtable_archive
(UpdateDate datetime      not null
,PK         varchar(12)   not null
,col1       varchar(100)      null
,col2       int               null
,col3       varchar(20)       null
,col4       datetime          null
,col5       char(20)          null
,PRIMARY KEY (PK, UpdateDate)
);

insert into dbo.bigtable         values ('20170512', 'ABC', NULL, 6, 'C1', '20161223', 'closed')

insert into dbo.bigtable_archive values ('20170427', 'ABC', NULL, 6, 'C1', '20160820', 'open')
insert into dbo.bigtable_archive values ('20170315', 'ABC', NULL, 5, 'C1', '20160820', 'open')
insert into dbo.bigtable_archive values ('20170212', 'ABC', 'C1', 1, 'C1', '20160820', 'open')
insert into dbo.bigtable_archive values ('20170109', 'ABC', 'C1', 1, 'C1', '20160513', 'open')

insert into dbo.bigtable         values ('20170526', 'XYZ', 'sue', 23, 'C1', '20161223', 're-open')

insert into dbo.bigtable_archive values ('20170401', 'XYZ', 'max', 12, 'C1', '20160825', 'cancel')
insert into dbo.bigtable_archive values ('20170307', 'XYZ', 'bob', 12, 'C1', '20160825', 'cancel')
insert into dbo.bigtable_archive values ('20170223', 'XYZ', 'bob', 12, 'C1', '20160820', 'open')
insert into dbo.bigtable_archive values ('20170214', 'XYZ', 'bob', 12, 'C1', '20160513', 'open')
;

create table #columns
(rid        int           identity(1,1)
,PK         varchar(12)   not null
,UpdateDate datetime      not null
,ColName    varchar(128)  not null
,ColValue   varchar(max)      null
,PRIMARY KEY (rid, PK, UpdateDate, ColName)
);

The guts of the solution:

declare @columns_max varchar(max),
        @columns_raw varchar(max),
        @cmd         varchar(max)

select  @columns_max = stuff((select ',isnull(convert(varchar(max),'+name+'),''NULL'') as '+name
                from    syscolumns
                where   id   = object_id('dbo.bigtable')
                and     name not in ('PK','UpdateDate')
                order by name
                for xml path(''))
            ,1,1,''),
        @columns_raw = stuff((select ','+name
                from    syscolumns
                where   id   = object_id('dbo.bigtable')
                and     name not in ('PK','UpdateDate')
                order by name
                for xml path(''))
            ,1,1,'')


select @cmd = '
insert #columns (PK, UpdateDate, ColName, ColValue)
select PK,UpdateDate,ColName,ColValue
from
(select PK,UpdateDate,'+@columns_max+' from bigtable
 union all
 select PK,UpdateDate,'+@columns_max+' from bigtable_archive
) p
unpivot
  (ColValue for ColName in ('+@columns_raw+')
) as unpvt
order by PK, ColName, UpdateDate'

--select @cmd

execute(@cmd)

--select * from #columns order by rid
;

select  c2.PK, c2.UpdateDate, c2.ColName as ColumnName, c1.ColValue as 'Old Value', c2.ColValue as 'New Value'
from    #columns c1,
        #columns c2
where   c2.rid                       = c1.rid + 1
and     c2.PK                        = c1.PK
and     c2.ColName                   = c1.ColName
and     isnull(c2.ColValue,'xxx')   != isnull(c1.ColValue,'xxx')
order by c2.UpdateDate, c2.PK, c2.ColName
;

And the results:

Note: apologies ... couldn't figure out an easy way to cut-n-paste the rextester output into a code block. I'm open to suggestions.

Potential issues/concerns:

1 - conversion of data to a generic varchar(max) can lead to loss of data precision which in turn can mean we miss some data changes; consider the following datetime and float pairs which, when converted/cast to the generic 'varchar(max)', lose their precision (ie, the converted values are the same):

original value       varchar(max)
-------------------  -------------------
06/10/2017 10:27:15  Jun 10 2017 10:27AM
06/10/2017 10:27:18  Jun 10 2017 10:27AM

    234.23844444                 234.238
    234.23855555                 234.238

    29333488.888            2.93335e+007
    29333499.999            2.93335e+007

While data precision could be maintained it would require a bit more coding (eg, casting based on source column datatypes); for now I've opted to stick with the generic varchar(max) per the OP's recommendation (and assumption that the OP knows the data well enough to know that we won't run into any issues of data precision loss).

2 - for really large sets of data we run the risk of blowing out some server resources, whether it be tempdb space and/or cache/memory; primary issue comes from the data explosion that occurs during an unpivot (eg, we go from 1 row and 302 pieces of data to 300 rows and 1200-1500 pieces of data, including 300 copies of the PK and UpdateDate columns, 300 column names)

Dharmendar Kumar 'DK' · Answer 5 · 2017-06-07T09:02:17+08:00

这种方法使用动态查询生成 sql 来获取更改。SP 采用表和架构名称并提供您想要的输出。

假设 PK 和 UpdateDate 列存在于所有表中。并且所有存档表的格式为 originalTableName + "_archive"..

注意：我没有检查它的性能。

注意：因为这使用动态 sql，所以我应该添加关于安全/sql 注入的警告。限制对 SP 的访问并添加其他验证以防止 sql 注入。

    CREATE proc getTableChanges
    @schemaname  varchar(255),
    @tableName varchar(255)
    as

    declare @strg nvarchar(max), @colNameStrg nvarchar(max)='', @oldValueString nvarchar(max)='', @newValueString nvarchar(max)=''

    set @strg = '
    with cte as (

    SELECT  * , ROW_NUMBER() OVER(partition by PK ORDER BY UpdateDate) as RowNbr
    FROM    (

        SELECT  *
        FROM    [' + @schemaname + '].[' + @tableName + ']

        UNION

        SELECT  *
        FROM    [' + @schemaname + '].[' + @tableName + '_archive]

        ) a

    )
    '


    SET @strg = @strg + '

    SELECT  a.pk, a.updateDate, 
    CASE '

    DECLARE @colName varchar(255)
    DECLARE cur CURSOR FOR
        SELECT  COLUMN_NAME
        FROM    INFORMATION_SCHEMA.COLUMNS
        WHERE TABLE_SCHEMA = @schemaname
        AND TABLE_NAME = @tableName
        AND COLUMN_NAME NOT IN ('PK', 'Updatedate')

    OPEN cur
    FETCH NEXT FROM cur INTO @colName 

    WHILE @@FETCH_STATUS = 0
    BEGIN

        SET @colNameStrg  = @colNameStrg  + ' when a.' + @colName + ' <> b.' + @colName + ' then ''' + @colName + ''' '
        SET @oldValueString = @oldValueString + ' when a.' + @colName + ' <> b.' + @colName + ' then cast(a.' + @colName + ' as varchar(max))'
        SET @newValueString = @newValueString + ' when a.' + @colName + ' <> b.' + @colName + ' then cast(b.' + @colName + ' as varchar(max))'


    FETCH NEXT FROM cur INTO @colName 
    END

    CLOSE cur
    DEALLOCATE cur


    SET @colNameStrg = @colNameStrg  + '    END as ColumnChanges '
    SET @oldValueString = 'CASE ' + @oldValueString + ' END as OldValue'
    SET @newValueString = 'CASE ' + @newValueString + ' END as NewValue'

    SET @strg = @strg + @colNameStrg + ',' + @oldValueString + ',' + @newValueString

    SET @strg = @strg + '
        FROM    cte a join cte b on a.PK = b.PK and a.RowNbr + 1 = b.RowNbr 
        ORDER BY  a.pk, a.UpdateDate
    '

    print @strg

    execute sp_executesql @strg


    go

示例调用：

exec getTableChanges 'dbo', 'bigTable'

KumarHarsh · Answer 6 · 2017-06-09T03:23:57+08:00

I am using AdventureWorks2012`,Production.ProductCostHistory and Production.ProductListPriceHistory in my example.It may not be perfect history table example, "but script is able to put together the desire output and correct output".

     DECLARE @sql NVARCHAR(MAX)
    ,@columns NVARCHAR(Max)
    ,@table VARCHAR(200) = 'ProductCostHistory'
    ,@Schema VARCHAR(200) = 'Production'
    ,@Archivecolumns NVARCHAR(Max)
    ,@ColForUnpivot NVARCHAR(Max)
    ,@ArchiveColForUnpivot NVARCHAR(Max)
    ,@PKCol VARCHAR(200) = 'ProductID'
    ,@UpdatedCol VARCHAR(200) = 'modifiedDate'
    ,@Histtable VARCHAR(200) = 'ProductListPriceHistory'
SELECT @columns = STUFF((
            SELECT ',CAST(p.' + QUOTENAME(column_name) + ' AS VARCHAR(MAX)) AS ' + QUOTENAME(column_name)
            FROM information_schema.columns
            WHERE table_name = @table
                AND column_name NOT IN (
                    @PKCol
                    ,@UpdatedCol
                    )
            ORDER BY ORDINAL_POSITION
            FOR XML PATH('')
            ), 1, 1, '')
    ,@Archivecolumns = STUFF((
            SELECT ',CAST(p1.' + QUOTENAME(column_name) + ' AS VARCHAR(MAX)) AS ' + QUOTENAME('A_' + column_name)
            FROM information_schema.columns
            WHERE table_name = @Histtable
                AND column_name NOT IN (
                    @PKCol
                    ,@UpdatedCol
                    )
            ORDER BY ORDINAL_POSITION
            FOR XML PATH('')
            ), 1, 1, '')
    ,@ColForUnpivot = STUFF((
            SELECT ',' + QUOTENAME(column_name)
            FROM information_schema.columns
            WHERE table_name = @table
                AND column_name NOT IN (
                    @PKCol
                    ,@UpdatedCol
                    )
            ORDER BY ORDINAL_POSITION
            FOR XML PATH('')
            ), 1, 1, '')
    ,@ArchiveColForUnpivot = STUFF((
            SELECT ',' + QUOTENAME('A_' + column_name)
            FROM information_schema.columns
            WHERE table_name = @Histtable
                AND column_name NOT IN (
                    @PKCol
                    ,@UpdatedCol
                    )
            ORDER BY ORDINAL_POSITION
            FOR XML PATH('')
            ), 1, 1, '')

--SELECT @columns   ,@Archivecolumns    ,@ColForUnpivot
SET @sql = N' 
    SELECT ' + @PKCol + ', ColumnName,
            OldValue,NewValue,' + @UpdatedCol + '
    FROM    (  
    SELECT p.' + @PKCol + '
        ,p.' + @UpdatedCol + '
        ,' + @columns + '
        ,' + @Archivecolumns + '
    FROM ' + @Schema + '.' + @table + ' p
    left JOIN ' + @Schema + '.' + @Histtable + ' p1 ON p.' + @PKCol + ' = p1.' + @PKCol + '

  ) t
    UNPIVOT (
        OldValue
        FOR ColumnName in (' + @ColForUnpivot + ')
    ) up

     UNPIVOT (
        NewValue
        FOR ColumnName1 in (' + @ArchiveColForUnpivot + ')
    ) up1

--print @sql
EXEC (@sql)

Here in inner Select query consider p as Main Table and p1 as History table.In unpivot it is important to convert it into same type.

You can take any other table name with fewer column name to understand my script.Any Explanation need then ping me.

查询大量数据行间的详细差异

连接到 PostgreSQL 服务器：致命：主机没有 pg_hba.conf 条目

如何让sqlplus的输出出现在一行中？

选择具有最大日期或最晚日期的日期

如何列出 PostgreSQL 中的所有模式？

列出指定表的所有列

如何在不修改我自己的 tnsnames.ora 的情况下使用 sqlplus 连接到位于另一台主机上的 Oracle 数据库

你如何mysqldump特定的表？

使用 psql 列出数据库权限

如何从 PostgreSQL 中的选择查询中将值插入表中？

如何使用 psql 列出所有数据库和表？

查询大量数据行间的详细差异

6 个回答

相关问题