优化大表的连接

我试图从访问具有约 2.5 亿条记录的表的查询中获得更多性能。根据我对实际（未估计）执行计划的阅读，第一个瓶颈是如下查询：

select
    b.stuff,
    a.added,
    a.value
from
    dbo.hugetable a
    inner join
    #smalltable b on a.fk = b.pk
where
    a.added between @start and @end;

有关所涉及的表和索引的定义，请参阅下文。

执行计划表明在#smalltable 上使用了嵌套循环，并且对hugetable 的索引扫描被执行了480 次（对于#smalltable 中的每一行）。这对我来说似乎倒退了，所以我试图强制使用合并连接：

select
    b.stuff,
    a.added,
    a.value
from
    dbo.hugetable a with(index = ix_hugetable)
    inner merge join
    #smalltable b with(index(1)) on a.fk = b.pk
where
    a.added between @start and @end;

有问题的索引（完整定义见下文）涵盖列fk（连接谓词）、添加（在 where 子句中使用）和id（无用）按升序排列，并包括value。

但是，当我这样做时，查询从 2 1/2 分钟到超过 9 分钟。我希望提示会强制执行更有效的连接，该连接只对每个表执行一次传递，但显然不是。

欢迎任何指导。如有需要，提供额外信息。

更新 (2011/06/02)

重新组织表上的索引后，我在性能上取得了显着进步，但是在汇总巨大表中的数据时遇到了新的障碍。结果是按月汇总，目前如下所示：

select
    b.stuff,
    datediff(month, 0, a.added),
    count(a.value),
    sum(case when a.value > 0 else 1 end) -- this triples the running time!
from
    dbo.hugetable a
    inner join
    #smalltable b on a.fk = b.pk
group by
    b.stuff,
    datediff(month, 0, a.added);

目前，hugetable有一个聚集索引pk_hugetable (added, fk)（主键），一个非聚集索引则相反ix_hugetable (fk, added)。

如果没有上面的第 4 列，优化器像以前一样使用嵌套循环连接，使用 #smalltable 作为外部输入，使用非聚集索引查找作为内部循环（再次执行 480 次）。我担心的是估计行（12,958.4）和实际行（74,668,468）之间的差异。这些搜索的相对成本为 45%。然而，运行时间不到一分钟。

对于第 4 列，运行时间达到 4 分钟。这次它以相同的相对成本 (45%) 搜索聚集索引（2 次执行），通过哈希匹配 (30%) 进行聚合，然后在 #smalltable (0%) 上执行哈希连接。

我不确定我的下一步行动。我担心的是日期范围搜索和连接谓词都不能得到保证，甚至可能会大大减少结果集。在大多数情况下，日期范围只会修剪 10-15% 的记录，而fk上的内部连接可能会过滤掉 20-30% 的记录。

根据 Will A 的要求，结果如下sp_spaceused：

name      | rows      | reserved    | data        | index_size  | unused
hugetable | 261774373 | 93552920 KB | 18373816 KB | 75167432 KB | 11672 KB

#smalltable定义为：

create table #endpoints (
    pk uniqueidentifier primary key clustered,
    stuff varchar(6) null
);

而dbo.hugetable被定义为：

create table dbo.hugetable (
    id uniqueidentifier not null,
    fk uniqueidentifier not null,
    added datetime not null,
    value decimal(13, 3) not null,

    constraint pk_hugetable primary key clustered (
        fk asc,
        added asc,
        id asc
    )
    with (
        pad_index = off, statistics_norecompute = off,
        ignore_dup_key = off, allow_row_locks = on,
        allow_page_locks = on
    )
    on [primary]
)
on [primary];

定义了以下索引：

create nonclustered index ix_hugetable on dbo.hugetable (
    fk asc, added asc, id asc
) include(value) with (
    pad_index = off, statistics_norecompute = off,
    sort_in_tempdb = off, ignore_dup_key = off,
    drop_existing = off, online = off,
    allow_row_locks = on, allow_page_locks = on
)
on [primary];

id字段是多余的，这是以前的 DBA 的产物，他坚持认为所有地方的所有表都应该有一个 GUID，没有例外。

优化大表的连接

更新 (2011/06/02)

连接到 PostgreSQL 服务器：致命：主机没有 pg_hba.conf 条目

如何让sqlplus的输出出现在一行中？

选择具有最大日期或最晚日期的日期

如何列出 PostgreSQL 中的所有模式？

列出指定表的所有列

如何在不修改我自己的 tnsnames.ora 的情况下使用 sqlplus 连接到位于另一台主机上的 Oracle 数据库

你如何mysqldump特定的表？

使用 psql 列出数据库权限

如何从 PostgreSQL 中的选择查询中将值插入表中？

如何使用 psql 列出所有数据库和表？

Quick Joe Smith's questions

更新 (2011/06/02)