使用 PostgreSQL 8.4,我有许多具有非常相似结构但属于不同类别的表:
CREATE TABLE table_a (
id SERIAL PRIMARY KEY,
event_time TIMESTAMP WITH TIME ZONE NOT NULL,
value_a REAL
);
CREATE TABLE table_b (
id SERIAL PRIMARY KEY,
event_time TIMESTAMP WITH TIME ZONE NOT NULL,
value_b REAL
);
CREATE TABLE table_c (
id SERIAL PRIMARY KEY,
event_time TIMESTAMP WITH TIME ZONE NOT NULL,
value_c REAL
);
我需要将这些值链接到中央表(根据查询使用连接或子选择):
CREATE TABLE periods_table (
id SERIAL PRIMARY KEY,
start_time TIMESTAMP WITH TIME ZONE NOT NULL,
end_time TIMESTAMP WITH TIME ZONE NOT NULL,
category TEXT NOT NULL
);
这里,category
是'Category A'
,'Category B'
或之一'Category C'
。
为了抽象出 A、B 和 C 表之间的相似性,我创建了一个视图:
CREATE VIEW table_values AS
SELECT 'Category A' AS category, event_time, value_a AS value
FROM table_a
UNION
SELECT 'Category B' AS category, event_time, value_b AS value
FROM table_b
UNION
SELECT 'Category C' AS category, event_time, value_c AS value
FROM table_c;
典型的查询类似于:
SELECT p.start_time, p.end_time, p.category,
(SELECT SUM(v.value) FROM table_values v
WHERE v.category=p.category
AND v.event_time >= t.start_time AND v.event_time < t.end_time)
FROM periods_table p
问题在于category
可能用于区分视图中不同表的列仅在最后使用。
即使是EXPLAIN ANALYZE
onSELECT * FROM table_values WHERE category='Category A'
显示当符合此条件的行将永远只来自时,所有 3 个表都被子查询table_a
:
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------
Subquery Scan table_values (cost=176.02..295.50 rows=27 width=44) (actual time=0.135..0.135 rows=0 loops=1)
Filter: (table_values.category = 'Category A'::text)
-> HashAggregate (cost=176.02..229.12 rows=5310 width=12) (actual time=0.119..0.119 rows=0 loops=1)
-> Append (cost=0.00..136.20 rows=5310 width=12) (actual time=0.089..0.089 rows=0 loops=1)
-> Subquery Scan "*SELECT* 1" (cost=0.00..45.40 rows=1770 width=12) (actual time=0.025..0.025 rows=0 loops=1)
-> Seq Scan on table_a (cost=0.00..27.70 rows=1770 width=12) (actual time=0.010..0.010 rows=0 loops=1)
-> Subquery Scan "*SELECT* 2" (cost=0.00..45.40 rows=1770 width=12) (actual time=0.020..0.020 rows=0 loops=1)
-> Seq Scan on table_b (cost=0.00..27.70 rows=1770 width=12) (actual time=0.006..0.006 rows=0 loops=1)
-> Subquery Scan "*SELECT* 3" (cost=0.00..45.40 rows=1770 width=12) (actual time=0.020..0.020 rows=0 loops=1)
-> Seq Scan on table_c (cost=0.00..27.70 rows=1770 width=12) (actual time=0.006..0.006 rows=0 loops=1)
Total runtime: 0.437 ms
(11 rows)
(这个虚拟表中没有数据,但实际表有大约 300000 行。SELECT * FROM table_values WHERE category='Category A'
它比 花费的时间大约长 15 倍SELECT * FROM table_a
,而它们或多或少是相同的。)
每个表上都有索引event_time
,但由于视图上没有索引category
,所以这无济于事。我还尝试用 CTE 替换视图(因为它有时会导致不同的查询路径),但没有帮助。
考虑到我无法真正更改现有表,有没有办法“合并”像这样的几个表,从而加快查询速度?
编辑:对实际数据的类似查询。(实际上这里有 5 个类似的表。)
有趣的是,虽然我在这里查询“类别 E”,但“类别 A”中有一个排序键并非来自查询中的任何特定位置(我猜它必须来自视图中的第一个选择,或者可能仅使用第一个选择中的值来指示列名)。
EXPLAIN ANALYZE SELECT * FROM table_values WHERE category='Category E'
:
Subquery Scan table_values (cost=1573543.53..1755714.30 rows=40482 width=44) (actual time=221030.235..221234.162 rows=317676 loops=1)
Filter: (table_values.category = 'Category E'::text)
-> Unique (cost=1573543.53..1654508.32 rows=8096479 width=12) (actual time=212999.276..220240.297 rows=8097555 loops=1)
-> Sort (cost=1573543.53..1593784.72 rows=8096479 width=12) (actual time=212999.275..218561.085 rows=8097555 loops=1)
Sort Key: ('Category A'::text), "*SELECT* 1".event_time, "*SELECT* 1".value
Sort Method: external merge Disk: 300792kB"
-> Append (cost=0.00..229411.58 rows=8096479 width=12) (actual time=0.014..4683.734 rows=8097555 loops=1)
-> Subquery Scan "*SELECT* 1" (cost=0.00..80689.62 rows=2847831 width=12) (actual time=0.014..954.326 rows=2847951 loops=1)
-> Seq Scan on table_a (cost=0.00..52211.31 rows=2847831 width=12) (actual time=0.010..607.528 rows=2847951 loops=1)
-> Subquery Scan "*SELECT* 2" (cost=0.00..29304.52 rows=1033976 width=12) (actual time=9.738..576.803 rows=1034928 loops=1)
-> Seq Scan on table_b (cost=0.00..18964.76 rows=1033976 width=12) (actual time=9.737..450.619 rows=1034928 loops=1)
-> Subquery Scan "*SELECT* 3" (cost=0.00..30463.22 rows=1075161 width=12) (actual time=15.100..720.983 rows=1075157 loops=1)
-> Seq Scan on table_c (cost=0.00..19711.61 rows=1075161 width=12) (actual time=15.099..592.070 rows=1075157 loops=1)
-> Subquery Scan "*SELECT* 4" (cost=0.00..79952.70 rows=2821835 width=12) (actual time=20.098..1794.739 rows=2821843 loops=1)
-> Seq Scan on table_d (cost=0.00..51734.35 rows=2821835 width=12) (actual time=20.097..1441.719 rows=2821843 loops=1)
-> Subquery Scan "*SELECT* 5" (cost=0.00..9001.52 rows=317676 width=12) (actual time=0.016..108.768 rows=317676 loops=1)
-> Seq Scan on table_e (cost=0.00..5824.76 rows=317676 width=12) (actual time=0.016..69.732 rows=317676 loops=1)
Total runtime: 221299.573 ms
EXPLAIN ANALYZE SELECT * FROM table_e
:
Seq Scan on table_e (cost=0.00..5824.76 rows=317676 width=12) (actual time=0.025..54.143 rows=317676 loops=1)
Total runtime: 67.624 ms
您的查询存在一些问题。很明显,您的“包装器”视图 - 虽然起初看起来像是一个优雅的解决方案 - 扼杀了涉及 7.7M 完全不必要的行的性能。这是因为
UNION
需要对所有这些数据进行排序,并且由于这些数据不适合内存(您可以从 中看到这一点Sort Method: external merge Disk: 300792kB
),它会“交换”到磁盘并在那里排序,这是一个非常缓慢的过程。作为第一次尝试,尝试使用
UNION ALL
而不是简单的方式重新创建“包装器”视图UNION
(您可以在此处找到不同之处- 请注意,为了使行不同,Postgres 必须首先对它们进行排序);这样你就可以避免排序。如果结果不够好,请尝试将“主”查询中的五个表一一连接起来,然后UNION ALL
得到结果。