我试图了解 Redshift 上的 EXPLAIN。我的情况是我有这样的数据
id | user_id | created_at
---|---------------------------------
1 | 1 | 2017-02-08 14:32:10.96
2 | 1 | 2017-02-07 14:32:10.96
3 | 2 | 2017-02-06 14:32:10.96
4 | 2 | 2017-02-05 14:32:10.96
我想 :
id | user_id | created_at
---|---------------------------------
1 | 1 | 2017-02-08 14:32:10.96
3 | 2 | 2017-02-06 14:32:10.96
我有这两个查询:
SELECT id,
user_id,
created_at
FROM
( SELECT user_id,
created_at,
row_number() OVER (PARTITION BY user_id
ORDER BY created_at) AS rownum
FROM my_table) x
WHERE rownum = 1;
随着 EXPLAIN 有:
XN Subquery Scan x (cost=1000001263779.68..1000001513986.60 rows=50042 width=16)
Filter: (rownum = 1)
-> XN Window (cost=1000001263779.68..1000001388883.14 rows=10008277 width=16)
Partition: user_id
Order: created_at
-> XN Sort (cost=1000001263779.68..1000001288800.37 rows=10008277 width=16)
Sort Key: user_id, created_at
-> XN Seq Scan on my_table (cost=0.00..100082.77 rows=10008277 width=16)
然后是另一个查询:
SELECT ac1.user_id, ac1.created_at FROM my_table ac1
JOIN
(
SELECT user_id, MAX(created_at) AS MAXDATE
FROM my_table
GROUP BY user_id
) ac2
ON ac1.user_id = ac2.user_id
AND ac1.created_at = ac2.MAXDATE;
和解释:
XN Hash Join DS_DIST_NONE (cost=150798.74..771939079.62 rows=7257 width=16)
Hash Cond: (("outer".created_at = "inner".maxdate) AND ("outer".user_id = "inner".user_id))
-> XN Seq Scan on my_table ac1 (cost=0.00..100082.77 rows=10008277 width=16)
-> XN Hash (cost=150606.01..150606.01 rows=38548 width=16)
-> XN Subquery Scan ac2 (cost=150124.15..150606.01 rows=38548 width=16)
-> XN HashAggregate (cost=150124.15..150220.52 rows=38548 width=16)
-> XN Seq Scan on my_table (cost=0.00..100082.77 rows=10008277 width=16)
第一个查询的结果有点慢,但是当我尝试理解 EXPLAIN 时,我迷路了。似乎cost
在使用的查询中更高,ROW_NUMBER()
但与rows
.
但是我可以从这些 EXPLAIN 中提取什么(遗憾的是我不能ANALYZE
在 Redshift 上使用)?
法比奥给出了一个很好的答案。
然而,对于 Redshift,值得补充的是,数据的物理布局方式对 EXPLAIN 计划成本有巨大影响。
创建一些虚拟数据:(
灵感来自https://stackoverflow.com/questions/38667215/redshift-how-can-i-generate-a-series-of-numbers-without-creating-a-table-called)
drop table if exists #my_table; create table #my_table as select (row_number() over (order by 1)) - 1 as user_id ,(current_date - user_id::int)::timestamp created_at from stl_load_errors limit 24
重现与问题类似的 EXPLAIN 计划:
explain SELECT user_id, created_at FROM ( SELECT user_id, created_at, row_number() OVER (PARTITION BY user_id ORDER BY created_at) AS rownum FROM #my_table) x WHERE rownum = 1
EXPLAIN:
XN Subquery Scan x (cost=1000000000000.79..1000000000001.39 rows=1 width=16) Filter: (rownum = 1) -> XN Window (cost=1000000000000.79..1000000000001.09 rows=24 width=16) Partition: user_id Order: created_at -> XN Sort (cost=1000000000000.79..1000000000000.85 rows=24 width=16) Sort Key: user_id, created_at -> XN Network (cost=0.00..0.24 rows=24 width=16) Distribute -> XN Seq Scan on "#my_table" (cost=0.00..0.24 rows=24 width=16)
Next:
explain SELECT ac1.user_id, ac1.created_at FROM #my_table ac1 JOIN ( SELECT user_id, MAX(created_at) AS MAXDATE FROM #my_table GROUP BY user_id ) ac2 ON ac1.user_id = ac2.user_id AND ac1.created_at = ac2.MAXDATE
解释:
XN Hash Join DS_DIST_INNER (cost=0.72..822858.77 rows=13 width=16) Inner Dist Key: ac1.user_id Hash Cond: (("outer".maxdate = "inner".created_at) AND ("outer".user_id = "inner".user_id)) -> XN Subquery Scan ac2 (cost=0.36..0.66 rows=24 width=16) -> XN HashAggregate (cost=0.36..0.42 rows=24 width=16) -> XN Seq Scan on "#my_table" (cost=0.00..0.24 rows=24 width=16) -> XN Hash (cost=0.24..0.24 rows=24 width=16) -> XN Seq Scan on "#my_table" ac1 (cost=0.00..0.24 rows=24 width=16)
这确实快得多,但数据分布在节点之间(DB_DIST_INNER)。
现在尝试:
drop table #my_table_dist; create table #my_table_dist distkey(user_id) sortkey(user_id,created_at) as select (row_number() over (order by 1)) - 1 as user_id ,(current_date - user_id::int)::timestamp created_at from stl_load_errors limit 24
现在运行解释:
explain SELECT user_id, created_at FROM ( SELECT user_id, created_at, row_number() OVER (PARTITION BY user_id ORDER BY created_at) AS rownum FROM #my_table_dist) x WHERE rownum = 1
EXPLAIN:
XN Subquery Scan x (cost=0.00..0.78 rows=1 width=16) Filter: (rownum = 1) -> XN Window (cost=0.00..0.48 rows=24 width=16) Partition: user_id Order: created_at -> XN Seq Scan on "#my_table_dist" (cost=0.00..0.24 rows=24 width=16)
数据已经排序和分发,因此 Redshift 只需读取答案即可。
同时:
explain SELECT ac1.user_id, ac1.created_at FROM #my_table ac1 JOIN ( SELECT user_id, MAX(created_at) AS MAXDATE FROM #my_table_dist GROUP BY user_id ) ac2 ON ac1.user_id = ac2.user_id AND ac1.created_at = ac2.MAXDATE
解释:
XN Hash Join DS_DIST_INNER (cost=0.36..822858.77 rows=13 width=16) Inner Dist Key: ac1.user_id Hash Cond: (("outer".maxdate = "inner".created_at) AND ("outer".user_id = "inner".user_id)) -> XN Subquery Scan ac2 (cost=0.00..0.66 rows=24 width=16) -> XN GroupAggregate (cost=0.00..0.42 rows=24 width=16) -> XN Seq Scan on "#my_table_dist" (cost=0.00..0.24 rows=24 width=16) -> XN Hash (cost=0.24..0.24 rows=24 width=16) -> XN Seq Scan on "#my_table" ac1 (cost=0.00..0.24 rows=24 width=16)
请注意,由于节点之间的数据分布(DB_DIST_INNER),成本没有差异。
第一个查询计划中代价高昂且解释差异的步骤是对大量行的排序步骤。您正在对整个数据集进行排序(一个
O(n log n)
操作,其中 n 是您的分区大小),因此您可以选择第一个条目。其他行(#2 - #10,000,000)仍然需要排序,即使你从未看过它们。另一方面, max 是一项O(n)
操作,因为您只需要在传递数据时跟踪一个值