为什么在解释查询中阅读 const 表后会注意到“Impossible WHERE”？

Question

Mio

Asked: 2017-02-09 07:01:45 +0800 CST2017-02-09 07:01:45 +0800 CST 2017-02-09 07:01:45 +0800 CST

在 Redshift 上评估不同的 EXPLAIN

772

我试图了解 Redshift 上的 EXPLAIN。我的情况是我有这样的数据

id | user_id | created_at
---|---------------------------------
1  | 1       | 2017-02-08 14:32:10.96
2  | 1       | 2017-02-07 14:32:10.96
3  | 2       | 2017-02-06 14:32:10.96
4  | 2       | 2017-02-05 14:32:10.96

我想：

id | user_id | created_at
---|---------------------------------
1  | 1       | 2017-02-08 14:32:10.96
3  | 2       | 2017-02-06 14:32:10.96

我有这两个查询：

SELECT id,
       user_id,
       created_at
FROM
  ( SELECT user_id,
           created_at,
           row_number() OVER (PARTITION BY user_id
                              ORDER BY created_at) AS rownum
   FROM my_table) x
WHERE rownum = 1;

随着 EXPLAIN 有：

XN Subquery Scan x  (cost=1000001263779.68..1000001513986.60 rows=50042 width=16)
  Filter: (rownum = 1)
  ->  XN Window  (cost=1000001263779.68..1000001388883.14 rows=10008277 width=16)
        Partition: user_id
        Order: created_at
        ->  XN Sort  (cost=1000001263779.68..1000001288800.37 rows=10008277 width=16)
              Sort Key: user_id, created_at
              ->  XN Seq Scan on my_table  (cost=0.00..100082.77 rows=10008277 width=16)

然后是另一个查询：

SELECT ac1.user_id, ac1.created_at FROM my_table ac1
JOIN 
(
   SELECT user_id, MAX(created_at) AS MAXDATE
   FROM my_table
   GROUP BY user_id
) ac2
ON ac1.user_id = ac2.user_id
AND ac1.created_at = ac2.MAXDATE;

和解释：

XN Hash Join DS_DIST_NONE  (cost=150798.74..771939079.62 rows=7257 width=16)
  Hash Cond: (("outer".created_at = "inner".maxdate) AND ("outer".user_id = "inner".user_id))
  ->  XN Seq Scan on my_table ac1  (cost=0.00..100082.77 rows=10008277 width=16)
  ->  XN Hash  (cost=150606.01..150606.01 rows=38548 width=16)
        ->  XN Subquery Scan ac2  (cost=150124.15..150606.01 rows=38548 width=16)
              ->  XN HashAggregate  (cost=150124.15..150220.52 rows=38548 width=16)
                    ->  XN Seq Scan on my_table  (cost=0.00..100082.77 rows=10008277 width=16)

第一个查询的结果有点慢，但是当我尝试理解 EXPLAIN 时，我迷路了。似乎cost在使用的查询中更高，ROW_NUMBER()但与rows.

但是我可以从这些 EXPLAIN 中提取什么（遗憾的是我不能ANALYZE在 Redshift 上使用）？

2 个回答

Voted

hibernado · Answer 1 · 2018-11-08T14:04:41+08:00

法比奥给出了一个很好的答案。

然而，对于 Redshift，值得补充的是，数据的物理布局方式对 EXPLAIN 计划成本有巨大影响。

创建一些虚拟数据：（
灵感来自https://stackoverflow.com/questions/38667215/redshift-how-can-i-generate-a-series-of-numbers-without-creating-a-table-called）

drop table if exists #my_table; create table #my_table as select (row_number() over (order by 1)) - 1 as user_id ,(current_date - user_id::int)::timestamp created_at from stl_load_errors limit 24

重现与问题类似的 EXPLAIN 计划：

explain SELECT user_id, created_at FROM ( SELECT user_id, created_at, row_number() OVER (PARTITION BY user_id ORDER BY created_at) AS rownum FROM #my_table) x WHERE rownum = 1
EXPLAIN:
XN Subquery Scan x (cost=1000000000000.79..1000000000001.39 rows=1 width=16) Filter: (rownum = 1) -> XN Window (cost=1000000000000.79..1000000000001.09 rows=24 width=16) Partition: user_id Order: created_at -> XN Sort (cost=1000000000000.79..1000000000000.85 rows=24 width=16) Sort Key: user_id, created_at -> XN Network (cost=0.00..0.24 rows=24 width=16) Distribute -> XN Seq Scan on "#my_table" (cost=0.00..0.24 rows=24 width=16)

Next:
explain SELECT ac1.user_id, ac1.created_at FROM #my_table ac1 JOIN ( SELECT user_id, MAX(created_at) AS MAXDATE FROM #my_table GROUP BY user_id ) ac2 ON ac1.user_id = ac2.user_id AND ac1.created_at = ac2.MAXDATE

解释：
XN Hash Join DS_DIST_INNER (cost=0.72..822858.77 rows=13 width=16) Inner Dist Key: ac1.user_id Hash Cond: (("outer".maxdate = "inner".created_at) AND ("outer".user_id = "inner".user_id)) -> XN Subquery Scan ac2 (cost=0.36..0.66 rows=24 width=16) -> XN HashAggregate (cost=0.36..0.42 rows=24 width=16) -> XN Seq Scan on "#my_table" (cost=0.00..0.24 rows=24 width=16) -> XN Hash (cost=0.24..0.24 rows=24 width=16) -> XN Seq Scan on "#my_table" ac1 (cost=0.00..0.24 rows=24 width=16)

这确实快得多，但数据分布在节点之间（DB_DIST_INNER）。

现在尝试：

drop table #my_table_dist; create table #my_table_dist distkey(user_id) sortkey(user_id,created_at) as select (row_number() over (order by 1)) - 1 as user_id ,(current_date - user_id::int)::timestamp created_at from stl_load_errors limit 24

现在运行解释：
explain SELECT user_id, created_at FROM ( SELECT user_id, created_at, row_number() OVER (PARTITION BY user_id ORDER BY created_at) AS rownum FROM #my_table_dist) x WHERE rownum = 1

EXPLAIN：
XN Subquery Scan x (cost=0.00..0.78 rows=1 width=16) Filter: (rownum = 1) -> XN Window (cost=0.00..0.48 rows=24 width=16) Partition: user_id Order: created_at -> XN Seq Scan on "#my_table_dist" (cost=0.00..0.24 rows=24 width=16)
数据已经排序和分发，因此 Redshift 只需读取答案即可。

同时：

explain SELECT ac1.user_id, ac1.created_at FROM #my_table ac1 JOIN ( SELECT user_id, MAX(created_at) AS MAXDATE FROM #my_table_dist GROUP BY user_id ) ac2 ON ac1.user_id = ac2.user_id AND ac1.created_at = ac2.MAXDATE

解释：
XN Hash Join DS_DIST_INNER (cost=0.36..822858.77 rows=13 width=16) Inner Dist Key: ac1.user_id Hash Cond: (("outer".maxdate = "inner".created_at) AND ("outer".user_id = "inner".user_id)) -> XN Subquery Scan ac2 (cost=0.00..0.66 rows=24 width=16) -> XN GroupAggregate (cost=0.00..0.42 rows=24 width=16) -> XN Seq Scan on "#my_table_dist" (cost=0.00..0.24 rows=24 width=16) -> XN Hash (cost=0.24..0.24 rows=24 width=16) -> XN Seq Scan on "#my_table" ac1 (cost=0.00..0.24 rows=24 width=16)
请注意，由于节点之间的数据分布（DB_DIST_INNER），成本没有差异。

Fabio Beltramini · Answer 2 · 2017-03-16T10:14:09+08:00

Best Answer

Fabio Beltramini

2017-03-16T10:14:09+08:002017-03-16T10:14:09+08:00

第一个查询计划中代价高昂且解释差异的步骤是对大量行的排序步骤。您正在对整个数据集进行排序（一个O(n log n)操作，其中 n 是您的分区大小），因此您可以选择第一个条目。其他行（#2 - #10,000,000）仍然需要排序，即使你从未看过它们。另一方面， max 是一项O(n)操作，因为您只需要在传递数据时跟踪一个值

1

在 Redshift 上评估不同的 EXPLAIN

现在尝试：

连接到 PostgreSQL 服务器：致命：主机没有 pg_hba.conf 条目

如何让sqlplus的输出出现在一行中？

选择具有最大日期或最晚日期的日期

如何列出 PostgreSQL 中的所有模式？

列出指定表的所有列

如何在不修改我自己的 tnsnames.ora 的情况下使用 sqlplus 连接到位于另一台主机上的 Oracle 数据库

你如何mysqldump特定的表？

使用 psql 列出数据库权限

如何从 PostgreSQL 中的选择查询中将值插入表中？

如何使用 psql 列出所有数据库和表？

在 Redshift 上评估不同的 EXPLAIN

2 个回答

现在尝试：

相关问题