AskOverflow.Dev

AskOverflow.Dev Logo AskOverflow.Dev Logo

AskOverflow.Dev Navigation

  • 主页
  • 系统&网络
  • Ubuntu
  • Unix
  • DBA
  • Computer
  • Coding
  • LangChain

Mobile menu

Close
  • 主页
  • 系统&网络
    • 最新
    • 热门
    • 标签
  • Ubuntu
    • 最新
    • 热门
    • 标签
  • Unix
    • 最新
    • 标签
  • DBA
    • 最新
    • 标签
  • Computer
    • 最新
    • 标签
  • Coding
    • 最新
    • 标签
主页 / dba / 问题 / 157540
Accepted
jpmc26
jpmc26
Asked: 2016-12-08 18:33:36 +0800 CST2016-12-08 18:33:36 +0800 CST 2016-12-08 18:33:36 +0800 CST

在 DISTINCT ON 之后应用过滤器

  • 772

最后有一个脚本将完全创建模式并用示例数据填充它。

架构

考虑这两个表:

多边形表:

CREATE TABLE my_polygon (
    my_polygon_id SERIAL PRIMARY KEY,
    common_id INTEGER NOT NULL,
    value1 NUMERIC NOT NULL,
    value2 NUMERIC NOT NULL,
    value3 NUMERIC NOT NULL,
    geom GEOMETRY(Polygon) NOT NULL
)
;

CREATE INDEX ON my_polygon (common_id);
CREATE INDEX ON my_polygon USING GIST (common_id, geom);

多边形内包含的点表:

CREATE TABLE my_point (
    my_point_id SERIAL PRIMARY KEY,
    common_id INTEGER NOT NULL,
    pointvalue NUMERIC NOT NULL,
    geom GEOMETRY(Point) NOT NULL
);

CREATE INDEX ON my_point (common_id);
CREATE INDEX ON my_point USING GIST (common_id, geom);

我使用几何的事实与这里的问题并不严格相关。但是,我认为这使我尝试做的事情的动机更加清晰。

问题查询

问题是多边形之间有非常微小的、微不足道的重叠。(尝试清理它们确实不是一种选择。重叠来自生成它们时的某种浮点错误,尽可能接近。)但是有些点可能落在这些微小的重叠内,当我JOIN他们时会产生两行基于遏制。但实际上,每个点应该只与一个多边形相关联。当一个确实属于其中两个时,它最终与哪一个关联并不重要,因此可以让查询只选择一个,如下所示:

SELECT DISTINCT ON (my_point.my_point_id)
    my_polygon.*,
    my_point.my_point_id,
    my_point.pointvalue,
    my_point.geom AS pointgeom
FROM my_polygon
JOIN my_point ON my_point.common_id = my_polygon.common_id AND ST_Contains(my_polygon.geom, my_point.geom)
WHERE my_polygon.common_id = 1
ORDER BY my_point.my_point_id, my_polygon.my_polygon_id

和上面的查询一样,我通常想SELECT基于common_id. 此查询执行良好。它的查询计划如下所示:

好的查询计划

但是,这是我在许多不同查询中需要的逻辑,所以我想把它放在一个视图中。结果是,就查询规划器而言,查询看起来像这样:

SELECT *
FROM (
    SELECT DISTINCT ON (my_point.my_point_id)
        my_polygon.*,
        my_point.my_point_id,
        my_point.pointvalue,
        my_point.geom AS pointgeom
    FROM my_polygon
    JOIN my_point ON my_point.common_id = my_polygon.common_id AND ST_Contains(my_polygon.geom, my_point.geom)
    ORDER BY my_point.my_point_id, my_polygon.my_polygon_id
) point_with_polygon
WHERE common_id = 1

结果是现在 PostgreSQL在执行common_id 之后进行过滤DISTINCT ON,这意味着它必须对JOIN两个表的整体进行过滤。这是它的查询计划:

错误的查询计划

如何允许 PostgreSQL 将过滤器下推到查询的早期部分,并且仍然将一般查询放在视图中?

我现在卡在 PG 9.3 上,但升级到 9.5 可能是一种选择。

架构和示例数据脚本

需要 PostGIS。(这就是为什么没有 SQL Fiddle 的原因。)

CREATE EXTENSION IF NOT EXISTS postgis;
CREATE EXTENSION IF NOT EXISTS btree_gist;

-- DROP FUNCTION ST_GeneratePoints(geometry, numeric);
DO $doblock$
BEGIN
    IF NOT EXISTS(SELECT * FROM pg_proc WHERE UPPER(proname) = UPPER('ST_GeneratePoints')) THEN
        -- Create naive ST_GeneratePoints if version of PostGIS is not new enough
        CREATE FUNCTION ST_GeneratePoints(g geometry, npoints numeric)
            RETURNS geometry
            VOLATILE
            RETURNS NULL ON NULL INPUT
            LANGUAGE plpgsql
            AS $$
            DECLARE
                num_to_generate INTEGER := npoints::INTEGER;
                adjustment CONSTANT FLOAT := 0.00000000001;
                x_min FLOAT := ST_XMin(g) + adjustment;
                x_max FLOAT := ST_XMax(g) - adjustment;
                y_min FLOAT := ST_YMin(g) + adjustment;
                y_max FLOAT := ST_YMax(g) - adjustment;
                temp_result GEOMETRY[];
                result_array GEOMETRY[] := ARRAY[]::GEOMETRY[];
            BEGIN
                IF ST_IsEmpty(g) THEN
                    RAISE EXCEPTION 'Cannot generate points inside an empty geometry';
                END IF;

                IF ST_Dimension(g) < 2 THEN
                    RAISE EXCEPTION 'Only polygons supported';
                END IF;

                -- Reduce number of loops to reduce slow array_cat calls
                WHILE num_to_generate > 0 LOOP
                    SELECT ARRAY_AGG(contained.point) INTO temp_result
                    FROM (
                        SELECT point
                        FROM (
                            SELECT ST_MakePoint(
                                x_min + random() * (x_max - x_min),
                                y_min + random() * (y_max - y_min)
                            ) point
                            -- Generate extras to reduce number of loops
                            --
                            -- Each point has a probability of  ST_Area(g) / ST_Area(ST_Envelope(g))  to fall within the polygon.
                            -- So on average, we expect ST_Area(g) / ST_Area(ST_Envelope(g)) of the points generated to fall within.
                            -- Generating  ST_Area(ST_Envelope(g)) / ST_Area(g) * num_to_generate  points means that on average, we'll
                            -- get
                            --
                            --     ST_Area(g) / ST_Area(ST_Envelope(g)) * ST_Area(ST_Envelope(g)) / ST_Area(g) * num_to_generate
                            --      = num_to_generate
                            --
                            -- points within the polygon. (Notice the numerators and denominators cancel out.) This means we'll 
                            -- only run one loop about half the time without generating an excessive number of points.
                            --
                            -- Generate at least 20 to avoid a lot of loops for small numbers, though.
                            FROM generate_series(1, GREATEST(20, CEIL(ST_Area(ST_Envelope(g)) / ST_Area(g) * num_to_generate)::INTEGER))
                        ) candidate
                        WHERE ST_Contains(g, candidate.point)
                        -- Filter out extras if we have too many matches
                        LIMIT num_to_generate
                    ) contained
                    ;
                    IF ARRAY_LENGTH(temp_result, 1) > 0 THEN
                        result_array := array_cat(result_array, temp_result);
                        num_to_generate := npoints - COALESCE(ARRAY_LENGTH(result_array, 1), 0);
                    END IF;
                END LOOP;
                RETURN (SELECT ST_Union(point) FROM UNNEST(result_array) result (point));
            END;
            $$;
        RAISE NOTICE 'Created ST_GeneratePoints';
    ELSE
        RAISE NOTICE 'ST_GeneratePoints exists';
    END IF;
END
$doblock$
;

DROP TABLE IF EXISTS my_polygon;

CREATE TABLE my_polygon (
    my_polygon_id SERIAL PRIMARY KEY,
    common_id INTEGER NOT NULL,
    value1 NUMERIC NOT NULL,
    value2 NUMERIC NOT NULL,
    value3 NUMERIC NOT NULL,
    geom GEOMETRY(Polygon) NOT NULL
)
;

CREATE INDEX ON my_polygon (common_id);
CREATE INDEX ON my_polygon USING GIST (common_id, geom);


WITH common AS (
    SELECT
        common_id,
        random() * 5000 AS common_x_translate,
        random() * 5000 AS common_y_translate
    FROM (
        SELECT TRUNC(random() * 1000) + 1 AS common_id
        FROM generate_series(1, 100)
        UNION 
        SELECT 1
    ) a
),
geom_set_with_small_overlaps AS (
    SELECT
        ST_MakeEnvelope(
            x.translate, 
            y.translate, 
            x.translate + 1.1, 
            y.translate + 1.1
        ) AS geom
    FROM
        generate_series(0, 9) x (translate),
        generate_series(0, 9) y (translate)
)
INSERT INTO my_polygon (common_id, value1, value2, value3, geom)
SELECT
    common_id,
    random() * 100,
    random() * 100,
    random() * 100,
    ST_Translate(geom, common_x_translate, common_y_translate)
FROM common, geom_set_with_small_overlaps
;

DROP TABLE IF EXISTS my_point;

CREATE TABLE my_point (
    my_point_id SERIAL PRIMARY KEY,
    common_id INTEGER NOT NULL,
    pointvalue NUMERIC NOT NULL,
    geom GEOMETRY(Point) NOT NULL
);

INSERT INTO my_point (common_id, pointvalue, geom)
SELECT
    common_id,
    random() * 100,
    (ST_Dump(ST_GeneratePoints(extent, FLOOR(5000 + random() * 15000)::NUMERIC))).geom
FROM (
    SELECT
        common_id,
        -- Small negative buffer prevents lying on the outer edge
        ST_Buffer(ST_Extent(geom), - 0.0001) AS extent
    FROM my_polygon
    GROUP BY common_id
) common
UNION ALL
SELECT
    common_id,
    random() * 100,
    (ST_Dump(ST_GeneratePoints(intersection, TRUNC(random() * 5)::NUMERIC))).geom
FROM (
    SELECT
        p1.common_id,
        p1.my_polygon_id AS id1,
        p2.my_polygon_id AS id2,
        ST_Intersection(p1.geom, p2.geom) AS intersection
    FROM my_polygon p1
    JOIN my_polygon p2 ON (
        p1.my_polygon_id < p2.my_polygon_id AND
        p1.common_id = p2.common_id AND
        ST_Intersects(p1.geom, p2.geom)
    )
) a
;

CREATE INDEX ON my_point (common_id);
CREATE INDEX ON my_point USING GIST (common_id, geom);

你可能想VACUUM ANALYZE在那之后。

以文本形式查询计划

WHERE子句内(表现良好):

Unique  (cost=1195.74..1207.74 rows=2400 width=216)
  ->  Sort  (cost=1195.74..1201.74 rows=2400 width=216)
        Sort Key: my_point.my_point_id, my_polygon.my_polygon_id
        ->  Nested Loop  (cost=5.34..1060.99 rows=2400 width=216)
              ->  Bitmap Heap Scan on my_polygon  (cost=4.93..191.74 rows=100 width=164)
                    Recheck Cond: (common_id = 1)
                    ->  Bitmap Index Scan on my_polygon_common_id_geom_idx  (cost=0.00..4.90 rows=100 width=0)
                          Index Cond: (common_id = 1)
              ->  Index Scan using my_point_common_id_geom_idx on my_point  (cost=0.41..8.68 rows=1 width=52)
                    Index Cond: ((common_id = 1) AND (my_polygon.geom && geom))
                    Filter: _st_contains(my_polygon.geom, geom)

WHERE子句外(表现不佳):

Subquery Scan on a  (cost=209447.85..215842.18 rows=1827 width=212)
  Filter: (a.common_id = 1)
  ->  Unique  (cost=209447.85..211274.80 rows=365390 width=212)
        ->  Sort  (cost=209447.85..210361.33 rows=365390 width=212)
              Sort Key: my_point.my_point_id, my_polygon.my_polygon_id
              ->  Nested Loop  (cost=0.41..63285.00 rows=365390 width=212)
                    ->  Seq Scan on my_polygon  (cost=0.00..338.00 rows=9800 width=164)
                    ->  Index Scan using my_point_common_id_geom_idx on my_point  (cost=0.41..6.41 rows=1 width=52)
                          Index Cond: ((common_id = my_polygon.common_id) AND (my_polygon.geom && geom))
                          Filter: _st_contains(my_polygon.geom, geom)
postgresql performance
  • 1 1 个回答
  • 2584 Views

1 个回答

  • Voted
  1. Best Answer
    jpmc26
    2017-01-05T14:31:32+08:002017-01-05T14:31:32+08:00

    事实证明,有一种方法可以解决这个问题并让 PG 正确优化:你必须包含common_id在DISTINCT ON子句中。

    像这样:

    SELECT *
    FROM (
        SELECT DISTINCT ON (my_point.my_point_id, my_polygon.common_id)
            my_polygon.*,
            my_point.my_point_id,
            my_point.pointvalue,
            my_point.geom AS pointgeom
        FROM my_polygon
        JOIN my_point ON my_point.common_id = my_polygon.common_id AND ST_Contains(my_polygon.geom, my_point.geom)
        ORDER BY my_point.my_point_id, my_polygon.common_id, my_polygon.my_polygon_id
    ) point_with_polygon
    WHERE common_id = 1
    

    WHERE这会产生与在 :之前包含子句相同的查询计划DISTINCT ON:

    Unique  (cost=2307.77..2345.55 rows=5038 width=212)
      ->  Sort  (cost=2307.77..2320.36 rows=5038 width=212)
            Sort Key: my_point.my_point_id, my_polygon.my_polygon_id
            ->  Nested Loop  (cost=9.36..1479.97 rows=5038 width=212)
                  ->  Bitmap Heap Scan on my_polygon  (cost=4.93..190.19 rows=100 width=164)
                        Recheck Cond: (common_id = 1)
                        ->  Bitmap Index Scan on my_polygon_common_id_geom_idx  (cost=0.00..4.90 rows=100 width=0)
                              Index Cond: (common_id = 1)
                  ->  Bitmap Heap Scan on my_point  (cost=4.43..12.89 rows=1 width=52)
                        Recheck Cond: ((common_id = 1) AND (my_polygon.geom && geom))
                        Filter: _st_contains(my_polygon.geom, geom)
                        ->  Bitmap Index Scan on my_point_common_id_geom_idx  (cost=0.00..4.43 rows=2 width=0)
                              Index Cond: ((common_id = 1) AND (my_polygon.geom && geom))
    

    包含common_id有点多余,因为JOIN无论如何它都是条件的一部分,但这一事实也意味着这不会改变查询的结果。

    警告:确保一切都匹配

    在结果中使用that 's非常重要。以这个查询为例:common_idSELECT

    SELECT *
    FROM (
        SELECT DISTINCT ON (my_point.my_point_id, my_point.common_id)
            my_polygon.*,
            my_point.my_point_id,
            my_point.pointvalue,
            my_point.geom AS pointgeom
        FROM my_polygon
        JOIN my_point ON my_point.common_id = my_polygon.common_id AND ST_Contains(my_polygon.geom, my_point.geom)
        ORDER BY my_point.my_point_id, my_point.common_id, my_polygon.my_polygon_id
    ) point_with_polygon
    WHERE common_id = 1
    

    此查询my_polygon.common_id在SELECT子句中使用,但my_point.common_id在ORDER BYandDISTINCT ON子句中使用。在这种情况下, PG不会将过滤器下推到子查询中。

    • 2

相关问题

  • PostgreSQL 中 UniProt 的生物序列

  • 如何确定是否需要或需要索引

  • 我在哪里可以找到mysql慢日志?

  • 如何优化大型数据库的 mysqldump?

  • PostgreSQL 9.0 Replication 和 Slony-I 有什么区别?

Sidebar

Stats

  • 问题 205573
  • 回答 270741
  • 最佳答案 135370
  • 用户 68524
  • 热门
  • 回答
  • Marko Smith

    连接到 PostgreSQL 服务器:致命:主机没有 pg_hba.conf 条目

    • 12 个回答
  • Marko Smith

    如何让sqlplus的输出出现在一行中?

    • 3 个回答
  • Marko Smith

    选择具有最大日期或最晚日期的日期

    • 3 个回答
  • Marko Smith

    如何列出 PostgreSQL 中的所有模式?

    • 4 个回答
  • Marko Smith

    列出指定表的所有列

    • 5 个回答
  • Marko Smith

    如何在不修改我自己的 tnsnames.ora 的情况下使用 sqlplus 连接到位于另一台主机上的 Oracle 数据库

    • 4 个回答
  • Marko Smith

    你如何mysqldump特定的表?

    • 4 个回答
  • Marko Smith

    使用 psql 列出数据库权限

    • 10 个回答
  • Marko Smith

    如何从 PostgreSQL 中的选择查询中将值插入表中?

    • 4 个回答
  • Marko Smith

    如何使用 psql 列出所有数据库和表?

    • 7 个回答
  • Martin Hope
    Jin 连接到 PostgreSQL 服务器:致命:主机没有 pg_hba.conf 条目 2014-12-02 02:54:58 +0800 CST
  • Martin Hope
    Stéphane 如何列出 PostgreSQL 中的所有模式? 2013-04-16 11:19:16 +0800 CST
  • Martin Hope
    Mike Walsh 为什么事务日志不断增长或空间不足? 2012-12-05 18:11:22 +0800 CST
  • Martin Hope
    Stephane Rolland 列出指定表的所有列 2012-08-14 04:44:44 +0800 CST
  • Martin Hope
    haxney MySQL 能否合理地对数十亿行执行查询? 2012-07-03 11:36:13 +0800 CST
  • Martin Hope
    qazwsx 如何监控大型 .sql 文件的导入进度? 2012-05-03 08:54:41 +0800 CST
  • Martin Hope
    markdorison 你如何mysqldump特定的表? 2011-12-17 12:39:37 +0800 CST
  • Martin Hope
    Jonas 如何使用 psql 对 SQL 查询进行计时? 2011-06-04 02:22:54 +0800 CST
  • Martin Hope
    Jonas 如何从 PostgreSQL 中的选择查询中将值插入表中? 2011-05-28 00:33:05 +0800 CST
  • Martin Hope
    Jonas 如何使用 psql 列出所有数据库和表? 2011-02-18 00:45:49 +0800 CST

热门标签

sql-server mysql postgresql sql-server-2014 sql-server-2016 oracle sql-server-2008 database-design query-performance sql-server-2017

Explore

  • 主页
  • 问题
    • 最新
    • 热门
  • 标签
  • 帮助

Footer

AskOverflow.Dev

关于我们

  • 关于我们
  • 联系我们

Legal Stuff

  • Privacy Policy

Language

  • Pt
  • Server
  • Unix

© 2023 AskOverflow.DEV All Rights Reserve