最后有一个脚本将完全创建模式并用示例数据填充它。
架构
考虑这两个表:
多边形表:
CREATE TABLE my_polygon (
my_polygon_id SERIAL PRIMARY KEY,
common_id INTEGER NOT NULL,
value1 NUMERIC NOT NULL,
value2 NUMERIC NOT NULL,
value3 NUMERIC NOT NULL,
geom GEOMETRY(Polygon) NOT NULL
)
;
CREATE INDEX ON my_polygon (common_id);
CREATE INDEX ON my_polygon USING GIST (common_id, geom);
多边形内包含的点表:
CREATE TABLE my_point (
my_point_id SERIAL PRIMARY KEY,
common_id INTEGER NOT NULL,
pointvalue NUMERIC NOT NULL,
geom GEOMETRY(Point) NOT NULL
);
CREATE INDEX ON my_point (common_id);
CREATE INDEX ON my_point USING GIST (common_id, geom);
我使用几何的事实与这里的问题并不严格相关。但是,我认为这使我尝试做的事情的动机更加清晰。
问题查询
问题是多边形之间有非常微小的、微不足道的重叠。(尝试清理它们确实不是一种选择。重叠来自生成它们时的某种浮点错误,尽可能接近。)但是有些点可能落在这些微小的重叠内,当我JOIN
他们时会产生两行基于遏制。但实际上,每个点应该只与一个多边形相关联。当一个确实属于其中两个时,它最终与哪一个关联并不重要,因此可以让查询只选择一个,如下所示:
SELECT DISTINCT ON (my_point.my_point_id)
my_polygon.*,
my_point.my_point_id,
my_point.pointvalue,
my_point.geom AS pointgeom
FROM my_polygon
JOIN my_point ON my_point.common_id = my_polygon.common_id AND ST_Contains(my_polygon.geom, my_point.geom)
WHERE my_polygon.common_id = 1
ORDER BY my_point.my_point_id, my_polygon.my_polygon_id
和上面的查询一样,我通常想SELECT
基于common_id
. 此查询执行良好。它的查询计划如下所示:
但是,这是我在许多不同查询中需要的逻辑,所以我想把它放在一个视图中。结果是,就查询规划器而言,查询看起来像这样:
SELECT *
FROM (
SELECT DISTINCT ON (my_point.my_point_id)
my_polygon.*,
my_point.my_point_id,
my_point.pointvalue,
my_point.geom AS pointgeom
FROM my_polygon
JOIN my_point ON my_point.common_id = my_polygon.common_id AND ST_Contains(my_polygon.geom, my_point.geom)
ORDER BY my_point.my_point_id, my_polygon.my_polygon_id
) point_with_polygon
WHERE common_id = 1
结果是现在 PostgreSQL在执行common_id
之后进行过滤DISTINCT ON
,这意味着它必须对JOIN
两个表的整体进行过滤。这是它的查询计划:
如何允许 PostgreSQL 将过滤器下推到查询的早期部分,并且仍然将一般查询放在视图中?
我现在卡在 PG 9.3 上,但升级到 9.5 可能是一种选择。
架构和示例数据脚本
需要 PostGIS。(这就是为什么没有 SQL Fiddle 的原因。)
CREATE EXTENSION IF NOT EXISTS postgis;
CREATE EXTENSION IF NOT EXISTS btree_gist;
-- DROP FUNCTION ST_GeneratePoints(geometry, numeric);
DO $doblock$
BEGIN
IF NOT EXISTS(SELECT * FROM pg_proc WHERE UPPER(proname) = UPPER('ST_GeneratePoints')) THEN
-- Create naive ST_GeneratePoints if version of PostGIS is not new enough
CREATE FUNCTION ST_GeneratePoints(g geometry, npoints numeric)
RETURNS geometry
VOLATILE
RETURNS NULL ON NULL INPUT
LANGUAGE plpgsql
AS $$
DECLARE
num_to_generate INTEGER := npoints::INTEGER;
adjustment CONSTANT FLOAT := 0.00000000001;
x_min FLOAT := ST_XMin(g) + adjustment;
x_max FLOAT := ST_XMax(g) - adjustment;
y_min FLOAT := ST_YMin(g) + adjustment;
y_max FLOAT := ST_YMax(g) - adjustment;
temp_result GEOMETRY[];
result_array GEOMETRY[] := ARRAY[]::GEOMETRY[];
BEGIN
IF ST_IsEmpty(g) THEN
RAISE EXCEPTION 'Cannot generate points inside an empty geometry';
END IF;
IF ST_Dimension(g) < 2 THEN
RAISE EXCEPTION 'Only polygons supported';
END IF;
-- Reduce number of loops to reduce slow array_cat calls
WHILE num_to_generate > 0 LOOP
SELECT ARRAY_AGG(contained.point) INTO temp_result
FROM (
SELECT point
FROM (
SELECT ST_MakePoint(
x_min + random() * (x_max - x_min),
y_min + random() * (y_max - y_min)
) point
-- Generate extras to reduce number of loops
--
-- Each point has a probability of ST_Area(g) / ST_Area(ST_Envelope(g)) to fall within the polygon.
-- So on average, we expect ST_Area(g) / ST_Area(ST_Envelope(g)) of the points generated to fall within.
-- Generating ST_Area(ST_Envelope(g)) / ST_Area(g) * num_to_generate points means that on average, we'll
-- get
--
-- ST_Area(g) / ST_Area(ST_Envelope(g)) * ST_Area(ST_Envelope(g)) / ST_Area(g) * num_to_generate
-- = num_to_generate
--
-- points within the polygon. (Notice the numerators and denominators cancel out.) This means we'll
-- only run one loop about half the time without generating an excessive number of points.
--
-- Generate at least 20 to avoid a lot of loops for small numbers, though.
FROM generate_series(1, GREATEST(20, CEIL(ST_Area(ST_Envelope(g)) / ST_Area(g) * num_to_generate)::INTEGER))
) candidate
WHERE ST_Contains(g, candidate.point)
-- Filter out extras if we have too many matches
LIMIT num_to_generate
) contained
;
IF ARRAY_LENGTH(temp_result, 1) > 0 THEN
result_array := array_cat(result_array, temp_result);
num_to_generate := npoints - COALESCE(ARRAY_LENGTH(result_array, 1), 0);
END IF;
END LOOP;
RETURN (SELECT ST_Union(point) FROM UNNEST(result_array) result (point));
END;
$$;
RAISE NOTICE 'Created ST_GeneratePoints';
ELSE
RAISE NOTICE 'ST_GeneratePoints exists';
END IF;
END
$doblock$
;
DROP TABLE IF EXISTS my_polygon;
CREATE TABLE my_polygon (
my_polygon_id SERIAL PRIMARY KEY,
common_id INTEGER NOT NULL,
value1 NUMERIC NOT NULL,
value2 NUMERIC NOT NULL,
value3 NUMERIC NOT NULL,
geom GEOMETRY(Polygon) NOT NULL
)
;
CREATE INDEX ON my_polygon (common_id);
CREATE INDEX ON my_polygon USING GIST (common_id, geom);
WITH common AS (
SELECT
common_id,
random() * 5000 AS common_x_translate,
random() * 5000 AS common_y_translate
FROM (
SELECT TRUNC(random() * 1000) + 1 AS common_id
FROM generate_series(1, 100)
UNION
SELECT 1
) a
),
geom_set_with_small_overlaps AS (
SELECT
ST_MakeEnvelope(
x.translate,
y.translate,
x.translate + 1.1,
y.translate + 1.1
) AS geom
FROM
generate_series(0, 9) x (translate),
generate_series(0, 9) y (translate)
)
INSERT INTO my_polygon (common_id, value1, value2, value3, geom)
SELECT
common_id,
random() * 100,
random() * 100,
random() * 100,
ST_Translate(geom, common_x_translate, common_y_translate)
FROM common, geom_set_with_small_overlaps
;
DROP TABLE IF EXISTS my_point;
CREATE TABLE my_point (
my_point_id SERIAL PRIMARY KEY,
common_id INTEGER NOT NULL,
pointvalue NUMERIC NOT NULL,
geom GEOMETRY(Point) NOT NULL
);
INSERT INTO my_point (common_id, pointvalue, geom)
SELECT
common_id,
random() * 100,
(ST_Dump(ST_GeneratePoints(extent, FLOOR(5000 + random() * 15000)::NUMERIC))).geom
FROM (
SELECT
common_id,
-- Small negative buffer prevents lying on the outer edge
ST_Buffer(ST_Extent(geom), - 0.0001) AS extent
FROM my_polygon
GROUP BY common_id
) common
UNION ALL
SELECT
common_id,
random() * 100,
(ST_Dump(ST_GeneratePoints(intersection, TRUNC(random() * 5)::NUMERIC))).geom
FROM (
SELECT
p1.common_id,
p1.my_polygon_id AS id1,
p2.my_polygon_id AS id2,
ST_Intersection(p1.geom, p2.geom) AS intersection
FROM my_polygon p1
JOIN my_polygon p2 ON (
p1.my_polygon_id < p2.my_polygon_id AND
p1.common_id = p2.common_id AND
ST_Intersects(p1.geom, p2.geom)
)
) a
;
CREATE INDEX ON my_point (common_id);
CREATE INDEX ON my_point USING GIST (common_id, geom);
你可能想VACUUM ANALYZE
在那之后。
以文本形式查询计划
WHERE
子句内(表现良好):
Unique (cost=1195.74..1207.74 rows=2400 width=216)
-> Sort (cost=1195.74..1201.74 rows=2400 width=216)
Sort Key: my_point.my_point_id, my_polygon.my_polygon_id
-> Nested Loop (cost=5.34..1060.99 rows=2400 width=216)
-> Bitmap Heap Scan on my_polygon (cost=4.93..191.74 rows=100 width=164)
Recheck Cond: (common_id = 1)
-> Bitmap Index Scan on my_polygon_common_id_geom_idx (cost=0.00..4.90 rows=100 width=0)
Index Cond: (common_id = 1)
-> Index Scan using my_point_common_id_geom_idx on my_point (cost=0.41..8.68 rows=1 width=52)
Index Cond: ((common_id = 1) AND (my_polygon.geom && geom))
Filter: _st_contains(my_polygon.geom, geom)
WHERE
子句外(表现不佳):
Subquery Scan on a (cost=209447.85..215842.18 rows=1827 width=212)
Filter: (a.common_id = 1)
-> Unique (cost=209447.85..211274.80 rows=365390 width=212)
-> Sort (cost=209447.85..210361.33 rows=365390 width=212)
Sort Key: my_point.my_point_id, my_polygon.my_polygon_id
-> Nested Loop (cost=0.41..63285.00 rows=365390 width=212)
-> Seq Scan on my_polygon (cost=0.00..338.00 rows=9800 width=164)
-> Index Scan using my_point_common_id_geom_idx on my_point (cost=0.41..6.41 rows=1 width=52)
Index Cond: ((common_id = my_polygon.common_id) AND (my_polygon.geom && geom))
Filter: _st_contains(my_polygon.geom, geom)
事实证明,有一种方法可以解决这个问题并让 PG 正确优化:你必须包含
common_id
在DISTINCT ON
子句中。像这样:
WHERE
这会产生与在 :之前包含子句相同的查询计划DISTINCT ON
:包含
common_id
有点多余,因为JOIN
无论如何它都是条件的一部分,但这一事实也意味着这不会改变查询的结果。警告:确保一切都匹配
在结果中使用that 's非常重要。以这个查询为例:
common_id
SELECT
此查询
my_polygon.common_id
在SELECT
子句中使用,但my_point.common_id
在ORDER BY
andDISTINCT ON
子句中使用。在这种情况下, PG不会将过滤器下推到子查询中。