我导入了ip2location_db11 精简版数据库的副本,其中包含 3,319,097 行,我希望优化数字范围查询,其中低值和高值位于表的不同列中 ( ip_from
, ip_to
)。
导入数据库:
CREATE TABLE ip2location_db11
(
ip_from bigint NOT NULL, -- First IP address in netblock.
ip_to bigint NOT NULL, -- Last IP address in netblock.
country_code character(2) NOT NULL, -- Two-character country code based on ISO 3166.
country_name character varying(64) NOT NULL, -- Country name based on ISO 3166.
region_name character varying(128) NOT NULL, -- Region or state name.
city_name character varying(128) NOT NULL, -- City name.
latitude real NOT NULL, -- City latitude. Default to capital city latitude if city is unknown.
longitude real NOT NULL, -- City longitude. Default to capital city longitude if city is unknown.
zip_code character varying(30) NOT NULL, -- ZIP/Postal code.
time_zone character varying(8) NOT NULL, -- UTC time zone (with DST supported).
CONSTRAINT ip2location_db11_pkey PRIMARY KEY (ip_from, ip_to)
);
\copy ip2location_db11 FROM 'IP2LOCATION-LITE-DB11.CSV' WITH CSV QUOTE AS '"';
我第一次天真的索引尝试是在这些列中的每一列上创建单独的索引,这导致了查询时间为 400 毫秒的顺序扫描:
account=> CREATE INDEX ip_from_db11_idx ON ip2location_db11 (ip_from);
account=> CREATE INDEX ip_to_db11_idx ON ip2location_db11 (ip_to);
account=> EXPLAIN ANALYZE VERBOSE SELECT * FROM ip2location_db11 WHERE 2538629520 BETWEEN ip_from AND ip_to;
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------------
Seq Scan on public.ip2location_db11 (cost=0.00..48930.99 rows=43111 width=842) (actual time=286.714..401.805 rows=1 loops=1)
Output: ip_from, ip_to, country_code, country_name, region_name, city_name, latitude, longitude, zip_code, time_zone
Filter: (('2538629520'::bigint >= ip2location_db11.ip_from) AND ('2538629520'::bigint <= ip2location_db11.ip_to))
Rows Removed by Filter: 3319096
Planning time: 0.155 ms
Execution time: 401.834 ms
(6 rows)
account=> \d ip2location_db11
Table "public.ip2location_db11"
Column | Type | Modifiers
--------------+------------------------+-----------
ip_from | bigint | not null
ip_to | bigint | not null
country_code | character(2) | not null
country_name | character varying(64) | not null
region_name | character varying(128) | not null
city_name | character varying(128) | not null
latitude | real | not null
longitude | real | not null
zip_code | character varying(30) | not null
time_zone | character varying(8) | not null
Indexes:
"ip2location_db11_pkey" PRIMARY KEY, btree (ip_from, ip_to)
"ip_from_db11_idx" btree (ip_from)
"ip_to_db11_idx" btree (ip_to)
我的第二次尝试是创建一个多列 btree 索引,这导致索引扫描的查询时间为 290 毫秒:
account=> CREATE INDEX ip_range_db11_idx ON ip2location_db11 (ip_from,ip_to);
account=> EXPLAIN ANALYZE VERBOSE SELECT * FROM ip2location_db11 WHERE 2538629520 BETWEEN ip_from AND ip_to;
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------------
Index Scan using ip_to_db11_idx on public.ip2location_db11 (cost=0.43..51334.91 rows=756866 width=69) (actual time=1.109..289.143 rows=1 loops=1)
Output: ip_from, ip_to, country_code, country_name, region_name, city_name, latitude, longitude, zip_code, time_zone
Index Cond: ('2538629520'::bigint <= ip2location_db11.ip_to)
Filter: ('2538629520'::bigint >= ip2location_db11.ip_from)
Rows Removed by Filter: 1160706
Planning time: 0.324 ms
Execution time: 289.172 ms
(7 rows)
n4l_account=> \d ip2location_db11
Table "public.ip2location_db11"
Column | Type | Modifiers
--------------+------------------------+-----------
ip_from | bigint | not null
ip_to | bigint | not null
country_code | character(2) | not null
country_name | character varying(64) | not null
region_name | character varying(128) | not null
city_name | character varying(128) | not null
latitude | real | not null
longitude | real | not null
zip_code | character varying(30) | not null
time_zone | character varying(8) | not null
Indexes:
"ip2location_db11_pkey" PRIMARY KEY, btree (ip_from, ip_to)
"ip_from_db11_idx" btree (ip_from)
"ip_range_db11_idx" btree (ip_from, ip_to)
"ip_to_db11_idx" btree (ip_to)
更新:根据评论中的要求,我重新做了上面的查询。重建表后前15次查询的时间(165ms、65ms、86ms、83ms、86ms、64ms、85ms、811ms、868ms、845ms、810ms、781ms、797ms、890ms、806ms):
account=> EXPLAIN (ANALYZE, VERBOSE, BUFFERS, TIMING) SELECT * FROM ip2location_db11 WHERE 2538629520 BETWEEN ip_from AND ip_to;
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on public.ip2location_db11 (cost=28200.29..76843.12 rows=368789 width=842) (actual time=64.866..64.866 rows=1 loops=1)
Output: ip_from, ip_to, country_code, country_name, region_name, city_name, latitude, longitude, zip_code, time_zone
Recheck Cond: (('2538629520'::bigint >= ip2location_db11.ip_from) AND ('2538629520'::bigint <= ip2location_db11.ip_to))
Heap Blocks: exact=1
Buffers: shared hit=8273
-> Bitmap Index Scan on ip_range_db11_idx (cost=0.00..28108.09 rows=368789 width=0) (actual time=64.859..64.859 rows=1 loops=1)
Index Cond: (('2538629520'::bigint >= ip2location_db11.ip_from) AND ('2538629520'::bigint <= ip2location_db11.ip_to))
Buffers: shared hit=8272
Planning time: 0.099 ms
Execution time: 64.907 ms
(10 rows)
account=> EXPLAIN (ANALYZE, VERBOSE, BUFFERS, TIMING) SELECT * FROM ip2location_db11 WHERE 2538629520 BETWEEN ip_from AND ip_to;
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------
Seq Scan on public.ip2location_db11 (cost=0.00..92906.18 rows=754776 width=69) (actual time=577.234..811.757 rows=1 loops=1)
Output: ip_from, ip_to, country_code, country_name, region_name, city_name, latitude, longitude, zip_code, time_zone
Filter: (('2538629520'::bigint >= ip2location_db11.ip_from) AND ('2538629520'::bigint <= ip2location_db11.ip_to))
Rows Removed by Filter: 3319096
Buffers: shared hit=33 read=43078
Planning time: 0.667 ms
Execution time: 811.783 ms
(7 rows)
导入的 CSV 文件中的示例行:
"0","16777215","-","-","-","-","0.000000","0.000000","-","-"
"16777216","16777471","AU","Australia","Queensland","Brisbane","-27.467940","153.028090","4000","+10:00"
"16777472","16778239","CN","China","Fujian","Fuzhou","26.061390","119.306110","350004","+08:00"
是否有更好的方法来索引此表以改进查询,或者是否有更有效的查询可以获得相同的结果?
这与已经提供的那些涉及使用空间索引来做一些技巧的解决方案有点不同。
相反,值得记住的是,对于 IP 地址,您不能有重叠的范围。那就是
A -> B
不能X -> Y
以任何方式相交。了解这一点后,您可以稍微更改SELECT
查询并利用此特性。利用这个特性,您根本不需要任何“聪明”的索引。事实上,您只需要索引您的ip_from
列。以前,正在分析的查询是:
让我们假设
2538629520
落入的范围恰好是2538629512
和2538629537
。由此我们可以假设下一个
ip_from
值是2538629538
。我们实际上不需要担心任何高于此ip_from
值的记录。事实上,我们真正关心的是ip_from
equals2538629512
的范围。知道这个事实,我们的查询实际上变成了(英文):
因为我们从来没有重叠的范围,
ip_from
所以ip_to
这适用并允许我们将查询编写为:回到索引以利用所有这些。我们实际上看到的只是 ip_from 并且我们正在进行整数比较。MIN(ip_from) 让 PostgreSQL 找到第一条可用的记录。这很好,因为我们可以寻求它的权利,然后根本不用担心任何其他记录。
我们真正需要的是一个像这样的索引:
CREATE UNIQUE INDEX CONCURRENTLY ix_ip2location_ipFrom ON public.ip2location(ip_from)
我们可以使索引唯一,因为我们不会有重叠的记录。我什至会自己将此列作为主键。
使用这个索引和这个查询,解释计划是:
为了让您了解使用这种方法提高查询性能,我在我的 Raspberry Pi 上进行了测试。最初的方法大约需要 4 秒。这种方法大约需要 120 毫秒。最大的胜利是从单独的行中寻找经文而不是一些扫描。由于结果中需要考虑更多的表,因此原始查询将受到低范围值的极度影响。此查询将在整个值范围内表现出一致的性能。
希望这对您有所帮助,并且我的解释对大家有意义。
感谢评论,我有一个解决方案,通过使用要点空间索引并相应地调整查询,将查询时间减少到 0.073 毫秒:
引文:
http://www.siafoo.net/article/53#comment_288
http://www.pgsql.cz/index.php/PostgreSQL_SQL_Tricks#Fast_interval_.28of_time_or_ip_addresses.29_searching_with_spatial_indexes
ip4r
首先,在 Github 上构建添加扩展(更好的说明)。
让我们从与之前几乎相同的事情开始,创建 ip 类型
ip4
。什么都不做PRIMARY KEY
,也不在类型上添加索引。我们将在加载后更改表。现在让我们将它们升级为
ip4r
现在让我们索引它
并查询它,