PostgreSQL 中 UniProt 的生物序列

Question

Parker

Asked: 2017-11-10 06:03:50 +0800 CST2017-11-10 06:03:50 +0800 CST 2017-11-10 06:03:50 +0800 CST

PostgreSQL 中的慢查询从两列中定义的范围之间选择单行

772

我导入了ip2location_db11 精简版数据库的副本，其中包含 3,319,097 行，我希望优化数字范围查询，其中低值和高值位于表的不同列中 ( ip_from, ip_to)。

导入数据库：

CREATE TABLE ip2location_db11
(
  ip_from bigint NOT NULL, -- First IP address in netblock.
  ip_to bigint NOT NULL, -- Last IP address in netblock.
  country_code character(2) NOT NULL, -- Two-character country code based on ISO 3166.
  country_name character varying(64) NOT NULL, -- Country name based on ISO 3166.
  region_name character varying(128) NOT NULL, -- Region or state name.
  city_name character varying(128) NOT NULL, -- City name.
  latitude real NOT NULL, -- City latitude. Default to capital city latitude if city is unknown.
  longitude real NOT NULL, -- City longitude. Default to capital city longitude if city is unknown.
  zip_code character varying(30) NOT NULL, -- ZIP/Postal code.
  time_zone character varying(8) NOT NULL, -- UTC time zone (with DST supported).
  CONSTRAINT ip2location_db11_pkey PRIMARY KEY (ip_from, ip_to)
);
\copy ip2location_db11 FROM 'IP2LOCATION-LITE-DB11.CSV' WITH CSV QUOTE AS '"';

我第一次天真的索引尝试是在这些列中的每一列上创建单独的索引，这导致了查询时间为 400 毫秒的顺序扫描：

account=> CREATE INDEX ip_from_db11_idx ON ip2location_db11 (ip_from);
account=> CREATE INDEX ip_to_db11_idx ON ip2location_db11 (ip_to);

account=> EXPLAIN ANALYZE VERBOSE SELECT * FROM ip2location_db11 WHERE 2538629520 BETWEEN ip_from AND ip_to;

                                                          QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------------
 Seq Scan on public.ip2location_db11  (cost=0.00..48930.99 rows=43111 width=842) (actual time=286.714..401.805 rows=1 loops=1)
   Output: ip_from, ip_to, country_code, country_name, region_name, city_name, latitude, longitude, zip_code, time_zone
   Filter: (('2538629520'::bigint >= ip2location_db11.ip_from) AND ('2538629520'::bigint <= ip2location_db11.ip_to))
   Rows Removed by Filter: 3319096
 Planning time: 0.155 ms
 Execution time: 401.834 ms
(6 rows)

account=> \d ip2location_db11
          Table "public.ip2location_db11"
    Column    |          Type          | Modifiers
--------------+------------------------+-----------
 ip_from      | bigint                 | not null
 ip_to        | bigint                 | not null
 country_code | character(2)           | not null
 country_name | character varying(64)  | not null
 region_name  | character varying(128) | not null
 city_name    | character varying(128) | not null
 latitude     | real                   | not null
 longitude    | real                   | not null
 zip_code     | character varying(30)  | not null
 time_zone    | character varying(8)   | not null
Indexes:
    "ip2location_db11_pkey" PRIMARY KEY, btree (ip_from, ip_to)
    "ip_from_db11_idx" btree (ip_from)
    "ip_to_db11_idx" btree (ip_to)

我的第二次尝试是创建一个多列 btree 索引，这导致索引扫描的查询时间为 290 毫秒：

account=> CREATE INDEX ip_range_db11_idx ON ip2location_db11 (ip_from,ip_to);

account=> EXPLAIN ANALYZE VERBOSE SELECT * FROM ip2location_db11 WHERE 2538629520 BETWEEN ip_from AND ip_to;
                                                                     QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------------
 Index Scan using ip_to_db11_idx on public.ip2location_db11 (cost=0.43..51334.91 rows=756866 width=69) (actual time=1.109..289.143 rows=1 loops=1)
   Output: ip_from, ip_to, country_code, country_name, region_name, city_name, latitude, longitude, zip_code, time_zone
   Index Cond: ('2538629520'::bigint <= ip2location_db11.ip_to)
   Filter: ('2538629520'::bigint >= ip2location_db11.ip_from)
   Rows Removed by Filter: 1160706
 Planning time: 0.324 ms
 Execution time: 289.172 ms
(7 rows)

n4l_account=> \d ip2location_db11
          Table "public.ip2location_db11"
    Column    |          Type          | Modifiers
--------------+------------------------+-----------
 ip_from      | bigint                 | not null
 ip_to        | bigint                 | not null
 country_code | character(2)           | not null
 country_name | character varying(64)  | not null
 region_name  | character varying(128) | not null
 city_name    | character varying(128) | not null
 latitude     | real                   | not null
 longitude    | real                   | not null
 zip_code     | character varying(30)  | not null
 time_zone    | character varying(8)   | not null
Indexes:
    "ip2location_db11_pkey" PRIMARY KEY, btree (ip_from, ip_to)
    "ip_from_db11_idx" btree (ip_from)
    "ip_range_db11_idx" btree (ip_from, ip_to)
    "ip_to_db11_idx" btree (ip_to)

更新：根据评论中的要求，我重新做了上面的查询。重建表后前15次查询的时间（165ms、65ms、86ms、83ms、86ms、64ms、85ms、811ms、868ms、845ms、810ms、781ms、797ms、890ms、806ms）：

account=> EXPLAIN (ANALYZE, VERBOSE, BUFFERS, TIMING) SELECT * FROM ip2location_db11 WHERE 2538629520 BETWEEN ip_from AND ip_to;
                                                                QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on public.ip2location_db11  (cost=28200.29..76843.12 rows=368789 width=842) (actual time=64.866..64.866 rows=1 loops=1)
   Output: ip_from, ip_to, country_code, country_name, region_name, city_name, latitude, longitude, zip_code, time_zone
   Recheck Cond: (('2538629520'::bigint >= ip2location_db11.ip_from) AND ('2538629520'::bigint <= ip2location_db11.ip_to))
   Heap Blocks: exact=1
   Buffers: shared hit=8273
   ->  Bitmap Index Scan on ip_range_db11_idx  (cost=0.00..28108.09 rows=368789 width=0) (actual time=64.859..64.859 rows=1 loops=1)
         Index Cond: (('2538629520'::bigint >= ip2location_db11.ip_from) AND ('2538629520'::bigint <= ip2location_db11.ip_to))
         Buffers: shared hit=8272
 Planning time: 0.099 ms
 Execution time: 64.907 ms
(10 rows)

account=> EXPLAIN (ANALYZE, VERBOSE, BUFFERS, TIMING) SELECT * FROM ip2location_db11 WHERE 2538629520 BETWEEN ip_from AND ip_to;
                                                          QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------
 Seq Scan on public.ip2location_db11  (cost=0.00..92906.18 rows=754776 width=69) (actual time=577.234..811.757 rows=1 loops=1)
   Output: ip_from, ip_to, country_code, country_name, region_name, city_name, latitude, longitude, zip_code, time_zone
   Filter: (('2538629520'::bigint >= ip2location_db11.ip_from) AND ('2538629520'::bigint <= ip2location_db11.ip_to))
   Rows Removed by Filter: 3319096
   Buffers: shared hit=33 read=43078
 Planning time: 0.667 ms
 Execution time: 811.783 ms
(7 rows)

导入的 CSV 文件中的示例行：

"0","16777215","-","-","-","-","0.000000","0.000000","-","-"
"16777216","16777471","AU","Australia","Queensland","Brisbane","-27.467940","153.028090","4000","+10:00"
"16777472","16778239","CN","China","Fujian","Fuzhou","26.061390","119.306110","350004","+08:00"

是否有更好的方法来索引此表以改进查询，或者是否有更有效的查询可以获得相同的结果？

3 个回答

Voted

Kent Chenery · Answer 1 · 2017-11-11T01:59:15+08:00

这与已经提供的那些涉及使用空间索引来做一些技巧的解决方案有点不同。

相反，值得记住的是，对于 IP 地址，您不能有重叠的范围。那就是A -> B不能X -> Y以任何方式相交。了解这一点后，您可以稍微更改SELECT查询并利用此特性。利用这个特性，您根本不需要任何“聪明”的索引。事实上，您只需要索引您的ip_from列。

以前，正在分析的查询是：

SELECT * FROM ip2location_db11 WHERE 2538629520 BETWEEN ip_from AND ip_to;

让我们假设2538629520落入的范围恰好是2538629512和2538629537。

注意：范围是什么并不重要，这只是为了帮助演示我们可以利用的模式。

由此我们可以假设下一个ip_from值是2538629538。我们实际上不需要担心任何高于此ip_from值的记录。事实上，我们真正关心的是ip_from equals 2538629512的范围。

知道这个事实，我们的查询实际上变成了（英文）：

找出我的ip_fromIP 地址高于的最大值ip_from。显示您找到此值的记录。

或者换句话说：找到我的ip_fromIP 地址之前的值并给我那个记录

因为我们从来没有重叠的范围，ip_from所以ip_to这适用并允许我们将查询编写为：

SELECT * 
FROM ip2location
WHERE ip_from = (
    SELECT MAX(ip_from)
    FROM ip2location
    WHERE ip_from <= 2538629520
    )

回到索引以利用所有这些。我们实际上看到的只是 ip_from 并且我们正在进行整数比较。MIN(ip_from) 让 PostgreSQL 找到第一条可用的记录。这很好，因为我们可以寻求它的权利，然后根本不用担心任何其他记录。

我们真正需要的是一个像这样的索引：

CREATE UNIQUE INDEX CONCURRENTLY ix_ip2location_ipFrom ON public.ip2location(ip_from)

我们可以使索引唯一，因为我们不会有重叠的记录。我什至会自己将此列作为主键。

使用这个索引和这个查询，解释计划是：

Index Scan using ix_ip2location_ipfrom on public.ip2location  (cost=0.90..8.92 rows=1 width=69) (actual time=0.530..0.533 rows=1 loops=1)
Output: ip2location.ip_from, ip2location.ip_to, ip2location.country_code, ip2location.country_name, ip2location.region_name, ip2location.city_name, ip2location.latitude, ip2location.longitude, ip2location.zip_code, ip2location.time_zone
Index Cond: (ip2location.ip_from = $1)
InitPlan 2 (returns $1)
    ->  Result  (cost=0.46..0.47 rows=1 width=8) (actual time=0.452..0.452 rows=1 loops=1)
        Output: $0
        InitPlan 1 (returns $0)
            ->  Limit  (cost=0.43..0.46 rows=1 width=8) (actual time=0.443..0.444 rows=1 loops=1)
                Output: ip2location_1.ip_from
                ->  Index Only Scan using ix_ip2location_ipfrom on public.ip2location ip2location_1  (cost=0.43..35440.79 rows=1144218 width=8) (actual time=0.438..0.438 rows=1 loops=1)
                        Output: ip2location_1.ip_from
                        Index Cond: ((ip2location_1.ip_from IS NOT NULL) AND (ip2location_1.ip_from >= '2538629520'::bigint))
                        Heap Fetches: 0

为了让您了解使用这种方法提高查询性能，我在我的 Raspberry Pi 上进行了测试。最初的方法大约需要 4 秒。这种方法大约需要 120 毫秒。最大的胜利是从单独的行中寻找经文而不是一些扫描。由于结果中需要考虑更多的表，因此原始查询将受到低范围值的极度影响。此查询将在整个值范围内表现出一致的性能。

希望这对您有所帮助，并且我的解释对大家有意义。

Parker · Answer 2 · 2017-11-10T06:28:13+08:00

感谢评论，我有一个解决方案，通过使用要点空间索引并相应地调整查询，将查询时间减少到 0.073 毫秒：

account=> DROP INDEX ip_to_db11_idx;
account=> DROP INDEX ip_from_db11_idx;
account=> DROP INDEX ip_range_db11_idx;
account=> CREATE INDEX ip2location_db11_gist ON ip2location_db11 USING gist ((box(point(ip_from,ip_from),point(ip_to,ip_to))) box_ops);

account=> EXPLAIN ANALYZE VERBOSE SELECT * FROM ip2location_db11 WHERE  box(point(ip_from,ip_from),point(ip_to,ip_to)) @> box(point (2538629520,2538629520), point(2538629520,2538629520));


              QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on public.ip2location_db11  (cost=190.14..10463.13 rows=3319 width=69) (actual time=0.032..0.033 rows=1 loops=1)
   Output: ip_from, ip_to, country_code, country_name, region_name, city_name, latitude, longitude, zip_code, time_zone
   Recheck Cond: (box(point((ip2location_db11.ip_from)::double precision, (ip2location_db11.ip_from)::double precision),
 point((ip2location_db11.ip_to)::double precision, (ip2location_db11.ip_to)::double precision)) @> '(2538629520,2538629520),(2538629520,2538629520)'::box)
   Heap Blocks: exact=1
   ->  Bitmap Index Scan on ip2location_db11_gist  (cost=0.00..189.31 rows=3319 width=0) (actual time=0.022..0.022 rows=1 loops=1)
         Index Cond: (box(point((ip2location_db11.ip_from)::double precision, (ip2location_db11.ip_from)::double precision), point((ip2location_db11.ip_to)::double precision, (ip2location_db11.ip_to)::double precision)) @> '(2538629520,2538629520),(2538629520,2538629520)'::box)
 Planning time: 2.119 ms
 Execution time: 0.073 ms
(8 rows)

引文：

http://www.siafoo.net/article/53#comment_288

http://www.pgsql.cz/index.php/PostgreSQL_SQL_Tricks#Fast_interval_.28of_time_or_ip_addresses.29_searching_with_spatial_indexes

Evan Carroll · Answer 3 · 2017-11-10T07:33:16+08:00

Evan Carroll

2017-11-10T07:33:16+08:002017-11-10T07:33:16+08:00

`ip4r`

首先，在 Github 上构建添加扩展（更好的说明）。

CREATE EXTENSION ip4r;

让我们从与之前几乎相同的事情开始，创建 ip 类型ip4。什么都不做PRIMARY KEY，也不在类型上添加索引。我们将在加载后更改表。

CREATE TABLE ip2location_db11
(
  ip_from ip4 NOT NULL,   -- First IP address in netblock.
  ip_to   ip4 NOT NULL, -- Last IP address in netblock.
  ....
);
\copy ip2location_db11 FROM 'IP2LOCATION-LITE-DB11.CSV' WITH CSV QUOTE AS '"';

现在让我们将它们升级为ip4r

BEGIN;
  ALTER TABLE ip2location_db11
    ADD iploc_range ip4r;
  UPDATE ip2location_db11
    SET iploc_range = ip4r(ip_from,ip_to);
  ALTER TABLE ip2location_db11
    DROP COLUMN ip_from,
    DROP COLUMN ip_to;
COMMIT;

现在让我们索引它

CREATE INDEX ON ip2location_db11
   USING gist (iploc_range);
VACUUM ANALYZE ip2location_db11;

并查询它，

SELECT *
FROM ip2location_db11
WHERE iploc_range >>= '1.2.3.4';

1

PostgreSQL 中的慢查询从两列中定义的范围之间选择单行

`ip4r`

连接到 PostgreSQL 服务器：致命：主机没有 pg_hba.conf 条目

如何让sqlplus的输出出现在一行中？

选择具有最大日期或最晚日期的日期

如何列出 PostgreSQL 中的所有模式？

列出指定表的所有列

如何在不修改我自己的 tnsnames.ora 的情况下使用 sqlplus 连接到位于另一台主机上的 Oracle 数据库

你如何mysqldump特定的表？

使用 psql 列出数据库权限

如何从 PostgreSQL 中的选择查询中将值插入表中？

如何使用 psql 列出所有数据库和表？

PostgreSQL 中的慢查询从两列中定义的范围之间选择单行

3 个回答

相关问题