AskOverflow.Dev

AskOverflow.Dev Logo AskOverflow.Dev Logo

AskOverflow.Dev Navigation

  • 主页
  • 系统&网络
  • Ubuntu
  • Unix
  • DBA
  • Computer
  • Coding
  • LangChain

Mobile menu

Close
  • 主页
  • 系统&网络
    • 最新
    • 热门
    • 标签
  • Ubuntu
    • 最新
    • 热门
    • 标签
  • Unix
    • 最新
    • 标签
  • DBA
    • 最新
    • 标签
  • Computer
    • 最新
    • 标签
  • Coding
    • 最新
    • 标签
主页 / dba / 问题 / 18300
Accepted
ypercubeᵀᴹ
ypercubeᵀᴹ
Asked: 2012-05-23 13:19:38 +0800 CST2012-05-23 13:19:38 +0800 CST 2012-05-23 13:19:38 +0800 CST

空间索引能否帮助“范围-按-限制”查询

  • 772

问这个问题,特别是针对 Postgres,因为它对 R-tree/空间索引有很好的支持。

我们有下表,其中包含单词及其频率的树结构(嵌套集模型):

lexikon
-------
_id   integer  PRIMARY KEY
word  text
frequency integer
lset  integer  UNIQUE KEY
rset  integer  UNIQUE KEY

和查询:

SELECT word
FROM lexikon
WHERE lset BETWEEN @Low AND @High
ORDER BY frequency DESC
LIMIT @N

我想覆盖索引会很有用,但我觉得如果范围内的值(lset, frequency, word)太多,它可能表现不佳。lset(@High, @Low)

有时,一个简单的索引(frequency DESC)也可能就足够了,当使用该索引的搜索早期产生@N与范围条件匹配的行时。

但似乎性能很大程度上取决于参数值。

有没有办法让它快速执行,无论范围(@Low, @High)是宽还是窄,也不管高频词是否幸运地在(窄)选定的范围内?

R-tree/空间索引会有帮助吗?

添加索引,重写查询,重新设计表,没有限制。

database-design postgresql
  • 4 4 个回答
  • 5079 Views

4 个回答

  • Voted
  1. Best Answer
    Jack Douglas
    2012-08-11T06:21:45+08:002012-08-11T06:21:45+08:00

    您可以通过首先在频率较高的行中搜索来获得更好的性能。这可以通过“颗粒化”频率然后按程序逐步执行它们来实现,例如如下:

    --testbed 和lexikon虚拟数据:

    begin;
    set role dba;
    create role stack;
    grant stack to dba;
    create schema authorization stack;
    set role stack;
    --
    create table lexikon( _id serial, 
                          word text, 
                          frequency integer, 
                          lset integer, 
                          width_granule integer);
    --
    insert into lexikon(word, frequency, lset) 
    select word, (1000000/row_number() over(order by random()))::integer as frequency, lset
    from (select 'word'||generate_series(1,1000000) word, generate_series(1,1000000) lset) z;
    --
    update lexikon set width_granule=ln(frequency)::integer;
    --
    create index on lexikon(width_granule, lset);
    create index on lexikon(lset);
    -- the second index is not used with the function but is added to make the timings 'fair'
    

    granule分析(主要用于信息和调整):

    create table granule as 
    select width_granule, count(*) as freq, 
           min(frequency) as granule_start, max(frequency) as granule_end 
    from lexikon group by width_granule;
    --
    select * from granule order by 1;
    /*
     width_granule |  freq  | granule_start | granule_end
    ---------------+--------+---------------+-------------
                 0 | 500000 |             1 |           1
                 1 | 300000 |             2 |           4
                 2 | 123077 |             5 |          12
                 3 |  47512 |            13 |          33
                 4 |  18422 |            34 |          90
                 5 |   6908 |            91 |         244
                 6 |   2580 |           245 |         665
                 7 |    949 |           666 |        1808
                 8 |    349 |          1811 |        4901
                 9 |    129 |          4926 |       13333
                10 |     47 |         13513 |       35714
                11 |     17 |         37037 |       90909
                12 |      7 |        100000 |      250000
                13 |      2 |        333333 |      500000
                14 |      1 |       1000000 |     1000000
    */
    alter table granule drop column freq;
    --
    

    首先扫描高频的功能:

    create function f(p_lset_low in integer, p_lset_high in integer, p_limit in integer)
           returns setof lexikon language plpgsql set search_path to 'stack' as $$
    declare
      m integer;
      n integer := 0;
      r record;
    begin 
      for r in (select width_granule from granule order by width_granule desc) loop
        return query( select * 
                      from lexikon 
                      where width_granule=r.width_granule 
                            and lset>=p_lset_low and lset<=p_lset_high );
        get diagnostics m = row_count;
        n = n+m;
        exit when n>=p_limit;
      end loop;
    end;$$;
    

    结果(时间可能应该加一点盐,但每个查询运行两次以对抗任何缓存)

    首先使用我们编写的函数:

    \timing on
    --
    select * from f(20000, 30000, 5) order by frequency desc limit 5;
    /*
     _id |   word    | frequency | lset  | width_granule
    -----+-----------+-----------+-------+---------------
     141 | word23237 |      7092 | 23237 |             9
     246 | word25112 |      4065 | 25112 |             8
     275 | word23825 |      3636 | 23825 |             8
     409 | word28660 |      2444 | 28660 |             8
     418 | word29923 |      2392 | 29923 |             8
    Time: 80.452 ms
    */
    select * from f(20000, 30000, 5) order by frequency desc limit 5;
    /*
     _id |   word    | frequency | lset  | width_granule
    -----+-----------+-----------+-------+---------------
     141 | word23237 |      7092 | 23237 |             9
     246 | word25112 |      4065 | 25112 |             8
     275 | word23825 |      3636 | 23825 |             8
     409 | word28660 |      2444 | 28660 |             8
     418 | word29923 |      2392 | 29923 |             8
    Time: 0.510 ms
    */
    

    然后进行简单的索引扫描:

    select * from lexikon where lset between 20000 and 30000 order by frequency desc limit 5;
    /*
     _id |   word    | frequency | lset  | width_granule
    -----+-----------+-----------+-------+---------------
     141 | word23237 |      7092 | 23237 |             9
     246 | word25112 |      4065 | 25112 |             8
     275 | word23825 |      3636 | 23825 |             8
     409 | word28660 |      2444 | 28660 |             8
     418 | word29923 |      2392 | 29923 |             8
    Time: 218.897 ms
    */
    select * from lexikon where lset between 20000 and 30000 order by frequency desc limit 5;
    /*
     _id |   word    | frequency | lset  | width_granule
    -----+-----------+-----------+-------+---------------
     141 | word23237 |      7092 | 23237 |             9
     246 | word25112 |      4065 | 25112 |             8
     275 | word23825 |      3636 | 23825 |             8
     409 | word28660 |      2444 | 28660 |             8
     418 | word29923 |      2392 | 29923 |             8
    Time: 51.250 ms
    */
    \timing off
    --
    rollback;
    

    根据您的实际数据,您可能希望改变颗粒的数量以及用于将行放入其中的函数。频率的实际分布在这里很关键,条款的预期值和所寻求的范围limit大小也是如此。lset

    • 31
  2. Erwin Brandstetter
    2012-08-15T19:36:39+08:002012-08-15T19:36:39+08:00

    设置

    我正在建立@Jack 的设置,以使人们更容易关注和比较。用PostgreSQL 9.1.4测试。

    CREATE TABLE lexikon (
       lex_id    serial PRIMARY KEY
     , word      text
     , frequency int NOT NULL  -- we'd need to do more if NULL was allowed
     , lset      int
    );
    
    INSERT INTO lexikon(word, frequency, lset) 
    SELECT 'w' || g  -- shorter with just 'w'
         , (1000000 / row_number() OVER (ORDER BY random()))::int
         , g
    FROM   generate_series(1,1000000) g
    

    从这里开始,我采取了不同的路线:

    ANALYZE lexikon;
    

    辅助表

    此解决方案不会向原始表添加列,它只需要一个小辅助表。我把它放在 schemapublic中,使用你选择的任何 schema。

    CREATE TABLE public.lex_freq AS
    WITH x AS (
       SELECT DISTINCT ON (f.row_min)
              f.row_min, c.row_ct, c.frequency
       FROM  (
          SELECT frequency, sum(count(*)) OVER (ORDER BY frequency DESC) AS row_ct
          FROM   lexikon
          GROUP  BY 1
          ) c
       JOIN  (                                   -- list of steps in recursive search
          VALUES (400),(1600),(6400),(25000),(100000),(200000),(400000),(600000),(800000)
          ) f(row_min) ON c.row_ct >= f.row_min  -- match next greater number
       ORDER  BY f.row_min, c.row_ct, c.frequency DESC
       )
    , y AS (   
       SELECT DISTINCT ON (frequency)
              row_min, row_ct, frequency AS freq_min
            , lag(frequency) OVER (ORDER BY row_min) AS freq_max
       FROM   x
       ORDER  BY frequency, row_min
       -- if one frequency spans multiple ranges, pick the lowest row_min
       )
    SELECT row_min, row_ct, freq_min
         , CASE freq_min <= freq_max
             WHEN TRUE  THEN 'frequency >= ' || freq_min || ' AND frequency < ' || freq_max
             WHEN FALSE THEN 'frequency  = ' || freq_min
             ELSE            'frequency >= ' || freq_min
           END AS cond
    FROM   y
    ORDER  BY row_min;
    

    表如下所示:

    row_min | row_ct  | freq_min | cond
    --------+---------+----------+-------------
    400     | 400     | 2500     | frequency >= 2500
    1600    | 1600    | 625      | frequency >= 625 AND frequency < 2500
    6400    | 6410    | 156      | frequency >= 156 AND frequency < 625
    25000   | 25000   | 40       | frequency >= 40 AND frequency < 156
    100000  | 100000  | 10       | frequency >= 10 AND frequency < 40
    200000  | 200000  | 5        | frequency >= 5 AND frequency < 10
    400000  | 500000  | 2        | frequency >= 2 AND frequency < 5
    600000  | 1000000 | 1        | frequency  = 1
    

    由于该列cond将在动态 SQL 中进一步使用,因此您必须使该表安全。如果您不能确定适当的 current ,请始终对表进行模式限定,并从(和任何其他不受信任的角色)search_path撤消写权限:public

    REVOKE ALL ON public.lex_freq FROM public;
    GRANT SELECT ON public.lex_freq TO public;
    

    该表lex_freq有三个目的:

    • 自动创建所需的部分索引。
    • 提供迭代函数的步骤。
    • 用于调优的元信息。

    索引

    此DO语句创建所有需要的索引:

    DO
    $$
    DECLARE
       _cond text;
    BEGIN
       FOR _cond IN
          SELECT cond FROM public.lex_freq
       LOOP
          IF _cond LIKE 'frequency =%' THEN
             EXECUTE 'CREATE INDEX ON lexikon(lset) WHERE ' || _cond;
          ELSE
             EXECUTE 'CREATE INDEX ON lexikon(lset, frequency DESC) WHERE ' || _cond;
          END IF;
       END LOOP;
    END
    $$
    

    所有这些部分索引一起跨越表一次。它们的大小与整个表上的一个基本索引大致相同:

    SELECT pg_size_pretty(pg_relation_size('lexikon'));       -- 50 MB
    SELECT pg_size_pretty(pg_total_relation_size('lexikon')); -- 71 MB
    

    到目前为止,50 MB 表的索引只有 21 MB。

    我在(lset, frequency DESC). 第二列仅在特殊情况下有帮助。但是由于涉及的两个列都是 type integer,由于PostgreSQL中结合 MAXALIGN 的数据对齐的特殊性,第二列不会使索引变得更大。这是一个小小的胜利,几乎没有任何成本。

    对于仅跨越单个频率的部分索引,这样做是没有意义的。那些只是在(lset)。创建的索引如下所示:

    CREATE INDEX ON lexikon(lset, frequency DESC) WHERE frequency >= 2500;
    CREATE INDEX ON lexikon(lset, frequency DESC) WHERE frequency >= 625 AND frequency < 2500;
    -- ...
    CREATE INDEX ON lexikon(lset, frequency DESC) WHERE frequency >= 2 AND frequency < 5;
    CREATE INDEX ON lexikon(lset) WHERE freqency = 1;
    

    功能

    该函数在风格上与@Jack 的解决方案有些相似:

    CREATE OR REPLACE FUNCTION f_search(_lset_min int, _lset_max int, _limit int)
      RETURNS SETOF lexikon
    $func$
    DECLARE
       _n      int;
       _rest   int := _limit;   -- init with _limit param
       _cond   text;
    BEGIN 
       FOR _cond IN
          SELECT l.cond FROM public.lex_freq l ORDER BY l.row_min
       LOOP    
          --  RAISE NOTICE '_cond: %, _limit: %', _cond, _rest; -- for debugging
          RETURN QUERY EXECUTE '
             SELECT * 
             FROM   public.lexikon 
             WHERE  ' || _cond || '
             AND    lset >= $1
             AND    lset <= $2
             ORDER  BY frequency DESC
             LIMIT  $3'
          USING  _lset_min, _lset_max, _rest;
    
          GET DIAGNOSTICS _n = ROW_COUNT;
          _rest := _rest - _n;
          EXIT WHEN _rest < 1;
       END LOOP;
    END
    $func$ LANGUAGE plpgsql STABLE;
    

    主要区别:

    • 动态 SQL与RETURN QUERY EXECUTE.
      当我们遍历这些步骤时,不同的查询计划可能会受益。静态 SQL 的查询计划生成一次,然后重复使用——这样可以节省一些开销。但在这种情况下,查询很简单,并且值非常不同。动态 SQL 将是一个巨大的胜利。

    • LIMIT每个查询步骤都是动态的。
      这有多种帮助:首先,仅根据需要获取行。结合动态 SQL,这也可能会生成不同的查询计划。第二:不需要LIMIT在函数调用中进行额外的修剪。

    基准

    设置

    我选择了四个示例,并对每个示例进行了三个不同的测试。我拿了五个中最好的一个来与暖缓存进行比较:

    1. 表单的原始 SQL 查询:

      SELECT * 
      FROM   lexikon 
      WHERE  lset >= 20000
      AND    lset <= 30000
      ORDER  BY frequency DESC
      LIMIT  5;
      
    2. 创建此索引后相同

      CREATE INDEX ON lexikon(lset);
      

      需要与我的所有部分索引相同的空间:

      SELECT pg_size_pretty(pg_total_relation_size('lexikon')) -- 93 MB
      
    3. 功能

      SELECT * FROM f_search(20000, 30000, 5);
      

    结果

    SELECT * FROM f_search(20000, 30000, 5);

    1:总运行时间:315.458 ms
    2:总运行时间:36.458 ms
    3:总运行时间:0.330 ms

    SELECT * FROM f_search(60000, 65000, 100);

    1:总运行时间:294.819 ms
    2:总运行时间:18.915 ms
    3:总运行时间:1.414 ms

    SELECT * FROM f_search(10000, 70000, 100);

    1: Total runtime: 426.831 ms
    2: Total runtime: 217.874 ms
    3: Total runtime: 1.611 ms

    SELECT * FROM f_search(1, 1000000, 5);

    1: Total runtime: 2458.205 ms
    2: Total runtime: 2458.205 ms -- for large ranges of lset, seq scan is faster than index.
    3: Total runtime: 0.266 ms

    Conclusion

    As expected, the benefit from the function grows with bigger ranges of lset and smaller LIMIT.

    With very small ranges of lset, the raw query in combination with the index is actually faster. You'll want to test and maybe branch: raw query for small ranges of lset, else function call. You could even just build that into the function for a "best of both worlds" - that's what I would do.

    Depending on your data distribution and typical queries, more steps in lex_freq may help performance. Test to find the sweet spot. With the tools presented here, it should be easy to test.

    • 23
  3. grayhemp
    2012-08-08T02:59:11+08:002012-08-08T02:59:11+08:00

    我看不出有任何理由在索引中包含单词列。所以这个指数

    CREATE INDEX lexikon_lset_frequency ON lexicon (lset, frequency DESC)
    

    将使您的查询快速执行。

    UPD

    目前没有办法在 PostgreSQL 中创建覆盖索引。在 PostgreSQL 邮件列表http://archives.postgresql.org/pgsql-performance/2012-06/msg00114.php中有关于此功能的讨论

    • 1
  4. Evan Carroll
    2018-06-13T10:40:40+08:002018-06-13T10:40:40+08:00

    Using a GIST index

    Is there a way to make it perform fast, regardless of whether the range (@Low, @High) is wide or narrow and regardless of whether the top frequency words are luckily in the (narrow) selected range?

    It depends on what you mean when you fast: you obviously have to visit every row in the range because your query is ORDER freq DESC. Shy of that the query planner already covers this if I understand the question,

    Here we create a table with 10k rows of (5::int,random()::double precision)

    CREATE EXTENSION IF NOT EXISTS btree_gin;
    CREATE TABLE t AS
      SELECT 5::int AS foo, random() AS bar
      FROM generate_series(1,1e4) AS gs(x);
    

    We index it,

    CREATE INDEX ON t USING gist (foo, bar);
    ANALYZE t;
    

    We query it,

    EXPLAIN ANALYZE
    SELECT *
    FROM t
    WHERE foo BETWEEN 1 AND 6
    ORDER BY bar DESC
    FETCH FIRST ROW ONLY;
    

    We get a Seq Scan on t. This is simply because our selectivity estimates let pg conclude heap access is faster than scanning an index and rechecking. So we make it more juicy by inserting 1,000,000 more rows of (42::int,random()::double precision) that don't fit our "range".

    INSERT INTO t(foo,bar)
    SELECT 42::int, x
    FROM generate_series(1,1e6) AS gs(x);
    
    VACUUM ANALYZE t;
    

    And then we requery,

    EXPLAIN ANALYZE
    SELECT *
    FROM t
    WHERE foo BETWEEN 1 AND 6
    ORDER BY bar DESC
    FETCH FIRST ROW ONLY;
    

    You can see here we complete in 4.6 MS with an Index Only Scan,

                                                                     QUERY PLAN                                                                  
    ---------------------------------------------------------------------------------------------------------------------------------------------
     Limit  (cost=617.64..617.64 rows=1 width=12) (actual time=4.652..4.652 rows=1 loops=1)
       ->  Sort  (cost=617.64..642.97 rows=10134 width=12) (actual time=4.651..4.651 rows=1 loops=1)
             Sort Key: bar DESC
             Sort Method: top-N heapsort  Memory: 25kB
             ->  Index Only Scan using t_foo_bar_idx on t  (cost=0.29..566.97 rows=10134 width=12) (actual time=0.123..3.623 rows=10000 loops=1)
                   Index Cond: ((foo >= 1) AND (foo <= 6))
                   Heap Fetches: 0
     Planning time: 0.144 ms
     Execution time: 4.678 ms
    (9 rows)
    

    Expanding the range to include the whole table, produces another seq scan -- logically, and growing it with another billion rows would produce another Index Scan.

    So in summary,

    • It'll perform fast, for the quantity of data.
    • Fast is relative to the alternative, if the range isn't selective enough a sequential scan may be as fast as you can get.
    • 1

相关问题

  • 运行时间偏移延迟复制的最佳实践

  • 存储过程可以防止 SQL 注入吗?

  • 在数据仓库中实现多对多关系有哪些方法?

  • PostgreSQL 中 UniProt 的生物序列

  • PostgreSQL 9.0 Replication 和 Slony-I 有什么区别?

Sidebar

Stats

  • 问题 205573
  • 回答 270741
  • 最佳答案 135370
  • 用户 68524
  • 热门
  • 回答
  • Marko Smith

    如何查看 Oracle 中的数据库列表?

    • 8 个回答
  • Marko Smith

    mysql innodb_buffer_pool_size 应该有多大?

    • 4 个回答
  • Marko Smith

    列出指定表的所有列

    • 5 个回答
  • Marko Smith

    从 .frm 和 .ibd 文件恢复表?

    • 10 个回答
  • Marko Smith

    如何在不修改我自己的 tnsnames.ora 的情况下使用 sqlplus 连接到位于另一台主机上的 Oracle 数据库

    • 4 个回答
  • Marko Smith

    你如何mysqldump特定的表?

    • 4 个回答
  • Marko Smith

    如何选择每组的第一行?

    • 6 个回答
  • Marko Smith

    使用 psql 列出数据库权限

    • 10 个回答
  • Marko Smith

    如何从 PostgreSQL 中的选择查询中将值插入表中?

    • 4 个回答
  • Marko Smith

    如何使用 psql 列出所有数据库和表?

    • 7 个回答
  • Martin Hope
    Mike Walsh 为什么事务日志不断增长或空间不足? 2012-12-05 18:11:22 +0800 CST
  • Martin Hope
    Stephane Rolland 列出指定表的所有列 2012-08-14 04:44:44 +0800 CST
  • Martin Hope
    haxney MySQL 能否合理地对数十亿行执行查询? 2012-07-03 11:36:13 +0800 CST
  • Martin Hope
    qazwsx 如何监控大型 .sql 文件的导入进度? 2012-05-03 08:54:41 +0800 CST
  • Martin Hope
    markdorison 你如何mysqldump特定的表? 2011-12-17 12:39:37 +0800 CST
  • Martin Hope
    pedrosanta 使用 psql 列出数据库权限 2011-08-04 11:01:21 +0800 CST
  • Martin Hope
    Jonas 如何使用 psql 对 SQL 查询进行计时? 2011-06-04 02:22:54 +0800 CST
  • Martin Hope
    Jonas 如何从 PostgreSQL 中的选择查询中将值插入表中? 2011-05-28 00:33:05 +0800 CST
  • Martin Hope
    Jonas 如何使用 psql 列出所有数据库和表? 2011-02-18 00:45:49 +0800 CST
  • Martin Hope
    bernd_k 什么时候应该使用唯一约束而不是唯一索引? 2011-01-05 02:32:27 +0800 CST

热门标签

sql-server mysql postgresql sql-server-2014 sql-server-2016 oracle sql-server-2008 database-design query-performance sql-server-2017

Explore

  • 主页
  • 问题
    • 最新
    • 热门
  • 标签
  • 帮助

Footer

AskOverflow.Dev

关于我们

  • 关于我们
  • 联系我们

Legal Stuff

  • Privacy Policy

Language

  • Pt
  • Server
  • Unix

© 2023 AskOverflow.DEV All Rights Reserve