PostgreSQL 中 UniProt 的生物序列

Question

user3052682

Asked: 2021-06-25 04:00:10 +0800 CST2021-06-25 04:00:10 +0800 CST 2021-06-25 04:00:10 +0800 CST

优化postgres功能

772

我有以下功能，我想用它来更新状态和距离列。

CREATE OR REPLACE FUNCTION public.check_deviation_table(
    )
    RETURNS void
    LANGUAGE 'plpgsql'

    COST 100
    VOLATILE 
    
AS $BODY$
declare 
status integer;
geoms integer;
distance float;
selected_rotue_name text;
Begin
    update ping_data set distance=sub.distance from
    (select min(st_distance(st_setsrid(st_transform(rl.geom,3857),3857),st_setsrid(st_transform(ST_SetSRID(ST_MakePoint(pd.longitude,pd.latitude),4326),3857),3857))) as distance
    ,pd.id from ping_data pd inner join order_data od on od."orderNumber"=pd.ordernumber
    inner join route_line rl on od."pickupLocName"||'-'||od."deliveryLocName"=rl."route name"
    where pd.distance is null group by pd.id)sub
    where ping_data.id=sub.id;
    update ping_data set status='No deviation' where ping_data.status is null and ping_data.distance<='3000'::int::float;
    update ping_data set status='Deviated' where ping_data.status is null and ping_data.distance>'3000'::int::float;
End
$BODY$;

它在几千条记录上运行良好，但是当我在ping_data有 642810 条记录的整个表上运行它时，它只会继续运行。我还在所涉及的表上创建了以下索引，但无济于事。

create index btree_ping on ping_data using btree(id);
create index gist_route on route_line using gist(geom);

请参考以下详情

Postgres 版本：PostgreSQL 12.4，Visual C++ build 1914,64-bit 编译

PostGIS 版本：3.0 USE_GEOS=1 USE_PROJ=1 USE_STATS=1

机器详情：Intel i5 16GB RAM Windows 10

下面共享的是所涉及的表的模式

CREATE TABLE public.route_line
(
    "route name" character varying COLLATE pg_catalog."default",
    geom geometry,
    routeid integer NOT NULL DEFAULT nextval('route_line_routeid_seq'::regclass),
    "pickup location" text COLLATE pg_catalog."default",
    "destination location" text COLLATE pg_catalog."default",
    CONSTRAINT route_line_pkey PRIMARY KEY (routeid)
)
CREATE TABLE public.ping_data
(
    latitude double precision,
    longitude double precision,
    pingdt character varying COLLATE pg_catalog."default",
    shipmentid integer,
    addedat timestamp with time zone DEFAULT now(),
    ordernumber bigint,
    id integer NOT NULL DEFAULT nextval('ping_data_id_seq'::regclass),
    ping_dt timestamp with time zone,
    status text COLLATE pg_catalog."default",
    distance double precision
)
CREATE TABLE public.order_data
(
    "assetNumber" text COLLATE pg_catalog."default",
    "createDt" timestamp with time zone,
    "deliveryLocName" text COLLATE pg_catalog."default",
    "orderNumber" bigint,
    "pickupLocName" text COLLATE pg_catalog."default",
    "releaseNumber" bigint,
    assigned character varying(1) COLLATE pg_catalog."default" DEFAULT '0'::character varying,
    city text COLLATE pg_catalog."default",
    transportername text COLLATE pg_catalog."default"
)

@ErwinBrandstetter 请参考以下对您共享的更新查询的解释分析结果。

Update on ping_data p  (cost=2199987.43..6027475.29 rows=97076 width=162) (actual time=6014063.492..6014063.861 rows=0 loops=1)
  ->  Hash Join  (cost=2199987.43..6027475.29 rows=97076 width=162) (actual time=168649.223..5998253.555 rows=545483 loops=1)
        Hash Cond: (sub.id = p.id)
        ->  Subquery Scan on sub  (cost=2144323.35..5958823.04 rows=97076 width=48) (actual time=167552.663..5987759.840 rows=545483 loops=1)
              ->  GroupAggregate  (cost=2144323.35..5957852.28 rows=97076 width=12) (actual time=167552.647..5984534.784 rows=545483 loops=1)
                    Group Key: pd.id
                    ->  Sort  (cost=2144323.35..2144630.74 rows=122956 width=13212) (actual time=165047.817..177245.569 rows=891149 loops=1)
                          Sort Key: pd.id
                          Sort Method: external merge  Disk: 2274960kB
                          ->  Hash Join  (cost=3815.85..47337.28 rows=122956 width=13212) (actual time=1134.751..2381.401 rows=891149 loops=1)
                                Hash Cond: (pd.ordernumber = od.orderNumber)
                                ->  Seq Scan on ping_data pd  (cost=0.00..38132.37 rows=97076 width=28) (actual time=0.122..381.017 rows=642810 loops=1)
                                      Filter: (distance IS NULL)
                                ->  Hash  (cost=1451.69..1451.69 rows=1453 width=13200) (actual time=1134.316..1134.399 rows=1632 loops=1)
                                      Buckets: 512  Batches: 8  Memory Usage: 576kB
                                      ->  Hash Join  (cost=46.93..1451.69 rows=1453 width=13200) (actual time=159.161..1119.843 rows=1632 loops=1)
                                            Hash Cond: ((rl.route name)::text = ((od.pickupLocName || '-'::text) || od.deliveryLocName))
                                            ->  Seq Scan on route_line rl  (cost=0.00..958.54 rows=954 width=13210) (actual time=25.129..121.613 rows=954 loops=1)
                                            ->  Hash  (cost=31.97..31.97 rows=1197 width=51) (actual time=11.256..11.259 rows=1197 loops=1)
                                                  Buckets: 2048  Batches: 1  Memory Usage: 114kB
                                                  ->  Seq Scan on order_data od  (cost=0.00..31.97 rows=1197 width=51) (actual time=9.379..10.686 rows=1197 loops=1)
        ->  Hash  (cost=38132.37..38132.37 rows=645737 width=89) (actual time=1042.800..1042.801 rows=642810 loops=1)
              Buckets: 32768  Batches: 32  Memory Usage: 2704kB
              ->  Seq Scan on ping_data p  (cost=0.00..38132.37 rows=645737 width=89) (actual time=0.135..674.748 rows=642810 loops=1)
Planning Time: 155.156 ms
Execution Time: 6016315.903 ms

有人可以帮我优化这个功能吗？

1 个回答

Voted

Erwin Brandstetter · Answer 1 · 2021-06-25T14:35:17+08:00

这与您的原始版本完全相同：

CREATE OR REPLACE FUNCTION public.check_deviation_table()
  RETURNS void
  LANGUAGE sql AS
$func$
UPDATE ping_data p
SET    distance = sub.distance
     , status = COALESCE(p.status
                       , CASE WHEN sub.distance <= 3000 THEN 'No deviation'
                              WHEN sub.distance >  3000 THEN 'Deviated' END)
FROM  (
   SELECT min(st_distance(st_setsrid(st_transform(rl.geom,3857),3857)
                         ,st_setsrid(st_transform(st_setsrid(st_makepoint(pd.longitude, pd.latitude),4326),3857),3857))) AS distance
        , pd.id
   FROM   ping_data  pd
   JOIN   order_data od ON od."orderNumber" = pd.ordernumber
   JOIN   route_line rl ON od."pickupLocName" || '-' || od."deliveryLocName" = rl."route name"
   WHERE  pd.distance IS NULL
   GROUP  BY pd.id
   ) sub
WHERE  ping_data.id = sub.id;
$func$;

应该已经快了很多。

删除所有无用的声明变量。那么就没有什么需要PL/pgSQL了。使用更简单的 SQL 函数，或者只是简单的查询。

最重要的是，使用一个 UPDATE而不是三个。您的原件将编写两个新的行版本，而不仅仅是一个。有几个顺序扫描而不是一个。（索引可能对此毫无用处或使其更加昂贵，但这取决于...）

可能还有更多的潜力，但我会在没有详细信息的情况下停在这里。

优化postgres功能

连接到 PostgreSQL 服务器：致命：主机没有 pg_hba.conf 条目

如何让sqlplus的输出出现在一行中？

选择具有最大日期或最晚日期的日期

如何列出 PostgreSQL 中的所有模式？

列出指定表的所有列

如何在不修改我自己的 tnsnames.ora 的情况下使用 sqlplus 连接到位于另一台主机上的 Oracle 数据库

你如何mysqldump特定的表？

使用 psql 列出数据库权限

如何从 PostgreSQL 中的选择查询中将值插入表中？

如何使用 psql 列出所有数据库和表？

优化postgres功能

1 个回答

相关问题