Re: Most efficient way to insert without duplicates

Поиск
Список
Период
Сортировка
От François Beausoleil
Тема Re: Most efficient way to insert without duplicates
Дата
Msg-id 82853183-91B9-4301-B4A9-603D8047FBBE@teksol.info
обсуждение исходный текст
Ответ на Most efficient way to insert without duplicates  (François Beausoleil <francois@teksol.info>)
Ответы Re: Most efficient way to insert without duplicates
Список pgsql-general
Le 2013-04-16 à 22:51, François Beausoleil a écrit :

> Hi all!
>
> I track Twitter followers in my database. I have the following table:
>
> # \d persona_followers
>           Table "public.persona_followers"
>   Column    |            Type             | Modifiers
> -------------+-----------------------------+-----------
> service_id  | bigint                      | not null
> follower_id | bigint                      | not null
> valid_at    | timestamp without time zone |
> Indexes:
>    "persona_followers_pkey" PRIMARY KEY, btree (service_id, follower_id)
>
> The table IS NOT partitioned.

A run with the following query:

INSERT INTO "persona_followers"
  SELECT "service_id", "follower_id", now()
  FROM (
    SELECT *
    FROM (
        SELECT DISTINCT "service_id", "follower_id"
        FROM "persona_followers_import"
        WHERE "service_id" IS NOT NULL OR "follower_id" IS NOT NULL
      EXCEPT
         SELECT "service_id", "follower_id" FROM "persona_followers") AS "t1") AS "t1"

results in http://explain.depesz.com/s/Y1c

 Insert on public.persona_followers  (cost=139261.12..20483497.65 rows=6256498 width=16) (actual
time=4729255.535..4729255.535rows=0 loops=1) 
   Buffers: shared hit=33135295 read=4776921
   ->  Subquery Scan on t1  (cost=139261.12..20483497.65 rows=6256498 width=16) (actual time=562265.156..578844.999
rows=6819520loops=1) 
         Output: t1.service_id, t1.follower_id, now()
         Buffers: shared hit=36891 read=3572263
         ->  HashSetOp Except  (cost=139261.12..20389650.18 rows=6256498 width=16) (actual time=562265.127..566513.759
rows=6819520loops=1) 
               Output: "*SELECT* 1".service_id, "*SELECT* 1".follower_id, (0)
               Buffers: shared hit=36891 read=3572263
               ->  Append  (cost=139261.12..17054024.97 rows=667125042 width=16) (actual time=4090.462..320879.545
rows=667689321loops=1) 
                     Buffers: shared hit=36891 read=3572263
                     ->  Subquery Scan on "*SELECT* 1"  (cost=139261.12..264391.09 rows=6256498 width=16) (actual
time=4090.461..7798.334rows=6820784 loops=1) 
                           Output: "*SELECT* 1".service_id, "*SELECT* 1".follower_id, 0
                           Buffers: shared hit=36891
                           ->  HashAggregate  (cost=139261.12..201826.11 rows=6256498 width=16) (actual
time=4090.459..6795.009rows=6820784 loops=1) 
                                 Output: persona_followers_import.service_id, persona_followers_import.follower_id
                                 Buffers: shared hit=36891
                                 ->  Seq Scan on francois.persona_followers_import  (cost=0.00..105137.75 rows=6824675
width=16)(actual time=0.017..1344.916 rows=6824700 loops=1) 
                                       Output: persona_followers_import.service_id,
persona_followers_import.follower_id
                                       Filter: ((persona_followers_import.service_id IS NOT NULL) OR
(persona_followers_import.follower_idIS NOT NULL)) 
                                       Buffers: shared hit=36891
                     ->  Subquery Scan on "*SELECT* 2"  (cost=0.00..16789633.88 rows=660868544 width=16) (actual
time=6.694..238761.499rows=660868537 loops=1) 
                           Output: "*SELECT* 2".service_id, "*SELECT* 2".follower_id, 1
                           Buffers: shared read=3572263
                           ->  Seq Scan on public.persona_followers  (cost=0.00..10180948.44 rows=660868544 width=16)
(actualtime=6.693..137929.808 rows=660868537 loops=1) 
                                 Output: public.persona_followers.service_id, public.persona_followers.follower_id
                                 Buffers: shared read=3572263
 Total runtime: 4729338.157 ms

1h20m for 6.8 million rows inserted. The estimates are spot on, since I had just run VACUUM ANALYZE on both tables
priorto doing this test. Running the original query now, SELECT * FROM ... WHERE NOT EXISTS(...). 

Bye,
François
Вложения

В списке pgsql-general по дате отправления:

Предыдущее
От: Albe Laurenz
Дата:
Сообщение: Re: dataset lock
Следующее
От: Aleksey Tsalolikhin
Дата:
Сообщение: How large can a PostgreSQL database get?