Update Perrformance with nested loop vs. merge-join

Поиск
Список
Период
Сортировка
От Nathan Boley
Тема Update Perrformance with nested loop vs. merge-join
Дата
Msg-id f943525a0612051701j48b9b37s8b38062366ae4b64@mail.gmail.com
обсуждение исходный текст
Ответы Re: Update Perrformance with nested loop vs. merge-join  (Tom Lane <tgl@sss.pgh.pa.us>)
Список pgsql-general
Hi,

I just created a new column on a relatively large table (~ 2 mil rows over 400 mb in tablesize) and am trying to populate it from another much smaller table by using an update query .
The purpose of the query to set a bool flag if the user for the action on table1 is the last user, as determined from table 2.

The large table has a two column primary key and one of the columns is the primary key on the smaller table.

My original update query is as follows:

update table1
    set lastuser = (table1.user = table2.lastuser)::bool from table2
    where table1.pkeycolumn1 = table2.pkey

which has an explain output of

Merge Join  (cost=883182.60..1024622.03 rows=6873573 width=89)
  Merge Cond: ("outer".pkeycolumn1 = "inner".pkey)
  ->  Sort  (cost=93387.50..94711.01 rows=529404 width=18)
        Sort Key: table2.pkey
        ->  Seq Scan on table2 (cost=0.00..15037.04 rows=529404 width=18)
  ->  Sort  (cost=789795.10..795605.39 rows=2324116 width=81)
        Sort Key: table1.pkeycolumn1
        ->  Seq Scan on table1  (cost=0.00..71900.16 rows=2324116 width=81)

This seems like it *should* be the best query to me, but when I try and run it it takes forever.
However, when I add a subselect clause to the end to force the planner to use a nested loop
select on table2, the query runs MUCH faster (1 hour instead of > 10, I never let it finish)

Here is the query:

update table1
    set lastuser = (table1.user = table2.lastuser)::bool from table2
    where table1.pkeycolumn1 = table2.pkey
    and table1.pkeycolumn1 is in ( select pkey from table3 limit 1000000)

where table1.pkeycolumn1 is a foreign key in table3.pkey. Also, note that the num of rows in table 3 is *much* less than 1000000.
Also, the limit clause is necessary to force the planner into using the nested loop. (Aside: Is there a better way to do this inside of a query?)
Here is the explain for the above query:

Hash Join  (cost=13863.09..109298.79 rows=51388 width=89)
  Hash Cond: ("outer".pkeycolumn1 = "inner".pkey)
  ->  Seq Scan on table1 (cost=0.00..71900.16 rows=2324116 width=81)
  ->  Hash  (cost=13854.99..13854.99 rows=1621 width=26)
        ->  Nested Loop  (cost=8205.72..13854.99 rows=1621 width=26)
              ->  HashAggregate  (cost=8205.72..8207.72 rows=200 width=8)
                    ->  Limit  (cost=0.00..5891.43 rows=185143 width=8)
                          ->  Seq Scan on table3 (cost=0.00..5891.43 rows=185143 width=8)
              ->  Index Scan using table2-pkey-index" on table2  (cost= 0.00..28.14 rows=8 width=18)
                    Index Cond: ("outer".pkey = table2.pkey)


My only theory up to this point is that it has something to do with how the tablespace is being cached.  I notice that when I use plan 1, my
computer goes through long periods of io and with bursts of processor activity every minute or so. When I use plan 2, the io sits at about 90%
of my total resource usage while my normal processor usage sits at about 10%. Maybe it keeps trying to cache and resort table2? I don't really
have any idea, but that is my only guess. 

If anyone knows why this may be happenning, I would really appreciate it.

Thanks,
Nathan


В списке pgsql-general по дате отправления:

Предыдущее
От: km
Дата:
Сообщение: dynamic SQL - variable substitution in plpgsql
Следующее
От: "Schwenker, Stephen"
Дата:
Сообщение: Re: HELP: Urgent, Vacuum problem