Re: left outer join vs subplan

Поиск
Список
Период
Сортировка
От Simon Riggs
Тема Re: left outer join vs subplan
Дата
Msg-id 1189066797.4175.300.camel@ebony.site
обсуждение исходный текст
Ответ на left outer join vs subplan  (Teodor Sigaev <teodor@sigaev.ru>)
Ответы Re: left outer join vs subplan  ("Filip Rembiałkowski" <plk.zuber@gmail.com>)
Re: left outer join vs subplan  (Teodor Sigaev <teodor@sigaev.ru>)
Список pgsql-hackers
On Wed, 2007-09-05 at 20:11 +0400, Teodor Sigaev wrote:

> I found two queries which do the same thing but they is very different in time. 
> For test suite it's about 10^3 times, but on real data it can be 10^5 times. 
> It's observed on 8.1-current, 8.2-current and CVS HEAD versions. Interesting 
> that even without LIMIT clause they take approximately the same time, but costs 
> is differ in 30 times. Is any way to tweaking pgsql to produce more reasonable 
> plan for first query?

Times I get are:
Q1: ~950ms
Q2: ~5ms

> This query is auto-generated, so they may be more complex and I choose simplest 
> example.

I think we need to build up a library of autogenerated queries, so we
can do things which address multiple use cases. Can you tell us more
about who/what generated it, so we can research?

The query formulation does seem a fairly common one.

> First query:
> explain analyze
> select *
> from
>      a
>      left outer join (
>          select b.id, sum(b.val)
>          from b
>          group by b.id
>      ) bagg
>          on bagg.id = a.id
> where
>      a.id > 10000
> order by a.addon, a.id
> limit 100;
>   Limit  (cost=9923.36..9923.61 rows=100 width=20) (actual 
> time=2232.437..2233.273 rows=100 loops=1)
>     ->  Sort  (cost=9923.36..10031.41 rows=43221 width=20) (actual 
> time=2232.428..2232.709 rows=100 loops=1)
>           Sort Key: a.addon, a.id
>           Sort Method:  top-N heapsort  Memory: 24kB
>           ->  Merge Right Join  (cost=0.00..8271.48 rows=43221 width=20) (actual 
> time=313.198..2052.559 rows=40000 loops=1)
>                 Merge Cond: (b.id = a.id)
>                 ->  GroupAggregate  (cost=0.00..5725.41 rows=53292 width=12) 
> (actual time=0.266..1422.522 rows=50000 loops=1)
>                       ->  Index Scan using bidx on b  (cost=0.00..4309.26 
> rows=150000 width=12) (actual time=0.217..547.402 rows=150000 loops=1)
>                 ->  Index Scan using a1idx on a  (cost=0.00..1256.90 rows=40551 
> width=8) (actual time=0.171..155.073 rows=40000 loops=1)
>                       Index Cond: (a.id > 10000)
>   Total runtime: 2233.940 ms

The value of sum(b.val) is never used in the query, so the aggregate
itself could be discarded. I suspect there are other conditions you
aren't showing us that would make this impossible?

The aggregate prevents the condition bagg.id = a.id from being pushed
down so that we know b.id = a.id. If we knew that then we could use b.id
= ? as an index condition to retrieve the rows.

Since we can't use the best technique, we use another. That then hits a
third optimization problem. When an IndexScan is used to enforce order,
we don't estimate how much of the table needs to be scanned before we
start hitting rows. In the example you give we need to scan 65% of the
table using an IndexScan before we hit any rows. So we would probably be
better off doing a Sort<-SeqScan to apply the condition.

I think we need to do all 3 eventually.

--  Simon Riggs 2ndQuadrant  http://www.2ndQuadrant.com



В списке pgsql-hackers по дате отправления:

Предыдущее
От: Tom Lane
Дата:
Сообщение: Re: [PATCHES] Lazy xid assignment V4
Следующее
От: apoc9009
Дата:
Сообщение: [FEATURE REQUEST] Streaming Onlinebackup (Maybe OFFTOPIC)