Обсуждение: Any idea on how to improve the statistics estimates for this plan?

Поиск
Список
Период
Сортировка

Any idea on how to improve the statistics estimates for this plan?

От
Guillaume Smet
Дата:
Hi,

I'm struggling with a query for some time and the major problem of the
query is that the statistics are way wrong on a particular operation:
 ->  Nested Loop  (cost=3177.72..19172.84 rows=*2* width=112) (actual
time=139.221..603.929 rows=*355331* loops=1)
     Join Filter: (l.location_id = r.location_id)
     ->  Hash Join  (cost=3177.71..7847.52 rows=*33914* width=108)
(actual time=138.343..221.852 rows=*36664* loops=1)
             Hash Cond: (el.location_id = l.location_id)
             ...
     ->  Index Scan using idx_test1 on representations r
(cost=0.01..0.32 rows=*1* width=12) (actual time=0.002..0.008
rows=*10* loops=36664)
             ...
(extracted from the original plan which is quite massive)

I tried to improve the statistics of l.location_id, el.location_id,
r.location_id and idx_test1.location_id (up to 5000) but it doesn't
get better.

Any idea on how I could get better statistics in this particular
example and why the estimate of the nested loop is so wrong while the
ones for each individual operations are quite good?

This is with PostgreSQL 9.2.1.

Thanks.

--
Guillaume


Re: Any idea on how to improve the statistics estimates for this plan?

От
Jeff Janes
Дата:
On Wed, Dec 5, 2012 at 11:39 AM, Guillaume Smet
<guillaume.smet@gmail.com> wrote:
> Hi,
>
> I'm struggling with a query for some time and the major problem of the
> query is that the statistics are way wrong on a particular operation:
>  ->  Nested Loop  (cost=3177.72..19172.84 rows=*2* width=112) (actual
> time=139.221..603.929 rows=*355331* loops=1)
>      Join Filter: (l.location_id = r.location_id)
>      ->  Hash Join  (cost=3177.71..7847.52 rows=*33914* width=108)
> (actual time=138.343..221.852 rows=*36664* loops=1)
>              Hash Cond: (el.location_id = l.location_id)
>              ...
>      ->  Index Scan using idx_test1 on representations r
> (cost=0.01..0.32 rows=*1* width=12) (actual time=0.002..0.008
> rows=*10* loops=36664)
>              ...
> (extracted from the original plan which is quite massive)

Could you reduce the plan size by removing joins that are extraneous
to this specific problem?

> I tried to improve the statistics of l.location_id, el.location_id,
> r.location_id and idx_test1.location_id (up to 5000) but it doesn't
> get better.

If there is a correlation that PostgreSQL is incapable of
understanding, than no amount of increase is going to help.

>
> Any idea on how I could get better statistics in this particular
> example and why the estimate of the nested loop is so wrong while the
> ones for each individual operations are quite good?

The trivial answer to "why" is that it thinks that the vast majority
of the 33914 rows from the hash join will find no partners in r, but
in fact each has about 10 partner in r.  Why does it think that?
Without seeing all the join conditions and filter conditions on those
tables, plus the size of each unfiltered pair-wise joins, it is hard
to speculate.

If you remove all filters (all members of the "where" which are not
join criteria), then what does the plan look like?

If those estimates are better, it probably means that your filter
condition is picking a part of the "el JOIN l" that has much different
selectivity to r than the full set does, and PostgreSQL has no way of
knowing that.

Cheers,

Jeff


Re: Any idea on how to improve the statistics estimates for this plan?

От
Tom Lane
Дата:
Jeff Janes <jeff.janes@gmail.com> writes:
> The trivial answer to "why" is that it thinks that the vast majority
> of the 33914 rows from the hash join will find no partners in r, but
> in fact each has about 10 partner in r.  Why does it think that?

I'm wondering if maybe the vast majority of the rows indeed have no join
partners, but there are a small number with a large number of partners.
The statistics might miss these, if so.

            regards, tom lane


Re: Any idea on how to improve the statistics estimates for this plan?

От
Guillaume Smet
Дата:
(cough cough, missed the Reply to all button)

Hi Jeff,

On Sat, Dec 8, 2012 at 3:32 AM, Jeff Janes <jeff.janes@gmail.com> wrote:
> If those estimates are better, it probably means that your filter
> condition is picking a part of the "el JOIN l" that has much different
> selectivity to r than the full set does, and PostgreSQL has no way of
> knowing that.

It's certainly that. The fact is that this query is OK on most of the
French territory but it doesn't go well when you're looking at Paris
area in particular. As the query is supposed to return the shows you
can book, the selectivity is quite different as Paris has a lot of
places AND places organize a lot more shows in Paris than in the rest
of France. I was hoping that the high number of places would be enough
to circumvent the second fact which is much harder for PostgreSQL to
get but it looks like it's not.

Is there any way I could mitigate this issue by playing with planner
knobs? I don't remember having seen something I could use for
selectivity (such as the n_distinct stuff). It's not that big a deal
if it's a little worth elsewhere as there are far less places so the
effects of a bad plan are more contained.

--
Guillaume


Re: Any idea on how to improve the statistics estimates for this plan?

От
Jeff Janes
Дата:
On Sat, Dec 8, 2012 at 5:19 AM, Guillaume Smet <guillaume.smet@gmail.com> wrote:
> Hi Jeff,
>
> On Sat, Dec 8, 2012 at 3:32 AM, Jeff Janes <jeff.janes@gmail.com> wrote:
>> If those estimates are better, it probably means that your filter
>> condition is picking a part of the "el JOIN l" that has much different
>> selectivity to r than the full set does, and PostgreSQL has no way of
>> knowing that.
>
> It's certainly that. The fact is that this query is OK on most of the
> French territory but it doesn't go well when you're looking at Paris
> area in particular. As the query is supposed to return the shows you
> can book, the selectivity is quite different as Paris has a lot of
> places AND places organize a lot more shows in Paris than in the rest
> of France. I was hoping that the high number of places would be enough
> to circumvent the second fact which is much harder for PostgreSQL to
> get but it looks like it's not.
>
> Is there any way I could mitigate this issue by playing with planner
> knobs?

I don't know the answer to that.  But does it matter?  If it knew you
were going to get 300,000 rows rather than 2, would it pick a better
plan?

Cheers,

Jeff