Re: Optimization idea

Поиск
Список
Период
Сортировка
От Robert Haas
Тема Re: Optimization idea
Дата
Msg-id u2n603c8f071004231506v1a46e2e1z45e81a4a56eaacc1@mail.gmail.com
обсуждение исходный текст
Ответ на Re: Optimization idea  (Cédric Villemain <cedric.villemain.debian@gmail.com>)
Ответы Re: Optimization idea  (Tom Lane <tgl@sss.pgh.pa.us>)
Список pgsql-performance
On Fri, Apr 23, 2010 at 3:22 PM, Cédric Villemain
<cedric.villemain.debian@gmail.com> wrote:
> 2010/4/23 Robert Haas <robertmhaas@gmail.com>:
>> On Fri, Apr 23, 2010 at 9:09 AM, Cédric Villemain
>> <cedric.villemain.debian@gmail.com> wrote:
>>> 2010/4/23 Robert Haas <robertmhaas@gmail.com>:
>>>> On Thu, Apr 22, 2010 at 10:37 PM, Vlad Arkhipov <arhipov@dc.baikal.ru> wrote:
>>>>> I don't think this is just an issue with statistics, because the same
>>>>> problem arises when I try executing a query like this:
>>>>
>>>> I'm not sure how you think this proves that it isn't a problem with
>>>> statistics, but I think what you should be focusing on here, looking
>>>> back to your original email, is that the plans that are actually much
>>>> faster have almost as much estimated cost as the slower one.  Since
>>>> all your data is probably fully cached, at a first cut, I might try
>>>> setting random_page_cost and seq_page_cost to 0.005 or so, and
>>>> adjusting effective_cache_size to something appropriate.
>>>
>>> that will help worrect the situation, but the planner is loosing here I think.
>>
>> Well, what do you think the planner should do differently?
>
> Here the planner just divide the number of rows in the t2 table by the
> number of distinct value of t1.t. this is the rows=20200 we can see in
> the explains.
> It seems it is normal, but it also looks to me that it can be improved.
> When estimating the rowcount to just num_rows/n_distinct, it *knows*
> that this is wrong because the most_common_freqs of t2.t say that of
> the 99600 rows have the value 1, or less than 200 in all other case.
> So in every case the planner make (perhaps good) choice, but being
> sure its estimation are wrong.
> I wonder if we can improve the planner here.
>
> In this case where the number of rows is lower than the stats
> target(in t1.t), perhaps the planner can improve its decision by going
> a bit ahead and trying plan for each n_distinct values corresponding
> in t2.t .
>
> I haven't a very clear idea of how to do that, but it may be better if
> the planner estimate if its plan is 100%(or lower, just an idea) sure
> to hapen and that's fine, else  try another plan.
>
> in this test case, if the query is :
> select *
> from t2
> join t1 on t1.t = t2.t
> where t1.id = X;
>
> if X=1 then the planner has 20% of chance that the rowcount=99600 and
> 80% that rowcount=200 or less, by providing a rowcount=20200 how can
> it find the good plan anyway ? Is it beter to start with bad
> estimation and perhaps find a good plan, or start with estimation
> which may be bad but lead to a good plan in more than XX% of the
> cases.
>
> So, currently, the planner do as expected, but can we try another
> approach for those corner cases ?

Hmm.  We currently have a heuristic that we don't record a value as an
MCV unless it's more frequent than the average frequency.  When the
number of MCVs is substantially smaller than the number of distinct
values in the table this is probably a good heuristic, since it
prevents us from bothering with the recording of some values that are
probably only marginally more interesting than other values we don't
have space to record.  But if ndistinct is less than the stats target
we could in theory record every value we find in the MCVs table and
leave the histogram empty.  Not sure if that would be better in
general, or not, but it's a thought.

...Robert

В списке pgsql-performance по дате отправления:

Предыдущее
От: Merlin Moncure
Дата:
Сообщение: Re: Replacing Cursors with Temporary Tables
Следующее
От: Robert Haas
Дата:
Сообщение: Re: Replacing Cursors with Temporary Tables