Re: benchmarking the query planner

Поиск
Список
Период
Сортировка
От Robert Haas
Тема Re: benchmarking the query planner
Дата
Msg-id 603c8f070812120414g2daba35egb10cc5d7cc2d7dee@mail.gmail.com
обсуждение исходный текст
Ответ на Re: benchmarking the query planner  (Tom Lane <tgl@sss.pgh.pa.us>)
Ответы Re: benchmarking the query planner  (Tom Lane <tgl@sss.pgh.pa.us>)
Список pgsql-hackers
On Thu, Dec 11, 2008 at 10:12 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> "Robert Haas" <robertmhaas@gmail.com> writes:
>> I had this idle thought too, but I didn't write it down because...
>
>>> ought to be, but it seems like it ought to be possible to determine
>>> that given a desired maximum error in the overall estimate.  I'm also
>>> not very clear on what the "total frequency" computations (matchfreq2
>>> and unmatchfreq2 in the current code) ought to look like if we are using
>>> a variable subset of the inner list.
>
>> ...of this exact concern, which I think is an insurmountable problem.
>
> Maybe so.  If we stick to the other design (end both lists at a preset
> frequency threshold) then the math clearly goes through the same as
> before, just with num_mcvs that are determined differently.  But can
> we prove anything about the maximum error added from that?

I don't think so, because in that design, it's entirely possible that
you'll throw away the entire MCV list if all of the entries are below
the threshold (as in the example we were just benchmarking, supposing
a threshold of 0.001).

An alternative is to pick a threshold T for the maximum number of
equality probes that you're willing to suffer through.  Then given two
MCV lists of lengths M1 and M2 such that M1 * M2 > T, pick the largest
N such that MIN(M1, N) * MIN(M2, N) <= T.  This guarantees that you
always use at least T^(1/2) MCVs.

If you compare this approach with T = 10^6 vs. simply chopping off the
MCV list at p = 0.001, this approach will be more accurate at the cost
of more comparisons.  For example in our test case where all the
comparisons fail, chopping off the MCV list at p = 0.001 results in
ignoring it completely, whereas with this approach we use all 1000
entries just as before.  So it might be appropriate to choose a lower
threshold like, say, T = 10^5, since otherwise we're not preventing
any computation.  (I suppose you could even make this a GUC since any
choice of value is going to be pretty arbitrary...)

I'm not sure to what extent we can bound the amount of error with this
approach, but it's definitely better.  As we've seen, a frequency
cutoff can throw away the entire MCV list; this approach always
retains at least T^(1/2) entries, and more if the other list happens
to be shorter than T^(1/2).  So we can say that the result will never
be worse than it would have been had you set the statistics target to
T^(1/2).

...Robert


В списке pgsql-hackers по дате отправления:

Предыдущее
От: "Pavan Deolasee"
Дата:
Сообщение: Re: visibility maps
Следующее
От: Dimitri Fontaine
Дата:
Сообщение: Re: psql commands for SQL/MED