Re: benchmarking the query planner

Поиск
Список
Период
Сортировка
От Nathan Boley
Тема Re: benchmarking the query planner
Дата
Msg-id 6fa3b6e20812111735s193b4d70lb2864abf279ede9b@mail.gmail.com
обсуждение исходный текст
Ответ на Re: benchmarking the query planner  (Tom Lane <tgl@sss.pgh.pa.us>)
Ответы Re: benchmarking the query planner  (Tom Lane <tgl@sss.pgh.pa.us>)
Список pgsql-hackers
Thanks for the response.

>> Well, ISTM there is a profound difference. For scalarineqsel we care
>> about the total number of values in a bucket. For eqsel  we care about
>> the total number of *distinct* values in each bucket
>
> Really?
>

Well, we would obviously also care about the total number of values in
the buckets if we were trying to use the histogram in eqsel.

Isn't a selectivity estimate of x = v as ( the number of values in v's
histogram bucket ) / ( number of distinct values in v's histogram
bucket ) pretty rational? Thats currently what we do for non-mcv
values, except that we look at ndistinct over the whole table instead
of individual histogram buckets.

>> IMHO, the whole idea of increasing mcv's seems a mistake. Why not use
>> the limited storage in pg_statistic to try and estimate the
>> selectivity for ranges of values rather than a single value?
>
> MCVs are useful for questions that are not related to ranges of values
> --- an example of something highly useful we do with them is to try to
> estimate the fraction of a column that satisfies a LIKE or regex
> pattern.
>

Good point. I guess I was responding to the eqsel benchmarks. I should
remember that I don't necessarily know all the places that mcv's are
used.

> In fact, as I was just pointing out to Bruce, we can compute them and
> do useful things with them for datatypes that don't have a defined sort
> order and so the whole concept of "range" is meaningless.
>

Another good point. But don't we have bigger stat problems for
datatypes without an ordering relation? IIRC, analyze doesn't even
fully compute the mcv list, as that would be N^2 in the sample size.

> Now, if your point is that it'd be more flexible to not tie the MCV list
> length to the histogram length, you're right.

No, my point is just the opposite. I think the two should be *more*
tightly linked. It seems that ( at least for eqsel and scalarineqsel )
mcv's should be the values that the histogram can't deal with
effectively. ie, there's no ordering relation, there are too many
values to fit into a histogram bucket, the histogram eqsel estimate
does an especially bad job for a relatively common value, etc.  Even
now, if there are 100 histogram buckets then any values that occupy
more than 1% of the table will be mcv's regardless - why force a value
to be an mcv if it only occupies 0.1% of the table? That just bloats
pg_statistic and slows down joins unnecessarily.

I'll submit a patch for 8.5 and then, hopefully, some simple
benchmarks can make my point..

-Nathan


В списке pgsql-hackers по дате отправления:

Предыдущее
От: Tom Lane
Дата:
Сообщение: Re: Mostly Harmless: Welcoming our C++ friends
Следующее
От: Tom Lane
Дата:
Сообщение: Re: benchmarking the query planner