Re: benchmarking the query planner

Поиск
Список
Период
Сортировка
От Greg Stark
Тема Re: benchmarking the query planner
Дата
Msg-id 4136ffa0812121033o105f8d6bie0f2a87082c77df7@mail.gmail.com
обсуждение исходный текст
Ответ на Re: benchmarking the query planner  (Tom Lane <tgl@sss.pgh.pa.us>)
Список pgsql-hackers
On Fri, Dec 12, 2008 at 6:18 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> I seem to recall Greg suggesting that there were ways to estimate
> ndistinct without sorting, but short of a fundamental algorithm change
> there's not going to be a win here.

Not sure if you meant me, but I can say the approach I saw documented
before involved keeping a hash of all values seen in a full table
scan. When the hash reached a maximum size you discard half the hash
key space and from then on ignore any values which hash into that
space. When you reach the end you have a hash table with a random
sample of 1/2^n distinct values (where n is the number of times the
hash table overflowed) and the counts for those values.

If you're just interested in counting distinct values in the sample I
suppose you could do it using a Bloom filter just as we've talked
about for other hash tables. You scan through the list and increment
an ndistinct counter for every value you find where the bits aren't
all already set (then set those bits). That would undercount slightly
and I'm not sure how much faster than qsort it would actually be. The
log(n) factor in qsort's average case is pretty small when we're
talking about small samples.



-- 
greg


В списке pgsql-hackers по дате отправления:

Предыдущее
От: "Robert Haas"
Дата:
Сообщение: Re: benchmarking the query planner
Следующее
От: "David E. Wheeler"
Дата:
Сообщение: Re: WIP: default values for function parameters