Re: Tuple sampling

Поиск
Список
Период
Сортировка
От Tom Lane
Тема Re: Tuple sampling
Дата
Msg-id 28169.1085347956@sss.pgh.pa.us
обсуждение исходный текст
Ответ на Tuple sampling  (Manfred Koizar <mkoi-pg@aon.at>)
Ответы Re: Tuple sampling  (Bruno Wolff III <bruno@wolff.to>)
Re: Tuple sampling  (Manfred Koizar <mkoi-pg@aon.at>)
Список pgsql-patches
Manfred Koizar <mkoi-pg@aon.at> writes:
> This patch implements the new tuple sampling method as discussed on
> -hackers and -performance a few weeks ago.

Applied with minor editorializations.  AFAICS get_next_S() needs to be
called with the number of tuples already processed, which means you were
off-by-one --- this surely makes only a trivial difference in the
probabilities, but if we are going to use Vitter's algorithm then we may
as well get it right.  Also, I took out the TupleCount typedef and went
back to using doubles for the tuple counts; this is more consistent with
the coding style used elsewhere, and I really doubt that it's any
slower.  (The datatype conversions induced inside get_next_S are likely
to outweigh any savings from counting by ints, on most modern hardware.)
Plus the justification for assuming it couldn't overflow seems weak to
me; the current limitation to 300000 requested sample rows is very
arbitrary and could change anytime.

I was initially convinced that your implementation of Knuth's algorithm
S was all wet, so now there's a bunch of comments explaining why it's
actually correct...

            regards, tom lane

В списке pgsql-patches по дате отправления:

Предыдущее
От: "Magnus Hagander"
Дата:
Сообщение: Re: Cancel/Kill backend functions
Следующее
От: Alvaro Herrera
Дата:
Сообщение: Nested xacts, try 5