Re: O(samplesize) tuple sampling, proof of concept

Поиск
Список
Период
Сортировка
От Tom Lane
Тема Re: O(samplesize) tuple sampling, proof of concept
Дата
Msg-id 14645.1081193827@sss.pgh.pa.us
обсуждение исходный текст
Ответ на O(samplesize) tuple sampling, proof of concept  (Manfred Koizar <mkoi-pg@aon.at>)
Ответы Re: O(samplesize) tuple sampling, proof of concept  (Manfred Koizar <mkoi-pg@aon.at>)
Список pgsql-patches
Manfred Koizar <mkoi-pg@aon.at> writes:
> The old sampling method is still there.  I have added a GUC variable
> sampling_method to make testing and benchmarking easier.

I wouldn't bother with a GUC variable for the production patch.

> Once a block is selected for inspection, all tuples of this
> block are accessed to get a good estimation of the live : dead row
> ratio.

Why are you bothering to compute the live:dead ratio?  The statistics
model doesn't have any place for a dead-tuples count, so there's no need
to think about anything but live tuples.  (This is a historical
artifact, perhaps, arising from the fact that originally analysis
happened after VACUUM FULL and so the number of dead tuples was
guaranteed to be zero at the time.  But still, I'm not sure how we'd
factor a dead-tuples count into the estimates if we had it.)

> Because I was afraid that method 1 might be too expensive in terms of
> CPU cycles, I implemented a small variation that skips tuples without
> checking them for visibility; this is sampling_method 2.

And?  Does it matter?  (I'd guess not in practice, as checking a tuple
for visibility is cheap if someone's already marked it good.)

> At 1000 pages there is only a difference of approximately 1%,
> at 3000 pages the difference has its maximum of ca. 15%.  We can sell
> this by pointing to the better quality of the statistics.

If that's as bad as it gets I think we are OK.  You should redo the test
with larger sample size though (try stats target = 100) to see if the
answer changes.

> Are there any spots in the documentation that
> should be updated?

AFAIR there is noplace that specifically needs to be changed.

> diff -ruN ../base/src/backend/commands/analyze.c src/backend/commands/analyze.c

I find -u diffs close to unreadable for reviewing purposes.  Please
submit diffs in -c format in future.

> +    /*
> +     * If we ran out of tuples then we're done.  No sort is needed,
> +     * since they're already in order.
> +     *
> +     * Otherwise we need to sort the collected tuples by position
> +     * (itempointer).  I don't care for corner cases where the tuples
> +     * are already sorted.
> +     */
> +    if (numrows == targrows)
> +        qsort((void *) rows, numrows, sizeof(HeapTuple), compare_rows);

AFAICS the rows will *always* be sorted already, and so this qsort is an
utter waste of cycles.  No?

            regards, tom lane

В списке pgsql-patches по дате отправления:

Предыдущее
От: Fabien COELHO
Дата:
Сообщение: Re: hint infrastructure setup (v3)
Следующее
От: Tom Lane
Дата:
Сообщение: Re: O(samplesize) tuple sampling, proof of concept