Re: Improving N-Distinct estimation by ANALYZE

Поиск
Список
Период
Сортировка
От Manfred Koizar
Тема Re: Improving N-Distinct estimation by ANALYZE
Дата
Msg-id 50dns19bc4cs20tgg7aq4d3l80ph1f6vjq@4ax.com
обсуждение исходный текст
Ответ на Re: Improving N-Distinct estimation by ANALYZE  (Simon Riggs <simon@2ndquadrant.com>)
Ответы Re: Improving N-Distinct estimation by ANALYZE  (Simon Riggs <simon@2ndquadrant.com>)
Список pgsql-hackers
On Fri, 13 Jan 2006 19:18:29 +0000, Simon Riggs
<simon@2ndquadrant.com> wrote:
>I enclose a patch for checking out block sampling.

Can't comment on the merits of block sampling and your implementation
thereof.  Just some nitpicking:

|!  * Row Sampling: As of May 2004, we use the Vitter algorithm to create

Linking the use of the Vitter algorithm to May 2004 is not quite
appropriate.  We introduced two stage sampling at that time.

|   * a random sample of targrows rows (or less, if there are less in the
|!  * sample of blocks). In this case, targblocks is always the same as
|!  * targrows, so we always read one row per block.

This is just wrong, unless you add "on average".  Even then it is a
bit misleading, because in most cases we *read* more tuples than we
use.

|   * Although every row has an equal chance of ending up in the final
|   * sample, this sampling method is not perfect: not every possible
|   * sample has an equal chance of being selected.  For large relations
|   * the number of different blocks represented by the sample tends to be
|!  * too small.  In that case, block sampling should be used.

Is the last sentence a fact or personal opinion?

|   * block.  The previous sampling method put too much credence in the row
|   * density near the start of the table.

FYI, "previous" refers to the date mentioned above:
previous == before May 2004 == before two stage sampling.

|+         /* Assume that we will have rows at least 64 bytes wide 
|+          * Currently we're very unlikely to overflow availMem here 
|+          */
|+         if ((allocrows * sizeof(HeapTuple)) > (allowedMem >> 4))

This is a funny way of sayingif (allocrows * (sizeof(Heaptuple) + 60) > allowedMem)

It doesn't match the comment above; and it is something different if
the size of a pointer is not four bytes.

|-     if (bs.m > 0)
|+     if (bs.m > 0 )

Oops.

|+         ereport(DEBUG2,
|+             (errmsg("ANALYZE attr %u sample: n=%u nmultiple=%u f1=%u d=%u", 
|+                         stats->tupattnum,samplerows, nmultiple, f1, d)));
^                          missing space here and in some more places
 

I haven't been following the discussion too closely but didn't you say
that a block sampling algorithm somehow compensates for the fact that
the sample is not random?
ServusManfred


В списке pgsql-hackers по дате отправления:

Предыдущее
От: Simon Riggs
Дата:
Сообщение: Re: Improving N-Distinct estimation by ANALYZE
Следующее
От: Tom Lane
Дата:
Сообщение: Re: Improving N-Distinct estimation by ANALYZE