Re: ANALYZE sampling is too good

Поиск
Список
Период
Сортировка
От Gavin Flower
Тема Re: ANALYZE sampling is too good
Дата
Msg-id 52A8B998.5040602@archidevsys.co.nz
обсуждение исходный текст
Ответ на Re: ANALYZE sampling is too good  (Gavin Flower <GavinFlower@archidevsys.co.nz>)
Ответы Re: ANALYZE sampling is too good
Re: ANALYZE sampling is too good
Список pgsql-hackers
On 12/12/13 07:22, Gavin Flower wrote:
> On 12/12/13 06:22, Tom Lane wrote:
>> I wrote:
>>> Hm.  You can only take N rows from a block if there actually are at 
>>> least
>>> N rows in the block.  So the sampling rule I suppose you are using is
>>> "select up to N rows from each sampled block" --- and that is going to
>>> favor the contents of blocks containing narrower-than-average rows.
>> Oh, no, wait: that's backwards.  (I plead insufficient caffeine.)
>> Actually, this sampling rule discriminates *against* blocks with
>> narrower rows.  You previously argued, correctly I think, that
>> sampling all rows on each page introduces no new bias because row
>> width cancels out across all sampled pages.  However, if you just
>> include up to N rows from each page, then rows on pages with more
>> than N rows have a lower probability of being selected, but there's
>> no such bias against wider rows.  This explains why you saw smaller
>> values of "i" being undersampled.
>>
>> Had you run the test series all the way up to the max number of
>> tuples per block, which is probably a couple hundred in this test,
>> I think you'd have seen the bias go away again.  But the takeaway
>> point is that we have to sample all tuples per page, not just a
>> limited number of them, if we want to change it like this.
>>
>>             regards, tom lane
>>
>>
> Surely we want to sample a 'constant fraction' (obviously, in practice 
> you have to sample an integral number of rows in a page!) of rows per 
> page? The simplest way, as Tom suggests, is to use all the rows in a 
> page.
>
> However, if you wanted the same number of rows from a greater number 
> of pages, you could (for example) select a quarter of the rows from 
> each page.  In which case, when this is a fractional number: take the 
> integral number of rows, plus on extra row with a probability equal to 
> the fraction (here 0.25).
>
> Either way, if it is determined that you need N rows, then keep 
> selecting pages at random (but never use the same page more than once) 
> until you have at least N rows.
>
>
> Cheers,
> Gavin
>
>
>
Yes the fraction/probability, could actually be one of: 0.25, 0.50, 0.75.

But there is a bias introduced by the arithmetic average size of the 
rows in a page. This results in block sampling favouring large rows, as 
they are in a larger proportion of pages.

For example, assume 1000 rows of 200 bytes and 1000 rows of 20 bytes, 
using 400 byte pages.  In the pathologically worst case, assuming 
maximum packing density and no page has both types: the large rows would 
occupy  500 pages and the smaller rows 50 pages. So if one selected 11 
pages at random, you get about 10 pages of large rows and about one for 
small rows!  In practice, it would be much less extreme - for a start, 
not all blocks will be fully packed, most blocks would have both types 
of rows, and there is usually greater variation in row size - but still 
a bias towards sampling larger rows.  So somehow, this bias needs to be 
counteracted.


Cheers,
Gavin





В списке pgsql-hackers по дате отправления:

Предыдущее
От: Robert Haas
Дата:
Сообщение: Re: preserving forensic information when we freeze
Следующее
От: Robert Haas
Дата:
Сообщение: Re: -d option for pg_isready is broken