Re: ANALYZE sampling is too good

Поиск
Список
Период
Сортировка
От Gavin Flower
Тема Re: ANALYZE sampling is too good
Дата
Msg-id 52A8BDF4.4090408@archidevsys.co.nz
обсуждение исходный текст
Ответ на Re: ANALYZE sampling is too good  (Gavin Flower <GavinFlower@archidevsys.co.nz>)
Список pgsql-hackers
<div class="moz-cite-prefix">On 12/12/13 08:14, Gavin Flower wrote:<br /></div><blockquote
cite="mid:52A8B998.5040602@archidevsys.co.nz"type="cite">On 12/12/13 07:22, Gavin Flower wrote: <br /><blockquote
type="cite">On12/12/13 06:22, Tom Lane wrote: <br /><blockquote type="cite">I wrote: <br /><blockquote type="cite">Hm. 
Youcan only take N rows from a block if there actually are at least <br /> N rows in the block.  So the sampling rule I
supposeyou are using is <br /> "select up to N rows from each sampled block" --- and that is going to <br /> favor the
contentsof blocks containing narrower-than-average rows. <br /></blockquote> Oh, no, wait: that's backwards.  (I plead
insufficientcaffeine.) <br /> Actually, this sampling rule discriminates *against* blocks with <br /> narrower rows. 
Youpreviously argued, correctly I think, that <br /> sampling all rows on each page introduces no new bias because row
<br/> width cancels out across all sampled pages.  However, if you just <br /> include up to N rows from each page,
thenrows on pages with more <br /> than N rows have a lower probability of being selected, but there's <br /> no such
biasagainst wider rows.  This explains why you saw smaller <br /> values of "i" being undersampled. <br /><br /> Had
yourun the test series all the way up to the max number of <br /> tuples per block, which is probably a couple hundred
inthis test, <br /> I think you'd have seen the bias go away again.  But the takeaway <br /> point is that we have to
sampleall tuples per page, not just a <br /> limited number of them, if we want to change it like this. <br /><br />
           regards, tom lane <br /><br /><br /></blockquote> Surely we want to sample a 'constant fraction' (obviously,
inpractice you have to sample an integral number of rows in a page!) of rows per page? The simplest way, as Tom
suggests,is to use all the rows in a page. <br /><br /> However, if you wanted the same number of rows from a greater
numberof pages, you could (for example) select a quarter of the rows from each page.  In which case, when this is a
fractionalnumber: take the integral number of rows, plus on extra row with a probability equal to the fraction (here
0.25).<br /><br /> Either way, if it is determined that you need N rows, then keep selecting pages at random (but never
usethe same page more than once) until you have at least N rows. <br /><br /><br /> Cheers, <br /> Gavin <br /><br
/><br/><br /></blockquote> Yes the fraction/probability, could actually be one of: 0.25, 0.50, 0.75. <br /><br /> But
thereis a bias introduced by the arithmetic average size of the rows in a page. This results in block sampling
favouringlarge rows, as they are in a larger proportion of pages. <br /><br /> For example, assume 1000 rows of 200
bytesand 1000 rows of 20 bytes, using 400 byte pages.  In the pathologically worst case, assuming maximum packing
densityand no page has both types: the large rows would occupy  500 pages and the smaller rows 50 pages. So if one
selected11 pages at random, you get about 10 pages of large rows and about one for small rows!  In practice, it would
bemuch less extreme - for a start, not all blocks will be fully packed, most blocks would have both types of rows, and
thereis usually greater variation in row size - but still a bias towards sampling larger rows.  So somehow, this bias
needsto be counteracted. <br /><br /><br /> Cheers, <br /> Gavin <br /><br /></blockquote> Actually, I just thought of
apossible way to overcome the bias towards large rows.<br /><br /><ol><li>Calculate (a rough estimate may be
sufficient,if not too 'rough') the size of the smallest row.<br /><br /><li>Select a page at random (never selecting
thesame page twice)<br /><br /><li>Then select rows at random within the page (never selecting the same row twice). 
Foreach row selected, accept it with the probability equal to (size of smallest row)/(size of selected row).  I think
youfind that will almost completely offset the bias towards larger rows!<br /><br /><li>If you do not have sufficient
rows,and you still have pages not yet selected, goto 2<br /></ol> Note that it will be normal for for some pages not to
haveany rows selected, especially for large tables!<br /><br /><br /> Cheers,<br /> Gavin<br /><br />  P.S.<br /> I
reallyneed to stop thinking about this problem, and get on with my assigned project!!!<br /><br /><br /> 

В списке pgsql-hackers по дате отправления:

Предыдущее
От: Simon Riggs
Дата:
Сообщение: Re: autovacuum_work_mem
Следующее
От: Peter Geoghegan
Дата:
Сообщение: Re: ANALYZE sampling is too good