Re: Improving N-Distinct estimation by ANALYZE
| От | Simon Riggs | 
|---|---|
| Тема | Re: Improving N-Distinct estimation by ANALYZE | 
| Дата | |
| Msg-id | 1137458470.3180.228.camel@localhost.localdomain обсуждение исходный текст | 
| Ответ на | Re: Improving N-Distinct estimation by ANALYZE (Manfred Koizar <mkoi-pg@aon.at>) | 
| Список | pgsql-hackers | 
On Mon, 2006-01-16 at 21:24 +0100, Manfred Koizar wrote: > On Fri, 13 Jan 2006 19:18:29 +0000, Simon Riggs > <simon@2ndquadrant.com> wrote: > >I enclose a patch for checking out block sampling. > > Can't comment on the merits of block sampling and your implementation > thereof. But your thoughts are welcome... > |! * Row Sampling: As of May 2004, we use the Vitter algorithm to create > > Linking the use of the Vitter algorithm to May 2004 is not quite > appropriate. We introduced two stage sampling at that time. I'll redo some of the comments as you suggest and review coding points. > | * a random sample of targrows rows (or less, if there are less in the > |! * sample of blocks). In this case, targblocks is always the same as > |! * targrows, so we always read one row per block. > > This is just wrong, unless you add "on average". Even then it is a > bit misleading, because in most cases we *read* more tuples than we > use. The current code reads a number of blocks == targrows, hence "one per block" but I'll change the comment. > | * Although every row has an equal chance of ending up in the final > | * sample, this sampling method is not perfect: not every possible > | * sample has an equal chance of being selected. For large relations > | * the number of different blocks represented by the sample tends to be > |! * too small. In that case, block sampling should be used. > > Is the last sentence a fact or personal opinion? That is my contention, based upon research. The existing code clearly identifies that case as requiring a solution. We use Chaudhuri et al's 1998 result for the sufficiency of sample size, yet overlook his estimators and his ideas for random block selection. My first point is that we need proportional sample size, not fixed size. [I show that for large tables we currently use a sample that is 2-5 times too small, even based on Chaudhuri's work.] ANALYZE is very I/O bound: random row sampling would need to scan the whole table to take a 2-3% sample in most cases. Block sampling is a realistic way of getting a larger sample size without loss of performance, and also a way of getting a reasonable sample when accessing very few blocks. We would keep random row sampling for smaller relations where the performance effects of sample collection are not noticeable. > I haven't been following the discussion too closely but didn't you say > that a block sampling algorithm somehow compensates for the fact that > the sample is not random? I haven't said that block sampling compensates for the sample being non-random. There is a danger of bias in the sample and I made that clear. I've tried to mitigate that by the use of random blocks, reusing your code and taking note of the earlier problems you solved. However, block sampling allows us to spot cases where rows are naturally grouped together, which row sampling doesn't spot until very high sample sizes. I explored in detail a common case where this causes the current method to fall down very badly. Brutlag & Richardson [2000] give a good run down on block sampling and I'm looking to implement some of his block estimators to complement the current patch. (Credit to Andrew Dunstan for locating the paper...) http://www.stat.washington.edu/www/research/reports/1999/tr355.ps [I did briefly explore the possibility of hybrid row/block sampling, using row sampling as well some block sampling to identify grouping, with an attempt to avoid swamping the sample with biased rows; but I don't have a strong basis for that, so I've ignored that for now.] This patch allows us to explore the possible bias that block sampling gives, allowing us to make a later decision to include it, or not. Best Regards, Simon Riggs
В списке pgsql-hackers по дате отправления: