Re: Estimating costs (was Functional Indices)
От | Martijn van Oosterhout |
---|---|
Тема | Re: Estimating costs (was Functional Indices) |
Дата | |
Msg-id | 20010524015500.A26920@kleptog.nth обсуждение исходный текст |
Ответ на | Re: Estimating costs (was Functional Indices) (Tom Lane <tgl@sss.pgh.pa.us>) |
Ответы |
Re: Estimating costs (was Functional Indices)
(Tom Lane <tgl@sss.pgh.pa.us>)
|
Список | pgsql-general |
On Wed, May 23, 2001 at 10:48:35AM -0400, Tom Lane wrote: > Martijn van Oosterhout <kleptog@svana.org> writes: > > But I know something that postgres doesn't. The data is clustered somewhat > > around the id we're comparing on. There is a "runlength" involved. Thus, > > when doing an index scan, once it has read in the first tuple of a run there > > will be more just sitting in the cache at basically zero cost. > > Hm. There is code in development sources that will notice data > clustering --- actually what it looks at is the correlation between > physical tuple order and logical order of the values. I'm not sure > whether it would do the right thing if you have runs of identical keys > but no overall ordering of the keys, however. That wouldn't help much in this case as outside of the actual grouping, the keys are not sorted against eachother. > > Currently I work around this by fiddling enable_seqscan is strategic places > > but that's blunt instrument. > > Yup, it sure is. Can you propose a statistical measurement that would > cope with this situation? I was thinking "average runlength". If this were 10 for example, when it came to calculating the cost of the index scan, it would divide the per-tuple cost by 10. You can go the simple calculation method which would count the number of times the value in a column was different than the previous value, then divide that into the total number of tuples. That's not difficult to implement. I was actually thinking of editting the ANALYZE code to calculate this value but after looking at the code I decided I would need to study the codebase a bit first. There is another way which would allow you to calculate a value that you can tune. It works something like this (this is for only one column): counter = 1 lastval = tuple[1].value foreach remaining tuple if tuple.value != lastval histogram[counter]++ lastval = tuple.value counter = 1 else counter++ You than have a histogram of runlengths. Then, from the highest runlength to the lowest, do a cumulative sum of number of tuples compared against the total. You can choose where to take the value at the 50% mark, 80% mark or even 95%. Note that I have absolutly no theory to base this on, only that it seems silly not to take advantage of any clustering present in the data. For a perl script the calculate the histogram, go to: http://svana.org/kleptog/getcoherency.pl Use like: ./getcoherency dbname tablename fieldnames... > output For example output, go to: http://svana.org/kleptog/output.txt Hope this helps, -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ - Every night you must rediscover the secret to sleep, - only to forget it moments later...
В списке pgsql-general по дате отправления: