Re: Estimating costs (was Functional Indices)

Поиск
Список
Период
Сортировка
От Martijn van Oosterhout
Тема Re: Estimating costs (was Functional Indices)
Дата
Msg-id 20010524015500.A26920@kleptog.nth
обсуждение исходный текст
Ответ на Re: Estimating costs (was Functional Indices)  (Tom Lane <tgl@sss.pgh.pa.us>)
Ответы Re: Estimating costs (was Functional Indices)  (Tom Lane <tgl@sss.pgh.pa.us>)
Список pgsql-general
On Wed, May 23, 2001 at 10:48:35AM -0400, Tom Lane wrote:
> Martijn van Oosterhout <kleptog@svana.org> writes:
> > But I know something that postgres doesn't. The data is clustered somewhat
> > around the id we're comparing on. There is a "runlength" involved. Thus,
> > when doing an index scan, once it has read in the first tuple of a run there
> > will be more just sitting in the cache at basically zero cost.
>
> Hm.  There is code in development sources that will notice data
> clustering --- actually what it looks at is the correlation between
> physical tuple order and logical order of the values.  I'm not sure
> whether it would do the right thing if you have runs of identical keys
> but no overall ordering of the keys, however.

That wouldn't help much in this case as outside of the actual grouping, the
keys are not sorted against eachother.

> > Currently I work around this by fiddling enable_seqscan is strategic places
> > but that's blunt instrument.
>
> Yup, it sure is.  Can you propose a statistical measurement that would
> cope with this situation?

I was thinking "average runlength". If this were 10 for example, when it
came to calculating the cost of the index scan, it would divide the
per-tuple cost by 10.

You can go the simple calculation method which would count the number of
times the value in a column was different than the previous value, then
divide that into the total number of tuples. That's not difficult to
implement. I was actually thinking of editting the ANALYZE code to calculate
this value but after looking at the code I decided I would need to study the
codebase a bit first.

There is another way which would allow you to calculate a value that you can
tune. It works something like this (this is for only one column):

counter = 1
lastval = tuple[1].value

foreach remaining tuple
  if tuple.value != lastval
    histogram[counter]++
    lastval = tuple.value
    counter = 1
  else
    counter++

You than have a histogram of runlengths. Then, from the highest runlength to
the lowest, do a cumulative sum of number of tuples compared against the
total.

You can choose where to take the value at the 50% mark, 80% mark or even
95%.

Note that I have absolutly no theory to base this on, only that it seems
silly not to take advantage of any clustering present in the data.

For a perl script the calculate the histogram, go to:
http://svana.org/kleptog/getcoherency.pl

Use like: ./getcoherency dbname tablename fieldnames...  > output

For example output, go to:
http://svana.org/kleptog/output.txt

Hope this helps,
--
Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/
- Every night you must rediscover the secret to sleep,
- only to forget it moments later...

В списке pgsql-general по дате отправления:

Предыдущее
От: snpe
Дата:
Сообщение: Re: Simple question for SQL Student
Следующее
От: Tom Lane
Дата:
Сообщение: Re: A Couple of Questions on Blobs