Re: new correlation metric

Поиск
Список
Период
Сортировка
От Heikki Linnakangas
Тема Re: new correlation metric
Дата
Msg-id 4904937B.9080009@enterprisedb.com
обсуждение исходный текст
Ответ на Re: new correlation metric  (Martijn van Oosterhout <kleptog@svana.org>)
Список pgsql-hackers
Martijn van Oosterhout wrote:
> On Sun, Oct 26, 2008 at 01:38:02AM -0700, Jeff Davis wrote:
>> I worked with Nathan Boley to come up with what we think is a better
>> metric for measuring this cost. It is based on the number of times in
>> the ordered sample that you have to physically backtrack (i.e. the data
>> value increases, but the physical position is earlier).
>>
>> For example, if the table's physical order is
>>
>> 6 7 8 9 10 1 2 3 4 5
> 
> How does it deal with a case like the following:
> 
> 1 6 2 7 3 8 4 9 5 10  (interleaving)
> 
> ISTM that your code will overestimate the cost whereas the old code
> wouldn't have done too badly.

Another similar case is where the tables is almost in order, but not 
quite. For example, take an ordered table, but swap every other value:

2 1 4 3 6 5 7 8 10 9

Distributions similar to that that would happen naturully if you have a 
logging application that receives timestamped events from elsewhere. The 
events would be come in roughly in timestamp order, but not quite 
because of different delays, for example. For all practical purposes, 
that is just as good as a perfectly sorted table.

However, the patch actually operates on the *sample*, not the real 
physical order. So I think it would actually work well for my example, 
because if you take just a few values from that sequence in random, the 
sample is likely to be in good order. Not sure about the interleaved case.

I wonder how the proposed measurement behaves wrt. the sample size vs. 
table size ratio. Intuitively, it feels like it becomes less sensitive 
to disorder, as the table size grows and (sample size)/(table size) 
becomes smaller. Which is not good, I believe.

> I think the code is in the right direction, but I think want you want
> is some kind of estimate of "given I've looked for tuple X, how many
> tuples in the next k pages are near this one". Unfortunatly I don't see
> a way of calculating it other than a full simulation.

Yeah, it would be nice to figure out something, as the current method 
sure isn't perfect.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


В списке pgsql-hackers по дате отправления:

Предыдущее
От: Tom Lane
Дата:
Сообщение: Re: new correlation metric
Следующее
От: Greg Stark
Дата:
Сообщение: Re: new correlation metric