Re: proposal : cross-column stats

Поиск
Список
Период
Сортировка
От Tomas Vondra
Тема Re: proposal : cross-column stats
Дата
Msg-id 4D149BBE.5000405@fuzzy.cz
обсуждение исходный текст
Ответ на Re: proposal : cross-column stats  (Florian Pflug <fgp@phlo.org>)
Список pgsql-hackers
Dne 24.12.2010 13:37, Florian Pflug napsal(a):
> On Dec24, 2010, at 11:23 , Nicolas Barbier wrote:
> 
>> 2010/12/24 Florian Pflug <fgp@phlo.org>:
>>
>>> On Dec23, 2010, at 20:39 , Tomas Vondra wrote:
>>>
>>>>   I guess we could use the highest possible value (equal to the number
>>>>   of tuples) - according to wiki you need about 10 bits per element
>>>>   with 1% error, i.e. about 10MB of memory for each million of
>>>>   elements.
>>>
>>> Drat. I had expected these number to come out quite a bit lower than
>>> that, at least for a higher error target. But even with 10% false
>>> positive rate, it's still 4.5MB per 1e6 elements. Still too much to
>>> assume the filter will always fit into memory, I fear :-(
>>
>> I have the impression that both of you are forgetting that there are 8
>> bits in a byte. 10 bits per element = 1.25MB per milion elements.
> 
> Uh, of course. So in the real universe, the numbers are
> 
> ~1.2MB per 1e6 elements for a false positive rate of 1%
> ~0.5MB per 1e6 elements for a false positive rate of 10%
> 
> Hm. So for a table with a billion distinct elements, we'd need half
> a gigabyte per column for the filter. A tuple with two int columns
> takes at least 24+2*4 = 32bytes to store I think, making such a table
> at least 32GB in size. The filter size would thus be 1/64 of the table
> size in the worst case. 

Yes, but in reality you need three such filters - one for each column,
one for the combination. So that is 1.5GB (with 10% error rate) or 3.6GB
(with 1% error rate).

But this is severely excessive compared to the real needs, as there are
usually much less distinct (not equal to the number of tuples as we
assume in these computations).

I was thinking about a simple heuristics to scale the filter properly,
something like this:

1) sample a small portion of the table and count distinct of values
2) compute "number of dist. values" / "number of sampled tuples"
3) scale this to the whole table and scale the filter

Say there are really 50 distinct values, 1.000 rows will be sampled but
20 distinct values are missing in the sample. This gives 5% in step (2)
and if the table has 1.000.000 tuples you'll get 50.000 in (3). So the
filter needs just 60kB. Which is a huge improvement compared to the
previous approach (1.2MB).

Obviously this will still lead to overestimates in most cases, and there
are probably some other fail cases, but I think it's a reasonable
solution. I don't think this can result in an underestimate (which is
the case where you loose precision).

And in case we want to build this incrementally (from a VACUUM) we
really need to use a bit larger filter, because rescaling the filter is
not possible AFAIK (without rebuilding it from scratch).

regards
Tomas


В списке pgsql-hackers по дате отправления:

Предыдущее
От: Tomas Vondra
Дата:
Сообщение: Re: proposal : cross-column stats
Следующее
От: Itagaki Takahiro
Дата:
Сообщение: Re: SQL/MED - file_fdw