two dimensional statistics in Postgres

Поиск
Список
Период
Сортировка
От Katharina Büchse
Тема two dimensional statistics in Postgres
Дата
Msg-id 545B4A3E.7090407@uni-jena.de
обсуждение исходный текст
Ответы Re: two dimensional statistics in Postgres  (Gavin Flower <GavinFlower@archidevsys.co.nz>)
Re: two dimensional statistics in Postgres  ("Tomas Vondra" <tv@fuzzy.cz>)
Список pgsql-hackers
Hi,

I'm a phd-student at the university of Jena, Thüringen, Germany, in the 
field of data bases, more accurate query optimization.
I want to implement a system in PostgreSQL that detects column 
correlations and creates statistical data about correlated columns for 
the optimizer. Therefore I need to store two dimensional statistics 
(especially two dimensional histograms) in PostgreSQL.
I had a look at the description of "WIP: multivariate statistics / proof 
of concept", which looks really promising, I guess these statistics are 
based on scans of the data? For my system I need both -- statistical 
data based on table scans (actually, samples are enough) and those based 
on query feedback. Query feedback (tuple counts and, speaking a little 
inaccurately, the where-part of the query itself) needs to be extracted 
and there needs to be a decision for the optimizer, when to take 
multivariate statistics and when to use the one dimensional ones. Oracle 
in this case just disables one dimensional histograms if there is 
already a multidimensional histogram, but this is not always useful, 
especially in the case of a feedback based histogram (which might not 
cover the whole data space). I want to use both kinds of histograms 
because correlations might occur only in parts of the data. In this case 
a histogram based on a sample of the whole table might not get the point 
and wouldn't help for the part of the data the user seems to be 
interested in.
There are special data structures for storing multidimensional 
histograms based on feedback and I already tried to implement one of 
these in C. In the case of two dimensions they are of course not "for 
free" (one dimensional would be much cheaper), but based on the 
principle of maximum entropy they deliver really good results. I decided 
for only two dimensions because in this case we have the best proportion 
of cost and benefit when searching for correlation (here I'm relying on 
tests that were made in DB2 within a project called CORDS which detects 
correlations even between different tables).

I'd be grateful for any advices and discussions.
Regards,

Katharina



В списке pgsql-hackers по дате отправления:

Предыдущее
От: Andrea Suisani
Дата:
Сообщение: Re: Amazon Redshift
Следующее
От: Craig Ringer
Дата:
Сообщение: Re: Sequence Access Method WIP