Re: proposal : cross-column stats

Поиск
Список
Период
Сортировка
От Tomas Vondra
Тема Re: proposal : cross-column stats
Дата
Msg-id 4D0BCD1A.4080001@fuzzy.cz
обсуждение исходный текст
Ответ на Re: proposal : cross-column stats  (Tomas Vondra <tv@fuzzy.cz>)
Список pgsql-hackers
Hi,

I've read about 10 papers about estimation this week, and I have gained
some insight. I'm not going to describe each of the papers here, I plan
to set up a wiki page about cross column statistics
 http://wiki.postgresql.org/wiki/Cross_Columns_Stats

and I'll put the list of papers and some basic info there. There was a
lot of discussion in this thread, and it's difficult to follow it, so
I'll put there some details about the proposal etc.

So I'll just briefly describe the interesting things I've learned from
those papers.

---------------------- instances of the problem ----------------------

Generally, there are four quite different cases of the problem.

First, the columns may be either real-valued or discrete. And by
'discrete' I mean the value is rather a label than a value. It does not
make sense to add or subtract them, etc. So for example city names or
zip codes are discrete values (although zip codes are numbers etc).

Second, the queries are consist of equality or inequality (range)
conditions.

So actually there are four instances of the problem:
                 | equality conditions | inequality conditions
|================================================================discrete values |          A          |           D
      | numeric values  |          C          |           B
|----------------------------------------------------------------

I think those four problems should be treated separately, although some
of the techniques may be common.

A) discrete values and inequality conditions
  One of the papers (A Bayesian Approach to Estimating The Selectivity  of Conjuctive Predicates) describes a quite
interestingsolution to  this problem - I've already posted a description on how to apply it  to the ZIP code / city
nameproblem - see this
 
  http://archives.postgresql.org/pgsql-hackers/2010-12/msg01576.php

B) numeric values and inequality conditions
  Most of the papers dealing with this problem are based on  discretization and multi-dimensional histograms to
approximatethe  joint distribution. So I believe my initial proposal was not a  complete nonsense.
 
  Once we have a working histogram-based solution, we can work on  precision and efficiency (how much space is needed
tostore it, how  long does it take to compute an estimate, etc.). There are two  ways to do that.
 
  First, most of the papers offer an enhanced type of histogram  (although the histogram proposed in the paper is
alwaysevaluated as  the most efficient one, which is a bit suspicious). Anyway there are  advanced quite-promising ways
tobuild multi-dimensional histograms.
 
  Second, the paper "Selectivity estimation using probabilistic  models") describes a solution based on bayesian
networks.That means  really really promising (actually it addresses join selectivity  estimation too).
 
  So yeas, I'm quite confident we can solve this issue and improve the  efficiency - either by an advanced histogram or
usingbayesian  networks.
 

C) numeric values and equality conditions
  OK, I'm not sure how to solve this case. But the queries against  numeric queries are range queries in most cases I
guess,so maybe  that's not that big deal.
 

D) discrete values and inequality conditions
  Basically, this can be handled just like numeric values after  discretization, i.e. using a histogram. But this is
usually

E) a combination of discrete / numeric columns
  I'm not sure how to deal with this. Obviously it's possible to  build multi-dimensional histogram, and estimate as
manyqueries as  possible.
 

-----------------------------------------------------------------------

The papers describe some interesting alternative approaches, e.g. based
on SVD (singular value decomposition) or random sampling (or kernel
estimators, which an enhanced version of sampling).

But there are various issues associated with those solutions. SVD is
applicable to 2D only, so we'd be stuck with 2 columns, and random
sampling sounds a bit suspicious to me).

regards
Tomas


В списке pgsql-hackers по дате отправления:

Предыдущее
От: Bill Moran
Дата:
Сообщение: Re: Why don't we accept exponential format for integers?
Следующее
От: Robert Haas
Дата:
Сообщение: Re: unlogged tables vs. GIST