Обсуждение: [HACKERS] Statistics "dependency"

Поиск
Список
Период
Сортировка

[HACKERS] Statistics "dependency"

От
Bruce Momjian
Дата:
In looking at the new multi-column statistics "dependency" option in
Postgres 10, I am quite confused by the term "dependency".  Wouldn't
"correlation" be clearer and less confusing as "column dependency"
already means something else.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+                      Ancient Roman grave inscription +



Re: [HACKERS] Statistics "dependency"

От
Dean Rasheed
Дата:
On 23 April 2017 at 03:37, Bruce Momjian <bruce@momjian.us> wrote:
> In looking at the new multi-column statistics "dependency" option in
> Postgres 10, I am quite confused by the term "dependency".  Wouldn't
> "correlation" be clearer and less confusing as "column dependency"
> already means something else.
>

Actually, the terms "dependency" and "correlation" are both quite
broad terms that cover a whole range of other different things, and
hence could be misleading. The precise term for this is "functional
dependency" [1], so if anything, the option name should be
"functional_dependencies" or some shortening of that, keeping a part
of each of those words.

Regards,
Dean

[1] https://en.wikipedia.org/wiki/Functional_dependency



Re: [HACKERS] Statistics "dependency"

От
Simon Riggs
Дата:
On 23 April 2017 at 09:17, Dean Rasheed <dean.a.rasheed@gmail.com> wrote:
> On 23 April 2017 at 03:37, Bruce Momjian <bruce@momjian.us> wrote:
>> In looking at the new multi-column statistics "dependency" option in
>> Postgres 10, I am quite confused by the term "dependency".  Wouldn't
>> "correlation" be clearer and less confusing as "column dependency"
>> already means something else.
>>

I also asked that exactly that question...

> Actually, the terms "dependency" and "correlation" are both quite
> broad terms that cover a whole range of other different things, and
> hence could be misleading. The precise term for this is "functional
> dependency" [1], so if anything, the option name should be
> "functional_dependencies" or some shortening of that, keeping a part
> of each of those words.

...and got that answer also.

For us "functional dependency" would sound like something to do with
functions (e.g. CREATE FUNCTION), so just "dependency" appears to me
to be the best term for this.

There are multiple statistics for dependency stored, hence
"dependencies". I don't like it, but its the best term I can see at
present.

-- 
Simon Riggs                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: [HACKERS] Statistics "dependency"

От
Bruce Momjian
Дата:
On Sun, Apr 23, 2017 at 11:44:12AM +0100, Simon Riggs wrote:
> For us "functional dependency" would sound like something to do with
> functions (e.g. CREATE FUNCTION), so just "dependency" appears to me
> to be the best term for this.
> 
> There are multiple statistics for dependency stored, hence
> "dependencies". I don't like it, but its the best term I can see at
> present.

OK, thank you for the reply, and I am sorry I forgot the previous
discussion.  I just wanted to re-check we had research this.  Thanks.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+                      Ancient Roman grave inscription +



Do we need multi-column frequency/histogram stats? WAS Re:[HACKERS] Statistics "dependency"

От
Bruce Momjian
Дата:
On Sun, Apr 23, 2017 at 10:01:16AM -0400, Bruce Momjian wrote:
> On Sun, Apr 23, 2017 at 11:44:12AM +0100, Simon Riggs wrote:
> > For us "functional dependency" would sound like something to do with
> > functions (e.g. CREATE FUNCTION), so just "dependency" appears to me
> > to be the best term for this.
> > 
> > There are multiple statistics for dependency stored, hence
> > "dependencies". I don't like it, but its the best term I can see at
> > present.
> 
> OK, thank you for the reply, and I am sorry I forgot the previous
> discussion.  I just wanted to re-check we had research this.  Thanks.

(Email subject updated.)

Actually, I have a larger question that I was thinking about.  Because
we already have lots of per-column stats, and now the dependency score,
is it possible to mix the per-column stats and dependency score in a way
that multi-column frequency/histogram stats are not necessary?  That
might be a less costly approach I had not considered.

Or did I miss that discussion too?  ;-)

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+                      Ancient Roman grave inscription +



Re: [HACKERS] Statistics "dependency"

От
Tomas Vondra
Дата:
On 04/23/2017 12:44 PM, Simon Riggs wrote:
> On 23 April 2017 at 09:17, Dean Rasheed <dean.a.rasheed@gmail.com> wrote:
>> On 23 April 2017 at 03:37, Bruce Momjian <bruce@momjian.us> wrote:
>>> In looking at the new multi-column statistics "dependency" option in
>>> Postgres 10, I am quite confused by the term "dependency".  Wouldn't
>>> "correlation" be clearer and less confusing as "column dependency"
>>> already means something else.
>>>
>
> I also asked that exactly that question...
>
>> Actually, the terms "dependency" and "correlation" are both quite
>> broad terms that cover a whole range of other different things, and
>> hence could be misleading. The precise term for this is "functional
>> dependency" [1], so if anything, the option name should be
>> "functional_dependencies" or some shortening of that, keeping a part
>> of each of those words.
>
> ...and got that answer also.
>
> For us "functional dependency" would sound like something to do with
> functions (e.g. CREATE FUNCTION), so just "dependency" appears to me
> to be the best term for this.
>

Not really. Functional dependency is a term well-defined in relational 
algebra, particularly in definition of normal forms. It has nothing to 
do with functions, and I'm sure it's not the only possible "ambiguity".

I actually considered functional_dependency when working on this, and I 
think it might be a better choice, but seemed a bit longish.

> There are multiple statistics for dependency stored, hence
> "dependencies". I don't like it, but its the best term I can see at
> present.

That is a good point, actually. It should probably be plural.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Do we need multi-column frequency/histogram stats? WAS Re:[HACKERS] Statistics "dependency"

От
Tomas Vondra
Дата:
On 04/23/2017 04:16 PM, Bruce Momjian wrote:
> On Sun, Apr 23, 2017 at 10:01:16AM -0400, Bruce Momjian wrote:
>> On Sun, Apr 23, 2017 at 11:44:12AM +0100, Simon Riggs wrote:
>>> For us "functional dependency" would sound like something to do with
>>> functions (e.g. CREATE FUNCTION), so just "dependency" appears to me
>>> to be the best term for this.
>>>
>>> There are multiple statistics for dependency stored, hence
>>> "dependencies". I don't like it, but its the best term I can see at
>>> present.
>>
>> OK, thank you for the reply, and I am sorry I forgot the previous
>> discussion.  I just wanted to re-check we had research this.  Thanks.
>
> (Email subject updated.)
>
> Actually, I have a larger question that I was thinking about.  Because
> we already have lots of per-column stats, and now the dependency score,
> is it possible to mix the per-column stats and dependency score in a way
> that multi-column frequency/histogram stats are not necessary?  That
> might be a less costly approach I had not considered.

Certainly not. Functional dependencies are "global" statistics, and only 
a very specific type of it. It only tells you that a particular column 
"implies" another column, i.e. knowledge of a value in A means there's 
only a single possible value in "B". That has a number of implications:

* It only works for equality conditions. No inequalities or so.

* It assumes the queries are "consistent" with the functional dependencies.

* There are dependencies/correlations that are not functional 
dependencies, while MCV/histograms would help.

Functional dependencies was the simplest type of extended statistics, 
and so it was the first one to implement (and introduce all the 
infrastructure). But we still need the other types.

The other types of statistics actually track correlation between values 
in the columns, not just "column A implies column B".


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services