Обсуждение: Improve docs for n_distinct_inherited

Поиск
Список
Период
Сортировка

Improve docs for n_distinct_inherited

От
David Rowley
Дата:
In [1] there's a bug report about ALTER TABLE ... ALTER COLUMN SET
(n_distinct = N) not working for partitioned tables. Of course, you
need to use n_distinct_inherited for partitioned tables, but the docs
don't say that.

I went through a few iterations of the wording to make this clearer
and landed on the attached.

David

[1] https://postgr.es/m/18916-985c04e9b5bed12b@postgresql.org

Вложения

Re: Improve docs for n_distinct_inherited

От
"David G. Johnston"
Дата:
On Wed, May 7, 2025 at 4:15 PM David Rowley <dgrowleyml@gmail.com> wrote:
In [1] there's a bug report about ALTER TABLE ... ALTER COLUMN SET
(n_distinct = N) not working for partitioned tables. Of course, you
need to use n_distinct_inherited for partitioned tables, but the docs
don't say that.

I went through a few iterations of the wording to make this clearer
and landed on the attached.


Not liking the proposal, not sure it is even correct.  Somehow "children of inheritance parent tables" are omitted.  In both cases the only statistics affected are those of the table upon which the option is being read.  That can go implied.  Leaving "affects the statistics for the table itself, while" is causing most of the harm.

In terms of wording - n_distinct must be used for all tables except partitioned tables and inheritance parent tables.  Those two acquire their override value from n_distinct_inherited.  So something like:

diff --git a/doc/src/sgml/ref/alter_table.sgml b/doc/src/sgml/ref/alter_table.sgml
index 24316bb816a..657e4c835ac 100644
--- a/doc/src/sgml/ref/alter_table.sgml
+++ b/doc/src/sgml/ref/alter_table.sgml
@@ -339,9 +339,13 @@ WITH ( MODULUS <replaceable class="parameter">numeric_literal</replaceable>, REM
       <literal>n_distinct_inherited</literal>, which override the
       number-of-distinct-values estimates made by subsequent
       <link linkend="sql-analyze"><command>ANALYZE</command></link>
-      operations.  <literal>n_distinct</literal> affects the statistics for the table
-      itself, while <literal>n_distinct_inherited</literal> affects the statistics
-      gathered for the table plus its inheritance children.  When set to a
+      operations.
+      For partitioned tables and inheritance parents <command>ANALYZE</command>
+      consults the <literal>n_distinct_inherited</literal> option and, if applicable,
+      includes all descendants when calculating the table size for use in the
+      number-of-distinct-values estimate.  For all other tables, <literal>n_distinct</literal>
+      is used and the table size estimate is based on the table itself.
+      When set to a
       positive value, <command>ANALYZE</command> will assume that the column contains
       exactly the specified number of distinct nonnull values.  When set to a
       negative value, which must be greater

It has a bit of forward-referencing to the following paragraph, but nothing surprising or novel.  The existing wording is already doing the same by alluding to descendants/children - but its just one data point that is involved so lets call it out.

David J.

Re: Improve docs for n_distinct_inherited

От
David Rowley
Дата:
On Thu, 8 May 2025 at 15:23, David G. Johnston
<david.g.johnston@gmail.com> wrote:
> Not liking the proposal, not sure it is even correct.  Somehow "children of inheritance parent tables" are omitted.

I don't see the quoted text anywhere in this area, so I'm not sure I
follow what you mean with the omission.

>  In both cases the only statistics affected are those of the table upon which the option is being read.  That can go
implied. Leaving "affects the statistics for the table itself, while" is causing most of the harm.
 

So the text, when it says "affects the statistics for the table
itself" is talking about stainherit==false statistics and the "the
statistics gathered which include the descendant tables of partitioned
tables and inheritance parent tables" which I added is trying to
describe the stainherits==true statistics.  You may have
misinterpreted that to think it was talking which table the option was
being applied to rather than where the sample rows were being gathered
from.

> In terms of wording - n_distinct must be used for all tables except partitioned tables and inheritance parent tables.
Those two acquire their override value from n_distinct_inherited.  So something like:
 
>
> diff --git a/doc/src/sgml/ref/alter_table.sgml b/doc/src/sgml/ref/alter_table.sgml
> index 24316bb816a..657e4c835ac 100644
> --- a/doc/src/sgml/ref/alter_table.sgml
> +++ b/doc/src/sgml/ref/alter_table.sgml
> @@ -339,9 +339,13 @@ WITH ( MODULUS <replaceable class="parameter">numeric_literal</replaceable>, REM
>        <literal>n_distinct_inherited</literal>, which override the
>        number-of-distinct-values estimates made by subsequent
>        <link linkend="sql-analyze"><command>ANALYZE</command></link>
> -      operations.  <literal>n_distinct</literal> affects the statistics for the table
> -      itself, while <literal>n_distinct_inherited</literal> affects the statistics
> -      gathered for the table plus its inheritance children.  When set to a
> +      operations.
> +      For partitioned tables and inheritance parents <command>ANALYZE</command>
> +      consults the <literal>n_distinct_inherited</literal> option and, if applicable,
> +      includes all descendants when calculating the table size for use in the
> +      number-of-distinct-values estimate.  For all other tables, <literal>n_distinct</literal>
> +      is used and the table size estimate is based on the table itself.
> +      When set to a
>        positive value, <command>ANALYZE</command> will assume that the column contains
>        exactly the specified number of distinct nonnull values.  When set to a
>        negative value, which must be greater
>
> It has a bit of forward-referencing to the following paragraph, but nothing surprising or novel.  The existing
wordingis already doing the same by alluding to descendants/children - but its just one data point that is involved so
letscall it out.
 

I can't follow your proposed text. It seems to indicate that ANALYZE
uses n_distinct_inherits to calculate the table size. ANALYZE does
nothing of the sort. I think you might have been confused by the
following text which talks about how *the planner* proportions
*negative* stadistinct values over the "table size", which really
should say estimated row count and not table size. table size has
nothing to do with this, but that seems like another issue.

As for my proposed patch and the text it changes, what this text is
meant to be talking about is which of ANALYZE's sample sets the
stadistinct estimate is overridden for. The two options for sample
sets are 1) stainherits==true and 2) stainherits==false. With
inheritance parent tables, ANALYZE gathers both sets and records both
sets in pg_statistic. Which one the planner uses depends on what it's
doing and whether you specify ONLY or not in your query.  For
partitioned tables, since they're empty, we don't bother with
stainherits==false stats.

I've attached another patch which adjusts the text to detail that
there are two sample row sets gathered for some cases, and then
mentions the two options that can be used to selectively overwrite
each sample's n_distinct estimate. I'd rather find a shorter set of
words to better explain this than what I'm proposing in v2, but the
current text has presumably confused one person (in the bug report)
and the v1 patch still seems to be a source of confusion.

David

Вложения

Re: Improve docs for n_distinct_inherited

От
"David G. Johnston"
Дата:
On Wed, May 7, 2025 at 11:41 PM David Rowley <dgrowleyml@gmail.com> wrote:
On Thu, 8 May 2025 at 15:23, David G. Johnston
<david.g.johnston@gmail.com> wrote:
> Not liking the proposal, not sure it is even correct.  Somehow "children of inheritance parent tables" are omitted.

I don't see the quoted text anywhere in this area, so I'm not sure I
follow what you mean with the omission.

Omitted from your v1 patch.

The two options for sample
sets are 1) stainherits==true and 2) stainherits==false. With
inheritance parent tables, ANALYZE gathers both sets and records both
sets in pg_statistic.

I was missing this key piece of knowledge which invalidated my entire attempt.  

Here's an attempt at shortening this now that I understand the mechanics better.

      Separate options exist because an inheritance parent table has two
      different sets of statistics: one considering only itself and one which
      also includes its children (<literal>n_distinct_inherited</literal>).
      Partitioned tables, which only have rows in the children, likewise uses
      the inherited option while everyone else uses <literal>n_distinct</literal>.

David J.

Re: Improve docs for n_distinct_inherited

От
David Rowley
Дата:
Just picking this one up again. I forgot to come back to this after PGConf.dev.

On Fri, 9 May 2025 at 02:50, David G. Johnston
<david.g.johnston@gmail.com> wrote:
> I was missing this key piece of knowledge which invalidated my entire attempt.
>
> Here's an attempt at shortening this now that I understand the mechanics better.
>
>       Separate options exist because an inheritance parent table has two
>       different sets of statistics: one considering only itself and one which
>       also includes its children (<literal>n_distinct_inherited</literal>).
>       Partitioned tables, which only have rows in the children, likewise uses
>       the inherited option while everyone else uses <literal>n_distinct</literal>.

I wasn't quite happy with that as the text indicates that
n_distinct_inherited is the statistics.  But, it's not, it's just the
option that allows some modification of the gathered statistics.

I came up with:

      Ordinarily <literal>n_distinct</literal> is used.
      <literal>n_distinct_inherited</literal> exists to allow the distinct
      estimate to be overwritten for the statistics gathered for inheritance
      parent tables and for partitioned tables.

I also fixed what I thought was some misleading text about ANALYZE
using this value to calculate things. That's not true. It's the query
planner that uses this value. ANALYZE just stores whatever this is set
to into pg_statistic. I also adjusted the text that was talking about
"the size of the table", which, as I mentioned earlier isn't correct.
It's all related to the estimated number rows in the table, per
"ntuples = vardata->rel->tuples;" in get_variable_numdistinct().

Also fixed a typo; "twice on the average" shouldn't contain "the".

I wonder if ", since the multiplication by the number of rows in the
table is not performed until query planning time" should be deleted
since I modified the text earlier to talk about "the query planner".

David

Вложения

Re: Improve docs for n_distinct_inherited

От
"David G. Johnston"
Дата:
On Sun, Oct 12, 2025 at 7:42 PM David Rowley <dgrowleyml@gmail.com> wrote:
Just picking this one up again. I forgot to come back to this after PGConf.dev.

On Fri, 9 May 2025 at 02:50, David G. Johnston
<david.g.johnston@gmail.com> wrote:
> I was missing this key piece of knowledge which invalidated my entire attempt.
>
> Here's an attempt at shortening this now that I understand the mechanics better.
>
>       Separate options exist because an inheritance parent table has two
>       different sets of statistics: one considering only itself and one which
>       also includes its children (<literal>n_distinct_inherited</literal>).
>       Partitioned tables, which only have rows in the children, likewise uses
>       the inherited option while everyone else uses <literal>n_distinct</literal>.

I wasn't quite happy with that as the text indicates that
n_distinct_inherited is the statistics.  But, it's not, it's just the
option that allows some modification of the gathered statistics.

I came up with:

      Ordinarily <literal>n_distinct</literal> is used.
      <literal>n_distinct_inherited</literal> exists to allow the distinct
      estimate to be overwritten for the statistics gathered for inheritance
      parent tables and for partitioned tables.

How about:

"n_disinct is used for normal tables while n_distinct_inherited is used for partitioned tables.  Both are usable (selected via the ONLY modifier) for an inheritance parent table since it has both storage and children."

The use of both "Ordinarily" and "overwritten" is bothering me here.  And it implies that n_distinct doesn't work for inheritance parent tables or, conversely, that n_distinct does work for partitioned tables.


I also fixed what I thought was some misleading text about ANALYZE
using this value to calculate things.

values in the column is linear with the estimated number
>      of rows in the table; the exact count is to be computed by multiplying the estimated
>      rows in the table by the absolute value of the given number.

"...is proportional to the estimated number of rows in the table at planning time."

(The final "at planning time" substitutes for the sentence you pondered removing.)

(The rest, including the examples, seem a bit self-explanatory given the definition, though I do get reader inexperience with the terminology.  But proportional implies a linear relationship, and the positive/negative bifurcation seems straight-forward here.)

I'm thinking everything else below is better incorporated into 14.2 which should be linking back to this section.  That way the crux of the usage is defined in syntax while the details about setting a specific value are located in the section covering the overall topic.

 the exact count is to be computed by multiplying the estimated
+      rows in the table by the absolute value of the given number.  For example,
       a value of -1 implies that all values in the column are distinct, while
-      a value of -0.5 implies that each value appears twice on the average.
+      a value of -0.5 implies that each value appears twice on average.
       This can be useful when the size of the table changes over time, since
       the multiplication by the number of rows in the table is not performed
       until query planning time.  

(Leave: Specify a value of 0 to revert to estimating...)

That said, this rework would be OK as-is.

Also, looking at stadistinct, the multiplier stored there accounts for the presence of null.  This attribute-option does not.  Is that difference worth noting?


David J.

Re: Improve docs for n_distinct_inherited

От
Chao Li
Дата:
Hi David,

I think your revision is good and accurate.

On Oct 13, 2025, at 07:42, David Rowley <dgrowleyml@gmail.com> wrote:

Just picking this one up again. I forgot to come back to this after PGConf.dev.

I came up with:

     Ordinarily <literal>n_distinct</literal> is used.
     <literal>n_distinct_inherited</literal> exists to allow the distinct
     estimate to be overwritten for the statistics gathered for inheritance
     parent tables and for partitioned tables.

This clarifies that n_distinct_inherited applies to both inheritance parents and partitioned tabled, that’s accurate and better than the original wording.


I also fixed what I thought was some misleading text about ANALYZE
using this value to calculate things. That's not true. It's the query
planner that uses this value. ANALYZE just stores whatever this is set
to into pg_statistic. I also adjusted the text that was talking about
"the size of the table", which, as I mentioned earlier isn't correct.
It's all related to the estimated number rows in the table, per
"ntuples = vardata->rel->tuples;" in get_variable_numdistinct().


When I read the diff, I thought “table size” is usually used in PG docs, so maybe “estimated table size”. But as you explicitly explained why you made the change, I am fine with your change.

Also fixed a typo; "twice on the average" shouldn't contain "the".


Correct grammer fix.

I wonder if ", since the multiplication by the number of rows in the
table is not performed until query planning time" should be deleted
since I modified the text earlier to talk about "the query planner”.


Yeah, with your rewrite, that clause now feels a little redundant. I think it can be removed entirely.

The other thing that doesn’t belong to your change but as you are touching here:

“When set to a negative value, which must be greater than or equal to -1"

When I first time read the doc, I was confused. Because no easier sentence indicated “n_distinct” is of float type. I thought “greater than” was a typo. When I read through, the later example (0.5) resolved my confusion. To avoid the same confusion to other readers, maybe change to “when set to a negative value between -1 and 0 (inclusive of -1)” or “when set to a negative value, which must be in the range -1<= value < 0”. 

But as I said, this doesn’t belong to your change. If you don’t want to enhance this place, or you don’t consider this is a problem, I am absolutely fine.

Best regards,
--
Chao Li (Evan)
HighGo Software Co., Ltd.
https://www.highgo.com/




Re: Improve docs for n_distinct_inherited

От
David Rowley
Дата:
On Mon, 13 Oct 2025 at 14:28, David G. Johnston
<david.g.johnston@gmail.com> wrote:
> How about:
>
> "n_disinct is used for normal tables while n_distinct_inherited is used for partitioned tables.  Both are usable
(selectedvia the ONLY modifier) for an inheritance parent table since it has both storage and children." 

I specifically was trying my best to avoid using the word "normal"
here as it's very indescriptive.  Rhetorical question: would you
expect a setting which applies to "normal" tables to be applied to a
materialized view? Instead, I described the cases where
n_distinct_inherited applies, and left it to the reader to assume that
n_distinct applies in all other cases, which IMO is correct. I don't
see the need to talk about ONLY. I'm only discussing stored statistics
in this part, i.e, what's in pg_statistic.

> The use of both "Ordinarily" and "overwritten" is bothering me here.  And it implies that n_distinct doesn't work for
inheritanceparent tables or, conversely, that n_distinct does work for partitioned tables. 

I don't know why you think that implies that n_distinct works for
partitioned tables. I've specifically said that "n_distinct_inherited
allows the n_distinct estimate to be overwritten for partitioned
tables". Why would the reader then assume n_distinct is for doing
that?

As for "overwritten", this is exactly what changing this setting does
to the pg_statistic.stadistinct when ANALYZE runs for the table, so as
far as I'm concerned, it's a good and accurate way to describe what
the settings do.

I'm open to finding something better for "Ordinarily
<literal>n_distinct</literal> is used."

>> I also fixed what I thought was some misleading text about ANALYZE
>> using this value to calculate things.
>
>
> > values in the column is linear with the estimated number
> >      of rows in the table; the exact count is to be computed by multiplying the estimated
> >      rows in the table by the absolute value of the given number.
>
> "...is proportional to the estimated number of rows in the table at planning time."
>
> (The final "at planning time" substitutes for the sentence you pondered removing.)

I've adjusted the "proportional" part, but the "add planning time" at
the end is surplus because we've already mentioned that this is
happening in "the query planner". I doubt anyone needs the additional
reassurance that things that the query planner does happen at planning
time.

> (The rest, including the examples, seem a bit self-explanatory given the definition, though I do get reader
inexperiencewith the terminology.  But proportional implies a linear relationship, and the positive/negative
bifurcationseems straight-forward here.) 

I don't feel the need to remove the examples. I think they're good to
reinforce how the negative values are utilised during planning.

> I'm thinking everything else below is better incorporated into 14.2 which should be linking back to this section.
Thatway the crux of the usage is defined in syntax while the details about setting a specific value are located in the
sectioncovering the overall topic. 
>
>  the exact count is to be computed by multiplying the estimated
> +      rows in the table by the absolute value of the given number.  For example,
>        a value of -1 implies that all values in the column are distinct, while
> -      a value of -0.5 implies that each value appears twice on the average.
> +      a value of -0.5 implies that each value appears twice on average.
>        This can be useful when the size of the table changes over time, since
>        the multiplication by the number of rows in the table is not performed
>        until query planning time.
>
> (Leave: Specify a value of 0 to revert to estimating...)

Do you mean remove that part?  I did consider that as I didn't think
the reader needed to know that as they could use RESET (n_distinct);

> That said, this rework would be OK as-is.
>
> Also, looking at stadistinct, the multiplier stored there accounts for the presence of null.  This attribute-option
doesnot.  Is that difference worth noting? 

It's already becoming more painful to get this changed than I would
have anticipated. Feel free to suggest something on a new thread if
you think it's worthwhile to mention it.

David



Re: Improve docs for n_distinct_inherited

От
David Rowley
Дата:
On Mon, 13 Oct 2025 at 14:47, Chao Li <li.evan.chao@gmail.com> wrote:
>
> Hi David,
>
> I think your revision is good and accurate.
>
> On Oct 13, 2025, at 07:42, David Rowley <dgrowleyml@gmail.com> wrote:
> I wonder if ", since the multiplication by the number of rows in the
> table is not performed until query planning time" should be deleted
> since I modified the text earlier to talk about "the query planner”.
>
>
> Yeah, with your rewrite, that clause now feels a little redundant. I think it can be removed entirely.

I've now removed that part.

> The other thing that doesn’t belong to your change but as you are touching here:
>
> “When set to a negative value, which must be greater than or equal to -1"
>
> When I first time read the doc, I was confused. Because no easier sentence indicated “n_distinct” is of float type. I
thought“greater than” was a typo. When I read through, the later example (0.5) resolved my confusion. To avoid the same
confusionto other readers, maybe change to “when set to a negative value between -1 and 0 (inclusive of -1)” or “when
setto a negative value, which must be in the range -1<= value < 0”. 

I agree that part is a bit clumsy. Starting at this cold again today I
also thought it must mean less than -1, so rewording that seems like a
good idea.

It now reads:

      Fractional values may also be specified by using values below 0 and above
      or equal to -1.  This instructs the query planner to estimate the number
      of distinct values by multiplying the absolute value of the specified
      number by the estimated number of rows in the table.

Updated patch attached.

David

Вложения

Re: Improve docs for n_distinct_inherited

От
Chao Li
Дата:


On Oct 13, 2025, at 11:48, David Rowley <dgrowleyml@gmail.com> wrote:

Updated patch attached.

David
<doc_n_distinct_inherited_v4.patch>

The doc change of v4 looks good to me now.

But v4 contains a lot of code changes, are they incidentally included?

Best regards,
--
Chao Li (Evan)
HighGo Software Co., Ltd.
https://www.highgo.com/




Improve docs for n_distinct_inherited

От
"David G. Johnston"
Дата:
On Sunday, October 12, 2025, David Rowley <dgrowleyml@gmail.com> wrote:

Updated patch attached.

The patch picked up a whole other set of changes as well.

How about:

n_distinct_inherited exists to allow the distinct value estimate to be overwritten for the children-included statistics gathered for inheritance parent tables, and for partitioned tables.

I can live with “ordinarily” with that wording, it is just the negative of “children-included”.

David J.

Re: Improve docs for n_distinct_inherited

От
David Rowley
Дата:
On Mon, 13 Oct 2025 at 17:09, David G. Johnston
<david.g.johnston@gmail.com> wrote:
> How about:
>
> n_distinct_inherited exists to allow the distinct value estimate to be overwritten for the children-included
statisticsgathered for inheritance parent tables, and for partitioned tables. 
>
> I can live with “ordinarily” with that wording, it is just the negative of “children-included”.

I've ended up reverting the text back to become more similar to how it
was but explicitly adding ", and for the statistics gathered for
partitioned tables.", since that's what the original bug report was
confused about.

I've kept the other document changes as they were in the previous patch.

David

Вложения

Re: Improve docs for n_distinct_inherited

От
"David G. Johnston"
Дата:
On Monday, October 13, 2025, David Rowley <dgrowleyml@gmail.com> wrote:
On Mon, 13 Oct 2025 at 17:09, David G. Johnston
<david.g.johnston@gmail.com> wrote:
> How about:
>
> n_distinct_inherited exists to allow the distinct value estimate to be overwritten for the children-included statistics gathered for inheritance parent tables, and for partitioned tables.
>
> I can live with “ordinarily” with that wording, it is just the negative of “children-included”.

I've ended up reverting the text back to become more similar to how it
was but explicitly adding ", and for the statistics gathered for
partitioned tables.", since that's what the original bug report was
confused about.

I've kept the other document changes as they were in the previous patch.

WFM.  Since we allow 0 as an input I’m not convinced we should remove that sentence, though I concur that using reset is preferred.

David J.

Re: Improve docs for n_distinct_inherited

От
David Rowley
Дата:
On Tue, 14 Oct 2025 at 00:55, David G. Johnston
<david.g.johnston@gmail.com> wrote:
> WFM.  Since we allow 0 as an input I’m not convinced we should remove that sentence, though I concur that using reset
ispreferred. 

I've pushed the v5 version. Thanks to both of you for looking.

David