Обсуждение: Improve docs for n_distinct_inherited
In [1] there's a bug report about ALTER TABLE ... ALTER COLUMN SET (n_distinct = N) not working for partitioned tables. Of course, you need to use n_distinct_inherited for partitioned tables, but the docs don't say that. I went through a few iterations of the wording to make this clearer and landed on the attached. David [1] https://postgr.es/m/18916-985c04e9b5bed12b@postgresql.org
Вложения
On Wed, May 7, 2025 at 4:15 PM David Rowley <dgrowleyml@gmail.com> wrote:
In [1] there's a bug report about ALTER TABLE ... ALTER COLUMN SET
(n_distinct = N) not working for partitioned tables. Of course, you
need to use n_distinct_inherited for partitioned tables, but the docs
don't say that.
I went through a few iterations of the wording to make this clearer
and landed on the attached.
Not liking the proposal, not sure it is even correct. Somehow "children of inheritance parent tables" are omitted. In both cases the only statistics affected are those of the table upon which the option is being read. That can go implied. Leaving "affects the statistics for the table itself, while" is causing most of the harm.
In terms of wording - n_distinct must be used for all tables except partitioned tables and inheritance parent tables. Those two acquire their override value from n_distinct_inherited. So something like:
diff --git a/doc/src/sgml/ref/alter_table.sgml b/doc/src/sgml/ref/alter_table.sgml
index 24316bb816a..657e4c835ac 100644
--- a/doc/src/sgml/ref/alter_table.sgml
+++ b/doc/src/sgml/ref/alter_table.sgml
@@ -339,9 +339,13 @@ WITH ( MODULUS <replaceable class="parameter">numeric_literal</replaceable>, REM
<literal>n_distinct_inherited</literal>, which override the
number-of-distinct-values estimates made by subsequent
<link linkend="sql-analyze"><command>ANALYZE</command></link>
- operations. <literal>n_distinct</literal> affects the statistics for the table
- itself, while <literal>n_distinct_inherited</literal> affects the statistics
- gathered for the table plus its inheritance children. When set to a
+ operations.
+ For partitioned tables and inheritance parents <command>ANALYZE</command>
+ consults the <literal>n_distinct_inherited</literal> option and, if applicable,
+ includes all descendants when calculating the table size for use in the
+ number-of-distinct-values estimate. For all other tables, <literal>n_distinct</literal>
+ is used and the table size estimate is based on the table itself.
+ When set to a
positive value, <command>ANALYZE</command> will assume that the column contains
exactly the specified number of distinct nonnull values. When set to a
negative value, which must be greater
index 24316bb816a..657e4c835ac 100644
--- a/doc/src/sgml/ref/alter_table.sgml
+++ b/doc/src/sgml/ref/alter_table.sgml
@@ -339,9 +339,13 @@ WITH ( MODULUS <replaceable class="parameter">numeric_literal</replaceable>, REM
<literal>n_distinct_inherited</literal>, which override the
number-of-distinct-values estimates made by subsequent
<link linkend="sql-analyze"><command>ANALYZE</command></link>
- operations. <literal>n_distinct</literal> affects the statistics for the table
- itself, while <literal>n_distinct_inherited</literal> affects the statistics
- gathered for the table plus its inheritance children. When set to a
+ operations.
+ For partitioned tables and inheritance parents <command>ANALYZE</command>
+ consults the <literal>n_distinct_inherited</literal> option and, if applicable,
+ includes all descendants when calculating the table size for use in the
+ number-of-distinct-values estimate. For all other tables, <literal>n_distinct</literal>
+ is used and the table size estimate is based on the table itself.
+ When set to a
positive value, <command>ANALYZE</command> will assume that the column contains
exactly the specified number of distinct nonnull values. When set to a
negative value, which must be greater
It has a bit of forward-referencing to the following paragraph, but nothing surprising or novel. The existing wording is already doing the same by alluding to descendants/children - but its just one data point that is involved so lets call it out.
David J.
On Thu, 8 May 2025 at 15:23, David G. Johnston <david.g.johnston@gmail.com> wrote: > Not liking the proposal, not sure it is even correct. Somehow "children of inheritance parent tables" are omitted. I don't see the quoted text anywhere in this area, so I'm not sure I follow what you mean with the omission. > In both cases the only statistics affected are those of the table upon which the option is being read. That can go implied. Leaving "affects the statistics for the table itself, while" is causing most of the harm. So the text, when it says "affects the statistics for the table itself" is talking about stainherit==false statistics and the "the statistics gathered which include the descendant tables of partitioned tables and inheritance parent tables" which I added is trying to describe the stainherits==true statistics. You may have misinterpreted that to think it was talking which table the option was being applied to rather than where the sample rows were being gathered from. > In terms of wording - n_distinct must be used for all tables except partitioned tables and inheritance parent tables. Those two acquire their override value from n_distinct_inherited. So something like: > > diff --git a/doc/src/sgml/ref/alter_table.sgml b/doc/src/sgml/ref/alter_table.sgml > index 24316bb816a..657e4c835ac 100644 > --- a/doc/src/sgml/ref/alter_table.sgml > +++ b/doc/src/sgml/ref/alter_table.sgml > @@ -339,9 +339,13 @@ WITH ( MODULUS <replaceable class="parameter">numeric_literal</replaceable>, REM > <literal>n_distinct_inherited</literal>, which override the > number-of-distinct-values estimates made by subsequent > <link linkend="sql-analyze"><command>ANALYZE</command></link> > - operations. <literal>n_distinct</literal> affects the statistics for the table > - itself, while <literal>n_distinct_inherited</literal> affects the statistics > - gathered for the table plus its inheritance children. When set to a > + operations. > + For partitioned tables and inheritance parents <command>ANALYZE</command> > + consults the <literal>n_distinct_inherited</literal> option and, if applicable, > + includes all descendants when calculating the table size for use in the > + number-of-distinct-values estimate. For all other tables, <literal>n_distinct</literal> > + is used and the table size estimate is based on the table itself. > + When set to a > positive value, <command>ANALYZE</command> will assume that the column contains > exactly the specified number of distinct nonnull values. When set to a > negative value, which must be greater > > It has a bit of forward-referencing to the following paragraph, but nothing surprising or novel. The existing wordingis already doing the same by alluding to descendants/children - but its just one data point that is involved so letscall it out. I can't follow your proposed text. It seems to indicate that ANALYZE uses n_distinct_inherits to calculate the table size. ANALYZE does nothing of the sort. I think you might have been confused by the following text which talks about how *the planner* proportions *negative* stadistinct values over the "table size", which really should say estimated row count and not table size. table size has nothing to do with this, but that seems like another issue. As for my proposed patch and the text it changes, what this text is meant to be talking about is which of ANALYZE's sample sets the stadistinct estimate is overridden for. The two options for sample sets are 1) stainherits==true and 2) stainherits==false. With inheritance parent tables, ANALYZE gathers both sets and records both sets in pg_statistic. Which one the planner uses depends on what it's doing and whether you specify ONLY or not in your query. For partitioned tables, since they're empty, we don't bother with stainherits==false stats. I've attached another patch which adjusts the text to detail that there are two sample row sets gathered for some cases, and then mentions the two options that can be used to selectively overwrite each sample's n_distinct estimate. I'd rather find a shorter set of words to better explain this than what I'm proposing in v2, but the current text has presumably confused one person (in the bug report) and the v1 patch still seems to be a source of confusion. David
Вложения
On Wed, May 7, 2025 at 11:41 PM David Rowley <dgrowleyml@gmail.com> wrote:
On Thu, 8 May 2025 at 15:23, David G. Johnston
<david.g.johnston@gmail.com> wrote:
> Not liking the proposal, not sure it is even correct. Somehow "children of inheritance parent tables" are omitted.
I don't see the quoted text anywhere in this area, so I'm not sure I
follow what you mean with the omission.
Omitted from your v1 patch.
The two options for sample
sets are 1) stainherits==true and 2) stainherits==false. With
inheritance parent tables, ANALYZE gathers both sets and records both
sets in pg_statistic.
I was missing this key piece of knowledge which invalidated my entire attempt.
Here's an attempt at shortening this now that I understand the mechanics better.
Separate options exist because an inheritance parent table has two
different sets of statistics: one considering only itself and one which
also includes its children (<literal>n_distinct_inherited</literal>).
Partitioned tables, which only have rows in the children, likewise uses
the inherited option while everyone else uses <literal>n_distinct</literal>.
different sets of statistics: one considering only itself and one which
also includes its children (<literal>n_distinct_inherited</literal>).
Partitioned tables, which only have rows in the children, likewise uses
the inherited option while everyone else uses <literal>n_distinct</literal>.
David J.
Just picking this one up again. I forgot to come back to this after PGConf.dev.
On Fri, 9 May 2025 at 02:50, David G. Johnston
<david.g.johnston@gmail.com> wrote:
> I was missing this key piece of knowledge which invalidated my entire attempt.
>
> Here's an attempt at shortening this now that I understand the mechanics better.
>
> Separate options exist because an inheritance parent table has two
> different sets of statistics: one considering only itself and one which
> also includes its children (<literal>n_distinct_inherited</literal>).
> Partitioned tables, which only have rows in the children, likewise uses
> the inherited option while everyone else uses <literal>n_distinct</literal>.
I wasn't quite happy with that as the text indicates that
n_distinct_inherited is the statistics. But, it's not, it's just the
option that allows some modification of the gathered statistics.
I came up with:
Ordinarily <literal>n_distinct</literal> is used.
<literal>n_distinct_inherited</literal> exists to allow the distinct
estimate to be overwritten for the statistics gathered for inheritance
parent tables and for partitioned tables.
I also fixed what I thought was some misleading text about ANALYZE
using this value to calculate things. That's not true. It's the query
planner that uses this value. ANALYZE just stores whatever this is set
to into pg_statistic. I also adjusted the text that was talking about
"the size of the table", which, as I mentioned earlier isn't correct.
It's all related to the estimated number rows in the table, per
"ntuples = vardata->rel->tuples;" in get_variable_numdistinct().
Also fixed a typo; "twice on the average" shouldn't contain "the".
I wonder if ", since the multiplication by the number of rows in the
table is not performed until query planning time" should be deleted
since I modified the text earlier to talk about "the query planner".
David
Вложения
On Sun, Oct 12, 2025 at 7:42 PM David Rowley <dgrowleyml@gmail.com> wrote:
Just picking this one up again. I forgot to come back to this after PGConf.dev.
On Fri, 9 May 2025 at 02:50, David G. Johnston
<david.g.johnston@gmail.com> wrote:
> I was missing this key piece of knowledge which invalidated my entire attempt.
>
> Here's an attempt at shortening this now that I understand the mechanics better.
>
> Separate options exist because an inheritance parent table has two
> different sets of statistics: one considering only itself and one which
> also includes its children (<literal>n_distinct_inherited</literal>).
> Partitioned tables, which only have rows in the children, likewise uses
> the inherited option while everyone else uses <literal>n_distinct</literal>.
I wasn't quite happy with that as the text indicates that
n_distinct_inherited is the statistics. But, it's not, it's just the
option that allows some modification of the gathered statistics.
I came up with:
Ordinarily <literal>n_distinct</literal> is used.
<literal>n_distinct_inherited</literal> exists to allow the distinct
estimate to be overwritten for the statistics gathered for inheritance
parent tables and for partitioned tables.
How about:
"n_disinct is used for normal tables while n_distinct_inherited is used for partitioned tables. Both are usable (selected via the ONLY modifier) for an inheritance parent table since it has both storage and children."
The use of both "Ordinarily" and "overwritten" is bothering me here. And it implies that n_distinct doesn't work for inheritance parent tables or, conversely, that n_distinct does work for partitioned tables.
I also fixed what I thought was some misleading text about ANALYZE
using this value to calculate things.
> values in the column is linear with the estimated number
> of rows in the table; the exact count is to be computed by multiplying the estimated
> rows in the table by the absolute value of the given number.
> of rows in the table; the exact count is to be computed by multiplying the estimated
> rows in the table by the absolute value of the given number.
"...is proportional to the estimated number of rows in the table at planning time."
(The final "at planning time" substitutes for the sentence you pondered removing.)
(The rest, including the examples, seem a bit self-explanatory given the definition, though I do get reader inexperience with the terminology. But proportional implies a linear relationship, and the positive/negative bifurcation seems straight-forward here.)
I'm thinking everything else below is better incorporated into 14.2 which should be linking back to this section. That way the crux of the usage is defined in syntax while the details about setting a specific value are located in the section covering the overall topic.
the exact count is to be computed by multiplying the estimated
+ rows in the table by the absolute value of the given number. For example,
a value of -1 implies that all values in the column are distinct, while
- a value of -0.5 implies that each value appears twice on the average.
+ a value of -0.5 implies that each value appears twice on average.
This can be useful when the size of the table changes over time, since
the multiplication by the number of rows in the table is not performed
until query planning time.
a value of -1 implies that all values in the column are distinct, while
- a value of -0.5 implies that each value appears twice on the average.
+ a value of -0.5 implies that each value appears twice on average.
This can be useful when the size of the table changes over time, since
the multiplication by the number of rows in the table is not performed
until query planning time.
(Leave: Specify a value of 0 to revert to estimating...)
That said, this rework would be OK as-is.
Also, looking at stadistinct, the multiplier stored there accounts for the presence of null. This attribute-option does not. Is that difference worth noting?
David J.
Hi David,
I think your revision is good and accurate.
On Oct 13, 2025, at 07:42, David Rowley <dgrowleyml@gmail.com> wrote:Just picking this one up again. I forgot to come back to this after PGConf.dev.
I came up with:
Ordinarily <literal>n_distinct</literal> is used.
<literal>n_distinct_inherited</literal> exists to allow the distinct
estimate to be overwritten for the statistics gathered for inheritance
parent tables and for partitioned tables.
This clarifies that n_distinct_inherited applies to both inheritance parents and partitioned tabled, that’s accurate and better than the original wording.
I also fixed what I thought was some misleading text about ANALYZE
using this value to calculate things. That's not true. It's the query
planner that uses this value. ANALYZE just stores whatever this is set
to into pg_statistic. I also adjusted the text that was talking about
"the size of the table", which, as I mentioned earlier isn't correct.
It's all related to the estimated number rows in the table, per
"ntuples = vardata->rel->tuples;" in get_variable_numdistinct().
When I read the diff, I thought “table size” is usually used in PG docs, so maybe “estimated table size”. But as you explicitly explained why you made the change, I am fine with your change.
Also fixed a typo; "twice on the average" shouldn't contain "the".
Correct grammer fix.
I wonder if ", since the multiplication by the number of rows in the
table is not performed until query planning time" should be deleted
since I modified the text earlier to talk about "the query planner”.
Yeah, with your rewrite, that clause now feels a little redundant. I think it can be removed entirely.
The other thing that doesn’t belong to your change but as you are touching here:
“When set to a negative value, which must be greater than or equal to -1"
When I first time read the doc, I was confused. Because no easier sentence indicated “n_distinct” is of float type. I thought “greater than” was a typo. When I read through, the later example (0.5) resolved my confusion. To avoid the same confusion to other readers, maybe change to “when set to a negative value between -1 and 0 (inclusive of -1)” or “when set to a negative value, which must be in the range -1<= value < 0”.
But as I said, this doesn’t belong to your change. If you don’t want to enhance this place, or you don’t consider this is a problem, I am absolutely fine.
Best regards,
--
Chao Li (Evan)
HighGo Software Co., Ltd.
https://www.highgo.com/
HighGo Software Co., Ltd.
https://www.highgo.com/
On Mon, 13 Oct 2025 at 14:28, David G. Johnston <david.g.johnston@gmail.com> wrote: > How about: > > "n_disinct is used for normal tables while n_distinct_inherited is used for partitioned tables. Both are usable (selectedvia the ONLY modifier) for an inheritance parent table since it has both storage and children." I specifically was trying my best to avoid using the word "normal" here as it's very indescriptive. Rhetorical question: would you expect a setting which applies to "normal" tables to be applied to a materialized view? Instead, I described the cases where n_distinct_inherited applies, and left it to the reader to assume that n_distinct applies in all other cases, which IMO is correct. I don't see the need to talk about ONLY. I'm only discussing stored statistics in this part, i.e, what's in pg_statistic. > The use of both "Ordinarily" and "overwritten" is bothering me here. And it implies that n_distinct doesn't work for inheritanceparent tables or, conversely, that n_distinct does work for partitioned tables. I don't know why you think that implies that n_distinct works for partitioned tables. I've specifically said that "n_distinct_inherited allows the n_distinct estimate to be overwritten for partitioned tables". Why would the reader then assume n_distinct is for doing that? As for "overwritten", this is exactly what changing this setting does to the pg_statistic.stadistinct when ANALYZE runs for the table, so as far as I'm concerned, it's a good and accurate way to describe what the settings do. I'm open to finding something better for "Ordinarily <literal>n_distinct</literal> is used." >> I also fixed what I thought was some misleading text about ANALYZE >> using this value to calculate things. > > > > values in the column is linear with the estimated number > > of rows in the table; the exact count is to be computed by multiplying the estimated > > rows in the table by the absolute value of the given number. > > "...is proportional to the estimated number of rows in the table at planning time." > > (The final "at planning time" substitutes for the sentence you pondered removing.) I've adjusted the "proportional" part, but the "add planning time" at the end is surplus because we've already mentioned that this is happening in "the query planner". I doubt anyone needs the additional reassurance that things that the query planner does happen at planning time. > (The rest, including the examples, seem a bit self-explanatory given the definition, though I do get reader inexperiencewith the terminology. But proportional implies a linear relationship, and the positive/negative bifurcationseems straight-forward here.) I don't feel the need to remove the examples. I think they're good to reinforce how the negative values are utilised during planning. > I'm thinking everything else below is better incorporated into 14.2 which should be linking back to this section. Thatway the crux of the usage is defined in syntax while the details about setting a specific value are located in the sectioncovering the overall topic. > > the exact count is to be computed by multiplying the estimated > + rows in the table by the absolute value of the given number. For example, > a value of -1 implies that all values in the column are distinct, while > - a value of -0.5 implies that each value appears twice on the average. > + a value of -0.5 implies that each value appears twice on average. > This can be useful when the size of the table changes over time, since > the multiplication by the number of rows in the table is not performed > until query planning time. > > (Leave: Specify a value of 0 to revert to estimating...) Do you mean remove that part? I did consider that as I didn't think the reader needed to know that as they could use RESET (n_distinct); > That said, this rework would be OK as-is. > > Also, looking at stadistinct, the multiplier stored there accounts for the presence of null. This attribute-option doesnot. Is that difference worth noting? It's already becoming more painful to get this changed than I would have anticipated. Feel free to suggest something on a new thread if you think it's worthwhile to mention it. David
On Mon, 13 Oct 2025 at 14:47, Chao Li <li.evan.chao@gmail.com> wrote:
>
> Hi David,
>
> I think your revision is good and accurate.
>
> On Oct 13, 2025, at 07:42, David Rowley <dgrowleyml@gmail.com> wrote:
> I wonder if ", since the multiplication by the number of rows in the
> table is not performed until query planning time" should be deleted
> since I modified the text earlier to talk about "the query planner”.
>
>
> Yeah, with your rewrite, that clause now feels a little redundant. I think it can be removed entirely.
I've now removed that part.
> The other thing that doesn’t belong to your change but as you are touching here:
>
> “When set to a negative value, which must be greater than or equal to -1"
>
> When I first time read the doc, I was confused. Because no easier sentence indicated “n_distinct” is of float type. I
thought“greater than” was a typo. When I read through, the later example (0.5) resolved my confusion. To avoid the same
confusionto other readers, maybe change to “when set to a negative value between -1 and 0 (inclusive of -1)” or “when
setto a negative value, which must be in the range -1<= value < 0”.
I agree that part is a bit clumsy. Starting at this cold again today I
also thought it must mean less than -1, so rewording that seems like a
good idea.
It now reads:
Fractional values may also be specified by using values below 0 and above
or equal to -1. This instructs the query planner to estimate the number
of distinct values by multiplying the absolute value of the specified
number by the estimated number of rows in the table.
Updated patch attached.
David
Вложения
On Oct 13, 2025, at 11:48, David Rowley <dgrowleyml@gmail.com> wrote:
Updated patch attached.
David
<doc_n_distinct_inherited_v4.patch>
The doc change of v4 looks good to me now.
But v4 contains a lot of code changes, are they incidentally included?
--
Chao Li (Evan)
HighGo Software Co., Ltd.
https://www.highgo.com/
HighGo Software Co., Ltd.
https://www.highgo.com/
On Sunday, October 12, 2025, David Rowley <dgrowleyml@gmail.com> wrote:
Updated patch attached.
The patch picked up a whole other set of changes as well.
How about:
n_distinct_inherited exists to allow the distinct value estimate to be overwritten for the children-included statistics gathered for inheritance parent tables, and for partitioned tables.
I can live with “ordinarily” with that wording, it is just the negative of “children-included”.
David J.
On Mon, 13 Oct 2025 at 17:09, David G. Johnston <david.g.johnston@gmail.com> wrote: > How about: > > n_distinct_inherited exists to allow the distinct value estimate to be overwritten for the children-included statisticsgathered for inheritance parent tables, and for partitioned tables. > > I can live with “ordinarily” with that wording, it is just the negative of “children-included”. I've ended up reverting the text back to become more similar to how it was but explicitly adding ", and for the statistics gathered for partitioned tables.", since that's what the original bug report was confused about. I've kept the other document changes as they were in the previous patch. David
Вложения
On Monday, October 13, 2025, David Rowley <dgrowleyml@gmail.com> wrote:
On Mon, 13 Oct 2025 at 17:09, David G. Johnston
<david.g.johnston@gmail.com> wrote:
> How about:
>
> n_distinct_inherited exists to allow the distinct value estimate to be overwritten for the children-included statistics gathered for inheritance parent tables, and for partitioned tables.
>
> I can live with “ordinarily” with that wording, it is just the negative of “children-included”.
I've ended up reverting the text back to become more similar to how it
was but explicitly adding ", and for the statistics gathered for
partitioned tables.", since that's what the original bug report was
confused about.
I've kept the other document changes as they were in the previous patch.
WFM. Since we allow 0 as an input I’m not convinced we should remove that sentence, though I concur that using reset is preferred.
David J.
On Tue, 14 Oct 2025 at 00:55, David G. Johnston <david.g.johnston@gmail.com> wrote: > WFM. Since we allow 0 as an input I’m not convinced we should remove that sentence, though I concur that using reset ispreferred. I've pushed the v5 version. Thanks to both of you for looking. David