Обсуждение: Re: Extended Statistics set/restore/clear functions.

Поиск

Список

Период

Сортировка

Re: Extended Statistics set/restore/clear functions.

От

jian he

Дата:

28 января, 19:24:59

hi.
I reviewed 0001 only.

in src/backend/statistics/mvdistinct.c

no need #include "nodes/pg_list.h" since
src/include/statistics/statistics.h sub level include "nodes/pg_list.h"

no need #include "utils/palloc.h"
sicne #include "postgres.h"
already included it.


 select '[{"6, -32768,,": -11}]'::pg_ndistinct;
ERROR:  malformed pg_ndistinct: "[{"6, -32768,,": -11}]"
LINE 1: select '[{"6, -32768,,": -11}]'::pg_ndistinct;
               ^
DETAIL:  All ndistinct count values are scalar doubles.
imho, this errdetail message is not good.


select '{}'::pg_ndistinct ;
segfault


select '{"1,":"1"}'::pg_ndistinct ;
ERROR:  malformed pg_ndistinct: "{"1,":"1"}"
LINE 1: select '{"1,":"1"}'::pg_ndistinct ;
               ^
DETAIL:  All ndistinct attnum lists must be a comma separated list of attnums.

imho, this errdetail message is not good. would be better saying that
"length of list of attnums must be larger than 1".



select  pt1.typnamespace, pt1.typarray, pt1.typcategory, pt1.typname
from    pg_type pt1
where pt1.typname ~*'distinct';

 typnamespace | typarray | typcategory |   typname
--------------+----------+-------------+--------------
           11 |        0 | Z           | pg_ndistinct

typcategory (Z) marked as Internal-use types. and there is no
pg_ndistinct array type,
not sure this is fine.


all the errcode one pair of the parenthesis is unnecessary.
for example
                (errcode(ERRCODE_INVALID_TEXT_REPRESENTATION),
                 errmsg("malformed pg_ndistinct: \"%s\"", parse->str),
                 errdetail("Must begin with \"{\"")));
can change to
                errcode(ERRCODE_INVALID_TEXT_REPRESENTATION),
                 errmsg("malformed pg_ndistinct: \"%s\"", parse->str),
                 errdetail("Must begin with \"{\""));

see https://www.postgresql.org/docs/current/error-message-reporting.html

Re: Extended Statistics set/restore/clear functions.

От

jian he

Дата:

29 января, 10:50:09

hi.

select '{"1, 0B100101":"NaN"}'::pg_ndistinct;
      pg_ndistinct
------------------------
 {"1, 37": -2147483648}
(1 row)

this is not what we expected?
For the VALUE part of pg_ndistinct, float8 has 3 special values: inf, -inf, NaN.

For the key part of pg_ndistinct, see example.
select '{"1, 16\t":"1"}'::pg_ndistinct;
here \t is not tab character, ascii 9. it's two characters: backslash
and character "t".
so here it should error out?
(apply this to \n, \r, \b)


pg_ndistinct_in(PG_FUNCTION_ARGS)
ending part should be:

    freeJsonLexContext(lex);
    if (result == JSON_SUCCESS)
    {
        ......
    }
    else
    {
       ereturn(parse_state.escontext, (Datum) 0,
                    errcode(ERRCODE_INVALID_TEXT_REPRESENTATION),
                    errmsg("malformed pg_ndistinct: \"%s\"", str),
                    errdetail("Must be valid JSON."));
       PG_RETURN_NULL();
    }
result should be either JSON_SUCCESS or anything else.



all these functions:
ndistinct_object_start, ndistinct_array_start,
ndistinct_object_field_start, ndistinct_array_element_start
have
ndistinctParseState *parse = state;

do we need to change it to
ndistinctParseState *parse = (ndistinctParseState *)state;
?



ndistinctParseState need to add to src/tools/pgindent/typedefs.list
probably change it to "typedef struct ndistinctParseState".
also struct ndistinctParseState need placed in the top of
src/backend/statistics/mvdistinct.c
not in line 340?

Re: Extended Statistics set/restore/clear functions.

От

Corey Huinker

Дата:

29 января, 22:56:33

On Tue, Jan 28, 2025 at 11:25 AM jian he <jian.universality@gmail.com> wrote:

hi.
I reviewed 0001 only.

in src/backend/statistics/mvdistinct.c

no need #include "nodes/pg_list.h" since
src/include/statistics/statistics.h sub level include "nodes/pg_list.h"

no need #include "utils/palloc.h"
sicne #include "postgres.h"
already included it.

Noted.

select '[{"6, -32768,,": -11}]'::pg_ndistinct;
ERROR: malformed pg_ndistinct: "[{"6, -32768,,": -11}]"
LINE 1: select '[{"6, -32768,,": -11}]'::pg_ndistinct;
^
DETAIL: All ndistinct count values are scalar doubles.
imho, this errdetail message is not good.

What error message do you think is appropriate in that situation?

select '{}'::pg_ndistinct ;
segfault

Mmm, gotta look into that!

select '{"1,":"1"}'::pg_ndistinct ;
ERROR: malformed pg_ndistinct: "{"1,":"1"}"
LINE 1: select '{"1,":"1"}'::pg_ndistinct ;
^
DETAIL: All ndistinct attnum lists must be a comma separated list of attnums.

imho, this errdetail message is not good. would be better saying that
"length of list of attnums must be larger than 1".

That sounds better.

typcategory (Z) marked as Internal-use types. and there is no
pg_ndistinct array type,
not sure this is fine.

I think it's probably ok for now. The datatype currently has no utility other than extended statistics, and I'm doubtful that it ever will.

Re: Extended Statistics set/restore/clear functions.

От

Corey Huinker

Дата:

29 января, 23:04:40

On Wed, Jan 29, 2025 at 2:50 AM jian he <jian.universality@gmail.com> wrote:

hi.

select '{"1, 0B100101":"NaN"}'::pg_ndistinct;
pg_ndistinct
------------------------
{"1, 37": -2147483648}
(1 row)

I think my initial reaction is to just refuse those special values, but I'll look into the parsing code to see what can be done.

this is not what we expected?
For the VALUE part of pg_ndistinct, float8 has 3 special values: inf, -inf, NaN.

For the key part of pg_ndistinct, see example.
select '{"1, 16\t":"1"}'::pg_ndistinct;
here \t is not tab character, ascii 9. it's two characters: backslash
and character "t".
so here it should error out?
(apply this to \n, \r, \b)

I don't have a good answer as to what should happen here. Special cases like this make Tomas' suggestion to change the in/out format more attractive.

pg_ndistinct_in(PG_FUNCTION_ARGS)
ending part should be:

freeJsonLexContext(lex);
if (result == JSON_SUCCESS)
{
......
}
else
{
ereturn(parse_state.escontext, (Datum) 0,
errcode(ERRCODE_INVALID_TEXT_REPRESENTATION),
errmsg("malformed pg_ndistinct: \"%s\"", str),
errdetail("Must be valid JSON."));
PG_RETURN_NULL();
}
result should be either JSON_SUCCESS or anything else.

all these functions:
ndistinct_object_start, ndistinct_array_start,
ndistinct_object_field_start, ndistinct_array_element_start
have
ndistinctParseState *parse = state;

do we need to change it to
ndistinctParseState *parse = (ndistinctParseState *)state;
?

The compiler isn't complaining so far, but I see no harm in it.

Re: Extended Statistics set/restore/clear functions.

От

Corey Huinker

Дата:

20 февраля, 07:43:16

select '{"1, 0B100101":"NaN"}'::pg_ndistinct;
pg_ndistinct
------------------------
{"1, 37": -2147483648}
(1 row)

I think my initial reaction is to just refuse those special values, but I'll look into the parsing code to see what can be done.

I noticed that the output function for pg_ndistinct casts that value to an integer before formatting it %d, so it's being treated as an integer even if it is not stored as one. After some consultation with Tomas, it made the most sense to just replicate this on the input side as well, and that is addressed in the patches below.

I've updated and rebased the patches.

The existing pg_ndistinct and pg_dependences formats were kept as-is. The formats are clumsy, more processing-friendly formats would be easier, but the need for such processing is minimal bordering on theoretical, so there is little impact in keeping the historical format.

There are now checks to ensure that the pg_ndistinct or pg_dependencies value assigned to an extended statistics object actually makes sense for that object. What this amounts to is checking that for every attnum cited, the positive attnums are also ones found the in the stxkeys of the pg_statistic_ext tuple, and the negative attnums correspond do not exceed the number of expressions in the attnum. In other words, if the stats object has no expressions in it, then no negative numbers will be accepted, if it has 2 expressions than any value -3 or lower will be rejected, etc.

All patches rebased to 71f17823ba010296da9946bd906bb8bcad6325bc.

Вложения

Re: Extended Statistics set/restore/clear functions.

От

Corey Huinker

Дата:

04 марта, 23:30:11

I think my initial reaction is to just refuse those special values, but I'll look into the parsing code to see what can be done.

I noticed that the output function for pg_ndistinct casts that value to an integer before formatting it %d, so it's being treated as an integer even if it is not stored as one. After some consultation with Tomas, it made the most sense to just replicate this on the input side as well, and that is addressed in the patches below.

I've updated and rebased the patches.

The existing pg_ndistinct and pg_dependences formats were kept as-is. The formats are clumsy, more processing-friendly formats would be easier, but the need for such processing is minimal bordering on theoretical, so there is little impact in keeping the historical format.

There are now checks to ensure that the pg_ndistinct or pg_dependencies value assigned to an extended statistics object actually makes sense for that object. What this amounts to is checking that for every attnum cited, the positive attnums are also ones found the in the stxkeys of the pg_statistic_ext tuple, and the negative attnums correspond do not exceed the number of expressions in the attnum. In other words, if the stats object has no expressions in it, then no negative numbers will be accepted, if it has 2 expressions than any value -3 or lower will be rejected, etc.

All patches rebased to 71f17823ba010296da9946bd906bb8bcad6325bc.

A rebasing, and a few changes

* regnamespace and name parameters changed to statistics_schemaname as text and statistics_name as text, so that there's one less thing that can potentially fail in an upgrade

* schema lookup and stat name lookup failures now issue a warning and return false, rather than ERROR

* elevel replaced with hardcoded WARNING most everywhere, as has been done with relation/attribute stats

Вложения

Re: Extended Statistics set/restore/clear functions.

От

Corey Huinker

Дата:

31 марта, 08:10:23

Just rebasing.

Вложения

Re: Extended Statistics set/restore/clear functions.

От

Corey Huinker

Дата:

30 мая, 01:32:17

On Mon, Mar 31, 2025 at 1:10 AM Corey Huinker <corey.huinker@gmail.com> wrote:

Just rebasing.

At pgconf.dev this year, the subject of changing the formats of pg_ndistinct and pg_depdentencies came up again.

To recap: presently these datatypes have no working input function, but would need one for statistics import to work on extended statistics. The existing input formats are technically JSON, but the keys themselves are a comma-separated list of attnums, so they require additional parsing. That parsing is already done in the patches in this thread, but overall the format is terrible for any sort of manipulation, like the manipulation that people might want to do to translate the values to a table with a different column order (say, after a restore of a table that had dropped columns), or to do query planner experiments.

Because the old formats don't have a corresponding input function, there is no risk of the ouptut not matching required inputs, but there will be once we add new input functions, so this is our last chance to change the format to something we like better.

The old format can be trivially translated via functions posted earlier in this thread back in January (pg_xlat_ndistinct_to_attnames, pg_xlat_dependencies_to_attnames) as well as the reverse (s/_to_/_from_/), so dumping values from older versions will not be difficult.

I believe that we should take this opportunity to make the change. While we don't have a pressing need to manipulate these structures now, we might in the future and failing to do so now makes a later change much harder.

With that in mind, I'd like people to have a look at the proposed format change if pg_ndistinct (the changes to pg_dependencies are similar), to see if they want to make any improvements or comments. As you can see, the new format is much less compact (about 3x as large), which could get bad if the number of elements grew by a lot, but the number of elements is tied to the number of factors in the extended support (N choose N, then N choose N-1, etc, excluding choose 1), so this can't get too out of hand.

Existing format (newlines/formatting added by me to make head-to-head comparison easier):

'{"2, 3": 4,

"2, -1": 4,

"2, -2": 4,

"3, -1": 4,

"3, -2": 4,

"-1, -2": 3,

"2, 3, -1": 4,

"2, 3, -2": 4,

"2, -1, -2": 4,

"3, -1, -2": 4}'::pg_ndistinct

Proposed new format (again, all formatting here is just for ease of humans reading):

' [ {"attributes" : [2,3], "ndistinct" : 4},
{"attributes" : [2,-1], "ndistinct" : 4},
{"attributes" : [2,-2], "ndistinct" : 4},
{"attributes" : [3,-1], "ndistinct" : 4},
{"attributes" : [3,-2], "ndistinct" : 4},
{"attributes" : [-1,-2], "ndistinct" : 3},
{"attributes" : [2,3,-1], "ndistinct" : 4},
{"attributes" : [2,3,-2], "ndistinct" : 4},
{"attributes" : [2,-1,-2], "ndistinct" : 4},
{"attributes" : [3,-1,-2], "ndistinct" : 4}]'::pg_ndistinct

The pg_dependencies structure is only slightly more complex:

An abbreviated example:

{"2 => 1": 1.000000, "2 => -1": 1.000000, ..., "2, -2 => -1": 1.000000, "3, -1 => 2": 1.000000},

Becomes:

[ {"attributes": [2], "dependency": 1, "degree": 1.000000},

{"attributes": [2], "dependency": -1, "degree": 1.000000},

{"attributes": [2, -2], "dependency": -1, "degree": 1.000000},
...,

{"attributes": [2, -2], "dependency": -1, "degree": 1.000000},
{"attributes": [3, -1], "dependency": 2, "degree": 1.000000}]

Any thoughts on using/improving these structures?

Re: Extended Statistics set/restore/clear functions.

От

Corey Huinker

Дата:

22 июня, 03:12:28

Any thoughts on using/improving these structures?

Hearing no objections, here is the latest patchset.

0001 - Changes input/output functions of pg_ndistinct to the format described earlier.

0002 - Changes input/output functions of pg_dependencies to the format described earlier.

0003 - Makes some previously internal/static attribute stats functions visible to extended_stats.c, because the exprs attribute is basically an array of partially filled-out pg_statistic rows.

0004 - Adds pg_restore_attribute_stats(), pg_clear_attribute_stats(), in the pattern of their relation/attribute brethren.

0005 - adds the dumping and restoring of extended statistics back to v10. No command line flag changes needed.

Вложения

Re: Extended Statistics set/restore/clear functions.

От

Corey Huinker

Дата:

12 августа, 20:28:14

On Sat, Jun 21, 2025 at 8:12 PM Corey Huinker <corey.huinker@gmail.com> wrote:

Any thoughts on using/improving these structures?

Hearing no objections, here is the latest patchset.

0001 - Changes input/output functions of pg_ndistinct to the format described earlier.
0002 - Changes input/output functions of pg_dependencies to the format described earlier.
0003 - Makes some previously internal/static attribute stats functions visible to extended_stats.c, because the exprs attribute is basically an array of partially filled-out pg_statistic rows.
0004 - Adds pg_restore_attribute_stats(), pg_clear_attribute_stats(), in the pattern of their relation/attribute brethren.
0005 - adds the dumping and restoring of extended statistics back to v10. No command line flag changes needed.

Rebased. Enjoy.

Вложения

Re: Extended Statistics set/restore/clear functions.

От

Corey Huinker

Дата:

12 сентября, 07:55:56

Rebased. Enjoy.

Rebased.

Вложения

Re: Extended Statistics set/restore/clear functions.

От

Corey Huinker

Дата:

19 октября, 03:27:58

On Fri, Sep 12, 2025 at 12:55 AM Corey Huinker <corey.huinker@gmail.com> wrote:

Rebased. Enjoy.

Rebased.

And rebased again to conform to 688dc6299 and 4bd919129.

Вложения

Re: Extended Statistics set/restore/clear functions.

От

Michael Paquier

Дата:

21 октября, 08:48:37

On Sat, Oct 18, 2025 at 08:27:58PM -0400, Corey Huinker wrote:
> And rebased again to conform to  688dc6299 and 4bd919129.

The format redesign for extended stats is pretty nice, as done in
0001.  I think that this patch should be split in two, actually, as it
tackles two issues:
- One patch for the format change.
- Second patch for the introduction of the input function, useful on
its own to allow more regression tests paths for the format generated.

Similar split comment for 0002 regarding pg_dependencies.  The format
change is one thing.  The input function is cool to have for input
validation, still not absolutely mandatory for the core part of the
format change.

The functions exposed in 0003 should be renamed to match more with the
style of the rest, aka it is a bit hard to figure out what they do at
first sight.  Presumably, these should be prefixed with some
"statext_", except text_to_stavalues() which could still be named the
same.

Do you have some numbers regarding the increase in size this generates
for the catalogs?

0004 has been designed following the same model as the relation and
attribute stats.  That sounds OK here.

+enum extended_stats_argnum
[...]
+enum extended_stats_exprs_element

It would be nice to document why such things are around.  That would
be less guessing for somebody reading the code.

Reusing this small sequence from your pg_dump patch, executed on a v14
backend:
create schema dump_test;
CREATE TABLE dump_test.has_ext_stats
  AS SELECT g.g AS x, g.g / 2 AS y FROM generate_series(1,100) AS g(g);
CREATE STATISTICS dump_test.es1 ON x, (y % 2) FROM dump_test.has_ext_stats;
ANALYZE dump_test.has_ext_stats;

Then pg_dump fails:
pg_dump: error: query failed: ERROR:  column e.inherited does not exist
LINE 2: ...hemaname = $1 AND e.statistics_name = $2 ORDER BY e.inherite...


+        * TODO: Until v18 is released the master branch has a
+        * server_version_num of 180000. We will update this to 190000
as soon
+        * as the master branch updates.

This part has not been updated.

+       Assert(item.nattributes > 0);   /* TODO: elog? */
[...]
+       Assert(dependency->nattributes > 1);    /* TODO: elog? */
Yes and yes.  It seems like it should be possible to craft some input
that triggers these..

+void
+free_pg_dependencies(MVDependencies *dependencies);

Double declaration of this routine in dependencies.c.

Perhaps some of the regression tests could use some jsonb_pretty() in
the outputs generated.  Some of the results generated are very hard to
parse, something that would become harder in the buildfarm.  This
comment starts with 0001 for stxdndistinct.

I have mixed feelings about 0005, FWIW.  I am wondering if we should
not lift the needle a bit here and only support the dump of extended
statistics when dealing with a backend of at least v19.  This would
mean that we would only get the full benefit of this feature once
people upgrade to v20 or dump from a pg_dump with --statistics from at
least v19, but with the long-term picture in mind this would also make
the dump/restore picture of the patch dead simple (spoiler: I like
simple).

Tomas, what is your take about the format changes and my argument
about the backward requirements of pg_dump (about not dumping these
stats if connecting to a server older than v18, included)?
--
Michael

Вложения

signature.asc

Re: Extended Statistics set/restore/clear functions.

От

Michael Paquier

Дата:

21 октября, 08:57:00

On Tue, Oct 21, 2025 at 02:48:37PM +0900, Michael Paquier wrote:
> Tomas, what is your take about the format changes and my argument
> about the backward requirements of pg_dump (about not dumping these
> stats if connecting to a server older than v18, included)?

By the way, the patch is still needed in the previous CF of September:
https://commitfest.postgresql.org/patch/5517/

You may want to move it.
--
Michael

Вложения

signature.asc

Re: Extended Statistics set/restore/clear functions.

От

Corey Huinker

Дата:

22 октября, 14:55:31

The functions exposed in 0003 should be renamed to match more with the
style of the rest, aka it is a bit hard to figure out what they do at
first sight. Presumably, these should be prefixed with some
"statext_", except text_to_stavalues() which could still be named the
same.

That prefix would probably be statatt_ or statattr_.

Do you have some numbers regarding the increase in size this generates
for the catalogs?

Sorry, I don't understand. There shouldn't be any increase inside the catalogs as the internal storage of the datatypes hasn't changed, so I can only conclude that you're referring to something else.

0004 has been designed following the same model as the relation and
attribute stats. That sounds OK here.

+enum extended_stats_argnum
[...]
+enum extended_stats_exprs_element

It would be nice to document why such things are around. That would
be less guessing for somebody reading the code.

The equivalent structures in attribute_stats.c will need documenting too.

Reusing this small sequence from your pg_dump patch, executed on a v14
backend:
create schema dump_test;
CREATE TABLE dump_test.has_ext_stats
AS SELECT g.g AS x, g.g / 2 AS y FROM generate_series(1,100) AS g(g);
CREATE STATISTICS dump_test.es1 ON x, (y % 2) FROM dump_test.has_ext_stats;
ANALYZE dump_test.has_ext_stats;

Then pg_dump fails:
pg_dump: error: query failed: ERROR: column e.inherited does not exist
LINE 2: ...hemaname = $1 AND e.statistics_name = $2 ORDER BY e.inherite...

Noted.

+ * TODO: Until v18 is released the master branch has a
+ * server_version_num of 180000. We will update this to 190000
as soon
+ * as the master branch updates.

This part has not been updated.

+ Assert(item.nattributes > 0); /* TODO: elog? */
[...]
+ Assert(dependency->nattributes > 1); /* TODO: elog? */
Yes and yes. It seems like it should be possible to craft some input
that triggers these..

+void
+free_pg_dependencies(MVDependencies *dependencies);

Double declaration of this routine in dependencies.c.

Perhaps some of the regression tests could use some jsonb_pretty() in
the outputs generated. Some of the results generated are very hard to
parse, something that would become harder in the buildfarm. This
comment starts with 0001 for stxdndistinct.

Can do.

I have mixed feelings about 0005, FWIW. I am wondering if we should
not lift the needle a bit here and only support the dump of extended
statistics when dealing with a backend of at least v19. This would
mean that we would only get the full benefit of this feature once
people upgrade to v20 or dump from a pg_dump with --statistics from at
least v19, but with the long-term picture in mind this would also make
the dump/restore picture of the patch dead simple (spoiler: I like
simple).

I also like simple.

Right now we have a situation where the vast majority of databases can carry forward all of their stats via pg_upgrade, except for those databases that have extended stats. The trouble is, most customers don't know if their database uses extended statistics or not, and those that do are in for some bad query plans if they haven't run vacuumdb --missing-stats-only. Explaining that to customers is complicated, especially when most of them do not know what extended stats are, let alone whether they have them. It would be a lot simpler to just say "all stats are carried over on upgrade", and vacuumdb becomes unnecessary, making upgrades one step simpler as well.

Given that, I think that the admittedly ugly transformation is worth it, and sequestering it inside pg_dump is the smallest footprint it can have. Earlier in this thread I posted some functions that did the translation from the existing formats to the proposed new formats. We could include those as new system functions, and that would make the dump code very simple. Having said that, I don't know that there would be use for those functions except inside pg_dump, hence the decision to do the transforms right in the dump query.

If the format translation is a barrier to fetching existing extended stats, then I'd be more inclined to keep the existing pg_ndistinct and pg_dependencies data formats as they are now.

Re: Extended Statistics set/restore/clear functions.

От

Michael Paquier

Дата:

23 октября, 02:46:27

On Wed, Oct 22, 2025 at 02:55:31PM +0300, Corey Huinker wrote:
>> Do you have some numbers regarding the increase in size this generates
>> for the catalogs?
>
> Sorry, I don't understand. There shouldn't be any increase inside the
> catalogs as the internal storage of the datatypes hasn't changed, so I can
> only conclude that you're referring to something else.

The new format meant more characters, perhaps I've just missed
something while quickly testing the patch..  Anyway, that's OK at this
stage.

> The equivalent structures in attribute_stats.c will need documenting too.

Right.  This sounds like a separate patch to me, impacting HEAD.

> Right now we have a situation where the vast majority of databases can
> carry forward all of their stats via pg_upgrade, except for those databases
> that have extended stats. The trouble is, most customers don't know if
> their database uses extended statistics or not, and those that do are in
> for some bad query plans if they haven't run vacuumdb --missing-stats-only.
> Explaining that to customers is complicated, especially when most of them
> do not know what extended stats are, let alone whether they have them. It
> would be a lot simpler to just say "all stats are carried over on upgrade",
> and vacuumdb becomes unnecessary, making upgrades one step simpler as well.

Okay.

> Given that, I think that the admittedly ugly transformation is worth it,
> and sequestering it inside pg_dump is the smallest footprint it can have.
> Earlier in this thread I posted some functions that did the translation
> from the existing formats to the proposed new formats. We could include
> those as new system functions, and that would make the dump code very
> simple. Having said that, I don't know that there would be use for those
> functions except inside pg_dump, hence the decision to do the transforms
> right in the dump query.

I'd prefer the new format.  One killer pushing in favor of the new
format that you are making upthread in favor of is that it makes much
easier the viewing, editing and injecting of these stats.  It's the
part of the patch where we would need Tomas' input on the matter
before deciding anything, I guess, as primary author of the original
facilities.  My view of the problem is just one opinion.

> If the format translation is a barrier to fetching existing extended stats,
> then I'd be more inclined to keep the existing pg_ndistinct and
> pg_dependencies data formats as they are now.

Not necessarily, it can be possible to also take that in multiple
steps rather than a single one:
- First do the basics in v19 with the new format.
- Raise the bar to older versions.
--
Michael

Вложения

signature.asc

Re: Extended Statistics set/restore/clear functions.

От

Michael Paquier

Дата:

23 октября, 03:35:04

On Thu, Oct 23, 2025 at 08:46:27AM +0900, Michael Paquier wrote:
> I'd prefer the new format.  One killer pushing in favor of the new
> format that you are making upthread in favor of is that it makes much
> easier the viewing, editing and injecting of these stats.

This is missing an "argument", as in "One killer argument pushing in
favor.." :D
--
Michael

Вложения

signature.asc

Re: Extended Statistics set/restore/clear functions.

От

Tomas Vondra

Дата:

31 октября, 23:22:55


On 10/23/25 01:46, Michael Paquier wrote:
> On Wed, Oct 22, 2025 at 02:55:31PM +0300, Corey Huinker wrote:
>>> Do you have some numbers regarding the increase in size this generates
>>> for the catalogs?
>>
>> Sorry, I don't understand. There shouldn't be any increase inside the
>> catalogs as the internal storage of the datatypes hasn't changed, so I can
>> only conclude that you're referring to something else.
> 
> The new format meant more characters, perhaps I've just missed
> something while quickly testing the patch..  Anyway, that's OK at this
> stage.
> 
>> The equivalent structures in attribute_stats.c will need documenting too.
> 
> Right.  This sounds like a separate patch to me, impacting HEAD.
> 
>> Right now we have a situation where the vast majority of databases can
>> carry forward all of their stats via pg_upgrade, except for those databases
>> that have extended stats. The trouble is, most customers don't know if
>> their database uses extended statistics or not, and those that do are in
>> for some bad query plans if they haven't run vacuumdb --missing-stats-only.
>> Explaining that to customers is complicated, especially when most of them
>> do not know what extended stats are, let alone whether they have them. It
>> would be a lot simpler to just say "all stats are carried over on upgrade",
>> and vacuumdb becomes unnecessary, making upgrades one step simpler as well.
> 
> Okay.
> 
>> Given that, I think that the admittedly ugly transformation is worth it,
>> and sequestering it inside pg_dump is the smallest footprint it can have.
>> Earlier in this thread I posted some functions that did the translation
>> from the existing formats to the proposed new formats. We could include
>> those as new system functions, and that would make the dump code very
>> simple. Having said that, I don't know that there would be use for those
>> functions except inside pg_dump, hence the decision to do the transforms
>> right in the dump query.
> 
> I'd prefer the new format.  One killer pushing in favor of the new
> format that you are making upthread in favor of is that it makes much
> easier the viewing, editing and injecting of these stats.  It's the
> part of the patch where we would need Tomas' input on the matter
> before deciding anything, I guess, as primary author of the original
> facilities.  My view of the problem is just one opinion.
> 

Sorry for not paying much attention to this thread ...

My opinion is that we should both use the new format and keep the
pg_dump code to allow upgrading from older pre-19 versions.

There really is nothing special about the current format - I should have
used JSON (or any other established format) from the beginning. But I
only saw that as human-readable version of ephemeral data, it didn't
occur to me we'll use this to export/import stats cross versions. So if
we need to adjust that to make new use cases more convenient, let's bite
the bullet now.

If doing both is too complex / ugly, I think the pg_upgrade capability
is more valuable. I'd rather keep the old, less convenient format to
have pg_upgrade support for all versions.

Otherwise users may not benefit from this pg_upgrade feature for a
couple more years. Plenty of users delay upgrading until the EOL gets
close, and so might be unable to dump/restore extended stats for the
next ~5 years.

regards

-- 
Tomas Vondra

Re: Extended Statistics set/restore/clear functions.[

От

Michael Paquier

Дата:

04 ноября, 10:13:55

On Fri, Oct 31, 2025 at 09:22:55PM +0100, Tomas Vondra wrote:
> Sorry for not paying much attention to this thread ...

It was PGEU last week, it's not surprising knowing that you are in..
Europe :)

> My opinion is that we should both use the new format and keep the
> pg_dump code to allow upgrading from older pre-19 versions.

Okay, thanks for the input.

> There really is nothing special about the current format - I should have
> used JSON (or any other established format) from the beginning. But I
> only saw that as human-readable version of ephemeral data, it didn't
> occur to me we'll use this to export/import stats cross versions. So if
> we need to adjust that to make new use cases more convenient, let's bite
> the bullet now.
>
> If doing both is too complex / ugly, I think the pg_upgrade capability
> is more valuable. I'd rather keep the old, less convenient format to
> have pg_upgrade support for all versions.

Hmm.  I have been eyeing at the dump code once again, and it's not
that bad on a second lookup, so I'd be OK to incorporate the whole in
a review.

> Otherwise users may not benefit from this pg_upgrade feature for a
> couple more years. Plenty of users delay upgrading until the EOL gets
> close, and so might be unable to dump/restore extended stats for the
> next ~5 years.

Okay.

I still think that we should split things a bit more in the patch set.
Corey has sent me a proposal in this direction, where most of the
entries can be reviewed and potentially applied separately:
1. pg_ndistinct output function change.
2. pg_ndistinct input function addition.
3. pg_dependencies output function change
4. pg_dependencies input function
5. Expose attribute statistics function and rename them attstat_* or
statatt_*
6. pg_restore_extended_stats
7. pg_dump with no ability to fetch old-format
pg_ndistinct/pg_dependences.
8. pg_dump working back as far as possible

I am not completely sold about the changes in the input/output
functions, but I'd be happy to consider more options with a rebased
patch set (with the dump issue on older clusters issue, while on it).
--
Michael

Вложения

signature.asc

Re: Extended Statistics set/restore/clear functions.

От

Corey Huinker

Дата:

05 ноября, 09:38:56

Otherwise users may not benefit from this pg_upgrade feature for a
couple more years. Plenty of users delay upgrading until the EOL gets
close, and so might be unable to dump/restore extended stats for the
next ~5 years

Paquier's response got sidetracked because of an errant subject line change, so I will try to recap:

* pg_dump code changes no longer seem as bad on second look

* proceed with breakout per off-list discussion

in that off-list discussion I proposed (though I was mostly echoing what I thought Paquier wanted):

1. pg_ndistinct output function change.
2. pg_ndistinct input function addition.
3. pg_dependencies output function change
4. pg_dependencies input function
5. Expose attribute statistics function and rename them attstat_* or statatt_* (edit: and fix lack of comments on the enums and arrays)
6. pg_restore_extended_stats
7. pg_dump with no ability to fetch old-format pg_ndistinct/pg_dependences. (edit: and fix inherited bug)
8. pg_dump working back as far as possible

Given that the pg_dump code no longer seems as bad, and Tomas is very much in support of it, I've opted not to split out steps 7/8.

Вложения

Re: Extended Statistics set/restore/clear functions.

От

Michael Paquier

Дата:

06 ноября, 11:26:58

On Wed, Nov 05, 2025 at 01:38:56AM -0500, Corey Huinker wrote:
> Paquier's response got sidetracked because of an errant subject line
> change, so I will try to recap:

That was a typo that found its way into the email subject.  Sorry
about that, that broke gmail's tracking at least.

> in that off-list discussion I proposed (though I was mostly echoing what I
> thought Paquier wanted):
>
> 1. pg_ndistinct output function change.
> 2. pg_ndistinct input function addition.
> 3. pg_dependencies output function change
> 4. pg_dependencies input function
> 5. Expose attribute statistics function and rename them attstat_* or
> statatt_*   (edit: and fix lack of comments on the enums and arrays)
> 6. pg_restore_extended_stats
> 7. pg_dump with no ability to fetch old-format pg_ndistinct/pg_dependences.
> (edit: and fix inherited bug)
> 8. pg_dump working back as far as possible

Thanks.  I have begun reviewing it (more a bit later, still need to
study more the structure of the code).  For now, I have extracted some
of the comment changes in 0005 and applied these independently.

> Given that the pg_dump code no longer seems as bad, and Tomas is very much
> in support of it, I've opted not to split out steps 7/8.

That sounds like a way forward to me, then, in terms of using a new
format and make pg_dump intelligent enough to deal with it based on
what's in the past versions.
--
Michael

Вложения

signature.asc

Re: Extended Statistics set/restore/clear functions.

От

Corey Huinker

Дата:

06 ноября, 21:35:34

That was a typo that found its way into the email subject. Sorry
about that, that broke gmail's tracking at least.

So the mailing list archive will still pick it up? That's nice.

Thanks. I have begun reviewing it (more a bit later, still need to
study more the structure of the code). For now, I have extracted some
of the comment changes in 0005 and applied these independently.

Rebased to reflect that commit.

Вложения

Re: Extended Statistics set/restore/clear functions.

От

Michael Paquier

Дата:

07 ноября, 10:51:15

On Thu, Nov 06, 2025 at 01:35:34PM -0500, Corey Huinker wrote:
> So the mailing list archive will still pick it up? That's nice.

It did.  My email client does not care much either.

> Rebased to reflect that commit.

I have spent a bit more time on this set.

Patch 0001 for ndistinct was missing a documentation update, we have
one query in perform.sgml that looks at stxdndistinct.  Patch 0003 is
looking OK here as well.

For dependencies, the format switches from a single json object
with key/vals like that:
"3 => 4": 1.000000
To a JSON array made of elements like that:
{"degree": 1.000000, "attributes": [3],"dependency": 4},

For ndistincts, we move from a JSON blob with key/vals like that:
"3, 4": 11
To a JSON array made of the following elements:
{"ndistinct": 11, "attributes": [3,4]}

Using a keyword within each element would force a stronger validation
when these get imported back, which is a good thing.  I like that.

Before going in-depth into the input functions to cross-check the
amount of validation we should do, have folks any comments about the
proposed format?  That's the key point this patch set depends on, and
I'd rather not spend more time the whole thing if somebody would like
a different format.  This is the format that Tomas has mentioned at
the top of the thread.  Note: as noted upthread, pg_dump would be in
charge of transferring the data of the old format to the new format at
the end.

While looking at 0002 and 0004 (which have a couple of issues
actually), I have been wondering about moving into a new file the four
data-type functions (in, out, send and receive) and the new input
functions that rely on a new JSON lexer and parser logic into for both
ndistinct and dependencies.  The new set of headers added at the top
of mvdistinct.c and dependencies.c for the new code points that a
separation may be better in the long-term, because the new code relies
on parts of the backend that the existing code does not care about,
and these files become larger than the relation and attribute stats
files.  I would be tempted to name these new files pg_dependencies.c
and pg_ndistinct.c, mapping with their catalog types.  With this
separation, it looks like the "core" parts in charge of the
calculations with ndistinct and dependencies can be kept on its own.
What do you think?

A second comment is for 0005.  The routines of attributes.c are
applied to the new clear and restore functions.  Shouldn't these be in
stats_utils.c at the end?  That's where the "common" functions used by
the stats manipulation logic are.
--
Michael

Вложения

Re: Extended Statistics set/restore/clear functions.

От

Corey Huinker

Дата:

08 ноября, 01:28:50

Patch 0001 for ndistinct was missing a documentation update, we have
one query in perform.sgml that looks at stxdndistinct. Patch 0003 is
looking OK here as well.

Well spotted.

For dependencies, the format switches from a single json object
with key/vals like that:
"3 => 4": 1.000000
To a JSON array made of elements like that:
{"degree": 1.000000, "attributes": [3],"dependency": 4},

For ndistincts, we move from a JSON blob with key/vals like that:
"3, 4": 11
To a JSON array made of the following elements:
{"ndistinct": 11, "attributes": [3,4]}

Using a keyword within each element would force a stronger validation
when these get imported back, which is a good thing. I like that.

Before going in-depth into the input functions to cross-check the
amount of validation we should do, have folks any comments about the
proposed format? That's the key point this patch set depends on, and
I'd rather not spend more time the whole thing if somebody would like
a different format. This is the format that Tomas has mentioned at
the top of the thread. Note: as noted upthread, pg_dump would be in
charge of transferring the data of the old format to the new format at
the end.

I'm open to other formats, but aside from renaming the json keys (maybe "attnums" or "keys" instead of "attributes"?), I'm not sure what really could be done and still be JSON. I suppose we could go with a tuple format like this:

'{({3,4},11),...}' for pg_ndistinct and

'{({3},4,1.00000),...}' for pg_dependencies.

Those would certainly be more compact, but makes for a hard read by humans, and while the JSON code is big, it's also proven in other parts of the codebase, hence less risky.

While looking at 0002 and 0004 (which have a couple of issues
actually), I have been wondering about moving into a new file the four
data-type functions (in, out, send and receive) and the new input
functions that rely on a new JSON lexer and parser logic into for both
ndistinct and dependencies. The new set of headers added at the top
of mvdistinct.c and dependencies.c for the new code points that a
separation may be better in the long-term, because the new code relies
on parts of the backend that the existing code does not care about,
and these files become larger than the relation and attribute stats
files. I would be tempted to name these new files pg_dependencies.c
and pg_ndistinct.c, mapping with their catalog types. With this
separation, it looks like the "core" parts in charge of the
calculations with ndistinct and dependencies can be kept on its own.
What do you think?

A part of me thinks that everything that remains after removing in/out/send/recv is just taking a table sample data structure and crunching numbers to come up with the deserialized data structure...that's in/out with a different starting/ending points.

There's no denying that JSON parsing is a very different code style than statistical number crunching, and mixing the two is incongruous, so it's worth a shot, and I'll try that for v9.

A second comment is for 0005. The routines of attributes.c are
applied to the new clear and restore functions. Shouldn't these be in
stats_utils.c at the end? That's where the "common" functions used by
the stats manipulation logic are.

I assume you're referring to attribute_stats.c. I think that would cause stats_utils.c to have to pull in a lot of things from attribute_stats.c, and that would create the exact sort of include-pollution that you're trying to avoid in the mvdistinct.c/dependencies.c situation mentioned above.

The one lone exception to this is text_to_stavalues(), which is a fancy wrapper around array_in() and could perhaps be turned to even more generic usage outside of stats in general.

The functions in question are needed because the exprs value is itself an array of partly-filled-out pg_attribute tuples, so it's common to those two needs, but specific to stats about attributes. Maybe we need an attr_stats_utils.h?

Re: Extended Statistics set/restore/clear functions.

От

Michael Paquier

Дата:

08 ноября, 01:56:48

On Fri, Nov 07, 2025 at 05:28:50PM -0500, Corey Huinker wrote:
> I'm open to other formats, but aside from renaming the json keys (maybe
> "attnums" or "keys" instead of "attributes"?), I'm not sure what really
> could be done and still be JSON. I suppose we could go with a tuple format
> like this:
>
> '{({3,4},11),...}' for pg_ndistinct and
> '{({3},4,1.00000),...}'  for pg_dependencies.
>
> Those would certainly be more compact, but makes for a hard read by humans,
> and while the JSON code is big, it's also proven in other parts of the
> codebase, hence less risky.

I've liked the human-readability factor of the format in the current
patches with names in the keys, and values assigned to each property.

Another thing that may be worth doing is pushing the names of the keys
and some its the JSON meta-data shaping the object into a new header
than can be loaded by both the backend and the frontend.  It would be
nice to not hardcode this knowledge in a bunch of places if we finish
by renaming these attributes.

> A part of me thinks that everything that remains after removing
> in/out/send/recv is just taking a table sample data structure and crunching
> numbers to come up with the deserialized data structure...that's in/out
> with a different starting/ending points.
>
> There's no denying that JSON parsing is a very different code style than
> statistical number crunching, and mixing the two is incongruous, so it's
> worth a shot, and I'll try that for v9.

Yeah, right.  Thanks.  The parsing pieces seem like pieces worth their
own file.

> The functions in question are needed because the exprs value is itself an
> array of partly-filled-out pg_attribute tuples, so it's common to those two
> needs, but specific to stats about attributes. Maybe we need an
> attr_stats_utils.h?

Hmm, maybe.  I'd be OK to revisit these structures once we're happy
with the in/out structures.  That would be a good start point before
working on the SQL functions and the dump/restore bits in more
details.
--
Michael

Вложения

signature.asc

Re: Extended Statistics set/restore/clear functions.

От

Corey Huinker

Дата:

10 ноября, 08:33:40

Another thing that may be worth doing is pushing the names of the keys
and some its the JSON meta-data shaping the object into a new header
than can be loaded by both the backend and the frontend. It would be
nice to not hardcode this knowledge in a bunch of places if we finish
by renaming these attributes.

It may not be quite what you wanted, but the attribute names are now static constants in the new adt c files. It's possible/probable that you wanted them in some header file, but so far I haven't had to create any new header files, but that can be done if desired.

Yeah, right. Thanks. The parsing pieces seem like pieces worth their
own file.

That's done in the 0008-0009 patches. If I was starting from scratch, I would have moved the pre-existing in/out/send/recv functions to their own files in their own patches before changing the output format, but tacked on at the end like they are it's easier to see what the changes were, and the patches will probably get squashed together anyway.

> The functions in question are needed because the exprs value is itself an
> array of partly-filled-out pg_attribute tuples, so it's common to those two
> needs, but specific to stats about attributes. Maybe we need an
> attr_stats_utils.h?

Hmm, maybe. I'd be OK to revisit these structures once we're happy
with the in/out structures. That would be a good start point before
working on the SQL functions and the dump/restore bits in more
details.

In addition to the changes detailed above, I fixed a few typos and incorporated the v8 change.

Вложения

Re: Extended Statistics set/restore/clear functions.

От

Michael Paquier

Дата:

11 ноября, 11:06:49

On Mon, Nov 10, 2025 at 12:33:40AM -0500, Corey Huinker wrote:
> It may not be quite what you wanted, but the attribute names are now static
> constants in the new adt c files. It's possible/probable that you wanted
> them in some header file, but so far I haven't had to create any new header
> files, but that can be done if desired.

No, that's not the best thing we can do with the dump/restore pieces
in mind.  Let's put that in a separate header.

> That's done in the 0008-0009 patches. If I was starting from scratch, I
> would have moved the pre-existing in/out/send/recv functions to their own
> files in their own patches before changing the output format, but tacked on
> at the end like they are it's easier to see what the changes were, and the
> patches will probably get squashed together anyway.

Thanks for the new patch.  And FWIW I disagree with this approach:
cleanup and refactoring pieces make more sense if done first, as these
lead to less code churn in the final result.  So...  I've begun to put
my hands on the patch set.  The whole has been restructured a bit, as
per the attached.  Patch 0001 to 0004 feel OK here, these include two
code moves and the two output functions:
- Two new files for adt/, that I'm planning to apply soon as a
separate cleanup.
- New output functions, with keys added to a new header named
statistics_format.h, for frontend and backend consumption.

Next comes the input functions.  First, I am unhappy with the amount
of testing that has been put into ndistinct, first and only input
facility I've looked at in details for the moment.  I have quickly
spotted a couple a few issues while testing buggy input, like this one
that crashes on pointer dereference, not good obviously:
SELECT '[]'::pg_ndistinct;

There was a second one with the error message generated when using an
incorrect key value.

Second, the inputs are too permissive and could be more strictly
checked IMHO.  For example, patterns like that are incorrect, still
authorized with only the patches up to 0005 in:
- Duplicated list of attributes:
SELECT '[{"attributes" : [2,3], "ndistinct" : 4},
         {"attributes" : [2,3], "ndistinct" : 4}]'::pg_ndistinct;
- Partial (K,N) sets, for example say we take stats on attrs (1,2,3),
a partial input like this one is basically OK:
SELECT '[{"attributes" : [1,3], "ndistinct" : 4},
         {"attributes" : [1,2,3], "ndistinct" : 4}]'::pg_ndistinct;

These are checked in the patches that introduce the functions like
with pg_ndistinct_validate_items(), based on the list of stxkeys we
have.  However, I think that this is not enough by itself.  Shouldn't
we check that the list of items in the array is what we expect based
on the longest "attributes" array at least, even after a JSON that was
parsed?  That would be cheap to check in the output function itself,
at least as a first layer of checks before trying something with the
import function and cross-checking the list of attributes for the
extended statistics object.  This means checking that for N attributes
we have all the elements we'd expect in each element of the array,
without gaps or duplications, with an extra step done once the JSON
parsing is finished.  Except for this sanity issue this part of the
patch set should be mostly OK, plus more cleanup and more typo/grammar
fixes.

I suspect a similar family of issues with pg_dependencies, and it
would be nice to move the tests with the input function into a new
regression file, like the other one.

I've rebased the full set using the new structure.  0001~0004 are
clean.  0005~ need more work and analysis, but that's a start.
--
Michael

Вложения

Re: Extended Statistics set/restore/clear functions.

От

jian he

Дата:

11 ноября, 18:14:34

On Tue, Nov 11, 2025 at 4:07 PM Michael Paquier <michael@paquier.xyz> wrote:
> I've rebased the full set using the new structure.  0001~0004 are
> clean.  0005~ need more work and analysis, but that's a start.
hi.

diff --git a/doc/src/sgml/perform.sgml b/doc/src/sgml/perform.sgml
index 106583fb2965..e0e9f0468cb8 100644
--- a/doc/src/sgml/perform.sgml
+++ b/doc/src/sgml/perform.sgml
@@ -1579,9 +1579,39 @@ ANALYZE zipcodes;
 SELECT stxkeys AS k, stxdndistinct AS nd
   FROM pg_statistic_ext join pg_statistic_ext_data on (oid = stxoid)
   WHERE stxname = 'stts2';
--[ RECORD 1 ]------------------------------------------------------&zwsp;--
+-[ RECORD 1 ]-------------------
 k  | 1 2 5
-nd | {"1, 2": 33178, "1, 5": 33178, "2, 5": 27435, "1, 2, 5": 33178}
+nd | [                          +
+   |     {                      +
+   |         "ndistinct": 33178,+
+   |         "attributes": [    +
+   |             1,             +
+   |             2              +
+   |         ]                  +
+   |     },                     +

If you not specify as
 SELECT stxkeys AS k, stxdndistinct AS nd
   FROM pg_statistic_ext join pg_statistic_ext_data on (oid = stxoid)
   WHERE stxname = 'stts2' \gx
then it won't look like ``-[ RECORD 1 ]``.

I tried the dummy data, the above query, the nd column data look like

[{"attributes": [1, 2], "ndistinct": 2} .....
which does not look like the above.

So I guess the query in doc/src/sgml/perform.sgml would be:

SELECT stxkeys AS k, jsonb_pretty(d.stxdndistinct::text::jsonb) AS nd
FROM pg_statistic_ext join pg_statistic_ext_data d on (oid = d.stxoid)
WHERE stxname = 'stts2' \gx

per related thread, the word "join" needs uppercase?


--
jian
https://www.enterprisedb.com/

Re: Extended Statistics set/restore/clear functions.

От

jian he

Дата:

12 ноября, 05:51:51

On Tue, Nov 11, 2025 at 11:14 PM jian he <jian.universality@gmail.com> wrote:
>
> On Tue, Nov 11, 2025 at 4:07 PM Michael Paquier <michael@paquier.xyz> wrote:
> > I've rebased the full set using the new structure.  0001~0004 are
> > clean.  0005~ need more work and analysis, but that's a start.

hi.

+Datum
+pg_ndistinct_out(PG_FUNCTION_ARGS)
+
+
+ appendStringInfo(&str, "], \"" PG_NDISTINCT_KEY_NDISTINCT "\": %d}",
+ (int) item.ndistinct);

I’m a bit confused about the part above,
item.ndistinct is double type, we just cast it to int type?


after apply 0004, the below in doc/src/sgml/perform.sgml also need to change?

<programlisting>
CREATE STATISTICS stts (dependencies) ON city, zip FROM zipcodes;

ANALYZE zipcodes;

SELECT stxname, stxkeys, stxddependencies
  FROM pg_statistic_ext join pg_statistic_ext_data on (oid = stxoid)
  WHERE stxname = 'stts';
 stxname | stxkeys |             stxddependencies
---------+---------+------------------------------------------
 stts    | 1 5     | {"1 => 5": 1.000000, "5 => 1": 0.423130}
(1 row)
</programlisting>


Do you think it's worth the trouble to have two separate
appendStringInfoChar for ``{}``?

for example in loop ``for (i = 0; i < ndist->nitems; i++)``. we can change to:
        appendStringInfoChar(&str, '{');
        appendStringInfo(&str, "\"" PG_NDISTINCT_KEY_ATTRIBUTES "\": [%d",
                         item.attributes[0]);

        for (int j = 1; j < item.nattributes; j++)
            appendStringInfo(&str, ", %d", item.attributes[j]);

        appendStringInfo(&str, "], \"" PG_NDISTINCT_KEY_NDISTINCT "\": %d",
                         (int) item.ndistinct);
        appendStringInfoChar(&str, '}');


--
jian
https://www.enterprisedb.com/

Re: Extended Statistics set/restore/clear functions.

От

Corey Huinker

Дата:

12 ноября, 09:47:33

Thanks for the new patch. And FWIW I disagree with this approach:
cleanup and refactoring pieces make more sense if done first, as these
lead to less code churn in the final result. So... I've begun to put
my hands on the patch set. The whole has been restructured a bit, as
per the attached. Patch 0001 to 0004 feel OK here, these include two
code moves and the two output functions:
- Two new files for adt/, that I'm planning to apply soon as a
separate cleanup.
- New output functions, with keys added to a new header named
statistics_format.h, for frontend and backend consumption.

Agreed, 0001-0004 all look good.

Next comes the input functions. First, I am unhappy with the amount
of testing that has been put into ndistinct, first and only input
facility I've looked at in details for the moment. I have quickly
spotted a couple a few issues while testing buggy input, like this one
that crashes on pointer dereference, not good obviously:
SELECT '[]'::pg_ndistinct;

- I put some work into more specific error messages for invalid values for both pg_ndistinct and pg_dependencies.

- The check for empty attribute lists and item lists now occur in the array-end event handler.

- Also tried to standardize conventions between the two data types (switch statements, similar utility functions, etc).

These are checked in the patches that introduce the functions like
with pg_ndistinct_validate_items(), based on the list of stxkeys we
have. However, I think that this is not enough by itself. Shouldn't
we check that the list of items in the array is what we expect based
on the longest "attributes" array at least, even after a JSON that was
parsed? That would be cheap to check in the output function itself,
at least as a first layer of checks before trying something with the
import function and cross-checking the list of attributes for the
extended statistics object.

I added tests for both duplicate attribute sequences as well as making the first-longest attribute sequence the template by which all later and shorter sequences are checked. I had been reluctant to add checks like this, because so many similar validations were removed from the earlier statistics code like histograms and the like.

I suspect a similar family of issues with pg_dependencies, and it
would be nice to move the tests with the input function into a new
regression file, like the other one.

Did so.

0001-0004,0007,0009 unchanged.

Significant modification of the stats_import.sql regression tests in 0008 to conform to stricter datatype rules enacted in 0005, 0006.

Вложения

Re: Extended Statistics set/restore/clear functions.

От

Michael Paquier

Дата:

12 ноября, 10:53:50

On Wed, Nov 12, 2025 at 01:47:33AM -0500, Corey Huinker wrote:
>> Thanks for the new patch.  And FWIW I disagree with this approach:
>> cleanup and refactoring pieces make more sense if done first, as these
>> lead to less code churn in the final result.  So...  I've begun to put
>> my hands on the patch set.  The whole has been restructured a bit, as
>> per the attached.  Patch 0001 to 0004 feel OK here, these include two
>> code moves and the two output functions:
>> - Two new files for adt/, that I'm planning to apply soon as a
>> separate cleanup.
>
> Agreed, 0001-0004 all look good.

Applied 0001 and 0002 for now.
--
Michael

Вложения

signature.asc

Re: Extended Statistics set/restore/clear functions.

От

Man Zeng

Дата:

12 ноября, 11:45:16

Hey, I saw that some comments in the recent commit have inconsistent styles—maybe we can tweak them to make them
uniform!

```c
// pg_dependencies    - output routine for type pg_dependencies.
pg_dependencies_out   - output routine for type pg_dependencies.

// pg_ndistinct
//    output routine for type pg_ndistinct
pg_ndistinct_out
    output routine for type pg_ndistinct
```

Oh, I didn’t operate it properly, which resulted in the creation of a duplicate theme.
https://www.postgresql.org/message-id/tencent_6EC50DD07D5D567C15E65C46%40qq.com
--
Man Zeng

Re: Extended Statistics set/restore/clear functions.

От

Michael Paquier

Дата:

12 ноября, 14:25:43

On Wed, Nov 12, 2025 at 08:45:16AM +0000, Man Zeng wrote:
> pg_dependencies_out   - output routine for type pg_dependencies.
> pg_ndistinct_out
>
> Oh, I didn’t operate it properly, which resulted in the creation of a duplicate theme.
> https://www.postgresql.org/message-id/tencent_6EC50DD07D5D567C15E65C46%40qq.com

No problem.  I've just fixed both comments, thanks.  These were older
than the recent commit, but let's fix things when these are in sight.
--
Michael

Вложения

signature.asc

Re: Extended Statistics set/restore/clear functions.

От

Corey Huinker

Дата:

13 ноября, 01:21:36

+
+ appendStringInfo(&str, "], \"" PG_NDISTINCT_KEY_NDISTINCT "\": %d}",
+ (int) item.ndistinct);

I’m a bit confused about the part above,
item.ndistinct is double type, we just cast it to int type?

It's a historical quirk. That's what the original output function did in mvdistinct.c, so we maintain compatibility with that. Altering the internal storage type would affect the bytea serialization, which would break binary compatibility.

after apply 0004, the below in doc/src/sgml/perform.sgml also need to change?

Yes it does, good catch.

Do you think it's worth the trouble to have two separate
appendStringInfoChar for ``{}``?

for example in loop ``for (i = 0; i < ndist->nitems; i++)``. we can change to:

I agree that that feels more symmetrical. However, it seems the prevailing wisdom is that we're already paying for a string interpolation in the very next appendStringInfo(), we might as well save ourselves a function call. Hence, I left that one as-is.

The sgml change has been worked into a rebased and reduced patch set (thanks for the commits Michael!)

Вложения

Re: Extended Statistics set/restore/clear functions.

От

jian he

Дата:

13 ноября, 10:01:42

hi.

v12-0001 and v12-0002 overall look good to me.

        if (dependency->nattributes <= 1)
            elog(ERROR, "invalid zero-length nattributes array in
MVDependencies");
This is an unlikely-to-happen error message,  but still, “nattributes”
seems confusing?

in  doc/src/sgml/func/func-json.sgml, we have \gset
<programlisting>
SELECT '{
.......
  }
}' AS json \gset
</programlisting>

similarly, in doc/src/sgml/perform.sgml, I think the query should be:

SELECT stxkeys AS k, jsonb_pretty(stxdndistinct::text::jsonb) AS nd
  FROM pg_statistic_ext join pg_statistic_ext_data on (oid = stxoid)
  WHERE stxname = 'stts2'    \gx



--
jian
https://www.enterprisedb.com/

Re: Extended Statistics set/restore/clear functions.

От

Corey Huinker

Дата:

13 ноября, 10:32:17

On Thu, Nov 13, 2025 at 2:02 AM jian he <jian.universality@gmail.com> wrote:

hi.

v12-0001 and v12-0002 overall look good to me.

if (dependency->nattributes <= 1)
elog(ERROR, "invalid zero-length nattributes array in
MVDependencies");
This is an unlikely-to-happen error message, but still, “nattributes”
seems confusing?

Agreed the error message should be changed if it's kept at all. That error may never occur now that we test for empty arrays in the array close event handler. So maybe this becomes a plain assert().

similarly, in doc/src/sgml/perform.sgml, I think the query should be:

SELECT stxkeys AS k, jsonb_pretty(stxdndistinct::text::jsonb) AS nd
FROM pg_statistic_ext join pg_statistic_ext_data on (oid = stxoid)
WHERE stxname = 'stts2' \gx

The example almost certainly predates \gx, so that's a good suggestion.

Re: Extended Statistics set/restore/clear functions.

От

jian he

Дата:

13 ноября, 14:51:00

hi.
now looking at v12-0003-Add-working-input-function-for-pg_ndistinct.patch
again.

+ * example input:
+ *     [{"attributes": [6, -1], "ndistinct": 14},
+ *      {"attributes": [6, -2], "ndistinct": 9143},
+ *      {"attributes": [-1,-2], "ndistinct": 13454},
+ *      {"attributes": [6, -1, -2], "ndistinct": 14549}]
  */
 Datum
 pg_ndistinct_in(PG_FUNCTION_ARGS)

extenssted statistics surely won't work on system columns,
how should we deal with case like:
```
{"attributes": [6, -1], "ndistinct": 14}
{"attributes": [6, -7], "ndistinct": 14},
```
issue a warning or error out saying that your attribute number is invalid?
Should we discourage using system columns as examples in comments here?

I have added more test code in src/test/regress/sql/pg_ndistinct.sql,
to improve the code coverage.

as mentioned before in
https://postgr.es/m/CACJufxEZYqocFdgn-x-bJMRBSk_zkS=ziGGkaSumteiPDksnsg@mail.gmail.com
I think it's a good thing to change
``(errcode....``
to
``errcode``.
So I did the change.


+static JsonParseErrorType
+ndistinct_array_element_start(void *state, bool isnull)
+{
+ NDistinctParseState *parse = state;
+
+ switch(parse->state)
+ {
+ case NDIST_EXPECT_ATTNUM:
+ if (!isnull)
+ return JSON_SUCCESS;
+
+ ereturn(parse->escontext, (Datum) 0,
+ (errcode(ERRCODE_INVALID_TEXT_REPRESENTATION),
+ errmsg("malformed pg_ndistinct: \"%s\"", parse->str),
+ errdetail("Attnum list elements cannot be null.")));

this (and many other places) looks wrong, because
ereturn would really return ``(Datum) 0``, and this function returns
JsonParseErrorType.
so we have to errsave here.

+typedef struct
+{
+ const char *str;
+ NDistinctSemanticState state;
+
+ List   *distinct_items; /* Accumulated complete MVNDistinctItems */
+ Node   *escontext;
+
+ bool found_attributes; /* Item has an attributes key */
+ bool found_ndistinct; /* Item has ndistinct key */
+ List   *attnum_list; /* Accumulated attributes attnums */
+ int64 ndistinct;
+} NDistinctParseState;

+ case NDIST_EXPECT_NDISTINCT:
+ /*
+ * While the structure dictates that ndistinct in a double precision
+ * floating point, in practice it has always been an integer, and it
+ * is output as such. Therefore, we follow usage precendent over the
+ * actual storage structure, and read it in as an integer.
+ */
+ parse->ndistinct = pg_strtoint64_safe(token, parse->escontext);
+
+ if (SOFT_ERROR_OCCURRED(parse->escontext))
+ return JSON_SEM_ACTION_FAILED;

NDistinctParseState.ndistinct should be integer,
otherwise pg_ndistinct_out will not be consistent with  pg_ndistinct_in?

SELECT '[{"attributes" : [1, 2], "ndistinct" :
2147483648}]'::pg_ndistinct; --error
                    pg_ndistinct
----------------------------------------------------
 [{"attributes": [1, 2], "ndistinct": -2147483648}]
(1 row)

The result seems not what we expected.

Вложения

v13-0001-refactor-v12-0003.no-cfbot

Re: Extended Statistics set/restore/clear functions.

От

Corey Huinker

Дата:

14 ноября, 08:49:23

extenssted statistics surely won't work on system columns,
how should we deal with case like:
```
{"attributes": [6, -1], "ndistinct": 14}
{"attributes": [6, -7], "ndistinct": 14},
```
issue a warning or error out saying that your attribute number is invalid?
Should we discourage using system columns as examples in comments here?

Negative numbers represent the Nth expression defined in the extended statistics object. So if you have extended statistics on a, b, length(a), length(b) then you can legally have -1 and -2 in the attributes, but nothing lower than that.

See functions pg_ndistinct_validate_items() and pg_depdendencies_validate_deps() as these check the attributes in the value against the definition of the extended stats object.

Though this does bring up a small naming problem: Elements of a pg_dependencies are Dependency, abbreviated to dep, but are also called Items like the elements in a pg_ndistinct. We should pick a standard name for such things (probably item) and use it everywhere.

I have added more test code in src/test/regress/sql/pg_ndistinct.sql,
to improve the code coverage.

I'm trying to implement those test cases, but I may have missed some.

this (and many other places) looks wrong, because
ereturn would really return ``(Datum) 0``, and this function returns
JsonParseErrorType.
so we have to errsave here.

Good point. Implemented.

NDistinctParseState.ndistinct should be integer,
otherwise pg_ndistinct_out will not be consistent with pg_ndistinct_in?

Implemented many, but not all of these suggestions.

Вложения

Re: Extended Statistics set/restore/clear functions.

От

Michael Paquier

Дата:

14 ноября, 09:25:27

On Fri, Nov 14, 2025 at 12:49:23AM -0500, Corey Huinker wrote:
> Negative numbers represent the Nth expression defined in the extended
> statistics object. So if you have extended statistics on a, b, length(a),
> length(b) then you can legally have -1 and -2 in the attributes, but
> nothing lower than that.
>
> See functions pg_ndistinct_validate_items() and
> pg_depdendencies_validate_deps() as these check the attributes in the value
> against the definition of the extended stats object.

Exactly.  Extended stats on system columns don't work because they
don't really make sense as we want to track correlations between the
attributes defined, and these reflect internal states:
create table poo (a int, b int);
create statistics poos (ndistinct ) ON cmax, a from poo;
ERROR:  0A000: statistics creation on system columns is not supported

Note that the expressions are also stored in pg_stats_ext_exprs.

> I'm trying to implement those test cases, but I may have missed some.

I've found a lot of them with coverage-html during a previous lookup.
I'd like to think that we should aim for something close to 100%
coverage for the two input functions.

> Implemented many, but not all of these suggestions.

Thanks for the new versions, I'll also look at all these across the
next couple of days.  Probably not at 0005~ for now.
--
Michael

Вложения

signature.asc

Re: Extended Statistics set/restore/clear functions.

От

Michael Paquier

Дата:

17 ноября, 09:56:06

On Fri, Nov 14, 2025 at 03:25:27PM +0900, Michael Paquier wrote:
> Thanks for the new versions, I'll also look at all these across the
> next couple of days.  Probably not at 0005~ for now.

0001 and 0002 from series v13 have been applied to change the output
functions.

And I have looked at 0003 in details for now.  Attached is a revised
version for it, with many adjustments.  Some notes:
- Many portions of the coverage were missing.  I have measured the
coverage at 91% with the updated version attached.  This includes
coverage for some error reporting, something that we rely a lot on for
this code.
- The error reports are made simpler, with the token values getting
hidden.  While testing with some fancy values, I have actually noticed
that the error handlings for the parsing of the int16 and int32 values
were incorrect, the error reports used what the safe functions
generated, not the reports from the data type.
- Passing down arbitrary bytes sequences was leading to these bytes
reported in the error outputs because we cared about the token values.
I have added a few tests based on that for the code paths involved.

There is an extra thing that bugs me as incorrect for the pg_ndistinct
input, something I have not tackled myself yet.  Your patch checks
that subsets of attributes are included in the longest set found, but
it does not match the guarantees we have in mvndistinct.c: we have to
check that *all* the combinations generated by generator_init() are
satisfied based on the longest of attributes detected.  For example,
this is thought as correct in the input function:
SELECT '[{"attributes" : [-1,2], "ndistinct" : 1},
         {"attributes" : [-1,2,3], "ndistinct" : 3}]'::pg_ndistinct;

However it is obviously not correct as we are missing an element for
the attributes [-1, 3].  The simplest solution would be to export the
routines that generate the groups now in mvndistinct.c.  Also we
should make sure that the number of elements in the arrays match with
the number of groups we expect, not only the elements.  I don't think
that we need to care much about the values, but we ought to provide
stronger guarantees for the attributes listed in these elements.

Except for this argument, the input of pg_ndistinct feels OK in terms
of the guarantees that we'd want to enforce on an import.  The same
argument applies in terms of attribute number guarantees for
pg_dependencies, based on DependencyGenerator_init() & friends in
dependencies.c.  Could you look at that?

For pg_dependencies, we also require some checks on the value for
"dependency", of course, making sure that this matches with what's
expected with the "largest" sets of attributes.  In this case, we need
to track the union of "dependency" and "attributes", with "attributes"
having at least one element.

The tests of pg_dependencies need also to be extended more (begun that
a bit, far from being complete and I'm lacking of time this week due
to a conference).  One thing that I would add are nested JSON objects
in the paths where we expect values, for example.  Please note that I
have done a brush of 0004, while on it, cleaning up typos,
inconsistencies and making the error codes consistent with the
ndistinct case where possible.  This is not ready, but that's at least
it's a start to rely on.

In terms of committable bits, it would be better to apply the input
functions once both parts are ready to go.  For now I am attached a
v14 with the work I've put into them.  0005~ are not reviewed yet, as
mentioned previously.  The changes in pg_dependencies are actually
straight-forward to figure out (well, mostly) once the pg_ndistinct
changes are OK in shape.
--
Michael

Вложения

Re: Extended Statistics set/restore/clear functions.

От

jian he

Дата:

17 ноября, 15:53:35

On Mon, Nov 17, 2025 at 2:56 PM Michael Paquier <michael@paquier.xyz> wrote:
>
> On Fri, Nov 14, 2025 at 03:25:27PM +0900, Michael Paquier wrote:
> > Thanks for the new versions, I'll also look at all these across the
> > next couple of days.  Probably not at 0005~ for now.
>
> 0001 and 0002 from series v13 have been applied to change the output
> functions.
>
> And I have looked at 0003 in details for now.  Attached is a revised
> version for it, with many adjustments.  Some notes:
> - Many portions of the coverage were missing.  I have measured the
> coverage at 91% with the updated version attached.  This includes
> coverage for some error reporting, something that we rely a lot on for
> this code.
> - The error reports are made simpler, with the token values getting
> hidden.  While testing with some fancy values, I have actually noticed
> that the error handlings for the parsing of the int16 and int32 values
> were incorrect, the error reports used what the safe functions
> generated, not the reports from the data type.
> - Passing down arbitrary bytes sequences was leading to these bytes
> reported in the error outputs because we cared about the token values.
> I have added a few tests based on that for the code paths involved.
>
> There is an extra thing that bugs me as incorrect for the pg_ndistinct
> input, something I have not tackled myself yet.  Your patch checks
> that subsets of attributes are included in the longest set found, but
> it does not match the guarantees we have in mvndistinct.c: we have to
> check that *all* the combinations generated by generator_init() are
> satisfied based on the longest of attributes detected.  For example,
> this is thought as correct in the input function:
> SELECT '[{"attributes" : [-1,2], "ndistinct" : 1},
>          {"attributes" : [-1,2,3], "ndistinct" : 3}]'::pg_ndistinct;
>
> However it is obviously not correct as we are missing an element for
> the attributes [-1, 3].  The simplest solution would be to export the
> routines that generate the groups now in mvndistinct.c.  Also we
> should make sure that the number of elements in the arrays match with
> the number of groups we expect, not only the elements.  I don't think
> that we need to care much about the values, but we ought to provide
> stronger guarantees for the attributes listed in these elements.
>
> Except for this argument, the input of pg_ndistinct feels OK in terms
> of the guarantees that we'd want to enforce on an import.  The same
> argument applies in terms of attribute number guarantees for
> pg_dependencies, based on DependencyGenerator_init() & friends in
> dependencies.c.  Could you look at that?
>

hi.

NDistinctSemanticState, last element
NDIST_EXPECT_COMPLETE should follow with a comma, like:
``NDIST_EXPECT_COMPLETE, ``


Lots of tests use pg_input_error_info, which is good.
since currently only pg_input_error_info, pg_input_is_valid
NDistinctParseState.escontext is not NULL.


+ /*
+ * If escontext already set, just use that. Anything else is a generic
+ * JSON parse error.
+ */
+ if (!SOFT_ERROR_OCCURRED(parse_state.escontext))
+ errsave(parse_state.escontext,
+ errcode(ERRCODE_INVALID_TEXT_REPRESENTATION),
+ errmsg("malformed pg_ndistinct: \"%s\"", str),
+ errdetail("Must be valid JSON."));
seems no coverage for this.

segfault:
SELECT '[{"attributes" : [1,2,3,4,5,67,6,7,8], "ndistinct" : 4}]'::pg_ndistinct;

because src/backend/statistics/mvdistinct.c line: 310, Assert
Assert((item->nattributes >= 2) && (item->nattributes <= STATS_MAX_DIMENSIONS));


+SELECT '[{"attributes" : [2], "ndistinct" : 4}]'::pg_ndistinct;
+ERROR:  malformed pg_ndistinct: "[{"attributes" : [2], "ndistinct" : 4}]"
+LINE 1: SELECT '[{"attributes" : [2], "ndistinct" : 4}]'::pg_ndistin...
+               ^
+DETAIL:  The "ndistinct" key must contain an array of at least two attributes.

here, it should be
+DETAIL:  The "attributes" key must contain an array of at least two attributes.
?

--
jian
https://www.enterprisedb.com/

Re: Extended Statistics set/restore/clear functions.

От

Corey Huinker

Дата:

17 ноября, 20:18:55

On Mon, Nov 17, 2025 at 1:56 AM Michael Paquier <michael@paquier.xyz> wrote:

On Fri, Nov 14, 2025 at 03:25:27PM +0900, Michael Paquier wrote:
> Thanks for the new versions, I'll also look at all these across the
> next couple of days. Probably not at 0005~ for now.

0001 and 0002 from series v13 have been applied to change the output
functions.

Thanks!

Though, I was thinking some more about the output format. Using jsonb_pretty() makes it readable in one way, and very clumsy in other ways. Instead, I'm going to try doing the following:

replace (ndist_string_value, '},', E'}\n,')

This will result in the output value being formatted in exactly the way described in the commit messages.

Of course, we could make the the actual default format by changing appendStringInfoString(&str, ", ") instead.

One might argue that the output shouldn't get too flowery, but we're already adding spaces between items and array elements, and we've already made extensive changes favoring readability over compactness.

Review of changes to input function patches:

I'm curious about the re-parameterization of error messages involving PG_NDISTINCT_KEY_ATTRIBUTES, PG_NDISTINCT_KEY_NDISTINCT, and similar keys in dependencies. I like the parameterized version better, and was confused as to why it was removed. Did you change your mind, or was it done for ease of translation?

Wording changes all look good. Use of pg_input_error_info() makes sense.

There is an extra thing that bugs me as incorrect for the pg_ndistinct
input, something I have not tackled myself yet. Your patch checks
that subsets of attributes are included in the longest set found, but
it does not match the guarantees we have in mvndistinct.c: we have to
check that *all* the combinations generated by generator_init() are
satisfied based on the longest of attributes detected. For example,
this is thought as correct in the input function:
SELECT '[{"attributes" : [-1,2], "ndistinct" : 1},
{"attributes" : [-1,2,3], "ndistinct" : 3}]'::pg_ndistinct;

However it is obviously not correct as we are missing an element for
the attributes [-1, 3]. The simplest solution would be to export the
routines that generate the groups now in mvndistinct.c. Also we
should make sure that the number of elements in the arrays match with
the number of groups we expect, not only the elements. I don't think
that we need to care much about the values, but we ought to provide
stronger guarantees for the attributes listed in these elements.

I had a feeling that was going to be requested. My question would be if that we want to stick to modeling the other combinations after the first longest combination, last longest, or if we want to defer those checks altogether until we have to validate against an actual stats object?

Except for this argument, the input of pg_ndistinct feels OK in terms
of the guarantees that we'd want to enforce on an import. The same
argument applies in terms of attribute number guarantees for
pg_dependencies, based on DependencyGenerator_init() & friends in
dependencies.c. Could you look at that?

Yes. I had already looked at it to verify that _all_ combinations were always generated (they are), because I had some vague memory of the generator dropping combinations that were statistically insignificant. In retrospect, I have no idea where I got that idea.

For pg_dependencies, we also require some checks on the value for
"dependency", of course, making sure that this matches with what's
expected with the "largest" sets of attributes. In this case, we need
to track the union of "dependency" and "attributes", with "attributes"
having at least one element.

This is fairly simple to do. The dependency attnum is just appended to the list of attnums, and the combinations are generated the same as ndistinct, though obviously there are no single elements.

There's probably some common code between the lists to be shared, differing only in how they report missing combinations.

The tests of pg_dependencies need also to be extended more (begun that
a bit, far from being complete and I'm lacking of time this week due
to a conference). One thing that I would add are nested JSON objects
in the paths where we expect values, for example. Please note that I
have done a brush of 0004, while on it, cleaning up typos,
inconsistencies and making the error codes consistent with the
ndistinct case where possible. This is not ready, but that's at least
it's a start to rely on.

In terms of committable bits, it would be better to apply the input
functions once both parts are ready to go. For now I am attached a
v14 with the work I've put into them. 0005~ are not reviewed yet, as
mentioned previously. The changes in pg_dependencies are actually
straight-forward to figure out (well, mostly) once the pg_ndistinct
changes are OK in shape.
--
Michael

Re: Extended Statistics set/restore/clear functions.

От

Michael Paquier

Дата:

18 ноября, 02:19:58

On Mon, Nov 17, 2025 at 12:18:55PM -0500, Corey Huinker wrote:
> On Mon, Nov 17, 2025 at 1:56 AM Michael Paquier <michael@paquier.xyz> wrote:
> Though, I was thinking some more about the output format. Using
> jsonb_pretty() makes it readable in one way, and very clumsy in other ways.
> Instead, I'm going to try doing the following:
>
> replace (ndist_string_value, '},', E'}\n,')
>
> This will result in the output value being formatted in exactly the way
> described in the commit messages.
>
> Of course, we could make the the actual default format by changing
> appendStringInfoString(&str, ", ") instead.

This feels like a different pretty still compressed output for json.
I don't think we should change the output functions to do that, but if
you want to add a function that filters these contents a bit in the
tests for the input functions, sure, why not.

> One might argue that the output shouldn't get too flowery, but we're
> already adding spaces between items and array elements, and we've already
> made extensive changes favoring readability over compactness.

I'd still keep it without newlines, FWIW.  So what we have in the
output functions is OK for me.  The key names could be updated to
something else for this release, I'm open for suggestions and we have
time for this release.  It would be nice to not do rename tweaks
several times.

> I'm curious about the re-parameterization of error messages involving
> PG_NDISTINCT_KEY_ATTRIBUTES, PG_NDISTINCT_KEY_NDISTINCT, and similar keys
> in dependencies. I like the parameterized version better, and was confused
> as to why it was removed. Did you change your mind, or was it done for ease
> of translation?

Yes, this one is to reduce the translation work, and because the
messages are quite the same across the board and deal with the same
requirements:
- Single integer expected after a key (attnum or actual value).
- Array of attribute expected after a key.
- For the degree key, float value.

> I had a feeling that was going to be requested. My question would be if
> that we want to stick to modeling the other combinations after the first
> longest combination, last longest, or if we want to defer those checks
> altogether until we have to validate against an actual stats object?

I would tend to think that performing one round of validation once the
whole set of objects has been parsed is going to be cheaper than
periodic checks.

One other thing would be to force a sort of the elements in the array
to match with the order these are generated when creating the stats.
We cannot do that in the input functions because we have no idea about
the order of the attributes in the statistics object yet.  Applying a
sort sounds also important to me to make sure that we order the stats
based on what the group generation functions (aka
generate_combinations(), etc.) think on the matter, which would
enforce a stronger binary compatibility after we are sure that we have
a full set of attributes listed in an array with the input function of
course.  I have briefly looked at the planner code where extended
stats are used, like selfuncs.c, and the ordering does not completely
matter, it seems, but it's cheap enough to enforce a stricter ordering
based on the K groups of N elements generated in the import function.

>> Except for this argument, the input of pg_ndistinct feels OK in terms
>> of the guarantees that we'd want to enforce on an import.  The same
>> argument applies in terms of attribute number guarantees for
>> pg_dependencies, based on DependencyGenerator_init() & friends in
>> dependencies.c.  Could you look at that?
>
> Yes. I had already looked at it to verify that _all_ combinations were
> always generated (they are), because I had some vague memory of the
> generator dropping combinations that were statistically insignificant. In
> retrospect, I have no idea where I got that idea.

Hmm.  I would need to double-check the code to be sure, but I don't
think that we drop combinations, because the code prevents duplicates
to begin with, even for expressions:
create table aa (a int, b int);
create statistics stats (ndistinct) ON a, a, b, b from aa;
ERROR:  42701: duplicate column name in statistics definition
create statistics stats (ndistinct) ON (a + a), ((a + a)) from aa;
ERROR:  42701: duplicate expression in statistics definition

These don't make sense anyway because they have a predictible and
perfectly matching correlation relationship.

> This is fairly simple to do. The dependency attnum is just appended to the
> list of attnums, and the combinations are generated the same as ndistinct,
> though obviously there are no single elements.

Yeah.  That should be not be bad, I hope.

> There's probably some common code between the lists to be shared, differing
> only in how they report missing combinations.

I would like to agree on that, but it did not look that obvious to me
yesterday.  If you think that something could be refactored, I'd
suggest a refactoring patch that applies on top of the rest of the
patch set, with new generic facilities in stat_util.c, or even a
new separate file, if that leads to a cleaner result (okay, a
definition of "clean" is up to one's taste).
--
Michael

Вложения

signature.asc

Re: Extended Statistics set/restore/clear functions.

От

Michael Paquier

Дата:

18 ноября, 02:35:00

On Mon, Nov 17, 2025 at 08:53:35PM +0800, jian he wrote:
> segfault:
> SELECT '[{"attributes" : [1,2,3,4,5,67,6,7,8], "ndistinct" : 4}]'::pg_ndistinct;
>
> because src/backend/statistics/mvdistinct.c line: 310, Assert
> Assert((item->nattributes >= 2) && (item->nattributes <= STATS_MAX_DIMENSIONS));

Ahah, good one here.  Perhaps we had better switch this assertion to
become an elog() with the import functions in mind, as well.  It may
be worth double-checking the other deserialization paths, as well.  We
need to be super careful about this stuff for data imports.
--
Michael

Вложения

signature.asc

Re: Extended Statistics set/restore/clear functions.

От

Corey Huinker

Дата:

18 ноября, 05:32:37

This feels like a different pretty still compressed output for json.
I don't think we should change the output functions to do that, but if
you want to add a function that filters these contents a bit in the
tests for the input functions, sure, why not.

+1, I'll probably just use the replaces rather than define a function and give someone the false impression that the function exists elsewhere.

Yes, this one is to reduce the translation work, and because the
messages are quite the same across the board and deal with the same
requirements:
- Single integer expected after a key (attnum or actual value).
- Array of attribute expected after a key.
- For the degree key, float value.

> I had a feeling that was going to be requested. My question would be if
> that we want to stick to modeling the other combinations after the first
> longest combination, last longest, or if we want to defer those checks
> altogether until we have to validate against an actual stats object?

I would tend to think that performing one round of validation once the
whole set of objects has been parsed is going to be cheaper than
periodic checks.

One other thing would be to force a sort of the elements in the array
to match with the order these are generated when creating the stats.
We cannot do that in the input functions because we have no idea about
the order of the attributes in the statistics object yet. Applying a
sort sounds also important to me to make sure that we order the stats
based on what the group generation functions (aka
generate_combinations(), etc.) think on the matter, which would
enforce a stronger binary compatibility after we are sure that we have
a full set of attributes listed in an array with the input function of
course. I have briefly looked at the planner code where extended
stats are used, like selfuncs.c, and the ordering does not completely
matter, it seems, but it's cheap enough to enforce a stricter ordering
based on the K groups of N elements generated in the import function.

>> Except for this argument, the input of pg_ndistinct feels OK in terms
>> of the guarantees that we'd want to enforce on an import. The same
>> argument applies in terms of attribute number guarantees for
>> pg_dependencies, based on DependencyGenerator_init() & friends in
>> dependencies.c. Could you look at that?
>
> Yes. I had already looked at it to verify that _all_ combinations were
> always generated (they are), because I had some vague memory of the
> generator dropping combinations that were statistically insignificant. In
> retrospect, I have no idea where I got that idea.

Hmm. I would need to double-check the code to be sure, but I don't
think that we drop combinations, because the code prevents duplicates
to begin with, even for expressions:
create table aa (a int, b int);
create statistics stats (ndistinct) ON a, a, b, b from aa;
ERROR: 42701: duplicate column name in statistics definition
create statistics stats (ndistinct) ON (a + a), ((a + a)) from aa;
ERROR: 42701: duplicate expression in statistics definition

So I looked at the generator functions, hoping they'd have enough in common that they could be made generic. And they're just different enough that I think it's not worth it to try.

But, if we don't care about the order of the combinations, I also don't think we need to expose the functions at all. We know exactly how many combinations there should be for any N attributes as each attribute must be unique. So if we have the right number of unique combinations, and they're all subsets of the first-longest, then we must have a complete set. Thoughts on that?

Getting _too_ tight with the ordering and contents makes me concerned for the day when the format might change. We don't want to _fail_ an upgrade because some of the combinations were in the wrong order.

These don't make sense anyway because they have a predictible and
perfectly matching correlation relationship.

They do, for now, but are we willing to lock ourselves into that forever?

> This is fairly simple to do. The dependency attnum is just appended to the
> list of attnums, and the combinations are generated the same as ndistinct,
> though obviously there are no single elements.

Yeah. That should be not be bad, I hope.

> There's probably some common code between the lists to be shared, differing
> only in how they report missing combinations.

I would like to agree on that, but it did not look that obvious to me
yesterday. If you think that something could be refactored, I'd
suggest a refactoring patch that applies on top of the rest of the
patch set, with new generic facilities in stat_util.c, or even a
new separate file, if that leads to a cleaner result (okay, a
definition of "clean" is up to one's taste).

Looking over those functions, they both could have use the same generator, but the dependencies-side decided that dependency order doesn't matter, which puts doubt in my head that the order is perfectly the same for both, so we'd better follow each individually IF we want to enforce order.

Re: Extended Statistics set/restore/clear functions.

От

Corey Huinker

Дата:

18 ноября, 05:34:14

On Mon, Nov 17, 2025 at 6:35 PM Michael Paquier <michael@paquier.xyz> wrote:

On Mon, Nov 17, 2025 at 08:53:35PM +0800, jian he wrote:
> segfault:
> SELECT '[{"attributes" : [1,2,3,4,5,67,6,7,8], "ndistinct" : 4}]'::pg_ndistinct;
>
> because src/backend/statistics/mvdistinct.c line: 310, Assert
> Assert((item->nattributes >= 2) && (item->nattributes <= STATS_MAX_DIMENSIONS));

Ahah, good one here. Perhaps we had better switch this assertion to
become an elog() with the import functions in mind, as well. It may
be worth double-checking the other deserialization paths, as well. We
need to be super careful about this stuff for data imports.

Yeah, I liked that one too, and I can act on that while we're debating how tightly to turn the screws on combinatorics.

Re: Extended Statistics set/restore/clear functions.

От

Michael Paquier

Дата:

18 ноября, 06:34:29

On Mon, Nov 17, 2025 at 09:32:37PM -0500, Corey Huinker wrote:
> So I looked at the generator functions, hoping they'd have enough in common
> that they could be made generic. And they're just different enough that I
> think it's not worth it to try.
>
> But, if we don't care about the order of the combinations, I also don't
> think we need to expose the functions at all. We know exactly how many
> combinations there should be for any N attributes as each attribute must be
> unique. So if we have the right number of unique combinations, and they're
> all subsets of the first-longest, then we must have a complete set.
> Thoughts on that?
>
> Getting _too_ tight with the ordering and contents makes me concerned for
> the day when the format might change. We don't want to _fail_ an upgrade
> because some of the combinations were in the wrong order.

That's fair.  The planner costing code pulling the stats numbers based
on the attributes was smart enough to not care much about the ordering
as far as I recall, but I'd rather make sure of that first.  This
needs some careful lookup.

>> These don't make sense anyway because they have a predictible and
>> perfectly matching correlation relationship.
>>
>
> They do, for now, but are we willing to lock ourselves into that forever?

Perhaps not.  I cannot say for sure what's the future is going to be
made of.

> Looking over those functions, they both could have use the same generator,
> but the dependencies-side decided that dependency order doesn't matter,
> which puts doubt in my head that the order is perfectly the same for both,
> so we'd better follow each individually IF we want to enforce order.

I'd try to look at the bits related to pg_dependencies and
pg_ndistinct as two separate concepts, at the end.  They're sort of
alike, but have too many differences already.
--
Michael

Вложения

signature.asc

Re: Extended Statistics set/restore/clear functions.

От

Corey Huinker

Дата:

18 ноября, 07:39:50

> But, if we don't care about the order of the combinations, I also don't
> think we need to expose the functions at all. We know exactly how many
> combinations there should be for any N attributes as each attribute must be
> unique. So if we have the right number of unique combinations, and they're
> all subsets of the first-longest, then we must have a complete set.
> Thoughts on that?
>
> Getting _too_ tight with the ordering and contents makes me concerned for
> the day when the format might change. We don't want to _fail_ an upgrade
> because some of the combinations were in the wrong order.

That's fair. The planner costing code pulling the stats numbers based
on the attributes was smart enough to not care much about the ordering
as far as I recall, but I'd rather make sure of that first. This
needs some careful lookup.

I've done some experiments, creating extended stats objects up to the 8 attribute limit.

The big takeaway is that I wasn't imagining that he number of dependencies combinations is NOT deterministic:

/*
* if the dependency seems entirely invalid, don't store it
*/
if (degree == 0.0)
continue;

So, in theory, an empty (i.e. '[]') pg_dependencies is valid.

The number of pg_ndistinct is deterministic, now, but I'm even less sure that'll be true in the future.

We can definitely rely on the attnums being all the positive numbers in ascending order first, followed by the negative numbers in descending order, but that's about it. Which raises the question of how we describe the error when attnums are out of order.

We know that the deserialize functions take the data's word for it as to how many items to unpack, so I don't see the impact of not caring how many might be missing. That even sort of feeds into Tom's idea that stats import was in some sense a fuzzing tool.

I'd try to look at the bits related to pg_dependencies and
pg_ndistinct as two separate concepts, at the end. They're sort of
alike, but have too many differences already.

Based on the above, I think we can't really add anything beyond the attnum order, and we have to relax some existing restrictions on pg_dependencies...

Re: Extended Statistics set/restore/clear functions.

От

jian he

Дата:

18 ноября, 08:07:23

On Mon, Nov 17, 2025 at 2:56 PM Michael Paquier <michael@paquier.xyz> wrote:
>
> On Fri, Nov 14, 2025 at 03:25:27PM +0900, Michael Paquier wrote:
> > Thanks for the new versions, I'll also look at all these across the
> > next couple of days.  Probably not at 0005~ for now.
>
> 0001 and 0002 from series v13 have been applied to change the output
> functions.
>

> And I have looked at 0003 in details for now.  Attached is a revised
> version for it, with many adjustments.  Some notes:
> - Many portions of the coverage were missing.  I have measured the
> coverage at 91% with the updated version attached.  This includes
> coverage for some error reporting, something that we rely a lot on for
> this code.
> - The error reports are made simpler, with the token values getting
> hidden.  While testing with some fancy values, I have actually noticed
> that the error handlings for the parsing of the int16 and int32 values
> were incorrect, the error reports used what the safe functions
> generated, not the reports from the data type.
> - Passing down arbitrary bytes sequences was leading to these bytes
> reported in the error outputs because we cared about the token values.
> I have added a few tests based on that for the code paths involved.
>
hi.

in src/backend/statistics/mvdistinct.c, we have:
Assert(AttributeNumberIsValid(item->attributes[j]));

should we disallow 0 in key attributes?
SELECT '[{"attributes" : [0,1], "ndistinct" : 4}]'::pg_ndistinct;
I didn't find a way to trigger this Assert yet.


+ errsave(parse->escontext,
+ errcode(ERRCODE_INVALID_TEXT_REPRESENTATION),
+ errmsg("malformed pg_ndistinct: \"%s\"", parse->str),
+ errdetail("Invalid \"%s\" value.", PG_NDISTINCT_KEY_ATTRIBUTES));

+ errsave(parse->escontext,
+ errcode(ERRCODE_INVALID_TEXT_REPRESENTATION),
+ errmsg("malformed pg_ndistinct: \"%s\"", parse->str),
+ errdetail("Invalid \"%s\" value.",
+  PG_NDISTINCT_KEY_NDISTINCT));

the errdetail is way too generic?
similar to ``select 'a'::int;``
we can
DETAIL:  Invalid input syntax for type integer: "a"
HINT: "ndistinct" value expected to be a type of integer.

what do you think?


we already have "fname" in ndistinct_object_field_start,
we can also print out the "fname", like:
    errsave(parse->escontext,
            errcode(ERRCODE_INVALID_TEXT_REPRESENTATION),
            errmsg("malformed pg_ndistinct: \"%s\"", parse->str),
            errdetail("Unexpected key \"%s\"", fname),
            errhint("Only allowed keys are \"%s\" and \"%s\".",
                      PG_NDISTINCT_KEY_ATTRIBUTES,
                      PG_NDISTINCT_KEY_NDISTINCT));


SELECT '[{"attributes" : [2,3], "ndistinct" : 4, "ndistinct" :
14}]'::pg_ndistinct;
               pg_ndistinct
-------------------------------------------
 [{"attributes": [2, 3], "ndistinct": 14}]

SELECT '[{"attributes" : [2,3], "ndistinct" : 4, "attributes" :
[]}]'::pg_ndistinct;
               pg_ndistinct
------------------------------------------
 [{"attributes": [2, 3], "ndistinct": 4}]

Is the above output what we expected?


+ /*
+ * We need at least two attribute numbers for a ndistinct item, anything
+ * less is malformed.
+ */
+ natts = parse->attnum_list->length;
here, we can use list_length.

+ if (parse->attnum_list != NIL)
+ if (parse->distinct_items != NIL)
here, we can also use list_length.


--
jian
https://www.enterprisedb.com/

Re: Extended Statistics set/restore/clear functions.

От

Michael Paquier

Дата:

18 ноября, 08:39:04

On Tue, Nov 18, 2025 at 01:07:23PM +0800, jian he wrote:
> + errsave(parse->escontext,
> + errcode(ERRCODE_INVALID_TEXT_REPRESENTATION),
> + errmsg("malformed pg_ndistinct: \"%s\"", parse->str),
> + errdetail("Invalid \"%s\" value.",
> +  PG_NDISTINCT_KEY_NDISTINCT));
>
> the errdetail is way too generic?
> similar to ``select 'a'::int;``
> we can
> DETAIL:  Invalid input syntax for type integer: "a"
> HINT: "ndistinct" value expected to be a type of integer.
>
> what do you think?

That is intentional, as it is intentional to not show the value of the
token in such cases.  This will be mostly used for the import
functions, so I am not sure that it is worth spending time on tuning
all these error cases that one is most likely not going to see as the
main scenarios are going to be through pg_dump/restore, most of the
time in the scope of an upgrade.

> SELECT '[{"attributes" : [2,3], "ndistinct" : 4, "ndistinct" :
> 14}]'::pg_ndistinct;
>                pg_ndistinct
> -------------------------------------------
>  [{"attributes": [2, 3], "ndistinct": 14}]
>
> SELECT '[{"attributes" : [2,3], "ndistinct" : 4, "attributes" :
> []}]'::pg_ndistinct;
>                pg_ndistinct
> ------------------------------------------
>  [{"attributes": [2, 3], "ndistinct": 4}]
>
> Is the above output what we expected?

Interesting one.  The extra "attributes" should not be required once
we have one set, indeed.
--
Michael

Вложения

signature.asc

Re: Extended Statistics set/restore/clear functions.

От

Corey Huinker

Дата:

18 ноября, 11:52:31

> SELECT '[{"attributes" : [2,3], "ndistinct" : 4, "ndistinct" :
> 14}]'::pg_ndistinct;
> pg_ndistinct
> -------------------------------------------
> [{"attributes": [2, 3], "ndistinct": 14}]
>
> SELECT '[{"attributes" : [2,3], "ndistinct" : 4, "attributes" :
> []}]'::pg_ndistinct;
> pg_ndistinct
> ------------------------------------------
> [{"attributes": [2, 3], "ndistinct": 4}]
>
> Is the above output what we expected?

Interesting one. The extra "attributes" should not be required once
we have one set, indeed.
--
Michael

v15:

- catches duplicate object keys cited above

- enforces attnum ordering (ascending positive numbers followed by descending negative numbers, no zeros allowed), which means we get duplicate attnum detection for free

- attnum validation is now done as soon as the attnum is parsed

- tests refactored to put attnums in proper order

- unfortunately, this means that one of the error cases from stats_import.sql (attnum = 0) is now an error rather than something that can be soft-excluded.

- didn't enforce combinatorical completeness for dependencies because not all combinations are guaranteed to be there.

- didn't enforce combinatorical completeness for ndistinct because I'm not convinced we should.

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Обсуждение: Re: Extended Statistics set/restore/clear functions.

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения