Обсуждение: Defects with invalid stats data for expressions in extended stats
Hi all,
(Adding Tomas in CC., as primary author of the feature dealt with
here, and Corey, due to his work for the restore of extstats.)
While looking at the proposed patch for the restore of extended
statistics on expressions, I have bumped into two defects that exist
since this feature has been introduced in v15. First, I have thought
that this was a problem only related to the proposed patch, but after
more analysis, I have found that this issue is independent, and can be
triggered without restoring any stats.
First, one thing that is important to know is that when defining an
extstat object with N expressions, what we finish by storing in the
catalogs is an array of N pg_statistics tuples. Some expressions can
have invalid data, symbolized by this code in extended_stats.c
if (!stats->stats_valid)
{
astate = accumArrayResult(astate,
(Datum) 0,
true,
typOid,
CurrentMemoryContext);
continue;
}
There is nothing wrong with that. Having N elements in the array of
pg_statistics tuples is a requirement, and the code clearly intends
that. There should be no more and no less elements, and this is used
as a marker to let the code that loads this catalog data that nothing
could be computed. This data is inserted when we run ANALYZE.
Some code paths are unfortunately not water-proof with this NULL-ness
handling, and I have found two of them as fixed by 0001.
1) When building statistics, lookup_var_attr_stats() has missed the
fact that computing stats for an expression could lead to invalid
stats being generated. examine_attribute() deals with this case by
returning NULL if either:
1-1) the typanalyze callback returns false,
1-2) The number of rows returned is negative.
1-3) For whatever reason in a custom type implementation, the
compute_stats callback is not set.
lookup_var_attr_stats() has been dealing with the case of invalid
stats for attributes, but it has missed the mark after calling
examine_attribute() for each expression. Note that
examine_attribute() exists in both extended_stats.c and analyze.c,
they are very close in shape, and need to rely on the same assumptions
in terms of what the typanalyze callback can return.
lookup_var_attr_stats() has two callers, both are able to deal with
NULL data (aka invalid stats). A consequence of this issue is a set
of NULL pointer dereferences for MCV, ndistinct and dependencies, as
all these code paths expect something to exist for each expression.
As there is no stats data, each of them would crash. At least one
needs to be specified when creating an extstat object.
2) When loading statistics, statext_expressions_load() missed that
some elements in the pg_statistics array could be NULL. This issue
can only be triggered if we have some invalid data stored in the
catalogs. This issue can be triggered on any branches with a
typanalyze callback that returns true, where stats_valid is set to
false when computing the stats (all the in-core callbacks set this
flag to true, nobody in their right mind would do that except me here,
I suspect). The restore of extended stats for expressions makes
this defect more easily reachable by *bypassing* the build, but as the
previous sentence describes it is *not* a mandatory requirement
depending on what a typanalyze callback does. Hence, we have patch it
in all the stable branches anyway. The code allows NULL data to exist
for some expressions, but the load does not cope with it. This is
reflected by fixing two code paths:
2-1) statext_expressions_load() needs to be able to load NULL data.
2-2) examine_variable() in selfuncs.c needs to lead with this possible
consequence.
All the callers of examine_variable() have an exit path in selfuncs.c
if there is no stats data available, assuming some defaults, in the
event where statsTuple is NULL (aka invalid). Note that it is
possible to reach this path with a typanalyze that returns true,
meaning that some data is available and that we store NULL data to be
stored, but the compute_stats callback has the idea to set stats_valid
to false. We never set stats_valid to false in any of the in-core
callbacks.
In order to demonstrate these two bugs, I have implemented a test
module called test_custom_types, as of 0002 (btree operator bloats the
module, it is required for extended stats), that includes a simple
custom type with two buggy typanalyze callbacks (one for the build
defect, and one for the load defect). I think that it would be a good
thing to backpatch this module with the main fix, so as we can cover
these fancy scenarios for all released branches. This could be
enlarged for more cases if we have more defects detected in the future
in this area of the code. This affects versions down to v15.
In order to finish the business with the restore of extended stats for
this release, these defects have to be addressed first. It does not
impact what has been already committed for the restore of extended
stats, fortunately: we have not touched the expressions yet and the
patch is still floating around in this CF.
I am registering this item in the final CF, as a bug fix. The
typanalyze requirements may sound it like something worth only fixing
on HEAD, but I don't really see a reason why back-branches should not
be fixed as well.
So, thoughts or comments?
--
Michael
Вложения
On Thu, Feb 26, 2026 at 7:54 PM Michael Paquier <michael@paquier.xyz> wrote:
Hi all,
(Adding Tomas in CC., as primary author of the feature dealt with
here, and Corey, due to his work for the restore of extstats.)
While looking at the proposed patch for the restore of extended
statistics on expressions, I have bumped into two defects that exist
since this feature has been introduced in v15. First, I have thought
that this was a problem only related to the proposed patch, but after
more analysis, I have found that this issue is independent, and can be
triggered without restoring any stats.
First, one thing that is important to know is that when defining an
extstat object with N expressions, what we finish by storing in the
catalogs is an array of N pg_statistics tuples. Some expressions can
have invalid data, symbolized by this code in extended_stats.c
if (!stats->stats_valid)
{
astate = accumArrayResult(astate,
(Datum) 0,
true,
typOid,
CurrentMemoryContext);
continue;
}
There is nothing wrong with that. Having N elements in the array of
pg_statistics tuples is a requirement, and the code clearly intends
that. There should be no more and no less elements, and this is used
as a marker to let the code that loads this catalog data that nothing
could be computed. This data is inserted when we run ANALYZE.
Some code paths are unfortunately not water-proof with this NULL-ness
handling, and I have found two of them as fixed by 0001.
1) When building statistics, lookup_var_attr_stats() has missed the
fact that computing stats for an expression could lead to invalid
stats being generated. examine_attribute() deals with this case by
returning NULL if either:
1-1) the typanalyze callback returns false,
1-2) The number of rows returned is negative.
1-3) For whatever reason in a custom type implementation, the
compute_stats callback is not set.
lookup_var_attr_stats() has been dealing with the case of invalid
stats for attributes, but it has missed the mark after calling
examine_attribute() for each expression. Note that
examine_attribute() exists in both extended_stats.c and analyze.c,
they are very close in shape, and need to rely on the same assumptions
in terms of what the typanalyze callback can return.
lookup_var_attr_stats() has two callers, both are able to deal with
NULL data (aka invalid stats). A consequence of this issue is a set
of NULL pointer dereferences for MCV, ndistinct and dependencies, as
all these code paths expect something to exist for each expression.
As there is no stats data, each of them would crash. At least one
needs to be specified when creating an extstat object.
2) When loading statistics, statext_expressions_load() missed that
some elements in the pg_statistics array could be NULL. This issue
can only be triggered if we have some invalid data stored in the
catalogs. This issue can be triggered on any branches with a
typanalyze callback that returns true, where stats_valid is set to
false when computing the stats (all the in-core callbacks set this
flag to true, nobody in their right mind would do that except me here,
I suspect). The restore of extended stats for expressions makes
this defect more easily reachable by *bypassing* the build, but as the
previous sentence describes it is *not* a mandatory requirement
depending on what a typanalyze callback does. Hence, we have patch it
in all the stable branches anyway. The code allows NULL data to exist
for some expressions, but the load does not cope with it. This is
reflected by fixing two code paths:
2-1) statext_expressions_load() needs to be able to load NULL data.
2-2) examine_variable() in selfuncs.c needs to lead with this possible
consequence.
All the callers of examine_variable() have an exit path in selfuncs.c
if there is no stats data available, assuming some defaults, in the
event where statsTuple is NULL (aka invalid). Note that it is
possible to reach this path with a typanalyze that returns true,
meaning that some data is available and that we store NULL data to be
stored, but the compute_stats callback has the idea to set stats_valid
to false. We never set stats_valid to false in any of the in-core
callbacks.
In order to demonstrate these two bugs, I have implemented a test
module called test_custom_types, as of 0002 (btree operator bloats the
module, it is required for extended stats), that includes a simple
custom type with two buggy typanalyze callbacks (one for the build
defect, and one for the load defect). I think that it would be a good
thing to backpatch this module with the main fix, so as we can cover
these fancy scenarios for all released branches. This could be
enlarged for more cases if we have more defects detected in the future
in this area of the code. This affects versions down to v15.
In order to finish the business with the restore of extended stats for
this release, these defects have to be addressed first. It does not
impact what has been already committed for the restore of extended
stats, fortunately: we have not touched the expressions yet and the
patch is still floating around in this CF.
I am registering this item in the final CF, as a bug fix. The
typanalyze requirements may sound it like something worth only fixing
on HEAD, but I don't really see a reason why back-branches should not
be fixed as well.
So, thoughts or comments?
Great detective work.
Patch applies for me, but there seems to be some user-specific stuff in the test, which causes it to fail:
diff -U3 /home/corey/src/postgres/src/test/modules/test_custom_types/expected/test_custom_types.out /home/corey/src/postgres/build/testrun/test_custom_types/regress/results/test_custom_types.out
--- /home/corey/src/postgres/src/test/modules/test_custom_types/expected/test_custom_types.out 2026-02-26 22:34:30.928378989 -0500
+++ /home/corey/src/postgres/build/testrun/test_custom_types/regress/results/test_custom_types.out 2026-02-26 22:35:53.005346284 -0500
@@ -127,7 +127,7 @@
tablename | test_table
statistics_schemaname | public
statistics_name | test_stats
-statistics_owner | ioltas
+statistics_owner | corey
expr | func_int_custom(data)
inherited | f
null_frac |
--- /home/corey/src/postgres/src/test/modules/test_custom_types/expected/test_custom_types.out 2026-02-26 22:34:30.928378989 -0500
+++ /home/corey/src/postgres/build/testrun/test_custom_types/regress/results/test_custom_types.out 2026-02-26 22:35:53.005346284 -0500
@@ -127,7 +127,7 @@
tablename | test_table
statistics_schemaname | public
statistics_name | test_stats
-statistics_owner | ioltas
+statistics_owner | corey
expr | func_int_custom(data)
inherited | f
null_frac |
A nitpick about the test - it uses a plpgsql function when we've been moving such trivial functions to SQL standard function bodies for a while now, and they were introduced back in v14 so there's no backporting concern. Blah blah, eat our own dogfood blah.
On Thu, Feb 26, 2026 at 10:52:48PM -0500, Corey Huinker wrote: > Patch applies for me, but there seems to be some user-specific stuff in the > test, which causes it to fail: Yep. I've noticed that in the CI a few minutes ago. I have switched the tests to use a query where the owner does not show up, leading to the same coverage without the user-dependency blip. I have checked that this version cools down the CI. > A nitpick about the test - it uses a plpgsql function when we've been > moving such trivial functions to SQL standard function bodies for a while > now, and they were introduced back in v14 so there's no backporting > concern. No, that's on purpose. Using a SQL function with a body would not trigger the problem with the stats loaded at the end of the SQL test as we would skip the fatal call of statext_expressions_load(). Based on your confusion, I guess that a note to document that is in order, at least, so as nobody comes with the idea of changing the definition of this function.. -- Michael
Вложения
No, that's on purpose. Using a SQL function with a body would not
trigger the problem with the stats loaded at the end of the SQL test
as we would skip the fatal call of statext_expressions_load(). Based
on your confusion, I guess that a note to document that is in order,
at least, so as nobody comes with the idea of changing the definition
of this function..
Thanks for the explanation. Changes look good.
> On Feb 27, 2026, at 12:25, Michael Paquier <michael@paquier.xyz> wrote:
>
> On Thu, Feb 26, 2026 at 10:52:48PM -0500, Corey Huinker wrote:
>> Patch applies for me, but there seems to be some user-specific stuff in the
>> test, which causes it to fail:
>
> Yep. I've noticed that in the CI a few minutes ago. I have switched
> the tests to use a query where the owner does not show up, leading to
> the same coverage without the user-dependency blip. I have checked
> that this version cools down the CI.
>
>> A nitpick about the test - it uses a plpgsql function when we've been
>> moving such trivial functions to SQL standard function bodies for a while
>> now, and they were introduced back in v14 so there's no backporting
>> concern.
>
> No, that's on purpose. Using a SQL function with a body would not
> trigger the problem with the stats loaded at the end of the SQL test
> as we would skip the fatal call of statext_expressions_load(). Based
> on your confusion, I guess that a note to document that is in order,
> at least, so as nobody comes with the idea of changing the definition
> of this function..
> --
> Michael
>
<v2-0001-Fix-two-defects-with-extended-statistics-for-expr.patch><v2-0002-test_custom_types-Test-module-for-custom-data-typ.patch>
A few small comments from an eyeball review:
1 - 0001
```
stats[i] = examine_attribute(expr);
+ /*
+ * If the expression has been found as non-analyzable, give up. We
+ * will not be able to build extended stats with it.
+ */
+ if (stats[i] == NULL)
+ {
+ pfree(stats);
+ return NULL;
+ }
```
Here stats itself is destroyed, but memory pointed by stats[0]~stats[i-1] are not free-ed, those memory are returned
fromexamine_attribute() by palloc0_object().
2 - 0002
```
/*
* int_custom_typanalyze_invalid
*
* This function returns sets some invalid stats data, letting the caller know
* that we are safe for an analyze, returning true.
```
“This function returns sets …”, is “returns” a typo and not needed?
3 - 0002
```
+-- Dummy function used for expression evaluations.
+-- Note that this function does not use a function body on purpose, so as
+-- external statistics can be loaded from it.
+CREATE OR REPLACE FUNCTION func_int_custom (p_value int_custom)
+ RETURNS int_custom LANGUAGE plpgsql AS $$
+ BEGIN
+ RETURN p_value;
+ END; $$;
```
The comment says “this function does not use a function body”, but the function has a body. This appears in two places.
Best regards,
--
Chao Li (Evan)
HighGo Software Co., Ltd.
https://www.highgo.com/
On Fri, Feb 27, 2026 at 02:51:40PM +0800, Chao Li wrote: > Here stats itself is destroyed, but memory pointed by > stats[0]~stats[i-1] are not free-ed, those memory are returned from > examine_attribute() by palloc0_object(). I am aware of that. This is not done on simplicity ground, keeping the cleanup of the memory context to ANALYZE in this case. > “This function returns sets …”, is “returns” a typo and not needed? That's a typo, thanks for re-reading. > The comment says “this function does not use a function body”, but > the function has a body. This appears in two places. Let's just append a "SQL-standard" here, then. -- Michael
Вложения
On Fri, Feb 27, 2026 at 04:33:24PM +0900, Michael Paquier wrote:
> On Fri, Feb 27, 2026 at 02:51:40PM +0800, Chao Li wrote:
>> Here stats itself is destroyed, but memory pointed by
>> stats[0]~stats[i-1] are not free-ed, those memory are returned from
>> examine_attribute() by palloc0_object().
>
> I am aware of that. This is not done on simplicity ground, keeping
> the cleanup of the memory context to ANALYZE in this case.
About this one, something worth noting is the beginning of
do_analyze_rel(), which does the following:
/*
* Set up a working context so that we can easily free whatever junk gets
* created.
*/
anl_context = AllocSetContextCreate(CurrentMemoryContext,
"Analyze",
ALLOCSET_DEFAULT_SIZES);
caller_context = MemoryContextSwitchTo(anl_context);
So these extra allocations would just be freed under this memory
context umbrella once we are done processing a single relation. This
works even if we begin to repeat ANALYZE commands that fail to build
some of the stats in a repeated fashion in a single transaction block.
Code simplicity and readability is just a better choice for this path.
--
Michael
Вложения
> On Mar 2, 2026, at 07:05, Michael Paquier <michael@paquier.xyz> wrote: > > On Fri, Feb 27, 2026 at 04:33:24PM +0900, Michael Paquier wrote: >> On Fri, Feb 27, 2026 at 02:51:40PM +0800, Chao Li wrote: >>> Here stats itself is destroyed, but memory pointed by >>> stats[0]~stats[i-1] are not free-ed, those memory are returned from >>> examine_attribute() by palloc0_object(). >> >> I am aware of that. This is not done on simplicity ground, keeping >> the cleanup of the memory context to ANALYZE in this case. > > About this one, something worth noting is the beginning of > do_analyze_rel(), which does the following: > /* > * Set up a working context so that we can easily free whatever junk gets > * created. > */ > anl_context = AllocSetContextCreate(CurrentMemoryContext, > "Analyze", > ALLOCSET_DEFAULT_SIZES); > caller_context = MemoryContextSwitchTo(anl_context); > > So these extra allocations would just be freed under this memory > context umbrella once we are done processing a single relation. This > works even if we begin to repeat ANALYZE commands that fail to build > some of the stats in a repeated fashion in a single transaction block. > Code simplicity and readability is just a better choice for this path. > -- > Michael Yeah, memory context is a great mechanism. But why do we still explicitly free the stats array? My concern is mostly about “partial free”: either free everything, orfree nothing and let the memory context clean it up. “Partial free” tends to confuse code readers, and future readers maykeep running into the same question. This feels similar to what we discussed in [1]. If we don’t free the container at all, that’s fine too. But freeing the containerwhile leaving its members behind is also a kind of “partial free”, and it’s hard to reason about. [1] https://www.postgresql.org/message-id/5DED17B4-D948-4C0B-9DE1-A915D0BFAA54%40gmail.com Best regards, -- Chao Li (Evan) HighGo Software Co., Ltd. https://www.highgo.com/
On Mon, Mar 02, 2026 at 07:39:45AM +0800, Chao Li wrote: > But why do we still explicitly free the stats array? My concern is > mostly about “partial free”: either free everything, or free nothing > and let the memory context clean it up. “Partial free” tends to > confuse code readers, and future readers may keep running into the > same question. Not necessarity. We have plenty of these patterns in the code with such partial frees. At the end, two things tend to take priority: the readability of the code as well as its efficiency because repeated pfree() calls can be more expensive than a memory context cleanup, particularly in hot loops, though doing pfree() calls may be better in some cases. In this case, I see a better argument with the code clarity, as we use the same array for the stats built with attributes and expressions. We also cap the number of expressions to 8, which puts some control. But at the end it does not really matter: in this context ANALYZE makes sure that nothing crosses the processing of a single relation, even with a transaction analyzes a lot of relations (same or different). -- Michael
Вложения
On Thu, Feb 26, 2026 at 11:50:19PM -0500, Corey Huinker wrote: > Thanks for the explanation. Changes look good. Thanks. I have spent a few more hours head down on it, and applied the whole as two commits down to v14: one for the fix, one for the module. I have adjusted a couple of things in the module, with a few things to make it work in back-branches. A scan of pg_stats_ext_exprs felt in order after we fail to build the stats entirely, to make sure that there is nothing detected in the catalogs, so I have added a shorter version of the SELECT. -- Michael