Обсуждение: aggregate crash

Поиск
Список
Период
Сортировка

aggregate crash

От
Teodor Sigaev
Дата:
Hi!

Found crash on production instance, assert-enabled build crashes in pfree() 
call, with default config. v11, v12 and head are affected, but, seems, you need 
to be a bit lucky.

The bug is comparing old and new aggregate pass-by-ref values only by pointer 
value itself, despite on null flag. Any function which returns null doesn't 
worry about actual returned Datum value, so that comparison isn't enough. Test 
case shows bug with ExecInterpExpr() but there several similar places (thanks 
Nikita Glukhov for help).
Attached patch adds check of null flag.

How to reproduce:
http://sigaev.ru/misc/xdump.sql.bz2
bzcat xdump.sql.bz2 | psql postgres && psql postgres < x.sql


Backtrace from v12 (note, newValue and oldValue are differ on current call, but 
oldValue points into pfreed memory) :
#0  0x0000000000c8405a in GetMemoryChunkContext (pointer=0x80a808250) at 
../../../../src/include/utils/memutils.h:130
130             AssertArg(MemoryContextIsValid(context));
(gdb) bt
#0  0x0000000000c8405a in GetMemoryChunkContext (pointer=0x80a808250) at 
../../../../src/include/utils/memutils.h:130
#1  0x0000000000c85ae5 in pfree (pointer=0x80a808250) at mcxt.c:1058
#2  0x000000000080475e in ExecAggTransReparent (aggstate=0x80a806370, 
pertrans=0x80a87e830, newValue=34535940744, newValueIsNull=false, 
oldValue=34535932496, oldValueIsNull=false)
     at execExprInterp.c:4209
#3  0x00000000007ff51f in ExecInterpExpr (state=0x80a87f4d8, 
econtext=0x80a8065a8, isnull=0x7fffffffd7b7) at execExprInterp.c:1747
#4  0x000000000082c12b in ExecEvalExprSwitchContext (state=0x80a87f4d8, 
econtext=0x80a8065a8, isNull=0x7fffffffd7b7) at 
../../../src/include/executor/executor.h:308
#5  0x000000000082bc0f in advance_aggregates (aggstate=0x80a806370) at nodeAgg.c:679
#6  0x000000000082b8a6 in agg_retrieve_direct (aggstate=0x80a806370) at 
nodeAgg.c:1847
#7  0x0000000000828782 in ExecAgg (pstate=0x80a806370) at nodeAgg.c:1572
#8  0x000000000080e712 in ExecProcNode (node=0x80a806370) at 
../../../src/include/executor/executor.h:240
#9  0x000000000080a4a1 in ExecutePlan (estate=0x80a806120, 
planstate=0x80a806370, use_parallel_mode=false, operation=CMD_SELECT, 
sendTuples=true, numberTuples=0,
     direction=ForwardScanDirection, dest=0x80a851cc0, execute_once=true) at 
execMain.c:1646
#10 0x000000000080a362 in standard_ExecutorRun (queryDesc=0x80a853120, 
direction=ForwardScanDirection, count=0, execute_once=true) at execMain.c:364
#11 0x000000000080a114 in ExecutorRun (queryDesc=0x80a853120, 
direction=ForwardScanDirection, count=0, execute_once=true) at execMain.c:308
#12 0x0000000000a79d6f in PortalRunSelect (portal=0x80a70d120, forward=true, 
count=0, dest=0x80a851cc0) at pquery.c:929
#13 0x0000000000a79807 in PortalRun (portal=0x80a70d120, 
count=9223372036854775807, isTopLevel=true, run_once=true, dest=0x80a851cc0, 
altdest=0x80a851cc0, completionTag=0x7fffffffdc30 "")
     at pquery.c:770
#14 0x0000000000a74e49 in exec_simple_query (
     query_string=0x800d02950 

"SELECT\nT1._Q_001_F_000,\nT1._Q_001_F_001,\nT1._Q_001_F_002RRef,\nT1._Q_001_F_003RRef,\nT1._Q_001_F_004RRef,\nT1._Q_001_F_005RRef,\nMAX(CASE

WHEN (T1._Q_001_F_010 > CAST(0 AS NUMERIC)) THEN T2._Q_001_F_009RR"...) at 
postgres.c:1227
#15 0x0000000000a74123 in PostgresMain (argc=1, argv=0x80a6ef8f0, 
dbname=0x80a6ef850 "postgres", username=0x80a6ef830 "teodor") at postgres.c:4291
#16 0x00000000009a4c3b in BackendRun (port=0x80a6e6000) at postmaster.c:4498
#17 0x00000000009a403a in BackendStartup (port=0x80a6e6000) at postmaster.c:4189
#18 0x00000000009a2f63 in ServerLoop () at postmaster.c:1727
#19 0x00000000009a0a0a in PostmasterMain (argc=3, argv=0x7fffffffe3c8) at 
postmaster.c:1400
#20 0x000000000088deef in main (argc=3, argv=0x7fffffffe3c8) at main.c:210

-- 
Teodor Sigaev                                   E-mail: teodor@sigaev.ru
                                                    WWW: http://www.sigaev.ru/

Вложения

Re: aggregate crash

От
Andres Freund
Дата:
Hi,

On 2019-12-27 20:13:26 +0300, Teodor Sigaev wrote:
> Found crash on production instance, assert-enabled build crashes in pfree()
> call, with default config. v11, v12 and head are affected, but, seems, you
> need to be a bit lucky.
> 
> The bug is comparing old and new aggregate pass-by-ref values only by
> pointer value itself, despite on null flag. Any function which returns null
> doesn't worry about actual returned Datum value, so that comparison isn't
> enough. Test case shows bug with ExecInterpExpr() but there several similar
> places (thanks Nikita Glukhov for help).
> Attached patch adds check of null flag.

Hm. I don't understand the problem here. Why do we need to reparent in
that case? What freed the relevant value?

Nor do I really understand why v10 wouldn't be affected if this actually
is a problem. The relevant code is also only guarded by
        DatumGetPointer(newVal) != DatumGetPointer(pergroupstate->transValue))


> 
> Backtrace from v12 (note, newValue and oldValue are differ on current call,
> but oldValue points into pfreed memory) :
> #0  0x0000000000c8405a in GetMemoryChunkContext (pointer=0x80a808250) at
> ../../../../src/include/utils/memutils.h:130
> 130             AssertArg(MemoryContextIsValid(context));
> (gdb) bt
> #0  0x0000000000c8405a in GetMemoryChunkContext (pointer=0x80a808250) at
> ../../../../src/include/utils/memutils.h:130
> #1  0x0000000000c85ae5 in pfree (pointer=0x80a808250) at mcxt.c:1058
> #2  0x000000000080475e in ExecAggTransReparent (aggstate=0x80a806370,
> pertrans=0x80a87e830, newValue=34535940744, newValueIsNull=false,
> oldValue=34535932496, oldValueIsNull=false)
>     at execExprInterp.c:4209
> #3  0x00000000007ff51f in ExecInterpExpr (state=0x80a87f4d8,
> econtext=0x80a8065a8, isnull=0x7fffffffd7b7) at execExprInterp.c:1747
> #4  0x000000000082c12b in ExecEvalExprSwitchContext (state=0x80a87f4d8,
> econtext=0x80a8065a8, isNull=0x7fffffffd7b7) at
> ../../../src/include/executor/executor.h:308
> #5  0x000000000082bc0f in advance_aggregates (aggstate=0x80a806370) at nodeAgg.c:679
> #6  0x000000000082b8a6 in agg_retrieve_direct (aggstate=0x80a806370) at
> nodeAgg.c:1847
> #7  0x0000000000828782 in ExecAgg (pstate=0x80a806370) at nodeAgg.c:1572
> #8  0x000000000080e712 in ExecProcNode (node=0x80a806370) at
> ../../../src/include/executor/executor.h:240



> How to reproduce:
> http://sigaev.ru/misc/xdump.sql.bz2
> bzcat xdump.sql.bz2 | psql postgres && psql postgres < x.sql

It should be possible to create a smaller reproducer... It'd be good if
a bug fix for this were committed with a regression test.


> diff --git a/src/backend/executor/execExprInterp.c b/src/backend/executor/execExprInterp.c
> index 034970648f3..3b5333716d4 100644
> --- a/src/backend/executor/execExprInterp.c
> +++ b/src/backend/executor/execExprInterp.c
> @@ -1743,7 +1743,8 @@ ExecInterpExpr(ExprState *state, ExprContext *econtext, bool *isnull)
>               * expanded object that is already a child of the aggcontext,
>               * assume we can adopt that value without copying it.
>               */
> -            if (DatumGetPointer(newVal) != DatumGetPointer(pergroup->transValue))
> +            if (DatumGetPointer(newVal) != DatumGetPointer(pergroup->transValue) ||
> +                fcinfo->isnull != pergroup->transValueIsNull)
>                  newVal = ExecAggTransReparent(aggstate, pertrans,
>                                                newVal, fcinfo->isnull,
>                                                pergroup->transValue,

I'd really like to avoid adding additional branches to these paths.

Greetings,

Andres Freund



Re: aggregate crash

От
Alvaro Herrera
Дата:
On 2019-Dec-27, Teodor Sigaev wrote:

> Hi!
> 
> Found crash on production instance, assert-enabled build crashes in pfree()
> call, with default config. v11, v12 and head are affected, but, seems, you
> need to be a bit lucky.

Is this bug being considered for the next set of minors?

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: aggregate crash

От
Tom Lane
Дата:
Alvaro Herrera <alvherre@2ndquadrant.com> writes:
> On 2019-Dec-27, Teodor Sigaev wrote:
>> Found crash on production instance, assert-enabled build crashes in pfree()
>> call, with default config. v11, v12 and head are affected, but, seems, you
>> need to be a bit lucky.

> Is this bug being considered for the next set of minors?

I think Andres last touched that code, so I was sort of expecting
him to have an opinion on this.  But I agree that not checking null-ness
explicitly is kind of unsafe.  We've never before had any expectation
that the Datum value of a null is anything in particular.

            regards, tom lane



Re: aggregate crash

От
Andres Freund
Дата:
Hi,

On 2020-01-14 17:01:01 -0500, Tom Lane wrote:
> Alvaro Herrera <alvherre@2ndquadrant.com> writes:
> > On 2019-Dec-27, Teodor Sigaev wrote:
> >> Found crash on production instance, assert-enabled build crashes in pfree()
> >> call, with default config. v11, v12 and head are affected, but, seems, you
> >> need to be a bit lucky.
> 
> > Is this bug being considered for the next set of minors?
> 
> I think Andres last touched that code, so I was sort of expecting
> him to have an opinion on this.

Well, I commented a few days ago, also asking for further input...

To me it looks like that code has effectively been the same for quite a
while.  While today the code is:

            newVal = FunctionCallInvoke(fcinfo);

            /*
             * For pass-by-ref datatype, must copy the new value into
             * aggcontext and free the prior transValue.  But if transfn
             * returned a pointer to its first input, we don't need to do
             * anything.  Also, if transfn returned a pointer to a R/W
             * expanded object that is already a child of the aggcontext,
             * assume we can adopt that value without copying it.
             */
            if (DatumGetPointer(newVal) != DatumGetPointer(pergroup->transValue))
                newVal = ExecAggTransReparent(aggstate, pertrans,
                                              newVal, fcinfo->isnull,
                                              pergroup->transValue,
                                              pergroup->transValueIsNull);
...
ExecAggTransReparent(AggState *aggstate, AggStatePerTrans pertrans,
                     Datum newValue, bool newValueIsNull,
                     Datum oldValue, bool oldValueIsNull)
...
    if (!newValueIsNull)
    {
        MemoryContextSwitchTo(aggstate->curaggcontext->ecxt_per_tuple_memory);
        if (DatumIsReadWriteExpandedObject(newValue,
                                           false,
                                           pertrans->transtypeLen) &&
            MemoryContextGetParent(DatumGetEOHP(newValue)->eoh_context) == CurrentMemoryContext)
             /* do nothing */ ;
        else
            newValue = datumCopy(newValue,
                                 pertrans->transtypeByVal,
                                 pertrans->transtypeLen);
    }
    if (!oldValueIsNull)
    {
        if (DatumIsReadWriteExpandedObject(oldValue,
                                           false,
                                           pertrans->transtypeLen))
            DeleteExpandedObject(oldValue);
        else
            pfree(DatumGetPointer(oldValue));
    }

before it was (in v10):

    if (!pertrans->transtypeByVal &&
        DatumGetPointer(newVal) != DatumGetPointer(pergroupstate->transValue))
    {
        if (!fcinfo->isnull)
        {
            MemoryContextSwitchTo(aggstate->curaggcontext->ecxt_per_tuple_memory);
            if (DatumIsReadWriteExpandedObject(newVal,
                                               false,
                                               pertrans->transtypeLen) &&
                MemoryContextGetParent(DatumGetEOHP(newVal)->eoh_context) == CurrentMemoryContext)
                 /* do nothing */ ;
            else
                newVal = datumCopy(newVal,
                                   pertrans->transtypeByVal,
                                   pertrans->transtypeLen);
        }
        if (!pergroupstate->transValueIsNull)
        {
            if (DatumIsReadWriteExpandedObject(pergroupstate->transValue,
                                               false,
                                               pertrans->transtypeLen))
                DeleteExpandedObject(pergroupstate->transValue);
            else
                pfree(DatumGetPointer(pergroupstate->transValue));
        }
    }

there's no need in the current code to check !pertrans->transtypeByVal,
as byval has a separate expression opcode.  So I don't think things have
changed?

As far as I can tell, comparing the values by pointer goes back a *long*
while. We didn't use to handle expanded objects, but otherwise it looked
pretty similar back to 7.4 (oldest version I've checked out):

    newVal = FunctionCallInvoke(&fcinfo);

    /*
     * If pass-by-ref datatype, must copy the new value into aggcontext
     * and pfree the prior transValue.    But if transfn returned a pointer
     * to its first input, we don't need to do anything.
     */
    if (!peraggstate->transtypeByVal &&
    DatumGetPointer(newVal) != DatumGetPointer(pergroupstate->transValue))
    {
        if (!fcinfo.isnull)
        {
            MemoryContextSwitchTo(aggstate->aggcontext);
            newVal = datumCopy(newVal,
                               peraggstate->transtypeByVal,
                               peraggstate->transtypeLen);
        }
        if (!pergroupstate->transValueIsNull)
            pfree(DatumGetPointer(pergroupstate->transValue));
    }


> But I agree that not checking null-ness
> explicitly is kind of unsafe.  We've never before had any expectation
> that the Datum value of a null is anything in particular.

I'm still not sure I actually fully understand the bug. It's obvious how
returning the input value again could lead to memory not being freed (so
that leak seems to go all the way back). And similarly, since the
introduction of expanded objects, it can also lead to the expanded
object not being deleted.

But that's not the problem causing the crash here. What I think must
instead be the problem is that pergroupstate->transValueIsNull, but
pergroupstate->transValue is set to something looking like a
pointer. Which caused us not to datumCopy() a new transition value into
a long lived context. and then a later transition causes us to free the
short-lived value?

Greetings,

Andres Freund



Re: aggregate crash

От
Tom Lane
Дата:
Andres Freund <andres@anarazel.de> writes:
> On 2020-01-14 17:01:01 -0500, Tom Lane wrote:
>> But I agree that not checking null-ness
>> explicitly is kind of unsafe.  We've never before had any expectation
>> that the Datum value of a null is anything in particular.

> I'm still not sure I actually fully understand the bug. It's obvious how
> returning the input value again could lead to memory not being freed (so
> that leak seems to go all the way back). And similarly, since the
> introduction of expanded objects, it can also lead to the expanded
> object not being deleted.
> But that's not the problem causing the crash here. What I think must
> instead be the problem is that pergroupstate->transValueIsNull, but
> pergroupstate->transValue is set to something looking like a
> pointer. Which caused us not to datumCopy() a new transition value into
> a long lived context. and then a later transition causes us to free the
> short-lived value?

Yeah, I was kind of wondering that too.  While formally the Datum value
for a null is undefined, I'm not aware offhand of any functions that
wouldn't return zero --- and this would have to be an aggregate transition
function doing so, which reduces the universe of candidates quite a lot.
Plus there's the question of how often a transition function would return
null for non-null input at all.

Could we see a test case that provokes this crash, even if it doesn't
do so reliably?

            regards, tom lane



Re: aggregate crash

От
Andres Freund
Дата:
Hi,

On 2020-01-14 17:54:16 -0500, Tom Lane wrote:
> Andres Freund <andres@anarazel.de> writes:
> > On 2020-01-14 17:01:01 -0500, Tom Lane wrote:
> >> But I agree that not checking null-ness
> >> explicitly is kind of unsafe.  We've never before had any expectation
> >> that the Datum value of a null is anything in particular.
> 
> > I'm still not sure I actually fully understand the bug. It's obvious how
> > returning the input value again could lead to memory not being freed (so
> > that leak seems to go all the way back). And similarly, since the
> > introduction of expanded objects, it can also lead to the expanded
> > object not being deleted.
> > But that's not the problem causing the crash here. What I think must
> > instead be the problem is that pergroupstate->transValueIsNull, but
> > pergroupstate->transValue is set to something looking like a
> > pointer. Which caused us not to datumCopy() a new transition value into
> > a long lived context. and then a later transition causes us to free the
> > short-lived value?
> 
> Yeah, I was kind of wondering that too.  While formally the Datum value
> for a null is undefined, I'm not aware offhand of any functions that
> wouldn't return zero --- and this would have to be an aggregate transition
> function doing so, which reduces the universe of candidates quite a lot.
> Plus there's the question of how often a transition function would return
> null for non-null input at all.
> 
> Could we see a test case that provokes this crash, even if it doesn't
> do so reliably?

There's a larger reproducer referenced in the first message. I had hoped
that Teodor could narrow it down - I guess I'll try to do that tomorrow...

Greetings,

Andres Freund



Re: aggregate crash

От
Andres Freund
Дата:
Hi,

On 2020-01-14 23:27:02 -0800, Andres Freund wrote:
> On 2020-01-14 17:54:16 -0500, Tom Lane wrote:
> > Andres Freund <andres@anarazel.de> writes:
> > > I'm still not sure I actually fully understand the bug. It's obvious how
> > > returning the input value again could lead to memory not being freed (so
> > > that leak seems to go all the way back). And similarly, since the
> > > introduction of expanded objects, it can also lead to the expanded
> > > object not being deleted.
> > > But that's not the problem causing the crash here. What I think must
> > > instead be the problem is that pergroupstate->transValueIsNull, but
> > > pergroupstate->transValue is set to something looking like a
> > > pointer. Which caused us not to datumCopy() a new transition value into
> > > a long lived context. and then a later transition causes us to free the
> > > short-lived value?
> >
> > Yeah, I was kind of wondering that too.  While formally the Datum value
> > for a null is undefined, I'm not aware offhand of any functions that
> > wouldn't return zero --- and this would have to be an aggregate transition
> > function doing so, which reduces the universe of candidates quite a lot.
> > Plus there's the question of how often a transition function would return
> > null for non-null input at all.
> >
> > Could we see a test case that provokes this crash, even if it doesn't
> > do so reliably?
>
> There's a larger reproducer referenced in the first message. I had hoped
> that Teodor could narrow it down - I guess I'll try to do that tomorrow...

FWIW, I'm working on narrowing it down to something small. I can
reliably trigger the bug, and I understand the mechanics, I
think. Interestingly enough the reproducer currently only triggers on
v12, not on v11 and before.

As you say, this requires a transition function returning a NULL that
has the datum part set - the reproducer here defines a non-strict
aggregate transition function that can indirectly do so:

CREATE FUNCTION public.state_max_bytea(st bytea, inp bytea) RETURNS bytea
    LANGUAGE plpgsql
    AS $$
    BEGIN
         if st is null
         then
            return inp;
         elseif st<inp then
            return inp;
         else
            return st;
         end if;
    END;$$;

CREATE AGGREGATE public.max(bytea) (
    SFUNC = public.state_max_bytea,
    STYPE = bytea
);

I.e. when the current transition is null (e.g. for the first tuple), the
transition is always set to new input value. Even if that is null.

Then the question in turn is, how the input datum is != 0, but has
isnull set. And that's caused by:


        EEO_CASE(EEOP_FUNCEXPR_STRICT)
        {
            FunctionCallInfo fcinfo = op->d.func.fcinfo_data;
            NullableDatum *args = fcinfo->args;
            int            argno;
            Datum        d;

            /* strict function, so check for NULL args */
            for (argno = 0; argno < op->d.func.nargs; argno++)
            {
                if (args[argno].isnull)
                {
                    *op->resnull = true;
                    goto strictfail;
                }
            }
            fcinfo->isnull = false;
            d = op->d.func.fn_addr(fcinfo);
            *op->resvalue = d;
            *op->resnull = fcinfo->isnull;

    strictfail:
            EEO_NEXT();
        }


I.e. if the transitions argument is a strict function, and that strict
function is not evaluated because of a NULL input, we set op->resnull =
true, but do *not* touch op->resvalue. If there was a previous row that
actually set resvalue to something meaningful, we get an input to the
transition function consisting out of the old resvalue (!= 0), but the
new resnull = true.  If the transition function returns that unchanged,
ExecAggTransReparent() doesn't do anything, because the new value is
null. Afterwards pergroup->transValue is set != 0, even though
transValueIsNull = true.

The somewhat tricky bit is arranging this to happen with pointers that
are the same. I think I'm on the way to narrow that down, but it'll take
me a bit longer.

To fix this I think we should set newVal = 0 in
ExecAggTransReparent()'s, as a new else to !newValueIsNull. That should
not add any additional branches, I think. I contrast to always doing so
when checking whether ExecAggTransReparent() ought to be called.

Greetings,

Andres Freund



Re: aggregate crash

От
Andres Freund
Дата:
Hi,

On 2020-01-15 12:47:47 -0800, Andres Freund wrote:
> FWIW, I'm working on narrowing it down to something small. I can
> reliably trigger the bug, and I understand the mechanics, I
> think. Interestingly enough the reproducer currently only triggers on
> v12, not on v11 and before.

That's just happenstance due to allocation changes in plpgsql,
though. The attached small reproducer, for me, reliably triggers crashes
on 10 - master.

It's hard to hit intentionally, because plpgsql does a datumCopy() to
its non-null return value, which means that to hit the bug, one needs
different numbers of allocations between setting up the transition value
with transvalueisnull = true, transvalue = 0xsomepointer (because
plpgsql doesn't copy NULLs), and the transition output with
transvalueisnull = false, transvalue = 0xsomepointer. Which is necessary
to trigger the bug, as it's then not reparented into a long lived enough
context. To be then freed/accessed for the next group input value.

I think this is too finnicky to actually keep as a regression test.

The bug, in a way, exists all the way back, but it's a bit harder to
create NULL values where the datum component isn't 0.


To fix I suggest we, in all branches, do the equivalent of adding
something like:
diff --git i/src/backend/executor/execExprInterp.c w/src/backend/executor/execExprInterp.c
index 790380051be..3260a63ac6b 100644
--- i/src/backend/executor/execExprInterp.c
+++ w/src/backend/executor/execExprInterp.c
@@ -4199,6 +4199,12 @@ ExecAggTransReparent(AggState *aggstate, AggStatePerTrans pertrans,
                                  pertrans->transtypeByVal,
                                  pertrans->transtypeLen);
     }
+    else
+    {
+        /* ensure datum component is 0 for NULL transition values */
+        newValue = (Datum) 0;
+    }
+
     if (!oldValueIsNull)
     {
         if (DatumIsReadWriteExpandedObject(oldValue,

and a comment explaining why it's (now) safe to rely on datum
comparisons for
    if (DatumGetPointer(newVal) != DatumGetPointer(pergroup->transValue))


I don't think it makes sense to add something like it to the byval case
- there's plenty other ways a function returning != 0 with
fcinfo->isnull == true can cause such values to exist. And that's
longstanding.


A separate question is whether it's worth adding code to
e.g. EEO_CASE(EEOP_FUNCEXPR_STRICT) also resetting *op->resvalue to
(Datum) 0.  I don't personally don't think ensuring the datum is always
0 when isnull true is all that helpful, if we can't guarantee it
everywhere. So I'm a bit loathe to add cycles to places that don't need
it, and are hot.

Regards,

Andres

Вложения

Re: aggregate crash

От
Andres Freund
Дата:
Hi,

On 2020-01-15 19:16:30 -0800, Andres Freund wrote:
> The bug, in a way, exists all the way back, but it's a bit harder to
> create NULL values where the datum component isn't 0.

> To fix I suggest we, in all branches, do the equivalent of adding
> something like:
> diff --git i/src/backend/executor/execExprInterp.c w/src/backend/executor/execExprInterp.c
> index 790380051be..3260a63ac6b 100644
> --- i/src/backend/executor/execExprInterp.c
> +++ w/src/backend/executor/execExprInterp.c
> @@ -4199,6 +4199,12 @@ ExecAggTransReparent(AggState *aggstate, AggStatePerTrans pertrans,
>                                   pertrans->transtypeByVal,
>                                   pertrans->transtypeLen);
>      }
> +    else
> +    {
> +        /* ensure datum component is 0 for NULL transition values */
> +        newValue = (Datum) 0;
> +    }
> +
>      if (!oldValueIsNull)
>      {
>          if (DatumIsReadWriteExpandedObject(oldValue,
> 
> and a comment explaining why it's (now) safe to rely on datum
> comparisons for
>     if (DatumGetPointer(newVal) != DatumGetPointer(pergroup->transValue))

Pushed something along those lines.


> A separate question is whether it's worth adding code to
> e.g. EEO_CASE(EEOP_FUNCEXPR_STRICT) also resetting *op->resvalue to
> (Datum) 0.  I don't personally don't think ensuring the datum is always
> 0 when isnull true is all that helpful, if we can't guarantee it
> everywhere. So I'm a bit loathe to add cycles to places that don't need
> it, and are hot.

I wonder if its worth adding a few valgrind annotations marking values
as undefined when null? Would make it easier to catch such cases in the
future.

Greetings,

Andres Freund