Обсуждение: strange valgrind reports about wrapper_handler on 64-bit arm
Hi,
while running check-world on 64-bit arm (rpi5 with Debian 12.9), I got a
couple reports like this:
==64550== Use of uninitialised value of size 8
==64550==    at 0xA62FE0: wrapper_handler (pqsignal.c:107)
==64550==    by 0x580BB9E7: ??? (in
/usr/libexec/valgrind/memcheck-arm64-linux)
==64550==  Uninitialised value was created by a stack allocation
==64550==    at 0x4F94660: strcoll_l (strcoll_l.c:258)
==64550==
{
   <insert_a_suppression_name_here>
   Memcheck:Value8
   fun:wrapper_handler
   obj:/usr/libexec/valgrind/memcheck-arm64-linux
}
**64550** Valgrind detected 1 error(s) during execution of "ANALYZE
mcv_lists;"
The exact command varies, I don't think it's necessarily about analyze
or extended stats.
The line the report refers to is this:
    (*pqsignal_handlers[postgres_signal_arg]) (postgres_signal_arg);
so I guess it can't be about postgres_signal_arg (as that's an int). But
that leaves just pqsignal_handlers, and why would that be uninitialized?
The closest thing I found in archives is [1] from about a year ago, but
we haven't found any clear explanation there either :-(
[1]
https://www.postgresql.org/message-id/f1a022e5-9bec-42c5-badd-cfc00b60515c@enterprisedb.com
regards
-- 
Tomas Vondra
			
		On Fri, Mar 07, 2025 at 12:03:47AM +0100, Tomas Vondra wrote:
> while running check-world on 64-bit arm (rpi5 with Debian 12.9), I got a
> couple reports like this:
> 
> ==64550== Use of uninitialised value of size 8
> ==64550==    at 0xA62FE0: wrapper_handler (pqsignal.c:107)
> ==64550==    by 0x580BB9E7: ??? (in
> /usr/libexec/valgrind/memcheck-arm64-linux)
> ==64550==  Uninitialised value was created by a stack allocation
> ==64550==    at 0x4F94660: strcoll_l (strcoll_l.c:258)
> ==64550==
> {
>    <insert_a_suppression_name_here>
>    Memcheck:Value8
>    fun:wrapper_handler
>    obj:/usr/libexec/valgrind/memcheck-arm64-linux
> }
> **64550** Valgrind detected 1 error(s) during execution of "ANALYZE
> mcv_lists;"
> 
> The exact command varies, I don't think it's necessarily about analyze
> or extended stats.
> 
> The line the report refers to is this:
> 
>     (*pqsignal_handlers[postgres_signal_arg]) (postgres_signal_arg);
> 
> so I guess it can't be about postgres_signal_arg (as that's an int). But
> that leaves just pqsignal_handlers, and why would that be uninitialized?
> 
> The closest thing I found in archives is [1] from about a year ago, but
> we haven't found any clear explanation there either :-(
Hm.  The pointer to strcoll_l makes me wonder if there might be an issue in
one of the handler functions that wrapper_handler calls, and
wrapper_handler is getting the blame.  I don't see how pqsignal_handlers
could be uninitialized.  It's static, and we are careful to set it
appropriately before we call sigaction().
-- 
nathan
			
		Hi,
On 2025-03-07 00:03:47 +0100, Tomas Vondra wrote:
> while running check-world on 64-bit arm (rpi5 with Debian 12.9), I got a
> couple reports like this:
> 
> ==64550== Use of uninitialised value of size 8
> ==64550==    at 0xA62FE0: wrapper_handler (pqsignal.c:107)
> ==64550==    by 0x580BB9E7: ??? (in
> /usr/libexec/valgrind/memcheck-arm64-linux)
> ==64550==  Uninitialised value was created by a stack allocation
> ==64550==    at 0x4F94660: strcoll_l (strcoll_l.c:258)
> ==64550==
> {
>    <insert_a_suppression_name_here>
>    Memcheck:Value8
>    fun:wrapper_handler
>    obj:/usr/libexec/valgrind/memcheck-arm64-linux
> }
> **64550** Valgrind detected 1 error(s) during execution of "ANALYZE
> mcv_lists;"
> The exact command varies, I don't think it's necessarily about analyze
> or extended stats.
Do you have a few other examples from where it was triggered?
Is the source of the uninitialized value always strcoll_l?
Can you reliably reproduce it in certain scenarios or is it probabilistic in
some form?
Do you know what signal was delivered (I think that could be detected using
valgrinds --vgdb)?
> The line the report refers to is this:
> 
>     (*pqsignal_handlers[postgres_signal_arg]) (postgres_signal_arg);
> 
> so I guess it can't be about postgres_signal_arg (as that's an int). But
> that leaves just pqsignal_handlers, and why would that be uninitialized?
Is it possible that the signal number we're getting called for is above
PG_NSIG? That'd explain why the source value is something fairly random?
ISTM that we should add an Assert() to wrapper_handler() that ensures that the
signal arg is below PG_NSIG.
Might also be worth trying to run without valgrind but with address and
undefined behaviour sanitizers enabled.  I don't currently have access to an
armv8 machine that's not busy doing other stuff...
Greetings,
Andres Freund
			
		On Fri, Mar 07, 2025 at 11:32:28AM -0500, Andres Freund wrote: > Is it possible that the signal number we're getting called for is above > PG_NSIG? That'd explain why the source value is something fairly random? > > ISTM that we should add an Assert() to wrapper_handler() that ensures that the > signal arg is below PG_NSIG. We have such an assertion in pqsignal() before we install wrapper_handler for anything. Is there another way it could be getting called with a different signo? -- nathan
Hi, On 2025-03-07 10:36:35 -0600, Nathan Bossart wrote: > On Fri, Mar 07, 2025 at 11:32:28AM -0500, Andres Freund wrote: > > Is it possible that the signal number we're getting called for is above > > PG_NSIG? That'd explain why the source value is something fairly random? > > > > ISTM that we should add an Assert() to wrapper_handler() that ensures that the > > signal arg is below PG_NSIG. > > We have such an assertion in pqsignal() before we install wrapper_handler > for anything. Is there another way it could be getting called with a > different signo? Who the hell knows :). One potential way would be that we got SIGNAL_ARGS wrong for the platform and are interpreting some random thing as the signal number. Or something went wrong in the windows signal emulation code. Or ... It seems cheap insurance to add it both places. Greetings, Andres Freund
On Fri, Mar 07, 2025 at 11:41:38AM -0500, Andres Freund wrote: > On 2025-03-07 10:36:35 -0600, Nathan Bossart wrote: >> On Fri, Mar 07, 2025 at 11:32:28AM -0500, Andres Freund wrote: >> > Is it possible that the signal number we're getting called for is above >> > PG_NSIG? That'd explain why the source value is something fairly random? >> > >> > ISTM that we should add an Assert() to wrapper_handler() that ensures that the >> > signal arg is below PG_NSIG. >> >> We have such an assertion in pqsignal() before we install wrapper_handler >> for anything. Is there another way it could be getting called with a >> different signo? > > Who the hell knows :). > > One potential way would be that we got SIGNAL_ARGS wrong for the platform and > are interpreting some random thing as the signal number. Or something went > wrong in the windows signal emulation code. Or ... > > It seems cheap insurance to add it both places. Good enough for me. I'll commit/back-patch to v17 the attached soon. -- nathan
Вложения
On Fri, Mar 07, 2025 at 10:52:10AM -0600, Nathan Bossart wrote: > Good enough for me. I'll commit/back-patch to v17 the attached soon. On second thought, since the signal number is a signed integer, I think we also ought to check that it's > 0. I'm running the attached patch through the CI tests to make sure that's correct for the common platforms. If that looks good, I'm planning to commit it. -- nathan
Вложения
On Fri, Mar 07, 2025 at 02:38:47PM -0600, Nathan Bossart wrote: > If that looks good, I'm planning to commit it. And committed. -- nathan
On 3/7/25 17:32, Andres Freund wrote:
> Hi,
> 
> On 2025-03-07 00:03:47 +0100, Tomas Vondra wrote:
>> while running check-world on 64-bit arm (rpi5 with Debian 12.9), I got a
>> couple reports like this:
>>
>> ==64550== Use of uninitialised value of size 8
>> ==64550==    at 0xA62FE0: wrapper_handler (pqsignal.c:107)
>> ==64550==    by 0x580BB9E7: ??? (in
>> /usr/libexec/valgrind/memcheck-arm64-linux)
>> ==64550==  Uninitialised value was created by a stack allocation
>> ==64550==    at 0x4F94660: strcoll_l (strcoll_l.c:258)
>> ==64550==
>> {
>>    <insert_a_suppression_name_here>
>>    Memcheck:Value8
>>    fun:wrapper_handler
>>    obj:/usr/libexec/valgrind/memcheck-arm64-linux
>> }
>> **64550** Valgrind detected 1 error(s) during execution of "ANALYZE
>> mcv_lists;"
> 
>> The exact command varies, I don't think it's necessarily about analyze
>> or extended stats.
> 
> Do you have a few other examples from where it was triggered?
> 
> Is the source of the uninitialized value always strcoll_l?
> 
I've seen a couple reports, but only a single one had info about source
of the allocation (and that was strcoll).
> Can you reliably reproduce it in certain scenarios or is it probabilistic in
> some form?
> 
I believe it's probabilistic, I certainly don't know how to trigger or
reproduce it.
> Do you know what signal was delivered (I think that could be detected using
> valgrinds --vgdb)?
> 
No idea.
> 
>> The line the report refers to is this:
>>
>>     (*pqsignal_handlers[postgres_signal_arg]) (postgres_signal_arg);
>>
>> so I guess it can't be about postgres_signal_arg (as that's an int). But
>> that leaves just pqsignal_handlers, and why would that be uninitialized?
> 
> Is it possible that the signal number we're getting called for is above
> PG_NSIG? That'd explain why the source value is something fairly random?
> 
No idea.
> ISTM that we should add an Assert() to wrapper_handler() that ensures that the
> signal arg is below PG_NSIG.
> 
No idea.
> 
> Might also be worth trying to run without valgrind but with address and
> undefined behaviour sanitizers enabled.  I don't currently have access to an
> armv8 machine that's not busy doing other stuff...
> 
I've restarted check-world with valgrind on my rpi5 machines, with
current master. I can try running other stuff once that finishes in a
couple hours.
regards
-- 
Tomas Vondra
			
		On 3/8/25 21:38, Tomas Vondra wrote: > > I've restarted check-world with valgrind on my rpi5 machines, with > current master. I can try running other stuff once that finishes in a > couple hours. > Shortly after restarting this I got three more reports - all of them are related to strcoll_l. This is on c472a18296e4, i.e. with the asserts added in this thread etc. But none of those seem to fail. regards -- Tomas Vondra
Вложения
On Sat, Mar 08, 2025 at 11:48:22PM +0100, Tomas Vondra wrote:
> Shortly after restarting this I got three more reports - all of them are
> related to strcoll_l. This is on c472a18296e4, i.e. with the asserts
> added in this thread etc. But none of those seem to fail.
> ==189168==    at 0xA683CC: wrapper_handler (pqsignal.c:90)
> ==189168==    at 0xA683F0: wrapper_handler (pqsignal.c:91)
> ==189168==    at 0xA684D4: wrapper_handler (pqsignal.c:110)
This appears to refer to the following lines:
    Assert(postgres_signal_arg > 0);
    Assert(postgres_signal_arg < PG_NSIG);
    (*pqsignal_handlers[postgres_signal_arg]) (postgres_signal_arg);
The common ingredient seems to be postgres_signal_arg.  I haven't found
anything else that seems helpful.
-- 
nathan
			
		On 3/9/25 03:16, Nathan Bossart wrote:
> On Sat, Mar 08, 2025 at 11:48:22PM +0100, Tomas Vondra wrote:
>> Shortly after restarting this I got three more reports - all of them are
>> related to strcoll_l. This is on c472a18296e4, i.e. with the asserts
>> added in this thread etc. But none of those seem to fail.
> 
>> ==189168==    at 0xA683CC: wrapper_handler (pqsignal.c:90)
>> ==189168==    at 0xA683F0: wrapper_handler (pqsignal.c:91)
>> ==189168==    at 0xA684D4: wrapper_handler (pqsignal.c:110)
> 
> This appears to refer to the following lines:
> 
>     Assert(postgres_signal_arg > 0);
>     Assert(postgres_signal_arg < PG_NSIG);
>     (*pqsignal_handlers[postgres_signal_arg]) (postgres_signal_arg);
> 
> The common ingredient seems to be postgres_signal_arg.  I haven't found
> anything else that seems helpful.
> 
Yeah, it's a bit weird. I got another report on the 64-bit rpi5 machine
(except for the command it's exactly the same), and then also this
report on the 32-bit machine:
---------------------------------------------------------------
==3482== Use of uninitialised value of size 4
==3482==    at 0x991498: wrapper_handler (pqsignal.c:113)
==3482==    by 0xFFFFFFFF: ???
==3482==  Uninitialised value was created by a heap allocation
==3482==    at 0x940EB8: palloc (mcxt.c:1341)
==3482==    by 0x980037: initStringInfoInternal (stringinfo.c:45)
==3482==    by 0x980103: initStringInfo (stringinfo.c:99)
==3482==    by 0x7194B7: ReadArrayStr (arrayfuncs.c:613)
==3482==    by 0x718803: array_in (arrayfuncs.c:289)
==3482==    by 0x90024F: InputFunctionCallSafe (fmgr.c:1607)
==3482==    by 0x2E346B: CopyFromTextLikeOneRow (copyfromparse.c:1029)
==3482==    by 0x2E304B: CopyFromTextOneRow (copyfromparse.c:919)
==3482==    by 0x2E2EEF: NextCopyFrom (copyfromparse.c:890)
==3482==    by 0x2DF5C7: CopyFrom (copyfrom.c:1149)
==3482==    by 0x2DAE33: DoCopy (copy.c:306)
==3482==    by 0x6CC38F: standard_ProcessUtility (utility.c:738)
==3482==    by 0x4B654C3: pgss_ProcessUtility (pg_stat_statements.c:1179)
==3482==    by 0x6CBBF3: ProcessUtility (utility.c:519)
==3482==    by 0x6CA24B: PortalRunUtility (pquery.c:1184)
==3482==    by 0x6CA537: PortalRunMulti (pquery.c:1348)
==3482==    by 0x6C9837: PortalRun (pquery.c:819)
==3482==    by 0x6C1BEB: exec_simple_query (postgres.c:1272)
==3482==    by 0x6C74FF: PostgresMain (postgres.c:4693)
==3482==    by 0x6BD297: BackendMain (backend_startup.c:107)
==3482==
{
   <insert_a_suppression_name_here>
   Memcheck:Value4
   fun:wrapper_handler
   obj:*
}
**3482** Valgrind detected 1 error(s) during execution of "COPY
array_op_test FROM
'/home/debian/postgres/src/test/regress/data/array.data';"
---------------------------------------------------------------
This all seems very strange ... I'm starting to wonder if maybe this is
a valgrind issue. Both machines have valgrind 3.19, I'll try with a
custom build of 3.24 (the latest release).
regards
-- 
Tomas Vondra