windows CI failing PMSignalState->PMChildFlags[slot] == PM_CHILD_ASSIGNED

Поиск
Список
Период
Сортировка
От Andres Freund
Тема windows CI failing PMSignalState->PMChildFlags[slot] == PM_CHILD_ASSIGNED
Дата
Msg-id 20230208012852.bvkn2am4h4iqjogq@awork3.anarazel.de
обсуждение исходный текст
Ответы Re: windows CI failing PMSignalState->PMChildFlags[slot] == PM_CHILD_ASSIGNED
Список pgsql-hackers
Hi,

A recent cfbot run caused CI on windows to crash - on a patch that could not
conceivably cause this issue:
  https://cirrus-ci.com/task/5646021133336576
the patch is just:
  https://github.com/postgresql-cfbot/postgresql/commit/dbd4afa6e7583c036b86abe2e3d27b508d335c2b

regression.diffs:
https://api.cirrus-ci.com/v1/artifact/task/5646021133336576/testrun/build/testrun/regress/regress/regression.diffs
postmaster.log:
https://api.cirrus-ci.com/v1/artifact/task/5646021133336576/testrun/build/testrun/regress/regress/log/postmaster.log
crash info:
https://api.cirrus-ci.com/v1/artifact/task/5646021133336576/crashlog/crashlog-postgres.exe_1af0_2023-02-08_00-53-23-997.txt

00000085`f03ffa40 00007ff6`fd89faa8     ucrtbased!abort(void)+0x5a [minkernel\crts\ucrt\src\appcrt\startup\abort.cpp @
77]
00000085`f03ffa80 00007ff6`fd6474dc     postgres!ExceptionalCondition(
            char * conditionName = 0x00007ff6`fdd03ca8 "PMSignalState->PMChildFlags[slot] == PM_CHILD_ASSIGNED",
            char * fileName = 0x00007ff6`fdd03c80 "../src/backend/storage/ipc/pmsignal.c",
            int lineNumber = 0n329)+0x78 [c:\cirrus\src\backend\utils\error\assert.c @ 67]
00000085`f03ffac0 00007ff6`fd676eff     postgres!MarkPostmasterChildActive(void)+0x7c
[c:\cirrus\src\backend\storage\ipc\pmsignal.c@ 329]
 
00000085`f03ffb00 00007ff6`fd59aa3a     postgres!InitProcess(void)+0x2ef [c:\cirrus\src\backend\storage\lmgr\proc.c @
375]
00000085`f03ffb60 00007ff6`fd467689     postgres!SubPostmasterMain(
            int argc = 0n3,
            char ** argv = 0x000001c6`f3814e80)+0x33a [c:\cirrus\src\backend\postmaster\postmaster.c @ 4962]
00000085`f03ffd90 00007ff6`fda0e1c9     postgres!main(
            int argc = 0n3,
            char ** argv = 0x000001c6`f3814e80)+0x2f9 [c:\cirrus\src\backend\main\main.c @ 192]

So, somehow we ended up a pmsignal slot for a new backend that's not currently
in PM_CHILD_ASSIGNED state.


Obviously the first idea is to wonder whether this is a problem introduced as
part of the the recent postmaster-latchification work.


At first I thought we were failing to terminate running processes, due to the
following output:

parallel group (20 tests):  name char txid text varchar enum float8 regproc int2 boolean bit oid pg_lsn int8 int4
float4uuid rangetypes numeric money
 
     boolean                      ... ok          684 ms
     char                         ... ok          517 ms
     name                         ... ok          354 ms
     varchar                      ... ok          604 ms
     text                         ... ok          603 ms
     int2                         ... ok          676 ms
     int4                         ... ok          818 ms
     int8                         ... ok          779 ms
     oid                          ... ok          720 ms
     float4                       ... ok          823 ms
     float8                       ... ok          628 ms
     bit                          ... ok          666 ms
     numeric                      ... ok         1132 ms
     txid                         ... ok          497 ms
     uuid                         ... ok          818 ms
     enum                         ... ok          619 ms
     money                        ... FAILED (test process exited with exit code 2)     7337 ms
     rangetypes                   ... ok          813 ms
     pg_lsn                       ... ok          762 ms
     regproc                      ... ok          632 ms


But now I realize the reason none of the other tests failed, is because the
crash took a long time, presumably due to the debugger creating the above
information, so no other tests failed.


2023-02-08 00:53:20.257 GMT client backend[4584] pg_regress/rangetypes STATEMENT:  select '-[a,z)'::textrange;
TRAP: failed Assert("PMSignalState->PMChildFlags[slot] == PM_CHILD_ASSIGNED"), File:
"../src/backend/storage/ipc/pmsignal.c",Line: 329, PID: 5948
 
[ quite a few lines ]
2023-02-08 00:53:27.420 GMT postmaster[872] LOG:  server process (PID 5948) was terminated by exception 0xC0000354
2023-02-08 00:53:27.420 GMT postmaster[872] HINT:  See C include file "ntstatus.h" for a description of the hexadecimal
value.
2023-02-08 00:53:27.420 GMT postmaster[872] LOG:  terminating any other active server processes
2023-02-08 00:53:27.434 GMT postmaster[872] LOG:  all server processes terminated; reinitializing
2023-02-08 00:53:27.459 GMT startup[5800] LOG:  database system was interrupted; last known up at 2023-02-08 00:53:19
GMT
2023-02-08 00:53:27.459 GMT startup[5800] LOG:  database system was not properly shut down; automatic recovery in
progress
2023-02-08 00:53:27.462 GMT startup[5800] LOG:  redo starts at 0/20DCF08
2023-02-08 00:53:27.484 GMT startup[5800] LOG:  could not stat file "pg_tblspc/16502": No such file or directory
2023-02-08 00:53:27.484 GMT startup[5800] CONTEXT:  WAL redo at 0/20DCFB8 for Tablespace/DROP: 16502
2023-02-08 00:53:27.614 GMT startup[5800] LOG:  invalid record length at 0/25353E8: wanted 24, got 0
2023-02-08 00:53:27.614 GMT startup[5800] LOG:  redo done at 0/2534FE0 system usage: CPU: user: 0.04 s, system: 0.04 s,
elapsed:0.15 s
 


Nevertheless, clearly this should never be reached.

Greetings,

Andres Freund



В списке pgsql-hackers по дате отправления:

Предыдущее
От: Andres Freund
Дата:
Сообщение: deadlock-hard flakiness
Следующее
От: Thomas Munro
Дата:
Сообщение: Re: windows CI failing PMSignalState->PMChildFlags[slot] == PM_CHILD_ASSIGNED