Обсуждение: fairywren failures
My new msys2 animal fairywren has had 3 recent failures when checking pg_upgrade. The failures have been while running the regression tests, specifically the interval test, and they all look like this: 2019-10-03 05:36:00.373 UTC [24272:43] LOG: server process (PID 23756) was terminated by exception 0xC0000028 2019-10-03 05:36:00.373 UTC [24272:44] DETAIL: Failed process was running: INSERT INTO INTERVAL_TBL (f1) VALUES ('badlyformatted interval'); That error is "bad stack" The failures have been on REL_12_STABLE (twice) and master (once). However, they are not consistent (REL_!2_STABLE is currently green). The interval test itself hasn't changed for m ore than 2 years, and I haven't found any obvious recent change that might cause the problem. I guess it could be a comoiler bug ... this is gcc 9.2.0, which is the current release. cheers andrew -- Andrew Dunstan https://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Hi, On 2019-10-03 10:21:13 -0400, Andrew Dunstan wrote: > My new msys2 animal fairywren has had 3 recent failures when checking > pg_upgrade. The failures have been while running the regression tests, > specifically the interval test, and they all look like this: > > > 2019-10-03 05:36:00.373 UTC [24272:43] LOG: server process (PID 23756) was terminated by exception 0xC0000028 > 2019-10-03 05:36:00.373 UTC [24272:44] DETAIL: Failed process was running: INSERT INTO INTERVAL_TBL (f1) VALUES ('badlyformatted interval'); > > > That error is "bad stack" > The failures have been on REL_12_STABLE (twice) and master (once). > However, they are not consistent (REL_!2_STABLE is currently green). > > > The interval test itself hasn't changed for m ore than 2 years, and I > haven't found any obvious recent change that might cause the problem. I > guess it could be a comoiler bug ... this is gcc 9.2.0, which is the > current release. This is around where an error is thrown: -- badly formatted interval INSERT INTO INTERVAL_TBL (f1) VALUES ('badly formatted interval'); -ERROR: invalid input syntax for type interval: "badly formatted interval" -LINE 1: INSERT INTO INTERVAL_TBL (f1) VALUES ('badly formatted inter... - ^ and the error is stack related. So I suspect that setjmp/longjmp might be to blame here, and somehow don't save/restore the stack into a proper state. I don't know enough about mingw/msys/windows to know whether that uses a self-written setjmp or relies on the MS implementation. If you could gather a backtrace it might help us. It's possible that the stack is "just" misaligned or something, we had problems with that before (IIRC valgrind didn't always align stacks correctly for processes that forked from within a signal handler, which then crashed when using instructions with alignment requirements, but only sometimes, because the stack coiuld be aligned). Greetings, Andres Freund
Hi, On 2019-10-03 08:18:42 -0700, Andres Freund wrote: > On 2019-10-03 10:21:13 -0400, Andrew Dunstan wrote: > > My new msys2 animal fairywren has had 3 recent failures when checking > > pg_upgrade. The failures have been while running the regression tests, > > specifically the interval test, and they all look like this: > > > > > > 2019-10-03 05:36:00.373 UTC [24272:43] LOG: server process (PID 23756) was terminated by exception 0xC0000028 > > 2019-10-03 05:36:00.373 UTC [24272:44] DETAIL: Failed process was running: INSERT INTO INTERVAL_TBL (f1) VALUES ('badlyformatted interval'); > > > > > > That error is "bad stack" > > > The failures have been on REL_12_STABLE (twice) and master (once). > > However, they are not consistent (REL_!2_STABLE is currently green). > > > > > > The interval test itself hasn't changed for m ore than 2 years, and I > > haven't found any obvious recent change that might cause the problem. I > > guess it could be a comoiler bug ... this is gcc 9.2.0, which is the > > current release. > > This is around where an error is thrown: > -- badly formatted interval > INSERT INTO INTERVAL_TBL (f1) VALUES ('badly formatted interval'); > -ERROR: invalid input syntax for type interval: "badly formatted interval" > -LINE 1: INSERT INTO INTERVAL_TBL (f1) VALUES ('badly formatted inter... > - ^ > > and the error is stack related. So I suspect that setjmp/longjmp might > be to blame here, and somehow don't save/restore the stack into a proper > state. I don't know enough about mingw/msys/windows to know whether that > uses a self-written setjmp or relies on the MS implementation. > > If you could gather a backtrace it might help us. It's possible that the > stack is "just" misaligned or something, we had problems with that > before (IIRC valgrind didn't always align stacks correctly for processes > that forked from within a signal handler, which then crashed when using > instructions with alignment requirements, but only sometimes, because > the stack coiuld be aligned). It seems we're not the only ones hitting this: https://rt.perl.org/Public/Bug/Display.html?id=133603 Doesn't look like they've really narrowed it down that much yet. - Andres
Hi, On 2019-10-03 08:23:49 -0700, Andres Freund wrote: > On 2019-10-03 08:18:42 -0700, Andres Freund wrote: > > This is around where an error is thrown: > > -- badly formatted interval > > INSERT INTO INTERVAL_TBL (f1) VALUES ('badly formatted interval'); > > -ERROR: invalid input syntax for type interval: "badly formatted interval" > > -LINE 1: INSERT INTO INTERVAL_TBL (f1) VALUES ('badly formatted inter... > > - ^ > > > > and the error is stack related. So I suspect that setjmp/longjmp might > > be to blame here, and somehow don't save/restore the stack into a proper > > state. I don't know enough about mingw/msys/windows to know whether that > > uses a self-written setjmp or relies on the MS implementation. > > > > If you could gather a backtrace it might help us. It's possible that the > > stack is "just" misaligned or something, we had problems with that > > before (IIRC valgrind didn't always align stacks correctly for processes > > that forked from within a signal handler, which then crashed when using > > instructions with alignment requirements, but only sometimes, because > > the stack coiuld be aligned). > > It seems we're not the only ones hitting this: > https://rt.perl.org/Public/Bug/Display.html?id=133603 > > Doesn't look like they've really narrowed it down that much yet. A few notes: * As an experiment, it could be worthwhile to try to redefine sigsetjmp/longjmp/sigjmp_buf with what https://gcc.gnu.org/onlinedocs/gcc/Nonlocal-Gotos.html provides, it's apparently a separate implementation from MS crt one. * Arguably "Do not use longjmp to transfer control from a callback routine invoked directly or indirectly by Windows code." and "Do not use longjmp to transfer control out of an interrupt-handling routine unless the interrupt is caused by a floating-point exception. In this case, a program may return from an interrupt handler via longjmp if it first reinitializes the floating-point math package by calling _fpreset." from https://docs.microsoft.com/en-us/cpp/c-runtime-library/reference/longjmp?view=vs-2019 might be violated by our signal signal emulation on windows. But I've not looked into that in detail. * Any chance you could get the pre-processed source for postgres.c or such? I'm kinda wondering if the definition of setjmp() that we get includes the returns_twice attribute that gcc wants to see, and whether we're picking up the mingw version of longjmp, or the windows one. https://sourceforge.net/p/mingw-w64/mingw-w64/ci/844cb490ab2cc32ac3df5914700564b2e40739d8/tree/mingw-w64-headers/crt/setjmp.h#l31 * It's certainly curious that the failures so far only have happended as part of pg_upgradeCheck, rather than the plain regression tests. Greetings, Andres Freund
Andres Freund <andres@anarazel.de> writes: > * It's certainly curious that the failures so far only have happended as > part of pg_upgradeCheck, rather than the plain regression tests. Isn't it though. We spent a long time wondering why we saw parallel plan instability mostly in pg_upgradeCheck, too [1]. We eventually decided that the cause of that instability was chance timing collisions with bgwriter/checkpointer, but nobody ever really explained why pg_upgradeCheck should be more prone to hit those windows than the plain tests are. I feel like there's something still to be understood there. Whether this is related, who's to say. But given your thought about stack alignment, I'm half thinking that the crash is seen when we get a signal (e.g. SIGUSR1 from sinval processing) at the wrong time, allowing the stack to become unaligned, and that the still-unexplained timing difference in pg_upgradeCheck accounts for that test being more prone to show it. regards, tom lane [1] https://www.postgresql.org/message-id/20190605050037.GA33985@rfd.leadboat.com
On 10/3/19 4:13 PM, Tom Lane wrote: > Andres Freund <andres@anarazel.de> writes: >> * It's certainly curious that the failures so far only have happended as >> part of pg_upgradeCheck, rather than the plain regression tests. > Isn't it though. We spent a long time wondering why we saw parallel > plan instability mostly in pg_upgradeCheck, too [1]. We eventually > decided that the cause of that instability was chance timing collisions > with bgwriter/checkpointer, but nobody ever really explained why > pg_upgradeCheck should be more prone to hit those windows than the plain > tests are. I feel like there's something still to be understood there. > > Whether this is related, who's to say. But given your thought about > stack alignment, I'm half thinking that the crash is seen when we get a > signal (e.g. SIGUSR1 from sinval processing) at the wrong time, allowing > the stack to become unaligned, and that the still-unexplained timing > difference in pg_upgradeCheck accounts for that test being more prone to > show it. > > regards, tom lane > > [1] https://www.postgresql.org/message-id/20190605050037.GA33985@rfd.leadboat.com Yes, that's very puzzling. But what do we actually do differently in the pg_upgrade checks that might account for it? Nothing that is at all obvious to me that might account for it. Another data point: the new Visual Studio 2019 instance drongo running on the same machine is not exhibiting these problems. Yes, it's not running test.sh, but vcregress.pl does pretty much the same thing. So that does seem to point to the toolset. I'll see if I can get the same toolset jacana is using installed and try that. cheers andrew -- Andrew Dunstan https://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 2019-10-03 16:21, Andrew Dunstan wrote: > My new msys2 animal fairywren Could you please check how this animal is labeled? AFAICT, this is not an msys2 build but a mingw build (x86_64-w64-mingw32). > has had 3 recent failures when checking > pg_upgrade. The failures have been while running the regression tests, > specifically the interval test, and they all look like this: I've also seen this randomly, but only under 64-bit mingw, never 32-bit mingw. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Sat, Oct 12, 2019 at 3:56 PM Peter Eisentraut <peter.eisentraut@2ndquadrant.com> wrote: > > On 2019-10-03 16:21, Andrew Dunstan wrote: > > My new msys2 animal fairywren > > Could you please check how this animal is labeled? AFAICT, this is not > an msys2 build but a mingw build (x86_64-w64-mingw32). It is indeed an msys2 system. However, when we set MSYSTEM=MINGW64 as we do in fairywren's config environment so that the compiler it is properly detected by configure (using Msys2's /etc/config.site) 'uname -a' reports MINGW64... instead of MSYS... This is a bit confusing. The compiler currently being used on the animal is the gcc 7.3.0 from the Mingw64 project, the same one that's being usied on jacana (which runs Msys1). Notwithstanding the "mingw32" in the compiler name, these are 64 bit builds. I think the "32" there is somewhat vestigial. > > > has had 3 recent failures when checking > > pg_upgrade. The failures have been while running the regression tests, > > specifically the interval test, and they all look like this: > > I've also seen this randomly, but only under 64-bit mingw, never 32-bit > mingw. > Since I downgraded the compiler from gcc 9.0 about a week ago these errors seem to have stopped. cheers andrew -- Andrew Dunstan https://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 2019-10-16 13:34, Andrew Dunstan wrote: >> Could you please check how this animal is labeled? AFAICT, this is not >> an msys2 build but a mingw build (x86_64-w64-mingw32). > > It is indeed an msys2 system. However, when we set MSYSTEM=MINGW64 as > we do in fairywren's config environment so that the compiler it is > properly detected by configure (using Msys2's /etc/config.site) > 'uname -a' reports MINGW64... instead of MSYS... When you install MSYS2 from msys2.org, you get three possible build targets, depending on what you set MSYSTEM to: MSYSTEM=MINGW32 MSYSTEM=MINGW64 MSYSTEM=MSYS When a buildfarm member identifiers itself as "msys ... 2", then I would expect the third variant, but that's not what it's doing. A MSYSTEM=MSYS build is similar to a Cygwin build (since MSYS2 is a fork of Cygwin), which is also a valid thing to do, but it's obviously quite different from a mingw build. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 10/25/19 3:09 PM, Peter Eisentraut wrote: > On 2019-10-16 13:34, Andrew Dunstan wrote: >>> Could you please check how this animal is labeled? AFAICT, this is not >>> an msys2 build but a mingw build (x86_64-w64-mingw32). >> It is indeed an msys2 system. However, when we set MSYSTEM=MINGW64 as >> we do in fairywren's config environment so that the compiler it is >> properly detected by configure (using Msys2's /etc/config.site) >> 'uname -a' reports MINGW64... instead of MSYS... > When you install MSYS2 from msys2.org, you get three possible build > targets, depending on what you set MSYSTEM to: > > MSYSTEM=MINGW32 > MSYSTEM=MINGW64 > MSYSTEM=MSYS > > When a buildfarm member identifiers itself as "msys ... 2", then I would > expect the third variant, but that's not what it's doing. A > MSYSTEM=MSYS build is similar to a Cygwin build (since MSYS2 is a fork > of Cygwin), which is also a valid thing to do, but it's obviously quite > different from a mingw build. If it helps you I can change the compiler name in the animal metainfo to mingw64-gcc. Msys2 is the build environment, but not the target, which is native Windows. cheers andrew -- Andrew Dunstan https://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services