Обсуждение: fairywren failures

Поиск
Список
Период
Сортировка

fairywren failures

От
Andrew Dunstan
Дата:
My new msys2 animal fairywren has had 3 recent failures when checking
pg_upgrade. The failures have been while running the regression tests,
specifically the interval test, and they all look like this:


2019-10-03 05:36:00.373 UTC [24272:43] LOG:  server process (PID 23756) was terminated by exception 0xC0000028
2019-10-03 05:36:00.373 UTC [24272:44] DETAIL:  Failed process was running: INSERT INTO INTERVAL_TBL (f1) VALUES
('badlyformatted interval');
 


That error is "bad stack"

The failures have been on REL_12_STABLE (twice) and master (once).
However, they are not consistent (REL_!2_STABLE is currently green).


The interval test itself hasn't changed for m ore than 2 years, and I
haven't found any obvious recent change that might cause the problem. I
guess it could be a comoiler bug ... this is gcc 9.2.0, which is the
current release.


cheers


andrew


-- 
Andrew Dunstan                https://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: fairywren failures

От
Andres Freund
Дата:
Hi,

On 2019-10-03 10:21:13 -0400, Andrew Dunstan wrote:
> My new msys2 animal fairywren has had 3 recent failures when checking
> pg_upgrade. The failures have been while running the regression tests,
> specifically the interval test, and they all look like this:
> 
> 
> 2019-10-03 05:36:00.373 UTC [24272:43] LOG:  server process (PID 23756) was terminated by exception 0xC0000028
> 2019-10-03 05:36:00.373 UTC [24272:44] DETAIL:  Failed process was running: INSERT INTO INTERVAL_TBL (f1) VALUES
('badlyformatted interval');
 
> 
> 
> That error is "bad stack"

> The failures have been on REL_12_STABLE (twice) and master (once).
> However, they are not consistent (REL_!2_STABLE is currently green).
> 
> 
> The interval test itself hasn't changed for m ore than 2 years, and I
> haven't found any obvious recent change that might cause the problem. I
> guess it could be a comoiler bug ... this is gcc 9.2.0, which is the
> current release.

This is around where an error is thrown:
 -- badly formatted interval
 INSERT INTO INTERVAL_TBL (f1) VALUES ('badly formatted interval');
-ERROR:  invalid input syntax for type interval: "badly formatted interval"
-LINE 1: INSERT INTO INTERVAL_TBL (f1) VALUES ('badly formatted inter...
-                                              ^

and the error is stack related. So I suspect that setjmp/longjmp might
be to blame here, and somehow don't save/restore the stack into a proper
state. I don't know enough about mingw/msys/windows to know whether that
uses a self-written setjmp or relies on the MS implementation.

If you could gather a backtrace it might help us. It's possible that the
stack is "just" misaligned or something, we had problems with that
before (IIRC valgrind didn't always align stacks correctly for processes
that forked from within a signal handler, which then crashed when using
instructions with alignment requirements, but only sometimes, because
the stack coiuld be aligned).

Greetings,

Andres Freund



Re: fairywren failures

От
Andres Freund
Дата:
Hi,

On 2019-10-03 08:18:42 -0700, Andres Freund wrote:
> On 2019-10-03 10:21:13 -0400, Andrew Dunstan wrote:
> > My new msys2 animal fairywren has had 3 recent failures when checking
> > pg_upgrade. The failures have been while running the regression tests,
> > specifically the interval test, and they all look like this:
> > 
> > 
> > 2019-10-03 05:36:00.373 UTC [24272:43] LOG:  server process (PID 23756) was terminated by exception 0xC0000028
> > 2019-10-03 05:36:00.373 UTC [24272:44] DETAIL:  Failed process was running: INSERT INTO INTERVAL_TBL (f1) VALUES
('badlyformatted interval');
 
> > 
> > 
> > That error is "bad stack"
> 
> > The failures have been on REL_12_STABLE (twice) and master (once).
> > However, they are not consistent (REL_!2_STABLE is currently green).
> > 
> > 
> > The interval test itself hasn't changed for m ore than 2 years, and I
> > haven't found any obvious recent change that might cause the problem. I
> > guess it could be a comoiler bug ... this is gcc 9.2.0, which is the
> > current release.
> 
> This is around where an error is thrown:
>  -- badly formatted interval
>  INSERT INTO INTERVAL_TBL (f1) VALUES ('badly formatted interval');
> -ERROR:  invalid input syntax for type interval: "badly formatted interval"
> -LINE 1: INSERT INTO INTERVAL_TBL (f1) VALUES ('badly formatted inter...
> -                                              ^
> 
> and the error is stack related. So I suspect that setjmp/longjmp might
> be to blame here, and somehow don't save/restore the stack into a proper
> state. I don't know enough about mingw/msys/windows to know whether that
> uses a self-written setjmp or relies on the MS implementation.
> 
> If you could gather a backtrace it might help us. It's possible that the
> stack is "just" misaligned or something, we had problems with that
> before (IIRC valgrind didn't always align stacks correctly for processes
> that forked from within a signal handler, which then crashed when using
> instructions with alignment requirements, but only sometimes, because
> the stack coiuld be aligned).

It seems we're not the only ones hitting this:
https://rt.perl.org/Public/Bug/Display.html?id=133603

Doesn't look like they've really narrowed it down that much yet.

- Andres



Re: fairywren failures

От
Andres Freund
Дата:
Hi,

On 2019-10-03 08:23:49 -0700, Andres Freund wrote:
> On 2019-10-03 08:18:42 -0700, Andres Freund wrote:
> > This is around where an error is thrown:
> >  -- badly formatted interval
> >  INSERT INTO INTERVAL_TBL (f1) VALUES ('badly formatted interval');
> > -ERROR:  invalid input syntax for type interval: "badly formatted interval"
> > -LINE 1: INSERT INTO INTERVAL_TBL (f1) VALUES ('badly formatted inter...
> > -                                              ^
> >
> > and the error is stack related. So I suspect that setjmp/longjmp might
> > be to blame here, and somehow don't save/restore the stack into a proper
> > state. I don't know enough about mingw/msys/windows to know whether that
> > uses a self-written setjmp or relies on the MS implementation.
> >
> > If you could gather a backtrace it might help us. It's possible that the
> > stack is "just" misaligned or something, we had problems with that
> > before (IIRC valgrind didn't always align stacks correctly for processes
> > that forked from within a signal handler, which then crashed when using
> > instructions with alignment requirements, but only sometimes, because
> > the stack coiuld be aligned).
>
> It seems we're not the only ones hitting this:
> https://rt.perl.org/Public/Bug/Display.html?id=133603
>
> Doesn't look like they've really narrowed it down that much yet.

A few notes:

* As an experiment, it could be worthwhile to try to redefine
  sigsetjmp/longjmp/sigjmp_buf with what
  https://gcc.gnu.org/onlinedocs/gcc/Nonlocal-Gotos.html
  provides, it's apparently a separate implementation from MS crt one.

* Arguably
  "Do not use longjmp to transfer control from a callback routine
  invoked directly or indirectly by Windows code."
  and
  "Do not use longjmp to transfer control out of an interrupt-handling
  routine unless the interrupt is caused by a floating-point
  exception. In this case, a program may return from an interrupt
  handler via longjmp if it first reinitializes the floating-point math
  package by calling _fpreset."

  from https://docs.microsoft.com/en-us/cpp/c-runtime-library/reference/longjmp?view=vs-2019

  might be violated by our signal signal emulation on windows. But I've
  not looked into that in detail.

* Any chance you could get the pre-processed source for postgres.c or
  such? I'm kinda wondering if the definition of setjmp() that we get
  includes the returns_twice attribute that gcc wants to see, and
  whether we're picking up the mingw version of longjmp, or the windows
  one.


https://sourceforge.net/p/mingw-w64/mingw-w64/ci/844cb490ab2cc32ac3df5914700564b2e40739d8/tree/mingw-w64-headers/crt/setjmp.h#l31

* It's certainly curious that the failures so far only have happended as
  part of pg_upgradeCheck, rather than the plain regression tests.

Greetings,

Andres Freund



Re: fairywren failures

От
Tom Lane
Дата:
Andres Freund <andres@anarazel.de> writes:
> * It's certainly curious that the failures so far only have happended as
>   part of pg_upgradeCheck, rather than the plain regression tests.

Isn't it though.  We spent a long time wondering why we saw parallel
plan instability mostly in pg_upgradeCheck, too [1].  We eventually
decided that the cause of that instability was chance timing collisions
with bgwriter/checkpointer, but nobody ever really explained why
pg_upgradeCheck should be more prone to hit those windows than the plain
tests are.  I feel like there's something still to be understood there.

Whether this is related, who's to say.  But given your thought about
stack alignment, I'm half thinking that the crash is seen when we get a
signal (e.g. SIGUSR1 from sinval processing) at the wrong time, allowing
the stack to become unaligned, and that the still-unexplained timing
difference in pg_upgradeCheck accounts for that test being more prone to
show it.

            regards, tom lane

[1] https://www.postgresql.org/message-id/20190605050037.GA33985@rfd.leadboat.com



Re: fairywren failures

От
Andrew Dunstan
Дата:
On 10/3/19 4:13 PM, Tom Lane wrote:
> Andres Freund <andres@anarazel.de> writes:
>> * It's certainly curious that the failures so far only have happended as
>>   part of pg_upgradeCheck, rather than the plain regression tests.
> Isn't it though.  We spent a long time wondering why we saw parallel
> plan instability mostly in pg_upgradeCheck, too [1].  We eventually
> decided that the cause of that instability was chance timing collisions
> with bgwriter/checkpointer, but nobody ever really explained why
> pg_upgradeCheck should be more prone to hit those windows than the plain
> tests are.  I feel like there's something still to be understood there.
>
> Whether this is related, who's to say.  But given your thought about
> stack alignment, I'm half thinking that the crash is seen when we get a
> signal (e.g. SIGUSR1 from sinval processing) at the wrong time, allowing
> the stack to become unaligned, and that the still-unexplained timing
> difference in pg_upgradeCheck accounts for that test being more prone to
> show it.
>
>             regards, tom lane
>
> [1] https://www.postgresql.org/message-id/20190605050037.GA33985@rfd.leadboat.com



Yes, that's very puzzling. But what do we actually do differently in the
pg_upgrade checks that might account for it? Nothing that is at all
obvious to me that might account for it.


Another data point: the new Visual Studio 2019 instance drongo running
on the same machine is not exhibiting these problems. Yes, it's not
running test.sh, but vcregress.pl does pretty much the same thing. So
that does seem to point to the toolset. I'll see if I can get the same
toolset jacana is using installed and try that.


cheers


andrew

-- 
Andrew Dunstan                https://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: fairywren failures

От
Peter Eisentraut
Дата:
On 2019-10-03 16:21, Andrew Dunstan wrote:
> My new msys2 animal fairywren

Could you please check how this animal is labeled?  AFAICT, this is not
an msys2 build but a mingw build (x86_64-w64-mingw32).

> has had 3 recent failures when checking
> pg_upgrade. The failures have been while running the regression tests,
> specifically the interval test, and they all look like this:

I've also seen this randomly, but only under 64-bit mingw, never 32-bit
mingw.

-- 
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: fairywren failures

От
Andrew Dunstan
Дата:
On Sat, Oct 12, 2019 at 3:56 PM Peter Eisentraut
<peter.eisentraut@2ndquadrant.com> wrote:
>
> On 2019-10-03 16:21, Andrew Dunstan wrote:
> > My new msys2 animal fairywren
>
> Could you please check how this animal is labeled?  AFAICT, this is not
> an msys2 build but a mingw build (x86_64-w64-mingw32).


It is indeed an msys2 system. However, when we set  MSYSTEM=MINGW64 as
we do in fairywren's config environment so that the compiler it is
properly detected by configure (using Msys2's /etc/config.site)
'uname -a' reports MINGW64... instead of MSYS...

This is a bit confusing.

The compiler currently being used on the animal is the gcc 7.3.0 from
the Mingw64 project, the same one that's being usied on jacana (which
runs Msys1). Notwithstanding the "mingw32" in the compiler name, these
are 64 bit builds. I think the "32" there is somewhat vestigial.


>
> > has had 3 recent failures when checking
> > pg_upgrade. The failures have been while running the regression tests,
> > specifically the interval test, and they all look like this:
>
> I've also seen this randomly, but only under 64-bit mingw, never 32-bit
> mingw.
>


Since I downgraded the compiler from gcc 9.0 about a week ago these
errors seem to have stopped.

cheers

andrew


-- 
Andrew Dunstan                https://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: fairywren failures

От
Peter Eisentraut
Дата:
On 2019-10-16 13:34, Andrew Dunstan wrote:
>> Could you please check how this animal is labeled?  AFAICT, this is not
>> an msys2 build but a mingw build (x86_64-w64-mingw32).
> 
> It is indeed an msys2 system. However, when we set  MSYSTEM=MINGW64 as
> we do in fairywren's config environment so that the compiler it is
> properly detected by configure (using Msys2's /etc/config.site)
> 'uname -a' reports MINGW64... instead of MSYS...

When you install MSYS2 from msys2.org, you get three possible build
targets, depending on what you set MSYSTEM to:

MSYSTEM=MINGW32
MSYSTEM=MINGW64
MSYSTEM=MSYS

When a buildfarm member identifiers itself as "msys ... 2", then I would
expect the third variant, but that's not what it's doing.  A
MSYSTEM=MSYS build is similar to a Cygwin build (since MSYS2 is a fork
of Cygwin), which is also a valid thing to do, but it's obviously quite
different from a mingw build.

-- 
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: fairywren failures

От
Andrew Dunstan
Дата:
On 10/25/19 3:09 PM, Peter Eisentraut wrote:
> On 2019-10-16 13:34, Andrew Dunstan wrote:
>>> Could you please check how this animal is labeled?  AFAICT, this is not
>>> an msys2 build but a mingw build (x86_64-w64-mingw32).
>> It is indeed an msys2 system. However, when we set  MSYSTEM=MINGW64 as
>> we do in fairywren's config environment so that the compiler it is
>> properly detected by configure (using Msys2's /etc/config.site)
>> 'uname -a' reports MINGW64... instead of MSYS...
> When you install MSYS2 from msys2.org, you get three possible build
> targets, depending on what you set MSYSTEM to:
>
> MSYSTEM=MINGW32
> MSYSTEM=MINGW64
> MSYSTEM=MSYS
>
> When a buildfarm member identifiers itself as "msys ... 2", then I would
> expect the third variant, but that's not what it's doing.  A
> MSYSTEM=MSYS build is similar to a Cygwin build (since MSYS2 is a fork
> of Cygwin), which is also a valid thing to do, but it's obviously quite
> different from a mingw build.




If it helps you I can change the compiler name in the animal metainfo to
mingw64-gcc. Msys2 is the build environment, but not the target, which
is native Windows.


cheers


andrew


-- 
Andrew Dunstan                https://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services