Обсуждение: buildfarm instance bichir stuck

Поиск
Список
Период
Сортировка

buildfarm instance bichir stuck

От
Robins Tharakan
Дата:
Hi,

Bichir's been stuck for the past month and is unable to run regression tests since 6a2a70a02018d6362f9841cc2f499cc45405e86b.

It is interesting that that commit's a month old and probably no other client has complained since, but diving in, I can see that it's been unable to even start regression tests after that commit went in.

Note that Bichir is running on WSL1 (not WSL2) - i.e. Windows Subsystem for Linux inside Windows 10 - and so isn't really production use-case. The only run that actually got submitted to Buildfarm was from a few days back when I killed it after a long wait - see [1].

Since yesterday, I have another run that's again stuck on CREATE DATABASE (see outputs below) and although pstack not working may be a limitation of the architecture / installation (unsure), a trace shows it is stuck at poll.

Tracing commits, it seems that the commit 6a2a70a02018d6362f9841cc2f499cc45405e86b broke things and I can confirm that 'make check' works if I rollback to the preceding commit ( 83709a0d5a46559db016c50ded1a95fd3b0d3be6 ).

Not sure if many agree but 2 things stood out here:
1) Buildfarm never got the message that a commit broke an instance. Ideally I'd have expected buildfarm to have an optimistic timeout that could have helped - for e.g. right now, the CREATE DATABASE is still stuck since 18 hrs.

2) bichir is clearly not a production use-case (it takes 5 hrs to complete a HEAD run!), so let me know if this change is intentional (I guess I'll stop maintaining it if so) but thought I'd still put this out in case it interests someone.

-
thanks
robins

Reference:
1) Last run that I had to kill - https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=bichir&dt=2021-03-31%2012%3A00%3A05

#####################################################
The current run is running since yesterday.


postgres@WSLv1:/opt/postgres/bf/v11/buildroot/HEAD/bichir.lastrun-logs$ tail -2 lastcommand.log
running on port 5678 with PID 8715
============== creating database "regression"         ==============


postgres@WSLv1:/opt/postgres/bf/v11/buildroot/HEAD/bichir.lastrun-logs$ date
Wed Apr  7 12:48:26 AEST 2021


postgres@WSLv1:/opt/postgres/bf/v11/buildroot/HEAD/bichir.lastrun-logs$ ls -la
total 840
drwxrwxr-x 1 postgres postgres   4096 Apr  6 09:00 .
drwxrwxr-x 1 postgres postgres   4096 Apr  6 08:55 ..
-rw-rw-r-- 1 postgres postgres   1358 Apr  6 08:55 SCM-checkout.log
-rw-rw-r-- 1 postgres postgres  91546 Apr  6 08:56 configure.log
-rw-rw-r-- 1 postgres postgres     40 Apr  6 08:55 githead.log
-rw-rw-r-- 1 postgres postgres   2890 Apr  6 09:01 lastcommand.log
-rw-rw-r-- 1 postgres postgres 712306 Apr  6 09:00 make.log


root@WSLv1:~# pstack 8729
8729: psql -X -c CREATE DATABASE "regression" TEMPLATE=template0 LC_COLLATE='C' LC_CTYPE='C' postgres
pstack: Bad address
failed to read target.


root@WSLv1:~# gdb -batch -ex bt -p 8729
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
0x00007f41a8ea4c84 in __GI___poll (fds=fds@entry=0x7fffe13d7be8, nfds=nfds@entry=1, timeout=-1) at ../sysdeps/unix/sysv/linux/poll.c:29
29      ../sysdeps/unix/sysv/linux/poll.c: No such file or directory.
#0  0x00007f41a8ea4c84 in __GI___poll (fds=fds@entry=0x7fffe13d7be8, nfds=nfds@entry=1, timeout=-1) at ../sysdeps/unix/sysv/linux/poll.c:29
#1  0x00007f41a9bc8eb1 in poll (__timeout=<optimized out>, __nfds=1, __fds=0x7fffe13d7be8) at /usr/include/x86_64-linux-gnu/bits/poll2.h:46
#2  pqSocketPoll (end_time=-1, forWrite=0, forRead=1, sock=<optimized out>) at fe-misc.c:1133
#3  pqSocketCheck (conn=0x7fffd979a0b0, forRead=1, forWrite=0, end_time=-1) at fe-misc.c:1075
#4  0x00007f41a9bc8ff0 in pqWaitTimed (forRead=<optimized out>, forWrite=<optimized out>, conn=0x7fffd979a0b0, finish_time=<optimized out>) at fe-misc.c:1007
#5  0x00007f41a9bc5ac9 in PQgetResult (conn=0x7fffd979a0b0) at fe-exec.c:1963
#6  0x00007f41a9bc5ea3 in PQexecFinish (conn=0x7fffd979a0b0) at fe-exec.c:2306
#7  0x00007f41a9bc5ef2 in PQexec (conn=<optimized out>, query=query@entry=0x7fffd9799f70 "CREATE DATABASE \"regression\" TEMPLATE=template0 LC_COLLATE='C' LC_CTYPE='C'") at fe-exec.c:2148
#8  0x00007f41aa21e7a0 in SendQuery (query=0x7fffd9799f70 "CREATE DATABASE \"regression\" TEMPLATE=template0 LC_COLLATE='C' LC_CTYPE='C'") at common.c:1303
#9  0x00007f41aa2160a6 in main (argc=<optimized out>, argv=<optimized out>) at startup.c:369



#####################################################



Here we can see that 83709a0d5a46559db016c50ded1a95fd3b0d3be6 goes past 'CREATE DATABASE'
=======================
robins@WSLv1:~/proj/postgres/postgres$ git checkout 83709a0d5a46559db016c50ded1a95fd3b0d3be6
Previous HEAD position was 6a2a70a020 Use signalfd(2) for epoll latches.
HEAD is now at 83709a0d5a Use SIGURG rather than SIGUSR1 for latches.

robins@WSLv1:~/proj/postgres/postgres$ cd src/test/regress/

robins@WSLv1:~/proj/postgres/postgres/src/test/regress$ make -j4 NO_LOCALE=1 check
make -C ../../../src/backend generated-headers
rm -rf ./testtablespace
make[1]: Entering directory '/home/robins/proj/postgres/postgres/src/backend'
make -C catalog distprep generated-header-symlinks
make -C utils distprep generated-header-symlinks
mkdir ./testtablespace
make[2]: Entering directory '/home/robins/proj/postgres/postgres/src/backend/utils'
make[2]: Nothing to be done for 'distprep'.
make[2]: Nothing to be done for 'generated-header-symlinks'.
make[2]: Leaving directory '/home/robins/proj/postgres/postgres/src/backend/utils'
make[2]: Entering directory '/home/robins/proj/postgres/postgres/src/backend/catalog'
make[2]: Nothing to be done for 'distprep'.
make[2]: Nothing to be done for 'generated-header-symlinks'.
make[2]: Leaving directory '/home/robins/proj/postgres/postgres/src/backend/catalog'
make[1]: Leaving directory '/home/robins/proj/postgres/postgres/src/backend'
make -C ../../../src/port all
rm -rf '/home/robins/proj/postgres/postgres'/tmp_install
make[1]: Entering directory '/home/robins/proj/postgres/postgres/src/port'
make[1]: Nothing to be done for 'all'.
make[1]: Leaving directory '/home/robins/proj/postgres/postgres/src/port'
make -C ../../../src/common all
make[1]: Entering directory '/home/robins/proj/postgres/postgres/src/common'
make[1]: Nothing to be done for 'all'.
make[1]: Leaving directory '/home/robins/proj/postgres/postgres/src/common'
make -C ../../../contrib/spi
make[1]: Entering directory '/home/robins/proj/postgres/postgres/contrib/spi'
make[1]: Nothing to be done for 'all'.
make[1]: Leaving directory '/home/robins/proj/postgres/postgres/contrib/spi'
/bin/mkdir -p '/home/robins/proj/postgres/postgres'/tmp_install/log
make -C '../../..' DESTDIR='/home/robins/proj/postgres/postgres'/tmp_install install >'/home/robins/proj/postgres/postgres'/tmp_install/log/install.log 2>&1
make -j1  checkprep >>'/home/robins/proj/postgres/postgres'/tmp_install/log/install.log 2>&1
PATH="/home/robins/proj/postgres/postgres/tmp_install/opt/postgres/master/bin:$PATH" LD_LIBRARY_PATH="/home/robins/proj/postgres/postgres/tmp_install/opt/postgres/master/li b"  ../../../src/test/regress/pg_regress --temp-instance=./tmp_check --inputdir=. --bindir=   --no-locale  --dlpath=. --max-concurrent-tests=20  --schedule=./parallel_sched ule
============== removing existing temp instance        ==============
============== creating temporary instance            ==============
============== initializing database system           ==============
============== starting postmaster                    ==============
running on port 58080 with PID 25879
============== creating database "regression"         ==============
CREATE DATABASE
ALTER DATABASE
============== running regression test queries        ==============
test tablespace                   ... ok         1239 ms
parallel group (20 tests):  boolean char varchar name text int2 int4 int8 oid float4 float8 bit^CGNUmakefile:132: recipe for target 'check' failed
make: *** [check] Interrupt



But checking out 6a2a70a02018d6362f9841cc2f499cc45405e86b we can see that it hangs at 'CREATE DATABASE'
=======================================
robins@WSLv1:~/proj/postgres/postgres/src/test/regress$ git checkout 6a2a70a02018d6362f9841cc2f499cc45405e86b
Previous HEAD position was 83709a0d5a Use SIGURG rather than SIGUSR1 for latches.
HEAD is now at 6a2a70a020 Use signalfd(2) for epoll latches.
robins@WSLv1:~/proj/postgres/postgres/src/test/regress$ make -j4 NO_LOCALE=1 check
make -C ../../../src/backend generated-headers
rm -rf ./testtablespace
make[1]: Entering directory '/home/robins/proj/postgres/postgres/src/backend'
make -C catalog distprep generated-header-symlinks
make -C utils distprep generated-header-symlinks
mkdir ./testtablespace
make[2]: Entering directory '/home/robins/proj/postgres/postgres/src/backend/utils'
make[2]: Nothing to be done for 'distprep'.
make[2]: Nothing to be done for 'generated-header-symlinks'.
make[2]: Leaving directory '/home/robins/proj/postgres/postgres/src/backend/utils'
make[2]: Entering directory '/home/robins/proj/postgres/postgres/src/backend/catalog'
make[2]: Nothing to be done for 'distprep'.
make[2]: Nothing to be done for 'generated-header-symlinks'.
make[2]: Leaving directory '/home/robins/proj/postgres/postgres/src/backend/catalog'
make[1]: Leaving directory '/home/robins/proj/postgres/postgres/src/backend'
make -C ../../../src/port all
rm -rf '/home/robins/proj/postgres/postgres'/tmp_install
make[1]: Entering directory '/home/robins/proj/postgres/postgres/src/port'
make[1]: Nothing to be done for 'all'.
make[1]: Leaving directory '/home/robins/proj/postgres/postgres/src/port'
make -C ../../../src/common all
make[1]: Entering directory '/home/robins/proj/postgres/postgres/src/common'
make[1]: Nothing to be done for 'all'.
make[1]: Leaving directory '/home/robins/proj/postgres/postgres/src/common'
make -C ../../../contrib/spi
make[1]: Entering directory '/home/robins/proj/postgres/postgres/contrib/spi'
make[1]: Nothing to be done for 'all'.
make[1]: Leaving directory '/home/robins/proj/postgres/postgres/contrib/spi'
/bin/mkdir -p '/home/robins/proj/postgres/postgres'/tmp_install/log
make -C '../../..' DESTDIR='/home/robins/proj/postgres/postgres'/tmp_install install >'/home/robins/proj/postgres/postgres'/tmp_install/log/install.log 2>&1
make -j1  checkprep >>'/home/robins/proj/postgres/postgres'/tmp_install/log/install.log 2>&1
PATH="/home/robins/proj/postgres/postgres/tmp_install/opt/postgres/master/bin:$PATH" LD_LIBRARY_PATH="/home/robins/proj/postgres/postgres/tmp_install/opt/postgres/master/lib"  ../../../src/test/regress/pg_regress --temp-instance=./tmp_check --inputdir=. --bindir=   --no-locale  --dlpath=. --max-concurrent-tests=20  --schedule=./parallel_schedule
============== removing existing temp instance        ==============
============== creating temporary instance            ==============
============== initializing database system           ==============
============== starting postmaster                    ==============
running on port 58080 with PID 26702
============== creating database "regression"         ==============
stuck here ^^^
^CCancel request sent
FATAL:  terminating connection due to administrator command
server closed the connection unexpectedly
        This probably means the server terminated abnormally
        before or while processing the request.
connection to server was lost
command failed: "psql" -X -c "CREATE DATABASE \"regression\" TEMPLATE=template0 LC_COLLATE='C' LC_CTYPE='C'" "postgres"
pg_ctl: PID file "/home/robins/proj/postgres/postgres/src/test/regress/./tmp_check/data/postmaster.pid" does not exist
Is server running?

pg_regress: could not stop postmaster: exit code was 256
GNUmakefile:132: recipe for target 'check' failed
make: *** [check] Interrupt

Re: buildfarm instance bichir stuck

От
Thomas Munro
Дата:
On Wed, Apr 7, 2021 at 5:44 PM Robins Tharakan <tharakan@gmail.com> wrote:
> Bichir's been stuck for the past month and is unable to run regression tests since
6a2a70a02018d6362f9841cc2f499cc45405e86b.

Hrmph.  That's "Use signalfd(2) for epoll latches."  I had a similar
report from an illumos user (but it was intermittent).  I have never
seen such a failure on Linux.  My first guess is that these two
systems that are doing Linux system call emulation have implemented
subtly different semantics, and something is going wrong like this: a
SIGUSR1 arrives to tell you some important news about a procsignal and
the signal handler calls SetLatch(MyLatch) which does kill(MyProcPid,
SIGURG), but somehow that fails to wake up the epoll() you are
sleeping in which contains the signalfd that should receive the signal
and report it by being readable, due to some internal race.  Or
something like that.  But I haven't been able to verify that theory
because I don't have any of those computers.  If it is indeed
something like that and not a bug in my code, then I was thinking that
the main tool available to deal with it would be to set WAIT_USE_POLL
in the relevant template file, so that we don't use the combination of
epoll + signalfd on illlumos, but then WSL1 thows a spanner in the
works because AFAIK it's masquerading as Ubuntu, running PostgreSQL
from an Ubuntu package with a freaky kernel.  Hmm.

> It is interesting that that commit's a month old and probably no other client has complained since, but diving in, I
cansee that it's been unable to even start regression tests after that commit went in. 

Oh, well at least it's easily reproducible then, that's something!

> Note that Bichir is running on WSL1 (not WSL2) - i.e. Windows Subsystem for Linux inside Windows 10 - and so isn't
reallyproduction use-case. The only run that actually got submitted to Buildfarm was from a few days back when I killed
itafter a long wait - see [1]. 
>
> Since yesterday, I have another run that's again stuck on CREATE DATABASE (see outputs below) and although pstack not
workingmay be a limitation of the architecture / installation (unsure), a trace shows it is stuck at poll. 

That's actually the client.  I guess there is also a backend process
stuck somewhere in epoll_wait()?



Re: buildfarm instance bichir stuck

От
Robins Tharakan
Дата:
Hi Thomas,

Thanks for taking a look at this promptly.


On Wed, 7 Apr 2021 at 16:17, Thomas Munro <thomas.munro@gmail.com> wrote:
> On Wed, Apr 7, 2021 at 5:44 PM Robins Tharakan <tharakan@gmail.com> wrote:
> > It is interesting that that commit's a month old and probably no other client has complained since, but diving in, I can see that it's been unable to even start regression tests after that commit went in.
>
> Oh, well at least it's easily reproducible then, that's something!

Correct. This is easily reproducible on this test-instance, so let me know if you want me to test a patch.

 
>
> That's actually the client.  I guess there is also a backend process
> stuck somewhere in epoll_wait()?

You're right (and yes my bad, I was looking at the client). The server process is stuck in epoll_wait(). Let me know if you need me to give any other info that may be helpful.


root@WSLv1:~# gdb -batch -ex bt -p 29887
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
0x00007fa087741a07 in epoll_wait (epfd=10, events=0x7fffcbcc5748, maxevents=maxevents@entry=1, timeout=timeout@entry=-1) at ../sysdeps/unix/sysv/linux/epoll_wait.c:30
30      ../sysdeps/unix/sysv/linux/epoll_wait.c: No such file or directory.
#0  0x00007fa087741a07 in epoll_wait (epfd=10, events=0x7fffcbcc5748, maxevents=maxevents@entry=1, timeout=timeout@entry=-1) at ../sysdeps/unix/sysv/linux/epoll_wait.c:30
#1  0x00007fa088c355dc in WaitEventSetWaitBlock (nevents=1, occurred_events=0x7fffd2d4c090, cur_timeout=-1, set=0x7fffcbcc56e8) at latch.c:1428
#2  WaitEventSetWait (set=0x7fffcbcc56e8, timeout=timeout@entry=-1, occurred_events=occurred_events@entry=0x7fffd2d4c090, nevents=nevents@entry=1, wait_event_info=wait_ev
#3  0x00007fa088c35a14 in WaitLatch (latch=<optimized out>, wakeEvents=wakeEvents@entry=33, timeout=timeout@entry=-1, wait_event_info=wait_event_info@entry=134217733) at
#4  0x00007fa088c43ed8 in ConditionVariableTimedSleep (cv=0x7fa0873cc498, timeout=-1, wait_event_info=134217733) at condition_variable.c:163
#5  0x00007fa088bba8bc in RequestCheckpoint (flags=flags@entry=44) at checkpointer.c:1017
#6  0x00007fa088a46315 in createdb (pstate=pstate@entry=0x7fffcbcebbc0, stmt=stmt@entry=0x7fffcbcca558) at dbcommands.c:711
.
.
.

-
robins

Re: buildfarm instance bichir stuck

От
Andrew Dunstan
Дата:
On 4/7/21 2:16 AM, Thomas Munro wrote:
> On Wed, Apr 7, 2021 at 5:44 PM Robins Tharakan <tharakan@gmail.com> wrote:
>> Bichir's been stuck for the past month and is unable to run regression tests since
6a2a70a02018d6362f9841cc2f499cc45405e86b.
> Hrmph.  That's "Use signalfd(2) for epoll latches."  I had a similar
> report from an illumos user (but it was intermittent).  I have never
> seen such a failure on Linux.  My first guess is that these two
> systems that are doing Linux system call emulation have implemented
> subtly different semantics, and something is going wrong like this: a
> SIGUSR1 arrives to tell you some important news about a procsignal and
> the signal handler calls SetLatch(MyLatch) which does kill(MyProcPid,
> SIGURG), but somehow that fails to wake up the epoll() you are
> sleeping in which contains the signalfd that should receive the signal
> and report it by being readable, due to some internal race.  Or
> something like that.  But I haven't been able to verify that theory
> because I don't have any of those computers.  If it is indeed
> something like that and not a bug in my code, then I was thinking that
> the main tool available to deal with it would be to set WAIT_USE_POLL
> in the relevant template file, so that we don't use the combination of
> epoll + signalfd on illlumos, but then WSL1 thows a spanner in the
> works because AFAIK it's masquerading as Ubuntu, running PostgreSQL
> from an Ubuntu package with a freaky kernel.  Hmm.
>

To test this the OP could just add


    CPPFLAGS => '-DWAIT_USE_POLL',


to his animal's config's config_env stanza.


cheers


andrew


--
Andrew Dunstan
EDB: https://www.enterprisedb.com




Re: buildfarm instance bichir stuck

От
Robins Tharakan
Дата:
Thanks Andrew.

The build's still running but the CPPFLAGS hint does seem to have helped (see below).

Unless advised otherwise, I intend to let that option be, so as to get bichir back online. If a future commit 'fixes' things, I could rollback this flag to test things out (or try out other options if required).


On Wed, 7 Apr 2021 at 21:49, Andrew Dunstan <andrew@dunslane.net> wrote:
> On 4/7/21 2:16 AM, Thomas Munro wrote:
> > On Wed, Apr 7, 2021 at 5:44 PM Robins Tharakan <tharakan@gmail.com> wrote:
> >> Bichir's been stuck for the past month and is unable to run regression tests since 6a2a70a02018d6362f9841cc2f499cc45405e86b.
> > ...If it is indeed
> > something like that and not a bug in my code, then I was thinking that
> > the main tool available to deal with it would be to set WAIT_USE_POLL
> > in the relevant template file, so that we don't use the combination of
> > epoll + signalfd on illlumos, but then WSL1 thows a spanner in the
> > works because AFAIK it's masquerading as Ubuntu, running PostgreSQL
> > from an Ubuntu package with a freaky kernel.  Hmm.
> To test this the OP could just add
>     CPPFLAGS => '-DWAIT_USE_POLL',
> to his animal's config's config_env stanza.

This did help in getting past the previous hurdle.

postgres@WSLv1:/opt/postgres/bf/v11/buildroot/HEAD/bichir.lastrun-logs$ grep CPPFLAGS configure.log| grep using
configure: using CPPFLAGS=-DWAIT_USE_POLL -D_GNU_SOURCE -I/usr/include/libxml2
configure:19511: using CPPFLAGS=-DWAIT_USE_POLL -D_GNU_SOURCE -I/usr/include/libxml2

postgres@WSLv1:/opt/postgres/bf/v11/buildroot/HEAD/bichir.lastrun-logs$ grep -A2 "creating database" lastcommand.log
============== creating database "regression"         ==============
CREATE DATABASE
ALTER DATABASE

-
thanks
robins

Re: buildfarm instance bichir stuck

От
Tom Lane
Дата:
Robins Tharakan <tharakan@gmail.com> writes:
> Not sure if many agree but 2 things stood out here:
> 1) Buildfarm never got the message that a commit broke an instance. Ideally
> I'd have expected buildfarm to have an optimistic timeout that could have
> helped - for e.g. right now, the CREATE DATABASE is still stuck since 18
> hrs.

As far as that goes, you can set wait_timeout in the animal's config
to something comfortably more than the longest run time you expect.
It doesn't default to enabled though, possibly because picking a
one-size-fits-all value would be impossible.

I do use it on some of my flakier dinosaurs, and I've noticed that
when it does kick in, the buildfarm run just stops dead and no report
is sent to the BF server.  That has advantages in not cluttering the
BF status with run-failed-because-of-$weird_problem issues, but it
doesn't help from the standpoint of noticing when your animal is stuck.
Maybe it'd be better to change that behavior.

(I can also attest from personal experience that what had been a
comfortable amount of slop when you picked it tends to become less
so over time.  Consider yourself warned.)

            regards, tom lane



Re: buildfarm instance bichir stuck

От
Andrew Dunstan
Дата:
On 4/7/21 1:07 PM, Tom Lane wrote:
> Robins Tharakan <tharakan@gmail.com> writes:
>> Not sure if many agree but 2 things stood out here:
>> 1) Buildfarm never got the message that a commit broke an instance. Ideally
>> I'd have expected buildfarm to have an optimistic timeout that could have
>> helped - for e.g. right now, the CREATE DATABASE is still stuck since 18
>> hrs.
> As far as that goes, you can set wait_timeout in the animal's config
> to something comfortably more than the longest run time you expect.
> It doesn't default to enabled though, possibly because picking a
> one-size-fits-all value would be impossible.
>
> I do use it on some of my flakier dinosaurs, and I've noticed that
> when it does kick in, the buildfarm run just stops dead and no report
> is sent to the BF server.  That has advantages in not cluttering the
> BF status with run-failed-because-of-$weird_problem issues, but it
> doesn't help from the standpoint of noticing when your animal is stuck.
> Maybe it'd be better to change that behavior.
>

Yeah, I'll have a look. It's not simple for a bunch of reasons.


cheers


andrew


--
Andrew Dunstan
EDB: https://www.enterprisedb.com




Re: buildfarm instance bichir stuck

От
Tom Lane
Дата:
Andrew Dunstan <andrew@dunslane.net> writes:
> On 4/7/21 1:07 PM, Tom Lane wrote:
>> I do use it on some of my flakier dinosaurs, and I've noticed that
>> when it does kick in, the buildfarm run just stops dead and no report
>> is sent to the BF server.  That has advantages in not cluttering the
>> BF status with run-failed-because-of-$weird_problem issues, but it
>> doesn't help from the standpoint of noticing when your animal is stuck.
>> Maybe it'd be better to change that behavior.

> Yeah, I'll have a look. It's not simple for a bunch of reasons.

On further thought, that doesn't seem like the place to fix it.
I'd rather be able to ask the buildfarm server to send me nagmail
if my animal hasn't sent a report in N days (where N had better
be owner-configurable).  This would catch not only animal-is-hung,
but also other classes of problems like whole-machine-is-hung or
you-broke-your-firewall-configuration-so-it-cant-contact-the-server.
I've had issues of those sorts before ...

            regards, tom lane



Re: buildfarm instance bichir stuck

От
Andrew Dunstan
Дата:
On 4/7/21 4:02 PM, Tom Lane wrote:
> Andrew Dunstan <andrew@dunslane.net> writes:
>> On 4/7/21 1:07 PM, Tom Lane wrote:
>>> I do use it on some of my flakier dinosaurs, and I've noticed that
>>> when it does kick in, the buildfarm run just stops dead and no report
>>> is sent to the BF server.  That has advantages in not cluttering the
>>> BF status with run-failed-because-of-$weird_problem issues, but it
>>> doesn't help from the standpoint of noticing when your animal is stuck.
>>> Maybe it'd be better to change that behavior.
>> Yeah, I'll have a look. It's not simple for a bunch of reasons.
> On further thought, that doesn't seem like the place to fix it.
> I'd rather be able to ask the buildfarm server to send me nagmail
> if my animal hasn't sent a report in N days (where N had better
> be owner-configurable).  This would catch not only animal-is-hung,
> but also other classes of problems like whole-machine-is-hung or
> you-broke-your-firewall-configuration-so-it-cant-contact-the-server.
> I've had issues of those sorts before ...
>
>             



That already exists, and has for a long time. See the 'alerts' stanza of
your config file.


cheers


andrew


--
Andrew Dunstan
EDB: https://www.enterprisedb.com




Re: buildfarm instance bichir stuck

От
Tom Lane
Дата:
Andrew Dunstan <andrew@dunslane.net> writes:
> On 4/7/21 4:02 PM, Tom Lane wrote:
>> On further thought, that doesn't seem like the place to fix it.
>> I'd rather be able to ask the buildfarm server to send me nagmail
>> if my animal hasn't sent a report in N days (where N had better
>> be owner-configurable).

> That already exists, and has for a long time. See the 'alerts' stanza of
> your config file.

Oh!  In that case, I don't think we need anything else.

            regards, tom lane



Re: buildfarm instance bichir stuck

От
Thomas Munro
Дата:
On Wed, Apr 7, 2021 at 7:31 PM Robins Tharakan <tharakan@gmail.com> wrote:
> Correct. This is easily reproducible on this test-instance, so let me know if you want me to test a patch.

From your description it sounds like signals are not arriving at all,
rather than some more complicated race.  Let's go back to basics...
what does the attached program print for you?  I see:

tmunro@x1:~/junk$ cc test-signalfd.c
tmunro@x1:~/junk$ ./a.out
blocking SIGURG...
creating a signalfd to receive SIGURG...
creating an epoll set...
adding signalfd to epoll set...
polling the epoll set... 0
sending a signal...
polling the epoll set... 1

Вложения

Re: buildfarm instance bichir stuck

От
Robins Tharakan
Дата:

On Fri, 9 Apr 2021 at 16:12, Thomas Munro <thomas.munro@gmail.com> wrote:
> From your description it sounds like signals are not arriving at all,
> rather than some more complicated race.  Let's go back to basics...
> what does the attached program print for you?  I see:
>
> tmunro@x1:~/junk$ cc test-signalfd.c
> tmunro@x1:~/junk$ ./a.out
> blocking SIGURG...
> creating a signalfd to receive SIGURG...
> creating an epoll set...
> adding signalfd to epoll set...
> polling the epoll set... 0
> sending a signal...
> polling the epoll set... 1


I get pretty much the same. Some additional info below, although not sure if it'd be of any help here.

robins@WSLv1:~/proj/hackers$ cc test-signalfd.c

robins@WSLv1:~/proj/hackers$ ./a.out
blocking SIGURG...
creating a signalfd to receive SIGURG...
creating an epoll set...
adding signalfd to epoll set...
polling the epoll set... 0
sending a signal...
polling the epoll set... 1

robins@WSLv1:~/proj/hackers$ cat /proc/cpuinfo | egrep 'flags|model' | sort -u
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave osxsave avx f16c rdrand lahf_lm abm 3dnowprefetch fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx rdseed adx smap clflushopt intel_pt ibrs ibpb stibp ssbd
model           : 142
model name      : Intel(R) Core(TM) i7-8650U CPU @ 1.90GHz

robins@WSLv1:~/proj/hackers$ uname -a
Linux WSLv1 4.4.0-19041-Microsoft #488-Microsoft Mon Sep 01 13:43:00 PST 2020 x86_64 x86_64 x86_64 GNU/Linux

C:>wsl -l -v
  NAME            STATE           VERSION
* Ubuntu-18.04    Running         1

-
robins

Re: buildfarm instance bichir stuck

От
Thomas Munro
Дата:
On Fri, Apr 9, 2021 at 6:11 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> On Wed, Apr 7, 2021 at 7:31 PM Robins Tharakan <tharakan@gmail.com> wrote:
> > Correct. This is easily reproducible on this test-instance, so let me know if you want me to test a patch.
>
> From your description it sounds like signals are not arriving at all,
> rather than some more complicated race.  Let's go back to basics...

I was looking into the portability of SIGURG and OOB socket data for
something totally different (hallway track discussion from PGCon,
could we use that for query cancel, like FTP does, instead of opening
another socket?), and lo and behold, someone has figured out a
workaround for this latch problem:

https://github.com/microsoft/WSL/issues/8619

I don't really want to add code to scrape uname() ouput detect
different kernels at runtime as shown there, but it doesn't seem to
make a difference on Linux if we just always do what was suggested.  I
didn't look too hard into whether that is the right place to put the
call, or really understand *why* it works, and since I am not a
Windows user and we don't have a WSL1 CI, I can't confirm that it
works or explore whether there is some other ordering of operations
that would be better but still work, but if that does the trick then
maybe we should just do something like the attached.

Thoughts?

Вложения