Обсуждение: Failed test 'psql query died successfully after SIGQUIT'

Поиск
Список
Период
Сортировка

Failed test 'psql query died successfully after SIGQUIT'

От
Thomas Munro
Дата:
Hi,

Does anyone have any thoughts on why Travis CI might be failing
occasionally as shown below, but the build farm apparently doesn't?
It's not alien technology, it's just Ubuntu Linux running on a virtual
machine.  One thing that it does slightly differently is "make
-Otarget -j3 check-world" but I don't see why -j3 would break
something like that specifically, except perhaps by overloading the
system in a way that affects timing.  And yet we have all many of fast
and slow machines in the BF, so I don't know.

https://travis-ci.org/postgresql-cfbot/postgresql/builds/334253917

t/012_subtransactions.pl ............. ok
# aborting wait: program timed out
# stream contents: >><<
# pattern searched for: (?^m:psql-connected)
#   Failed test 'monitor connected'
#   at t/013_crash_restart.pl line 95.
# aborting wait: program timed out
# stream contents: >><<
# pattern searched for: (?^m:WARNING:  terminating connection because
of crash of another server process|server closed the connection
unexpectedly)
#   Failed test 'psql query died successfully after SIGQUIT'
#   at t/013_crash_restart.pl line 110.
# aborting wait: program timed out
# stream contents: >>psql:<stdin>:3: WARNING:  terminating connection
because of crash of another server process
# DETAIL:  The postmaster has commanded this server process to roll
back the current transaction and exit, because another server process
exited abnormally and possibly corrupted shared memory.
# HINT:  In a moment you should be able to reconnect to the database
and repeat your command.
# psql:<stdin>:3: server closed the connection unexpectedly
# This probably means the server terminated abnormally
# before or while processing the request.
# psql:<stdin>:3: connection to server was lost
# <<
# pattern searched for: (?^m:WARNING:  terminating connection because
of crash of another server process|server closed the connection
unexpectedly)
#   Failed test 'psql monitor died successfully after SIGQUIT'
#   at t/013_crash_restart.pl line 119.
# Looks like you failed 3 tests of 18.
t/013_crash_restart.pl ...............
Dubious, test returned 3 (wstat 768, 0x300)
Failed 3/18 subtests
Test Summary Report
-------------------
t/013_crash_restart.pl             (Wstat: 768 Tests: 18 Failed: 3)
  Failed tests:  3, 5-6
  Non-zero exit status: 3
Files=13, Tests=134, 198 wallclock secs ( 0.10 usr  0.01 sys + 35.98
cusr 10.35 csys = 46.44 CPU)
Result: FAIL
Makefile:19: recipe for target 'check' failed

-- 
Thomas Munro
http://www.enterprisedb.com


Re: Failed test 'psql query died successfully after SIGQUIT'

От
Tom Lane
Дата:
Thomas Munro <thomas.munro@enterprisedb.com> writes:
> Does anyone have any thoughts on why Travis CI might be failing
> occasionally as shown below, but the build farm apparently doesn't?
> It's not alien technology, it's just Ubuntu Linux running on a virtual
> machine.  One thing that it does slightly differently is "make
> -Otarget -j3 check-world" but I don't see why -j3 would break
> something like that specifically, except perhaps by overloading the
> system in a way that affects timing.

Quite a few of us use -jsomething with make check-world, so I doubt
that that's an issue per se.  I have no idea what -Otarget means in
this context though, and neither does "info make".

> # aborting wait: program timed out

Hmm -- looking at src/test/recovery/t/013_crash_restart.pl, that
message comes out if this timer expires:

    # To avoid hanging while expecting some specific input from a psql
    # instance being driven by us, add a timeout high enough that it
    # should never trigger even on very slow machines, unless something
    # is really wrong.
    my $psql_timeout = IPC::Run::timer(60);

so in other words, all you need for this failure is for
013_crash_restart.pl to take more than 60 seconds (per-command, I think,
not total).  I find the comment's optimism about that being impossible
rather naive :-(, even if the test does take only a couple of seconds on
typical modern hardware.  I see in the buildfarm that skink takes >240
seconds ...

If you can easily experiment with changing that limit to 120 or 180
seconds, or can measure how long the test takes when it passes, that'd
be useful info.

            regards, tom lane


Re: Failed test 'psql query died successfully after SIGQUIT'

От
Thomas Munro
Дата:
On Mon, Jan 29, 2018 at 5:35 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Thomas Munro <thomas.munro@enterprisedb.com> writes:
>> Does anyone have any thoughts on why Travis CI might be failing
>> occasionally as shown below, but the build farm apparently doesn't?
>> It's not alien technology, it's just Ubuntu Linux running on a virtual
>> machine.  One thing that it does slightly differently is "make
>> -Otarget -j3 check-world" but I don't see why -j3 would break
>> something like that specifically, except perhaps by overloading the
>> system in a way that affects timing.
>
> Quite a few of us use -jsomething with make check-world, so I doubt
> that that's an issue per se.  I have no idea what -Otarget means in
> this context though, and neither does "info make".

-Otarget means that it should buffer the output from all the parallel
jobs and then serialise them into per-target chunks, so that you can
read the output instead of getting intermingled lines from concurrent
stuff.  I learned this trick from Andres... I think it landed in GNU
make 4.something, so it's quite new.  The latest web documentation
only says that it exists, and doesn't say what it does!

https://www.gnu.org/software/make/manual/make.html

>> # aborting wait: program timed out
>
> Hmm -- looking at src/test/recovery/t/013_crash_restart.pl, that
> message comes out if this timer expires:
>
>         # To avoid hanging while expecting some specific input from a psql
>         # instance being driven by us, add a timeout high enough that it
>         # should never trigger even on very slow machines, unless something
>         # is really wrong.
>         my $psql_timeout = IPC::Run::timer(60);
>
> so in other words, all you need for this failure is for
> 013_crash_restart.pl to take more than 60 seconds (per-command, I think,
> not total).  I find the comment's optimism about that being impossible
> rather naive :-(, even if the test does take only a couple of seconds on
> typical modern hardware.  I see in the buildfarm that skink takes >240
> seconds ...

Ahh.

> If you can easily experiment with changing that limit to 120 or 180
> seconds, or can measure how long the test takes when it passes, that'd
> be useful info.

Will look into that.  Thanks!

-- 
Thomas Munro
http://www.enterprisedb.com


Re: Failed test 'psql query died successfully after SIGQUIT'

От
Thomas Munro
Дата:
On Mon, Jan 29, 2018 at 5:48 PM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:
> On Mon, Jan 29, 2018 at 5:35 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Thomas Munro <thomas.munro@enterprisedb.com> writes:
>>> Does anyone have any thoughts on why Travis CI might be failing
>>> occasionally as shown below, but the build farm apparently doesn't?
>>> It's not alien technology, it's just Ubuntu Linux running on a virtual
>>> machine.  One thing that it does slightly differently is "make
>>> -Otarget -j3 check-world" but I don't see why -j3 would break
>>> something like that specifically, except perhaps by overloading the
>>> system in a way that affects timing.
>>
>> Quite a few of us use -jsomething with make check-world, so I doubt
>> that that's an issue per se.  I have no idea what -Otarget means in
>> this context though, and neither does "info make".
>
> -Otarget means that it should buffer the output from all the parallel
> jobs and then serialise them into per-target chunks, so that you can
> read the output instead of getting intermingled lines from concurrent
> stuff.  I learned this trick from Andres... I think it landed in GNU
> make 4.something, so it's quite new.  The latest web documentation
> only says that it exists, and doesn't say what it does!
>
> https://www.gnu.org/software/make/manual/make.html

Oh, actually it is now documented:

‘-O[type]’

‘--output-sync[=type]’

Ensure that the complete output from each recipe is printed in one
uninterrupted sequence. This option is only useful when using the
--jobs option to run multiple recipes simultaneously (see Parallel
Execution) Without this option output will be displayed as it is
generated by the recipes.

With no type or the type ‘target’, output from the entire recipe of
each target is grouped together. With the type ‘line’, output from
each line in the recipe is grouped together. With the type ‘recurse’,
the output from an entire recursive make is grouped together. With the
type ‘none’, no output synchronization is performed. See Output During
Parallel Execution.

--
Thomas Munro
http://www.enterprisedb.com