Обсуждение: Failed test 'psql query died successfully after SIGQUIT'
Hi, Does anyone have any thoughts on why Travis CI might be failing occasionally as shown below, but the build farm apparently doesn't? It's not alien technology, it's just Ubuntu Linux running on a virtual machine. One thing that it does slightly differently is "make -Otarget -j3 check-world" but I don't see why -j3 would break something like that specifically, except perhaps by overloading the system in a way that affects timing. And yet we have all many of fast and slow machines in the BF, so I don't know. https://travis-ci.org/postgresql-cfbot/postgresql/builds/334253917 t/012_subtransactions.pl ............. ok # aborting wait: program timed out # stream contents: >><< # pattern searched for: (?^m:psql-connected) # Failed test 'monitor connected' # at t/013_crash_restart.pl line 95. # aborting wait: program timed out # stream contents: >><< # pattern searched for: (?^m:WARNING: terminating connection because of crash of another server process|server closed the connection unexpectedly) # Failed test 'psql query died successfully after SIGQUIT' # at t/013_crash_restart.pl line 110. # aborting wait: program timed out # stream contents: >>psql:<stdin>:3: WARNING: terminating connection because of crash of another server process # DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory. # HINT: In a moment you should be able to reconnect to the database and repeat your command. # psql:<stdin>:3: server closed the connection unexpectedly # This probably means the server terminated abnormally # before or while processing the request. # psql:<stdin>:3: connection to server was lost # << # pattern searched for: (?^m:WARNING: terminating connection because of crash of another server process|server closed the connection unexpectedly) # Failed test 'psql monitor died successfully after SIGQUIT' # at t/013_crash_restart.pl line 119. # Looks like you failed 3 tests of 18. t/013_crash_restart.pl ............... Dubious, test returned 3 (wstat 768, 0x300) Failed 3/18 subtests Test Summary Report ------------------- t/013_crash_restart.pl (Wstat: 768 Tests: 18 Failed: 3) Failed tests: 3, 5-6 Non-zero exit status: 3 Files=13, Tests=134, 198 wallclock secs ( 0.10 usr 0.01 sys + 35.98 cusr 10.35 csys = 46.44 CPU) Result: FAIL Makefile:19: recipe for target 'check' failed -- Thomas Munro http://www.enterprisedb.com
Thomas Munro <thomas.munro@enterprisedb.com> writes: > Does anyone have any thoughts on why Travis CI might be failing > occasionally as shown below, but the build farm apparently doesn't? > It's not alien technology, it's just Ubuntu Linux running on a virtual > machine. One thing that it does slightly differently is "make > -Otarget -j3 check-world" but I don't see why -j3 would break > something like that specifically, except perhaps by overloading the > system in a way that affects timing. Quite a few of us use -jsomething with make check-world, so I doubt that that's an issue per se. I have no idea what -Otarget means in this context though, and neither does "info make". > # aborting wait: program timed out Hmm -- looking at src/test/recovery/t/013_crash_restart.pl, that message comes out if this timer expires: # To avoid hanging while expecting some specific input from a psql # instance being driven by us, add a timeout high enough that it # should never trigger even on very slow machines, unless something # is really wrong. my $psql_timeout = IPC::Run::timer(60); so in other words, all you need for this failure is for 013_crash_restart.pl to take more than 60 seconds (per-command, I think, not total). I find the comment's optimism about that being impossible rather naive :-(, even if the test does take only a couple of seconds on typical modern hardware. I see in the buildfarm that skink takes >240 seconds ... If you can easily experiment with changing that limit to 120 or 180 seconds, or can measure how long the test takes when it passes, that'd be useful info. regards, tom lane
On Mon, Jan 29, 2018 at 5:35 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Thomas Munro <thomas.munro@enterprisedb.com> writes: >> Does anyone have any thoughts on why Travis CI might be failing >> occasionally as shown below, but the build farm apparently doesn't? >> It's not alien technology, it's just Ubuntu Linux running on a virtual >> machine. One thing that it does slightly differently is "make >> -Otarget -j3 check-world" but I don't see why -j3 would break >> something like that specifically, except perhaps by overloading the >> system in a way that affects timing. > > Quite a few of us use -jsomething with make check-world, so I doubt > that that's an issue per se. I have no idea what -Otarget means in > this context though, and neither does "info make". -Otarget means that it should buffer the output from all the parallel jobs and then serialise them into per-target chunks, so that you can read the output instead of getting intermingled lines from concurrent stuff. I learned this trick from Andres... I think it landed in GNU make 4.something, so it's quite new. The latest web documentation only says that it exists, and doesn't say what it does! https://www.gnu.org/software/make/manual/make.html >> # aborting wait: program timed out > > Hmm -- looking at src/test/recovery/t/013_crash_restart.pl, that > message comes out if this timer expires: > > # To avoid hanging while expecting some specific input from a psql > # instance being driven by us, add a timeout high enough that it > # should never trigger even on very slow machines, unless something > # is really wrong. > my $psql_timeout = IPC::Run::timer(60); > > so in other words, all you need for this failure is for > 013_crash_restart.pl to take more than 60 seconds (per-command, I think, > not total). I find the comment's optimism about that being impossible > rather naive :-(, even if the test does take only a couple of seconds on > typical modern hardware. I see in the buildfarm that skink takes >240 > seconds ... Ahh. > If you can easily experiment with changing that limit to 120 or 180 > seconds, or can measure how long the test takes when it passes, that'd > be useful info. Will look into that. Thanks! -- Thomas Munro http://www.enterprisedb.com
On Mon, Jan 29, 2018 at 5:48 PM, Thomas Munro <thomas.munro@enterprisedb.com> wrote: > On Mon, Jan 29, 2018 at 5:35 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Thomas Munro <thomas.munro@enterprisedb.com> writes: >>> Does anyone have any thoughts on why Travis CI might be failing >>> occasionally as shown below, but the build farm apparently doesn't? >>> It's not alien technology, it's just Ubuntu Linux running on a virtual >>> machine. One thing that it does slightly differently is "make >>> -Otarget -j3 check-world" but I don't see why -j3 would break >>> something like that specifically, except perhaps by overloading the >>> system in a way that affects timing. >> >> Quite a few of us use -jsomething with make check-world, so I doubt >> that that's an issue per se. I have no idea what -Otarget means in >> this context though, and neither does "info make". > > -Otarget means that it should buffer the output from all the parallel > jobs and then serialise them into per-target chunks, so that you can > read the output instead of getting intermingled lines from concurrent > stuff. I learned this trick from Andres... I think it landed in GNU > make 4.something, so it's quite new. The latest web documentation > only says that it exists, and doesn't say what it does! > > https://www.gnu.org/software/make/manual/make.html Oh, actually it is now documented: ‘-O[type]’ ‘--output-sync[=type]’ Ensure that the complete output from each recipe is printed in one uninterrupted sequence. This option is only useful when using the --jobs option to run multiple recipes simultaneously (see Parallel Execution) Without this option output will be displayed as it is generated by the recipes. With no type or the type ‘target’, output from the entire recipe of each target is grouped together. With the type ‘line’, output from each line in the recipe is grouped together. With the type ‘recurse’, the output from an entire recursive make is grouped together. With the type ‘none’, no output synchronization is performed. See Output During Parallel Execution. -- Thomas Munro http://www.enterprisedb.com