Обсуждение: Maybe BF "timedout" failures are the client script's fault?

Поиск
Список
Период
Сортировка

Maybe BF "timedout" failures are the client script's fault?

От
Tom Lane
Дата:
We've been assuming that all the "timedout" failures on BF member
fruitcrow were due to some wonkiness in the GNU/Hurd platform.
I got suspicious about that though after noticing that there are
a small number of such failures on other animals, eg [1][2][3].
In each case, the failure message claims it waited a good long
time, which is at variance with the actually observed runtime.
For instance [1] says "timed out after 14400 secs", but the
actual total test runtime is only 01:24:28 according to the
summary at the top of the page.

Looking into the buildfarm client, I realized that it's assuming that
"sleep($wait_time)" is sufficient to wait for $wait_time seconds.
However, the Perl docs point out that sleep() can be interrupted by a
signal.  So now I'm suspicious that many of these failures are caused
by a stray signal waking up the wait_timeout thread prematurely.
GNU/Hurd might just be more prone to that than other platforms.

I propose the attached patch to the BF client to try to make this
more robust.

            regards, tom lane

[1] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=ovenbird&dt=2025-11-14%2009%3A21%3A05
[2] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=conchuela&dt=2025-10-17%2018%3A32%3A07
[3] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=opaleye&dt=2026-01-08%2023%3A07%3A37

--- run_build.pl.orig    2025-11-25 07:47:25
+++ run_build.pl    2026-01-09 15:02:23
@@ -3415,7 +3415,13 @@ sub wait_timeout
         $SIG{$sig} = 'DEFAULT';
     }
     $SIG{'TERM'} = \&silent_terminate;
-    sleep($wait_time);
+    # loop to absorb any unexpected signals without dying early
+    my $end_time = time + $wait_time;
+    while (time < $end_time)
+    {
+        my $delay = $end_time - time;
+        sleep($delay);
+    }
     print STDERR "Run timed out, terminating.\n";
     my $sig = $usig[0] || 'TERM';
     kill $sig, $main_pid;

Re: Maybe BF "timedout" failures are the client script's fault?

От
Michael Banck
Дата:
Hi,

On Fri, Jan 09, 2026 at 03:41:03PM -0500, Tom Lane wrote:
> We've been assuming that all the "timedout" failures on BF member
> fruitcrow were due to some wonkiness in the GNU/Hurd platform.
> I got suspicious about that though after noticing that there are
> a small number of such failures on other animals, eg [1][2][3].
> In each case, the failure message claims it waited a good long
> time, which is at variance with the actually observed runtime.
> For instance [1] says "timed out after 14400 secs", but the
> actual total test runtime is only 01:24:28 according to the
> summary at the top of the page.
>
> Looking into the buildfarm client, I realized that it's assuming that
> "sleep($wait_time)" is sufficient to wait for $wait_time seconds.
> However, the Perl docs point out that sleep() can be interrupted by a
> signal.  So now I'm suspicious that many of these failures are caused
> by a stray signal waking up the wait_timeout thread prematurely.
> GNU/Hurd might just be more prone to that than other platforms.

That might be the case for those other failures, but unfortunately, I
think the fruitcrow failures are really because it gets stuck endlessly
in the test_shm_mq test (it is always that one) and only the test
timeout kicks it out.

I've ran that test manually quite a lot and either it finishes in 10-15
seconds, or (presumably) never. This is not really easy to see in the
public builfarm logs (at least I can't find it on a quick glance), but
I've routinely checked the log timestamps of the runs, and they really
take one hour (wait_timeout) in the case of a hang.

> I propose the attached patch to the BF client to try to make this
> more robust.

Looks sensible, though I wonder whether something should be logged in
case we get woken up early so that we can gather some evidence for this?


Michael



Re: Maybe BF "timedout" failures are the client script's fault?

От
Tom Lane
Дата:
Michael Banck <mbanck@gmx.net> writes:
> On Fri, Jan 09, 2026 at 03:41:03PM -0500, Tom Lane wrote:
>> Looking into the buildfarm client, I realized that it's assuming that
>> "sleep($wait_time)" is sufficient to wait for $wait_time seconds.
>> However, the Perl docs point out that sleep() can be interrupted by a
>> signal.  So now I'm suspicious that many of these failures are caused
>> by a stray signal waking up the wait_timeout thread prematurely.

> That might be the case for those other failures, but unfortunately, I
> think the fruitcrow failures are really because it gets stuck endlessly
> in the test_shm_mq test (it is always that one) and only the test
> timeout kicks it out.

If it's always the same test, then yeah that's evidence against
my theory (at least for fruitcrow's failures).

> I've ran that test manually quite a lot and either it finishes in 10-15
> seconds, or (presumably) never. This is not really easy to see in the
> public builfarm logs (at least I can't find it on a quick glance), but
> I've routinely checked the log timestamps of the runs, and they really
> take one hour (wait_timeout) in the case of a hang.

Hmm.  Then why is the BF report showing that the total runtime is
nowhere near that?  I wonder how those times are gathered ...

            regards, tom lane



Re: Maybe BF "timedout" failures are the client script's fault?

От
Michael Banck
Дата:
Hi,

On Fri, Jan 09, 2026 at 04:42:22PM -0500, Tom Lane wrote:
> Michael Banck <mbanck@gmx.net> writes:
> > I've ran that test manually quite a lot and either it finishes in 10-15
> > seconds, or (presumably) never. This is not really easy to see in the
> > public builfarm logs (at least I can't find it on a quick glance), but
> > I've routinely checked the log timestamps of the runs, and they really
> > take one hour (wait_timeout) in the case of a hang.
>
> Hmm.  Then why is the BF report showing that the total runtime is
> nowhere near that?  I wonder how those times are gathered ...

Looks like the server takes the timestamp of the logfile as an educated
guess when the particular stage finished:

https://github.com/PGBuildFarm/server-code/blob/main/cgi-bin/pgstatus.pl#L496

But this does not work when the stage hangs and gets terminated
externally, it seems no output is appended to the stage log in this case
by the buildfarm client and (at least for fruitcrow) neither by the
stage itself, so the server thinks the stage duration was whatever time
it took until the last piece of output before the hang.


Michael



Re: Maybe BF "timedout" failures are the client script's fault?

От
Andrew Dunstan
Дата:
On 2026-01-09 Fr 3:41 PM, Tom Lane wrote:
> We've been assuming that all the "timedout" failures on BF member
> fruitcrow were due to some wonkiness in the GNU/Hurd platform.
> I got suspicious about that though after noticing that there are
> a small number of such failures on other animals, eg [1][2][3].
> In each case, the failure message claims it waited a good long
> time, which is at variance with the actually observed runtime.
> For instance [1] says "timed out after 14400 secs", but the
> actual total test runtime is only 01:24:28 according to the
> summary at the top of the page.
>
> Looking into the buildfarm client, I realized that it's assuming that
> "sleep($wait_time)" is sufficient to wait for $wait_time seconds.
> However, the Perl docs point out that sleep() can be interrupted by a
> signal.  So now I'm suspicious that many of these failures are caused
> by a stray signal waking up the wait_timeout thread prematurely.
> GNU/Hurd might just be more prone to that than other platforms.
>
> I propose the attached patch to the BF client to try to make this
> more robust.
>
>             regards, tom lane
>
> [1] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=ovenbird&dt=2025-11-14%2009%3A21%3A05
> [2] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=conchuela&dt=2025-10-17%2018%3A32%3A07
> [3] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=opaleye&dt=2026-01-08%2023%3A07%3A37
>


The patch seems reasonable on its face, but I doubt it's the issue. 
Rather I think what's happening here is that a test is hanging silently 
and lastcommand.log's mtime doesn't get updated, causing a misreporting 
of the run duration. So in  addition to the above I have added some code 
to update that timestamp if the file exists (which should only be the 
case with a timeout).

See 
https://github.com/PGBuildFarm/client-code/commit/e5d67a35a0136a53e441fccf0ecc9b1b6322526c


cheers


andrew


--
Andrew Dunstan
EDB: https://www.enterprisedb.com




Re: Maybe BF "timedout" failures are the client script's fault?

От
Tom Lane
Дата:
Andrew Dunstan <andrew@dunslane.net> writes:
> The patch seems reasonable on its face, but I doubt it's the issue. 
> Rather I think what's happening here is that a test is hanging silently 
> and lastcommand.log's mtime doesn't get updated, causing a misreporting 
> of the run duration. So in addition to the above I have added some code 
> to update that timestamp if the file exists (which should only be the 
> case with a timeout).

Sounds reasonable.  Thanks!

            regards, tom lane