Обсуждение: hyrax versus isolationtester.c's hard-wired timeouts

Поиск
Список
Период
Сортировка

hyrax versus isolationtester.c's hard-wired timeouts

От
Tom Lane
Дата:
Buildfarm member hyrax has been intermittently failing the
deadlock-parallel isolation test ever since that went in.
I finally got around to looking at this closely, and what
seems to be happening is simply that isolationtester.c's
hard-wired three-minute timeout for the completion of any
one test step is triggering.  hyrax uses CLOBBER_CACHE_ALWAYS
and it seems to be a little slower than other animals using
CLOBBER_CACHE_ALWAYS, so it's unsurprising that it's showing
the symptom and nobody else is.

There are two things we could do about this:

1. Knock the hard-wired setting up a tad, maybe to 5 minutes.
Easy but doesn't seem terribly future-proof.

2. Make the limit configurable somehow, probably from an
environment variable.  There's precedent for that (PGCTLTIMEOUT),
and it would provide a way for owners of especially slow buildfarm
members to adjust things ... but it would require owners of
especially slow buildfarm animals to adjust things.

Any preferences?  (Actually, it wouldn't be unreasonable to do
both things, I suppose.)

BTW, I notice that isolationtester.c fails to print any sort of warning
notice when it decides it's waited too long.  This seems like a
spectacularly bad idea in hindsight: it's not that obvious why the test
case failed.  Plus there's no way to tell exactly which connection it
decided to send a PQcancel to.  So independently of the timeout-length
issue, I think we ought to also make it print something like
"isolationtester: waited too long for something to happen, canceling
step thus-and-so".

            regards, tom lane



Re: hyrax versus isolationtester.c's hard-wired timeouts

От
Tom Lane
Дата:
I wrote:
> There are two things we could do about this:
> 1. Knock the hard-wired setting up a tad, maybe to 5 minutes.
> Easy but doesn't seem terribly future-proof.
> 2. Make the limit configurable somehow, probably from an
> environment variable.  There's precedent for that (PGCTLTIMEOUT),
> and it would provide a way for owners of especially slow buildfarm
> members to adjust things ... but it would require owners of
> especially slow buildfarm animals to adjust things.
> Any preferences?  (Actually, it wouldn't be unreasonable to do
> both things, I suppose.)
> BTW, I notice that isolationtester.c fails to print any sort of warning
> notice when it decides it's waited too long.  This seems like a
> spectacularly bad idea in hindsight: it's not that obvious why the test
> case failed.  Plus there's no way to tell exactly which connection it
> decided to send a PQcancel to.  So independently of the timeout-length
> issue, I think we ought to also make it print something like
> "isolationtester: waited too long for something to happen, canceling
> step thus-and-so".

I pushed a patch doing all of the above.  This should be enough to
fix hyrax's problem without any manual adjustments of the animal's
configuration ... unless I've misdiagnosed what's happening.
We shall see.

            regards, tom lane