Re: [HACKERS] intermittent failures in Cygwin from select_parallel tests

Поиск
Список
Период
Сортировка
От Tom Lane
Тема Re: [HACKERS] intermittent failures in Cygwin from select_parallel tests
Дата
Msg-id 24435.1496773316@sss.pgh.pa.us
обсуждение исходный текст
Ответ на Re: [HACKERS] intermittent failures in Cygwin from select_parallel tests  (Robert Haas <robertmhaas@gmail.com>)
Ответы Re: [HACKERS] intermittent failures in Cygwin from select_parallel tests  (Robert Haas <robertmhaas@gmail.com>)
Список pgsql-hackers
Robert Haas <robertmhaas@gmail.com> writes:
> On Mon, Jun 5, 2017 at 10:40 AM, Andrew Dunstan
> <andrew.dunstan@2ndquadrant.com> wrote:
>> Buildfarm member lorikeet is failing occasionally with a failed
>> assertion during the select_parallel regression tests like this:

> I don't *think* we've made any relevant code changes lately.  The only
> thing that I can see as looking at all relevant is
> b6dd1271281ce856ab774fc0b491a92878e3b501, but that doesn't really seem
> like it can be to blame.

Yeah, I don't believe that either.  That could have introduced a hard
failure (if something were relying on initializing a field before where
I put the memsets) but it's hard to see how it could produce an
intermittent and platform-specific one.

> One thought is that the only places where shm_mq_set_sender() should
> be getting invoked during the main regression tests are
> ParallelWorkerMain() and ExecParallelGetReceiver, and both of those
> places using ParallelWorkerNumber to figure out what address to pass.
> So if ParallelWorkerNumber were getting set to the same value in two
> different parallel workers - e.g. because the postmaster went nuts and
> launched two processes instead of only one - or if
> ParallelWorkerNumber were not getting initialized at all or were
> getting initialized to some completely bogus value, it could cause
> this symptom.

Hmm.  With some generous assumptions it'd be possible to think that
aa1351f1eec4adae39be59ce9a21410f9dd42118 triggered this.  That commit was
present in 20 successful lorikeet runs before the first of these failures,
which is a bit more than the MTBF after that, but not a huge amount more.

That commit in itself looks innocent enough, but could it have exposed
some latent bug in bgworker launching?
        regards, tom lane



В списке pgsql-hackers по дате отправления:

Предыдущее
От: Robert Haas
Дата:
Сообщение: Re: [HACKERS] Should we standardize on a type for signal handler flags?
Следующее
От: Robert Haas
Дата:
Сообщение: Re: [HACKERS] UPDATE of partition key