Re: [HACKERS] intermittent failures in Cygwin from select_parallel tests

Поиск

Список

Период

Сортировка

От	Amit Kapila
Тема	Re: [HACKERS] intermittent failures in Cygwin from select_parallel tests
Дата	7 июня 2017 г. 16:36:31
Msg-id	CAA4eK1+KK0kBM0OOTmUpbqbZPFBb0Um_2HxRE0wKDgLKwwYRAw@mail.gmail.com обсуждение исходный текст
Ответ на	Re: [HACKERS] intermittent failures in Cygwin from select_parallel tests (Robert Haas <robertmhaas@gmail.com>)
Ответы	Re: [HACKERS] intermittent failures in Cygwin from select_parallel tests (Robert Haas <robertmhaas@gmail.com>)
Список	pgsql-hackers

Дерево обсуждения

On Wed, Jun 7, 2017 at 12:37 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Tue, Jun 6, 2017 at 2:21 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>> One thought is that the only places where shm_mq_set_sender() should
>>> be getting invoked during the main regression tests are
>>> ParallelWorkerMain() and ExecParallelGetReceiver, and both of those
>>> places using ParallelWorkerNumber to figure out what address to pass.
>>> So if ParallelWorkerNumber were getting set to the same value in two
>>> different parallel workers - e.g. because the postmaster went nuts and
>>> launched two processes instead of only one - or if
>>> ParallelWorkerNumber were not getting initialized at all or were
>>> getting initialized to some completely bogus value, it could cause
>>> this symptom.
>>
>> Hmm.  With some generous assumptions it'd be possible to think that
>> aa1351f1eec4adae39be59ce9a21410f9dd42118 triggered this.  That commit was
>> present in 20 successful lorikeet runs before the first of these failures,
>> which is a bit more than the MTBF after that, but not a huge amount more.
>>
>> That commit in itself looks innocent enough, but could it have exposed
>> some latent bug in bgworker launching?
>
> Hmm, that's a really interesting idea, but I can't quite put together
> a plausible theory around it.  I mean, it seems like that commit could
> make launching bgworkers faster, which could conceivably tickle some
> heretofore-latent timing-related bug.  But it wouldn't, IIUC, make the
> first worker start any faster than before - it would just make them
> more closely-spaced thereafter, and it's not very obvious how that
> would cause a problem.
>
> Another idea is that the commit in question is managing to corrupt
> BackgroundWorkerList somehow.
>

I don't think so because this problem has been reported previously as
well [1][2] even before the commit in question.


[1] - https://www.postgresql.org/message-id/1ce5a19f-3b1d-bb1c-4561-0158176f65f1%40dunslane.net
[2] - https://www.postgresql.org/message-id/25861.1472215822%40sss.pgh.pa.us

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

В списке pgsql-hackers по дате отправления:

Предыдущее

От: sanyam jain
Дата: 07 июня 2017 г., 16:16:07
Сообщение: [HACKERS] Use of snapshot in logical replication

Следующее

От: Alexander Korotkov
Дата: 07 июня 2017 г., 16:42:04
Сообщение: Re: [HACKERS] Challenges preventing us moving to 64 bit transactionid (XID)?

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Re: [HACKERS] intermittent failures in Cygwin from select_parallel tests

Предыдущее

Следующее