Re: [HACKERS] intermittent failures in Cygwin from select_parallel tests

Поиск

Список

Период

Сортировка

От	Tom Lane
Тема	Re: [HACKERS] intermittent failures in Cygwin from select_parallel tests
Дата	16 июня 2017 г. 00:34:50
Msg-id	29733.1497551690@sss.pgh.pa.us обсуждение исходный текст
Ответ на	Re: [HACKERS] intermittent failures in Cygwin from select_parallel tests (Robert Haas <robertmhaas@gmail.com>)
Ответы	Re: [HACKERS] intermittent failures in Cygwin from select_parallel tests (Tom Lane <tgl@sss.pgh.pa.us>)
Список	pgsql-hackers

Дерево обсуждения

Robert Haas <robertmhaas@gmail.com> writes:
> I think you're right.  So here's a theory:

> 1. The ERROR mapping the DSM segment is just a case of the worker the
> losing a race, and isn't a bug.

I concur that this is a possibility, but if we expect this to happen,
seems like there should be other occurrences in the buildfarm logs.
I trolled the last three months worth of check/installcheck logs (all runs
not just failures), and could find exactly two cases of "could not map
dynamic shared memory segment":
sysname  |    branch     |      snapshot       |     stage      |                                                 l

----------+---------------+---------------------+----------------+---------------------------------------------------------------------------------------------------lorikeet
|REL9_6_STABLE | 2017-05-03 10:21:31 | Check          | 2017-05-03 06:27:32.626 EDT [5909b094.1e28:1] ERROR:  could not
mapdynamic shared memory segmentlorikeet | HEAD          | 2017-06-13 20:28:33 | InstallCheck-C | 2017-06-13
16:44:57.247EDT [59404ec9.2e78:1] ERROR:  could not map dynamic shared memory segment 
(2 rows)

Now maybe this can be explained away by saying that the worker never loses
the race unless it's subject to cygwin's unusually slow fork() emulation,
but somehow I doubt that.  For one thing, it's not clear why that path
would be slower than EXEC_BACKEND, which would also involve populating
a new process image from scratch.

BTW, that 9.6 failure is worth studying because it looks quite a bit
different from the one on HEAD.  It looks like the worker failed to
launch and then the leader got hung up waiting for the worker.
Eventually other stuff started failing because the select_parallel
test is holding an exclusive lock on tenk1 throughout its session.
(Does it really need to do that ALTER TABLE?)

> 2. But when that happens, parallel_terminate_count is getting bumped
> twice for some reason.
> 3. So then the leader process fails that assertion when it tries to
> launch the parallel workers for the next query.

It seems like this has to trace to some sort of logic error in the
postmaster that's allowing it to mess up parallel_terminate_count,
but I'm not managing to construct a plausible flow of control that
would cause that.
        regards, tom lane

В списке pgsql-hackers по дате отправления:

Предыдущее

От: Petr Jelinek
Дата: 16 июня 2017 г., 00:07:00
Сообщение: Re: [HACKERS] Get stuck when dropping a subscription duringsynchronizing table

Следующее

От: Robert Haas
Дата: 16 июня 2017 г., 01:09:32
Сообщение: Re: [HACKERS] WIP: Data at rest encryption

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Re: [HACKERS] intermittent failures in Cygwin from select_parallel tests

Предыдущее

Следующее