Re: [HACKERS] intermittent failures in Cygwin from select_parallel tests

Поиск
Список
Период
Сортировка
От Tom Lane
Тема Re: [HACKERS] intermittent failures in Cygwin from select_parallel tests
Дата
Msg-id 29733.1497551690@sss.pgh.pa.us
обсуждение исходный текст
Ответ на Re: [HACKERS] intermittent failures in Cygwin from select_parallel tests  (Robert Haas <robertmhaas@gmail.com>)
Ответы Re: [HACKERS] intermittent failures in Cygwin from select_parallel tests  (Tom Lane <tgl@sss.pgh.pa.us>)
Список pgsql-hackers
Robert Haas <robertmhaas@gmail.com> writes:
> I think you're right.  So here's a theory:

> 1. The ERROR mapping the DSM segment is just a case of the worker the
> losing a race, and isn't a bug.

I concur that this is a possibility, but if we expect this to happen,
seems like there should be other occurrences in the buildfarm logs.
I trolled the last three months worth of check/installcheck logs (all runs
not just failures), and could find exactly two cases of "could not map
dynamic shared memory segment":
sysname  |    branch     |      snapshot       |     stage      |                                                 l
                                            

----------+---------------+---------------------+----------------+---------------------------------------------------------------------------------------------------lorikeet
|REL9_6_STABLE | 2017-05-03 10:21:31 | Check          | 2017-05-03 06:27:32.626 EDT [5909b094.1e28:1] ERROR:  could not
mapdynamic shared memory segmentlorikeet | HEAD          | 2017-06-13 20:28:33 | InstallCheck-C | 2017-06-13
16:44:57.247EDT [59404ec9.2e78:1] ERROR:  could not map dynamic shared memory segment 
(2 rows)

Now maybe this can be explained away by saying that the worker never loses
the race unless it's subject to cygwin's unusually slow fork() emulation,
but somehow I doubt that.  For one thing, it's not clear why that path
would be slower than EXEC_BACKEND, which would also involve populating
a new process image from scratch.

BTW, that 9.6 failure is worth studying because it looks quite a bit
different from the one on HEAD.  It looks like the worker failed to
launch and then the leader got hung up waiting for the worker.
Eventually other stuff started failing because the select_parallel
test is holding an exclusive lock on tenk1 throughout its session.
(Does it really need to do that ALTER TABLE?)

> 2. But when that happens, parallel_terminate_count is getting bumped
> twice for some reason.
> 3. So then the leader process fails that assertion when it tries to
> launch the parallel workers for the next query.

It seems like this has to trace to some sort of logic error in the
postmaster that's allowing it to mess up parallel_terminate_count,
but I'm not managing to construct a plausible flow of control that
would cause that.
        regards, tom lane



В списке pgsql-hackers по дате отправления:

Предыдущее
От: Petr Jelinek
Дата:
Сообщение: Re: [HACKERS] Get stuck when dropping a subscription duringsynchronizing table
Следующее
От: Robert Haas
Дата:
Сообщение: Re: [HACKERS] WIP: Data at rest encryption