Re: [HACKERS] SERIALIZABLE with parallel query

Поиск
Список
Период
Сортировка
От Thomas Munro
Тема Re: [HACKERS] SERIALIZABLE with parallel query
Дата
Msg-id CAEepm=2DwnQOfogvBRFm_dYi53qcttoPb=TcxQgaLAKnPONnsg@mail.gmail.com
обсуждение исходный текст
Ответ на Re: [HACKERS] SERIALIZABLE with parallel query  (Amit Kapila <amit.kapila16@gmail.com>)
Ответы Re: [HACKERS] SERIALIZABLE with parallel query
Список pgsql-hackers
On Fri, Feb 23, 2018 at 3:29 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Thu, Feb 22, 2018 at 10:35 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>> On Thu, Feb 22, 2018 at 7:54 AM, Thomas Munro
>>> PS  I noticed that for BecomeLockGroupMember() we say "If we can't
>>> join the lock group, the leader has gone away, so just exit quietly"
>>> but for various other similar things we spew errors (most commonly
>>> seen one being "ERROR:  could not map dynamic shared memory segment").
>>> Intentional?
>>
>> I suppose I thought that if we failed to map the dynamic shared memory
>> segment, it might be down to any one of several causes; whereas if we
>> fail to join the lock group, it must be because the leader has already
>> exited.  There might be a flaw in that thinking, though.
>>
>
> By the way, in which case leader can exit early?  As of now, we do
> wait for workers to end both before the query is finished or in error
> cases.

create table foo as select generate_series(1, 10)::int a;
alter table foo set (parallel_workers = 2);
set parallel_setup_cost = 0;
set parallel_tuple_cost = 0;
select count(a / 0) from foo;

That reliably gives me:
ERROR:  division by zero [from leader]
ERROR:  could not map dynamic shared memory segment [from workers]

I thought this was coming from resource manager cleanup, but you're
right: that happens after we wait for all workers to finish.  Perhaps
this is a race within DestroyParallelContext() itself: when it is
called by AtEOXact_Parallel() during an abort, it asks the postmaster
to SIGTERM the workers, then it immediately detaches from the DSM
segment, and then it waits for the worker to start up.  The workers
unblock signals before the they try to attach to the DSM segment, but
they don't CHECK_FOR_INTERRUPTS before they try to attach (and even if
they did it wouldn't solve nothing).

I don't like the error much, though at least the root cause error is
logged first.

I don't immediately see how BecomeLockGroupMember() could have the
same kind of problem though, for the reason you said: the leader waits
for the workers to finish, so I'm not sure in which circumstances it
would cease to be the lock group leader while the workers are still
running.

-- 
Thomas Munro
http://www.enterprisedb.com


В списке pgsql-hackers по дате отправления:

Предыдущее
От: Peter Eisentraut
Дата:
Сообщение: SSL passphrase prompt external command
Следующее
От: Tatsuo Ishii
Дата:
Сообщение: Re: Translations contributions urgently needed