Re: [HACKERS] intermittent failures in Cygwin from select_parallel tests

Поиск
Список
Период
Сортировка
От Noah Misch
Тема Re: [HACKERS] intermittent failures in Cygwin from select_parallel tests
Дата
Msg-id 20170803034740.GA2641942@rfd.leadboat.com
обсуждение исходный текст
Ответ на Re: [HACKERS] intermittent failures in Cygwin from select_parallel tests  (Tom Lane <tgl@sss.pgh.pa.us>)
Ответы Re: [HACKERS] intermittent failures in Cygwin from select_parallel tests  (Robert Haas <robertmhaas@gmail.com>)
Список pgsql-hackers
On Wed, Jun 21, 2017 at 06:44:09PM -0400, Tom Lane wrote:
> Today, lorikeet failed with a new variant on the bgworker start crash:
> 
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=lorikeet&dt=2017-06-21%2020%3A29%3A10
> 
> This one is even more exciting than the last one, because it sure looks
> like the crashing bgworker took the postmaster down with it.  That is
> Not Supposed To Happen.
> 
> Wondering if we broke something here recently, I tried to reproduce it
> on a Linux machine by adding a randomized Assert failure in
> shm_mq_set_sender.  I don't see any such problem, even with EXEC_BACKEND;
> we recover from the crash as-expected.
> 
> So I'm starting to get a distinct feeling that there's something wrong
> with the cygwin port.  But I dunno what.

I think signal blocking broke on Cygwin.

On a system (gcc 5.4.0, CYGWIN_NT-10.0 2.7.0(0.306/5/3) 2017-02-12 13:18
x86_64) that reproduces lorikeet's symptoms, I instrumented the postmaster as
attached.  The patch's small_parallel.sql is a subset of select_parallel.sql
sufficient to reproduce the mq_sender Assert failure and the postmaster silent
exit.  (It occasionally needed hundreds of iterations to do so.)  The parallel
query normally starts four bgworkers; when the mq_sender Assert fired, the
test had started five workers in response to four registrations.

The postmaster.c instrumentation regularly detects sigusr1_handler() calls
while another sigusr1_handler() is already on the stack:

6328 2017-08-02 07:25:42.788 GMT LOG:  forbid signals @ sigusr1_handler
6328 2017-08-02 07:25:42.788 GMT DEBUG:  saw slot-0 registration, want 0
6328 2017-08-02 07:25:42.788 GMT DEBUG:  saw slot-0 registration, want 1
6328 2017-08-02 07:25:42.788 GMT DEBUG:  slot 1 not yet registered
6328 2017-08-02 07:25:42.789 GMT DEBUG:  registering background worker "parallel worker for PID 4776" (slot 1)
6328 2017-08-02 07:25:42.789 GMT DEBUG:  saw slot-1 registration, want 2
6328 2017-08-02 07:25:42.789 GMT DEBUG:  saw slot-0 registration, want 2
6328 2017-08-02 07:25:42.789 GMT DEBUG:  slot 2 not yet registered
6328 2017-08-02 07:25:42.789 GMT DEBUG:  registering background worker "parallel worker for PID 4776" (slot 2)
6328 2017-08-02 07:25:42.789 GMT DEBUG:  saw slot-2 registration, want 3
6328 2017-08-02 07:25:42.789 GMT DEBUG:  saw slot-1 registration, want 3
6328 2017-08-02 07:25:42.789 GMT DEBUG:  saw slot-0 registration, want 3
6328 2017-08-02 07:25:42.789 GMT DEBUG:  slot 3 not yet registered
6328 2017-08-02 07:25:42.789 GMT DEBUG:  registering background worker "parallel worker for PID 4776" (slot 3)
6328 2017-08-02 07:25:42.789 GMT DEBUG:  saw slot-3 registration, want 4
6328 2017-08-02 07:25:42.789 GMT DEBUG:  saw slot-2 registration, want 4
6328 2017-08-02 07:25:42.789 GMT DEBUG:  saw slot-1 registration, want 4
6328 2017-08-02 07:25:42.789 GMT DEBUG:  saw slot-0 registration, want 4
6328 2017-08-02 07:25:42.789 GMT DEBUG:  slot 4 not yet registered
6328 2017-08-02 07:25:42.789 GMT DEBUG:  registering background worker "parallel worker for PID 4776" (slot 4)
6328 2017-08-02 07:25:42.789 GMT DEBUG:  starting background worker process "parallel worker for PID 4776"
6328 2017-08-02 07:25:42.790 GMT LOG:  forbid signals @ sigusr1_handler
6328 2017-08-02 07:25:42.790 GMT WARNING:  signals already forbidden @ sigusr1_handler
6328 2017-08-02 07:25:42.790 GMT LOG:  permit signals @ sigusr1_handler

postmaster algorithms rely on the PG_SETMASK() calls preventing that.  Without
such protection, duplicate bgworkers are an understandable result.  I caught
several other assertions; the PMChildFlags failure is another case of
duplicate postmaster children:

      6 TRAP: FailedAssertion("!(entry->trans == ((void *)0))", File: "pgstat.c", Line: 871)
      3 TRAP: FailedAssertion("!(PMSignalState->PMChildFlags[slot] == 1)", File: "pmsignal.c", Line: 229)
     20 TRAP: FailedAssertion("!(RefCountErrors == 0)", File: "bufmgr.c", Line: 2523)
     21 TRAP: FailedAssertion("!(vmq->mq_sender == ((void *)0))", File: "shm_mq.c", Line: 221)
     Also, got a few "select() failed in postmaster: Bad address"

I suspect a Cygwin signals bug.  I'll try to distill a self-contained test
case for the Cygwin hackers.  The lack of failures on buildfarm member brolga
argues that older Cygwin is not affected.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Вложения

В списке pgsql-hackers по дате отправления:

Предыдущее
От: Alvaro Herrera
Дата:
Сообщение: [HACKERS] fixing pg_upgrade strings (was Re: pgsql: Add new files to nls.mkand add translation)
Следующее
От: Tom Lane
Дата:
Сообщение: Re: [HACKERS] [COMMITTERS] pgsql: Fix pg_dump's errno checking for zlib I/O