[GENERAL] postmaster deadlock while logging after syslogger exited

Поиск
Список
Период
Сортировка
От David Pacheco
Тема [GENERAL] postmaster deadlock while logging after syslogger exited
Дата
Msg-id CACukRjOaR2_DY2tLzPe-ZJUActf8RcZ=r1HimQxFgYBy++aR8Q@mail.gmail.com
обсуждение исходный текст
Ответы Re: [GENERAL] postmaster deadlock while logging after syslogger exited  (Tom Lane <tgl@sss.pgh.pa.us>)
Список pgsql-general
Hello,

I ran into what appears to be a deadlock in the logging subsystem.  It looks like what happened was that the syslogger process exited because it ran out of memory.  But before the postmaster got a chance to handle the SIGCLD to restart it, it handled a SIGUSR1 to start an autovacuum worker.  That also failed, and the postmaster went to log a message about it, but it's blocked on the pipe that's normally connected to the syslogger, presumably because the pipe is full because the syslogger is gone and hasn't read from it.  The net result is that new connections to PostgreSQL hang, and any query that causes PostgreSQL to log appears to hang, though queries appear to work on existing connections as long as they do not cause the backend to log anything.  I don't believe the system would ever get out of this state.

I ran into this with version 9.2.4 on SmartOS (illumos), but it doesn't look to me like any of the relevant code has changed in 9.6 or 10, nor is it platform-specific.  Version 10.0 still:

- restarts the syslogger in response to a SIGCHLD signal (in "reaper")
- starts autovacuum workers in response to SIGUSR1 (in "sigusr1_handler")
- mask SIGCHLD at the beginning of the SIGUSR1 handler
- potentially logs from StartAutoVacWorker (and, of course, many other places in the postmaster)

which I think means the problem is still possible.  Am I missing something?  I couldn't find anything about this issue.  Is this just a rarely seen bug?

--------------------

Here's the raw data: my initial observation was that when I ran "psql" to connect to the database, it would hang (for hours, it turned out).  My postmaster process was blocked here, writing to its own stderr pipe:

$ mdb core.41349
Loading modules: [ libumem.so.1 libc.so.1 ld.so.1 ]
> $C
fffffd7fffdf9df0 libc.so.1`__write+0xa()
fffffd7fffdfb220 write_pipe_chunks+0x142()
fffffd7fffdfb280 send_message_to_server_log+0x55b()
fffffd7fffdfb2a0 EmitErrorReport+0xe6()
fffffd7fffdfb390 errfinish+0x1ac()
fffffd7fffdfb3b0 StartAutoVacWorker+0x5c()
fffffd7fffdfb3e0 StartAutovacuumWorker+0x6f()
fffffd7fffdfb410 sigusr1_handler+0x185()
fffffd7fffdfb420 libc.so.1`__sighndlr+6()
fffffd7fffdfb4b0 libc.so.1`call_user_handler+0x1db(10, 0, fffffd7fffdfb520)
fffffd7fffdfb500 libc.so.1`sigacthandler+0x116(10, 0, fffffd7fffdfb520)
fffffd7fffdfb970 libc.so.1`__pollsys+0xa()
fffffd7fffdfba90 libc.so.1`pselect+0x1cb(5, fffffd7fffdfbaf0, 0, 0, fffffd7fffdfbaa0, 0)
fffffd7fffdfbae0 libc.so.1`select+0x5a(5, fffffd7fffdfbaf0, 0, 0, fffffd7fffdfdaf0)
fffffd7fffdffb40 ServerLoop+0xb5()
fffffd7fffdffbe0 PostmasterMain+0xec2()
fffffd7fffdffc00 main+0x23a()
fffffd7fffdffc10 _start+0x6c()

From inspecting kernel state, I found that the only other file descriptor on the system that corresponds to the other side of this pipe is in the postmaster itself, which it keeps open in case it needs to restart the syslogger.  Based on the code, I expected the syslogger subprocess to have the pipe open, but that process appears to have exited due to a fatal error (out of memory).  (I know it exited because the process still exists in the kernel -- it hasn't been reaped yet -- and I think it ran out of memory based on a log message I found from around the time when the process exited.)  Then I expected the postmaster would have restarted it, but I found that it currently has SIGCHLD both pending and masked:

$ pflags 41349
41349:  /opt/postgresql/9.2.4/bin/postgres -D /manatee/pg/data
        data model = _LP64  flags = ORPHAN|MSACCT|MSFORK
        sigpend = 0x00028000,0x00000000,0x00000000
 /1:    flags = ASLEEP  write(0x2,0xfffffd7fffdf9e10,0x65)
        sigmask = 0xfebff047,0xfffffff7,0x000003ff

The signal mask may be system-specific, but I've decoded it and confirmed that SIGCHLD is in "sigpend" as well as "sigmask".  And that makes sense because the stack above indicates we're in "sigusr1_handler", which masks SIGCHLD when it starts.

If it wasn't clear from the above, the logging configuration looks like this:

log_destination = 'stderr'
logging_collector = on
log_directory = '/var/pg/'
log_filename = 'postgresql.log'

Thanks,
Dave

В списке pgsql-general по дате отправления:

Предыдущее
От: Johannes Graën
Дата:
Сообщение: Re: [GENERAL] Combine multiple text search configuration
Следующее
От: "David G. Johnston"
Дата:
Сообщение: Re: [GENERAL] idle in transaction, why