[BUGS] signal 11 segfaults with parallel workers

Поиск
Список
Период
Сортировка
От Rick Otten
Тема [BUGS] signal 11 segfaults with parallel workers
Дата
Msg-id CAMAYy4J0AkJacQjcPYE4o+k-Sx=4vA36eEQj3e-h+cpNe+LAzw@mail.gmail.com
обсуждение исходный текст
Ответы Re: [BUGS] signal 11 segfaults with parallel workers  (Michael Paquier <michael.paquier@gmail.com>)
Список pgsql-bugs
Starting a couple of weeks ago, our PostgreSQL database has been crashing, almost daily, with a signal 11 seg fault on a query as the triggering event:

2017-07-11 23:00:29.984 UTC     LOG:  worker process: parallel worker for PID 1055 (PID 12405) was terminated by signal 11: Segmentation fault
2017-07-12 23:01:56.432 UTC     LOG:  worker process: parallel worker for PID 5752 (PID 32552) was terminated by signal 11: Segmentation fault
2017-07-14 23:00:46.856 UTC     LOG:  worker process: parallel worker for PID 24280 (PID 9639) was terminated by signal 11: Segmentation fault
2017-07-15 23:01:24.317 UTC     LOG:  worker process: parallel worker for PID 1561 (PID 15153) was terminated by signal 11: Segmentation fault
2017-07-16 23:00:26.722 UTC     LOG:  worker process: parallel worker for PID 5776 (PID 7912) was terminated by signal 11: Segmentation fault
2017-07-17 18:58:14.155 UTC     LOG:  worker process: parallel worker for PID 11427 (PID 9998) was terminated by signal 11: Segmentation fault
2017-07-17 19:08:04.103 UTC     LOG:  worker process: parallel worker for PID 10190 (PID 11907) was terminated by signal 11: Segmentation fault
2017-07-18 23:01:09.775 UTC     LOG:  worker process: parallel worker for PID 29445 (PID 360) was terminated by signal 11: Segmentation fault
2017-07-19 18:46:58.676 UTC     LOG:  worker process: parallel worker for PID 7080 (PID 27710) was terminated by signal 11: Segmentation fault
2017-07-20 23:00:35.270 UTC     LOG:  worker process: parallel worker for PID 19153 (PID 21218) was terminated by signal 11: Segmentation fault
2017-07-21 23:00:41.085 UTC     LOG:  worker process: parallel worker for PID 19161 (PID 30720) was terminated by signal 11: Segmentation fault
2017-07-22 23:00:22.169 UTC     LOG:  worker process: parallel worker for PID 4903 (PID 6931) was terminated by signal 11: Segmentation fault
2017-07-25 23:02:03.688 UTC     LOG:  worker process: parallel worker for PID 11099 (PID 11280) was terminated by signal 11: Segmentation fault

 As near as I can tell there were no specific changes preceding this pattern which might be a root cause.  Since then I've tried patching the Linux instance and bounced the database server, and bumped up the number of connections (because we were running low sometimes).  None of those changes impacted the regular crashing pattern.

On Sunday (2017-07-23) I set DEBUG5 on all log events, and set it to also log all queries, so I could try to learn more about what was happening.  I found the culprit query last night, from one of our daily jobs.  It was doing a 5 thread parallel sequence scan on a moderately sized table (maybe 70 columns, by 2.5M rows).

I was not able to force the database to crash by running this query by hand.  I tried a number of times.  Although it did happen to someone else on the 17th.

For now, I've put an index on the relevant columns to avoid the parallel sequence scan for that query.  I also repacked the table.  Hopefully we won't crash tonight too.

There wasn't much extra in the logs to share about the crash.  This is from when it crashed:

2017-07-25 23:02:03.688 UTC     DEBUG:  reaping dead processes
2017-07-25 23:02:03.688 UTC     LOG:  worker process: parallel worker for PID 11099 (PID 11280) was terminated by signal 11: Segmentation fault

And this is when it spun out those parallel workers, just prior to the segfault, that let me identify the query in question:

2017-07-25 23:02:01.804 UTC     DEBUG:  registering background worker "parallel worker for PID 11099"
2017-07-25 23:02:01.804 UTC     DEBUG:  registering background worker "parallel worker for PID 11099"
2017-07-25 23:02:01.804 UTC     DEBUG:  registering background worker "parallel worker for PID 11099"
2017-07-25 23:02:01.804 UTC     DEBUG:  registering background worker "parallel worker for PID 11099"
2017-07-25 23:02:01.804 UTC     DEBUG:  registering background worker "parallel worker for PID 11099"
2017-07-25 23:02:01.804 UTC     DEBUG:  starting background worker process "parallel worker for PID 11099"
2017-07-25 23:02:01.805 UTC     DEBUG:  starting background worker process "parallel worker for PID 11099"
2017-07-25 23:02:01.805 UTC     DEBUG:  starting background worker process "parallel worker for PID 11099"
2017-07-25 23:02:01.806 UTC     DEBUG:  starting background worker process "parallel worker for PID 11099"
2017-07-25 23:02:01.806 UTC     DEBUG:  starting background worker process "parallel worker for PID 11099"


Here is what /var/log/kern.log had to say about the one from last night:

Jul 25 23:02:01 core-gce kernel: [738031.417934] postgres[11279]: segfault at 8 ip 000055dc24e22403 sp 00007ffc84dbf6f0 error 4 in postgres[55dc249cd000+64c000]
Jul 25 23:02:01 core-gce kernel: [738031.417953] postgres[11278]: segfault at 8 ip 000055dc24e22403 sp 00007ffc84dbf6f0 error 4 in postgres[55dc249cd000+64c000]
Jul 25 23:02:01 core-gce kernel: [738031.417967] postgres[11280]: segfault at 8 ip 000055dc24e22403 sp 00007ffc84dbf6f0 error 4 in postgres[55dc249cd000+64c000]
Jul 25 23:02:01 core-gce kernel: [738031.417989] postgres[11276]: segfault at 8 ip 000055dc24e22403 sp 00007ffc84dbf6f0 error 4 in postgres[55dc249cd000+64c000]


I'm running on Ubuntu 16.04.02 in Google Compute Environment, on a 16 core VM with 104G RAM, using the Ubuntu Postgresql 9.6.3 package.

$  pg_config --configure
'--with-tcl' '--with-perl' '--with-python' '--with-pam' '--with-openssl' '--with-libxml' '--with-libxslt' '--with-tclconfig=/usr/lib/x86_64-linux-gnu/tcl8.6' '--with-includes=/usr/include/tcl8.6' 'PYTHON=/usr/bin/python' '--mandir=/usr/share/postgresql/9.6/man' '--docdir=/usr/share/doc/postgresql-doc-9.6' '--sysconfdir=/etc/postgresql-common' '--datarootdir=/usr/share/' '--datadir=/usr/share/postgresql/9.6' '--bindir=/usr/lib/postgresql/9.6/bin' '--libdir=/usr/lib/x86_64-linux-gnu/' '--libexecdir=/usr/lib/postgresql/' '--includedir=/usr/include/postgresql/' '--enable-nls' '--enable-integer-datetimes' '--enable-thread-safety' '--enable-tap-tests' '--enable-debug' '--disable-rpath' '--with-uuid=e2fs' '--with-gnu-ld' '--with-pgport=5432' '--with-system-tzdata=/usr/share/zoneinfo' '--with-systemd' 'CFLAGS=-g -O2 -fstack-protector-strong -Wformat -Werror=format-security -I/usr/include/mit-krb5 -fPIC -pie -fno-omit-frame-pointer' 'LDFLAGS=-Wl,-Bsymbolic-functions -Wl,-z,relro -Wl,-z,now -Wl,--as-needed -L/usr/lib/mit-krb5 -L/usr/lib/x86_64-linux-gnu/mit-krb5' '--with-krb5' '--with-gssapi' '--with-ldap' '--with-selinux' 'CPPFLAGS=-Wdate-time -D_FORTIFY_SOURCE=2'

$  pg_config --ldflags
-L../../src/common -Wl,-Bsymbolic-functions -Wl,-z,relro -Wl,-z,now -Wl,--as-needed -L/usr/lib/mit-krb5 -L/usr/lib/x86_64-linux-gnu/mit-krb5 -Wl,--as-needed

$  pg_config --cflags
-Wall -Wmissing-prototypes -Wpointer-arith -Wdeclaration-after-statement -Wendif-labels -Wmissing-format-attribute -Wformat-security -fno-strict-aliasing -fwrapv -fexcess-precision=standard -g -g -O2 -fstack-protector-strong -Wformat -Werror=format-security -I/usr/include/mit-krb5 -fPIC -pie -fno-omit-frame-pointer

I have these two most relevant settings enabled in my configuration:
max_worker_processes = 16
max_parallel_workers_per_gather = 16

If you need anything else, please let me know.  I wish I could reproduce the error every time I ran the query, but it doesn't seem to work that way, and of course now the query plan is completely different, but I'm sure I can run other queries that would induce parallel sequence scans on my tables.


В списке pgsql-bugs по дате отправления:

Предыдущее
От: Feike Steenbergen
Дата:
Сообщение: Re: [BUGS] BUG #14746: sub query's plan not right?
Следующее
От: Michael Paquier
Дата:
Сообщение: Re: [BUGS] signal 11 segfaults with parallel workers