Funny hang on PostgreSQL 10 during parallel index scan on slave

Поиск
Список
Период
Сортировка
От Chris Travers
Тема Funny hang on PostgreSQL 10 during parallel index scan on slave
Дата
Msg-id CAN-RpxBV0-EZhHSEMrZ3eTZGWH-tK40ZFEm4f3oiGavCEoX3nw@mail.gmail.com
обсуждение исходный текст
Ответы Re: Funny hang on PostgreSQL 10 during parallel index scan on slave  (Thomas Munro <thomas.munro@enterprisedb.com>)
Список pgsql-hackers
Hi all;

For the last few months we have been facing a funny problem on a slave where queries go to 100% cpu usage and never finish, causing the recovery process to hang and the replica to fall behind,  Over time we ruled out a lot of causes and were banging our heads against this one.  Today we got a break in it when we attached a debugger to various processes even without debugging symbols.  Not only did we get useful stack traces from the hung query but attaching a debugger to the startup process caused the query to finish.  This has shown up in 10.2 and 10.5.

Based on the stack traces we have concluded the following seems to happen:

1.  The query is in a parallel index scan or similar
2.  A process is executing a parallel plan and allocating a significant chunk of memory (2MB for example) in dynamic shared memory.
3.  The startup process goes into a loop where it sends a sigusr1, sleeps 5m, and sends another sigusr1 etc.
4.  The sigusr1 aborts the system call, which is then retried.
5.  Because the system call takes more than 5ms, we end up in an endless loop

I see one of two possible solutions here.
1.  Exponential backoff in sending signals to maybe 1s max.
2.  If there is any way to check for signals before retrying the system call (which I am not 100% sure where it is yet but on my way).

Any feedback or thoughts before we look at implementing a patch?
--
Best Regards,
Chris Travers
Head of Database

Tel: +49 162 9037 210 | Skype: einhverfr | www.adjust.com 
Saarbrücker Straße 37a, 10405 Berlin

В списке pgsql-hackers по дате отправления:

Предыдущее
От: Thomas Munro
Дата:
Сообщение: Re: Collation versioning
Следующее
От: "Bossart, Nathan"
Дата:
Сообщение: Re: Add SKIP LOCKED to VACUUM and ANALYZE