Обсуждение: Funny hang on PostgreSQL 10 during parallel index scan on slave

Поиск
Список
Период
Сортировка

Funny hang on PostgreSQL 10 during parallel index scan on slave

От
Chris Travers
Дата:
Hi all;

For the last few months we have been facing a funny problem on a slave where queries go to 100% cpu usage and never finish, causing the recovery process to hang and the replica to fall behind,  Over time we ruled out a lot of causes and were banging our heads against this one.  Today we got a break in it when we attached a debugger to various processes even without debugging symbols.  Not only did we get useful stack traces from the hung query but attaching a debugger to the startup process caused the query to finish.  This has shown up in 10.2 and 10.5.

Based on the stack traces we have concluded the following seems to happen:

1.  The query is in a parallel index scan or similar
2.  A process is executing a parallel plan and allocating a significant chunk of memory (2MB for example) in dynamic shared memory.
3.  The startup process goes into a loop where it sends a sigusr1, sleeps 5m, and sends another sigusr1 etc.
4.  The sigusr1 aborts the system call, which is then retried.
5.  Because the system call takes more than 5ms, we end up in an endless loop

I see one of two possible solutions here.
1.  Exponential backoff in sending signals to maybe 1s max.
2.  If there is any way to check for signals before retrying the system call (which I am not 100% sure where it is yet but on my way).

Any feedback or thoughts before we look at implementing a patch?
--
Best Regards,
Chris Travers
Head of Database

Tel: +49 162 9037 210 | Skype: einhverfr | www.adjust.com 
Saarbrücker Straße 37a, 10405 Berlin

Re: Funny hang on PostgreSQL 10 during parallel index scan on slave

От
Thomas Munro
Дата:
On Wed, Sep 5, 2018 at 8:23 AM Chris Travers <chris.travers@adjust.com> wrote:
> 1.  The query is in a parallel index scan or similar
> 2.  A process is executing a parallel plan and allocating a significant chunk of memory (2MB for example) in dynamic
sharedmemory.
 
> 3.  The startup process goes into a loop where it sends a sigusr1, sleeps 5m, and sends another sigusr1 etc.
> 4.  The sigusr1 aborts the system call, which is then retried.
> 5.  Because the system call takes more than 5ms, we end up in an endless loop

Do you mean this loop in dsm_impl_posix_resize() is getting
interrupted constantly and never completing?

                /* We may get interrupted, if so just retry. */
                do
                {
                        rc = posix_fallocate(fd, 0, size);
                } while (rc == EINTR);

-- 
Thomas Munro
http://www.enterprisedb.com


Re: Funny hang on PostgreSQL 10 during parallel index scan on slave

От
Chris Travers
Дата:
Yep,  Maybe we should check for signals there.

On Wed, Sep 5, 2018 at 5:27 PM Thomas Munro <thomas.munro@enterprisedb.com> wrote:
On Wed, Sep 5, 2018 at 8:23 AM Chris Travers <chris.travers@adjust.com> wrote:
> 1.  The query is in a parallel index scan or similar
> 2.  A process is executing a parallel plan and allocating a significant chunk of memory (2MB for example) in dynamic shared memory.
> 3.  The startup process goes into a loop where it sends a sigusr1, sleeps 5m, and sends another sigusr1 etc.
> 4.  The sigusr1 aborts the system call, which is then retried.
> 5.  Because the system call takes more than 5ms, we end up in an endless loop

Do you mean this loop in dsm_impl_posix_resize() is getting
interrupted constantly and never completing?

                /* We may get interrupted, if so just retry. */
                do
                {
                        rc = posix_fallocate(fd, 0, size);
                } while (rc == EINTR);

--
Thomas Munro
http://www.enterprisedb.com


--
Best Regards,
Chris Travers
Head of Database

Tel: +49 162 9037 210 | Skype: einhverfr | www.adjust.com 
Saarbrücker Straße 37a, 10405 Berlin

Re: Funny hang on PostgreSQL 10 during parallel index scan on slave

От
Chris Travers
Дата:
Will submit a patch here shortly.  Thanks!  Should we do for master and 10?  Or 9.6 too?

On Wed, Sep 5, 2018 at 6:40 PM Chris Travers <chris.travers@adjust.com> wrote:
Yep,  Maybe we should check for signals there.

On Wed, Sep 5, 2018 at 5:27 PM Thomas Munro <thomas.munro@enterprisedb.com> wrote:
On Wed, Sep 5, 2018 at 8:23 AM Chris Travers <chris.travers@adjust.com> wrote:
> 1.  The query is in a parallel index scan or similar
> 2.  A process is executing a parallel plan and allocating a significant chunk of memory (2MB for example) in dynamic shared memory.
> 3.  The startup process goes into a loop where it sends a sigusr1, sleeps 5m, and sends another sigusr1 etc.
> 4.  The sigusr1 aborts the system call, which is then retried.
> 5.  Because the system call takes more than 5ms, we end up in an endless loop

Do you mean this loop in dsm_impl_posix_resize() is getting
interrupted constantly and never completing?

                /* We may get interrupted, if so just retry. */
                do
                {
                        rc = posix_fallocate(fd, 0, size);
                } while (rc == EINTR);

--
Thomas Munro
http://www.enterprisedb.com


--
Best Regards,
Chris Travers
Head of Database

Tel: +49 162 9037 210 | Skype: einhverfr | www.adjust.com 
Saarbrücker Straße 37a, 10405 Berlin



--
Best Regards,
Chris Travers
Head of Database

Tel: +49 162 9037 210 | Skype: einhverfr | www.adjust.com 
Saarbrücker Straße 37a, 10405 Berlin

Re: Funny hang on PostgreSQL 10 during parallel index scan on slave

От
Thomas Munro
Дата:
On Wed, Sep 5, 2018 at 9:49 AM Chris Travers <chris.travers@adjust.com> wrote:
>
> Will submit a patch here shortly.  Thanks!  Should we do for master and 10?  Or 9.6 too?
>
> On Wed, Sep 5, 2018 at 6:40 PM Chris Travers <chris.travers@adjust.com> wrote:
>>
>> Yep,  Maybe we should check for signals there.

Yeah, it seems reasonable to check for interrupts here.  It would need
to be back-patched as far as 9.4.

-- 
Thomas Munro
http://www.enterprisedb.com


Re: Funny hang on PostgreSQL 10 during parallel index scan on slave

От
Andres Freund
Дата:
Hi,

On 2018-09-05 18:48:44 +0200, Chris Travers wrote:
> Will submit a patch here shortly.  Thanks!  Should we do for master and
> 10?  Or 9.6 too?

Please don't top-post on this list.  This needs to be done in all
branches where the posix_fallocate call is present.

> > Yep,  Maybe we should check for signals there.
> >
> > On Wed, Sep 5, 2018 at 5:27 PM Thomas Munro <thomas.munro@enterprisedb.com>
> > wrote:
> >
> >> On Wed, Sep 5, 2018 at 8:23 AM Chris Travers <chris.travers@adjust.com>
> >> wrote:
> >> > 1.  The query is in a parallel index scan or similar
> >> > 2.  A process is executing a parallel plan and allocating a significant
> >> chunk of memory (2MB for example) in dynamic shared memory.
> >> > 3.  The startup process goes into a loop where it sends a sigusr1,
> >> sleeps 5m, and sends another sigusr1 etc.
> >> > 4.  The sigusr1 aborts the system call, which is then retried.
> >> > 5.  Because the system call takes more than 5ms, we end up in an
> >> endless loop

What you're presumably encountering here is a recovery conflict.


> On Wed, Sep 5, 2018 at 6:40 PM Chris Travers <chris.travers@adjust.com>
> wrote:
> >> Do you mean this loop in dsm_impl_posix_resize() is getting
> >> interrupted constantly and never completing?
> >>
> >>                 /* We may get interrupted, if so just retry. */
> >>                 do
> >>                 {
> >>                         rc = posix_fallocate(fd, 0, size);
> >>                 } while (rc == EINTR);
> >>

Probably worthwile to check that the dsm code is properly robust if
errors are thrown from within here.


Greetings,

Andres Freund


Re: Funny hang on PostgreSQL 10 during parallel index scan on slave

От
Chris Travers
Дата:


On Wed, Sep 5, 2018 at 6:55 PM Andres Freund <andres@anarazel.de> wrote:
Hi,

On 2018-09-05 18:48:44 +0200, Chris Travers wrote:
> Will submit a patch here shortly.  Thanks!  Should we do for master and
> 10?  Or 9.6 too?

Please don't top-post on this list.  This needs to be done in all
branches where the posix_fallocate call is present.

> > Yep,  Maybe we should check for signals there.
> >
> > On Wed, Sep 5, 2018 at 5:27 PM Thomas Munro <thomas.munro@enterprisedb.com>
> > wrote:
> >
> >> On Wed, Sep 5, 2018 at 8:23 AM Chris Travers <chris.travers@adjust.com>
> >> wrote:
> >> > 1.  The query is in a parallel index scan or similar
> >> > 2.  A process is executing a parallel plan and allocating a significant
> >> chunk of memory (2MB for example) in dynamic shared memory.
> >> > 3.  The startup process goes into a loop where it sends a sigusr1,
> >> sleeps 5m, and sends another sigusr1 etc.
> >> > 4.  The sigusr1 aborts the system call, which is then retried.
> >> > 5.  Because the system call takes more than 5ms, we end up in an
> >> endless loop

What you're presumably encountering here is a recovery conflict.

Agreed but the question is how to correct what is a fairly interesting race condition. 


> On Wed, Sep 5, 2018 at 6:40 PM Chris Travers <chris.travers@adjust.com>
> wrote:
> >> Do you mean this loop in dsm_impl_posix_resize() is getting
> >> interrupted constantly and never completing?
> >>
> >>                 /* We may get interrupted, if so just retry. */
> >>                 do
> >>                 {
> >>                         rc = posix_fallocate(fd, 0, size);
> >>                 } while (rc == EINTR);
> >>

Probably worthwile to check that the dsm code is properly robust if
errors are thrown from within here.

Will check that too.  Thanks! 


Greetings,

Andres Freund


--
Best Regards,
Chris Travers
Head of Database

Tel: +49 162 9037 210 | Skype: einhverfr | www.adjust.com 
Saarbrücker Straße 37a, 10405 Berlin

Re: Funny hang on PostgreSQL 10 during parallel index scan on slave

От
Thomas Munro
Дата:
On Wed, Sep 5, 2018 at 10:13 AM Chris Travers <chris.travers@adjust.com> wrote:
> On Wed, Sep 5, 2018 at 6:55 PM Andres Freund <andres@anarazel.de> wrote:
>> > On Wed, Sep 5, 2018 at 6:40 PM Chris Travers <chris.travers@adjust.com>
>> > wrote:
>> > >> Do you mean this loop in dsm_impl_posix_resize() is getting
>> > >> interrupted constantly and never completing?
>> > >>
>> > >>                 /* We may get interrupted, if so just retry. */
>> > >>                 do
>> > >>                 {
>> > >>                         rc = posix_fallocate(fd, 0, size);
>> > >>                 } while (rc == EINTR);
>> > >>
>>
>> Probably worthwile to check that the dsm code is properly robust if
>> errors are thrown from within here.

Yeah, currently dsm_impl_posix_resize() returns and lets
dsm_impl_posix() clean up (close(), shm_unlink()) before raising
errors.  We can't just let CHECK_FOR_INTERRUPTS() take a non-local
exit.  Some refactoring involving PG_TRY()/PG_CATCH() may be the
simplest way forward.

-- 
Thomas Munro
http://www.enterprisedb.com


Re: Funny hang on PostgreSQL 10 during parallel index scan on slave

От
Chris Travers
Дата:
Ok, so here's my current patch (work in progress).  I have not run tests yet, and finding a way to add a test case is virtually impossible though I expect we will find ways of using gdb to confirm local fix later.

After careful code review, I settled on the following approach which seemed to be the least intrusive.

1.  Change the retry logic so that it does not retry of QueryCancelPending or ProcDiePending are set.  In these cases EINTR is passed back to the caller
2.  Change the error handling logic of the caller so that EINTR is handled by the next CHECK_FOR_INTERRUPTS() after cleanup.

This means that the file descriptor is now already closed and that we handle this the same way we would with a ENOSPC.  The reason for this is that there are many places in the code where it is not clear what degree of safety is present for throwing errors using ereport, and the patch needs to be as unobtrusive as possible to facilitate back porting.

At this point the patch is for discussion.  I have not even compiled it or tested it but maybe there is something wrong with my approach so figured I would send it out early.

The major assumptions are:
1.  close() will never take longer than the interrupt interval that caused the loop to break.
2.  close() does not get interrupted in a way that will not cause cleanup problems later.
3. We will reach the next interrupt check at a reasonable interval and safe spot.

Any initial arguments against?

--
Best Regards,
Chris Travers
Head of Database

Tel: +49 162 9037 210 | Skype: einhverfr | www.adjust.com 
Saarbrücker Straße 37a, 10405 Berlin

Вложения

Re: Funny hang on PostgreSQL 10 during parallel index scan on slave

От
Chris Travers
Дата:


On Thu, Sep 6, 2018 at 11:08 AM Chris Travers <chris.travers@adjust.com> wrote:
Ok, so here's my current patch (work in progress).  I have not run tests yet, and finding a way to add a test case is virtually impossible though I expect we will find ways of using gdb to confirm local fix later.

After careful code review, I settled on the following approach which seemed to be the least intrusive.

1.  Change the retry logic so that it does not retry of QueryCancelPending or ProcDiePending are set.  In these cases EINTR is passed back to the caller
2.  Change the error handling logic of the caller so that EINTR is handled by the next CHECK_FOR_INTERRUPTS() after cleanup.

This means that the file descriptor is now already closed and that we handle this the same way we would with a ENOSPC.  The reason for this is that there are many places in the code where it is not clear what degree of safety is present for throwing errors using ereport, and the patch needs to be as unobtrusive as possible to facilitate back porting.

At this point the patch is for discussion.  I have not even compiled it or tested it but maybe there is something wrong with my approach so figured I would send it out early.

The major assumptions are:
1.  close() will never take longer than the interrupt interval that caused the loop to break.
2.  close() does not get interrupted in a way that will not cause cleanup problems later.
3. We will reach the next interrupt check at a reasonable interval and safe spot.

Any initial arguments against?

The previous patch had two issues found on internal code review here.  I am sending a new patch.

This patch compiles.  make check passes
make check-world fails but the errors are the same as found on master and involve ecpg.

We will be doing further internal review and verification and then will run through pg_indent before final submission.

Changes in this patch:
1.  Previous patch had backwards check for EINTR in calling function.  This was fixed.
2.  In cases where ERROR elevel was passed in, function would return instead of error out on case of error.

So in this case we check if we expect to error on error and if so, check for interrupts.  If not, we go through the normal error handling/logging path which might result in an warning that shared memory segment could not be allocated followed by an actual meaningful error.  I could put that in an else if, if that is seen as a good idea, but figured I would raise it as a discussion point.
 

--
Best Regards,
Chris Travers
Head of Database

Tel: +49 162 9037 210 | Skype: einhverfr | www.adjust.com 
Saarbrücker Straße 37a, 10405 Berlin



--
Best Regards,
Chris Travers
Head of Database

Tel: +49 162 9037 210 | Skype: einhverfr | www.adjust.com 
Saarbrücker Straße 37a, 10405 Berlin

Вложения

Re: Funny hang on PostgreSQL 10 during parallel index scan on slave

От
Chris Travers
Дата:


On Thu, Sep 6, 2018 at 1:31 PM Chris Travers <chris.travers@adjust.com> wrote:


On Thu, Sep 6, 2018 at 11:08 AM Chris Travers <chris.travers@adjust.com> wrote:
Ok, so here's my current patch (work in progress).  I have not run tests yet, and finding a way to add a test case is virtually impossible though I expect we will find ways of using gdb to confirm local fix later.

After careful code review, I settled on the following approach which seemed to be the least intrusive.

1.  Change the retry logic so that it does not retry of QueryCancelPending or ProcDiePending are set.  In these cases EINTR is passed back to the caller
2.  Change the error handling logic of the caller so that EINTR is handled by the next CHECK_FOR_INTERRUPTS() after cleanup.

This means that the file descriptor is now already closed and that we handle this the same way we would with a ENOSPC.  The reason for this is that there are many places in the code where it is not clear what degree of safety is present for throwing errors using ereport, and the patch needs to be as unobtrusive as possible to facilitate back porting.

At this point the patch is for discussion.  I have not even compiled it or tested it but maybe there is something wrong with my approach so figured I would send it out early.

The major assumptions are:
1.  close() will never take longer than the interrupt interval that caused the loop to break.
2.  close() does not get interrupted in a way that will not cause cleanup problems later.
3. We will reach the next interrupt check at a reasonable interval and safe spot.

Any initial arguments against?

The previous patch had two issues found on internal code review here.  I am sending a new patch.

This patch compiles.  make check passes
make check-world fails but the errors are the same as found on master and involve ecpg.

We will be doing further internal review and verification and then will run through pg_indent before final submission.

Changes in this patch:
1.  Previous patch had backwards check for EINTR in calling function.  This was fixed.
2.  In cases where ERROR elevel was passed in, function would return instead of error out on case of error.

So in this case we check if we expect to error on error and if so, check for interrupts.  If not, we go through the normal error handling/logging path which might result in an warning that shared memory segment could not be allocated followed by an actual meaningful error.  I could put that in an else if, if that is seen as a good idea, but figured I would raise it as a discussion point.

Attaching correct patch.... 
 

--
Best Regards,
Chris Travers
Head of Database

Tel: +49 162 9037 210 | Skype: einhverfr | www.adjust.com 
Saarbrücker Straße 37a, 10405 Berlin



--
Best Regards,
Chris Travers
Head of Database

Tel: +49 162 9037 210 | Skype: einhverfr | www.adjust.com 
Saarbrücker Straße 37a, 10405 Berlin



--
Best Regards,
Chris Travers
Head of Database

Tel: +49 162 9037 210 | Skype: einhverfr | www.adjust.com 
Saarbrücker Straße 37a, 10405 Berlin

Вложения

Re: Funny hang on PostgreSQL 10 during parallel index scan on slave

От
Chris Travers
Дата:
As a status note, the above patch has not been run through pg_indent and while I have run make check-world on linux (passed) and a non-Linux system (which failed both with and without my patch) I will be making one more small revision before final submission unless anyone thinks I need to change approaches.

Currently, all code paths that hit the resize call the relevant functions with an ERROR elevel.  So running CHECK_FOR_INTERRUPTS after the initial cleanup is currently safe, but the surrounding function doesn't guarantee safety so I don't think it is a good idea to raise exceptions there.

In above patch I test for whether the level is ERROR and check for interrupts when that is true.  However as a colleague pointed out to me, if anyone ever calls this with FATAL or PANIC as the desired error level, that would not be safe so I will be checking for whether the level is equal to or greater than ERROR.

We will also go through a manual test using a debugger to try to test the behavior in these cases and make sure the patch actually resolves our problem.

Assuming no objections here and that the manual test works, you can expect a formal submission in the next couple of days.

--
Best Regards,
Chris Travers
Head of Database

Tel: +49 162 9037 210 | Skype: einhverfr | www.adjust.com 
Saarbrücker Straße 37a, 10405 Berlin



--
Best Regards,
Chris Travers
Head of Database

Tel: +49 162 9037 210 | Skype: einhverfr | www.adjust.com 
Saarbrücker Straße 37a, 10405 Berlin