Обсуждение: [PATCH] Fix for infinite signal loop in parallel scan

Поиск
Список
Период
Сортировка

[PATCH] Fix for infinite signal loop in parallel scan

От
Chris Travers
Дата:
Hi;

Attached is the patch we are fully testing at Adjust.  There are a few non-obvious aspects of the code around where the patch hits.    I have run make check on Linux and MacOS, and make check-world on Linux (check-world fails on MacOS on all versions and all branches due to ecpg failures).  Automatic tests are difficult because it is to a rare race condition which is difficult (or possibly impossible) to automatically create.  Our current approach testing is to force the issue using a debugger.  Any ideas on how to reproduce automatically are appreciated but even on our heavily loaded systems this seems to be a very small portion of queries that hit this case (meaning the issue happens about once a week for us).

The main problem is that under certain rare circumstances we see recovery conflicts going into loops sending sigusr1 to the parallel query which retries a posix_falloc call.  The call gets interrupted by the signal perpetually and the replication cannot proceed.

The patch attempts to handle cases where we are trying to cancel the query or terminate the process in the same way we would handle an ENOSPC in the resize operation, namely to break off the loop, do relevant cleanup, and then throw relevant exceptions.

There is a very serious complication in this area, however, which is that dsm_impl_op takes an elevel parameter which determines whether or not it is safe to throw errors.  This elevel is then passed to ereport inside the function, and this function also calls the resize operation itself.  Since this is not safe with CHECK_FOR_INTERRUPTS(), I conditionally do this only if elevel is greater or equal to ERROR.  

Currently all codepaths here use elevel ERROR when reaching this path but given that the calling function supports semantics where this could be lower, and where a caller might have additional cleanup to do, I don't think one can just add CHECK_FOR_INTERRUPTS() there even though that would work for now since this could create all kinds of subtle bugs in the future.

So what the patch does is check for whether we are trying to end the query or the backend and does not retry the resize operation if either of those conditions are true.  In those cases it will set errno to EINTR and return the same.

The only caller then, if the resize operation failed, checks for an elevel greater or equal to ERROR, and whether the errno is set to EINTR.  If so it checks for signals.  If these do not abort the query, we then fall through and pass the ereport with the supplied elevel as we would have otherwise, and return false to the caller.

In current calls to this function, this means that interrupts are handled right after cleanup of the dynamic shared memory and these then abort the query or exit the backend.  Future calls could specify a WARNING elevel if they have extra cleanup to do, in which case signals would not be checked, and the same warning found today would found in the log.  At the next CHECK_FOR_INTERRUPTS() call, the appropriate errors would be raised etc.

In the past it had been suggested we do PG_TRY(); and PG_CATCH(), but given that it is not consistent whether we can raise an error or whether we MUST raise an error, I don't see how this approach can work.  As far as I can see, we MUST raise an error in the appropriate spot if and only if elevel is set to a sufficient level.

Backporting this is pretty trivial.  We expect to confirm this ourselves on both master and 10.  I can prepare back ports fairly quickly.

Is there any feedback on this approach before I add it to the next commitfest?

--
Best Regards,
Chris Travers
Head of Database

Tel: +49 162 9037 210 | Skype: einhverfr | www.adjust.com 
Saarbrücker Straße 37a, 10405 Berlin

Вложения

Re: [PATCH] Fix for infinite signal loop in parallel scan

От
Andres Freund
Дата:
Hi,

On 2018-09-07 17:57:05 +0200, Chris Travers wrote:
> Attached is the patch we are fully testing at Adjust.

For the future, please don't start a separate threads from the bugreport
/ original discussion.  Makes it much harder to follow.

Greetings,

Andres Freund


Re: [PATCH] Fix for infinite signal loop in parallel scan

От
Thomas Munro
Дата:
On Sat, Sep 8, 2018 at 3:57 AM Chris Travers <chris.travers@adjust.com> wrote:
> Attached is the patch we are fully testing at Adjust.

Thanks!

> I have run make check on Linux and MacOS, and make check-world on Linux (check-world fails on MacOS on all versions
andall branches due to ecpg failures). 

FWIW it's entirely possible to get make check-world passing on a Mac.
Maybe post the problem you're seeing to a new thread?

> ...

> In the past it had been suggested we do PG_TRY(); and PG_CATCH(), but given that it is not consistent whether we can
raisean error or whether we MUST raise an error, I don't see how this approach can work.  As far as I can see, we MUST
raisean error in the appropriate spot if and only if elevel is set to a sufficient level. 

Yeah, your way looks a bit nicer than something involving PG_TRY().

> Is there any feedback on this approach before I add it to the next commitfest?

Please go ahead and add it.  Being a bug fix, we'll commit it sooner
than the open commitfest anyway, but it's useful to have it in there.

+ if (errno == EINTR && elevel >= ERROR)
+ CHECK_FOR_INTERRUPTS();

I think we might as well just CHECK_FOR_INTERRUPTS() unconditionally.
In this branch elevel is always ERROR as you noted, and the code
around there is confusing enough already.

+ } while (rc == EINTR && !(ProcDiePending || QueryCancelPending));

There seems to be a precedent for checking QueryCancelPending directly
to break out of a loop in regcomp.c and syncrep.c.  So this seems OK.
Hmm, I wonder if there should be an INTERRUPT_PENDING() macro that
hides those details, but allows you to break out of a loop and then do
some cleanup before CHECK_FOR_INTERRUPT().

--
Thomas Munro
http://www.enterprisedb.com


Re: [PATCH] Fix for infinite signal loop in parallel scan

От
Oleksii Kliukin
Дата:

> On 7. Sep 2018, at 17:57, Chris Travers <chris.travers@adjust.com> wrote:
>
> Hi;
>
> Attached is the patch we are fully testing at Adjust.  There are a few non-obvious aspects of the code around where
thepatch hits.    I have run make check on Linux and MacOS, and make check-world on Linux (check-world fails on MacOS
onall versions and all branches due to ecpg failures).  Automatic tests are difficult because it is to a rare race
conditionwhich is difficult (or possibly impossible) to automatically create.  Our current approach testing is to force
theissue using a debugger.  Any ideas on how to reproduce automatically are appreciated but even on our heavily loaded
systemsthis seems to be a very small portion of queries that hit this case (meaning the issue happens about once a week
forus). 


I did some manual testing on it, putting breakpoints in the
ResolveRecoveryConflictWithLock in the startup process on a streaming replica
(configured with a very low max_standby_streaming_delay, i.e. 100ms) and to the
posix_fallocate call on the normal backend on the same replica. At this point I
also instructed gdb not to stop upon receiving SIGUSR1 (handle SIGUSR1 nonstop)
and resumed execution on both the backend and the startup process.

Then I simulated a conflict by creating a rather big (3GB) table on the master,
doing some updates on it and then running an aggregate on the replica backend
(i.e. 'select count(1) from test' with 'force_parallel_mode = true')  where I
set the breakpoint. The aggregate and force_parallel_mode ensured that
the query was executed as a parallel one, leading to allocation of the DSM

Once the backend reached the posix_fallocate call and was waiting on a
breakpoint, I called 'vacuum full test' on the master that lead to a conflict
on the replica running 'select from test' (in a vast majority of cases),
triggering the breakpoint in ResolveRecoveryConflictWithLock in the startup
process, since the startup process tried to cancel the conflicting backend. At
that point I continued execution of the startup process (which would loop in
CancelVirtualTransaction sending SIGUSR1 to the backend while the backend
waited to be resumed from the breakpoint). Right after that, I changed the size
parameter on the backend to something that would make posix_fallocate run for a
bit longer, typically ten to hundred MB. Once the backend process was resumed,
it started receiving SIGUSR1 from the startup process, resulting in
posix_fallocate existing with EINTR.

With the patch applied, the posix_fallocate loop terminated right away (because
of QueryCancelPending flag set to true) and the backend went through the
cleanup, showing an ERROR of cancelling due to the conflict with recovery.
Without the patch, it looped indefinitely in the dsm_impl_posix_resize, while
the startup process were looping forever, trying to send SIGUSR1.

One thing I’m wondering is whether we could do the same by just blocking SIGUSR1
for the duration of posix_fallocate?

Cheers,
Oleksii Kliukin

Re: [PATCH] Fix for infinite signal loop in parallel scan

От
Chris Travers
Дата:
First, I have attached a cleaned-up revision (pg_indent, removing a dangling comment etc)

On Fri, Sep 14, 2018 at 1:16 PM Thomas Munro <thomas.munro@enterprisedb.com> wrote:
On Sat, Sep 8, 2018 at 3:57 AM Chris Travers <chris.travers@adjust.com> wrote:
> Attached is the patch we are fully testing at Adjust.

Thanks!

> I have run make check on Linux and MacOS, and make check-world on Linux (check-world fails on MacOS on all versions and all branches due to ecpg failures).

FWIW it's entirely possible to get make check-world passing on a Mac.
Maybe post the problem you're seeing to a new thread?

Will do. 

> ...

> In the past it had been suggested we do PG_TRY(); and PG_CATCH(), but given that it is not consistent whether we can raise an error or whether we MUST raise an error, I don't see how this approach can work.  As far as I can see, we MUST raise an error in the appropriate spot if and only if elevel is set to a sufficient level.

Yeah, your way looks a bit nicer than something involving PG_TRY().

> Is there any feedback on this approach before I add it to the next commitfest?

Please go ahead and add it.  Being a bug fix, we'll commit it sooner
than the open commitfest anyway, but it's useful to have it in there.

+ if (errno == EINTR && elevel >= ERROR)
+ CHECK_FOR_INTERRUPTS();

I think we might as well just CHECK_FOR_INTERRUPTS() unconditionally.
In this branch elevel is always ERROR as you noted, and the code
around there is confusing enough already.

The reason I didn't do that is future safety and for the off chance that someone could do something strange with extensions today that might use this in an unsafe way.    While I cannot think of any use case for calling this in an unsafe way, I know for a fact that one might write extensions, background workers, etc. to do things with this API that are not in our current code right now.  For something that could be back ported I really want to work as much as possible in a way that does not possibly brake even exotic extensions.

Longer-run I would like to see if I can help refactor this code so that responsibility for error handling is clearer and such problems cannot exist.  But that's not something to back port.

+ } while (rc == EINTR && !(ProcDiePending || QueryCancelPending));

There seems to be a precedent for checking QueryCancelPending directly
to break out of a loop in regcomp.c and syncrep.c.  So this seems OK.

Yeah, I checked.
 
Hmm, I wonder if there should be an INTERRUPT_PENDING() macro that
hides those details, but allows you to break out of a loop and then do
some cleanup before CHECK_FOR_INTERRUPT(). 

That is a good idea. 

--
Thomas Munro
http://www.enterprisedb.com


--
Best Regards,
Chris Travers
Head of Database

Tel: +49 162 9037 210 | Skype: einhverfr | www.adjust.com 
Saarbrücker Straße 37a, 10405 Berlin

Вложения

Re: [PATCH] Fix for infinite signal loop in parallel scan

От
Chris Travers
Дата:


On Mon, Sep 17, 2018 at 2:59 PM Oleksii Kliukin <alexk@hintbits.com> wrote:


> On 7. Sep 2018, at 17:57, Chris Travers <chris.travers@adjust.com> wrote:
>
> Hi;
>
> Attached is the patch we are fully testing at Adjust.  There are a few non-obvious aspects of the code around where the patch hits.    I have run make check on Linux and MacOS, and make check-world on Linux (check-world fails on MacOS on all versions and all branches due to ecpg failures).  Automatic tests are difficult because it is to a rare race condition which is difficult (or possibly impossible) to automatically create.  Our current approach testing is to force the issue using a debugger.  Any ideas on how to reproduce automatically are appreciated but even on our heavily loaded systems this seems to be a very small portion of queries that hit this case (meaning the issue happens about once a week for us).


I did some manual testing on it, putting breakpoints in the
ResolveRecoveryConflictWithLock in the startup process on a streaming replica
(configured with a very low max_standby_streaming_delay, i.e. 100ms) and to the
posix_fallocate call on the normal backend on the same replica. At this point I
also instructed gdb not to stop upon receiving SIGUSR1 (handle SIGUSR1 nonstop)
and resumed execution on both the backend and the startup process.

Then I simulated a conflict by creating a rather big (3GB) table on the master,
doing some updates on it and then running an aggregate on the replica backend
(i.e. 'select count(1) from test' with 'force_parallel_mode = true')  where I
set the breakpoint. The aggregate and force_parallel_mode ensured that
the query was executed as a parallel one, leading to allocation of the DSM

Once the backend reached the posix_fallocate call and was waiting on a
breakpoint, I called 'vacuum full test' on the master that lead to a conflict
on the replica running 'select from test' (in a vast majority of cases),
triggering the breakpoint in ResolveRecoveryConflictWithLock in the startup
process, since the startup process tried to cancel the conflicting backend. At
that point I continued execution of the startup process (which would loop in
CancelVirtualTransaction sending SIGUSR1 to the backend while the backend
waited to be resumed from the breakpoint). Right after that, I changed the size
parameter on the backend to something that would make posix_fallocate run for a
bit longer, typically ten to hundred MB. Once the backend process was resumed,
it started receiving SIGUSR1 from the startup process, resulting in
posix_fallocate existing with EINTR.

With the patch applied, the posix_fallocate loop terminated right away (because
of QueryCancelPending flag set to true) and the backend went through the
cleanup, showing an ERROR of cancelling due to the conflict with recovery.
Without the patch, it looped indefinitely in the dsm_impl_posix_resize, while
the startup process were looping forever, trying to send SIGUSR1.

One thing I’m wondering is whether we could do the same by just blocking SIGUSR1
for the duration of posix_fallocate?

Many thanks!

If we were to do that, I would say we should mask all signals we can mask during the call.

I don't have a problem going down that road instead if people think it is better.

As a note, we have a patched version of PostgreSQL 10.5 ready to deploy to a system affected by this and expect that to be done this week.
 

Cheers,
Oleksii Kliukin


--
Best Regards,
Chris Travers
Head of Database

Tel: +49 162 9037 210 | Skype: einhverfr | www.adjust.com 
Saarbrücker Straße 37a, 10405 Berlin

Re: [PATCH] Fix for infinite signal loop in parallel scan

От
Thomas Munro
Дата:
On Tue, Sep 18, 2018 at 1:15 AM Chris Travers <chris.travers@adjust.com> wrote:
> On Mon, Sep 17, 2018 at 2:59 PM Oleksii Kliukin <alexk@hintbits.com> wrote:
>> With the patch applied, the posix_fallocate loop terminated right away (because
>> of QueryCancelPending flag set to true) and the backend went through the
>> cleanup, showing an ERROR of cancelling due to the conflict with recovery.
>> Without the patch, it looped indefinitely in the dsm_impl_posix_resize, while
>> the startup process were looping forever, trying to send SIGUSR1.

Thanks for testing!

>> One thing I’m wondering is whether we could do the same by just blocking SIGUSR1
>> for the duration of posix_fallocate?
>
> If we were to do that, I would say we should mask all signals we can mask during the call.
>
> I don't have a problem going down that road instead if people think it is better.

We discussed that when adding posix_fallocate() and decided that
retrying is better:

https://www.postgresql.org/message-id/20170628230458.n5ehizmvhoerr5yq%40alap3.anarazel.de

Here is a patch that I propose to commit and back-patch to 9.4.  I
just wrote a suitable commit message, edited the comments lightly and
fixed some whitespace.

--
Thomas Munro
http://www.enterprisedb.com

Вложения

Re: [PATCH] Fix for infinite signal loop in parallel scan

От
Oleksii Kliukin
Дата:

> On 18. Sep 2018, at 03:18, Thomas Munro <thomas.munro@enterprisedb.com> wrote:
>
> On Tue, Sep 18, 2018 at 1:15 AM Chris Travers <chris.travers@adjust.com> wrote:
>> On Mon, Sep 17, 2018 at 2:59 PM Oleksii Kliukin <alexk@hintbits.com> wrote:
>>> With the patch applied, the posix_fallocate loop terminated right away (because
>>> of QueryCancelPending flag set to true) and the backend went through the
>>> cleanup, showing an ERROR of cancelling due to the conflict with recovery.
>>> Without the patch, it looped indefinitely in the dsm_impl_posix_resize, while
>>> the startup process were looping forever, trying to send SIGUSR1.
>
> Thanks for testing!
>
>>> One thing I’m wondering is whether we could do the same by just blocking SIGUSR1
>>> for the duration of posix_fallocate?
>>
>> If we were to do that, I would say we should mask all signals we can mask during the call.
>>
>> I don't have a problem going down that road instead if people think it is better.
>
> We discussed that when adding posix_fallocate() and decided that
> retrying is better:
>
> https://www.postgresql.org/message-id/20170628230458.n5ehizmvhoerr5yq%40alap3.anarazel.de
>
> Here is a patch that I propose to commit and back-patch to 9.4.  I
> just wrote a suitable commit message, edited the comments lightly and
> fixed some whitespace.

Thanks!

Apart from the fact that the reviewer's name is “Murat Kabilov” and
not “Murak Kabilov” the back-patch looks good to me.

Cheers,
Oleksii Kliukin

Re: [PATCH] Fix for infinite signal loop in parallel scan

От
Thomas Munro
Дата:
On Tue, Sep 18, 2018 at 9:25 PM Oleksii Kliukin <alexk@hintbits.com> wrote:
> > On 18. Sep 2018, at 03:18, Thomas Munro <thomas.munro@enterprisedb.com> wrote:
> > Here is a patch that I propose to commit and back-patch to 9.4.  I
> > just wrote a suitable commit message, edited the comments lightly and
> > fixed some whitespace.
>
> Thanks!
>
> Apart from the fact that the reviewer's name is “Murat Kabilov” and
> not “Murak Kabilov” the back-patch looks good to me.

Oops, fixed.  Pushed.  Thanks all for the report, patch and reviews.

--
Thomas Munro
http://www.enterprisedb.com