Обсуждение: [HACKERS] libpqrcv_PQexec() seems to violate latch protocol

Поиск
Список
Период
Сортировка

[HACKERS] libpqrcv_PQexec() seems to violate latch protocol

От
Andres Freund
Дата:
Hi,

The function  in $subject does:       while (PQisBusy(streamConn))       {           int         rc;
           /*            * We don't need to break down the sleep into smaller increments,            * since we'll get
interruptedby signals and can either handle            * interrupts here or elog(FATAL) within SIGTERM signal handler
if           * the signal arrives in the middle of establishment of            * replication connection.            */
        ResetLatch(&MyProc->procLatch);           rc = WaitLatchOrSocket(&MyProc->procLatch,
     WL_POSTMASTER_DEATH | WL_SOCKET_READABLE |                                  WL_LATCH_SET,
       PQsocket(streamConn),                                  0,
WAIT_EVENT_LIBPQWALRECEIVER);          if (rc & WL_POSTMASTER_DEATH)               exit(1);           /* interrupted */
         if (rc & WL_LATCH_SET)           {               CHECK_FOR_INTERRUPTS();               continue;           }
 

Doing ResetLatch();WaitLatch() like that makes it possible to miss a the
latch being set, e.g. if it happens just after WaitLatchOrSocket()
returns.

Afaict, the ResetLatch() really should just instead be in the if (rc & WL_LATCH_SET)
block.

Unless somebody protests, I'll make it so.

Greetings,

Andres Freund



Re: [HACKERS] libpqrcv_PQexec() seems to violate latch protocol

От
Tom Lane
Дата:
Andres Freund <andres@anarazel.de> writes:
> The function  in $subject does:

>             ResetLatch(&MyProc->procLatch);
>             rc = WaitLatchOrSocket(&MyProc->procLatch,
>                                    WL_POSTMASTER_DEATH | WL_SOCKET_READABLE |
>                                    WL_LATCH_SET,
>                                    PQsocket(streamConn),
>                                    0,
>                                    WAIT_EVENT_LIBPQWALRECEIVER);

Yeah, this is certainly broken.

> Afaict, the ResetLatch() really should just instead be in the if (rc & WL_LATCH_SET) block.

And, to be specific, it should be before the CHECK_FOR_INTERRUPTS call,
since that is the useful work that we want to be sure occurs after
any latch-setting event.
        regards, tom lane



Re: [HACKERS] libpqrcv_PQexec() seems to violate latch protocol

От
Andres Freund
Дата:
On 2017-06-06 17:14:59 -0400, Tom Lane wrote:
> Andres Freund <andres@anarazel.de> writes:
> > The function  in $subject does:
> 
> >             ResetLatch(&MyProc->procLatch);
> >             rc = WaitLatchOrSocket(&MyProc->procLatch,
> >                                    WL_POSTMASTER_DEATH | WL_SOCKET_READABLE |
> >                                    WL_LATCH_SET,
> >                                    PQsocket(streamConn),
> >                                    0,
> >                                    WAIT_EVENT_LIBPQWALRECEIVER);
> 
> Yeah, this is certainly broken.
> 
> > Afaict, the ResetLatch() really should just instead be in the if (rc & WL_LATCH_SET) block.
> 
> And, to be specific, it should be before the CHECK_FOR_INTERRUPTS call,
> since that is the useful work that we want to be sure occurs after
> any latch-setting event.

Right.  I found a couple more instance of similarly iffy, although not
quite as broken, patterns in launcher.c.  It's easy to get this wrong,
but it's a lot easy if you do it differently everywhere you use a
latch.  It's not good if code in the same file, by the same author(s),
has different ways of using latches.

- Andres



Re: [HACKERS] libpqrcv_PQexec() seems to violate latch protocol

От
Petr Jelinek
Дата:
On 06/06/17 23:17, Andres Freund wrote:
> On 2017-06-06 17:14:59 -0400, Tom Lane wrote:
>> Andres Freund <andres@anarazel.de> writes:
>>> The function  in $subject does:
>>
>>>             ResetLatch(&MyProc->procLatch);
>>>             rc = WaitLatchOrSocket(&MyProc->procLatch,
>>>                                    WL_POSTMASTER_DEATH | WL_SOCKET_READABLE |
>>>                                    WL_LATCH_SET,
>>>                                    PQsocket(streamConn),
>>>                                    0,
>>>                                    WAIT_EVENT_LIBPQWALRECEIVER);
>>
>> Yeah, this is certainly broken.
>>
>>> Afaict, the ResetLatch() really should just instead be in the if (rc & WL_LATCH_SET) block.
>>
>> And, to be specific, it should be before the CHECK_FOR_INTERRUPTS call,
>> since that is the useful work that we want to be sure occurs after
>> any latch-setting event.
> 
> Right.  I found a couple more instance of similarly iffy, although not
> quite as broken, patterns in launcher.c.  It's easy to get this wrong,
> but it's a lot easy if you do it differently everywhere you use a
> latch.  It's not good if code in the same file, by the same author(s),
> has different ways of using latches.

Huh? I see same pattern everywhere in launcher.c, what am I missing?

--  Petr Jelinek                  http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training &
Services



Re: [HACKERS] libpqrcv_PQexec() seems to violate latch protocol

От
Andres Freund
Дата:
On 2017-06-06 23:24:50 +0200, Petr Jelinek wrote:
> On 06/06/17 23:17, Andres Freund wrote:
> > Right.  I found a couple more instance of similarly iffy, although not
> > quite as broken, patterns in launcher.c.  It's easy to get this wrong,
> > but it's a lot easy if you do it differently everywhere you use a
> > latch.  It's not good if code in the same file, by the same author(s),
> > has different ways of using latches.
> 
> Huh? I see same pattern everywhere in launcher.c, what am I missing?

WaitForReplicationWorkerAttach:
while (...)   CHECK_FOR_INTERRUPTS();   /* other stuff including returns */   WaitLatch()   WL_POSTMASTER_DEATH
ResetLatch()

logicalrep_worker_stop loop 1:
while (...)   /* other stuff */   CHECK_FOR_INTERRUPTS()   WaitLatch()   POSTMASTER_DEATH   ResetLatch()   /* other
stuffincluding returns */
 
logicalrep_worker_stop loop 1:
while (...)   /* other stuff including returns */   CHECK_FOR_INTERRUPTS();   WaitLatch()   WL_POSTMASTER_DEATH
ResetLatch()

ApplyLauncherMain:
while (!got_SIGTERM)   /* lots other stuff */   WaitLatch()   WL_POSTMASTER_DEATH   /* some other stuff */
ResetLatch()
(note no CFI)

they're not hugely different, but subtely there are differences.
Sometimes you're guaranteed to check for interrupts after resetting the
latch, in other cases not. Sometimes expensive-ish things happen before
a CFI...

Greetings,

Andres Freund



Re: [HACKERS] libpqrcv_PQexec() seems to violate latch protocol

От
Petr Jelinek
Дата:
On 06/06/17 23:42, Andres Freund wrote:
> On 2017-06-06 23:24:50 +0200, Petr Jelinek wrote:
>> On 06/06/17 23:17, Andres Freund wrote:
>>> Right.  I found a couple more instance of similarly iffy, although not
>>> quite as broken, patterns in launcher.c.  It's easy to get this wrong,
>>> but it's a lot easy if you do it differently everywhere you use a
>>> latch.  It's not good if code in the same file, by the same author(s),
>>> has different ways of using latches.
>>
>> Huh? I see same pattern everywhere in launcher.c, what am I missing?
> 
> WaitForReplicationWorkerAttach:
> while (...)
>     CHECK_FOR_INTERRUPTS();
>     /* other stuff including returns */
>     WaitLatch()
>     WL_POSTMASTER_DEATH
>     ResetLatch()
> 
> logicalrep_worker_stop loop 1:
> while (...)
>     /* other stuff */
>     CHECK_FOR_INTERRUPTS()
>     WaitLatch()
>     POSTMASTER_DEATH
>     ResetLatch()
>     /* other stuff including returns */
> logicalrep_worker_stop loop 1:
> while (...)
>     /* other stuff including returns */
>     CHECK_FOR_INTERRUPTS();
>     WaitLatch()
>     WL_POSTMASTER_DEATH
>     ResetLatch()
> 
> ApplyLauncherMain:
> while (!got_SIGTERM)
>     /* lots other stuff */
>     WaitLatch()
>     WL_POSTMASTER_DEATH
>     /* some other stuff */
>     ResetLatch()
> (note no CFI)
> 
> they're not hugely different, but subtely there are differences.
> Sometimes you're guaranteed to check for interrupts after resetting the
> latch, in other cases not. Sometimes expensive-ish things happen before
> a CFI...
> 

Ah that's because signals in launcher are broken, see
https://www.postgresql.org/message-id/fe072153-babd-3b5d-8052-73527a6eb657@2ndquadrant.com
which also includes patch to fix it.

We originally had custom signal handling everywhere, then I realized it
was mistake for workers because of locking interaction but missed same
issue with launcher (the CFI in current launcher doesn't work).

--  Petr Jelinek                  http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training &
Services