Обсуждение: BUG #9557: Row not visible after receiving notification

Поиск
Список
Период
Сортировка

BUG #9557: Row not visible after receiving notification

От
marko@joh.to
Дата:
The following bug has been logged on the website:

Bug reference:      9557
Logged by:          Marko Tiikkaja
Email address:      marko@joh.to
PostgreSQL version: 9.1.12
Operating system:   Linux
Description:

Hi,

We went live with an application a few days ago, and it's been having
problems with seeing rows after receiving a notification from the
transaction that committed it.  I can't say for sure how often it happens,
but I've seen evidence suggesting at least three separate occasions.

The general design is that we have two independent "RPC Workers" LISTENing
for notifications.  Upon receiving a notification, they call a function
Claim_Request() which does this:

  BEGIN
    SELECT
      Requests.RequestID INTO _RequestID
    FROM rpcworker.Requests
    WHERE
      state = 'UNCLAIMED' AND
      -- sanity check
      Requests.Claimed IS NULL
    ORDER BY Requests.RequestID
    LIMIT 1
    FOR UPDATE ;
    IF NOT FOUND THEN
      RETURN;
    END IF;

    -- otherwise UPDATE to set claimed=now()

I suspected it might be something wonky with FOR UPDATE, so I added an extra
query sent right before Claim_Request():

    SELECT Requests.RequestID
    FROM rpcworker.Requests
    WHERE state = 'UNCLAIMED' AND
                Requests.Claimed IS NULL
    ORDER BY Requests.RequestID
    LIMIT 1

Below I have attached the relevant portions of the log when this happened.
In the example,  the SELECT without the FOR UPDATE did not see the row, but
the subsequent Claim_Request() did (supposedly due to the small latency
added by the extra round-trip by the SELECT before it).

RPCWorker B appeared to be working "normally" in this case (though it was a
bit slower to react than A), but I've attached its logs for posterity as
well.  In previous cases, where both of them called Claim_Request()
directly, we saw a couple of instances where *neither* of the RPC workers
saw the new row, which suggests something's terribly wrong.


## RPCWORKER B (UTC): ##


received notification with payload 8839
2014-03-12 21:11:27.250 +0000 LOG:  received notification on channel "RPC
Worker"
2014-03-12 21:11:27.253 +0000 LOG:  saw requestid 8839 without FOR UPDATE


## RPCWORKER A (UTC): ##


received notification with payload 8839
2014-03-12 21:11:27.240 +0000 LOG:  received notification on channel "RPC
Worker"
2014-03-12 21:11:27.245 +0000 LOG:  plain SELECT: sql: no rows in result
set
2014-03-12 21:11:27.251 +0000 WARNING:  SNAKE: saw 0, claimed 8839
2014-03-12 21:11:27.251 +0000 LOG:  received requestid 8839
2014-03-12 21:11:27.253 +0000 LOG:  plain SELECT: sql: no rows in result
set
2014-03-12 21:11:27.531 +0000 LOG:  request 8839 (attempt 1) succeeded
2014-03-12 21:11:27.562 +0000 LOG:  finished request 8839 (attempt 1)


## DBWORKER (UTC + 1): ##


2014-03-12 22:11:27.237
CET,"dbworker","db",34682,"10.1.1.71:42307",531f39d8.877a,211904,"SELECT",2014-03-11
17:29:12 CET,583/727697,1268656677,WARNING,01000,"
# the above line is the last piece of information we see from the
transaction sending the notification

# committed somewhere around here

2014-03-12 22:11:27.246
CET,"dbworker","db",34682,"10.1.1.71:42307",531f39d8.877a,211905,"SELECT",2014-03-11
17:29:12 CET,583/0,0,LOG,00000,"duration: 42.826
ms",,,,,,,,"exec_simple_query, postgres.c:1090",""

# known committed at this point (but a query has already been executed!)



## RPCWORKER A IN DB (UTC + 1): ##


2014-03-12 22:11:21.052
CET,"rpcworker","db",33568,"10.1.1.72:48556",53207d07.8320,12480,"SELECT",2014-03-12
16:28:07 CET,34/0,0,LOG,00000,"duration: 0.741
ms",,,,,,,,"exec_simple_query, postgres.c:1090","rpcworker

# idle for a good while here

2014-03-12 22:11:27.245
CET,"rpcworker","db",33568,"10.1.1.72:48556",53207d07.8320,12481,"idle",2014-03-12
16:28:07 CET,34/2000571,0,LOG,00000,"statement: -- NO FOR UPDATE
2014-03-12 22:11:27.245
CET,"rpcworker","db",33568,"10.1.1.72:48556",53207d07.8320,12482,"SELECT",2014-03-12
16:28:07 CET,34/0,0,LOG,00000,"duration: 0.983
ms",,,,,,,,"exec_simple_query, postgres.c:1090","rpcworker"

# didn't see the row

2014-03-12 22:11:27.252
CET,"rpcworker","db",33568,"10.1.1.72:48556",53207d07.8320,12483,"idle",2014-03-12
16:28:07 CET,34/2000572,0,LOG,00000,"statement:  -- Claim_Request()

# sees the row!



## RPCWORKER B IN DB (UTC + 1): ##


2014-03-12 22:11:27.253
CET,"rpcworker","db",32865,"10.2.1.72:33236",53207cee.8061,14392,"idle",2014-03-12
16:27:42 CET,95/1580775,0,LOG,00000,"statement: -- NO FOR UPDATE

# is a bit sluggish, so sees the row!

2014-03-12 22:11:27.254
CET,"rpcworker","db",32865,"10.2.1.72:33236",53207cee.8061,14394,"idle",2014-03-12
16:27:42 CET,95/1580776,0,LOG,00000,"statement:  -- Claim_Request()

# would likely see the row, but already claimed by worker A

Re: BUG #9557: Row not visible after receiving notification

От
Marko Tiikkaja
Дата:
On 3/13/14 10:39 AM, marko@joh.to wrote:
> We went live with an application a few days ago, and it's been having
> problems with seeing rows after receiving a notification from the
> transaction that committed it.  I can't say for sure how often it happens,
> but I've seen evidence suggesting at least three separate occasions.

I can reproduce this locally with some trickery.  Here's what I did:

   1) Stop the listener at HandleNotifyInterrupt() (suppose an unrelated
notification comes in in the real world).
   2) In another backend, do what the application would normally do, but
stop it before ProcArrayEndTransaction().
   3) Let the notification go through to the listener.  It sees *both*
notifications (since the new one has been committed to CLOG).
   4) SELECT from the frontend after seeing the notification (#2 is
still stuck right before ProcArrayEndTransaction).

This would explain it, as I don't see the notification code checking
whether the transaction is still in PGPROC, just calling
TransactionIdDidCommit().

Any thoughts?


Regards,
Marko Tiikkaja

Re: BUG #9557: Row not visible after receiving notification

От
Marko Tiikkaja
Дата:
On 3/13/14 1:55 PM, I wrote:
> This would explain it, as I don't see the notification code checking
> whether the transaction is still in PGPROC, just calling
> TransactionIdDidCommit().

And as predicted, adding a call to TransactionIdIsInProgress() in
asyncQueueProcessPageEntries makes the problem go away.


Regards,
Marko Tiikkaja

Re: BUG #9557: Row not visible after receiving notification

От
Heikki Linnakangas
Дата:
On 03/13/2014 03:26 PM, Marko Tiikkaja wrote:
> On 3/13/14 1:55 PM, I wrote:
>> This would explain it, as I don't see the notification code checking
>> whether the transaction is still in PGPROC, just calling
>> TransactionIdDidCommit().
>
> And as predicted, adding a call to TransactionIdIsInProgress() in
> asyncQueueProcessPageEntries makes the problem go away.

Good catch! Want to write that into a patch?

- Heikki

Re: BUG #9557: Row not visible after receiving notification

От
Marko Tiikkaja
Дата:
On 3/13/14 2:45 PM, Heikki Linnakangas wrote:
> On 03/13/2014 03:26 PM, Marko Tiikkaja wrote:
>> On 3/13/14 1:55 PM, I wrote:
>>> This would explain it, as I don't see the notification code checking
>>> whether the transaction is still in PGPROC, just calling
>>> TransactionIdDidCommit().
>>
>> And as predicted, adding a call to TransactionIdIsInProgress() in
>> asyncQueueProcessPageEntries makes the problem go away.
>
> Good catch! Want to write that into a patch?

I can do it if it helps, but I figured that it's a one-line change,
perhaps four with a nice comment..


Regards,
Marko Tiikkaja

Re: BUG #9557: Row not visible after receiving notification

От
Marko Tiikkaja
Дата:
On 3/13/14 2:55 PM, I wrote:
> On 3/13/14 2:45 PM, Heikki Linnakangas wrote:
>> Good catch! Want to write that into a patch?
>
> I can do it if it helps, but I figured that it's a one-line change,
> perhaps four with a nice comment..

How does the attached look?  I've again confirmed that I can't reproduce
the problem locally after applying this patch.


Regards,
Marko Tiikkaja

Вложения

Re: BUG #9557: Row not visible after receiving notification

От
Marko Tiikkaja
Дата:
On 3/13/14 3:47 PM, I wrote:
> On 3/13/14 2:55 PM, I wrote:
>> On 3/13/14 2:45 PM, Heikki Linnakangas wrote:
>>> Good catch! Want to write that into a patch?
>>
>> I can do it if it helps, but I figured that it's a one-line change,
>> perhaps four with a nice comment..
>
> How does the attached look?  I've again confirmed that I can't reproduce
> the problem locally after applying this patch.

Ugh.  Here's one with tabs instead of spaces.


Regards,
Marko Tiikkaja

Вложения

Re: BUG #9557: Row not visible after receiving notification

От
Tom Lane
Дата:
Marko Tiikkaja <marko@joh.to> writes:
> On 3/13/14 3:47 PM, I wrote:
>> How does the attached look?  I've again confirmed that I can't reproduce
>> the problem locally after applying this patch.

> Ugh.  Here's one with tabs instead of spaces.

Don't like the comment much, but I agree this is a bug.  Will take care of
it.

            regards, tom lane

Re: BUG #9557: Row not visible after receiving notification

От
Marko Tiikkaja
Дата:
On 3/13/14 4:16 PM, Tom Lane wrote:
> Don't like the comment much, but I agree this is a bug.  Will take care of
> it.

Thanks.


Regards,
Marko Tiikkaja