Обсуждение: and waiting

Поиск
Список
Период
Сортировка

and waiting

От
"Gurjeet Singh"
Дата:
Hi guys,<br /><br />    I saw a strange behaviour on one of the production boxes. The pg_stat_activity shows a process
as<IDLE> and yet 'waiting' !!! On top of it (understandably, since its IDLE), there are no entries for this pid
inpg_locks!<br /><br />    Following are the snapshots of the two system views.<br /><br /><span style="font-family:
couriernew,monospace;"> procpid |     current_query     | waiting |     duration     |         backend_start</span><br
style="font-family:courier new,monospace;" /><span style="font-family: courier
new,monospace;">---------+-----------------------+---------+------------------+-------------------------------</span><br
style="font-family:courier new,monospace;" /><span style="font-family: courier new,monospace;">   20762 |
<IDLE>               | f       |                  | 2008-01-31 13:38:30.848898-08</span><br style="font-family:
couriernew,monospace;" /><span style="font-family: courier new,monospace;">   19776 | <IDLE>                |
t      | 00:38:34.76833   | 2008-01-31 12:51:29.005744-08</span><br style="font-family: courier new,monospace;" /><span
style="font-family:courier new,monospace;">   20356 | <IDLE>                | f       | 00:38:29.971425  |
2008-01-3113:17:37.617497-08</span><br style="font-family: courier new,monospace;" /><span style="font-family: courier
new,monospace;">  19775 | <IDLE>                | f       | 00:38:27.187201  | 2008-01-31
12:51:28.999242-08</span><brstyle="font-family: courier new,monospace;" /><span style="font-family: courier
new,monospace;">  19774 | <IDLE>                | f       | 00:38:27.187068  | 2008-01-31
12:51:28.90554-08</span><brstyle="font-family: courier new,monospace;" /><span style="font-family: courier
new,monospace;">  20728 | <IDLE>                | f       | 00:14:03.913027  | 2008-01-31
13:36:11.345822-08</span><brstyle="font-family: courier new,monospace;" /><span style="font-family: courier
new,monospace;">   9727 | <IDLE>                | f       | 00:03:07.444273  | 2008-01-24
22:25:00.289931-08</span><brstyle="font-family: courier new,monospace;" /><span style="font-family: courier
new,monospace;">   9684 | <IDLE>                | f       | 00:00:07.704656  | 2008-01-24
22:22:00.007377-08</span><brstyle="font-family: courier new,monospace;" /><span style="font-family: courier
new,monospace;">  19390 | <IDLE> in transaction | f       | 00:00:00.027585  | 2008-01-31
12:30:07.999246-08</span><brstyle="font-family: courier new,monospace;" /><span style="font-family: courier
new,monospace;">  19389 | <IDLE> in transaction | t       | -00:00:00.000255 | 2008-01-31
12:30:07.973868-08</span><brstyle="font-family: courier new,monospace;" /><br style="font-family: courier
new,monospace;"/><span style="font-family: courier new,monospace;">select * from pg_locks where pid in ( 19776, 19389
);</span><brstyle="font-family: courier new,monospace;" /><br style="font-family: courier new,monospace;" /><span
style="font-family:courier new,monospace;">   locktype    | database | relation | page | tuple | transactionid |
classid| objid | objsubid | transaction |  pid  |       mode       | granted</span><br style="font-family: courier
new,monospace;"/><span style="font-family: courier
new,monospace;">---------------+----------+----------+------+-------+---------------+---------+-------+----------+-------------+-------+------------------+---------</span><br
style="font-family:courier new,monospace;" /><span style="font-family: courier new,monospace;"> relation      |   
16584|    17070 |      |       |               |         |       |          |  3700350056 | 19389 | RowExclusiveLock |
t</span><brstyle="font-family: courier new,monospace;" /><span style="font-family: courier
new,monospace;"> relation     |    16584 |    17106 |      |       |               |         |       |          | 
3700350056| 19389 | RowExclusiveLock | t</span><br style="font-family: courier new,monospace;" /><span
style="font-family:courier new,monospace;"> relation      |    16584 |    17068 |      |       |              
|        |       |          |  3700350056 | 19389 | RowExclusiveLock | t</span><br style="font-family: courier
new,monospace;"/><span style="font-family: courier new,monospace;"> transactionid |          |          |      |      
|   3700350056 |         |       |          |  3700350056 | 19389 | ExclusiveLock    | t</span><br style="font-family:
couriernew,monospace;" /><span style="font-family: courier new,monospace;"> relation      |    16584 |    17108 |     
|      |               |         |       |          |  3700350056 | 19389 | RowExclusiveLock | t</span><br
style="font-family:courier new,monospace;" /><span style="font-family: courier new,monospace;">(5 rows)</span><br
style="font-family:courier new,monospace;" /><br clear="all" style="font-family: courier new,monospace;" /><br />   
The'duration' column above is just now()-query_start. These are not just two instant snapshots, but we could see this
outputconsistently for quite long.<br /><br />    I tracked the 'waiting' column a little bit in the source code, and
sawthat it is actually generated from PgBackendStatus.st_waiting . Is it possible that, for some reason, postgres
forgotto update this for a backend?<br /><br /><span style="font-family: courier new,monospace;">select
version();</span><brstyle="font-family: courier new,monospace;" /><span style="font-family: courier
new,monospace;">                                         version</span><br style="font-family: courier new,monospace;"
/><spanstyle="font-family: courier
new,monospace;">--------------------------------------------------------------------------------------------</span><br
style="font-family:courier new,monospace;" /><span style="font-family: courier new,monospace;"> PostgreSQL 8.2.4 on
x86_64-unknown-linux-gnu,compiled by GCC gcc (GCC) 4.1.0 (SUSE Linux)</span><br style="font-family: courier
new,monospace;"/><br />    This issue has been seen  twice now.<br /><br />-- <br />gurjeet[.singh]@EnterpriseDB.com<br
/>singh.gurjeet@{gmail | hotmail | indiatimes | yahoo }.com<br /><br />EnterpriseDB      <a
href="http://www.enterprisedb.com">http://www.enterprisedb.com</a><br/><br />17° 29' 34.37"N,   78° 30' 59.76"E -
Hyderabad<br/>18° 32' 57.25"N,   73° 56' 25.42"E - Pune<br />37° 47' 19.72"N, 122° 24' 1.69" W - San Francisco *<br
/><br/><a href="http://gurjeet.frihost.net">http://gurjeet.frihost.net</a><br /><br />Mail sent from my BlackLaptop
device 

Re: and waiting

От
Tom Lane
Дата:
"Gurjeet Singh" <singh.gurjeet@gmail.com> writes:
>     I saw a strange behaviour on one of the production boxes. The
> pg_stat_activity shows a process as <IDLE> and yet 'waiting' !!! On top of
> it (understandably, since its IDLE), there are no entries for this pid in
> pg_locks!

Hmm, I can reproduce something like this by aborting a wait for lock.
It seems the problem is that WaitOnLock() is ignoring its own good
advice, assuming that it can do cleanup work after waiting.
        regards, tom lane


Re: and waiting

От
"Gurjeet Singh"
Дата:
The situation seems pretty bad!!

Here are the steps to reproduce in 'PostgreSQL 8.3beta2 on x86_64-unknown-linux-gnu, compiled by GCC gcc (GCC) 3.3.3 (SuSE Linux)':

session 1: begin;
session 1: update test set a = 112 where a = 112;
session 2: update test set a = 113 where a = 112; --waits
session 1: select * from pg_stat_activity; -- executed this a few times before executing 'select version()' and then following:
session 1: <stat query1> -- see end of mail for this query

 procpid |             current_query              | waiting |     duration     |         backend_start
---------+----------------------------------------+---------+------------------+-------------------------------
   12577 | update test set a = 113 where a = 112; | t       | -00:01:35.782881 | 2008-02-01 13:36:15.31027-08
   11975 | select * from pg_stat_activity ;       | f       | -00:01:52.554697 | 2008-02-01 13:30:40.396392-08
(2 rows)

session 1: select * from pg_locks

   locktype    | database | relation | page | tuple | virtualxid | transactionid | classid | objid | objsubid | virtualtransaction |  pid  |       mode       | granted

---------------+----------+----------+------+-------+------------+---------------+---------+-------+----------+---------------
-----+-------+------------------+---------
 transactionid |          |          |      |       |            |           390 |         |       |          | 2/14     | 12577 | ShareLock        | f

 transactionid |          |          |      |       |            |           390 |         |       |          | 1/9     | 11975 | ExclusiveLock    | t
<snip irrelevant>

Then,
session 2: ^C
Cancel request sent
ERROR:  canceling statement due to user request

session1: <stat query1>

 procpid |             current_query              | waiting |     duration     |         backend_start

---------+----------------------------------------+---------+------------------+-------------------------------
   12577 | update test set a = 113 where a = 112; | t       | -00:01:35.782881 | 2008-02-01 13:36:15.31027-08
   11975 | select * from pg_stat_activity ;       | f       | -00:01:52.554697 | 2008-02-01 13:30:40.396392-08
(2 rows)

session 1: select * from pg_locks ;

<no traces of pid 12577>

session 1: select pg_backend_pid();

 pg_backend_pid
----------------
          11975

The last mentioned output of <stat query> shows session 1 executing 'select * from p_s_a', whereas the <stat query> _is_ being executed in session 1!!! This result is consistently returned for a while, and later...

session 2: select pg_backend_pid();

  pg_backend_pid

----------------
          12577

session 1: <stat query1>

 procpid |     current_query     | waiting |    duration     |         backend_start
---------+-----------------------+---------+-----------------+-------------------------------
   11975 | <IDLE> in transaction | f       | 00:06:08.671029 | 2008-02-01 13:30:40.396392-08
(1 row)

After a while again:

session 1: <stat query2> -- notice 2 not 1; 'select *' comes back to haunt!!!

 procpid |             current_query              | waiting |     duration     |         backend_start
---------+----------------------------------------+---------+------------------+-------------------------------
   12577 | update test set a = 113 where a = 112; | t       | -00:01:35.782881 | 2008-02-01 13:36:15.31027-08
   11975 | select * from pg_stat_activity ;       | f       | -00:01:52.554697 | 2008-02-01 13:30:40.396392-08
(2 rows)

session 1: <stat query1> -- 1 back in action

 procpid |             current_query              | waiting |     duration     |         backend_start

---------+----------------------------------------+---------+------------------+-------------------------------
   12577 | update test set a = 113 where a = 112; | t       | -00:01:35.782881 | 2008-02-01 13:36:15.31027-08
   11975 | select * from pg_stat_activity ;       | f       | -00:01:52.554697 | 2008-02-01 13:30:40.396392-08
(2 rows)

The <stat query1> is:
select
procpid, current_query::varchar(50), waiting, now() - query_start as duration, backend_start
from pg_stat_activity
where current_query <> '<IDLE>'
and current_query not like '%DONT COUNT ME1 %'
order by duration desc
limit 10;

The <stat query2> is:
select
    procpid, current_query::varchar(50), waiting, now() - query_start as duration, backend_start
from pg_stat_activity
where current_query not like '%DONT COUNT ME1 %'
order by duration desc
limit 10;

    Found more bugs than I was looking for, to reproduce!!!

    The reporter also made an observation (on 8.2.4) that there were deadlocks detected at around the same time. Looked at WaitOnLock(), and clearly there's a problem, but is it at the same/only place we are suspecting it to be?

Best regards,

PS: Ran the <stat query>ies 1 and 2 again, just before hitting 'send', and the result is the same:
 procpid |             current_query              | waiting |     duration     |         backend_start
---------+----------------------------------------+---------+------------------+-------------------------------
   12577 | update test set a = 113 where a = 112; | t       | -00:01:35.782881 | 2008-02-01 13:36:15.31027-08
   11975 | select * from pg_stat_activity ;       | f       | -00:01:52.554697 | 2008-02-01 13:30:40.396392-08
(2 rows)


Clearly, there's something wrong

On Feb 1, 2008 8:16 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
"Gurjeet Singh" <singh.gurjeet@gmail.com> writes:
>     I saw a strange behaviour on one of the production boxes. The
> pg_stat_activity shows a process as <IDLE> and yet 'waiting' !!! On top of
> it (understandably, since its IDLE), there are no entries for this pid in
> pg_locks!

Hmm, I can reproduce something like this by aborting a wait for lock.
It seems the problem is that WaitOnLock() is ignoring its own good
advice, assuming that it can do cleanup work after waiting.

                       regards, tom lane



--
gurjeet[.singh]@EnterpriseDB.com
singh.gurjeet@{ gmail | hotmail | indiatimes | yahoo }.com

EnterpriseDB      http://www.enterprisedb.com

17° 29' 34.37"N,   78° 30' 59.76"E - Hyderabad
18° 32' 57.25"N,   73° 56' 25.42"E - Pune
37° 47' 19.72"N, 122° 24' 1.69" W - San Francisco *

http://gurjeet.frihost.net

Mail sent from my BlackLaptop device

Re: and waiting

От
Tom Lane
Дата:
"Gurjeet Singh" <singh.gurjeet@gmail.com> writes:
> The situation seems pretty bad!!

I think at least part of your problem is not understanding that a single
transaction sees a frozen snapshot of pg_stat_activity.
        regards, tom lane


Re: and waiting

От
"Gurjeet Singh"
Дата:
On Feb 1, 2008 3:56 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
"Gurjeet Singh" <singh.gurjeet@gmail.com> writes:
> The situation seems pretty bad!!

I think at least part of your problem is not understanding that a single
transaction sees a frozen snapshot of pg_stat_activity.


It does! I assumed that pg_stat_activity produced the transaction-independent snapshot of internal memory structures! Is that the case with pg_locks too!? I hope not.

BTW, we cannot say that the pg_stat_activity behaves in a consistent manner (transactions-wise). From what I could infer, this view's results are frozen when you first query the view, not when the transaction started (which is how other (normal) relations behave). It's a bit confusing, and should be documented if this is the way it is intended to work; Something along the lines of : "In a transaction, this view will repeatedly show the same results that were returned by it's first invocation in the transaction." in a less confusing way :)

So we are back to the original problem... Canceling a 'waiting' transaction does not revert the session's 'waiting' state back to 'false' (consistently reproducible).

--
gurjeet[.singh]@EnterpriseDB.com
singh.gurjeet@{ gmail | hotmail | indiatimes | yahoo }.com

EnterpriseDB      http://www.enterprisedb.com

17° 29' 34.37"N,   78° 30' 59.76"E - Hyderabad
18° 32' 57.25"N,   73° 56' 25.42"E - Pune
37° 47' 19.72"N, 122° 24' 1.69" W - San Francisco *

http://gurjeet.frihost.net

Mail sent from my BlackLaptop device

Re: and waiting

От
Tom Lane
Дата:
I wrote:
> "Gurjeet Singh" <singh.gurjeet@gmail.com> writes:
>> I saw a strange behaviour on one of the production boxes. The
>> pg_stat_activity shows a process as <IDLE> and yet 'waiting' !!! On top of
>> it (understandably, since its IDLE), there are no entries for this pid in
>> pg_locks!

> Hmm, I can reproduce something like this by aborting a wait for lock.
> It seems the problem is that WaitOnLock() is ignoring its own good
> advice, assuming that it can do cleanup work after waiting.

I've committed a fix for this.  (Too late for 8.3.0, unfortunately.)
        regards, tom lane


Re: and waiting

От
"Gurjeet Singh"
Дата:
On Feb 2, 2008 2:28 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
I wrote:
> "Gurjeet Singh" <singh.gurjeet@gmail.com> writes:
>> I saw a strange behaviour on one of the production boxes. The
>> pg_stat_activity shows a process as <IDLE> and yet 'waiting' !!! On top of
>> it (understandably, since its IDLE), there are no entries for this pid in
>> pg_locks!

> Hmm, I can reproduce something like this by aborting a wait for lock.
> It seems the problem is that WaitOnLock() is ignoring its own good
> advice, assuming that it can do cleanup work after waiting.

I've committed a fix for this.  (Too late for 8.3.0, unfortunately.)

Thanks. Like 8.2, can it not be back-patched on 8.3 too?

I just looked at the patch... Isn't PG_TRY() an expensive call to make in the lock.c code? I was thinking of registering a Xact callback using RegisterXactCallback() and performing 'waiting' reset in that callback if the Xact event is XACT_EVENT_ABORT.

That would have been compliant with the previous comments ('if we fail, any cleanup must happen in xact abort processing, not here').

Comments.

Best regards,
--
gurjeet[.singh]@EnterpriseDB.com
singh.gurjeet@{ gmail | hotmail | indiatimes | yahoo }.com

EnterpriseDB      http://www.enterprisedb.com

17° 29' 34.37"N,   78° 30' 59.76"E - Hyderabad
18° 32' 57.25"N,   73° 56' 25.42"E - Pune
37° 47' 19.72"N, 122° 24' 1.69" W - San Francisco *

http://gurjeet.frihost.net

Mail sent from my BlackLaptop device

Re: and waiting

От
Alvaro Herrera
Дата:
Gurjeet Singh escribió:

> I just looked at the patch... Isn't PG_TRY() an expensive call to make in
> the lock.c code? I was thinking of registering a Xact callback using
> RegisterXactCallback() and performing 'waiting' reset in that callback if
> the Xact event is XACT_EVENT_ABORT.

PG_TRY is not expensive as all that -- it's just a sigsetjmp() call and
another stack frame.

-- 
Alvaro Herrera                                http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.


Re: and waiting

От
"Gurjeet Singh"
Дата:
On Feb 2, 2008 3:27 PM, Alvaro Herrera <alvherre@commandprompt.com> wrote:
Gurjeet Singh escribió:

> I just looked at the patch... Isn't PG_TRY() an expensive call to make in
> the lock.c code? I was thinking of registering a Xact callback using
> RegisterXactCallback() and performing 'waiting' reset in that callback if
> the Xact event is XACT_EVENT_ABORT.

PG_TRY is not expensive as all that -- it's just a sigsetjmp() call and
another stack frame.


Thats why I asked. I assumed that creating stacks was expensive. Isn't this the reason the compilers came up with the function inline capability; to avoid stacks, because they can be expensive. Or am I confusing two different types of stacks!

Moreover, calling a callback, once in a while (only upon XACT abort), may prove to be much cheaper than setting up an additional stack on every lock-acquire call.

Really, my 2 cents.

--
gurjeet[.singh]@EnterpriseDB.com
singh.gurjeet@{ gmail | hotmail | indiatimes | yahoo }.com

EnterpriseDB      http://www.enterprisedb.com

17° 29' 34.37"N,   78° 30' 59.76"E - Hyderabad
18° 32' 57.25"N,   73° 56' 25.42"E - Pune
37° 47' 19.72"N, 122° 24' 1.69" W - San Francisco *

http://gurjeet.frihost.net

Mail sent from my BlackLaptop device

Re: and waiting

От
Tom Lane
Дата:
Alvaro Herrera <alvherre@commandprompt.com> writes:
> Gurjeet Singh escribi�:
>> I just looked at the patch... Isn't PG_TRY() an expensive call to make in
>> the lock.c code? I was thinking of registering a Xact callback using
>> RegisterXactCallback() and performing 'waiting' reset in that callback if
>> the Xact event is XACT_EVENT_ABORT.

> PG_TRY is not expensive as all that -- it's just a sigsetjmp() call and
> another stack frame.

Also, since we're about to block here, shaving microseconds is not all
that important.

The reason I did it that way was to avoid having to export the saved
ps-display string out to someplace LockWaitCancel could find it.
        regards, tom lane