Обсуждение: Hung Vacuum in 8.3

Поиск
Список
Период
Сортировка

Hung Vacuum in 8.3

От
Mark Kirkwood
Дата:
This is 8.3.14 on Debian Lenny x86-64.

I'm seeing a hung vacuum:

postgres=# select procpid, query_start,waiting, current_query from
pg_stat_activity where current_query like '%VACUUM%';
procpid |          query_start          | waiting
|
current_query

--------+-------------------------------+---------+---------------------------------------------------------------------------------------------
7347    | 2011-02-22 06:02:02.400247+13 | f       | VACUUM ANALYZE;

at approx 1300 (so it has been sitting there for approx 7 hours, normal
database vacuum time is 10 minutes). Now according to pg_stat_activity
and pg_locks it is *not* waiting for a lock, but no vacuuming appears to
be going on. strace says:

$ strace -p 7347

Process 7347 attached - interrupt to quit
semop(33456157, 0x7ffff512bad0, 1

Ok, so we are waiting on a semaphore - hmm, why is it not showing up as
waiting on a lock of some kind?

Of interest is this:

postgres=# select procpid, query_start, current_query from
pg_stat_activity order by query_start limit 1;
  procpid |          query_start
|                                                   current_query

---------+-------------------------------+-------------------------------------------------------------------------------------
    25953 | 2011-02-22 04:24:07.417138+13 | SELECT n.node_id, n.node_ --
long query, joining several large tables - text snipped

So this guy had been running from before the vacuum started, so probably
vacuum is wanting to do lazy_truncate_heap() on one of the tables in the
join (no gdb on this box unfortunately). I am however still puzzled
about why no locks are being waited on.

I have canceled the vacuum, but any suggestions for getting more diag
info for next time?

regards

Mark

Re: Hung Vacuum in 8.3

От
Heikki Linnakangas
Дата:
On 22.02.2011 05:43, Mark Kirkwood wrote:
> This is 8.3.14 on Debian Lenny x86-64.
>
> I'm seeing a hung vacuum:
>
> postgres=# select procpid, query_start,waiting, current_query from
> pg_stat_activity where current_query like '%VACUUM%';
> procpid | query_start | waiting | current_query
>
--------+-------------------------------+---------+---------------------------------------------------------------------------------------------
>
> 7347 | 2011-02-22 06:02:02.400247+13 | f | VACUUM ANALYZE;
>
> at approx 1300 (so it has been sitting there for approx 7 hours, normal
> database vacuum time is 10 minutes). Now according to pg_stat_activity
> and pg_locks it is *not* waiting for a lock, but no vacuuming appears to
> be going on. strace says:
>
> $ strace -p 7347
>
> Process 7347 attached - interrupt to quit
> semop(33456157, 0x7ffff512bad0, 1
>
> Ok, so we are waiting on a semaphore - hmm, why is it not showing up as
> waiting on a lock of some kind?
>
> Of interest is this:
>
> postgres=# select procpid, query_start, current_query from
> pg_stat_activity order by query_start limit 1;
> procpid | query_start | current_query
>
---------+-------------------------------+-------------------------------------------------------------------------------------
>
> 25953 | 2011-02-22 04:24:07.417138+13 | SELECT n.node_id, n.node_ --
> long query, joining several large tables - text snipped
>
> So this guy had been running from before the vacuum started, so probably
> vacuum is wanting to do lazy_truncate_heap() on one of the tables in the
> join (no gdb on this box unfortunately). I am however still puzzled
> about why no locks are being waited on.

A long query on the same table can block vacuum. Vacuum needs to take a
so-called "cleanup lock" on each page, which means that it has to wait
until no other backend holds a pin on the page. A long-running query can
keep a page pinned for a long time.

--
   Heikki Linnakangas
   EnterpriseDB   http://www.enterprisedb.com

Re: Hung Vacuum in 8.3

От
Mark Kirkwood
Дата:
On 22/02/11 19:47, Heikki Linnakangas wrote:
>
> A long query on the same table can block vacuum. Vacuum needs to take
> a so-called "cleanup lock" on each page, which means that it has to
> wait until no other backend holds a pin on the page. A long-running
> query can keep a page pinned for a long time.
>

Ah, - this is LockBufferForCleanup? So we are waiting for an exclusive
LockBuffer operation i.e a LWLock not a (heavyweight) lock, urg... no
wonder.

To mitigate this I'm encouraging the use of a cleanup script for long
running queries - since the app concerned is a web site, there is no
call for queries that run this long (i.e way longer than the timeout for
the respective page rendering).

Thanks for the clarification (assuming I've understood correctly of
course...).

Cheers

Mark

Re: Hung Vacuum in 8.3

От
Greg Stark
Дата:
On Tue, Feb 22, 2011 at 7:14 AM, Mark Kirkwood
<mark.kirkwood@catalyst.net.nz> wrote:
> On 22/02/11 19:47, Heikki Linnakangas wrote:
>>
>> A long query on the same table can block vacuum. Vacuum needs to take a
>> so-called "cleanup lock" on each page, which means that it has to wait until
>> no other backend holds a pin on the page. A long-running query can keep a
>> page pinned for a long time.
>>
>
> Ah, - this is LockBufferForCleanup? So we are waiting for an exclusive
> LockBuffer operation i.e a LWLock not a (heavyweight) lock, urg... no
> wonder.

Actually it's not waiting for the LockBuffer LWLock. it's waiting
until your query unpins the buffer it wants. Vacuum tries to get an
exclusive lock on the buffer, if it gets it then it checks if anyone
is using that buffer. If someone is then it unlocks the buffer and
waits until nobody has it pinned.

Only certain plan types will keep a buffer pinned for a long time.
Things like the outer table of a nested loop join where the inner side
is another large table or slow subquery for example. This isn't
terribly common, usually it's caused by a large mistaken cartesian
join or something. but occasionally Postgres will generate a plan that
could do it. For instance joining a single row against a large table
will sometimes do a nested loop from the single row to the large
table. It's also possible Vacuum has been making progress but the
query keeps getting in its way and stopping it on new blocks.

It's also possible there's a bug of course. If someone was using that
buffer and somehow failed to notify the vacuum that they were done it
would wait for a very long time (forever?). However if vacuum
eventually continued when the query was canceled then it seems likely
it was working as intended.


--
greg

Re: Hung Vacuum in 8.3

От
Robert Haas
Дата:
On Tue, Feb 22, 2011 at 6:26 AM, Greg Stark <gsstark@mit.edu> wrote:
> Actually it's not waiting for the LockBuffer LWLock. it's waiting
> until your query unpins the buffer it wants. Vacuum tries to get an
> exclusive lock on the buffer, if it gets it then it checks if anyone
> is using that buffer. If someone is then it unlocks the buffer and
> waits until nobody has it pinned.

How bad it would be if we made LockBufferForCleanup() not wait?  If we
can't obtain the buffer cleanup lock immediately, we just skip that
page and continue on.  That would prevent us from updating
relfrozenxid, I guess, but we already can't do that if there are any
bits set in the visibility map.   It could also leave some bloat in
the table, but probably not much (he says hopefully).

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Hung Vacuum in 8.3

От
Mark Kirkwood
Дата:
On 23/02/11 00:26, Greg Stark wrote:
>
> It's also possible there's a bug of course. If someone was using that
> buffer and somehow failed to notify the vacuum that they were done it
> would wait for a very long time (forever?). However if vacuum
> eventually continued when the query was canceled then it seems likely
> it was working as intended.
>

Greg, thanks for clarifying this.

Unfortunately this time around I canceled the vacuum and then the query.
However *next* time I'll get rid of the query 1st and see what happens.

Cheers

Mark

Re: Hung Vacuum in 8.3

От
Mark Kirkwood
Дата:
On 23/02/11 03:27, Robert Haas wrote:
> On Tue, Feb 22, 2011 at 6:26 AM, Greg Stark<gsstark@mit.edu>  wrote:
>> Actually it's not waiting for the LockBuffer LWLock. it's waiting
>> until your query unpins the buffer it wants. Vacuum tries to get an
>> exclusive lock on the buffer, if it gets it then it checks if anyone
>> is using that buffer. If someone is then it unlocks the buffer and
>> waits until nobody has it pinned.
> How bad it would be if we made LockBufferForCleanup() not wait?  If we
> can't obtain the buffer cleanup lock immediately, we just skip that
> page and continue on.  That would prevent us from updating
> relfrozenxid, I guess, but we already can't do that if there are any
> bits set in the visibility map.   It could also leave some bloat in
> the table, but probably not much (he says hopefully).
>

Seems like a good suggestion, and may leave less bloat than having the
vacuum hung for potentially quite some time.

Mark

Re: Hung Vacuum in 8.3

От
Mark Kirkwood
Дата:
On 23/02/11 10:18, Mark Kirkwood wrote:
> On 23/02/11 00:26, Greg Stark wrote:
>>
>> It's also possible there's a bug of course. If someone was using that
>> buffer and somehow failed to notify the vacuum that they were done it
>> would wait for a very long time (forever?). However if vacuum
>> eventually continued when the query was canceled then it seems likely
>> it was working as intended.
>>
>
> Greg, thanks for clarifying this.
>
> Unfortunately this time around I canceled the vacuum and then the
> query. However *next* time I'll get rid of the query 1st and see what
> happens.
>

I have another one here:

postgres=# select datname,procpid, query_start, current_query
                   from pg_stat_activity
                   where now() - query_start > '2 hours'::interval order
by query_start;
  datname | procpid |          query_start          |  current_query
---------+---------+-------------------------------+-----------------
  stuff   |   23957 | 2011-03-08 06:02:02.659159+13 | VACUUM ANALYZE;

postgres=# select datname,procpid, query_start, current_query
                   from pg_stat_activity
                   where now() - backend_start > '2 hours'::interval
order by query_start;
  datname | procpid |          query_start          |  current_query
---------+---------+-------------------------------+-----------------
  stuff   |   23957 | 2011-03-08 06:02:02.659159+13 | VACUUM ANALYZE;
                                                    :
  stuff   |   30018 | 2011-03-08 11:57:03.214177+13 | <IDLE>
  stuff   |   11390 | 2011-03-08 12:27:20.390122+13 | <IDLE>
  stuff   |   10178 | 2011-03-08 12:28:59.328265+13 | <IDLE>
  stuff   |    5100 | 2011-03-08 12:30:10.318204+13 | <IDLE>
  stuff   |   10255 | 2011-03-08 12:32:21.520814+13 | <IDLE>
  stuff   |   29993 | 2011-03-08 12:32:26.562835+13 | <IDLE>
  stuff   |   10204 | 2011-03-08 12:40:57.059958+13 | <IDLE>
  stuff   |   10206 | 2011-03-08 12:41:11.708924+13 | <IDLE>
(9 rows)


It looks to me like whatever query buffer pins the vacuum *was* waiting
on should be well and truly over by now. The strace for the vacuum looks
the same as before:

$ strace -p 23957
Process 23957 attached - interrupt to quit
semop(39747613, 0x7fff53d3dec0, 1


It certainly looks like the vacuum is not able to resume after setting
itself to sleep for some reason.

Re: Hung Vacuum in 8.3

От
Mark Kirkwood
Дата:
On 08/03/11 12:55, Mark Kirkwood wrote:
> On 23/02/11 10:18, Mark Kirkwood wrote:
>> On 23/02/11 00:26, Greg Stark wrote:
>>>
>>> It's also possible there's a bug of course. If someone was using that
>>> buffer and somehow failed to notify the vacuum that they were done it
>>> would wait for a very long time (forever?). However if vacuum
>>> eventually continued when the query was canceled then it seems likely
>>> it was working as intended.
>>>
>>
>> Greg, thanks for clarifying this.
>>
>> Unfortunately this time around I canceled the vacuum and then the
>> query. However *next* time I'll get rid of the query 1st and see what
>> happens.
>>
>
> I have another one here:
>
> postgres=# select datname,procpid, query_start, current_query
>                   from pg_stat_activity
>                   where now() - query_start > '2 hours'::interval
> order by query_start;
>  datname | procpid |          query_start          |  current_query
> ---------+---------+-------------------------------+-----------------
>  stuff   |   23957 | 2011-03-08 06:02:02.659159+13 | VACUUM ANALYZE;
>
> postgres=# select datname,procpid, query_start, current_query
>                   from pg_stat_activity
>                   where now() - backend_start > '2 hours'::interval
> order by query_start;
>  datname | procpid |          query_start          |  current_query
> ---------+---------+-------------------------------+-----------------
>  stuff   |   23957 | 2011-03-08 06:02:02.659159+13 | VACUUM ANALYZE;
>                                                    :
>  stuff   |   30018 | 2011-03-08 11:57:03.214177+13 | <IDLE>
>  stuff   |   11390 | 2011-03-08 12:27:20.390122+13 | <IDLE>
>  stuff   |   10178 | 2011-03-08 12:28:59.328265+13 | <IDLE>
>  stuff   |    5100 | 2011-03-08 12:30:10.318204+13 | <IDLE>
>  stuff   |   10255 | 2011-03-08 12:32:21.520814+13 | <IDLE>
>  stuff   |   29993 | 2011-03-08 12:32:26.562835+13 | <IDLE>
>  stuff   |   10204 | 2011-03-08 12:40:57.059958+13 | <IDLE>
>  stuff   |   10206 | 2011-03-08 12:41:11.708924+13 | <IDLE>
> (9 rows)
>
>
> It looks to me like whatever query buffer pins the vacuum *was*
> waiting on should be well and truly over by now. The strace for the
> vacuum looks the same as before:
>
> $ strace -p 23957
> Process 23957 attached - interrupt to quit
> semop(39747613, 0x7fff53d3dec0, 1
>
>
> It certainly looks like the vacuum is not able to resume after setting
> itself to sleep for some reason.
>
>
>

Immediately after I sent this, I saw the vacuum unlock itself and
continue... so there *are* some queries still blocking it. I'm seeing
some highish (1 -2 minute) lock waits for a certain table:

postgres=# select datname,procpid, query_start, current_query
            from pg_stat_activity where now() - query_start > '1 minute'
            and procpid in (select distinct pid from pg_locks where not
granted) order by query_start;
  datname | procpid |          query_start          |        current_query
---------+---------+-------------------------------+------------------------------
  stuff   |     390 | 2011-03-08 12:57:01.752234+13 | UPDATE
content.twitter_queue
                                                    : SET error = $1
                                                    : WHERE
twitter_queue_id = $2
(30 or so)

which are probably not helping.

So false alarm, no bug here it would seem, sorry.

Mark

Re: Hung Vacuum in 8.3

От
Greg Stark
Дата:
On Mon, Mar 7, 2011 at 11:55 PM, Mark Kirkwood
<mark.kirkwood@catalyst.net.nz> wrote:
> =A0stuff =A0 | =A0 23957 | 2011-03-08 06:02:02.659159+13 | VACUUM ANALYZE;
>

For what it's worth VACUUM ANALYZE does have to take an additional
lock to update the statistics. It should rarely cause any problems but
it is a possibility.

When you do get a good example you might want to take the contents of
pg_locks at the same time as pg_stat_activity. That would at least be
able to help rule out any real locks and make it clear it's really
stuck on waiting for a block to become unpinned.

--=20
greg

Re: Hung Vacuum in 8.3

От
Mark Kirkwood
Дата:
On 08/03/11 13:03, Mark Kirkwood wrote:
> On 08/03/11 12:55, Mark Kirkwood wrote:
>> On 23/02/11 10:18, Mark Kirkwood wrote:
>>> On 23/02/11 00:26, Greg Stark wrote:
>>>>
>>>> It's also possible there's a bug of course. If someone was using that
>>>> buffer and somehow failed to notify the vacuum that they were done it
>>>> would wait for a very long time (forever?). However if vacuum
>>>> eventually continued when the query was canceled then it seems likely
>>>> it was working as intended.
>>>>
>>>
>>> Greg, thanks for clarifying this.
>>>
>>> Unfortunately this time around I canceled the vacuum and then the
>>> query. However *next* time I'll get rid of the query 1st and see
>>> what happens.
>>>
>>
>> I have another one here:
>>
>> postgres=# select datname,procpid, query_start, current_query
>>                   from pg_stat_activity
>>                   where now() - query_start > '2 hours'::interval
>> order by query_start;
>>  datname | procpid |          query_start          |  current_query
>> ---------+---------+-------------------------------+-----------------
>>  stuff   |   23957 | 2011-03-08 06:02:02.659159+13 | VACUUM ANALYZE;
>>
>> postgres=# select datname,procpid, query_start, current_query
>>                   from pg_stat_activity
>>                   where now() - backend_start > '2 hours'::interval
>> order by query_start;
>>  datname | procpid |          query_start          |  current_query
>> ---------+---------+-------------------------------+-----------------
>>  stuff   |   23957 | 2011-03-08 06:02:02.659159+13 | VACUUM ANALYZE;
>>                                                    :
>>  stuff   |   30018 | 2011-03-08 11:57:03.214177+13 | <IDLE>
>>  stuff   |   11390 | 2011-03-08 12:27:20.390122+13 | <IDLE>
>>  stuff   |   10178 | 2011-03-08 12:28:59.328265+13 | <IDLE>
>>  stuff   |    5100 | 2011-03-08 12:30:10.318204+13 | <IDLE>
>>  stuff   |   10255 | 2011-03-08 12:32:21.520814+13 | <IDLE>
>>  stuff   |   29993 | 2011-03-08 12:32:26.562835+13 | <IDLE>
>>  stuff   |   10204 | 2011-03-08 12:40:57.059958+13 | <IDLE>
>>  stuff   |   10206 | 2011-03-08 12:41:11.708924+13 | <IDLE>
>> (9 rows)
>>
>>
>> It looks to me like whatever query buffer pins the vacuum *was*
>> waiting on should be well and truly over by now. The strace for the
>> vacuum looks the same as before:
>>
>> $ strace -p 23957
>> Process 23957 attached - interrupt to quit
>> semop(39747613, 0x7fff53d3dec0, 1
>>
>>
>> It certainly looks like the vacuum is not able to resume after
>> setting itself to sleep for some reason.
>>
>>
>>
>
> Immediately after I sent this, I saw the vacuum unlock itself and
> continue... so there *are* some queries still blocking it. I'm seeing
> some highish (1 -2 minute) lock waits for a certain table:
>
> postgres=# select datname,procpid, query_start, current_query
>            from pg_stat_activity where now() - query_start > '1 minute'
>            and procpid in (select distinct pid from pg_locks where not
> granted) order by query_start;
>  datname | procpid |          query_start          |        current_query
> ---------+---------+-------------------------------+------------------------------
>
>  stuff   |     390 | 2011-03-08 12:57:01.752234+13 | UPDATE
> content.twitter_queue
>                                                    : SET error = $1
>                                                    : WHERE
> twitter_queue_id = $2
> (30 or so)
>
> which are probably not helping.
>
>

Make that definitely not helping:

stuff=# select relname, last_vacuum
         from pg_stat_all_tables
         where last_vacuum is not null
         and last_vacuum>  '2011-03-08 06:03:06'::timestamp
         order by last_vacuum limit 3;
     relname     |          last_vacuum
----------------+-------------------------------
  headline_alert | 2011-03-08 06:03:06.182866+13
  stock_static   | 2011-03-08 06:03:06.610739+13
  twitter_queue  | 2011-03-08 12:58:03.02451+13
(3 rows)



So we were trying to vacuum 'twitter_queue' (a small 20MB table) for
approx 6 hrs.

Note that the vacuum itself was not waiting in any locks according to
pg_locks, so looks to be the pinning issue. Clearly we need to amend the
application design that results in this quite long (1-2 minute)! wait
for the lock on this table.

regards

Mark

P.s: for the interested, the process(es) that the above UPDATEs are
waiting for are sitting '<IDLE> in transaction' - i.e application end is
doing some *non database* blocking call whilst holding a lock on the
'twitter_queue' table :-/