Обсуждение: canceling autovacuum task woes

Поиск
Список
Период
Сортировка

canceling autovacuum task woes

От
Robert Haas
Дата:
I am running into a lot of customer situations where the customer
reports that "canceling autovacuum task" shows up in the logs, and
it's unclear whether this is happening often enough to matter, and
even more unclear what's causing it.

Me: So, do you know what table it's getting cancelled on?
Customer: Nope.
Me: Are you running any DDL commands anywhere in the cluster?
Customer: Nope, absolutely none.
Me: Well you've got to be running something somewhere or it wouldn't
be having a lock conflict.
Customer: OK, well I don't know of any.  What should I do?

It would be awfully nice if the process that does the cancelling would
provide the same kind of reporting that we do for a deadlock: the
relevant lock tag, the PID of the process sending the cancel, and the
query string.

Personally, I'm starting to have a sneaky suspicion that there is
something actually broken here - that is, that there are lock
conflicts involve here other than the obvious one (SHARE UPDATE
EXCLUSIVE on the table) that are allowing autovac to get cancelled
more often than we realize.  But whether that's true or not, the
current logging is wholly inadequate.

Thoughts?  Anybody else have this problem?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: canceling autovacuum task woes

От
Andres Freund
Дата:
Hi,

On Tuesday, July 24, 2012 07:48:27 PM Robert Haas wrote:
> I am running into a lot of customer situations where the customer
> reports that "canceling autovacuum task" shows up in the logs, and
> it's unclear whether this is happening often enough to matter, and
> even more unclear what's causing it.
> 
> Me: So, do you know what table it's getting cancelled on?
> Customer: Nope.
> Me: Are you running any DDL commands anywhere in the cluster?
> Customer: Nope, absolutely none.
> Me: Well you've got to be running something somewhere or it wouldn't
> be having a lock conflict.
> Customer: OK, well I don't know of any.  What should I do?
> 
> It would be awfully nice if the process that does the cancelling would
> provide the same kind of reporting that we do for a deadlock: the
> relevant lock tag, the PID of the process sending the cancel, and the
> query string.
> 
> Personally, I'm starting to have a sneaky suspicion that there is
> something actually broken here - that is, that there are lock
> conflicts involve here other than the obvious one (SHARE UPDATE
> EXCLUSIVE on the table) that are allowing autovac to get cancelled
> more often than we realize.  But whether that's true or not, the
> current logging is wholly inadequate.
Very, very, very quick guess: The relation extension lock?

> Thoughts?  Anybody else have this problem?
I have seen spuriously high occurances of that message before, but I never 
really investigated it.

Andres
-- 
Andres Freund        http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


Re: canceling autovacuum task woes

От
Steve Singer
Дата:
On 12-07-24 01:48 PM, Robert Haas wrote:
> I am running into a lot of customer situations where the customer
> reports that "canceling autovacuum task" shows up in the logs, and
> it's unclear whether this is happening often enough to matter, and
> even more unclear what's causing it.

Could autovacuum be compacting a lot of space at the end of the table.  
This is described
in the thread 
http://archives.postgresql.org/message-id/4D8DF88E.7080205@Yahoo.com


> Me: So, do you know what table it's getting cancelled on?
> Customer: Nope.
> Me: Are you running any DDL commands anywhere in the cluster?
> Customer: Nope, absolutely none.
> Me: Well you've got to be running something somewhere or it wouldn't
> be having a lock conflict.
> Customer: OK, well I don't know of any.  What should I do?
>
> It would be awfully nice if the process that does the cancelling would
> provide the same kind of reporting that we do for a deadlock: the
> relevant lock tag, the PID of the process sending the cancel, and the
> query string.
>
> Personally, I'm starting to have a sneaky suspicion that there is
> something actually broken here - that is, that there are lock
> conflicts involve here other than the obvious one (SHARE UPDATE
> EXCLUSIVE on the table) that are allowing autovac to get cancelled
> more often than we realize.  But whether that's true or not, the
> current logging is wholly inadequate.
>
> Thoughts?  Anybody else have this problem?
>



Re: canceling autovacuum task woes

От
Robert Haas
Дата:
On Tue, Jul 24, 2012 at 2:11 PM, Steve Singer <ssinger@ca.afilias.info> wrote:
> On 12-07-24 01:48 PM, Robert Haas wrote:
>> I am running into a lot of customer situations where the customer
>> reports that "canceling autovacuum task" shows up in the logs, and
>> it's unclear whether this is happening often enough to matter, and
>> even more unclear what's causing it.
>
> Could autovacuum be compacting a lot of space at the end of the table.  This
> is described
> in the thread
> http://archives.postgresql.org/message-id/4D8DF88E.7080205@Yahoo.com

You (and Andres) may well be right, but I think the way we find out is
to add some better logging.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: canceling autovacuum task woes

От
Andrew Dunstan
Дата:
On 07/24/2012 01:48 PM, Robert Haas wrote:
> I am running into a lot of customer situations where the customer
> reports that "canceling autovacuum task" shows up in the logs, and
> it's unclear whether this is happening often enough to matter, and
> even more unclear what's causing it.
>
> Me: So, do you know what table it's getting cancelled on?
> Customer: Nope.
> Me: Are you running any DDL commands anywhere in the cluster?
> Customer: Nope, absolutely none.
> Me: Well you've got to be running something somewhere or it wouldn't
> be having a lock conflict.
> Customer: OK, well I don't know of any.  What should I do?
>
> It would be awfully nice if the process that does the cancelling would
> provide the same kind of reporting that we do for a deadlock: the
> relevant lock tag, the PID of the process sending the cancel, and the
> query string.
>

+1 for more information on this.

cheers

andrew




Re: canceling autovacuum task woes

От
Alvaro Herrera
Дата:

> On Tuesday, July 24, 2012 07:48:27 PM Robert Haas wrote:
> > I am running into a lot of customer situations where the customer
> > reports that "canceling autovacuum task" shows up in the logs, and
> > it's unclear whether this is happening often enough to matter, and
> > even more unclear what's causing it.
> >
> > Me: So, do you know what table it's getting cancelled on?
> > Customer: Nope.
> > Me: Are you running any DDL commands anywhere in the cluster?
> > Customer: Nope, absolutely none.
> > Me: Well you've got to be running something somewhere or it wouldn't
> > be having a lock conflict.
> > Customer: OK, well I don't know of any.  What should I do?
> >
> > It would be awfully nice if the process that does the cancelling would
> > provide the same kind of reporting that we do for a deadlock: the
> > relevant lock tag, the PID of the process sending the cancel, and the
> > query string.

Hm, autovacuum is supposed to set an errcontext callback that would tell
you the table name it's working on at the time of the cancel.  So if
even that is missing, something strange is going on.

No objections to the general idea of adding more info about the process
blocked on autovacuum.

--
Álvaro Herrera <alvherre@commandprompt.com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support


Re: canceling autovacuum task woes

От
Alvaro Herrera
Дата:
Excerpts from Alvaro Herrera's message of mar jul 24 15:30:49 -0400 2012:
>
> > On Tuesday, July 24, 2012 07:48:27 PM Robert Haas wrote:
> > > I am running into a lot of customer situations where the customer
> > > reports that "canceling autovacuum task" shows up in the logs, and
> > > it's unclear whether this is happening often enough to matter, and
> > > even more unclear what's causing it.
> > >
> > > Me: So, do you know what table it's getting cancelled on?
> > > Customer: Nope.

> Hm, autovacuum is supposed to set an errcontext callback that would tell
> you the table name it's working on at the time of the cancel.  So if
> even that is missing, something strange is going on.

Yep, it says:

ERROR:  canceling autovacuum task
CONTEXT:  automatic vacuum of table "alvherre.public.foo"

So at least that part seems pilot error more than anything else.

--
Álvaro Herrera <alvherre@commandprompt.com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support


Re: canceling autovacuum task woes

От
Robert Haas
Дата:
On Tue, Jul 24, 2012 at 3:35 PM, Alvaro Herrera
<alvherre@commandprompt.com> wrote:
> Yep, it says:
>
> ERROR:  canceling autovacuum task
> CONTEXT:  automatic vacuum of table "alvherre.public.foo"
>
> So at least that part seems pilot error more than anything else.

Yeah, you're right.  So you do get the table name.  But you don't get
the cause, which is what you really need to understand why it's
happening.  Attached is a patch that adds some more detail.  Here's an
example of what the output looks like:

LOG:  sending cancel to blocking autovacuum PID 21595
DETAIL:  Process 21618 waits for AccessExclusiveLock on relation 27863
of database 16384
STATEMENT:  drop table if exists pgbench_accounts
ERROR:  canceling autovacuum task
CONTEXT:  automatic vacuum of table "rhaas.public.pgbench_accounts"

I think that's a lot more useful than just getting those last two lines...

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Вложения

Re: canceling autovacuum task woes

От
Alvaro Herrera
Дата:
Excerpts from Robert Haas's message of mar jul 24 15:52:23 -0400 2012:
> On Tue, Jul 24, 2012 at 3:35 PM, Alvaro Herrera
> <alvherre@commandprompt.com> wrote:
> > Yep, it says:
> >
> > ERROR:  canceling autovacuum task
> > CONTEXT:  automatic vacuum of table "alvherre.public.foo"
> >
> > So at least that part seems pilot error more than anything else.
>
> Yeah, you're right.  So you do get the table name.  But you don't get
> the cause, which is what you really need to understand why it's
> happening.  Attached is a patch that adds some more detail.  Here's an
> example of what the output looks like:
>
> LOG:  sending cancel to blocking autovacuum PID 21595
> DETAIL:  Process 21618 waits for AccessExclusiveLock on relation 27863
> of database 16384
> STATEMENT:  drop table if exists pgbench_accounts
> ERROR:  canceling autovacuum task
> CONTEXT:  automatic vacuum of table "rhaas.public.pgbench_accounts"

Looks great.  Are you considering backpatching this?


--
Álvaro Herrera <alvherre@commandprompt.com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support


Re: canceling autovacuum task woes

От
Robert Haas
Дата:
On Tue, Jul 24, 2012 at 4:03 PM, Alvaro Herrera
<alvherre@commandprompt.com> wrote:
> Excerpts from Robert Haas's message of mar jul 24 15:52:23 -0400 2012:
>> On Tue, Jul 24, 2012 at 3:35 PM, Alvaro Herrera
>> <alvherre@commandprompt.com> wrote:
>> > Yep, it says:
>> >
>> > ERROR:  canceling autovacuum task
>> > CONTEXT:  automatic vacuum of table "alvherre.public.foo"
>> >
>> > So at least that part seems pilot error more than anything else.
>>
>> Yeah, you're right.  So you do get the table name.  But you don't get
>> the cause, which is what you really need to understand why it's
>> happening.  Attached is a patch that adds some more detail.  Here's an
>> example of what the output looks like:
>>
>> LOG:  sending cancel to blocking autovacuum PID 21595
>> DETAIL:  Process 21618 waits for AccessExclusiveLock on relation 27863
>> of database 16384
>> STATEMENT:  drop table if exists pgbench_accounts
>> ERROR:  canceling autovacuum task
>> CONTEXT:  automatic vacuum of table "rhaas.public.pgbench_accounts"
>
> Looks great.  Are you considering backpatching this?

Well, that would certainly make MY life easier.  I am not sure whether
it would be in line with project policy, however.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: canceling autovacuum task woes

От
Tom Lane
Дата:
Robert Haas <robertmhaas@gmail.com> writes:
> Yeah, you're right.  So you do get the table name.  But you don't get
> the cause, which is what you really need to understand why it's
> happening.  Attached is a patch that adds some more detail.

Uh, what's the added dependency on pgstat.h for?  Looks sane to the
eyeball otherwise.
        regards, tom lane


Re: canceling autovacuum task woes

От
Robert Haas
Дата:
On Tue, Jul 24, 2012 at 4:09 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> Yeah, you're right.  So you do get the table name.  But you don't get
>> the cause, which is what you really need to understand why it's
>> happening.  Attached is a patch that adds some more detail.
>
> Uh, what's the added dependency on pgstat.h for?  Looks sane to the
> eyeball otherwise.

Woops, that was leftovers from some earlier silliness that I (mostly)
removed before posting.

New version attached.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Вложения

Re: canceling autovacuum task woes

От
Tom Lane
Дата:
Robert Haas <robertmhaas@gmail.com> writes:
> On Tue, Jul 24, 2012 at 4:03 PM, Alvaro Herrera
> <alvherre@commandprompt.com> wrote:
>> Looks great.  Are you considering backpatching this?

> Well, that would certainly make MY life easier.  I am not sure whether
> it would be in line with project policy, however.

+1 for a backpatch.  Otherwise it'll be years before we gain any
information about the unexpected cancels that you think exist.

However, after looking some more at deadlock.c, I wonder whether
(a) this patch gives sufficient detail, and (b) whether there isn't a
problem that's obvious by inspection.  It appears to me that as the
blocking_autovacuum_proc stuff is coded, it will finger an AV proc as
needing to be killed even though it may be several graph edges out from
the current proc.  This means that with respect to (a), the connection
from the process doing the kill to the AV proc may be inadequately
documented by this patch, and with respect to (b), there might well be
cases where we found an AV proc somewhere in the graph traversal but
it's not actually guilty of blocking the current process ... especially
not after the queue reorderings that we may have done.  I think I'd be
happier with that code if it restricted its AV targets to procs that
*directly* block the current process, which not incidentally would make
this amount of log detail sufficient.
        regards, tom lane


Re: canceling autovacuum task woes

От
Robert Haas
Дата:
On Jul 24, 2012, at 4:31 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> On Tue, Jul 24, 2012 at 4:03 PM, Alvaro Herrera
>> <alvherre@commandprompt.com> wrote:
>>> Looks great.  Are you considering backpatching this?
>
>> Well, that would certainly make MY life easier.  I am not sure whether
>> it would be in line with project policy, however.
>
> +1 for a backpatch.  Otherwise it'll be years before we gain any
> information about the unexpected cancels that you think exist

OK, great.

> However, after looking some more at deadlock.c, I wonder whether
> (a) this patch gives sufficient detail, and (b) whether there isn't a
> problem that's obvious by inspection.  It appears to me that as the
> blocking_autovacuum_proc stuff is coded, it will finger an AV proc as
> needing to be killed even though it may be several graph edges out from
> the current proc.  This means that with respect to (a), the connection
> from the process doing the kill to the AV proc may be inadequately
> documented by this patch, and with respect to (b), there might well be
> cases where we found an AV proc somewhere in the graph traversal but
> it's not actually guilty of blocking the current process ... especially
> not after the queue reorderings that we may have done.  I think I'd be
> happier with that code if it restricted its AV targets to procs that
> *directly* block the current process, which not incidentally would make
> this amount of log detail sufficient.

Uggh.  Well, that certainly sounds like something that could cause spurious cancels - or excessively fast ones, since
presumablyif we limit it to things that directly block the current process, you'll always allow the full
deadlock_timeoutbefore nuking the autovac worker.  So +1 for changing that. 

Does an edge in this context mean any lock, or just an ungranted one?  I assume the latter, which still leaves the
questionof where the edges are coming from in the first place. 

...Robert

Re: canceling autovacuum task woes

От
Tom Lane
Дата:
Robert Haas <robertmhaas@gmail.com> writes:
> On Jul 24, 2012, at 4:31 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> ... This means that with respect to (a), the connection
>> from the process doing the kill to the AV proc may be inadequately
>> documented by this patch, and with respect to (b), there might well be
>> cases where we found an AV proc somewhere in the graph traversal but
>> it's not actually guilty of blocking the current process ... especially
>> not after the queue reorderings that we may have done.  I think I'd be
>> happier with that code if it restricted its AV targets to procs that
>> *directly* block the current process, which not incidentally would make
>> this amount of log detail sufficient.

> Uggh.  Well, that certainly sounds like something that could cause spurious cancels - or excessively fast ones, since
presumablyif we limit it to things that directly block the current process, you'll always allow the full
deadlock_timeoutbefore nuking the autovac worker.  So +1 for changing that. 

I think something as simple as the attached would do the trick.  I've
verified that this still allows the expected cancel cases, though of
course it's harder to prove that it gives any benefit.

> Does an edge in this context mean any lock, or just an ungranted one?  I assume the latter, which still leaves the
questionof where the edges are coming from in the first place. 

The deadlock code follows "hard" edges, which mean "A wants a lock that
B has already got", as well as "soft" edges, which mean "A wants a lock
that B also wants, and B is ahead of A in the queue for it".  We don't
kill autovacs that are the end of soft edges, which I think is good,
but the fact that we follow them at all makes it a little unclear what
we might reach recursively.  Your point about determinacy of the timeout
is probably even a better argument for only allowing the direct blockee
to wield the knife.

            regards, tom lane

diff --git a/src/backend/storage/lmgr/deadlock.c b/src/backend/storage/lmgr/deadlock.c
index 288186a6ceebf40d5a43638c7be5be35c12dd3b3..6620a3d571d769f75e6ae3fbbe4a0ed2cf271720 100644
*** a/src/backend/storage/lmgr/deadlock.c
--- b/src/backend/storage/lmgr/deadlock.c
*************** FindLockCycleRecurse(PGPROC *checkProc,
*** 527,551 ****
                  if ((proclock->holdMask & LOCKBIT_ON(lm)) &&
                      (conflictMask & LOCKBIT_ON(lm)))
                  {
-                     /*
-                      * Look for a blocking autovacuum. There can be more than
-                      * one in the deadlock cycle, in which case we just pick a
-                      * random one.    We stash the autovacuum worker's PGPROC so
-                      * that the caller can send a cancel signal to it, if
-                      * appropriate.
-                      *
-                      * Note we read vacuumFlags without any locking.  This is
-                      * OK only for checking the PROC_IS_AUTOVACUUM flag,
-                      * because that flag is set at process start and never
-                      * reset; there is logic elsewhere to avoid canceling an
-                      * autovacuum that is working for preventing Xid
-                      * wraparound problems (which needs to read a different
-                      * vacuumFlag bit), but we don't do that here to avoid
-                      * grabbing ProcArrayLock.
-                      */
-                     if (pgxact->vacuumFlags & PROC_IS_AUTOVACUUM)
-                         blocking_autovacuum_proc = proc;
-
                      /* This proc hard-blocks checkProc */
                      if (FindLockCycleRecurse(proc, depth + 1,
                                               softEdges, nSoftEdges))
--- 527,532 ----
*************** FindLockCycleRecurse(PGPROC *checkProc,
*** 559,565 ****

                          return true;
                      }
!                     /* If no deadlock, we're done looking at this proclock */
                      break;
                  }
              }
--- 540,575 ----

                          return true;
                      }
!
!                     /*
!                      * No deadlock here, but see if this proc is an autovacuum
!                      * that is directly hard-blocking our own proc.  If so,
!                      * report it so that the caller can send a cancel signal
!                      * to it, if appropriate.  If there's more than one such
!                      * proc (probably not possible given that autovacuums all
!                      * take similar lock types), it's indeterminate which one
!                      * will be reported.
!                      *
!                      * We don't touch autovacuums that are indirectly blocking
!                      * us; it's up to the direct blockee to take action.  This
!                      * rule simplifies understanding the behavior and ensures
!                      * that an autovacuum won't be canceled with less than
!                      * deadlock_timeout grace period.
!                      *
!                      * Note we read vacuumFlags without any locking.  This is
!                      * OK only for checking the PROC_IS_AUTOVACUUM flag,
!                      * because that flag is set at process start and never
!                      * reset.  There is logic elsewhere to avoid canceling an
!                      * autovacuum that is working to prevent XID wraparound
!                      * problems (which needs to read a different vacuumFlag
!                      * bit), but we don't do that here to avoid grabbing
!                      * ProcArrayLock.
!                      */
!                     if (checkProc == MyProc &&
!                         pgxact->vacuumFlags & PROC_IS_AUTOVACUUM)
!                         blocking_autovacuum_proc = proc;
!
!                     /* We're done looking at this proclock */
                      break;
                  }
              }

Re: canceling autovacuum task woes

От
Peter Eisentraut
Дата:
On Tue, 2012-07-24 at 16:14 -0400, Robert Haas wrote:
> On Tue, Jul 24, 2012 at 4:09 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > Robert Haas <robertmhaas@gmail.com> writes:
> >> Yeah, you're right.  So you do get the table name.  But you don't get
> >> the cause, which is what you really need to understand why it's
> >> happening.  Attached is a patch that adds some more detail.
> >
> > Uh, what's the added dependency on pgstat.h for?  Looks sane to the
> > eyeball otherwise.
> 
> Woops, that was leftovers from some earlier silliness that I (mostly)
> removed before posting.
> 
> New version attached.
> 

The detail message should have a period at the end.




Re: canceling autovacuum task woes

От
Robert Haas
Дата:
On Mon, Aug 13, 2012 at 11:59 PM, Peter Eisentraut <peter_e@gmx.net> wrote:
> On Tue, 2012-07-24 at 16:14 -0400, Robert Haas wrote:
>> On Tue, Jul 24, 2012 at 4:09 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> > Robert Haas <robertmhaas@gmail.com> writes:
>> >> Yeah, you're right.  So you do get the table name.  But you don't get
>> >> the cause, which is what you really need to understand why it's
>> >> happening.  Attached is a patch that adds some more detail.
>> >
>> > Uh, what's the added dependency on pgstat.h for?  Looks sane to the
>> > eyeball otherwise.
>>
>> Woops, that was leftovers from some earlier silliness that I (mostly)
>> removed before posting.
>>
>> New version attached.
>>
>
> The detail message should have a period at the end.

Oops.  Fixed, sorry it took me so long to get to this.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company