Обсуждение: Invulnerable VACUUM process thrashing everything

Поиск
Список
Период
Сортировка

Invulnerable VACUUM process thrashing everything

От
"Jeffrey W. Baker"
Дата:
A few WEEKS ago, the autovacuum on my instance of pg 7.4 unilaterally
decided to VACUUM a table which has not been updated in over a year and
is more than one terabyte on the disk.  Because of the very high
transaction load on this database, this VACUUM has been ruining
performance for almost a month.  Unfortunately is seems invulnerable to
killing by signals:

# ps ax | grep VACUUM
15308 ?        D    588:00 postgres: postgres skunk [local] VACUUM
# kill -HUP 15308
# ps ax | grep VACUUM
15308 ?        D    588:00 postgres: postgres skunk [local] VACUUM
# kill -INT 15308
# ps ax | grep VACUUM
15308 ?        D    588:00 postgres: postgres skunk [local] VACUUM
# kill -PIPE 15308
# ps ax | grep VACUUM
15308 ?        D    588:00 postgres: postgres skunk [local] VACUUM

o/~ But the cat came back, the very next day ...

I assume that if I kill this with SIGKILL, that will bring down every
other postgres process, so that should be avoided.  But surely there is
a way to interrupt this.  If I had some reason to shut down the
instance, I'd be screwed, it seems.

-jwb

Re: Invulnerable VACUUM process thrashing everything

От
Ron
Дата:
Ick.  Can you get users and foreign connections off that machine,
lock them out for some period, and renice the VACUUM?

Shedding load and keeping it off while VACUUM runs high priority
might allow it to finish in a reasonable amount of time.
Or
Shedding load and dropping the VACUUM priority might allow a kill
signal to get through.

Hope this helps,
Ron


At 05:09 PM 12/29/2005, Jeffrey W. Baker wrote:
>A few WEEKS ago, the autovacuum on my instance of pg 7.4 unilaterally
>decided to VACUUM a table which has not been updated in over a year and
>is more than one terabyte on the disk.  Because of the very high
>transaction load on this database, this VACUUM has been ruining
>performance for almost a month.  Unfortunately is seems invulnerable to
>killing by signals:
>
># ps ax | grep VACUUM
>15308 ?        D    588:00 postgres: postgres skunk [local] VACUUM
># kill -HUP 15308
># ps ax | grep VACUUM
>15308 ?        D    588:00 postgres: postgres skunk [local] VACUUM
># kill -INT 15308
># ps ax | grep VACUUM
>15308 ?        D    588:00 postgres: postgres skunk [local] VACUUM
># kill -PIPE 15308
># ps ax | grep VACUUM
>15308 ?        D    588:00 postgres: postgres skunk [local] VACUUM
>
>o/~ But the cat came back, the very next day ...
>
>I assume that if I kill this with SIGKILL, that will bring down every
>other postgres process, so that should be avoided.  But surely there is
>a way to interrupt this.  If I had some reason to shut down the
>instance, I'd be screwed, it seems.




Re: Invulnerable VACUUM process thrashing everything

От
"Jeffrey W. Baker"
Дата:
On Thu, 2005-12-29 at 22:53 +0000, Russ Garrett wrote:
> In my experience a kill -9 has never resulted in any data loss in this
> situation (it will cause postgres to detect that the process died, shut
> down, then recover), and most of the time it only causes a 5-10sec
> outage. I'd definitely hesitate to recommend it in a production context
> though, especially since I think there are some known race-condition
> bugs in 7.4.
>
> VACUUM *will* respond to a SIGTERM, but it doesn't check very often -
> I've often had to wait hours for it to determine that it's been killed,
> and my tables aren't anywhere near 1TB. Maybe this is a place where
> things could be improved...

FWIW, I murdered this process with SIGKILL, and the recovery was very
short.


> Incidentally, I have to kill -9 some of our MySQL instances quite
> regularly because they do odd things. Not something you want to be
> doing, especially when MySQL takes 30mins to recover.

Agreed.  After mysql shutdown with MyISAM, all tables must be checked
and usually many need to be repaired.  This takes a reallllllly long
time.

-jwb

> Russ Garrett
> Last.fm Ltd.
> russ@last.fm
>
> Ron wrote:
>
> > Ick.  Can you get users and foreign connections off that machine, lock
> > them out for some period, and renice the VACUUM?
> >
> > Shedding load and keeping it off while VACUUM runs high priority might
> > allow it to finish in a reasonable amount of time.
> > Or
> > Shedding load and dropping the VACUUM priority might allow a kill
> > signal to get through.
> >
> > Hope this helps,
> > Ron
> >
> >
> > At 05:09 PM 12/29/2005, Jeffrey W. Baker wrote:
> >
> >> A few WEEKS ago, the autovacuum on my instance of pg 7.4 unilaterally
> >> decided to VACUUM a table which has not been updated in over a year and
> >> is more than one terabyte on the disk.  Because of the very high
> >> transaction load on this database, this VACUUM has been ruining
> >> performance for almost a month.  Unfortunately is seems invulnerable to
> >> killing by signals:
> >>
> >> # ps ax | grep VACUUM
> >> 15308 ?        D    588:00 postgres: postgres skunk [local] VACUUM
> >> # kill -HUP 15308
> >> # ps ax | grep VACUUM
> >> 15308 ?        D    588:00 postgres: postgres skunk [local] VACUUM
> >> # kill -INT 15308
> >> # ps ax | grep VACUUM
> >> 15308 ?        D    588:00 postgres: postgres skunk [local] VACUUM
> >> # kill -PIPE 15308
> >> # ps ax | grep VACUUM
> >> 15308 ?        D    588:00 postgres: postgres skunk [local] VACUUM
> >>
> >> o/~ But the cat came back, the very next day ...
> >>
> >> I assume that if I kill this with SIGKILL, that will bring down every
> >> other postgres process, so that should be avoided.  But surely there is
> >> a way to interrupt this.  If I had some reason to shut down the
> >> instance, I'd be screwed, it seems.
> >
> >
> >
> >
> >
> > ---------------------------(end of broadcast)---------------------------
> > TIP 2: Don't 'kill -9' the postmaster
> >
>

Re: Invulnerable VACUUM process thrashing everything

От
Russ Garrett
Дата:
In my experience a kill -9 has never resulted in any data loss in this
situation (it will cause postgres to detect that the process died, shut
down, then recover), and most of the time it only causes a 5-10sec
outage. I'd definitely hesitate to recommend it in a production context
though, especially since I think there are some known race-condition
bugs in 7.4.

VACUUM *will* respond to a SIGTERM, but it doesn't check very often -
I've often had to wait hours for it to determine that it's been killed,
and my tables aren't anywhere near 1TB. Maybe this is a place where
things could be improved...

Incidentally, I have to kill -9 some of our MySQL instances quite
regularly because they do odd things. Not something you want to be
doing, especially when MySQL takes 30mins to recover.

Russ Garrett
Last.fm Ltd.
russ@last.fm

Ron wrote:

> Ick.  Can you get users and foreign connections off that machine, lock
> them out for some period, and renice the VACUUM?
>
> Shedding load and keeping it off while VACUUM runs high priority might
> allow it to finish in a reasonable amount of time.
> Or
> Shedding load and dropping the VACUUM priority might allow a kill
> signal to get through.
>
> Hope this helps,
> Ron
>
>
> At 05:09 PM 12/29/2005, Jeffrey W. Baker wrote:
>
>> A few WEEKS ago, the autovacuum on my instance of pg 7.4 unilaterally
>> decided to VACUUM a table which has not been updated in over a year and
>> is more than one terabyte on the disk.  Because of the very high
>> transaction load on this database, this VACUUM has been ruining
>> performance for almost a month.  Unfortunately is seems invulnerable to
>> killing by signals:
>>
>> # ps ax | grep VACUUM
>> 15308 ?        D    588:00 postgres: postgres skunk [local] VACUUM
>> # kill -HUP 15308
>> # ps ax | grep VACUUM
>> 15308 ?        D    588:00 postgres: postgres skunk [local] VACUUM
>> # kill -INT 15308
>> # ps ax | grep VACUUM
>> 15308 ?        D    588:00 postgres: postgres skunk [local] VACUUM
>> # kill -PIPE 15308
>> # ps ax | grep VACUUM
>> 15308 ?        D    588:00 postgres: postgres skunk [local] VACUUM
>>
>> o/~ But the cat came back, the very next day ...
>>
>> I assume that if I kill this with SIGKILL, that will bring down every
>> other postgres process, so that should be avoided.  But surely there is
>> a way to interrupt this.  If I had some reason to shut down the
>> instance, I'd be screwed, it seems.
>
>
>
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 2: Don't 'kill -9' the postmaster
>


Re: Invulnerable VACUUM process thrashing everything

От
Tom Lane
Дата:
Russ Garrett <russ@garrett.co.uk> writes:
> VACUUM *will* respond to a SIGTERM, but it doesn't check very often -
> I've often had to wait hours for it to determine that it's been killed,
> and my tables aren't anywhere near 1TB. Maybe this is a place where
> things could be improved...

Hmm, there are CHECK_FOR_INTERRUPTS calls in all the loops that seem
significant to me.  Is there anything odd about your database schema?
Unusual index types or data types maybe?  Also, what PG version are
you using?

If you notice a VACUUM not responding to SIGTERM promptly, it'd be
useful to attach to the backend process with gdb and get a stack trace
to find out what it's doing.

            regards, tom lane

Re: Invulnerable VACUUM process thrashing everything

От
Markus Schaber
Дата:
Hi, Jeffrey,

Jeffrey W. Baker wrote:
> A few WEEKS ago, the autovacuum on my instance of pg 7.4 unilaterally
> decided to VACUUM a table which has not been updated in over a year and
> is more than one terabyte on the disk.

Hmm, maybe this is the Transaction ID wraparound emerging, and VACUUM is
freezing the rows.

Did you VACUUM FREEZE the table after the last modifications?

> # kill -HUP 15308
> # kill -INT 15308
> # kill -PIPE 15308

Did you try kill -TERM?

This always cleanly ended VACUUMing backends on our machines within seconds.

> I assume that if I kill this with SIGKILL, that will bring down every
> other postgres process, so that should be avoided.  But surely there is
> a way to interrupt this.  If I had some reason to shut down the
> instance, I'd be screwed, it seems.

Yes, SIGKILL will make the postmaster shut down all running backend
instances, the same as SIGSEGV and possibly a few others.

The reason is that the postmaster assumes some internal data structure
corruption in the shared memory pages is possible on an "unclean"
backend abort, and thus quits immediately to minimize the possibility of
those corruptions to propagate to the disks.

HTH,
Markus
--
Markus Schaber | Logical Tracking&Tracing International AG
Dipl. Inf.     | Software Development GIS

Fight against software patents in EU! www.ffii.org www.nosoftwarepatents.org