Обсуждение: Invulnerable VACUUM process thrashing everything
A few WEEKS ago, the autovacuum on my instance of pg 7.4 unilaterally decided to VACUUM a table which has not been updated in over a year and is more than one terabyte on the disk. Because of the very high transaction load on this database, this VACUUM has been ruining performance for almost a month. Unfortunately is seems invulnerable to killing by signals: # ps ax | grep VACUUM 15308 ? D 588:00 postgres: postgres skunk [local] VACUUM # kill -HUP 15308 # ps ax | grep VACUUM 15308 ? D 588:00 postgres: postgres skunk [local] VACUUM # kill -INT 15308 # ps ax | grep VACUUM 15308 ? D 588:00 postgres: postgres skunk [local] VACUUM # kill -PIPE 15308 # ps ax | grep VACUUM 15308 ? D 588:00 postgres: postgres skunk [local] VACUUM o/~ But the cat came back, the very next day ... I assume that if I kill this with SIGKILL, that will bring down every other postgres process, so that should be avoided. But surely there is a way to interrupt this. If I had some reason to shut down the instance, I'd be screwed, it seems. -jwb
Ick. Can you get users and foreign connections off that machine, lock them out for some period, and renice the VACUUM? Shedding load and keeping it off while VACUUM runs high priority might allow it to finish in a reasonable amount of time. Or Shedding load and dropping the VACUUM priority might allow a kill signal to get through. Hope this helps, Ron At 05:09 PM 12/29/2005, Jeffrey W. Baker wrote: >A few WEEKS ago, the autovacuum on my instance of pg 7.4 unilaterally >decided to VACUUM a table which has not been updated in over a year and >is more than one terabyte on the disk. Because of the very high >transaction load on this database, this VACUUM has been ruining >performance for almost a month. Unfortunately is seems invulnerable to >killing by signals: > ># ps ax | grep VACUUM >15308 ? D 588:00 postgres: postgres skunk [local] VACUUM ># kill -HUP 15308 ># ps ax | grep VACUUM >15308 ? D 588:00 postgres: postgres skunk [local] VACUUM ># kill -INT 15308 ># ps ax | grep VACUUM >15308 ? D 588:00 postgres: postgres skunk [local] VACUUM ># kill -PIPE 15308 ># ps ax | grep VACUUM >15308 ? D 588:00 postgres: postgres skunk [local] VACUUM > >o/~ But the cat came back, the very next day ... > >I assume that if I kill this with SIGKILL, that will bring down every >other postgres process, so that should be avoided. But surely there is >a way to interrupt this. If I had some reason to shut down the >instance, I'd be screwed, it seems.
On Thu, 2005-12-29 at 22:53 +0000, Russ Garrett wrote: > In my experience a kill -9 has never resulted in any data loss in this > situation (it will cause postgres to detect that the process died, shut > down, then recover), and most of the time it only causes a 5-10sec > outage. I'd definitely hesitate to recommend it in a production context > though, especially since I think there are some known race-condition > bugs in 7.4. > > VACUUM *will* respond to a SIGTERM, but it doesn't check very often - > I've often had to wait hours for it to determine that it's been killed, > and my tables aren't anywhere near 1TB. Maybe this is a place where > things could be improved... FWIW, I murdered this process with SIGKILL, and the recovery was very short. > Incidentally, I have to kill -9 some of our MySQL instances quite > regularly because they do odd things. Not something you want to be > doing, especially when MySQL takes 30mins to recover. Agreed. After mysql shutdown with MyISAM, all tables must be checked and usually many need to be repaired. This takes a reallllllly long time. -jwb > Russ Garrett > Last.fm Ltd. > russ@last.fm > > Ron wrote: > > > Ick. Can you get users and foreign connections off that machine, lock > > them out for some period, and renice the VACUUM? > > > > Shedding load and keeping it off while VACUUM runs high priority might > > allow it to finish in a reasonable amount of time. > > Or > > Shedding load and dropping the VACUUM priority might allow a kill > > signal to get through. > > > > Hope this helps, > > Ron > > > > > > At 05:09 PM 12/29/2005, Jeffrey W. Baker wrote: > > > >> A few WEEKS ago, the autovacuum on my instance of pg 7.4 unilaterally > >> decided to VACUUM a table which has not been updated in over a year and > >> is more than one terabyte on the disk. Because of the very high > >> transaction load on this database, this VACUUM has been ruining > >> performance for almost a month. Unfortunately is seems invulnerable to > >> killing by signals: > >> > >> # ps ax | grep VACUUM > >> 15308 ? D 588:00 postgres: postgres skunk [local] VACUUM > >> # kill -HUP 15308 > >> # ps ax | grep VACUUM > >> 15308 ? D 588:00 postgres: postgres skunk [local] VACUUM > >> # kill -INT 15308 > >> # ps ax | grep VACUUM > >> 15308 ? D 588:00 postgres: postgres skunk [local] VACUUM > >> # kill -PIPE 15308 > >> # ps ax | grep VACUUM > >> 15308 ? D 588:00 postgres: postgres skunk [local] VACUUM > >> > >> o/~ But the cat came back, the very next day ... > >> > >> I assume that if I kill this with SIGKILL, that will bring down every > >> other postgres process, so that should be avoided. But surely there is > >> a way to interrupt this. If I had some reason to shut down the > >> instance, I'd be screwed, it seems. > > > > > > > > > > > > ---------------------------(end of broadcast)--------------------------- > > TIP 2: Don't 'kill -9' the postmaster > > >
In my experience a kill -9 has never resulted in any data loss in this situation (it will cause postgres to detect that the process died, shut down, then recover), and most of the time it only causes a 5-10sec outage. I'd definitely hesitate to recommend it in a production context though, especially since I think there are some known race-condition bugs in 7.4. VACUUM *will* respond to a SIGTERM, but it doesn't check very often - I've often had to wait hours for it to determine that it's been killed, and my tables aren't anywhere near 1TB. Maybe this is a place where things could be improved... Incidentally, I have to kill -9 some of our MySQL instances quite regularly because they do odd things. Not something you want to be doing, especially when MySQL takes 30mins to recover. Russ Garrett Last.fm Ltd. russ@last.fm Ron wrote: > Ick. Can you get users and foreign connections off that machine, lock > them out for some period, and renice the VACUUM? > > Shedding load and keeping it off while VACUUM runs high priority might > allow it to finish in a reasonable amount of time. > Or > Shedding load and dropping the VACUUM priority might allow a kill > signal to get through. > > Hope this helps, > Ron > > > At 05:09 PM 12/29/2005, Jeffrey W. Baker wrote: > >> A few WEEKS ago, the autovacuum on my instance of pg 7.4 unilaterally >> decided to VACUUM a table which has not been updated in over a year and >> is more than one terabyte on the disk. Because of the very high >> transaction load on this database, this VACUUM has been ruining >> performance for almost a month. Unfortunately is seems invulnerable to >> killing by signals: >> >> # ps ax | grep VACUUM >> 15308 ? D 588:00 postgres: postgres skunk [local] VACUUM >> # kill -HUP 15308 >> # ps ax | grep VACUUM >> 15308 ? D 588:00 postgres: postgres skunk [local] VACUUM >> # kill -INT 15308 >> # ps ax | grep VACUUM >> 15308 ? D 588:00 postgres: postgres skunk [local] VACUUM >> # kill -PIPE 15308 >> # ps ax | grep VACUUM >> 15308 ? D 588:00 postgres: postgres skunk [local] VACUUM >> >> o/~ But the cat came back, the very next day ... >> >> I assume that if I kill this with SIGKILL, that will bring down every >> other postgres process, so that should be avoided. But surely there is >> a way to interrupt this. If I had some reason to shut down the >> instance, I'd be screwed, it seems. > > > > > > ---------------------------(end of broadcast)--------------------------- > TIP 2: Don't 'kill -9' the postmaster >
Russ Garrett <russ@garrett.co.uk> writes: > VACUUM *will* respond to a SIGTERM, but it doesn't check very often - > I've often had to wait hours for it to determine that it's been killed, > and my tables aren't anywhere near 1TB. Maybe this is a place where > things could be improved... Hmm, there are CHECK_FOR_INTERRUPTS calls in all the loops that seem significant to me. Is there anything odd about your database schema? Unusual index types or data types maybe? Also, what PG version are you using? If you notice a VACUUM not responding to SIGTERM promptly, it'd be useful to attach to the backend process with gdb and get a stack trace to find out what it's doing. regards, tom lane
Hi, Jeffrey, Jeffrey W. Baker wrote: > A few WEEKS ago, the autovacuum on my instance of pg 7.4 unilaterally > decided to VACUUM a table which has not been updated in over a year and > is more than one terabyte on the disk. Hmm, maybe this is the Transaction ID wraparound emerging, and VACUUM is freezing the rows. Did you VACUUM FREEZE the table after the last modifications? > # kill -HUP 15308 > # kill -INT 15308 > # kill -PIPE 15308 Did you try kill -TERM? This always cleanly ended VACUUMing backends on our machines within seconds. > I assume that if I kill this with SIGKILL, that will bring down every > other postgres process, so that should be avoided. But surely there is > a way to interrupt this. If I had some reason to shut down the > instance, I'd be screwed, it seems. Yes, SIGKILL will make the postmaster shut down all running backend instances, the same as SIGSEGV and possibly a few others. The reason is that the postmaster assumes some internal data structure corruption in the shared memory pages is possible on an "unclean" backend abort, and thus quits immediately to minimize the possibility of those corruptions to propagate to the disks. HTH, Markus -- Markus Schaber | Logical Tracking&Tracing International AG Dipl. Inf. | Software Development GIS Fight against software patents in EU! www.ffii.org www.nosoftwarepatents.org