Re: Startup process thrashing

Поиск
Список
Период
Сортировка
От Phillip Berry
Тема Re: Startup process thrashing
Дата
Msg-id 200812111548.22986.pberry@stellaconcepts.com
обсуждение исходный текст
Ответ на Re: Startup process thrashing  (Greg Smith <gsmith@gregsmith.com>)
Ответы Re: Startup process thrashing
Список pgsql-general
Hi Greg,

I appreciate the reply.  Fortunately within the last 10 minutes it has finished the recovery...and
then promptly shut itself down again.

The exact error is in fact:

FATAL:  database is not accepting commands to avoid wraparound data loss in database "aim"
2008-12-10 06:00:02 CST [21392]: [4-1] HINT:  Stop the postmaster and use a standalone backend to
vacuum database "aim".

So I'm back to where I was last night which is a slight relief.  The difference being that the
pg_xlog is still full and when I attempt to start Postgres in single user mode it fails becasue of
that.

I'm not running PITR and checkpoint_segments is set to 100 as this is home to a very write intensive
app.  The db has been running for over a year and in that time has been vacuumed many times without
issue.

So now I guess I have two questions:

1. Will copying the pg_xlog dir onto a larger partition cause any problems?
2. Is there any reason that vacuuming would create so many WAL files? I've looked around and can't
find any explaination, though it's a problem that seems to have happened before.
Actually make that three questions...
3. I installed pgbouncer yesterday for connection pooling, does the fact that it holds connections
open have any bearing on how rapidly the transaction IDs are used up? As in are transaction IDs in
anyway related to connections? It's a pretty interesting coincidence.

Cheers
Phil










On Thursday 11 December 2008 15:20:21 you wrote:
> On Thu, 11 Dec 2008, Phillip Berry wrote:
> > I've got a bit of a problem.  It started last night when postgres
> > (8.1.9) went down citing the need for a vacuum full to be done due to
> > the transaction log needing to wraparound.
>
> Not exactly.  What it said was "To avoid a database shutdown, execute a
> full-database VACUUM".  In that context, "full" means you vacuum
> everything in the database, but only regular VACUUM is needed.  VACUUM
> FULL, as you learned the hard way, is a more intensive operation, and it's
> not needed to resolve the problem you started with.  It's a somewhat
> unfortunately worded HINT.
>
> > During the vacuum of the larger of the databases a few hours in it
> > failed, it's filled up the 18GB pg_xlog partition with over 1000 wal
> > files.  Due to running out of space the vacuum failed.
>
> Ouch.  Are you running PITR recovery by setting archive_command?  Did you
> set checkpoint_segments to some very high value?  1000 WAL files is not
> normal, curious how you ended up with so many of them.
>
> > When I came in this morning I attempted to start postgres using the
> > normal init script, and now it's stuck.  The startup process is thrashing
> > the disks and working hard, pg_controldata says it's in recovery, but
> > it's been going for over two hours.
>
> It takes a long time to sort through 1000 WAL files and figure out if the
> database is consistent for every transactions mentioned there.  If your
> VACUUM FULL ran for several hours and kicked out 1000 of them, it would be
> reasonable to expect the cleanup to also take many hours.
>
> > My question is where I should go from here?  Should i kill the startup
> > script, clear out the excess wal files, start the standalone server and
> > try vacuum again?
>
> Deleting the WAL files like that will leave your database completely
> trashed.  The utility that's provided to do the job you're considering is
> pg_resetxlog:
>
> http://www.postgresql.org/docs/8.1/static/app-pgresetxlog.html
>
> Which is slightly safer, but note the dire warnings there.  You are very
> likely to get some sort of data corruption if you do that, and you won't
> know where it is.  You'll be facing a dump and restore to sort that out,
> and if you think the server startup is taking a while on a 156GB database
> you're really not going to be happy with how long a restore takes.
>
> The safest thing you can do here is just wait for the server to finish
> recovery so it starts up.  Watch the system activity with something like
> vmstat.  If the server process is busy using the CPU and it's doing stuff
> with the disks, if you have evidence it's making progress, you'll be hard
> pressed to execute any manual recovery that's any safer or more efficient
> than that is.
>
> Someone else may be able to point you toward better estimating how far
> it's got left to go, I haven't ever been stuck in your position for long
> enough before to figure that out myself.  Good luck.
>
> --
> * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD



В списке pgsql-general по дате отправления:

Предыдущее
От: "Scott Marlowe"
Дата:
Сообщение: Re: Data Replication
Следующее
От: "Adam Rich"
Дата:
Сообщение: Re: Multi-table CHECK constraint