Re: Hard limit on WAL space used (because PANIC sucks)

Поиск
Список
Период
Сортировка
От Josh Berkus
Тема Re: Hard limit on WAL space used (because PANIC sucks)
Дата
Msg-id 51B11A69.4050909@agliodbs.com
обсуждение исходный текст
Ответ на Hard limit on WAL space used (because PANIC sucks)  (Heikki Linnakangas <hlinnakangas@vmware.com>)
Ответы Re: Hard limit on WAL space used (because PANIC sucks)  (Jeff Janes <jeff.janes@gmail.com>)
Re: Hard limit on WAL space used (because PANIC sucks)  (Bernd Helmle <mailings@oopsware.de>)
Re: Hard limit on WAL space used (because PANIC sucks)  ("MauMau" <maumau307@gmail.com>)
Список pgsql-hackers
Let's talk failure cases.

There's actually three potential failure cases here:

- One Volume: WAL is on the same volume as PGDATA, and that volume is
completely out of space.

- XLog Partition: WAL is on its own partition/volume, and fills it up.

- Archiving: archiving is failing or too slow, causing the disk to fill
up with waiting log segments.

I'll argue that these three cases need to be dealt with in three
different ways, and no single solution is going to work for all three.

Archiving
---------

In some ways, this is the simplest case.  Really, we just need a way to
know when the available WAL space has become 90% full, and abort
archiving at that stage.  Once we stop attempting to archive, we can
clean up the unneeded log segments.

What we need is a better way for the DBA to find out that archiving is
falling behind when it first starts to fall behind.  Tailing the log and
examining the rather cryptic error messages we give out isn't very
effective.

xLog Partition
--------------

As Heikki pointed, out, a full dedicated WAL drive is hard to fix once
it gets full, since there's nothing you can safely delete to clear
space, even enough for a checkpoint record.

On the other hand, it should be easy to prevent full status; we could
simply force a non-spread checkpoint whenever the available WAL space
gets 90% full.  We'd also probably want to be prepared to switch to a
read-only mode if we get full enough that there's only room for the
checkpoint records.

One Volume
----------

This is the most complicated case, because we wouldn't necessarily run
out of space because of WAL using it up.  Anything could cause us to run
out of disk space, including activity logs, swapping, pgsql_tmp files,
database growth, or some other process which writes files.

This means that the DBA getting out of disk-full manually is in some
ways easier; there's usually stuff she can delete.  However, it's much
harder -- maybe impossible -- for PostgreSQL to prevent this kind of
space outage.  There should be things we can do to make it easier for
the DBA to troubleshoot this, but I'm not sure what.

We could use a hard limit for WAL to prevent WAL from contributing to
out-of-space, but that'll only prevent a minority of cases.


-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com



В списке pgsql-hackers по дате отправления:

Предыдущее
От: Josh Berkus
Дата:
Сообщение: Re: Redesigning checkpoint_segments
Следующее
От: Jaime Casanova
Дата:
Сообщение: Re: Hard limit on WAL space used (because PANIC sucks)