Re: Does PostgreSQL check database integrity at startup?

Поиск
Список
Период
Сортировка
От Stephen Frost
Тема Re: Does PostgreSQL check database integrity at startup?
Дата
Msg-id 20171227213314.GY4628@tamriel.snowman.net
обсуждение исходный текст
Ответ на Re: Does PostgreSQL check database integrity at startup?  (Edson Carlos Ericksson Richter <richter@simkorp.com.br>)
Ответы Re: Does PostgreSQL check database integrity at startup?  (Alvaro Herrera <alvherre@alvh.no-ip.org>)
Список pgsql-general
Edson,

* Edson Carlos Ericksson Richter (richter@simkorp.com.br) wrote:
> Em 27/12/2017 15:02, Stephen Frost escreveu:
> >* Alvaro Herrera (alvherre@2ndquadrant.com) wrote:
> >>Stephen Frost wrote:
> >>
> >>>* Edson Carlos Ericksson Richter (richter@simkorp.com.br) wrote:
> >>>>Anyway, instead digging into rsync functionality (or bugs - I doubt,
> >>>>but who knows?), I do prefer to have a script I can run to check if
> >>>>there is obvious failures in standby servers.
> >>>As mentioned, zero-byte files can be perfectly valid.  PostgreSQL does
> >>>have page-level CRCs, if you initialized your database with them (which
> >>>I would strongly recommend).
> >>Page-level checksums would not detect the problem being complained in
> >>this thread, however.
> >It's entirely unclear to me what the problem being complained about in
> >this thread actually is.  The complaint so far was about zero-byte
> >files, but those are entirely valid, so that isn't a problem that anyone
> >can solve..
> >
> >Given the thread subject, if someone actually wanted to do a database
> >integrity check before startup, they could use pgBackRest to perform a
> >backup with a CRC-enabled database and at least verify that all of the
> >checksums are valid.
> >
> >We could possibly look into adding some set of additional checks for
> >files which can't actually be zero-byte, perhaps..  I know we have some
> >other one-off checks already.
>
> Actually, the problem is:
>
> Master => Slave => Backup
>
> In the master server everything is fine.

Alright..

> But at some point in time, slave became corrupt (one of the base
> files are zero size where it should be 16Mb in size), and IMHO a
> "red alert" should arise - Slave server shall not even startup at
> all.

How do you know it should be 16Mb in size...?  That sounds like you're
describing a WAL file, but you should be archiving your WAL files during
a backup, not just using whatever is in pg_xlog/pg_wal..

> Since backups are taken from slave server, all backups are also corrupt.

If you aren't following the appropriate process to perform a backup
then, yes, you're going to end up with corrupt and useless/bad backups.

Backing up from a replica has only been officially supported using the
pg_start/stop_backup methods as of 9.6 and only when doing a
non-exclusive backup.

Note that the wiki page you're talking about (I think, anyway...) is
describing *exclusive* backup, not non-exclusive, and the two are not
the same.

> I've detected the problem just because I've restored a backup
> (excellent practice perhaps - nobody should take backups if not
> testing it with the restore procedure).

Yes, restoring a backup is excellent practice and something that
everyone really should be doing.  In my view, at least, everyone should
also be using well tested backup tools instead of trying to write their
own.

> In slave server there is no indication that the database is corrupt
> (not in logs, it starts normally and show it is applying stream
> changes regularly).

This sounds like what's happening is that you're ending up with PG
thinking that a crash happened because backup_label is missing, which
will happen if you do pg_start/stop_backup on the replica and don't make
sure to take the results from pg_stop_backup and create the backup_label
file before starting PostgreSQL after the restore.

This isn't something that would happen if you used a backup tool that
knew about how to perform a non-exclusive backup or how to properly
backup using a replica instead of trying to write your own without
understanding how all these pieces play together.

> So that is the point: how to detect that a database is corrupt so
> cluster doesn't even start...

I'm not sure that the database is actually corrupt in this specific
case- it sounds like everything is actually fine, but you didn't include
the backup_label file when restoring and therefore PG thinks there is
crash recovery happening when it should be replaying WAL from the start
of the backup, but how is PG going to know that?  Well, it'd know that
from the backup_label file, if it was there..  Otherwise, it seems
pretty difficult for us to know that we're not doing crash recovery.
I'm certainly open to ideas on how to detect that, but nothing springs
to mind off-hand.

Again, using a backup tool instead of trying to roll your own would make
this much less likely to happen.  This isn't something which should
happen when you're using pg_basebackup or the other PG-specific backup
tools, rather it's only happening because you're trying to do your own
with pg_start/stop_backup and rsync and didn't completely read the
documentation on non-exclusive backups from replicas (that said, those
docs could certainly use improvment...).

Of course, perhaps I'm misunderstanding exactly what you're doing or
what file you're referring to, but this is my best guess based on the
information you've provided so far.

Thanks!

Stephen

Вложения

В списке pgsql-general по дате отправления:

Предыдущее
От: Stephen Frost
Дата:
Сообщение: Re: Does PostgreSQL check database integrity at startup?
Следующее
От: Alvaro Herrera
Дата:
Сообщение: Re: Does PostgreSQL check database integrity at startup?