Re: Offline enabling/disabling of data checksums

Поиск
Список
Период
Сортировка
От Michael Banck
Тема Re: Offline enabling/disabling of data checksums
Дата
Msg-id 1553013016.9697.57.camel@credativ.de
обсуждение исходный текст
Ответ на Re: Offline enabling/disabling of data checksums  (Andres Freund <andres@anarazel.de>)
Ответы Re: Offline enabling/disabling of data checksums
Список pgsql-hackers
Hi,

Am Dienstag, den 19.03.2019, 09:13 -0700 schrieb Andres Freund:
> On 2019-03-19 17:08:17 +0100, Michael Banck wrote:
> > Am Dienstag, den 19.03.2019, 09:00 -0700 schrieb Andres Freund:
> > > On 2019-03-19 16:55:12 +0100, Michael Banck wrote:
> > > > Am Dienstag, den 19.03.2019, 08:36 -0700 schrieb Andres Freund:
> > > > > On 2019-03-18 17:13:01 +0900, Michael Paquier wrote:
> > > > > > +/*
> > > > > > + * Locations of persistent and temporary control files.  The control
> > > > > > + * file gets renamed into a temporary location when enabling checksums
> > > > > > + * to prevent a parallel startup of Postgres.
> > > > > > + */
> > > > > > +#define CONTROL_FILE_PATH        "global/pg_control"
> > > > > > +#define CONTROL_FILE_PATH_TEMP    CONTROL_FILE_PATH ".pg_checksums_in_progress"
> > > > > 
> > > > > I think this should be outright rejected. Again, you're making the
> > > > > control file into something it isn't. And there's no buyin for this as
> > > > > far as I can tell outside of Fabien and you. For crying out loud, if the
> > > > > server crashes during this YOU'VE CORRUPTED THE CLUSTER.
> > > > 
> > > > The cluster is supposed to be offline during this.  This is just an
> > > > additional precaution so that nobody starts it during the operation -
> > > > similar to how pg_upgrade disables the old data directory.
> > > 
> > > I don't see how that matters. Afterwards the cluster needs low level
> > > surgery to be recovered. That's a) undocumented b) likely to be done
> > > wrongly.  This is completely unacceptable *AND UNNECESSARY*.
> > 
> > Can you explain why low level surgery is needed and how that would look
> > like?
> > 
> > If pg_checksums successfully enables checksums, it will move back the
> > control file and update the checksum version - the cluster is ready to
> > be started again unless I am missing something?
> > 
> > If pg_checksums is interrupted by the admin, it will move back the
> > control file and the cluster is ready to be started again as well.
> > 
> > If pg_checksums aborts with a failure, the admin will have to move back
> > the control file before starting up the instance again, but I don't
> > think that counts?
> 
> That absolutely counts. Even a short period would imo be unacceptable,
> but this will take *hours* in many clusters. It's completely possible
> that the machine crashes while the enabling is in progress.
> 
> And after restarting postgres or even pg_checksums, you'll just get a
> message that there's no control file. How on earth is a normal user
> supposed to recover from that?  Now, you could have a check for the
> control file under the temporary name, and emit a hint about renaming,
> but that has its own angers (like people renaming it just to start
> postgres).

Ok, thanks for explaining. 

I guess if we check for the temporary name in postmaster during startup
if pg_control isn't there then a more generally useful name like
"pg_control.maintenance" should be picked. We could then spit out a nice
error message or hint like "the cluster has been disabled for
maintenance. In order to start it up anyway, rename
pg_control.maintenance to pg_control" or so.

In any case, this would be more of a operational or availability issue
and not a data-loss issue, as I feared from your previous mails.

> And you're basically adding it because Fabien doesn't like
> postmaster.pid and wants to invent another lockout mechanism in this
> thread.

I think the hazard of another DBA (or some automated configuration
management or HA tool for that matter) looking at postmaster.pid,
deciding that it is not a legit file from a running instance, deleting
it and then starting up Postgres while pg_checksums is still at work is
worse than the above scenario, but maybe if we make the content of
postmaster.pid clear enough (like "maintenance in progress"?) it would
be enough of a hint? Or do you have concrete suggestions on how this
should work?

I had the feeling (ab)using postmaster.pid for this would fly less than
using the same scheme as pg_upgrade does, but I'm fine doing it either
way.


Michael

-- 
Michael Banck
Projektleiter / Senior Berater
Tel.: +49 2166 9901-171
Fax:  +49 2166 9901-100
Email: michael.banck@credativ.de

credativ GmbH, HRB Mönchengladbach 12080
USt-ID-Nummer: DE204566209
Trompeterallee 108, 41189 Mönchengladbach
Geschäftsführung: Dr. Michael Meskes, Jörg Folz, Sascha Heuer

Unser Umgang mit personenbezogenen Daten unterliegt
folgenden Bestimmungen: https://www.credativ.de/datenschutz


В списке pgsql-hackers по дате отправления:

Предыдущее
От: Andres Freund
Дата:
Сообщение: Re: Offline enabling/disabling of data checksums
Следующее
От: David Fetter
Дата:
Сообщение: Re: Willing to fix a PQexec() in libpq module