Обсуждение: Add a log message on recovery startup before syncing datadir

Поиск
Список
Период
Сортировка

Add a log message on recovery startup before syncing datadir

От
Michael Banck
Дата:
Hi,

we had a customer incident recently where they needed to do a PITR.
Their data directory is on a NetApp NFS and they have several hundred
databases in their instance. The startup sync (i.e. before the message
"starting archive recovery" appears) took 20 minutes and during the
first try[1] they were wondering what's going on because there is just
one log message ("database system was interrupted; last known up at
...") and the postmaster process is in state 'D'. Attaching strace
revealed that it was syncing files and due to the NFS performance that
took a long time.

I propose to add a message "syncing data directory" before running
SyncDataDirectory() in StartupXLOG() to make that more apparent to the
user, see attached. That should make it clear to users that Postgres is
going to do some work, which, depending on their hardware, could take a
bit.

Thoughts?


Michael

[1] don't ask

-- 
Michael Banck
Projektleiter / Senior Berater
Tel.: +49 2166 9901-171
Fax:  +49 2166 9901-100
Email: michael.banck@credativ.de

credativ GmbH, HRB Mönchengladbach 12080
USt-ID-Nummer: DE204566209
Trompeterallee 108, 41189 Mönchengladbach
Geschäftsführung: Dr. Michael Meskes, Jörg Folz, Sascha Heuer

Unser Umgang mit personenbezogenen Daten unterliegt
folgenden Bestimmungen: https://www.credativ.de/datenschutz

Вложения

Re: Add a log message on recovery startup before syncing datadir

От
Thomas Munro
Дата:
On Wed, Oct 7, 2020 at 8:58 PM Michael Banck <michael.banck@credativ.de> wrote:
> we had a customer incident recently where they needed to do a PITR.
> Their data directory is on a NetApp NFS and they have several hundred
> databases in their instance. The startup sync (i.e. before the message
> "starting archive recovery" appears) took 20 minutes and during the

Nice data point.

> first try[1] they were wondering what's going on because there is just
> one log message ("database system was interrupted; last known up at
> ...") and the postmaster process is in state 'D'. Attaching strace
> revealed that it was syncing files and due to the NFS performance that
> took a long time.

No objection to adding a message, but see also this other thread,
about potential ways to get rid of that sync completely, or at least
the phase where you have to open all the files one by one:

https://www.postgresql.org/message-id/flat/CAEET0ZHGnbXmi8yF3ywsDZvb3m9CbdsGZgfTXscQ6agcbzcZAw%40mail.gmail.com

Also, maybe of interest for PITR use cases, see this other thread
about relaxing the end-of-recovery checkpoint (well the patch doesn't
do that yet but it'd be a small step to not wait for it, based on a
GUC, once the checkpointer is running):

https://commitfest.postgresql.org/30/2706/



Re: Add a log message on recovery startup before syncing datadir

От
Michael Banck
Дата:
Hi,

Am Mittwoch, den 07.10.2020, 21:11 +1300 schrieb Thomas Munro:
> On Wed, Oct 7, 2020 at 8:58 PM Michael Banck <michael.banck@credativ.de> wrote:
> > we had a customer incident recently where they needed to do a PITR.
> > Their data directory is on a NetApp NFS and they have several hundred
> > databases in their instance. The startup sync (i.e. before the message
> > "starting archive recovery" appears) took 20 minutes and during the
> 
> Nice data point.

In the thread you pointed to below, Robert also mentions "tens of
minutes".

> > first try[1] they were wondering what's going on because there is just
> > one log message ("database system was interrupted; last known up at
> > ...") and the postmaster process is in state 'D'. Attaching strace
> > revealed that it was syncing files and due to the NFS performance that
> > took a long time.
> 
> No objection to adding a message, but see also this other thread,
> about potential ways to get rid of that sync completely, or at least
> the phase where you have to open all the files one by one:
> 
> https://www.postgresql.org/message-id/flat/CAEET0ZHGnbXmi8yF3ywsDZvb3m9CbdsGZgfTXscQ6agcbzcZAw%40mail.gmail.com

Thanks, I've re-read that now. As a summary, Tom said that the syncs
can't go away on the relations files/tablespaces. Robert suggested some
periodic status update that we're still doing work. You proposed using
syncfs. There was a discussion about ommitting some files/directories,
but I think those don't matter much for people who see extreme delays
here because those are very likely due to relation files.

I had a quick look yesterday and it seems to me the code that actually
does the syncing has no notion of how many files we synced already or
will sync in total, and adding that along with book-keeping would
complicate the calling pattern considerably. So I don't think a status
update like "finished syncing 10000 out of 23054 files" is going to (i)
be back-patchable or (ii) would not be even slower than now due to
having to figure out how much work is to be done first. 

So I would suggest to go with a minimal message before starting to sync
as proposed, which I think should be back-patchable, and try to (also)
make it faster (e.g. via syncfs) for v14.


Michael

-- 
Michael Banck
Projektleiter / Senior Berater
Tel.: +49 2166 9901-171
Fax:  +49 2166 9901-100
Email: michael.banck@credativ.de

credativ GmbH, HRB Mönchengladbach 12080
USt-ID-Nummer: DE204566209
Trompeterallee 108, 41189 Mönchengladbach
Geschäftsführung: Dr. Michael Meskes, Jörg Folz, Sascha Heuer

Unser Umgang mit personenbezogenen Daten unterliegt
folgenden Bestimmungen: https://www.credativ.de/datenschutz




Re: Add a log message on recovery startup before syncing datadir

От
Thomas Munro
Дата:
On Wed, Oct 7, 2020 at 10:18 PM Michael Banck <michael.banck@credativ.de> wrote:
> Am Mittwoch, den 07.10.2020, 21:11 +1300 schrieb Thomas Munro:
> > On Wed, Oct 7, 2020 at 8:58 PM Michael Banck <michael.banck@credativ.de> wrote:
> > > "starting archive recovery" appears) took 20 minutes and during the
> > Nice data point.
> In the thread you pointed to below, Robert also mentions "tens of
> minutes".

I've also heard anecdata about much worse cases running for hours,
leading people to go and comment that code out in a special build.

> > https://www.postgresql.org/message-id/flat/CAEET0ZHGnbXmi8yF3ywsDZvb3m9CbdsGZgfTXscQ6agcbzcZAw%40mail.gmail.com
>
> Thanks, I've re-read that now. As a summary, Tom said that the syncs
> can't go away on the relations files/tablespaces. Robert suggested some
> periodic status update that we're still doing work. You proposed using
> syncfs. There was a discussion about ommitting some files/directories,
> but I think those don't matter much for people who see extreme delays
> here because those are very likely due to relation files.

Actually I proposed two different alternatives, and am hoping to get
thoughts, support, objections to help me figure out which was to go.
In summary:

1.  syncfs() is simple, and doesn't require much analysis to see that
it is about as good/bad at the job as the current coding while being
vastly more efficient.  It only works on Linux.
2.  Precise fsync() based on WAL contents is a more complicated
rethink from first principles, and requires careful analysis of the
arguments I've made about its safety.  It works on all systems, and is
more efficient than the big-syncfs-hammer, because it delays writing
dirty data that'll be redirtied by redo, so it's possible that it only
goes out to disk once.

> I had a quick look yesterday and it seems to me the code that actually
> does the syncing has no notion of how many files we synced already or
> will sync in total, and adding that along with book-keeping would
> complicate the calling pattern considerably. So I don't think a status
> update like "finished syncing 10000 out of 23054 files" is going to (i)
> be back-patchable or (ii) would not be even slower than now due to
> having to figure out how much work is to be done first.

Actually I think you could find out how many files will be synced
quite quickly!  Reading directory contents is fast, because directory
entries hopefully live in sequential blocks and are potentially
prefetchable by the kernel (well, details vary from filesystem to
filesystem, but the point is that you don't need to examine the
potentially randomly scattered inodes just to count the files).  For
example, ResetUnloggedRelations() already does a full scan of relation
directory entries on crash startup, and I know from experiments with 1
million relations that that is fast, whereas SyncDataDirectory() is
very very slow even with no dirty data, only because it has to open
every file.  But this will be a moot question if I can get rid of the
problem completely, so I'm not personally going to investigate making
a progress counter.

> So I would suggest to go with a minimal message before starting to sync
> as proposed, which I think should be back-patchable, and try to (also)
> make it faster (e.g. via syncfs) for v14.

Yeah, I am only proposing to change the behaviour in PG14.