Re: [HACKERS] Unlogged tables cleanup

Поиск
Список
Период
Сортировка
От Michael Paquier
Тема Re: [HACKERS] Unlogged tables cleanup
Дата
Msg-id 20190521081729.GJ1921@paquier.xyz
обсуждение исходный текст
Ответ на Re: [HACKERS] Unlogged tables cleanup  (Robert Haas <robertmhaas@gmail.com>)
Ответы Re: [HACKERS] Unlogged tables cleanup  (Robert Haas <robertmhaas@gmail.com>)
Список pgsql-hackers
On Mon, May 20, 2019 at 09:16:32AM -0400, Robert Haas wrote:
> Based on the discussion so far, I think there are a number of related
> problems here:

Thanks for the summary.

> 1. A base backup might copy the main fork but not the init fork. I
> think that's a problem that nobody's thought about before Andres
> raised it on this thread.  I may be wrong. Anyway, if that happens,
> the start-of-recovery code isn't going to do the right thing, because
> it won't know that it's dealing with an unlogged relation.

Hmm.  I got to think harder about this one, and I find the argument
of Andres from upthread to use (CLEANUP | INIT) for
ResetUnloggedRelations() at the end of recovery quite sensible:
https://www.postgresql.org/message-id/20190513173735.3znpqfkb3gasig6v@alap3.anarazel.de

It seems that the main risk is to finish with partially-present index
main forks.  For heap, the main would still be empty.

> 2. Suppose the system reaches the end of
> heapam_relation_set_new_filenode and then the system crashes.  Because
> of the smgrimmedsync(), and only because of the smgrimmedsync(), the
> init fork would exist at the start of recovery.  However, unlike the
> heap, some of the index AMs don't actually have a call to
> smgrimmedsync() in their *buildempty() functions.  If I'm not
> mistaken, the ones that write pages through shared_buffers do not do
> smgrimmedsync, and the ones that use private buffers do perform
> smgrimmedsync.

Yes, that maps with what I can see in the code for the various empty()
callbacks.

> Therefore, the ones that write pages through
> shared_buffers are vulnerable to a crash right after the unlogged
> table is created.  The init fork could fail to survive the crash, and
> therefore the start-of-recovery code would again fail to know that
> it's dealign with an unlogged relation.

Taking the example of gist which uses shared buffers for the init
fork logging, we take an exclusive lock on the buffer involved while
logging the init fork, meaning that the checkpoint is not able to
complete until the lock is released and the buffer is flushed.  Do you
have a specific sequence in mind?

> 3. So, why is it like that, anyway?  Why do index AMs that write pages
> via shared_buffers not have smgrimmedsync?  The answer is that we did
> it like that because we were worrying about a different problem -
> specifically, checkpointing.  If I dirty a buffer and write a WAL
> record for the changes that I made, it is guaranteed that the dirty
> data will make it to disk before the next checkpoint is written; we've
> got all sorts of interlocking in BufferSync, SyncOneBuffer, etc. to
> make sure that happens.  But if I create a file in the filesystem and
> write a WAL record for that change, none of that interlocking does
> anything.  So unless I not only create that file but smgrimmedsync()
> it before the next checkpoint's redo location is fixed, a crash could
> make the file disappear.
>
> On restart, I think we'll could potentially be missing the file
> created by smgrcreate(), and we won't replay the WAL record generated
> by log_smgrcreate() either because it's before the checkpoint.

Right, that's possible.  If you take the case of btree, the problem is
the window between the sync and the page logging which is unsafe.

> I believe that it is one aspect of this third problem that we
> previously fixed on this thread, but I think we failed to understand
> how general the issue actually is.  It affects _init forks of unlogged
> tables, but I think it also affects essentially every other use of
> smgrcreate(), whether it's got anything to do with unlogged tables or
> not. For an index AM, which has a non-empty initial representation,
> the consequences are pretty limited, because the transaction can't
> commit having created only an empty fork.  It's got to put some data
> into either the main fork or the init fork, as the case may be.  If it
> aborts, then we may leave some orphaned files behind, but if we lose
> one, we just have fewer orphaned files, so nobody cares.  And if it
> commits, then either everything's before the checkpoint (and the file
> will have been fsync'd because we fsync'd the buffers), or
> everything's after the checkpoint (so we will replay the WAL), or only
> the smgrcreate() is before the checkpoint and the rest is after (in
> which case XLogReadBufferExtended will do the smgrcreate over again
> for us and we'll be fine).  But since the heap uses an empty initial
> representation, it enjoys no similar protection.

Not sure what to think about this paragraph.  I need to ponder it
more.

> So, IIUC, the reason why we were talking about delayCkpt before is
> because it would allow us to remove the smgrimmedsync() of the
> unlogged fork while still avoiding problem #3.  However it does
> nothing to protect against problem #1 or #2.

Right.  I am not completely sure why #2 is a problem though, please
see my comments above.  For #1 the point raised upthread to do a
cleanup + init at the end of recovery makes sense.
--
Michael

Вложения

В списке pgsql-hackers по дате отправления:

Предыдущее
От: "Matwey V. Kornilov"
Дата:
Сообщение: Re: [PATCH v2] Introduce spgist quadtree @<(point,circle) operator
Следующее
От: Michael Paquier
Дата:
Сообщение: Re: with oids option not removed in pg_dumpall