Обсуждение: smgr vs DropRelFileNodeBuffers() vs filesystem state vs no criticalsection

Поиск
Список
Период
Сортировка

smgr vs DropRelFileNodeBuffers() vs filesystem state vs no criticalsection

От
Andres Freund
Дата:
Hi,

[1] made me notice these issues. The issues here are mostly independent,
but it's still worthwhile to read that thread - in particular because my
proposed solution for the problem is possibly somewhat related to this
issue. And, if we were to go for my more extreme proposal below, the fix
for this issue would probably also fix [1].


There's several smgr operations (e.g. smgrtruncate(), smgrdounlinkall())
that first do a a DropRelFileNodesAllBuffers() and then perform the
filesystem operation. Without a critical section.

As far as I can tell that's not acceptable.

Using smgrtruncate() as an example (because it's executed from backends
/ autovacuum rather than e.g. checkpointer, and because it's targeted at
relations that live past the current transaction):

DropRelFileNodeBuffers() throws away valid buffer contents. Therefore,
we'll be in a corrupt state should smgr_truncate fail. The on-disk state
will be the old file contents, but the in-memory state will not reflect
that. Consider e.g. the case that autovacuum pruned away all tuples from
a relation. That will result in pg_class.relfrozenxid being updated,
reflecting the new horizon. If we then do a DropRelFileNodeBuffers()
throwing away the modified buffer contents, but subsequently fail to
truncate the underlying file, we'll potentially have a lot of pages full
of tuples referencing xids that are older than relfrozenxid. Which scans
will try to return, as the underlying file is the original size.

As far as I can tell it, as things stand, is simply never ok to do
DropRelFileNodeBuffers() - when there potentially are dirty pages -
outside of a critical section.


RelationTruncate() notes:
    /*
     * We WAL-log the truncation before actually truncating, which means
     * trouble if the truncation fails. If we then crash, the WAL replay
     * likely isn't going to succeed in the truncation either, and cause a
     * PANIC. It's tempting to put a critical section here, but that cure
     * would be worse than the disease. It would turn a usually harmless
     * failure to truncate, that might spell trouble at WAL replay, into a
     * certain PANIC.
     */

but I think this analysis is quite insufficient. As far as I can tell it
doesn't consider the issue outlined above, nor does it consider what
will happen if standbys/PITR will replay the WAL records, but the
primary considers the relation to be of a different length.


It seems to me that we either need to

a) write out all dirty buffers during DropRelFileNodeBuffers(), thereby
   preventing the issue of old page contents "coming back to live" after
   a failed truncation.
b) accept using a critical section, with the obvious consequence that
   failing to truncate would lead to a PANIC
c) use a more complex protocol to invalidate buffers, ensuring there's
   no inconsistency between fs and shared_buffers.


c) could e.g. be something like

1) mark all buffers as about-to-be-dropped
2) CacheInvalidateSmgr()
3) truncate on filesystem level
4a) if that fails, remove the about-to-be-dropped flags, in a PG_CATCH block
4b) if that succeeds, fully remove the buffers to be dropped

When another backend wants to use a buffer marked as about-to-be-dropped
it would need to wait for that operation to finish (this afaict could
neither work the way StartBufferIO() does, nor the way
LockBufferForCleanup() does).

Such a scheme would have the advantage that truncation etc would behave
a lot more like normal buffer modifications. There'd e.g. afterwards be
an interlock between truncations and BufferSync() / checkpoints, which
imo is quite attractive.

On the other hand, the cost of DropRelFileNodeBuffers() needing to
perform an exhaustive search of shared_buffers, would be quite
painful. The single DropRelFileNodeBuffers() call is already bad
enough. I guess we could amortize that slightly, by caching the buffer
locations for a limited number of buffers - in many cases where the
search cost are problematic there won't be too many pages for individual
relations present.


For me alternative a) is prohibitively expensive, and not worth
discussing much. I think I can live with b), but I know that I'm much
less concerned with PANICs in these types of situations than others. c)
seems worth investigating, but presumably would end up being too
complicated to backpatch.


I think several DropRelFileNodeBuffers() callers besides vacuum are a
bit less concerning. E.g. when dropping a relation we do so as part of a
checkpoint, which'll trigger a PANIC IIRC. And the in-place truncation
case for TRUNCATE IIRC only applies to relations created in the same
subtransaction, which makes the failure scenario above largely moot IIRC.


Tom, I seem to recall a recent thread of yours discussing a different
approach to truncation. I wonder if there's some intersection with
that. But unfortunately my search somehow has come up with nothing so
far - do you remember enough to find the thread?

Comments?

Andres Freund

[1] https://www.postgresql.org/message-id/20191206230640.2dvdjpcgn46q3ks2%40alap3.anarazel.de



Re: smgr vs DropRelFileNodeBuffers() vs filesystem state vs nocritical section

От
Amit Kapila
Дата:
On Sat, Dec 7, 2019 at 5:42 AM Andres Freund <andres@anarazel.de> wrote:
>
> Tom, I seem to recall a recent thread of yours discussing a different
> approach to truncation. I wonder if there's some intersection with
> that. But unfortunately my search somehow has come up with nothing so
> far - do you remember enough to find the thread?
>

IIUC, a similar problem is discussed in the thread [1] where Tom
proposed a few solutions which are close to what you are proposing.

[1] - https://www.postgresql.org/message-id/1880.1281020817%40sss.pgh.pa.us

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: smgr vs DropRelFileNodeBuffers() vs filesystem state vs nocritical section

От
Andres Freund
Дата:
Hi,

On 2019-12-07 11:07:04 +0530, Amit Kapila wrote:
> On Sat, Dec 7, 2019 at 5:42 AM Andres Freund <andres@anarazel.de> wrote:
> >
> > Tom, I seem to recall a recent thread of yours discussing a different
> > approach to truncation. I wonder if there's some intersection with
> > that. But unfortunately my search somehow has come up with nothing so
> > far - do you remember enough to find the thread?
> >
> 
> IIUC, a similar problem is discussed in the thread [1] where Tom
> proposed a few solutions which are close to what you are proposing.
> 
> [1] - https://www.postgresql.org/message-id/1880.1281020817%40sss.pgh.pa.us

It does indeed look like basically the same problem. I was actually
remembering a different thread, that was more about truncating with not
quite as heavyweight locking.

Having pondered this and some related problems (truncation, flushing
multiple buffers at once using asynchronous IO, PrefetchBuffer()
directly into buffers, cleanup lock implementation), I think we
basically need the ability to set something like BM_IO_IN_PROGRESS on
multiple buffers (perhaps signalling different forms of IO with
different flags, but I'm not sure it's needed).

I think we basically ought to replace the current IO buffer locking with
condition variables (as Robert has suggested at [1]). Instead of having
an array of LWLocks (BufferIOLWLockArray), we'd allocate one condition
variable for each buffer. I think we have enough space directly in
BufferDesc these days, due to the spinlock removal, but that seems like
a detail.

For truncation, we'd first iterate over all buffers once to mark them as
BM_IO_IN_PROGRESS, then we would truncate the underlying file. If the
truncation succeeds, we can use a local palloc'd array of IO_IN_PROGRESS
buffers to actually evict them. If the truncation fails, the same array
would be used to reset IO_IN_PROGRESS (basically AbortBufferIO, except
not setting BM_IO_ERROR for the truncation case).  This would solve the
problem of truncations leading to diverging fs/buffer state, would not
require a PANIC, and would allow to have concurrent buffer eviction to
efficiently wait for IO to finish.

This would pretty directly replace the current cleanup locks, which
would just need the existing flagbit to indicate that refcount=0 should
cause a condition variable wakeup.

Does somebody see a fundamental hole in this approach?

Obviously there's lots of different details to fill in, but most of them
seem likely to only be found by actually writing the patch...

Greetings,

Andres Freund

[1] https://postgr.es/m/CA%2BTgmoaj2aPti0yho7FeEf2qt-JgQPRWb0gci_o1Hfr%3DC56Xng%40mail.gmail.com