Re: smgr vs DropRelFileNodeBuffers() vs filesystem state vs nocritical section

Поиск
Список
Период
Сортировка
От Andres Freund
Тема Re: smgr vs DropRelFileNodeBuffers() vs filesystem state vs nocritical section
Дата
Msg-id 20191223005430.yhf4n3zr4ojwbcn2@alap3.anarazel.de
обсуждение исходный текст
Ответ на Re: smgr vs DropRelFileNodeBuffers() vs filesystem state vs nocritical section  (Amit Kapila <amit.kapila16@gmail.com>)
Список pgsql-hackers
Hi,

On 2019-12-07 11:07:04 +0530, Amit Kapila wrote:
> On Sat, Dec 7, 2019 at 5:42 AM Andres Freund <andres@anarazel.de> wrote:
> >
> > Tom, I seem to recall a recent thread of yours discussing a different
> > approach to truncation. I wonder if there's some intersection with
> > that. But unfortunately my search somehow has come up with nothing so
> > far - do you remember enough to find the thread?
> >
> 
> IIUC, a similar problem is discussed in the thread [1] where Tom
> proposed a few solutions which are close to what you are proposing.
> 
> [1] - https://www.postgresql.org/message-id/1880.1281020817%40sss.pgh.pa.us

It does indeed look like basically the same problem. I was actually
remembering a different thread, that was more about truncating with not
quite as heavyweight locking.

Having pondered this and some related problems (truncation, flushing
multiple buffers at once using asynchronous IO, PrefetchBuffer()
directly into buffers, cleanup lock implementation), I think we
basically need the ability to set something like BM_IO_IN_PROGRESS on
multiple buffers (perhaps signalling different forms of IO with
different flags, but I'm not sure it's needed).

I think we basically ought to replace the current IO buffer locking with
condition variables (as Robert has suggested at [1]). Instead of having
an array of LWLocks (BufferIOLWLockArray), we'd allocate one condition
variable for each buffer. I think we have enough space directly in
BufferDesc these days, due to the spinlock removal, but that seems like
a detail.

For truncation, we'd first iterate over all buffers once to mark them as
BM_IO_IN_PROGRESS, then we would truncate the underlying file. If the
truncation succeeds, we can use a local palloc'd array of IO_IN_PROGRESS
buffers to actually evict them. If the truncation fails, the same array
would be used to reset IO_IN_PROGRESS (basically AbortBufferIO, except
not setting BM_IO_ERROR for the truncation case).  This would solve the
problem of truncations leading to diverging fs/buffer state, would not
require a PANIC, and would allow to have concurrent buffer eviction to
efficiently wait for IO to finish.

This would pretty directly replace the current cleanup locks, which
would just need the existing flagbit to indicate that refcount=0 should
cause a condition variable wakeup.

Does somebody see a fundamental hole in this approach?

Obviously there's lots of different details to fill in, but most of them
seem likely to only be found by actually writing the patch...

Greetings,

Andres Freund

[1] https://postgr.es/m/CA%2BTgmoaj2aPti0yho7FeEf2qt-JgQPRWb0gci_o1Hfr%3DC56Xng%40mail.gmail.com



В списке pgsql-hackers по дате отправления:

Предыдущее
От: Michael Paquier
Дата:
Сообщение: Re: Drongo vs. 9.4 initdb TAP test
Следующее
От: Tom Lane
Дата:
Сообщение: Re: Drongo vs. 9.4 initdb TAP test