Re: AdvanceXLInsertBuffers() vs wal_sync_method=open_datasync

Поиск
Список
Период
Сортировка
От Andres Freund
Тема Re: AdvanceXLInsertBuffers() vs wal_sync_method=open_datasync
Дата
Msg-id 20231110173957.brc4bokf4whqpq54@awork3.anarazel.de
обсуждение исходный текст
Ответ на Re: AdvanceXLInsertBuffers() vs wal_sync_method=open_datasync  (Heikki Linnakangas <hlinnaka@iki.fi>)
Список pgsql-hackers
Hi,

On 2023-11-10 17:16:35 +0200, Heikki Linnakangas wrote:
> On 10/11/2023 05:54, Andres Freund wrote:
> > In this case I had used wal_sync_method=open_datasync - it's often faster and
> > if we want to scale WAL writes more we'll have to use it more widely (you
> > can't have multiple fdatasyncs in progress and reason about which one affects
> > what, but you can have multiple DSYNC writes in progress at the same time).
>
> Not sure I understand that. If you issue an fdatasync, it will sync all
> writes that were complete before the fdatasync started. Right? If you have
> multiple fdatasyncs in progress, that's true for each fdatasync. Or is there
> a bottleneck in the kernel with multiple in-progress fdatasyncs or
> something?

Many filesystems only allow a single fdatasync to really be in progress at the
same time, they eventually acquire an inode specific lock.  More problematic
cases include things like a write followed by an fdatasync, followed by a
write of the same block in another process/thread - there's very little
guarantee about which contents of that block are now durable.

But more importantly, using fdatasync doesn't scale because it effectively has
to flush the entire write cache one the device - which often contains plenty
other dirty data. Whereas O_DSYNC can use FUA writes, which just makes the
individual WAL writes write through the cache, while leaving the rest of the
cache "unaffected".


> > After a bit of confused staring and debugging I figured out that the problem
> > is that the RequestXLogSwitch() within the code for starting a basebackup was
> > triggering writing back the WAL in individual 8kB writes via
> > GetXLogBuffer()->AdvanceXLInsertBuffer(). With open_datasync each of these
> > writes is durable - on this drive each take about 1ms.
>
> I see. So the assumption in AdvanceXLInsertBuffer() is that XLogWrite() is
> relatively fast. But with open_datasync, it's not.

I'm not sure that was an explicit assumption rather than just how it worked
out.


> > To fix this, I suspect we need to make
> > GetXLogBuffer()->AdvanceXLInsertBuffer() flush more aggressively. In this
> > specific case, we even know for sure that we are going to fill a lot more
> > buffers, so no heuristic would be needed. In other cases however we need some
> > heuristic to know how much to write out.
>
> +1. Maybe use the same logic as in XLogFlush().

I've actually been wondering about moving all the handling of WALWriteLock to
XLogWrite() and/or a new function called from all the places calling
XLogWrite().

I suspect we can't quite use the same logic in AdvanceXLInsertBuffer() as we
do in XLogFlush() - we e.g. don't ever want to trigger flushing out a
partially filled page, for example. Or really ever want to unnecessarily wait
for a WAL insertion to complete when we don't have to.


> I wonder if the 'flexible' argument to XLogWrite() is too inflexible. It
> would be nice to pass a hard minimum XLogRecPtr that it must write up to,
> but still allow it to write more than that if it's convenient.

Yes, I've also thought that.  In the AIOified WAL code I ended up tracking
"minimum" and "optimal" write/flush locations.

Greetings,

Andres Freund



В списке pgsql-hackers по дате отправления:

Предыдущее
От: jacktby jacktby
Дата:
Сообщение: Re: Buffer Cache Problem
Следующее
От: Nathan Bossart
Дата:
Сообщение: Re: SET ROLE documentation improvement