Re: WAL Re-Writes

Поиск
Список
Период
Сортировка
От Amit Kapila
Тема Re: WAL Re-Writes
Дата
Msg-id CAA4eK1Ko-jaPa_0ug5S+a2WCOb33mWpAniQrfRKWpb6Hb_8jog@mail.gmail.com
обсуждение исходный текст
Ответ на Re: WAL Re-Writes  (Jim Nasby <Jim.Nasby@BlueTreble.com>)
Ответы Re: WAL Re-Writes  (Amit Kapila <amit.kapila16@gmail.com>)
Список pgsql-hackers
On Mon, Feb 1, 2016 at 8:05 PM, Jim Nasby <Jim.Nasby@bluetreble.com> wrote:
On 1/31/16 3:26 PM, Jan Wieck wrote:
On 01/27/2016 08:30 AM, Amit Kapila wrote:
operation.  Now why OS couldn't find the corresponding block in
memory is that, while closing the WAL file, we use
POSIX_FADV_DONTNEED if wal_level is less than 'archive' which
lead to this problem.  So with this experiment, the conclusion is that
though we can avoid re-write of WAL data by doing exact writes, but
it could lead to significant reduction in TPS.

POSIX_FADV_DONTNEED isn't the only way how those blocks would vanish
from OS buffers. If I am not mistaken we recycle WAL segments in a round
robin fashion. In a properly configured system, where the reason for a
checkpoint is usually "time" rather than "xlog", a recycled WAL file
written to had been closed and not touched for about a complete
checkpoint_timeout or longer. You must have a really big amount of spare
RAM in the machine to still find those blocks in memory. Basically we
are talking about the active portion of your database, shared buffers,
the sum of all process local memory and the complete pg_xlog directory
content fitting into RAM.


I think that could only be problem if reads were happening at write or
fsync call, but that is not the case here.  Further investigation on this
point reveals that the reads are not for fsync operation, rather they
happen when we call posix_fadvise(,,POSIX_FADV_DONTNEED).
Although this behaviour (writing in non-OS-page-cache-size chunks could
lead to reads if followed by a call to posix_fadvise
(,,POSIX_FADV_DONTNEED)) is not very clearly documented, but the
reason for the same is that fadvise() call maps the specified data range
(which in our case is whole file) into the list of pages and then invalidate
them which will further lead to removing them from OS cache, now any
misaligned (w.r.t OS page-size) writes done during writing/fsyncing to file
could cause additional reads as everything written by us will not be on
OS-page-boundary. This theory is based on code of fadvise [1] and some
googling [2] which suggests that misaligned reads followed with
POSIX_FADV_DONTNEED could cause similar problem.  Colleague of
mine, Dilip Kumar has verified it even by writing a simple program
for open/write/fsync/fdvise/close as well.
 

But that's only going to matter when the segment is newly recycled. My impression from Amit's email is that the OS was repeatedly reading even in the same segment?


As explained above the reads are only happening during file close.
 
Either way, I would think it wouldn't be hard to work around this by spewing out a bunch of zeros to the OS in advance of where we actually need to write, preventing the need for reading back from disk.


I think we can simply prohibit to set wal_chunk_size to a value other
than OS-page-cache or XLOG_BLCKSZ (whichever is lesser) if the
wal_level is lesser than archive. This can avoid the problem of extra
reads for misaligned writes as we won't call fadvise().

We can even choose to always write in OS-page-cache boundary
or XLOG_BLCKSZ (whichever is lesser) as in many cases
OS-page-cache boundary is 4K which can also save significant
re-writes.

 
Amit, did you do performance testing with archiving enabled an a no-op archive_command?

No, but what kind of advantage are you expecting from such
tests?


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

В списке pgsql-hackers по дате отправления:

Предыдущее
От: Robert Haas
Дата:
Сообщение: Re: Raising the checkpoint_timeout limit
Следующее
От: Noah Misch
Дата:
Сообщение: Re: Re: PATCH: Split stats file per database WAS: autovacuum stress-testing our system