Re: WAL Re-Writes

Поиск

Список

Период

Сортировка

От	Amit Kapila
Тема	Re: WAL Re-Writes
Дата	3 февраля 2016 г. 08:42:41
Msg-id	CAA4eK1Ko-jaPa_0ug5S+a2WCOb33mWpAniQrfRKWpb6Hb_8jog@mail.gmail.com обсуждение исходный текст
Ответ на	Re: WAL Re-Writes (Jim Nasby <Jim.Nasby@BlueTreble.com>)
Ответы	Re: WAL Re-Writes
Список	pgsql-hackers

Дерево обсуждения

On Mon, Feb 1, 2016 at 8:05 PM, Jim Nasby <Jim.Nasby@bluetreble.com> wrote:

On 1/31/16 3:26 PM, Jan Wieck wrote:
On 01/27/2016 08:30 AM, Amit Kapila wrote:
operation. Now why OS couldn't find the corresponding block in
memory is that, while closing the WAL file, we use
POSIX_FADV_DONTNEED if wal_level is less than 'archive' which
lead to this problem. So with this experiment, the conclusion is that
though we can avoid re-write of WAL data by doing exact writes, but
it could lead to significant reduction in TPS.

POSIX_FADV_DONTNEED isn't the only way how those blocks would vanish
from OS buffers. If I am not mistaken we recycle WAL segments in a round
robin fashion. In a properly configured system, where the reason for a
checkpoint is usually "time" rather than "xlog", a recycled WAL file
written to had been closed and not touched for about a complete
checkpoint_timeout or longer. You must have a really big amount of spare
RAM in the machine to still find those blocks in memory. Basically we
are talking about the active portion of your database, shared buffers,
the sum of all process local memory and the complete pg_xlog directory
content fitting into RAM.

I think that could only be problem if reads were happening at write or

fsync call, but that is not the case here. Further investigation on this

point reveals that the reads are not for fsync operation, rather they

happen when we call posix_fadvise(,,POSIX_FADV_DONTNEED).

Although this behaviour (writing in non-OS-page-cache-size chunks could

lead to reads if followed by a call to posix_fadvise

(,,POSIX_FADV_DONTNEED)) is not very clearly documented, but the

reason for the same is that fadvise() call maps the specified data range

(which in our case is whole file) into the list of pages and then invalidate

them which will further lead to removing them from OS cache, now any

misaligned (w.r.t OS page-size) writes done during writing/fsyncing to file

could cause additional reads as everything written by us will not be on

OS-page-boundary. This theory is based on code of fadvise [1] and some

googling [2] which suggests that misaligned reads followed with

POSIX_FADV_DONTNEED could cause similar problem. Colleague of

mine, Dilip Kumar has verified it even by writing a simple program

for open/write/fsync/fdvise/close as well.

But that's only going to matter when the segment is newly recycled. My impression from Amit's email is that the OS was repeatedly reading even in the same segment?

As explained above the reads are only happening during file close.

Either way, I would think it wouldn't be hard to work around this by spewing out a bunch of zeros to the OS in advance of where we actually need to write, preventing the need for reading back from disk.

I think we can simply prohibit to set wal_chunk_size to a value other

than OS-page-cache or XLOG_BLCKSZ (whichever is lesser) if the

wal_level is lesser than archive. This can avoid the problem of extra

reads for misaligned writes as we won't call fadvise().

We can even choose to always write in OS-page-cache boundary

or XLOG_BLCKSZ (whichever is lesser) as in many cases

OS-page-cache boundary is 4K which can also save significant

re-writes.

Amit, did you do performance testing with archiving enabled an a no-op archive_command?

No, but what kind of advantage are you expecting from such

tests?

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

В списке pgsql-hackers по дате отправления:

Предыдущее

От: Robert Haas
Дата: 03 февраля 2016 г., 07:10:45
Сообщение: Re: Raising the checkpoint_timeout limit

Следующее

От: Noah Misch
Дата: 03 февраля 2016 г., 08:47:08
Сообщение: Re: Re: PATCH: Split stats file per database WAS: autovacuum stress-testing our system

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Re: WAL Re-Writes

Предыдущее

Следующее