Re: patch to allow disable of WAL recycling

Поиск
Список
Период
Сортировка
От Jerry Jelinek
Тема Re: patch to allow disable of WAL recycling
Дата
Msg-id CACPQ5FruwJx+x_oLt-vVjJoKvBVepYqJW++CJ9-aywBAbPrhFg@mail.gmail.com
обсуждение исходный текст
Ответ на Re: patch to allow disable of WAL recycling  (Thomas Munro <thomas.munro@enterprisedb.com>)
Список pgsql-hackers
Thanks to everyone who has taken the time to look at this patch and provide all of the feedback.

I'm going to wait another day to see if there are any more comments. If not, then first thing next week, I will send out a revised patch with improvements to the man page change as requested. If anyone has specific things they want to be sure are covered, please just let me know.

Thanks again,
Jerry


On Thu, Jul 12, 2018 at 7:06 PM, Thomas Munro <thomas.munro@enterprisedb.com> wrote:
On Thu, Jul 12, 2018 at 10:52 PM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
> I don't follow Alvaro's reasoning, TBH. There's a couple of things that
> confuse me ...
>
> I don't quite see how reusing WAL segments actually protects against full
> filesystem? On "traditional" filesystems I would not expect any difference
> between "unlink+create" and reusing an existing file. On CoW filesystems
> (like ZFS or btrfs) the space management works very differently and reusing
> an existing file is unlikely to save anything.

Yeah, I had the same thoughts.

> But even if it reduces the likelihood of ENOSPC, it does not eliminate it
> entirely. max_wal_size is not a hard limit, and the disk may be filled by
> something else (when WAL is not on a separate device, when there is think
> provisioning, etc.). So it's not a protection against data corruption we
> could rely on. (And as was discussed in the recent fsync thread, ENOSPC is a
> likely source of past data corruption issues on NFS and possibly other
> filesystems.)

Right.  That ENOSPC discussion was about checkpointing though, not
WAL.  IIUC the hypothesis was that there may be stacks (possibly
involving NFS or thin provisioning, or perhaps historical versions of
certain local filesystems that had reservation accounting bugs, on a
certain kernel) that could let you write() a buffer, and then later
when the checkpointer calls fsync() the filesystem says ENOSPC, the
kernel reports that and throws away the dirty page, and then at next
checkpoint fsync() succeeds but the checkpoint is a lie and the data
is smoke.

We already PANIC on any errno except EINTR in XLogWriteLog(), as seen
in Jerry's nearby stack trace, so that failure mode seems to be
covered already for WAL, no?

> AFAICS the original reason for reusing WAL segments was the belief that
> overwriting an existing file is faster than writing a new file. That might
> have been true in the past, but the question is if it's still true on
> current filesystems. The results posted here suggest it's not true on ZFS,
> at least.

Yeah.

The wal_recycle=on|off patch seems reasonable to me (modulo Andres's
comments about the documentation; we should make sure that the 'off'
setting isn't accidentally recommended to the wrong audience) and I
vote we take it.

Just by the way, if I'm not mistaken ZFS does avoid faulting when
overwriting whole blocks, just like other filesystems:

https://github.com/freebsd/freebsd/blob/master/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_vnops.c#L1034

So then where are those faults coming from?  Perhaps the tree page
that holds the block pointer, of which there must be many when the
recordsize is small?

В списке pgsql-hackers по дате отправления:

Предыдущее
От: Oliver Ford
Дата:
Сообщение: Add RESPECT/IGNORE NULLS and FROM FIRST/LAST options
Следующее
От: Ashutosh Bapat
Дата:
Сообщение: Re: How to make partitioning scale better for larger numbers of partitions