Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

Поиск
Список
Период
Сортировка
От Craig Ringer
Тема Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS
Дата
Msg-id CAMsr+YFjFrv2SH1=W-Z2OL3-87bTN5NBwQbnOkyUdPAFjQ78nA@mail.gmail.com
обсуждение исходный текст
Ответ на Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS  (Andres Freund <andres@anarazel.de>)
Список pgsql-hackers
On 9 April 2018 at 10:06, Andres Freund <andres@anarazel.de> wrote:
 

> And in many failure modes there's no reason to expect any data loss at all,
> like:
>
> * Local disk fills up (seems to be safe already due to space reservation at
> write() time)

That definitely should be treated separately.

It is, because all the FSes I looked at reserve space before returning from write(), even if they do delayed allocation. So they won't fail with ENOSPC at fsync() time or silently due to lost errors on background writeback. Otherwise we'd be hearing a LOT more noise about this.
 
> * Thin-provisioned storage backing local volume iSCSI or paravirt block
> device fills up
> * NFS volume fills up

Those should be the same as the above.

Unfortunately, they aren't.

AFAICS NFS doesn't reserve space with the other end before returning from write(), even if mounted with the sync option. So we can get ENOSPC lazily when the buffer writeback fails due to a full backing file system. This then travels the same paths as EIO: we fsync(), ERROR, retry, appear to succeed, and carry on with life losing the data. Or we never hear about the error in the first place.

(There's a proposed extension that'd allow this, see https://tools.ietf.org/html/draft-iyer-nfsv4-space-reservation-ops-02#page-5, but I see no mention of it in fs/nfs. All the reserve_space / xdr_reserve_space stuff seems to be related to space in protocol messages at a quick read.)

Thin provisioned storage could vary a fair bit depending on the implementation. But the specific failure case I saw, prompting this thread, was on a volume using the stack:

xfs -> lvm2 -> multipath -> ??? -> SAN

(the HBA/iSCSI/whatever was not recorded by the looks, but IIRC it was iSCSI. I'm checking.)

The SAN ran out of space. Due to use of thin provisioning, Linux *thought* there was plenty of space on the volume; LVM thought it had plenty of physical extents free and unallocated, XFS thought there was tons of free space, etc. The space exhaustion manifested as I/O errors on flushes of writeback buffers.

The logs were like this:

kernel: sd 2:0:0:1: [sdd] Unhandled sense code
kernel: sd 2:0:0:1: [sdd]   
kernel: Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
kernel: sd 2:0:0:1: [sdd]   
kernel: Sense Key : Data Protect [current] 
kernel: sd 2:0:0:1: [sdd]   
kernel: Add. Sense: Space allocation failed write protect
kernel: sd 2:0:0:1: [sdd] CDB: 
kernel: Write(16): **HEX-DATA-CUT-OUT**
kernel: Buffer I/O error on device dm-0, logical block 3098338786 
kernel: lost page write due to I/O error on dm-0
kernel: Buffer I/O error on device dm-0, logical block 3098338787 

The immediate cause was that Linux's multipath driver didn't seem to recognise the sense code as retryable, so it gave up and reported it to the next layer up (LVM). LVM and XFS both seem to think that the lower layer is responsible for retries, so they toss the write away, and tell any interested writers if they feel like it, per discussion upthread.

In this case Pg did get the news and reported fsync() errors on checkpoints, but it only reported an error once per relfilenode. Once it ran out of failed relfilenodes to cause the checkpoint to ERROR, it "completed" a "successful" checkpoint and kept on running until the resulting corruption started to manifest its self and it segfaulted some time later. As we've now learned, there's no guarantee we'd even get the news about the I/O errors at all.

WAL was on a separate volume that didn't run out of room immediately, so we didn't PANIC on WAL write failure and prevent the issue.

In this case if Pg had PANIC'd (and been able to guarantee to get the news of write failures reliably), there'd have been no corruption and no data loss despite the underlying storage issue.

If, prior to seeing this, you'd asked me "will my PostgreSQL database be corrupted if my thin-provisioned volume runs out of space" I'd have said "Surely not. PostgreSQL won't be corrupted by running out of disk space, it orders writes carefully and forces flushes so that it will recover gracefully from write failures."

Except not. I was very surprised.

BTW, it also turns out that the *default* for multipath is to give up on errors anyway; see the queue_if_no_path option and no_path_retries options. (Hint: run PostgreSQL with no_path_retries=queue). That's a sane default if you use O_DIRECT|O_SYNC, and otherwise pretty much a data-eating setup.


I regularly see rather a lot of multipath systems, iSCSI systems, SAN backed systems, etc. I think we need to be pretty clear that we expect them to retry indefinitely, and if they report an I/O error we cannot reliably handle it. We need to patch Pg to PANIC on any fsync() failure and document that Pg won't notice some storage failure modes that might otherwise be considered nonfatal or transient, so very specific storage configuration and testing is required. (Not that anyone will do it).  Also warn against running on NFS even with "hard,sync,nointr".

It'd be interesting to have a tool that tested error handling, allowing people to do iSCSI plug-pull tests, that sort of thing. But as far as I can tell nobody ever tests their storage stack anyway, so I don't plan on writing something that'll never get used.
  
> I think we need to think about a more robust path in future. But it's
> certainly not "stop the world" territory.

I think you're underestimating the complexity of doing that by at least
two orders of magnitude.

Oh, it's just a minor total rewrite of half Pg, no big deal ;) 

I'm sure that no matter how big I think it is, I'm still underestimating it.

The most workable option IMO would be some sort of fnotify/dnotify/whatever that reports all I/O errors on a volume. Some kind of error reporting handle we can keep open on a volume level that we can check for each volume/tablespace after we fsync() everything to see if it all really worked. If we PANIC if that gives us a bad answer, and PANIC on fsync errors, we guard against the great majority of these sorts of should-be-transient-if-the-kernel-didn't-give-up-and-throw-away-our-data errors.

Even then, good luck getting those events from an NFS volume in which the backing volume experiences an issue.

And it's kind of moot because AFAICS no such interface exists.

--
 Craig Ringer                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

В списке pgsql-hackers по дате отправления:

Предыдущее
От: Tom Lane
Дата:
Сообщение: Re: pgsql: Merge catalog/pg_foo_fn.h headers back into pg_foo.h headers.
Следующее
От: David Rowley
Дата:
Сообщение: Re: [HACKERS] path toward faster partition pruning