Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

Поиск
Список
Период
Сортировка
От Bruce Momjian
Тема Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS
Дата
Msg-id 20180420204908.GA30655@momjian.us
обсуждение исходный текст
Ответ на Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS  (Craig Ringer <craig@2ndquadrant.com>)
Список pgsql-hackers
On Wed, Apr 18, 2018 at 08:45:53PM +0800, Craig Ringer wrote:
> wrOn 18 April 2018 at 19:46, Bruce Momjian <bruce@momjian.us> wrote:
> 
> > So, if sync mode passes the write to NFS, and NFS pre-reserves write
> > space, and throws an error on reservation failure, that means that NFS
> > will not corrupt a cluster on out-of-space errors.
> 
> Yeah. I need to verify in a concrete test case.

Thanks.

> The thing is that write() is allowed to be asynchronous anyway. Most
> file systems choose to implement eager reservation of space, but it's
> not mandated. AFAICS that's largely a historical accident to keep
> applications happy, because FSes used to *allocate* the space at
> write() time too, and when they moved to delayed allocations, apps
> tended to break too easily unless they at least reserved space. NFS
> would have to do a round-trip on write() to reserve space.
> 
> The Linux man pages (http://man7.org/linux/man-pages/man2/write.2.html) say:
> 
> "
>        A successful return from write() does not make any guarantee that
>        data has been committed to disk.  On some filesystems, including NFS,
>        it does not even guarantee that space has successfully been reserved
>        for the data.  In this case, some errors might be delayed until a
>        future write(2), fsync(2), or even close(2).  The only way to be sure
>        is to call fsync(2) after you are done writing all your data.
> "
> 
> ... and I'm inclined to believe it when it refuses to make guarantees.
> Especially lately.

Uh, even calling fsync after write isn't 100% safe since the kernel
could have flushed the dirty pages to storage, and failed, and the fsync
would later succeed.  I realize newer kernels have that fixed for files
open during that operation, but that is the minority of installs.

> The idea is that when the SAN's actual physically allocate storage
> gets to 40TB it starts telling you to go buy another rack of storage
> so you don't run out. You don't have to resize volumes, resize file
> systems, etc. All the storage space admin is centralized on the SAN
> and storage team, and your sysadmins, DBAs and app devs are none the
> wiser. You buy storage when you need it, not when the DBA demands they
> need a 200% free space margin just in case. Whether or not you agree
> with this philosophy or think it's sensible is kind of moot, because
> it's an extremely widespread model, and servers you work on may well
> be backed by thin provisioned storage _even if you don't know it_.


> Most FSes only touch the blocks on dirty writeback, or sometimes
> lazily as part of delayed allocation. So if your SAN is running out of
> space and there's 100MB free, each of your 100 FSes may have
> decremented its freelist by 2MB and be happily promising more space to
> apps on write() because, well, as far as they know they're only 50%
> full. When they all do dirty writeback and flush to storage, kaboom,
> there's nowhere to put some of the data.

I see what you are saying --- that the kernel is reserving the write
space from its free space, but the free space doesn't all exist.  I am
not sure how we can tell people to make sure the file system free space
is real.

> You'd have to actually force writes to each page through to the
> backing storage to know for sure the space existed. Yes, the docs say
> 
> "
>        After a
>        successful call to posix_fallocate(), subsequent writes to bytes in
>        the specified range are guaranteed not to fail because of lack of
>        disk space.
> "
> 
> ... but they're speaking from the filesystem's perspective. If the FS
> doesn't dirty and flush the actual blocks, a thin provisioned storage
> system won't know.

Frankly, in what cases will a write fail _for_ lack of free space?  It
could be a new WAL file (not recycled), or a pages added to the end of
the table.

Is that it?  It doesn't sound too terrible.  If we can eliminate the
corruption due to free space exxhaustion, it would be a big step
forward.

The next most common failure would be temporary storage failure or
storage communication failure.

Permanent storage failure is "game over" so we don't need to worry about
that.

-- 
  Bruce Momjian  <bruce@momjian.us>        http://momjian.us
  EnterpriseDB                             http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+                      Ancient Roman grave inscription +


В списке pgsql-hackers по дате отправления:

Предыдущее
От: Chapman Flack
Дата:
Сообщение: Re: Add read-only param to set_config(...) / SET that effects (atleast) customized runtime options
Следующее
От: Tom Lane
Дата:
Сообщение: Re: Event trigger bugs (was Re: Repeated crashes in GENERATED ... AS IDENTITY tests)