Re: PostgreSQL's handling of fsync() errors is unsafe and risks dataloss at least on XFS

Поиск
Список
Период
Сортировка
От Craig Ringer
Тема Re: PostgreSQL's handling of fsync() errors is unsafe and risks dataloss at least on XFS
Дата
Msg-id CAMsr+YEeTrvp9wVnQTpOszaiecCfy6iA6SS0ADyPcyMOSA8HXQ@mail.gmail.com
обсуждение исходный текст
Ответ на Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS  (Thomas Munro <thomas.munro@enterprisedb.com>)
Ответы Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS  (Peter Geoghegan <pg@bowt.ie>)
Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS  (Christophe Pettus <xof@thebuild.com>)
Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS  (Andreas Karlsson <andreas@proxel.se>)
Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS  (Greg Stark <stark@mit.edu>)
Список pgsql-hackers
On 8 April 2018 at 10:16, Thomas Munro <thomas.munro@enterprisedb.com> wrote:
So, what can we actually do about this new Linux behaviour?

Yeah, I've been cooking over that myself.

More below, but here's an idea #5: decide InnoDB has the right idea, and go to using a single massive blob file, or a few giant blobs.

We have a storage abstraction that makes this way, way less painful than it should be.

We can virtualize relfilenodes into storage extents in relatively few big files. We could use sparse regions to make the addressing more convenient, but that makes copying and backup painful, so I'd rather not.

Even one file per tablespace for persistent relation heaps, another for indexes, another for each fork type.

That way we can use something like your #1 (which is what I was also thinking about then rejecting previously), but reduce the pain by reducing the FD count drastically so exhausting FDs stops being a problem.


Previously I was leaning toward what you've described here:
 
* whenever you open a file, either tell the checkpointer so it can
open it too (and wait for it to tell you that it has done so, because
it's not safe to write() until then), or send it a copy of the file
descriptor via IPC (since duplicated file descriptors share the same
f_wb_err)

* if the checkpointer can't take any more file descriptors (how would
that limit even work in the IPC case?), then it somehow needs to tell
you that so that you know that you're responsible for fsyncing that
file yourself, both on close (due to fd cache recycling) and also when
the checkpointer tells you to

Maybe it could be made to work, but sheesh, that seems horrible.  Is
there some simpler idea along these lines that could make sure that
fsync() is only ever called on file descriptors that were opened
before all unflushed writes, or file descriptors cloned from such file
descriptors?

... and got stuck on "yuck, that's awful".

I was assuming we'd force early checkpoints if the checkpointer hit its fd limit, but that's even worse.

We'd need to urgently do away with segmented relations, and partitions would start to become a hinderance.

Even then it's going to be an unworkable nightmare with heavily partitioned systems, systems that use schema-sharding, etc. And it'll mean we need to play with process limits and, often, system wide limits on FDs. I imagine the performance implications won't be pretty.


Idea 2:

Give up, complain that this implementation is defective and
unworkable, both on POSIX-compliance grounds and on POLA grounds, and
campaign to get it fixed more fundamentally (actual details left to
the experts, no point in speculating here, but we've seen a few
approaches that work on other operating systems including keeping
buffers dirty and marking the whole filesystem broken/read-only).

This appears to be what SQLite does AFAICS.


though it has the huge luxury of a single writer, so it's probably only subject to the original issue not the multiprocess / checkpointer issues we face.

 
Idea 3:

Give up on buffered IO and develop an O_SYNC | O_DIRECT based system ASAP.

That seems to be what the kernel folks will expect. But that's going to KILL performance. We'll need writer threads to have any hope of it not *totally* sucking, because otherwise simple things like updating a heap tuple and two related indexes will incur enormous disk latencies.

But I suspect it's the path forward.

Goody.


 
Any other ideas?

For a while I considered suggesting an idea which I now think doesn't
work.  I thought we could try asking for a new fcntl interface that
spits out wb_err counter.  Call it an opaque error token or something.
Then we could store it in our fsync queue and safely close the file.
Check again before fsync()ing, and if we ever see a different value,
PANIC because it means a writeback error happened while we weren't
looking.  Sadly I think it doesn't work because AIUI inodes are not
pinned in kernel memory when no one has the file open and there are no
dirty buffers, so I think the counters could go away and be reset.
Perhaps you could keep inodes pinned by keeping the associated buffers
dirty after an error (like FreeBSD), but if you did that you'd have
solved the problem already and wouldn't really need the wb_err system
at all.  Is there some other idea long these lines that could work?

I think our underlying data syncing concept is fundamentally broken, and it's not really the kernel's fault. 

We assume that we can safely:

procA: open()
procA: write()
procA: close()

... some long time later, unbounded as far as the kernel is concerned ...

procB: open()
procB: fsync()
procB: close()


If the kernel does writeback in the middle, how on earth is it supposed to know we expect to reopen the file and check back later?

Should it just remember "this file had an error" forever, and tell every caller? In that case how could we recover? We'd need some new API to say "yeah, ok already, I'm redoing all my work since the last good fsync() so you can clear the error flag now". Otherwise it'd keep reporting an error after we did redo to recover, too.

I never really clicked to the fact that we closed relations with pending buffered writes, left them closed, then reopened them to fsync. That's .... well, the kernel isn't the only thing doing crazy things here.

Right now I think we're at option (4): If you see anything that smells like a write error in your kernel logs, hard-kill postgres with -m immediate (do NOT let it do a shutdown checkpoint). If it did a checkpoint since the logs, fake up a backup label to force redo to start from the last checkpoint before the error. Otherwise, it's safe to just let it start up again and do redo again.

Fun times.

This also means AFAICS that running Pg on NFS is extremely unsafe, you MUST make sure you don't run out of disk. Because the usual safeguard of space reservation against ENOSPC in fsync doesn't apply to NFS. (I haven't tested this with nfsv3 in sync,hard,nointr mode yet, *maybe* that's safe, but I doubt it). The same applies to thin-provisioned storage. Just. Don't.

This helps explain various reports of corruption in Docker and various other tools that use various sorts of thin provisioning. If you hit ENOSPC in fsync(), bye bye data.




--
 Craig Ringer                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

В списке pgsql-hackers по дате отправления:

Предыдущее
От: Alvaro Herrera
Дата:
Сообщение: Re: pgsql: Support partition pruning at execution time
Следующее
От: Peter Geoghegan
Дата:
Сообщение: Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS