Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

Поиск
Список
Период
Сортировка
От Craig Ringer
Тема Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS
Дата
Msg-id CAMsr+YGXcjheOijrUhqz7vtMe0C7uqx=6HpxTAuntaHnZ5K5uw@mail.gmail.com
обсуждение исходный текст
Ответ на Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS  (Thomas Munro <thomas.munro@enterprisedb.com>)
Ответы Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS  (Thomas Munro <thomas.munro@enterprisedb.com>)
Список pgsql-hackers
On 6 April 2018 at 10:53, Thomas Munro <thomas.munro@enterprisedb.com> wrote:
On Fri, Apr 6, 2018 at 1:27 PM, Craig Ringer <craig@2ndquadrant.com> wrote:
> On 6 April 2018 at 07:37, Andrew Gierth <andrew@tao11.riddles.org.uk> wrote:
>> Note: as I've brought up in another thread, it turns out that PG is not
>> handling fsync errors correctly even when the OS _does_ do the right
>> thing (discovered by testing on FreeBSD).
>
> Yikes. For other readers, the related thread for this is
> https://www.postgresql.org/message-id/87y3i1ia4w.fsf@news-spur.riddles.org.uk

Yeah.  That's really embarrassing, especially after beating up on
various operating systems all week.  It's also an independent issue --
let's keep that on the other thread and get it fixed.

> I see the failed fync, then the same fd being fsync()d without error on the
> next checkpoint, which succeeds.
>
>         postgres  9602 [003] 72380.325817: syscalls:sys_enter_fsync: fd:
> 0x00000005
>         postgres  9602 [003] 72380.325931:  syscalls:sys_exit_fsync:
> 0xfffffffffffffffb
> ...
>         postgres  9602 [000] 72381.336767: syscalls:sys_enter_fsync: fd:
> 0x00000005
>         postgres  9602 [000] 72381.336840:  syscalls:sys_exit_fsync: 0x0
>
> ... and Pg continues merrily on its way without realising it lost data:
>
> [72379.834872] XFS (dm-0): writeback error on sector 118752
> [72380.324707] XFS (dm-0): writeback error on sector 118688
>
> In this test I set things up so the checkpointer would see the first fsync()
> error. But if I make checkpoints less frequent, the bgwriter aggressive, and
> kernel dirty writeback aggressive, it should be possible to have the failure
> go completely unobserved too. I'll try that next, because we've already
> largely concluded that the solution to the issue above is to PANIC on
> fsync() error. But if we don't see the error at all we're in trouble.

I suppose you only see errors because the file descriptors linger open
in the virtual file descriptor cache, which is a matter of luck
depending on how many relation segment files you touched.

In this case I think it's because the kernel didn't get around to doing the writeback before the eagerly forced checkpoint fsync()'d it. Or we didn't even queue it for writeback from our own shared_buffers until just before we fsync()'d it. After all, it's a contrived test case that tries to reproduce the issue rapidly with big writes and frequent checkpoints.

So the checkpointer had the relation open to fsync() it, and it was the checkpointer's fsync() that did writeback on the dirty page and noticed the error.

If we the kernel had done the writeback before the checkpointer opened the relation to fsync() it, we might not have seen the error at all - though as you note this depends on the file descriptor cache. You can see the silent-error behaviour in my standalone test case where I confirmed the post-4.13 behaviour. (I'm on 4.14 here).

I can try to reproduce it with postgres too, but it not only requires closing and reopening the FDs, it also requires forcing writeback before opening the fd. To make it occur in a practical timeframe I have to make my kernel writeback settings insanely aggressive and/or call sync() before re-open()ing. I don't really think it's worth it, since I've confirmed the behaviour already with the simpler test in standalone/ in the rest repo. To try it yourself, clone 


and in the master branch

cd testcases/fsync-error-clear
less README
make REOPEN=reopen standalone-run


I've pushed the postgres test to that repo too; "make postgres-run".

You'll need docker, and be warned, it's using privileged docker containers and messing with dmsetup.


--
 Craig Ringer                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

В списке pgsql-hackers по дате отправления:

Предыдущее
От: Peter Eisentraut
Дата:
Сообщение: Re: csv format for psql
Следующее
От: David Rowley
Дата:
Сообщение: Re: [HACKERS] Runtime Partition Pruning