Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

Поиск
Список
Период
Сортировка
От Anthony Iliopoulos
Тема Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS
Дата
Msg-id 20180401005822.GJ11627@technoir
обсуждение исходный текст
Ответ на Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS  (Craig Ringer <craig@2ndquadrant.com>)
Список pgsql-hackers
On Sun, Apr 01, 2018 at 12:13:09AM +0800, Craig Ringer wrote:
>    On 31 March 2018 at 21:24, Anthony Iliopoulos <[1]ailiop@altatus.com>
>    wrote:
> 
>      On Fri, Mar 30, 2018 at 10:18:14AM +1300, Thomas Munro wrote:
> 
>      > >> Yeah, I see why you want to PANIC.
>      > >
>      > > Indeed. Even doing that leaves question marks about all the kernel
>      > > versions before v4.13, which at this point is pretty much everything
>      > > out there, not even detecting this reliably. This is messy.
> 
>      There may still be a way to reliably detect this on older kernel
>      versions from userspace, but it will be messy whatsoever. On EIO
>      errors, the kernel will not restore the dirty page flags, but it
>      will flip the error flags on the failed pages. One could mmap()
>      the file in question, obtain the PFNs (via /proc/pid/pagemap)
>      and enumerate those to match the ones with the error flag switched
>      on (via /proc/kpageflags). This could serve at least as a detection
>      mechanism, but one could also further use this info to logically
>      map the pages that failed IO back to the original file offsets,
>      and potentially retry IO just for those file ranges that cover
>      the failed pages. Just an idea, not tested.
> 
>    That sounds like a huge amount of complexity, with uncertainty as to how
>    it'll behave kernel-to-kernel, for negligble benefit.

Those interfaces have been around since the kernel 2.6 times and are
rather stable, but I was merely responding to your original post comment
regarding having a way of finding out which page(s) failed. I assume
that indeed there would be no benefit, especially since those errors
are usually not transient (typically they come from hard medium faults),
and although a filesystem could theoretically mask the error by allocating
a different logical block, I am not aware of any implementation that
currently does that.

>    I was exploring the idea of doing selective recovery of one relfilenode,
>    based on the assumption that we know the filenode related to the fd that
>    failed to fsync(). We could redo only WAL on that relation. But it fails
>    the same test: it's too complex for a niche case that shouldn't happen in
>    the first place, so it'll probably have bugs, or grow bugs in bitrot over
>    time.

Fully agree, those cases should be sufficiently rare that a complex
and possibly non-maintainable solution is not really warranted.

>    Remember, if you're on ext4 with errors=remount-ro, you get shut down even
>    harder than a PANIC. So we should just use the big hammer here.

I am not entirely sure what you mean here, does Pg really treat write()
errors as fatal? Also, the kind of errors that ext4 detects with this
option is at the superblock level and govern metadata rather than actual
data writes (recall that those are buffered anyway, no actual device IO
has to take place at the time of write()).

Best regards,
Anthony


В списке pgsql-hackers по дате отправления:

Предыдущее
От: Peter Geoghegan
Дата:
Сообщение: Re: WIP: Covering + unique indexes.
Следующее
От: Anthony Iliopoulos
Дата:
Сообщение: Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS