Re: Direct I/O

Поиск
Список
Период
Сортировка
От Thomas Munro
Тема Re: Direct I/O
Дата
Msg-id CA+hUKGKLr1G5DFWZWNPvmyj5tGFMRqZj=VnX7PYOqkQbR4B_kQ@mail.gmail.com
обсуждение исходный текст
Ответ на Re: Direct I/O  (Andres Freund <andres@anarazel.de>)
Ответы Re: Direct I/O
Список pgsql-hackers
On Tue, Apr 11, 2023 at 2:15 PM Andres Freund <andres@anarazel.de> wrote:
> And the fix has been merged into
> https://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux.git/log/?h=for-next
>
> I think that means it'll have to wait for 6.4 development to open (in a few
> weeks), and then will be merged into the stable branches from there.

Great!  Let's hope/assume for now that that'll fix phenomenon #2.
That still leaves the checksum-vs-concurrent-modification thing that I
called phenomenon #1, which we've not actually hit with PostgreSQL yet
but is clearly possible and can be seen with the stand-alone
repro-program I posted upthread.  You wrote:

On Mon, Apr 10, 2023 at 2:57 PM Andres Freund <andres@anarazel.de> wrote:
> I think we really need to think about whether we eventually we want to do
> something to avoid modifying pages while IO is in progress. The only
> alternative is for filesystems to make copies of everything in the IO path,
> which is far from free (and obviously prevents from using DMA for the whole
> IO). The copy we do to avoid the same problem when checksums are enabled,
> shows up quite prominently in write-heavy profiles, so there's a "purely
> postgres" reason to avoid these issues too.

+1

I wonder what the other file systems that maintain checksums (see list
at [1]) do when the data changes underneath a write.  ZFS's policy is
conservative[2], while BTRFS took the demons-will-fly-out-of-your-nose
route.  I can see arguments for both approaches (ZFS can only reach
zero-copy optimum by turning off checksums completely, while BTRFS is
happy to assume that if you break this programming rule that is not
written down anywhere then you must never want to see your data ever
again).  What about ReFS?  CephFS?

I tried to find out what POSIX says about this WRT synchronous
pwrite() (as Tom suggested, maybe we're doing something POSIX doesn't
allow), but couldn't find it in my first attempt.  It *does* say it's
undefined for aio_write() (which means that my prototype
io_method=posix_aio code that uses that stuff is undefined in presense
of hintbit modifications).  I don't really see why it should vary
between synchronous and asynchronous interfaces (considering the
existence of threads, shared memory etc, the synchronous interface
only removes one thread from list of possible suspects that could flip
some bits).

But yeah, in any case, it doesn't seem great that we do that.

[1] https://en.wikipedia.org/wiki/Comparison_of_file_systems#Block_capabilities
[2] https://openzfs.topicbox.com/groups/developer/T950b02acdf392290/odirect-semantics-in-zfs



В списке pgsql-hackers по дате отправления:

Предыдущее
От: Andres Freund
Дата:
Сообщение: Re: Direct I/O
Следующее
От: "Jonathan S. Katz"
Дата:
Сообщение: Re: longfin missing gssapi_ext.h