Re: new option to allow pg_rewind to run without full_page_writes

Поиск
Список
Период
Сортировка
От Andres Freund
Тема Re: new option to allow pg_rewind to run without full_page_writes
Дата
Msg-id 20221106023819.tpmvqa6kuy4cvtc7@awork3.anarazel.de
обсуждение исходный текст
Ответ на new option to allow pg_rewind to run without full_page_writes  (Jérémie Grauer <jeremie.grauer@cosium.com>)
Ответы Re: new option to allow pg_rewind to run without full_page_writes  (Jérémie Grauer <jeremie.grauer@cosium.com>)
Список pgsql-hackers
Hi,

On 2022-11-03 16:54:13 +0100, Jérémie Grauer wrote:
> Currently pg_rewind refuses to run if full_page_writes is off. This is to
> prevent it to run into a torn page during operation.
>
> This is usually a good call, but some file systems like ZFS are naturally
> immune to torn page (maybe btrfs too, but I don't know for sure for this
> one).

Note that this isn't about torn pages in case of crashes, but about reading
pages while they're being written to.

Right now, that definitely allows for torn reads, because of the way
pg_read_binary_file() is implemented.  We only ensure a 4k read size from the
view of our code, which obviously can lead to torn 8k page reads, no matter
what the filesystem guarantees.

Also, for reasons I don't understand we use C streaming IO or
pg_read_binary_file(), so you'd also need to ensure that the buffer size used
by the stream implementation can't cause the reads to happen in smaller
chunks.  Afaict we really shouldn't use file streams here, then we'd at least
have control over that aspect.


Does ZFS actually guarantee that there never can be short reads? As soon as
they are possible, full page writes are needed.



This isn't an fundamental issue - we could have a version of
pg_read_binary_file() for relation data that prevents the page being written
out concurrently by locking the buffer page. In addition it could often avoid
needing to read the page from the OS / disk, if present in shared buffers
(perhaps minus cases where we haven't flushed the WAL yet, but we could also
flush the WAL in those).

Greetings,

Andres Freund



В списке pgsql-hackers по дате отправления:

Предыдущее
От: Nathan Bossart
Дата:
Сообщение: Re: Suppressing useless wakeups in walreceiver
Следующее
От: Tom Lane
Дата:
Сообщение: Re: explain analyze rows=%.0f