Re: XLogReadRecord() error in XlogReadTwoPhaseData()

Поиск
Список
Период
Сортировка
От Noah Misch
Тема Re: XLogReadRecord() error in XlogReadTwoPhaseData()
Дата
Msg-id 20220116210241.GC756210@rfd.leadboat.com
обсуждение исходный текст
Ответ на Re: XLogReadRecord() error in XlogReadTwoPhaseData()  (Thomas Munro <thomas.munro@gmail.com>)
Ответы Re: XLogReadRecord() error in XlogReadTwoPhaseData()  (Thomas Munro <thomas.munro@gmail.com>)
Re: XLogReadRecord() error in XlogReadTwoPhaseData()  (Noah Misch <noah@leadboat.com>)
Re: XLogReadRecord() error in XlogReadTwoPhaseData()  (Noah Misch <noah@leadboat.com>)
Список pgsql-hackers
Cancel that kernel upgrade idea.  I no longer expect it to help...

On Sun, Jan 16, 2022 at 10:19:30PM +1300, Thomas Munro wrote:
> On Sun, Jan 16, 2022 at 8:12 PM Noah Misch <noah@leadboat.com> wrote:
> > For specifics of the kernel bug, see the attached test program.  In brief, the
> > bug arises if one process is write()ing or pwrite()ing a file at about the
> > same time that another process is read()ing or pread()ing the same.  POSIX
> > says the reader should see the data as it existed before the write or the
> > newly-written data.  On this kernel, the reader can see zeros instead.  That
> > leads to the $SUBJECT failure.  PostgreSQL processes write out a given WAL
> > block 20-30 times in ~10ms, and COMMIT PREPARED reads that block.  The writers
> > aren't changing the bytes of interest to COMMIT PREPARED, but the zeros from
> > the kernel bug yield the failure.

The difference between kittiwake and thorntail comes from thorntail using xfs
and kittiwake using ext4.  Running the io-rectitude.c tests on an ext4
partition on thorntail, I see the zeros bug just like I do on kittiwake.  I
don't see the zeros bug on ppc64 or x86_64, just sparc64 so far:

 * ext4, Linux 3.10.0-1160.49.1.el7.x86_64 (CentOS 7.9.2009):
 * pwrite/pread is non-atomic if count>16 (no -D switches)
 * write/read is atomic (-DUSE_SEEK -DXLOG_BLCKSZ=8192000)
 * pwrite/pread is free from zeros bug (-DCHANGE_CONTENT=0)
 * write/read is free from zeros bug (-DUSE_SEEK -DCHANGE_CONTENT=0)
 *
 * ext4, Linux 4.9.0-13-sparc64-smp (Debian):
 * pwrite/pread is non-atomic if count>4 (no -D switches)
 * write/read is non-atomic if count>4 (-DUSE_SEEK)
 * write/read IS atomic w/o REOPEN (-DUSE_SEEK -DREOPEN=0 -DXLOG_BLCKSZ=8192000)
 * pwrite/pread has zeros bug for count>127 (-DCHANGE_CONTENT=0)
 * pwrite/pread w/ O_SYNC has zeros bug (-DCHANGE_CONTENT=0 -DOPEN_FLAGS=O_SYNC)
 *    far less frequent w/ O_SYNC, but it still happens
 * pwrite/pread w/o REOPEN also has zeros bug for count>127 (-DCHANGE_CONTENT=0 -DREOPEN=0)
 * write/read has zeros bug for count>127 (-DUSE_SEEK -DCHANGE_CONTENT=0)
 * write/read w/ O_SYNC has zeros bug (-DUSE_SEEK -DCHANGE_CONTENT=0 -DOPEN_FLAGS=O_SYNC)
 * write/read w/o REOPEN is free from zeros bug (-DUSE_SEEK -DCHANGE_CONTENT=0 -DREOPEN=0)
 *
 * ext4, Linux 5.15.0-2-sparc64-smp (Debian bookworm/sid):
 * [behaviors match the previous kernel exactly]
 *
 * ext4, Linux 5.15.0-2-powerpc64 (Debian bookworm/sid):
 * [atomicity matches previous kernel, but zeros bug does not]
 * pwrite/pread is non-atomic if count>4 (no -D switches)
 * write/read is non-atomic if count>4 (-DUSE_SEEK)
 * write/read IS atomic w/o REOPEN (-DUSE_SEEK -DREOPEN=0 -DXLOG_BLCKSZ=8192000)
 * pwrite/pread is free from zeros bug (-DCHANGE_CONTENT=0)
 * write/read is free from zeros bug (-DUSE_SEEK -DCHANGE_CONTENT=0)
 *
 * ext4, Linux 5.15.5-0-virt x86_64 (Alpine):
 * [behaviors match the previous kernel exactly]
 *
 * xfs, Linux 5.15.0-2-sparc64-smp (Debian bookworm/sid):
 * pwrite/pread is atomic (-DXLOG_BLCKSZ=8192000)
 * write/read is atomic (-DUSE_SEEK -DXLOG_BLCKSZ=8192000)
 * pwrite/pread is free from zeros bug (-DCHANGE_CONTENT=0)
 * write/read is free from zeros bug (-DUSE_SEEK -DCHANGE_CONTENT=0)

> > We could opt to work around that by writing
> > only the not-already-written portion of a WAL block, but I doubt that's
> > worthwhile unless it happens to be a performance win anyway.

My next steps:

- Report a Debian bug for the sparc64+ext4 zeros problem.

- Try to falsify the idea that "write only the not-already-written portion of
  a WAL block" is an effective workaround.  Specifically, modify the test
  program to have the writer process mutate offsets [N-k,N-1] and [N+1,N+k]
  while the reader process reads offset N.  If the reader sees a zero, that
  workaround is ineffective.

- Implement the workaround, if I didn't falsify its effectiveness.  If it
  doesn't hurt performance on x86_64, we can use it unconditionally.
  Otherwise, limit its use to sparc64 Linux.

> > Separately, while I don't know of relevance to PostgreSQL, I was interested to
> > see that CentOS 7 pwrite()/pread() fail to have the POSIX-required atomicity.
> 
> FWIW there was some related discussion over here:
> 
> https://www.postgresql.org/message-id/flat/17064-bb0d7904ef72add3%40postgresql.org

That gave me the idea to test different filesystems.  Thanks.  Incidentally, I
find https://utcc.utoronto.ca/~cks/space/blog/unix/WriteNotVeryAtomic is
mistaken about POSIX requirements.  There's no precedent for POSIX writing
"two threads" when it means "two threads of the same process".  Moreover, the
part about "shall also apply whenever a file descriptor is successfully
closed, however caused (for example [...] process termination)" would be
superfluous in a requirement specific to threads of one process.  Having said
that, if the most-prominent POSIX regular file implementation (ext4 on x86_64)
doesn't implement a POSIX requirement, that has the same practical
consequences for PostgreSQL as POSIX not requiring it.

I now see newer Linux ext4 has drifted further away from POSIX atomicity,
compared to CentOS 7.  In CentOS 7 ext4, plain write()/read() was still
atomic.  By Linux 5.15.5, those abandoned atomicity.



В списке pgsql-hackers по дате отправления:

Предыдущее
От: Justin Pryzby
Дата:
Сообщение: Re: Avoiding smgrimmedsync() during nbtree index builds
Следующее
От: Tom Lane
Дата:
Сообщение: Re: fix crash with Python 3.11