Обсуждение: XLogReadRecord() error in XlogReadTwoPhaseData()

Поиск

Список

Период

Сортировка

XLogReadRecord() error in XlogReadTwoPhaseData()

От

Noah Misch

Дата:

07 ноября 2021 г., 04:31:57

On Sun, Oct 24, 2021 at 04:35:02PM -0700, Noah Misch wrote:
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=kittiwake&dt=2021-10-24%2012%3A01%3A10
> got an interesting v9.6 failure [...]:
> 
> 2021-10-24 14:25:29.263 CEST [34569:175] pgbench ERROR:  could not read two-phase state from xlog at 0/158F4E0
> 2021-10-24 14:25:29.263 CEST [34569:176] pgbench STATEMENT:  COMMIT PREPARED 'c1';

As a first step, let's report the actual XLogReadRecord() error message.
Attached.  All the other sites that expect no error already do this.

Вложения

could-not-read-two-phase-state-v1.patch

Re: XLogReadRecord() error in XlogReadTwoPhaseData()

От

Michael Paquier

Дата:

08 ноября 2021 г., 07:42:46

On Sat, Nov 06, 2021 at 06:31:57PM -0700, Noah Misch wrote:
> As a first step, let's report the actual XLogReadRecord() error message.
> Attached.

Good catch!  This looks good.

> All the other sites that expect no error already do this.

Indeed.  Looking closer, I think that we'd better improve
DecodingContextFindStartpoint(),
pg_logical_replication_slot_advance(), XLogSendLogical() as well as
pg_logical_slot_get_changes_guts() to follow a format closer to what
you have in your patch, with an error message that describes the
context where the problem has been found, instead of just elog()'ing
what XLogReadRecord() returns.
--
Michael

Вложения

signature.asc

Re: XLogReadRecord() error in XlogReadTwoPhaseData()

От

Andrey Borodin

Дата:

08 ноября 2021 г., 14:38:47


> 7 нояб. 2021 г., в 06:31, Noah Misch <noah@leadboat.com> написал(а):
>
> As a first step, let's report the actual XLogReadRecord() error message.
> Attached.  All the other sites that expect no error already do this.

BTW some time ago I've spotted a good number of related unreported errors [0].


[0] https://www.postgresql.org/message-id/A0A36AEE-3476-4326-B877-EE2B55BAEEED%40yandex-team.ru

Re: XLogReadRecord() error in XlogReadTwoPhaseData()

От

Michael Paquier

Дата:

09 ноября 2021 г., 04:51:40

On Mon, Nov 08, 2021 at 01:42:46PM +0900, Michael Paquier wrote:
> Indeed.  Looking closer, I think that we'd better improve
> DecodingContextFindStartpoint(),
> pg_logical_replication_slot_advance(), XLogSendLogical() as well as
> pg_logical_slot_get_changes_guts() to follow a format closer to what
> you have in your patch, with an error message that describes the
> context where the problem has been found, instead of just elog()'ing
> what XLogReadRecord() returns.

FYI, I have just begun a new thread about those ones:
https://www.postgresql.org/message-id/YYnTH6OyOwQcAdkw@paquier.xyz
--
Michael

Вложения

signature.asc

Re: XLogReadRecord() error in XlogReadTwoPhaseData()

От

Noah Misch

Дата:

12 ноября 2021 г., 04:22:34

On Mon, Nov 08, 2021 at 01:42:46PM +0900, Michael Paquier wrote:
> On Sat, Nov 06, 2021 at 06:31:57PM -0700, Noah Misch wrote:
> > On Sun, Oct 24, 2021 at 04:35:02PM -0700, Noah Misch wrote:
> > > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=kittiwake&dt=2021-10-24%2012%3A01%3A10
> > > got an interesting v9.6 failure [...]:
> > > 
> > > 2021-10-24 14:25:29.263 CEST [34569:175] pgbench ERROR:  could not read two-phase state from xlog at 0/158F4E0
> > > 2021-10-24 14:25:29.263 CEST [34569:176] pgbench STATEMENT:  COMMIT PREPARED 'c1';

Tom Lane reported another instance today:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=tadarida&dt=2021-11-11%2013%3A29%3A58

Each of the three failures happened on a sparc64 Debian+gcc machine.  I had
tried ~8000 iterations on thorntail, another sparc64 Debian+gcc animal,
without reproducing this.

> > As a first step, let's report the actual XLogReadRecord() error message.
> > Attached.
> 
> Good catch!  This looks good.

Pushed.

Re: XLogReadRecord() error in XlogReadTwoPhaseData()

От

Tom Lane

Дата:

18 ноября 2021 г., 01:47:10

Noah Misch <noah@leadboat.com> writes:
> Tom Lane reported another instance today:
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=tadarida&dt=2021-11-11%2013%3A29%3A58

> Each of the three failures happened on a sparc64 Debian+gcc machine.  I had
> tried ~8000 iterations on thorntail, another sparc64 Debian+gcc animal,
> without reproducing this.

>>> As a first step, let's report the actual XLogReadRecord() error message.
>>> Attached.

>> Good catch!  This looks good.

> Pushed.

Well, we didn't have to wait too long [1]:

#   at t/003_cic_2pc.pl line 143.
#                   'pgbench: error: client 0 script 1 aborted in command 4 query 0: ERROR:  could not read two-phase
statefrom WAL at 0/159EF88: unexpected pageaddr 0/0 in log segment 000000010000000000000001, offset 5890048 
# pgbench: error: client 2 script 3 aborted in command 2 query 0: ERROR:  canceling statement due to lock timeout
# pgbench: fatal: Run was aborted; the above results are incomplete.

I suppose "unexpected pageaddr 0/0" is most easily explained by supposing
that XlogReadTwoPhaseData tried to read a WAL page that hadn't been
written out yet.  Have we got any synchronization around that?

            regards, tom lane

[1] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=tadarida&dt=2021-11-17%2013%3A01%3A24

Re: XLogReadRecord() error in XlogReadTwoPhaseData()

От

Noah Misch

Дата:

18 ноября 2021 г., 10:05:06

On Wed, Nov 17, 2021 at 05:47:10PM -0500, Tom Lane wrote:
> Noah Misch <noah@leadboat.com> writes:
> > Each of the three failures happened on a sparc64 Debian+gcc machine.  I had
> > tried ~8000 iterations on thorntail, another sparc64 Debian+gcc animal,
> > without reproducing this.

> #                   'pgbench: error: client 0 script 1 aborted in command 4 query 0: ERROR:  could not read two-phase
statefrom WAL at 0/159EF88: unexpected pageaddr 0/0 in log segment 000000010000000000000001, offset 5890048
 
> [1] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=tadarida&dt=2021-11-17%2013%3A01%3A24

Two others:
ERROR:  could not read two-phase state from WAL at 0/16F1850: invalid record length at 0/16F1850: wanted 24, got 0
-- https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=tadarida&dt=2021-11-12%2013%3A01%3A15
ERROR:  could not read two-phase state from WAL at 0/1668020: incorrect resource manager data checksum in record at
0/1668020
-- https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=kittiwake&dt=2021-11-16%2015%3A00%3A52

> I suppose "unexpected pageaddr 0/0" is most easily explained by supposing
> that XlogReadTwoPhaseData tried to read a WAL page that hadn't been
> written out yet.  Have we got any synchronization around that?

If the WAL address isn't on disk, that error doesn't happen.  Instead,
read_local_xlog_page() blocks waiting for the WAL to become available.  It's
still possible that we make the WAL region exist, but it somehow doesn't
contain the right data until shortly later.  FinishPreparedTransaction() takes
TwoPhaseStateLock and looks for an entry having gxact->valid.  EndPrepare()
fills gxact->prepare_end_lsn, then calls MarkAsPrepared() to set gxact->valid
under TwoPhaseStateLock.  All freelist (freeGXacts) interaction holds
TwoPhaseStateLock.  I'm not seeing a gap in that synchronization.

I don't have a great theory, but here are candidates to examine next:

- Run with wal_debug=on to confirm logged write location matches read location.
- Run "PGDATA=contrib/amcheck/tmp_check/t_003_cic_2pc_CIC_2PC_test_data/pgdata
  pg_waldump -s 0/01000000" at the end of the test.
- Dump WAL page binary image at the point of failure.
- Log which branches in XLogReadRecord() are taken.

What else might help?

Re: XLogReadRecord() error in XlogReadTwoPhaseData()

От

Andrey Borodin

Дата:

18 ноября 2021 г., 15:48:02


> 18 нояб. 2021 г., в 12:05, Noah Misch <noah@leadboat.com> написал(а):
>
> What else might help?

Let's add more tests that check survival of 2PC through crash recovery? We do now only one restart. Maybe it worth to
do4 or 8? 

Best regards, Andrey Borodin.

Re: XLogReadRecord() error in XlogReadTwoPhaseData()

От

Tom Lane

Дата:

18 ноября 2021 г., 17:44:04

Andrey Borodin <x4mmm@yandex-team.ru> writes:
> Let's add more tests that check survival of 2PC through crash recovery? We do now only one restart. Maybe it worth to
do4 or 8? 

That seems a little premature when we can't explain the failure
we have.  Also, buildfarm cycles aren't free.

            regards, tom lane

Re: XLogReadRecord() error in XlogReadTwoPhaseData()

От

Tom Lane

Дата:

20 ноября 2021 г., 08:22:28

Noah Misch <noah@leadboat.com> writes:
> Tom Turelinckx, are you able to provide remote access to kittiwake or
> tadarida?  I'd use it to attempt the above things.  All else being equal,
> kittiwake is more relevant since it's still supported upstream.

snapper just exhibited the same failure, too:

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=snapper&dt=2021-11-18%2016%3A09%3A49

            regards, tom lane

Re: XLogReadRecord() error in XlogReadTwoPhaseData()

От

Tom Lane

Дата:

20 ноября 2021 г., 08:55:16

I wrote:
> snapper just exhibited the same failure, too:
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=snapper&dt=2021-11-18%2016%3A09%3A49

I grepped the buildfarm logs for all recent (last 3 months) occurrences of
'could not read two-phase state'.  Here's the results:

  sysname  |    branch     |      snapshot       |          stage          |
                                                                    l
                                                               

-----------+---------------+---------------------+-------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 kittiwake | REL9_6_STABLE | 2021-10-24 12:01:10 | pgbenchCheck            | #                   'client 1 aborted in
state3: ERROR:  could not read two-phase state from xlog at 0/158F4E0 
 kittiwake | REL_13_STABLE | 2021-10-26 12:51:11 | ContribCheck-en_US.utf8 | #                   'pgbench: error:
client3 script 1 aborted in command 4 query 0: ERROR:  could not read two-phase state from WAL at 0/168C8D8 
 kittiwake | REL_14_STABLE | 2021-11-08 15:42:35 | ContribCheck-en_US.utf8 | #                   'pgbench: error:
client0 script 0 aborted in command 3 query 0: ERROR:  could not read two-phase state from WAL at 0/17ABF48 
 kittiwake | REL_13_STABLE | 2021-11-16 15:00:52 | ContribCheck-en_US.utf8 | #                   'pgbench: error:
client3 script 1 aborted in command 4 query 0: ERROR:  could not read two-phase state from WAL at 0/1668020: incorrect
resourcemanager data checksum in record at 0/1668020 
 snapper   | REL_14_STABLE | 2021-11-18 16:09:49 | contrib-amcheckCheck    | #                   'pgbench: error:
client3 script 1 aborted in command 4 query 0: ERROR:  could not read two-phase state from WAL at 0/1770328: unexpected
pageaddr0/0 in log segment 000000010000000000000001, offset 7798784 
 tadarida  | REL_11_STABLE | 2021-11-11 13:29:58 | pgbenchCheck            | #                   'client 3 aborted in
command3 (SQL) of script 0; ERROR:  could not read two-phase state from WAL at 0/1716C68 
 tadarida  | REL_10_STABLE | 2021-11-12 13:01:15 | pgbenchCheck            | #                   'client 4 aborted in
command3 of script 0; ERROR:  could not read two-phase state from WAL at 0/16F1850: invalid record length at 0/16F1850:
wanted24, got 0 
 tadarida  | HEAD          | 2021-11-17 13:01:24 | contrib-amcheckCheck    | #                   'pgbench: error:
client0 script 1 aborted in command 4 query 0: ERROR:  could not read two-phase state from WAL at 0/159EF88: unexpected
pageaddr0/0 in log segment 000000010000000000000001, offset 5890048 

So not all are exactly 'unexpected pageaddr 0/0', but they do all
look like we read garbage data.

            regards, tom lane

Re: XLogReadRecord() error in XlogReadTwoPhaseData()

От

Thomas Munro

Дата:

16 января 2022 г., 12:19:30

On Sun, Jan 16, 2022 at 8:12 PM Noah Misch <noah@leadboat.com> wrote:
> For specifics of the kernel bug, see the attached test program.  In brief, the
> bug arises if one process is write()ing or pwrite()ing a file at about the
> same time that another process is read()ing or pread()ing the same.  POSIX
> says the reader should see the data as it existed before the write or the
> newly-written data.  On this kernel, the reader can see zeros instead.  That
> leads to the $SUBJECT failure.  PostgreSQL processes write out a given WAL
> block 20-30 times in ~10ms, and COMMIT PREPARED reads that block.  The writers
> aren't changing the bytes of interest to COMMIT PREPARED, but the zeros from
> the kernel bug yield the failure.  We could opt to work around that by writing
> only the not-already-written portion of a WAL block, but I doubt that's
> worthwhile unless it happens to be a performance win anyway.
>
> Separately, while I don't know of relevance to PostgreSQL, I was interested to
> see that CentOS 7 pwrite()/pread() fail to have the POSIX-required atomicity.

FWIW there was some related discussion over here:

https://www.postgresql.org/message-id/flat/17064-bb0d7904ef72add3%40postgresql.org

Re: XLogReadRecord() error in XlogReadTwoPhaseData()

От

Noah Misch

Дата:

17 января 2022 г., 00:02:41

Cancel that kernel upgrade idea.  I no longer expect it to help...

On Sun, Jan 16, 2022 at 10:19:30PM +1300, Thomas Munro wrote:
> On Sun, Jan 16, 2022 at 8:12 PM Noah Misch <noah@leadboat.com> wrote:
> > For specifics of the kernel bug, see the attached test program.  In brief, the
> > bug arises if one process is write()ing or pwrite()ing a file at about the
> > same time that another process is read()ing or pread()ing the same.  POSIX
> > says the reader should see the data as it existed before the write or the
> > newly-written data.  On this kernel, the reader can see zeros instead.  That
> > leads to the $SUBJECT failure.  PostgreSQL processes write out a given WAL
> > block 20-30 times in ~10ms, and COMMIT PREPARED reads that block.  The writers
> > aren't changing the bytes of interest to COMMIT PREPARED, but the zeros from
> > the kernel bug yield the failure.

The difference between kittiwake and thorntail comes from thorntail using xfs
and kittiwake using ext4.  Running the io-rectitude.c tests on an ext4
partition on thorntail, I see the zeros bug just like I do on kittiwake.  I
don't see the zeros bug on ppc64 or x86_64, just sparc64 so far:

 * ext4, Linux 3.10.0-1160.49.1.el7.x86_64 (CentOS 7.9.2009):
 * pwrite/pread is non-atomic if count>16 (no -D switches)
 * write/read is atomic (-DUSE_SEEK -DXLOG_BLCKSZ=8192000)
 * pwrite/pread is free from zeros bug (-DCHANGE_CONTENT=0)
 * write/read is free from zeros bug (-DUSE_SEEK -DCHANGE_CONTENT=0)
 *
 * ext4, Linux 4.9.0-13-sparc64-smp (Debian):
 * pwrite/pread is non-atomic if count>4 (no -D switches)
 * write/read is non-atomic if count>4 (-DUSE_SEEK)
 * write/read IS atomic w/o REOPEN (-DUSE_SEEK -DREOPEN=0 -DXLOG_BLCKSZ=8192000)
 * pwrite/pread has zeros bug for count>127 (-DCHANGE_CONTENT=0)
 * pwrite/pread w/ O_SYNC has zeros bug (-DCHANGE_CONTENT=0 -DOPEN_FLAGS=O_SYNC)
 *    far less frequent w/ O_SYNC, but it still happens
 * pwrite/pread w/o REOPEN also has zeros bug for count>127 (-DCHANGE_CONTENT=0 -DREOPEN=0)
 * write/read has zeros bug for count>127 (-DUSE_SEEK -DCHANGE_CONTENT=0)
 * write/read w/ O_SYNC has zeros bug (-DUSE_SEEK -DCHANGE_CONTENT=0 -DOPEN_FLAGS=O_SYNC)
 * write/read w/o REOPEN is free from zeros bug (-DUSE_SEEK -DCHANGE_CONTENT=0 -DREOPEN=0)
 *
 * ext4, Linux 5.15.0-2-sparc64-smp (Debian bookworm/sid):
 * [behaviors match the previous kernel exactly]
 *
 * ext4, Linux 5.15.0-2-powerpc64 (Debian bookworm/sid):
 * [atomicity matches previous kernel, but zeros bug does not]
 * pwrite/pread is non-atomic if count>4 (no -D switches)
 * write/read is non-atomic if count>4 (-DUSE_SEEK)
 * write/read IS atomic w/o REOPEN (-DUSE_SEEK -DREOPEN=0 -DXLOG_BLCKSZ=8192000)
 * pwrite/pread is free from zeros bug (-DCHANGE_CONTENT=0)
 * write/read is free from zeros bug (-DUSE_SEEK -DCHANGE_CONTENT=0)
 *
 * ext4, Linux 5.15.5-0-virt x86_64 (Alpine):
 * [behaviors match the previous kernel exactly]
 *
 * xfs, Linux 5.15.0-2-sparc64-smp (Debian bookworm/sid):
 * pwrite/pread is atomic (-DXLOG_BLCKSZ=8192000)
 * write/read is atomic (-DUSE_SEEK -DXLOG_BLCKSZ=8192000)
 * pwrite/pread is free from zeros bug (-DCHANGE_CONTENT=0)
 * write/read is free from zeros bug (-DUSE_SEEK -DCHANGE_CONTENT=0)

> > We could opt to work around that by writing
> > only the not-already-written portion of a WAL block, but I doubt that's
> > worthwhile unless it happens to be a performance win anyway.

My next steps:

- Report a Debian bug for the sparc64+ext4 zeros problem.

- Try to falsify the idea that "write only the not-already-written portion of
  a WAL block" is an effective workaround.  Specifically, modify the test
  program to have the writer process mutate offsets [N-k,N-1] and [N+1,N+k]
  while the reader process reads offset N.  If the reader sees a zero, that
  workaround is ineffective.

- Implement the workaround, if I didn't falsify its effectiveness.  If it
  doesn't hurt performance on x86_64, we can use it unconditionally.
  Otherwise, limit its use to sparc64 Linux.

> > Separately, while I don't know of relevance to PostgreSQL, I was interested to
> > see that CentOS 7 pwrite()/pread() fail to have the POSIX-required atomicity.
> 
> FWIW there was some related discussion over here:
> 
> https://www.postgresql.org/message-id/flat/17064-bb0d7904ef72add3%40postgresql.org

That gave me the idea to test different filesystems.  Thanks.  Incidentally, I
find https://utcc.utoronto.ca/~cks/space/blog/unix/WriteNotVeryAtomic is
mistaken about POSIX requirements.  There's no precedent for POSIX writing
"two threads" when it means "two threads of the same process".  Moreover, the
part about "shall also apply whenever a file descriptor is successfully
closed, however caused (for example [...] process termination)" would be
superfluous in a requirement specific to threads of one process.  Having said
that, if the most-prominent POSIX regular file implementation (ext4 on x86_64)
doesn't implement a POSIX requirement, that has the same practical
consequences for PostgreSQL as POSIX not requiring it.

I now see newer Linux ext4 has drifted further away from POSIX atomicity,
compared to CentOS 7.  In CentOS 7 ext4, plain write()/read() was still
atomic.  By Linux 5.15.5, those abandoned atomicity.

Re: XLogReadRecord() error in XlogReadTwoPhaseData()

От

Thomas Munro

Дата:

20 января 2022 г., 22:34:22

On Mon, Jan 17, 2022 at 10:02 AM Noah Misch <noah@leadboat.com> wrote:
> - Report a Debian bug for the sparc64+ext4 zeros problem.

I suspect that 027_stream_regress.pl hits this kernel bug with high
probability[1].  I wonder if the owner of kittiwake and tadarida would
consider setting up an xfs file system?  Or alternatively, since ext4
didn't support concurrent writes until recently, I wonder if there is
an option somewhere to turn the new concurrency stuff off, or failing
that, if we could temporarily downgrade the kernel to an older version
that does inode-level read/write locking.

[1]
https://www.postgresql.org/message-id/CA%2BhUKG%2BeuZ%3Ddc27ZB%3Ds74x0q%3DzU%3D2%3Dvs8%2B6TkJoTUiCPUd2dQA%40mail.gmail.com

Re: XLogReadRecord() error in XlogReadTwoPhaseData()

От

Noah Misch

Дата:

21 января 2022 г., 06:30:42

On Fri, Jan 21, 2022 at 08:34:22AM +1300, Thomas Munro wrote:
> On Mon, Jan 17, 2022 at 10:02 AM Noah Misch <noah@leadboat.com> wrote:
> > - Report a Debian bug for the sparc64+ext4 zeros problem.
> 
> I suspect that 027_stream_regress.pl hits this kernel bug with high
> probability[1].  I wonder if the owner of kittiwake and tadarida would
> consider setting up an xfs file system?  Or alternatively, since ext4
> didn't support concurrent writes until recently, I wonder if there is
> an option somewhere to turn the new concurrency stuff off, or failing
> that, if we could temporarily downgrade the kernel to an older version
> that does inode-level read/write locking.

If the write-only-new-bytes approach works, I think we'd want to revert those
changes.  Perhaps a cheaper stopgap is to make the affected tests skip on
sparc Linux.  Is that worth doing?  (Could even limit the skip to ext4,
e.g. by testing "df -x ext4 . >/dev/null".)

> [1]
https://www.postgresql.org/message-id/CA%2BhUKG%2BeuZ%3Ddc27ZB%3Ds74x0q%3DzU%3D2%3Dvs8%2B6TkJoTUiCPUd2dQA%40mail.gmail.com

Re: XLogReadRecord() error in XlogReadTwoPhaseData()

От

Noah Misch

Дата:

22 января 2022 г., 21:52:41

On Sun, Jan 16, 2022 at 01:02:41PM -0800, Noah Misch wrote:
> My next steps:
> 
> - Report a Debian bug for the sparc64+ext4 zeros problem.

(Not done yet.)

> - Try to falsify the idea that "write only the not-already-written portion of
>   a WAL block" is an effective workaround.  Specifically, modify the test
>   program to have the writer process mutate offsets [N-k,N-1] and [N+1,N+k]
>   while the reader process reads offset N.  If the reader sees a zero, that
>   workaround is ineffective.

The reader did not see a zero.  In addition to bytes outside the write being
immune to the zeros bug, the first and last forty bytes of a write were immune
to the zeros bug.

> - Implement the workaround, if I didn't falsify its effectiveness.  If it
>   doesn't hurt performance on x86_64, we can use it unconditionally.
>   Otherwise, limit its use to sparc64 Linux.

Attached.  With this, kittiwake has survived 8.5hr of 003_cic_2pc.pl.  Without
the patch, it failed many times, always within 1.3hr.  For easier review, this
patch uses the new behavior on all platforms.  Before commit and back-patch, I
plan to limit use of the new behavior to sparc Linux.  Future work can
benchmark the new behavior and, if it performs well, make it unconditional in
v15+.  I would expect performance to be unchanged or slightly better, because
the new behavior requests less futile work from the OS.

Вложения

xlog-write-once-v1.patch

Re: XLogReadRecord() error in XlogReadTwoPhaseData()

От

Thomas Munro

Дата:

23 января 2022 г., 23:42:13

On Sun, Jan 23, 2022 at 7:52 AM Noah Misch <noah@leadboat.com> wrote:
> Attached.  With this, kittiwake has survived 8.5hr of 003_cic_2pc.pl.  Without
> the patch, it failed many times, always within 1.3hr.  For easier review, this
> patch uses the new behavior on all platforms.  Before commit and back-patch, I
> plan to limit use of the new behavior to sparc Linux.  Future work can
> benchmark the new behavior and, if it performs well, make it unconditional in
> v15+.  I would expect performance to be unchanged or slightly better, because
> the new behavior requests less futile work from the OS.

One detail is that wal_level=open_datasync, wal_senders=0,
wal_level=minimal will panic, because O_DIRECT requires fs
page-aligned access (and fails in various other ways on other OSes, eg
expensive read-before-write every time).  That's an ultra-niche
concern likely affecting nobody, especially when multiplied by the
odds that anyone is using that stack at all (considering that
streaming rep has apparently been borked for years on linux/sparc/ext4
and nobody told us).

I was +1 for the control file locking change in that other thread (I
view the atomicity stuff as Linux-realpolitik-vs-POSIX, not to mention
that we run Windows too, which requires separate analysis).  I'm less
sure it makes sense to do anything to support the presumed bogus
zeroes bug for (probably) no real users, especially before we've even
reported it and heard some analysis, for example acceptance that it's
broken and confirmation that this really is just a sparc problem.

Re: XLogReadRecord() error in XlogReadTwoPhaseData()

От

Andres Freund

Дата:

24 января 2022 г., 01:26:53

Hi,

On 2022-01-24 09:42:13 +1300, Thomas Munro wrote:
> On Sun, Jan 23, 2022 at 7:52 AM Noah Misch <noah@leadboat.com> wrote:
> > Future work can benchmark the new behavior and, if it performs well, make
> > it unconditional in v15+.  I would expect performance to be unchanged or
> > slightly better, because the new behavior requests less futile work from
> > the OS.

I doubt it'll be generally applicable. Turning a write operation into a
read-write isn't free. Yes, often enough it's likely that the prior page is
still in cache, but I don't think we can rely on that in general.

It also just fundamentally locks us into never using O_DIRECT in anger. I
don't think that's a good direction.


> One detail is that wal_level=open_datasync, wal_senders=0,
> wal_level=minimal will panic, because O_DIRECT requires fs
> page-aligned access (and fails in various other ways on other OSes, eg
> expensive read-before-write every time).  That's an ultra-niche
> concern likely affecting nobody, especially when multiplied by the
> odds that anyone is using that stack at all (considering that
> streaming rep has apparently been borked for years on linux/sparc/ext4
> and nobody told us).

Seems like the patch should at least make this error out?

Greetings,

Andres Freund

Re: XLogReadRecord() error in XlogReadTwoPhaseData()

От

Noah Misch

Дата:

24 января 2022 г., 02:29:27

On Mon, Jan 24, 2022 at 09:42:13AM +1300, Thomas Munro wrote:
> I'm less
> sure it makes sense to do anything to support the presumed bogus
> zeroes bug for (probably) no real users, especially before we've even
> reported it and heard some analysis, for example acceptance that it's
> broken and confirmation that this really is just a sparc problem.

Got it.  I've already done a bad thing leaving the buildfarm broken for three
months, so I don't want to let the buildfarm wait for a kernel fix.  These are
the two main options I'm seeing now:

(a) Modify the tests so the affected animals can skip affected tests by
setting an environment variable, named PG_TEST_HAS_WAL_READ_BUG or similar.

(b) Remove --enable-tap-tests from affected animals.

Do you have a preference among those two or some other option that gets the
buildfarm green on a predictable schedule?  I somewhat prefer (a), since
--enable-tap-tests is where most of the interesting buildfarm reports happen
these days.

Re: XLogReadRecord() error in XlogReadTwoPhaseData()

От

Thomas Munro

Дата:

24 января 2022 г., 02:49:16

On Mon, Jan 24, 2022 at 12:29 PM Noah Misch <noah@leadboat.com> wrote:
> On Mon, Jan 24, 2022 at 09:42:13AM +1300, Thomas Munro wrote:
> > I'm less
> > sure it makes sense to do anything to support the presumed bogus
> > zeroes bug for (probably) no real users, especially before we've even
> > reported it and heard some analysis, for example acceptance that it's
> > broken and confirmation that this really is just a sparc problem.
>
> Got it.  I've already done a bad thing leaving the buildfarm broken for three
> months, so I don't want to let the buildfarm wait for a kernel fix.  These are
> the two main options I'm seeing now:
>
> (a) Modify the tests so the affected animals can skip affected tests by
> setting an environment variable, named PG_TEST_HAS_WAL_READ_BUG or similar.
>
> (b) Remove --enable-tap-tests from affected animals.
>
> Do you have a preference among those two or some other option that gets the
> buildfarm green on a predictable schedule?  I somewhat prefer (a), since
> --enable-tap-tests is where most of the interesting buildfarm reports happen
> these days.

Trying out a new idea: what if we could tell the buildfarm website
that a certain test is currently expected to fail for reasons we can't
fix yet (configuration change needed but owner not responding, or
bugfix from another project needed, etc)?  That could cause it to be
displayed in a different shade of green, or grey, or whatever?  Other
kinds of failures would still show as red.  Perhaps this would be
configured with a file in a git repo that any committer can push to.

Re: XLogReadRecord() error in XlogReadTwoPhaseData()

От

Noah Misch

Дата:

24 января 2022 г., 03:19:14

On Mon, Jan 24, 2022 at 12:49:16PM +1300, Thomas Munro wrote:
> On Mon, Jan 24, 2022 at 12:29 PM Noah Misch <noah@leadboat.com> wrote:
> > On Mon, Jan 24, 2022 at 09:42:13AM +1300, Thomas Munro wrote:
> > > I'm less
> > > sure it makes sense to do anything to support the presumed bogus
> > > zeroes bug for (probably) no real users, especially before we've even
> > > reported it and heard some analysis, for example acceptance that it's
> > > broken and confirmation that this really is just a sparc problem.
> >
> > Got it.  I've already done a bad thing leaving the buildfarm broken for three
> > months, so I don't want to let the buildfarm wait for a kernel fix.  These are
> > the two main options I'm seeing now:
> >
> > (a) Modify the tests so the affected animals can skip affected tests by
> > setting an environment variable, named PG_TEST_HAS_WAL_READ_BUG or similar.
> >
> > (b) Remove --enable-tap-tests from affected animals.
> >
> > Do you have a preference among those two or some other option that gets the
> > buildfarm green on a predictable schedule?  I somewhat prefer (a), since
> > --enable-tap-tests is where most of the interesting buildfarm reports happen
> > these days.
> 
> Trying out a new idea: what if we could tell the buildfarm website
> that a certain test is currently expected to fail for reasons we can't
> fix yet (configuration change needed but owner not responding, or
> bugfix from another project needed, etc)?  That could cause it to be
> displayed in a different shade of green, or grey, or whatever?  Other
> kinds of failures would still show as red.  Perhaps this would be
> configured with a file in a git repo that any committer can push to.

That would be a better capability to use if we had it, agreed.  Is it feasible
to acquire that capability soon enough?

Re: XLogReadRecord() error in XlogReadTwoPhaseData()

От

Tom Lane

Дата:

24 января 2022 г., 03:28:47

Noah Misch <noah@leadboat.com> writes:
> On Mon, Jan 24, 2022 at 12:49:16PM +1300, Thomas Munro wrote:
>> Trying out a new idea: what if we could tell the buildfarm website
>> that a certain test is currently expected to fail for reasons we can't
>> fix yet (configuration change needed but owner not responding, or
>> bugfix from another project needed, etc)?  That could cause it to be
>> displayed in a different shade of green, or grey, or whatever?  Other
>> kinds of failures would still show as red.  Perhaps this would be
>> configured with a file in a git repo that any committer can push to.

> That would be a better capability to use if we had it, agreed.  Is it feasible
> to acquire that capability soon enough?

It's not merely a website issue: you'd really rather that the
buildfarm animal runs the rest of the tests rather than going belly-up
after an expected failure.  I think your suggestion about skipping
problematic tests based on an environment variable is more practical
in the near term.  We already have some cases like that, too, eg in
src/bin/psql/t/010_tab_completion.pl.

In the long term I could get behind having some less ad-hoc way
of skipping tests, but I don't think we can have that quickly.

            regards, tom lane

Re: XLogReadRecord() error in XlogReadTwoPhaseData()

От

Andres Freund

Дата:

24 января 2022 г., 04:03:04


On January 23, 2022 3:29:27 PM PST
>(a) Modify the tests so the affected animals can skip affected tests by
>setting an environment variable, named PG_TEST_HAS_WAL_READ_BUG or similar.

Why not just detect the problem in the tap test and skip, rather than requiring multiple buildfarm configs to be
changedas well as the test itself?  
--
Sent from my Android device with K-9 Mail. Please excuse my brevity.

Re: XLogReadRecord() error in XlogReadTwoPhaseData()

От

Noah Misch

Дата:

24 января 2022 г., 04:17:59

On Sun, Jan 23, 2022 at 05:03:04PM -0800, Andres Freund wrote:
> On January 23, 2022 3:29:27 PM PST
> >(a) Modify the tests so the affected animals can skip affected tests by
> >setting an environment variable, named PG_TEST_HAS_WAL_READ_BUG or similar.
> 
> Why not just detect the problem in the tap test and skip, rather than requiring multiple buildfarm configs to be
changedas well as the test itself? 

End users running PostgreSQL test suites to acceptance-test their stack should
consider the affected stack unusable for PostgreSQL.  Hence, I ruled out that
approach, despite having implemented it at one point.  Under some plausible
set of goals, it is optimal.

Re: XLogReadRecord() error in XlogReadTwoPhaseData()

От

Andres Freund

Дата:

24 января 2022 г., 04:40:54

Hi,

On 2022-01-23 17:17:59 -0800, Noah Misch wrote:
> On Sun, Jan 23, 2022 at 05:03:04PM -0800, Andres Freund wrote:
> > On January 23, 2022 3:29:27 PM PST
> > >(a) Modify the tests so the affected animals can skip affected tests by
> > >setting an environment variable, named PG_TEST_HAS_WAL_READ_BUG or similar.
> > 
> > Why not just detect the problem in the tap test and skip, rather than requiring multiple buildfarm configs to be
changedas well as the test itself? 

> 
> End users running PostgreSQL test suites to acceptance-test their stack should
> consider the affected stack unusable for PostgreSQL.

I'd bet that that's zero users ;)

> Hence, I ruled out that
> approach, despite having implemented it at one point.  Under some plausible
> set of goals, it is optimal.

It's not perfect due to the way we run our tests (seeing output is hard, it's
not aggregated), but marking the test as todo rather than SKIP seems like the
most appropriate test status. It's known to be a problem, we've not fixed it,
but we want to be able to run the tests.

Test::more's description: "If it's something the programmer hasn't done yet,
use TODO. This is for any code you haven't written yet, or bugs you have yet
to fix, but want to put tests in your testing script (always a good idea)."

Greetings,

Andres Freund

Re: XLogReadRecord() error in XlogReadTwoPhaseData()

От

Noah Misch

Дата:

24 января 2022 г., 05:10:07

On Sun, Jan 23, 2022 at 05:40:54PM -0800, Andres Freund wrote:
> On 2022-01-23 17:17:59 -0800, Noah Misch wrote:
> > On Sun, Jan 23, 2022 at 05:03:04PM -0800, Andres Freund wrote:
> > > On January 23, 2022 3:29:27 PM PST
> > > >(a) Modify the tests so the affected animals can skip affected tests by
> > > >setting an environment variable, named PG_TEST_HAS_WAL_READ_BUG or similar.
> > > 
> > > Why not just detect the problem in the tap test and skip, rather than requiring multiple buildfarm configs to be
changedas well as the test itself? 
 
> > 
> > End users running PostgreSQL test suites to acceptance-test their stack should
> > consider the affected stack unusable for PostgreSQL.
> 
> I'd bet that that's zero users ;)

Wouldn't surprise me.  I'm attaching what I had written and discarded.  If
nobody else hates it, I can live with it.

> > Hence, I ruled out that
> > approach, despite having implemented it at one point.  Under some plausible
> > set of goals, it is optimal.
> 
> It's not perfect due to the way we run our tests (seeing output is hard, it's
> not aggregated), but marking the test as todo rather than SKIP seems like the
> most appropriate test status. It's known to be a problem, we've not fixed it,
> but we want to be able to run the tests.
> 
> Test::more's description: "If it's something the programmer hasn't done yet,
> use TODO. This is for any code you haven't written yet, or bugs you have yet
> to fix, but want to put tests in your testing script (always a good idea)."

Could do that.  Every run that doesn't get the flaky failure will print a
message like "TODO passed:  3-5", though the test file could mitigate that by
declaring the TODO only on configurations where we expect a failure.  The
027_stream_regress.pl trouble involves reaching a die(), not failing a test in
this sense, so that one would take more work.

Вложения

skip-sparc-wal-read-bug-v1.patch

Re: XLogReadRecord() error in XlogReadTwoPhaseData()

От

Michael Paquier

Дата:

24 января 2022 г., 05:20:21

On Sun, Jan 23, 2022 at 06:10:07PM -0800, Noah Misch wrote:
> Could do that.  Every run that doesn't get the flaky failure will print a
> message like "TODO passed:  3-5", though the test file could mitigate that by
> declaring the TODO only on configurations where we expect a failure.  The
> 027_stream_regress.pl trouble involves reaching a die(), not failing a test in
> this sense, so that one would take more work.

Using a TODO has the advantage to allow the tests to run on a periodic
basis, even if they could fail in this unexpected way.  We've had our
load of issues in the past proper to specific architectures, so this
could help with the 2PC code paths and SPARC.  Not to mention that we
could just let the code be as-is, and it would fix itself once the
kernel is updated.  So that sounds much better to me than a skip
phase.
--
Michael

Вложения

signature.asc

Re: XLogReadRecord() error in XlogReadTwoPhaseData()

От

Tom Lane

Дата:

24 января 2022 г., 05:25:04

Michael Paquier <michael@paquier.xyz> writes:
> On Sun, Jan 23, 2022 at 06:10:07PM -0800, Noah Misch wrote:
>> Could do that.  Every run that doesn't get the flaky failure will print a
>> message like "TODO passed:  3-5", though the test file could mitigate that by
>> declaring the TODO only on configurations where we expect a failure.  The
>> 027_stream_regress.pl trouble involves reaching a die(), not failing a test in
>> this sense, so that one would take more work.

> Using a TODO has the advantage to allow the tests to run on a periodic
> basis, even if they could fail in this unexpected way.

I'm okay with this *if* the TODO marking can be constrained to platforms
where we know there's a problem.  Otherwise I'm afraid it will mask
unexpected problems.

            regards, tom lane

Re: XLogReadRecord() error in XlogReadTwoPhaseData()

От

Andres Freund

Дата:

24 января 2022 г., 05:34:32

Hi,

On 2022-01-23 18:10:07 -0800, Noah Misch wrote:
> On Sun, Jan 23, 2022 at 05:40:54PM -0800, Andres Freund wrote:
> > Test::more's description: "If it's something the programmer hasn't done yet,
> > use TODO. This is for any code you haven't written yet, or bugs you have yet
> > to fix, but want to put tests in your testing script (always a good idea)."
>
> Could do that.  Every run that doesn't get the flaky failure will print a
> message like "TODO passed:  3-5", though the test file could mitigate that by
> declaring the TODO only on configurations where we expect a failure.

Yea, that's what I was thinking we'd do.


> The
> 027_stream_regress.pl trouble involves reaching a die(), not failing a test in
> this sense, so that one would take more work.

I'm really not a perl person... But my understanding is that todo_skip() would
address this? I.e. something like

TODO:
{
   $todo_skip "linux/sparc has unaddressed problems with partial page  overwrites"
     if ($^O eq 'linux' and $Config{archname'} ~= 'sparc');

   ok(whatever is broken);
}

(no idea if the above todo condition even approximates something working)

Greetings,

Andres Freund

Re: XLogReadRecord() error in XlogReadTwoPhaseData()

От

Andres Freund

Дата:

24 января 2022 г., 05:38:34

On 2022-01-23 21:25:04 -0500, Tom Lane wrote:
> Michael Paquier <michael@paquier.xyz> writes:
> > On Sun, Jan 23, 2022 at 06:10:07PM -0800, Noah Misch wrote:
> >> Could do that.  Every run that doesn't get the flaky failure will print a
> >> message like "TODO passed:  3-5", though the test file could mitigate that by
> >> declaring the TODO only on configurations where we expect a failure.  The
> >> 027_stream_regress.pl trouble involves reaching a die(), not failing a test in
> >> this sense, so that one would take more work.
> 
> > Using a TODO has the advantage to allow the tests to run on a periodic
> > basis, even if they could fail in this unexpected way.
> 
> I'm okay with this *if* the TODO marking can be constrained to platforms
> where we know there's a problem.  Otherwise I'm afraid it will mask
> unexpected problems.

Yep. Very weird syntax... TODO blocks only take effect if a local $TODO is set
(to a text explaining the reason hopefully) or if there's a todo_skip. So
something like

TODO:
{
   local $TODO = 'linux + sparc is borked' if $^O eq 'linux' and ...;

   tests...
}

would run the tests and accept failures if the if matches, and otherwise not.

Greetings,

Andres Freund

Re: XLogReadRecord() error in XlogReadTwoPhaseData()

От

Noah Misch

Дата:

24 января 2022 г., 11:02:43

On Sun, Jan 23, 2022 at 06:34:32PM -0800, Andres Freund wrote:
> On 2022-01-23 18:10:07 -0800, Noah Misch wrote:
> > On Sun, Jan 23, 2022 at 05:40:54PM -0800, Andres Freund wrote:
> > > Test::more's description: "If it's something the programmer hasn't done yet,
> > > use TODO. This is for any code you haven't written yet, or bugs you have yet
> > > to fix, but want to put tests in your testing script (always a good idea)."
> >
> > Could do that.  Every run that doesn't get the flaky failure will print a
> > message like "TODO passed:  3-5", though the test file could mitigate that by
> > declaring the TODO only on configurations where we expect a failure.
> 
> Yea, that's what I was thinking we'd do.
> 
> > The
> > 027_stream_regress.pl trouble involves reaching a die(), not failing a test in
> > this sense, so that one would take more work.
> 
> I'm really not a perl person... But my understanding is that todo_skip() would
> address this? I.e. something like
> 
> TODO:
> {
>    $todo_skip "linux/sparc has unaddressed problems with partial page  overwrites"
>      if ($^O eq 'linux' and $Config{archname'} ~= 'sparc');
> 
>    ok(whatever is broken);
> }

Yes.  todo_skip() behaves much like regular skip().  The enclosed tests don't
run.  Hence, it prevents die() and BAIL_OUT() failures.  $TODO is a different
beast; tests still run, and it changes the reporting.  For 003_cic_2pc.pl, I'm
fine using $TODO so we continue to run all test commands and quietly log their
results.  For 027_stream_regress.pl, which would need deep changes to use
$TODO, it works to use any of todo_skip, skip, or skip_all.  I prefer
skip_all, because it prints the skip reason to gmake's stdout.  (If the number
of affected users is zero as theorized, the choice doesn't matter.)  Any
objections?  Here's the appearance of each strategy on gmake's stdout:

=== todo_skip
[23:56:45] t/003_cic_2pc.pl .. ok       60 ms ( 0.00 usr  0.00 sys +  0.05 cusr  0.00 csys =  0.05 CPU)
[23:56:45]
All tests successful.
Files=1, Tests=5,  1 wallclock secs ( 0.01 usr  0.00 sys +  0.05 cusr  0.00 csys =  0.06 CPU)
Result: PASS

=== skip
[23:55:47] t/003_cic_2pc.pl .. ok       59 ms ( 0.00 usr  0.00 sys +  0.05 cusr  0.00 csys =  0.05 CPU)
[23:55:48]
All tests successful.
Files=1, Tests=5,  1 wallclock secs ( 0.01 usr  0.00 sys +  0.05 cusr  0.00 csys =  0.06 CPU)
Result: PASS

=== skip_all
[23:31:04] t/003_cic_2pc.pl .. skipped: filesystem bug
[23:31:04]
Files=1, Tests=0,  0 wallclock secs ( 0.02 usr  0.00 sys +  0.05 cusr  0.00 csys =  0.07 CPU)
Result: NOTESTS

=== $TODO, test 1 is expected fail, tests 2-5 are unexpected pass
[23:32:32] t/003_cic_2pc.pl .. ok     1371 ms ( 0.01 usr  0.00 sys +  0.51 cusr  0.24 csys =  0.76 CPU)
[23:32:33]
All tests successful.

Test Summary Report
-------------------
t/003_cic_2pc.pl (Wstat: 0 Tests: 5 Failed: 0)
  TODO passed:   2-5
Files=1, Tests=5,  1 wallclock secs ( 0.02 usr  0.00 sys +  0.51 cusr  0.24 csys =  0.77 CPU)
Result: PASS

Re: XLogReadRecord() error in XlogReadTwoPhaseData()

От

Noah Misch

Дата:

02 февраля 2022 г., 08:55:56

On Mon, Jan 24, 2022 at 12:02:43AM -0800, Noah Misch wrote:
> For 003_cic_2pc.pl, I'm
> fine using $TODO so we continue to run all test commands and quietly log their
> results.  For 027_stream_regress.pl, which would need deep changes to use
> $TODO, it works to use any of todo_skip, skip, or skip_all.  I prefer
> skip_all, because it prints the skip reason to gmake's stdout.  (If the number
> of affected users is zero as theorized, the choice doesn't matter.)  Any
> objections?

I pushed that on 2022-01-26 as ce6d793, but it wasn't enough:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=snapper&dt=2022-02-01%2013%3A29%3A58

In older Test::More versions, $TODO only works if tests know the right package
in which to examine $that_pkg::TODO.  In PostgreSQL v12+, all relevant
functions set $Test::Builder::Level, so Test::More finds $main::TODO as
intended.  This would be one way to fix v11:

--- a/src/test/perl/PostgresNode.pm
+++ b/src/test/perl/PostgresNode.pm
@@ -1869,4 +1869,6 @@ command_ok(...)
 sub command_checks_all
 {
+    local $Test::Builder::Level = $Test::Builder::Level + 1;
+
     my $self = shift;
 
diff --git a/src/test/perl/TestLib.pm b/src/test/perl/TestLib.pm
index 0c56156..2ab78b4 100644
--- a/src/test/perl/TestLib.pm
+++ b/src/test/perl/TestLib.pm
@@ -598,4 +598,6 @@ sub command_fails_like
 sub command_checks_all
 {
+    local $Test::Builder::Level = $Test::Builder::Level + 1;
+
     my ($cmd, $expected_ret, $out, $err, $test_name) = @_;

However, I'm inclined to use a less-fragile way, in all branches:

--- a/src/bin/pgbench/t/023_cic_2pc.pl
+++ b/src/bin/pgbench/t/023_cic_2pc.pl
@@ -14 +14,2 @@ use Test::More tests => 6;
-local $TODO = 'filesystem bug' if TestLib::has_wal_read_bug;
+Test::More->builder->todo_start('filesystem bug')
+  if TestLib::has_wal_read_bug;

None of this matters under newer Test::More.  kittiwake's Test::More 1.302133
is new enough not to care, but tadarida's Test::More 1.001014 is not.  The
Test::More->builder->todo_start() syntax works in 0.82, if not earlier.
(PostgreSQL v10 requires Test::More 0.87.)

Re: XLogReadRecord() error in XlogReadTwoPhaseData()

От

Noah Misch

Дата:

04 марта 2022 г., 04:41:39

On Sun, Jan 16, 2022 at 01:02:41PM -0800, Noah Misch wrote:
> My next steps:
> 
> - Report a Debian bug for the sparc64+ext4 zeros problem.

Reported to Debian, then upstream:
https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1006157
https://marc.info/?t=164539269900001

Last week, someone confirmed the behavior on kernel 5.17.0-rc5.  No other news.

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Обсуждение: XLogReadRecord() error in XlogReadTwoPhaseData()

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения