Re: XLogReadRecord() error in XlogReadTwoPhaseData()

Поиск
Список
Период
Сортировка
От Noah Misch
Тема Re: XLogReadRecord() error in XlogReadTwoPhaseData()
Дата
Msg-id 20211118070506.GA1294133@rfd.leadboat.com
обсуждение исходный текст
Ответ на Re: XLogReadRecord() error in XlogReadTwoPhaseData()  (Tom Lane <tgl@sss.pgh.pa.us>)
Ответы Re: XLogReadRecord() error in XlogReadTwoPhaseData()  (Andrey Borodin <x4mmm@yandex-team.ru>)
Список pgsql-hackers
On Wed, Nov 17, 2021 at 05:47:10PM -0500, Tom Lane wrote:
> Noah Misch <noah@leadboat.com> writes:
> > Each of the three failures happened on a sparc64 Debian+gcc machine.  I had
> > tried ~8000 iterations on thorntail, another sparc64 Debian+gcc animal,
> > without reproducing this.

> #                   'pgbench: error: client 0 script 1 aborted in command 4 query 0: ERROR:  could not read two-phase
statefrom WAL at 0/159EF88: unexpected pageaddr 0/0 in log segment 000000010000000000000001, offset 5890048
 
> [1] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=tadarida&dt=2021-11-17%2013%3A01%3A24

Two others:
ERROR:  could not read two-phase state from WAL at 0/16F1850: invalid record length at 0/16F1850: wanted 24, got 0
-- https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=tadarida&dt=2021-11-12%2013%3A01%3A15
ERROR:  could not read two-phase state from WAL at 0/1668020: incorrect resource manager data checksum in record at
0/1668020
-- https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=kittiwake&dt=2021-11-16%2015%3A00%3A52

> I suppose "unexpected pageaddr 0/0" is most easily explained by supposing
> that XlogReadTwoPhaseData tried to read a WAL page that hadn't been
> written out yet.  Have we got any synchronization around that?

If the WAL address isn't on disk, that error doesn't happen.  Instead,
read_local_xlog_page() blocks waiting for the WAL to become available.  It's
still possible that we make the WAL region exist, but it somehow doesn't
contain the right data until shortly later.  FinishPreparedTransaction() takes
TwoPhaseStateLock and looks for an entry having gxact->valid.  EndPrepare()
fills gxact->prepare_end_lsn, then calls MarkAsPrepared() to set gxact->valid
under TwoPhaseStateLock.  All freelist (freeGXacts) interaction holds
TwoPhaseStateLock.  I'm not seeing a gap in that synchronization.

I don't have a great theory, but here are candidates to examine next:

- Run with wal_debug=on to confirm logged write location matches read location.
- Run "PGDATA=contrib/amcheck/tmp_check/t_003_cic_2pc_CIC_2PC_test_data/pgdata
  pg_waldump -s 0/01000000" at the end of the test.
- Dump WAL page binary image at the point of failure.
- Log which branches in XLogReadRecord() are taken.

What else might help?



В списке pgsql-hackers по дате отправления:

Предыдущее
От: Peter Smith
Дата:
Сообщение: Re: row filtering for logical replication
Следующее
От: "osumi.takamichi@fujitsu.com"
Дата:
Сообщение: RE: Optionally automatically disable logical replication subscriptions on error