Re: A failure in t/038_save_logical_slots_shutdown.pl
От | vignesh C |
---|---|
Тема | Re: A failure in t/038_save_logical_slots_shutdown.pl |
Дата | |
Msg-id | CALDaNm1y+yJioqwVUJzoYnhZX33=f1Qp0MLPR54Ni3ff9ACEWQ@mail.gmail.com обсуждение исходный текст |
Ответ на | A failure in t/038_save_logical_slots_shutdown.pl (Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>) |
Ответы |
Re: A failure in t/038_save_logical_slots_shutdown.pl
(Amit Kapila <amit.kapila16@gmail.com>)
Re: A failure in t/038_save_logical_slots_shutdown.pl (vignesh C <vignesh21@gmail.com>) |
Список | pgsql-hackers |
On Wed, 10 Jan 2024 at 14:08, Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> wrote: > > Hi, > > I've been observing a failure in t/038_save_logical_slots_shutdown.pl > of late on my developer system: > > t/038_save_logical_slots_shutdown.pl .. 1/? > # Failed test 'Check that the slot's confirmed_flush LSN is the same > as the latest_checkpoint location' > # at t/038_save_logical_slots_shutdown.pl line 35. > # Looks like you failed 1 test of 2. > t/038_save_logical_slots_shutdown.pl .. Dubious, test returned 1 > (wstat 256, 0x100) > Failed 1/2 subtests > > I did a quick analysis of the failure and commit > https://github.com/postgres/postgres/commit/e0b2eed047df9045664da6f724cb42c10f8b12f0 > that introduced this test. I think the issue is that the slot's > confirmed_flush LSN (0/1508000) and shutdown checkpoint LSN > (0/1508018) are not the same: > > tmp_check/log/038_save_logical_slots_shutdown_pub.log: > > 2024-01-10 07:55:44.539 UTC [57621] sub LOG: starting logical > decoding for slot "sub" > 2024-01-10 07:55:44.539 UTC [57621] sub DETAIL: Streaming > transactions committing after 0/1508000, reading WAL from 0/1507FC8. > 2024-01-10 07:55:44.539 UTC [57621] sub STATEMENT: START_REPLICATION > SLOT "sub" LOGICAL 0/0 (proto_version '4', origin 'any', > publication_names '"pub"') > > ubuntu:~/postgres$ pg17/bin/pg_controldata -D > src/test/recovery/tmp_check/t_038_save_logical_slots_shutdown_pub_data/pgdata/ > Database cluster state: in production > pg_control last modified: Wed Jan 10 07:55:44 2024 > Latest checkpoint location: 0/1508018 > Latest checkpoint's REDO location: 0/1508018 > > But the tests added by t/038_save_logical_slots_shutdown.pl expects > both LSNs to be same: > > sub compare_confirmed_flush > { > # Is it same as the value read from log? > ok( $latest_checkpoint eq $confirmed_flush_from_log, > "Check that the slot's confirmed_flush LSN is the same as the > latest_checkpoint location" > ); > > I suspect that it's quite not right to expect the slot's > confirmed_flush and latest checkpoint location to be same in the test. > This is because the shutdown checkpoint gets an LSN that's greater > than the slot's confirmed_flush LSN - see the shutdown checkpoint > record getting inserted into WAL after the slot is marked dirty in > CheckPointReplicationSlots(). > > With this analysis in mind, I think the tests need to do something > like the following: > > diff --git a/src/test/recovery/t/038_save_logical_slots_shutdown.pl > b/src/test/recovery/t/038_save_logical_slots_shut > down.pl > index 5a4f5dc1d4..d49e6014fc 100644 > --- a/src/test/recovery/t/038_save_logical_slots_shutdown.pl > +++ b/src/test/recovery/t/038_save_logical_slots_shutdown.pl > @@ -32,7 +32,7 @@ sub compare_confirmed_flush > unless defined($latest_checkpoint); > > # Is it same as the value read from log? > - ok( $latest_checkpoint eq $confirmed_flush_from_log, > + ok( $latest_checkpoint ge $confirmed_flush_from_log, > "Check that the slot's confirmed_flush LSN is the same > as the latest_checkpoint location" > ); > > Thoughts? I got the log files from Bharath offline. Thanks Bharath for sharing the log files offline. The WAL record sequence is exactly the same in the failing test and tests which are passing. One observation in our case the confirmed flush lsn points exactly to shutdown checkpoint, but in the failing test the lsn pointed is invalid, pg_waldump says that address is invalid and skips about 24 bytes and then sees a valid record Passing case confirm flush lsn(0/150D158) from my machine: pg_waldump 000000010000000000000001 -s 0/150D158 rmgr: XLOG len (rec/tot): 114/ 114, tx: 0, lsn: 0/0150D158, prev 0/0150D120, desc: CHECKPOINT_SHUTDOWN redo 0/150D158; tli 1; prev tli 1; fpw true; xid 0:739; oid 16388; multi 1; offset 0; oldest xid 728 in DB 1; oldest multi 1 in DB 1; oldest/newest commit timestamp xid: 0/0; oldest running xid 0; shutdown Failing case confirm flush lsn( 0/1508000) from failing tests log file: pg_waldump 000000010000000000000001 -s 0/1508000 pg_waldump: first record is after 0/1508000, at 0/1508018, skipping over 24 bytes rmgr: XLOG len (rec/tot): 114/ 114, tx: 0, lsn: 0/01508018, prev 0/01507FC8, desc: CHECKPOINT_SHUTDOWN redo 0/1508018; tli 1; prev tli 1; fpw true; xid 0:739; oid 16388; multi 1; offset 0; oldest xid 728 in DB 1; oldest multi 1 in DB 1; oldest/newest commit timestamp xid: 0/0; oldest running xid 0; shutdown I'm still not sure why in this case, it is not exactly pointing to a valid WAL record, it has to skip 24 bytes to find the valid checkpoint shutdown record. I will investigate this further and share the analysis. Regards, Vignesh
В списке pgsql-hackers по дате отправления: