Re: A failure in t/038_save_logical_slots_shutdown.pl

Поиск
Список
Период
Сортировка
От vignesh C
Тема Re: A failure in t/038_save_logical_slots_shutdown.pl
Дата
Msg-id CALDaNm1y+yJioqwVUJzoYnhZX33=f1Qp0MLPR54Ni3ff9ACEWQ@mail.gmail.com
обсуждение исходный текст
Ответ на A failure in t/038_save_logical_slots_shutdown.pl  (Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>)
Ответы Re: A failure in t/038_save_logical_slots_shutdown.pl  (Amit Kapila <amit.kapila16@gmail.com>)
Re: A failure in t/038_save_logical_slots_shutdown.pl  (vignesh C <vignesh21@gmail.com>)
Список pgsql-hackers
On Wed, 10 Jan 2024 at 14:08, Bharath Rupireddy
<bharath.rupireddyforpostgres@gmail.com> wrote:
>
> Hi,
>
> I've been observing a failure in t/038_save_logical_slots_shutdown.pl
> of late on my developer system:
>
> t/038_save_logical_slots_shutdown.pl .. 1/?
> #   Failed test 'Check that the slot's confirmed_flush LSN is the same
> as the latest_checkpoint location'
> #   at t/038_save_logical_slots_shutdown.pl line 35.
> # Looks like you failed 1 test of 2.
> t/038_save_logical_slots_shutdown.pl .. Dubious, test returned 1
> (wstat 256, 0x100)
> Failed 1/2 subtests
>
> I did a quick analysis of the failure and commit
> https://github.com/postgres/postgres/commit/e0b2eed047df9045664da6f724cb42c10f8b12f0
> that introduced this test. I think the issue is that the slot's
> confirmed_flush LSN (0/1508000) and shutdown checkpoint LSN
> (0/1508018) are not the same:
>
> tmp_check/log/038_save_logical_slots_shutdown_pub.log:
>
> 2024-01-10 07:55:44.539 UTC [57621] sub LOG:  starting logical
> decoding for slot "sub"
> 2024-01-10 07:55:44.539 UTC [57621] sub DETAIL:  Streaming
> transactions committing after 0/1508000, reading WAL from 0/1507FC8.
> 2024-01-10 07:55:44.539 UTC [57621] sub STATEMENT:  START_REPLICATION
> SLOT "sub" LOGICAL 0/0 (proto_version '4', origin 'any',
> publication_names '"pub"')
>
> ubuntu:~/postgres$ pg17/bin/pg_controldata -D
> src/test/recovery/tmp_check/t_038_save_logical_slots_shutdown_pub_data/pgdata/
> Database cluster state:               in production
> pg_control last modified:             Wed Jan 10 07:55:44 2024
> Latest checkpoint location:           0/1508018
> Latest checkpoint's REDO location:    0/1508018
>
> But the tests added by t/038_save_logical_slots_shutdown.pl expects
> both LSNs to be same:
>
> sub compare_confirmed_flush
> {
>     # Is it same as the value read from log?
>     ok( $latest_checkpoint eq $confirmed_flush_from_log,
>         "Check that the slot's confirmed_flush LSN is the same as the
> latest_checkpoint location"
>     );
>
> I suspect that it's quite not right to expect the slot's
> confirmed_flush and latest checkpoint location to be same in the test.
> This is because the shutdown checkpoint gets an LSN that's greater
> than the slot's confirmed_flush LSN - see the shutdown checkpoint
> record getting inserted into WAL after the slot is marked dirty in
> CheckPointReplicationSlots().
>
> With this analysis in mind, I think the tests need to do something
> like the following:
>
> diff --git a/src/test/recovery/t/038_save_logical_slots_shutdown.pl
> b/src/test/recovery/t/038_save_logical_slots_shut
> down.pl
> index 5a4f5dc1d4..d49e6014fc 100644
> --- a/src/test/recovery/t/038_save_logical_slots_shutdown.pl
> +++ b/src/test/recovery/t/038_save_logical_slots_shutdown.pl
> @@ -32,7 +32,7 @@ sub compare_confirmed_flush
>           unless defined($latest_checkpoint);
>
>         # Is it same as the value read from log?
> -       ok( $latest_checkpoint eq $confirmed_flush_from_log,
> +       ok( $latest_checkpoint ge $confirmed_flush_from_log,
>                 "Check that the slot's confirmed_flush LSN is the same
> as the latest_checkpoint location"
>         );
>
> Thoughts?

I got the log files from Bharath offline. Thanks Bharath for sharing
the log files offline.
The WAL record sequence is exactly the same in the failing test and
tests which are passing.
One observation in our case the confirmed flush lsn points exactly to
shutdown checkpoint, but in the failing test the lsn pointed is
invalid, pg_waldump says that address is invalid and skips about 24
bytes and then sees a valid record

Passing case confirm flush lsn(0/150D158) from my machine:
pg_waldump 000000010000000000000001 -s 0/150D158
rmgr: XLOG        len (rec/tot):    114/   114, tx:          0, lsn:
0/0150D158, prev 0/0150D120, desc: CHECKPOINT_SHUTDOWN redo 0/150D158;
tli 1; prev tli 1; fpw true; xid 0:739; oid 16388; multi 1; offset 0;
oldest xid 728 in DB 1; oldest multi 1 in DB 1; oldest/newest commit
timestamp xid: 0/0; oldest running xid 0; shutdown

Failing case confirm flush lsn( 0/1508000) from failing tests log file:
pg_waldump 000000010000000000000001 -s 0/1508000
pg_waldump: first record is after 0/1508000, at 0/1508018, skipping
over 24 bytes
rmgr: XLOG        len (rec/tot):    114/   114, tx:          0, lsn:
0/01508018, prev 0/01507FC8, desc: CHECKPOINT_SHUTDOWN redo 0/1508018;
tli 1; prev tli 1; fpw true; xid 0:739; oid 16388; multi 1; offset 0;
oldest xid 728 in DB 1; oldest multi 1 in DB 1; oldest/newest commit
timestamp xid: 0/0; oldest running xid 0; shutdown

I'm still not sure why in this case, it is not exactly pointing to a
valid WAL record, it has to skip 24 bytes to find the valid checkpoint
shutdown record.
I will investigate this further and share the analysis.

Regards,
Vignesh



В списке pgsql-hackers по дате отправления:

Предыдущее
От: Matthias van de Meent
Дата:
Сообщение: Re: the s_lock_stuck on perform_spin_delay
Следующее
От: Magnus Hagander
Дата:
Сообщение: Re: System username in pg_stat_activity