At Wed, 28 Jul 2021 11:38:28 -0400, Tom Lane <tgl@sss.pgh.pa.us> wrote in
> Alvaro Herrera <alvherre@alvh.no-ip.org> writes:
> This test is still unstable :-(
>
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=tern&dt=2021-07-20%2012%3A46%3A11
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=kittiwake&dt=2021-07-20%2015%3A05%3A39
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=kittiwake&dt=2021-07-28%2014%3A33%3A01
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=kittiwake&dt=2021-07-28%2014%3A33%3A01
>
> These all look like
>
> # Failed test 'check that segments have been removed'
> # at t/019_replslot_limit.pl line 226.
> # got: '000000010000000000000020'
> # expected: '000000010000000000000024'
> # Looks like you failed 1 test of 16.
>
> with varying values mentioned. It looks to me like WAL file cleanup
> is not as synchronous with slot creation as the test thinks.
> Maybe it needs to loop until the oldest WAL file matches what it
> expects?
Sorry for the kludge.
Mmm. In the failure cases, directory scan(@16:52:22.036) runs before
the targetted checkpoint completes(@16:52:22.144).
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=kittiwake&dt=2021-07-28%2014%3A33%3A01
16:52:17.328 LOG: checkpoint starting: wal
16:52:19.140 LOG: invalidating slot "rep1" because its restart_lsn 0/1D00000 exceeds max_slot_wal_keep_size
16:52:19.316 019_replslot_limit.pl LOG: statement: SELECT pg_walfile_name(lsn) FROM
pg_create_physical_replication_slot('s2',true)
16:52:22.036 019_replslot_limit.pl LOG: statement: SELECT pg_ls_dir AS f FROM pg_ls_dir('pg_wal') WHERE pg_ls_dir ~
'^[0-9A-F]{24}$'ORDER BY 1 LIMIT
16:52:22.077 019_replslot_limit.pl LOG: statement: SELECT pg_drop_replication_slot('s2')
16:52:22.144 LOG: checkpoint complete: wrote 18 buffers (14.1%); 0 WAL file(s) added, 4 removed, 3 recycled;
write=1.806s, sync=0.001 s, total=4.817 s; sync files=0, longest=0.000 s, average=0.000 s; distance=3072 kB,
estimate=3072kB
The reason is the previous checkpoint completed after starting to
advance segments
> my $logstart = get_log_size($node_primary);
> advance_wal($node_primary, 7); !!!! another checkpoint runs/ends
...
16:52:19.140 # check for "invalidate slots" in log.
# check for the "lost" state in pg_replication_slots.
!! checkfor "checkpint complete" in log.
16:52:22.077 # read redo segment and oldest wal.
16:52:22.144 !! The target checkpoint ends.
So it seems to me we need to explicitly prevent unexpected checkpoints
from happening maybe by enlarging max_wal_size temporily.
I'll going that way.
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center