Re: Test slots invalidations in 035_standby_logical_decoding.pl only if dead rows are removed

Поиск
Список
Период
Сортировка
От Alexander Lakhin
Тема Re: Test slots invalidations in 035_standby_logical_decoding.pl only if dead rows are removed
Дата
Msg-id d975bb14-2c90-25bf-7d71-7ce5f9d7fa82@gmail.com
обсуждение исходный текст
Ответ на Re: Test slots invalidations in 035_standby_logical_decoding.pl only if dead rows are removed  (Alexander Lakhin <exclusion@gmail.com>)
Ответы Re: Test slots invalidations in 035_standby_logical_decoding.pl only if dead rows are removed  (Bertrand Drouvot <bertranddrouvot.pg@gmail.com>)
Список pgsql-hackers
Hello,

15.01.2024 12:00, Alexander Lakhin wrote:
> If this approach looks promising to you, maybe we could add a submodule to
> perl/PostgreSQL/Test/ and use this functionality in other tests (e.g., in
> 019_replslot_limit) as well.
>
> Personally I think that having such a functionality for using in tests
> might be useful not only to avoid some "problematic" behaviour but also to
> test the opposite cases.

After spending a few days on it, I've discovered two more issues:
https://www.postgresql.org/message-id/16d6d9cc-f97d-0b34-be65-425183ed3721%40gmail.com
https://www.postgresql.org/message-id/b0102688-6d6c-c86a-db79-e0e91d245b1a%40gmail.com

(The latter is not related to bgwriter directly, but it was discovered
thanks to the RUNNING_XACTS record flew in WAL in a lucky moment.)

So it becomes clear that the 035 test is not the only one, which might
suffer from bgwriter's activity, and inventing a way to stop bgwriter/
control it is a different subject, getting out of scope of the current
issue.


15.01.2024 11:49, Bertrand Drouvot wrote:
> We did a few things in this thread, so to sum up what we've discovered:
>
> - a race condition in InvalidatePossiblyObsoleteSlot() (see [1])
> - we need to launch the vacuum(s) only if we are sure we got a newer XID horizon
> ( proposal in in v6 attached)
> - we need a way to control how frequent xl_running_xacts are emmitted (to ensure
> they are not triggered in a middle of an active slot invalidation test).
>
> I'm not sure it's possible to address Tom's concern and keep the test "predictable".
>
> So, I think I'd vote for Michael's proposal to implement a superuser-settable
> developer GUC (as sending a SIGSTOP on the bgwriter (and bypass $windows_os) would
> still not address Tom's concern anyway).
>
> Another option would be to "sacrifice" the full predictablity of the test (in
> favor of real-world behavior testing)?
>
> [1]: https://www.postgresql.org/message-id/ZaTjW2Xh%2BTQUCOH0%40ip-10-97-1-34.eu-west-3.compute.internal

So, now we have the test 035 failing due to nondeterministic vacuum
activity in the first place, and due to bgwriter's activity in the second.
Maybe it would be a right move to commit the fix, and then think about
more rare failures.

Though I have a couple of question regarding the fix left, if you don't
mind:
1) The test has minor defects in the comments, that I noted before [1];
would you like to fix them in passing?

> BTW, it looks like the comment:
> # One way to produce recovery conflict is to create/drop a relation and
> # launch a vacuum on pg_class with hot_standby_feedback turned off on the standby.
> in scenario 3 is a copy-paste error.
> Also, there are two "Scenario 4" in this test.
>

2) Shall we align the 035_standby_logical_decoding with
031_recovery_conflict in regard to improving stability of vacuum?
I see the following options for this:
a) use wait_until_vacuum_can_remove() and autovacuum = off in both tests;
b) use FREEZE and autovacuum = off in both tests;
c) use wait_until_vacuum_can_remove() in 035, FREEZE in 031, and
  autovacuum = off in both.

I've re-tested the v6 patch today and confirmed that it makes the test
more stable. I ran (with bgwriter_delay = 10000 in temp.config) 20 tests in
parallel and got failures ('inactiveslot slot invalidation is logged with
vacuum on pg_authid') on iterations 2, 6, 6 with no patch applied.
(With unlimited CPU, when average test duration is around 70 seconds.)

But with v6 applied, 60 iterations succeeded.

[1] https://www.postgresql.org/message-id/b2a1f7d0-7629-72c0-2da7-e9c4e336b010@gmail.com

Best regards,
Alexander



В списке pgsql-hackers по дате отправления:

Предыдущее
От: Shubham Khanna
Дата:
Сообщение: Re: Add 64-bit XIDs into PostgreSQL 15
Следующее
От: shveta malik
Дата:
Сообщение: Re: Synchronizing slots from primary to standby