Re: Race condition in FetchTableStates() breaks synchronization of subscription tables

Поиск
Список
Период
Сортировка
От vignesh C
Тема Re: Race condition in FetchTableStates() breaks synchronization of subscription tables
Дата
Msg-id CALDaNm2=SjA_=e9bhYO0ChBm-Wdh-=HYF0RHsR+_2gnqwcyOkg@mail.gmail.com
обсуждение исходный текст
Ответ на Race condition in FetchTableStates() breaks synchronization of subscription tables  (Alexander Lakhin <exclusion@gmail.com>)
Ответы RE: Race condition in FetchTableStates() breaks synchronization of subscription tables  ("Zhijie Hou (Fujitsu)" <houzj.fnst@fujitsu.com>)
Список pgsql-hackers
On Fri, 26 Jan 2024 at 11:30, Alexander Lakhin <exclusion@gmail.com> wrote:
>
> Hello hackers,
>
> After determining a possible cause for intermittent failures of the test
> subscription/031_column_list [1], I was wondering what makes another
> subscription test (014_binary) fail on the buildfarm:
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=snakefly&dt=2024-01-22%2001%3A19%3A03
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mylodon&dt=2024-01-14%2018%3A19%3A20
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=piculet&dt=2023-12-21%2001%3A11%3A52
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mylodon&dt=2023-11-27%2001%3A42%3A39
>
> All those failures caused by a timeout when waiting for a message expected
> in _subscriber.log. For example, in the snakefly's case:
> [01:14:46.158](1.937s) ok 7 - check synced data on subscriber with custom type
> timed out waiting for match: (?^:ERROR: ( [A-Z0-9]+:)? incorrect binary data format) at
> /home/bf/bf-build/piculet/HEAD/pgsql/src/test/subscription/t/014_binary.pl line 269.
>
> _subscriber.log contains:
> 2023-12-21 01:14:46.215 UTC [410039] 014_binary.pl LOG:  statement: ALTER SUBSCRIPTION tsub REFRESH PUBLICATION;
> 2023-12-21 01:17:46.756 UTC [409999] ERROR:  could not receive data from WAL stream: server closed the connection
> unexpectedly
>          This probably means the server terminated abnormally
>          before or while processing the request.
> 2023-12-21 01:17:46.760 UTC [405057] LOG:  background worker "logical replication apply worker" (PID 409999) exited
with
> exit code 1
> 2023-12-21 01:17:46.779 UTC [532857] LOG:  logical replication apply worker for subscription "tsub" has started
> ...
>
> While _subscriber.log from a successful test run contains:
> 2024-01-26 03:49:07.065 UTC [9726:5] 014_binary.pl LOG:  statement: ALTER SUBSCRIPTION tsub REFRESH PUBLICATION;
> 2024-01-26 03:49:07.075 UTC [9726:6] 014_binary.pl LOG: disconnection: session time: 0:00:00.014 user=postgres
> database=postgres host=[local]
> 2024-01-26 03:49:07.558 UTC [9729:1] LOG:  logical replication apply worker for subscription "tsub" has started
> 2024-01-26 03:49:07.563 UTC [9731:1] LOG:  logical replication table synchronization worker for subscription "tsub",
> table "test_mismatching_types" has started
> 2024-01-26 03:49:07.585 UTC [9731:2] ERROR:  incorrect binary data format
> 2024-01-26 03:49:07.585 UTC [9731:3] CONTEXT:  COPY test_mismatching_types, line 1, column a
>
> In this case, "logical replication apply worker for subscription "tsub" has
> started" appears just after "ALTER SUBSCRIPTION", not 3 minutes later.
>
> I've managed to reproduce this failure locally by running multiple tests in
> parallel, and my analysis shows that it is caused by a race condition when
> accessing variable table_states_valid inside tablesync.c.
>
> tablesync.c does the following with table_states_valid:
> /*
>   * Callback from syscache invalidation.
>   */
> void
> invalidate_syncing_table_states(Datum arg, int cacheid, uint32 hashvalue)
> {
>      table_states_valid = false;
> }
> ...
> static bool
> FetchTableStates(bool *started_tx)
> {
> ...
>      if (!table_states_valid)
>      {
> ...
>          /* Fetch all non-ready tables. */
>          rstates = GetSubscriptionRelations(MySubscription->oid, true);
> ...
>          table_states_valid = true;
>      }
>
> So, when syscache invalidation occurs inside the code block
> "if (!table_states_valid)", that invalidation is effectively ignored.
>
> In a failed case I observe the following events:
> 1. logical replication apply worker performs
>   LogicalRepApplyLoop() -> process_syncing_tables() ->
>   process_syncing_tables_for_apply() -> FetchTableStates() periodically.
>
> 2. ALTER SUBSCRIPTION tsub REFRESH PUBLICATION invalidates syscache
>   for SUBSCRIPTIONRELMAP, and that leads to calling
>   invalidate_syncing_table_states().
>
> 3. If the apply worker manages to fetch no non-ready tables in
>   FetchTableStates() and ignore "table_states_valid = false" from
>   invalidate_syncing_table_states(), then it just misses the invalidation
>   event, so it keeps working without noticing non-ready tables appeared as
>   the result of ALTER SUBSCRIPTION (process_syncing_tables_for_apply() skips
>   a loop "foreach(lc, table_states_not_ready) ..." until some other event
>   occurs).
>
> pg_usleep(100000) added just below GetSubscriptionRelations(...) proves my
> analysis -- without it, I need tens of iterations (with 50 tests running in
> parallel) to catch the failure, but with it, I get the failure on the first
> iteration.

Thanks for the analysis, I was able to reproduce the issue with the
steps you had shared. I agree with your analysis. I added some logs to
verify that the invalidation was getting missed.

I felt that this invalidation is getting ignored because we have used
a boolean variable here, how about changing it slightly so that
table_states_invalid gets incremented for every invalidation and then
decrementing table_states_invalid after getting the non-ready tables
like in the attached patch.  I was able to verify that the test passes
with the attached patch.

Regards,
Vignesh

Вложения

В списке pgsql-hackers по дате отправления:

Предыдущее
От: "Zhijie Hou (Fujitsu)"
Дата:
Сообщение: RE: Synchronizing slots from primary to standby
Следующее
От: Andrew Dunstan
Дата:
Сообщение: Re: PG versus libxml2 2.12.x