RE: Skipping logical replication transactions on subscriber side

Поиск
Список
Период
Сортировка
От houzj.fnst@fujitsu.com
Тема RE: Skipping logical replication transactions on subscriber side
Дата
Msg-id OS0PR01MB5716D985D4F2430E817CE30B94689@OS0PR01MB5716.jpnprd01.prod.outlook.com
обсуждение исходный текст
Ответ на Re: Skipping logical replication transactions on subscriber side  (Masahiko Sawada <sawada.mshk@gmail.com>)
Ответы Re: Skipping logical replication transactions on subscriber side  (Amit Kapila <amit.kapila16@gmail.com>)
Список pgsql-hackers
On Tues, Nov 30, 2021 9:39 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> On Tue, Nov 30, 2021 at 8:41 PM Masahiko Sawada <sawada.mshk@gmail.com>
> wrote:
> >
> > On Tue, Nov 30, 2021 at 6:28 PM Amit Kapila <amit.kapila16@gmail.com>
> wrote:
> > >
> > > On Mon, Nov 29, 2021 at 11:38 AM vignesh C <vignesh21@gmail.com>
> wrote:
> > > >
> > >
> > > I have pushed this patch and there is a buildfarm failure for it. See:
> > > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=sidewinder&d
> > > t=2021-11-30%2005%3A05%3A25
> > >
> > > Sawada-San has shared his initial analysis on pgsql-committers [1]
> > > and I am responding here as the fix requires some more discussion.
> > >
> > > > Looking at the result the test actually got, we had two error
> > > > entries for test_tab1 instead of one:
> > > >
> > > > #   Failed test 'check the error reported by the apply worker'
> > > > #   at t/026_worker_stats.pl line 33.
> > > > #          got: 'tap_sub|INSERT|test_tab1|t
> > > > # tap_sub||test_tab1|t'
> > > > #     expected: 'tap_sub|INSERT|test_tab1|t'
> > > >
> > > > The possible scenarios are:
> > > >
> > > > The table sync worker for test_tab1 failed due to an error
> > > > unrelated to apply changes:
> > > >
> > > > 2021-11-30 06:24:02.137 CET [18990:2] ERROR:  replication origin
> > > > with OID 2 is already active for PID 23706
> > > >
> > > > At this time, the view had one error entry for the table sync worker.
> > > > After retrying table sync, it succeeded:
> > > >
> > > > 2021-11-30 06:24:04.202 CET [28117:2] LOG:  logical replication
> > > > table synchronization worker for subscription "tap_sub", table
> "test_tab1"
> > > > has finished
> > > >
> > > > Then after inserting a row on the publisher, the apply worker
> > > > inserted the row but failed due to violating a unique key
> > > > violation, which is
> > > > expected:
> > > >
> > > > 2021-11-30 06:24:04.307 CET [4806:2] ERROR:  duplicate key value
> > > > violates unique constraint "test_tab1_pkey"
> > > > 2021-11-30 06:24:04.307 CET [4806:3] DETAIL:  Key (a)=(1) already exists.
> > > > 2021-11-30 06:24:04.307 CET [4806:4] CONTEXT:  processing remote
> > > > data during "INSERT" for replication target relation
> > > > "public.test_tab1" in transaction 721 at 2021-11-30
> > > > 06:24:04.305096+01
> > > >
> > > > As a result, we had two error entries for test_tab1: the table
> > > > sync worker error and the apply worker error. I didn't expect that
> > > > the table sync worker for test_tab1 failed due to "replication
> > > > origin with OID 2 is already active for PID 23706” error.
> > > >
> > > > Looking at test_subscription_error() in 026_worker_stats.pl, we
> > > > have two checks; in the first check, we wait for the view to show
> > > > the error entry for the given relation name and xid. This check
> > > > was passed since we had the second error (i.g., apply worker
> > > > error). In the second check, we get error entries from
> > > > pg_stat_subscription_workers by specifying only the relation name.
> > > > Therefore, we ended up getting two entries and failed the tests.
> > > >
> > > > To fix this issue, I think that in the second check, we can get
> > > > the error from pg_stat_subscription_workers by specifying the
> > > > relation name *and* xid like the first check does. I've attached the patch.
> > > > What do you think?
> > > >
> > >
> > > I think this will fix the reported failure but there is another race
> > > condition in the test. Isn't it possible that for table test_tab2,
> > > we get an error "replication origin with OID ..." or some other
> > > error before copy, in that case also, we will proceed from the
> > > second call of test_subscription_error() which is not what we expect in the
> test?
> >
> > Right.
> >
> > > Shouldn't we someway check that the error message also starts with
> > > "duplicate key value violates ..."?
> >
> > Yeah, I think it's a good idea to make the checks more specific. That
> > is, probably we can specify the prefix of the error message and
> > subrelid in addition to the current conditions: relid and xid. That
> > way, we can check what error was reported by which workers (tablesync
> > or apply) for which relations. And both check queries in
> > test_subscription_error() can have the same WHERE clause.
> 
> I've attached a patch that fixes this issue. Please review it.
> 

I have a question about the testcase (I could be wrong here).

Is it possible that the race condition happen between apply worker(test_tab1)
and table sync worker(test_tab2) ? If so, it seems the error("replication
origin with OID") could happen randomly until we resolve the conflict.
Based on this, for the following code:
-----
    # Wait for the error statistics to be updated.
    my $check_sql = qq[SELECT count(1) > 0 ] . $part_sql;
    $node->poll_query_until(
    'postgres', $check_sql,
) or die "Timed out while waiting for statistics to be updated";

* [1] *

    $check_sql =
    qq[
SELECT subname, last_error_command, last_error_relid::regclass,
last_error_count > 0 ] . $part_sql;
    my $result = $node->safe_psql('postgres', $check_sql);
    is($result, $expected, $msg);
-----

Is it possible that the error("replication origin with OID") happen again at the
place [1]. In this case, the error message we have checked could be replaced by
another error("replication origin ...") and then the test fail ?

Best regards,
Hou zj

В списке pgsql-hackers по дате отправления:

Предыдущее
От: Amit Kapila
Дата:
Сообщение: Re: row filtering for logical replication
Следующее
От: Amit Kapila
Дата:
Сообщение: Re: Skipping logical replication transactions on subscriber side