Re: Build-farm - intermittent error in 031_column_list.pl

Поиск
Список
Период
Сортировка
От Amit Kapila
Тема Re: Build-farm - intermittent error in 031_column_list.pl
Дата
Msg-id CAA4eK1Lc=NDV1HrY2gNasFK90MtysnA575a+rd0p+POjXN+Spw@mail.gmail.com
обсуждение исходный текст
Ответ на Re: Build-farm - intermittent error in 031_column_list.pl  (Kyotaro Horiguchi <horikyota.ntt@gmail.com>)
Ответы Re: Build-farm - intermittent error in 031_column_list.pl  (Amit Kapila <amit.kapila16@gmail.com>)
Список pgsql-hackers
On Thu, May 19, 2022 at 12:28 PM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:
>
> At Thu, 19 May 2022 14:26:56 +1000, Peter Smith <smithpb2250@gmail.com> wrote in
> > Hi hackers.
> >
> > FYI, I saw that there was a recent Build-farm error on the "grison" machine [1]
> > [1] https://buildfarm.postgresql.org/cgi-bin/show_history.pl?nm=grison&br=HEAD
> >
> > The error happened during "subscriptionCheck" phase in the TAP test
> > t/031_column_list.pl
> > This test file was added by this [2] commit.
> > [2] https://github.com/postgres/postgres/commit/923def9a533a7d986acfb524139d8b9e5466d0a5
>
> What is happening for all of them looks like that the name of a
> publication created by CREATE PUBLICATION without a failure report is
> missing for a walsender came later. It seems like CREATE PUBLICATION
> can silently fail to create a publication, or walsender somehow failed
> to find existing one.
>

Do you see anything in LOGS which indicates CREATE SUBSCRIPTION has failed?

>
> > ~~
> >
>
> 2022-04-17 00:16:04.278 CEST [293659][client backend][4/270:0][031_column_list.pl] LOG:  statement: CREATE
PUBLICATIONpub9 FOR TABLE test_part_d (a) WITH (publish_via_partition_root = true);
 
> 2022-04-17 00:16:04.279 CEST [293659][client backend][:0][031_column_list.pl] LOG:  disconnection: session time:
0:00:00.002user=bf database=postgres host=[local]
 
>
> "CREATE PUBLICATION pub9" is executed at 00:16:04.278 on 293659 then
> the session has been disconnected. But the following request for the
> same publication fails due to the absense of the publication.
>
> 2022-04-17 00:16:08.147 CEST [293856][walsender][3/0:0][sub1] STATEMENT:  START_REPLICATION SLOT "sub1" LOGICAL
0/153DB88(proto_version '3', publication_names '"pub9"')
 
> 2022-04-17 00:16:08.148 CEST [293856][walsender][3/0:0][sub1] ERROR:  publication "pub9" does not exist
>

This happens after "ALTER SUBSCRIPTION sub1 SET PUBLICATION pub9". The
probable theory is that ALTER SUBSCRIPTION will lead to restarting of
apply worker (which we can see in LOGS as well) and after the restart,
the apply worker will use the existing slot and replication origin
corresponding to the subscription. Now, it is possible that before
restart the origin has not been updated and the WAL start location
points to a location prior to where PUBLICATION pub9 exists which can
lead to such an error. Once this error occurs, apply worker will never
be able to proceed and will always return the same error. Does this
make sense?

Unless you or others see a different theory, this seems to be the
existing problem in logical replication which is manifested by this
test. If we just want to fix these test failures, we can create a new
subscription instead of altering the existing publication to point to
the new publication.

Note: Added Tomas to know his views as he has committed this test.

-- 
With Regards,
Amit Kapila.



В списке pgsql-hackers по дате отправления:

Предыдущее
От: Amit Langote
Дата:
Сообщение: Re: First draft of the PG 15 release notes
Следующее
От: Alvaro Herrera
Дата:
Сообщение: Re: Intermittent buildfarm failures on wrasse