Re: long-standing data loss bug in initial sync of logical replication

Поиск
Список
Период
Сортировка
От Nitin Motiani
Тема Re: long-standing data loss bug in initial sync of logical replication
Дата
Msg-id CAH5HC97tLvRfg7YV7vh8S7LkLgE=vmFz5CmLq=i-nRwO80AFzA@mail.gmail.com
обсуждение исходный текст
Ответ на Re: long-standing data loss bug in initial sync of logical replication  (Nitin Motiani <nitinmotiani@google.com>)
Ответы Re: long-standing data loss bug in initial sync of logical replication
Список pgsql-hackers
On Thu, Jul 18, 2024 at 3:05 PM Nitin Motiani <nitinmotiani@google.com> wrote:
>
> On Wed, Jul 17, 2024 at 5:25 PM vignesh C <vignesh21@gmail.com> wrote:
> >
> > I tested these scenarios, and as you expected, it throws an error for
> > the create publication case:
> > 2024-07-17 14:50:01.145 IST [481526] 481526  ERROR:  could not receive
> > data from WAL stream: ERROR:  publication "pub1" does not exist
> >         CONTEXT:  slot "sub1", output plugin "pgoutput", in the change
> > callback, associated LSN 0/1510CD8
> > 2024-07-17 14:50:01.147 IST [481450] 481450  LOG:  background worker
> > "logical replication apply worker" (PID 481526) exited with exit code
> > 1
> >
> > The steps for this process are as follows:
> > 1) Create tables in both the publisher and subscriber.
> > 2) On the publisher: Create a replication slot.
> > 3) On the subscriber: Create a subscription using the slot created by
> > the publisher.
> > 4) On the publisher:
> > 4.a) Session 1: BEGIN; INSERT INTO T1;
> > 4.b) Session 2: CREATE PUBLICATION FOR ALL TABLES
> > 4.c) Session 1: COMMIT;
> >
> > Since we are throwing out a "publication does not exist" error, there
> > is no inconsistency issue here.
> >
> > However, an issue persists with DROP ALL TABLES publication, where
> > data continues to replicate even after the publication is dropped.
> > This happens because the open transaction consumes the invalidation,
> > causing the publications to be revalidated using old snapshot. As a
> > result, both the open transactions and the subsequent transactions are
> > getting replicated.
> >
> > We can reproduce this issue by following these steps in a logical
> > replication setup with an "ALL TABLES" publication:
> > On the publisher:
> > Session 1: BEGIN; INSERT INTO T1 VALUES (val1);
> > In another session on the publisher:
> > Session 2: DROP PUBLICATION
> > Back in Session 1 on the publisher:
> > COMMIT;
> > Finally, in Session 1 on the publisher:
> > INSERT INTO T1 VALUES (val2);
> >
> > Even after dropping the publication, both val1 and val2 are still
> > being replicated to the subscriber. This means that both the
> > in-progress concurrent transaction and the subsequent transactions are
> > being replicated.
> >
>
> Hi,
>
> I tried the 'DROP PUBLICATION' command even for a publication with a
> single table. And there also the data continues to get replicated.
>
> To test this, I did a similar experiment as the above but instead of
> creating publication on all tables, I did it for one specific table.
>
> Here are the steps :
> 1. Create table test_1 and test_2 on both the publisher and subscriber
> instances.
> 2. Create publication p for table test_1 on the publisher.
> 3. Create a subscription s which subscribes to p.
> 4. On the publisher
> 4a) Session 1 : BEGIN; INSERT INTO test_1 VALUES(val1);
> 4b) Session 2 : DROP PUBLICATION p;
> 4c) Session 1 : Commit;
> 5. On the publisher : INSERT INTO test_1 VALUES(val2);
>
> After these, when I check the subscriber, both val1 and val2 have been
> replicated. I tried a few more inserts on publisher after this and
> they all got replicated to the subscriber. Only after explicitly
> creating a new publication p2 for test_1 on the publisher, the
> replication stopped. Most likely because the create publication
> command invalidated the cache.
>
> My guess is that this issue probably comes from the fact that
> RemoveObjects in dropcmds.c doesn't do any special handling or
> invalidation for the object drop command.
>

I checked further and I see that RemovePublicationById does do cache
invalidation but it is only done in the scenario when the publication
is on all tables. This is done without taking any locks. But for the
other cases (eg. publication on one table), I don't see any cache
invalidation in RemovePublicationById. That would explain why the
replication kept happening for multiple transactions after the drop
publication command in my example..

Thanks & Regards
Nitin Motiani
Google



В списке pgsql-hackers по дате отправления:

Предыдущее
От: Alvaro Herrera
Дата:
Сообщение: Re: Feature Request: Extending PostgreSQL's Identifier Length Limit
Следующее
От: Nitin Motiani
Дата:
Сообщение: Re: long-standing data loss bug in initial sync of logical replication