Обсуждение: Issue with logical replication

Поиск
Список
Период
Сортировка

Issue with logical replication

От
Kacey Holston
Дата:
I ran into an issue of missing rows during an upgrade from  PostgreSQL 11 to PostgreSQL 13 using in core logical
replication.

## Procedure Used:

After copying over all the users and schema I ran this on the publisher:

"CREATE PUBLICATION upgrade11to13 FOR ALL TABLES;"

Then ran this on the subscriber:

"CREATE SUBSCRIPTION server13 CONNECTION ‘dbname=<DBNAME redacted> hostaddr=<ADDRESS redacted > user=<USER redacted >
password=<PASSWORDredacted > port=5432' PUBLICATION upgrade11to13;" 

The initial copy and replication appeared to run as expected. I saw all the tables in the ready state in
pg_subscription_rel.The tables pg_stat_repllication and pg_stat_subscription looked as expected as well. 

We then reversed the direction of replication by running this on the PostgreSQL 13 server:

"DROP SUBSCRIPTION server13;”

And this on the PostgreSQL 11 server:

"DROP PUBLICATION upgrade11to13;”

Back on the PostgreSQL 13 server:

"CREATE PUBLICATION reverse_to_11 FOR ALL TABLES;"

And on the PostgreSQL 11 server:

"CREATE SUBSCRIPTION oldserver11 CONNECTION 'dbname=<DBNAME redacted> hostaddr=<ADDRESS redacted > user=<USER redacted
>password=<PASSWORD redacted > port=5432' PUBLICATION reverse_to_11 with (copy_data=false);" 

At this point unexpected query results were returning and we discovered it was from missing rows in at least one table.
(Weunfortunately never checked additional tables).  

We were able to determine that all the rows missing were created on the publisher starting seven minutes after
replicationbegan and lasted for three hours.  

This table had a date created time stamp so we checked what time period the rows were missing from.  If you run a query
requestingall the records created during those three hours there is a 50k record discrepancy. If you run a query
requestingall the records created after those three hours all rows are in-tact. 

We did not retain the logs but at the time did not see any related errors.

## Second try:

We attempted to run this again.

We created a similar, new and fresh instance. Followed the same steps as above but this time created a replication slot
manually.

We received identical results. The records were missing from a handful of minutes after replication starts and last for
approximately3 hours. 

We were able to note that the data loss appeared to coincide approximately with the start of the table’s copy but the
dataloss ended before the table finished the copy. The table took about 12 hours to copy while the data loss was only
threehours. 

## Logs:

On the Subscriber, during the entire initial copy there were only four, spread out errors of the type:

"could not receive data from WAL stream: SSL SYSCALL error: EOF detected”

After these calls the next line in the server indicates a new copy and for different tables.

There were no related errors on the Publisher side.

## Server information

Publisher:

Ubuntu 16.04.7 LTS
GNU/Linux 4.4.0-1114-aws x86_64

Subscriber:

Ubuntu 20.04.2
GNU/Linux 5.4.0-1045-aws x86_64








Re: Issue with logical replication

От
Amit Kapila
Дата:
On Mon, Aug 9, 2021 at 1:58 AM Kacey Holston
<kacey.holston@pgexperts.com> wrote:
>
> I ran into an issue of missing rows during an upgrade from  PostgreSQL 11 to PostgreSQL 13 using in core logical
replication.
>
> ## Procedure Used:
>
> After copying over all the users and schema I ran this on the publisher:
>
> "CREATE PUBLICATION upgrade11to13 FOR ALL TABLES;"
>
> Then ran this on the subscriber:
>
> "CREATE SUBSCRIPTION server13 CONNECTION ‘dbname=<DBNAME redacted> hostaddr=<ADDRESS redacted > user=<USER redacted >
password=<PASSWORDredacted > port=5432' PUBLICATION upgrade11to13;" 
>
> The initial copy and replication appeared to run as expected. I saw all the tables in the ready state in
pg_subscription_rel.The tables pg_stat_repllication and pg_stat_subscription looked as expected as well. 
>
> We then reversed the direction of replication by running this on the PostgreSQL 13 server:
>
> "DROP SUBSCRIPTION server13;”
>
> And this on the PostgreSQL 11 server:
>
> "DROP PUBLICATION upgrade11to13;”
>
> Back on the PostgreSQL 13 server:
>
> "CREATE PUBLICATION reverse_to_11 FOR ALL TABLES;"
>
> And on the PostgreSQL 11 server:
>
> "CREATE SUBSCRIPTION oldserver11 CONNECTION 'dbname=<DBNAME redacted> hostaddr=<ADDRESS redacted > user=<USER
redacted> password=<PASSWORD redacted > port=5432' PUBLICATION reverse_to_11 with (copy_data=false);" 
>
..
>
> We were able to note that the data loss appeared to coincide approximately with the start of the table’s copy but the
dataloss ended before the table finished the copy. The table took about 12 hours to copy while the data loss was only
threehours. 
>

It is not very clear from the above description where exactly the data
was missing on the PG-11 server or PG-13 server. If it is missing from
PG-11, then how is it related to the initial copy because you
mentioned (copy_data=false) while creating a subscription on 11 so it
should not perform initial sync?

--
With Regards,
Amit Kapila.



Re: Issue with logical replication

От
Kacey Holston
Дата:
My apologies.

The data was missing on PG13 from the initial copy.

> On Aug 9, 2021, at 1:58 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Mon, Aug 9, 2021 at 1:58 AM Kacey Holston
> <kacey.holston@pgexperts.com> wrote:
>>
>> I ran into an issue of missing rows during an upgrade from  PostgreSQL 11 to PostgreSQL 13 using in core logical
replication.
>>
>> ## Procedure Used:
>>
>> After copying over all the users and schema I ran this on the publisher:
>>
>> "CREATE PUBLICATION upgrade11to13 FOR ALL TABLES;"
>>
>> Then ran this on the subscriber:
>>
>> "CREATE SUBSCRIPTION server13 CONNECTION ‘dbname=<DBNAME redacted> hostaddr=<ADDRESS redacted > user=<USER redacted
>password=<PASSWORD redacted > port=5432' PUBLICATION upgrade11to13;" 
>>
>> The initial copy and replication appeared to run as expected. I saw all the tables in the ready state in
pg_subscription_rel.The tables pg_stat_repllication and pg_stat_subscription looked as expected as well. 
>>
>> We then reversed the direction of replication by running this on the PostgreSQL 13 server:
>>
>> "DROP SUBSCRIPTION server13;”
>>
>> And this on the PostgreSQL 11 server:
>>
>> "DROP PUBLICATION upgrade11to13;”
>>
>> Back on the PostgreSQL 13 server:
>>
>> "CREATE PUBLICATION reverse_to_11 FOR ALL TABLES;"
>>
>> And on the PostgreSQL 11 server:
>>
>> "CREATE SUBSCRIPTION oldserver11 CONNECTION 'dbname=<DBNAME redacted> hostaddr=<ADDRESS redacted > user=<USER
redacted> password=<PASSWORD redacted > port=5432' PUBLICATION reverse_to_11 with (copy_data=false);" 
>>
> ..
>>
>> We were able to note that the data loss appeared to coincide approximately with the start of the table’s copy but
thedata loss ended before the table finished the copy. The table took about 12 hours to copy while the data loss was
onlythree hours. 
>>
>
> It is not very clear from the above description where exactly the data
> was missing on the PG-11 server or PG-13 server. If it is missing from
> PG-11, then how is it related to the initial copy because you
> mentioned (copy_data=false) while creating a subscription on 11 so it
> should not perform initial sync?
>
> --
> With Regards,
> Amit Kapila.




Re: Issue with logical replication

От
Amit Kapila
Дата:
On Tue, Aug 10, 2021 at 12:25 AM Kacey Holston
<kacey.holston@pgexperts.com> wrote:
>
> My apologies.
>
> The data was missing on PG13 from the initial copy.
>

Okay, so all the steps after you say "We then reversed the direction
of replication by running this on the PostgreSQL 13 server:" in your
email are not required to see the problem, right?

Can you please tell whether the data you see missing is the new data
generated on the publisher when the copy is in progress or it was the
original data of the table? Please tell me the situation when you have
used Create Subscription command *not* when you were using manually
created slot. Also, do you have any subscriber/publisher side logs
during that time which you can share?

Note - Please respond to the email in-line as it is easier to
communicate that way and that is the style we use to discuss here.

-- 
With Regards,
Amit Kapila.



Re: Issue with logical replication

От
Kacey Holston
Дата:



> On Aug 9, 2021, at 9:21 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Tue, Aug 10, 2021 at 12:25 AM Kacey Holston
> <kacey.holston@pgexperts.com> wrote:
>>
>> My apologies.
>>
>> The data was missing on PG13 from the initial copy.
>>
>
> Okay, so all the steps after you say "We then reversed the direction
> of replication by running this on the PostgreSQL 13 server:" in your
> email are not required to see the problem, right?

The reversed direction is only included for fullness of steps. The data is missing before that happens, we just did not
discoverthe missing data the first time until after the we reversed direction. 

When made the second attempt we confirmed the data was missing after the initial sync completed.

So, the steps can be ignored after the CREATE SUBSCRIPTION command.

>
> Can you please tell whether the data you see missing is the new data
> generated on the publisher when the copy is in progress or it was the
> original data of the table?

It was the original data of the table that is missing.

> Please tell me the situation when you have
> used Create Subscription command *not* when you were using manually
> created slot. Also, do you have any subscriber/publisher side logs
> during that time which you can share?

I do have the logs from the second attempt but do not feel comfortable publicly exposing them.

>
> Note - Please respond to the email in-line as it is easier to
> communicate that way and that is the style we use to discuss here.
>
> --
> With Regards,
> Amit Kapila.




Re: Issue with logical replication

От
Masahiko Sawada
Дата:
On Wed, Aug 11, 2021 at 3:28 AM Kacey Holston
<kacey.holston@pgexperts.com> wrote:
>
>
>
>
>
> > On Aug 9, 2021, at 9:21 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > On Tue, Aug 10, 2021 at 12:25 AM Kacey Holston
> > <kacey.holston@pgexperts.com> wrote:
> >>
> >> My apologies.
> >>
> >> The data was missing on PG13 from the initial copy.
> >>
> >
> > Okay, so all the steps after you say "We then reversed the direction
> > of replication by running this on the PostgreSQL 13 server:" in your
> > email are not required to see the problem, right?
>
> The reversed direction is only included for fullness of steps. The data is missing before that happens, we just did
notdiscover the missing data the first time until after the we reversed direction.
 
>
> When made the second attempt we confirmed the data was missing after the initial sync completed.
>
> So, the steps can be ignored after the CREATE SUBSCRIPTION command.
>
> >
> > Can you please tell whether the data you see missing is the new data
> > generated on the publisher when the copy is in progress or it was the
> > original data of the table?
>
> It was the original data of the table that is missing.

This seems to mean that COPY by table sync workers missed rows, which
isn't likely to happen, though. On the other hand, reading the
following description in the first your message, I thought the data
you see missing on the subscriber (pg13) were created on the publisher
*after* starting logical replication:

---
We were able to determine that all the rows missing were created on
the publisher starting seven minutes after replication began and
lasted for three hours.
---

Am I missing something?

The queries executed on the publisher and the subscriber don't always
necessarily match since the data are replicated asynchronously. I
suppose the publisher (pg11) was receiving queries during table sync.
Could you elaborate on how you checked if the rows were missing?

Also, could you share the server versions including the minor version?

Regards,

-- 
Masahiko Sawada
EDB:  https://www.enterprisedb.com/