Обсуждение: Trying to understand a failed upgrade in AWS RDS

Поиск
Список
Период
Сортировка

Trying to understand a failed upgrade in AWS RDS

От
Mike Lissner
Дата:
Hi all,

In AWS RDS, we are using logical replication between a postgresql 14 publisher and a postgresql 10 subscriber. The subscriber is rather old, so yesterday I tried to update it using AWS's built in upgrade tool (which uses pg_upgrade behind the scenes).

I did a pretty thorough test run before beginning, but the live run went pretty poorly. My process was:

1. Disable the subscription to pg10.
2. Run RDS's upgrade (which runs pg_upgrade).
3. Re-Enable the subscription to the newly upgraded server.

The idea was that the publisher could still be live and collect changes, and then on step 3, those changes would flush to the newly upgraded server.

When I hit step three, things went awry. From what I can tell, it seems like pg_upgrade might have wiped out the LSN location of the subscriber, because I was getting many messages in the logs saying:

2023-05-19 01:01:09 UTC:100.20.224.120(56536):django@courtlistener:[29669]:STATEMENT: CREATE_REPLICATION_SLOT "pg_18278_sync_86449755_7234675743763347169" LOGICAL pgoutput USE_SNAPSHOT
2023-05-19 01:01:09 UTC:100.20.224.120(56550):django@courtlistener:[29670]:ERROR: replication slot "pg_18278_sync_16561_7234675743763347169" does not exist
2023-05-19 01:01:09 UTC:100.20.224.120(56550):django@courtlistener:[29670]:STATEMENT: DROP_REPLICATION_SLOT pg_18278_sync_16561_7234675743763347169 WAIT
2023-05-19 01:01:09 UTC:100.20.224.120(56550):django@courtlistener:[29670]:ERROR: all replication slots are in use
2023-05-19 01:01:09 UTC:100.20.224.120(56550):django@courtlistener:[29670]:HINT: Free one or increase max_replication_slots.
I followed those instructions, and upped max_replication_slots to 200. That fixed that error, but then I had errors about COPY commands failing, and looking in the publisher I saw about 150 slots like:
select * from pg_replication_slots ;                 slot_name                  |  plugin  | slot_type | datoid |   database    | temporary | active | active_pid | xmin | catalog_xmin | restart_lsn  | confirmed_flush_lsn | wal_status | safe_wal_size | two_phase 
--------------------------------------------+----------+-----------+--------+---------------+-----------+--------+------------+------+--------------+--------------+---------------------+------------+---------------+----------- pg_18278_sync_86449408_7234675743763347169 | pgoutput | logical   |  16428 | courtlistener | f         | t      |       6906 |      |    859962500 | EA5/954A9F18 |                     | reserved   |               | f pg_18278_sync_20492279_7234675743763347169 | pgoutput | logical   |  16428 | courtlistener | f         | f      |            |      |    859962448 | EA5/9548EDF0 | EA5/9548EE28        | reserved   |               | f pg_18278_sync_16940_7234675743763347169    | pgoutput | logical   |  16428 | courtlistener | f         | f      |            |      |    859962448 | EA5/9548EE60 | EA5/9548EE98        | reserved   |               | f

So this looks like it's trying to sync all of the existing tables all over again when I re-enabled the subscription.

Does that make sense? In the future, I'll DROP the subscription and then create a new one with copy_data=False, but this was a real gotcha. 

Anybody know what's going on here?

Thanks,

Mike

Re: Trying to understand a failed upgrade in AWS RDS

От
Mike Lissner
Дата:
I also am realizing belatedly that my solution of dropping the subscriber probably won't work anyway, since I'd lose the changes on the publisher for the duration of the upgrade. Maybe I could drop the subscription while keeping the slot on the publisher, and then create a new subscription after the upgrade using that slot and copy_data=False? Getting wonky.

On Fri, May 19, 2023 at 8:17 AM Mike Lissner <mlissner@michaeljaylissner.com> wrote:
Hi all,

In AWS RDS, we are using logical replication between a postgresql 14 publisher and a postgresql 10 subscriber. The subscriber is rather old, so yesterday I tried to update it using AWS's built in upgrade tool (which uses pg_upgrade behind the scenes).

I did a pretty thorough test run before beginning, but the live run went pretty poorly. My process was:

1. Disable the subscription to pg10.
2. Run RDS's upgrade (which runs pg_upgrade).
3. Re-Enable the subscription to the newly upgraded server.

The idea was that the publisher could still be live and collect changes, and then on step 3, those changes would flush to the newly upgraded server.

When I hit step three, things went awry. From what I can tell, it seems like pg_upgrade might have wiped out the LSN location of the subscriber, because I was getting many messages in the logs saying:

2023-05-19 01:01:09 UTC:100.20.224.120(56536):django@courtlistener:[29669]:STATEMENT: CREATE_REPLICATION_SLOT "pg_18278_sync_86449755_7234675743763347169" LOGICAL pgoutput USE_SNAPSHOT
2023-05-19 01:01:09 UTC:100.20.224.120(56550):django@courtlistener:[29670]:ERROR: replication slot "pg_18278_sync_16561_7234675743763347169" does not exist
2023-05-19 01:01:09 UTC:100.20.224.120(56550):django@courtlistener:[29670]:STATEMENT: DROP_REPLICATION_SLOT pg_18278_sync_16561_7234675743763347169 WAIT
2023-05-19 01:01:09 UTC:100.20.224.120(56550):django@courtlistener:[29670]:ERROR: all replication slots are in use
2023-05-19 01:01:09 UTC:100.20.224.120(56550):django@courtlistener:[29670]:HINT: Free one or increase max_replication_slots.
I followed those instructions, and upped max_replication_slots to 200. That fixed that error, but then I had errors about COPY commands failing, and looking in the publisher I saw about 150 slots like:
select * from pg_replication_slots ;                 slot_name                  |  plugin  | slot_type | datoid |   database    | temporary | active | active_pid | xmin | catalog_xmin | restart_lsn  | confirmed_flush_lsn | wal_status | safe_wal_size | two_phase 
--------------------------------------------+----------+-----------+--------+---------------+-----------+--------+------------+------+--------------+--------------+---------------------+------------+---------------+----------- pg_18278_sync_86449408_7234675743763347169 | pgoutput | logical   |  16428 | courtlistener | f         | t      |       6906 |      |    859962500 | EA5/954A9F18 |                     | reserved   |               | f pg_18278_sync_20492279_7234675743763347169 | pgoutput | logical   |  16428 | courtlistener | f         | f      |            |      |    859962448 | EA5/9548EDF0 | EA5/9548EE28        | reserved   |               | f pg_18278_sync_16940_7234675743763347169    | pgoutput | logical   |  16428 | courtlistener | f         | f      |            |      |    859962448 | EA5/9548EE60 | EA5/9548EE98        | reserved   |               | f

So this looks like it's trying to sync all of the existing tables all over again when I re-enabled the subscription.

Does that make sense? In the future, I'll DROP the subscription and then create a new one with copy_data=False, but this was a real gotcha. 

Anybody know what's going on here?

Thanks,

Mike

Re: Trying to understand a failed upgrade in AWS RDS

От
"Elterman, Michael"
Дата:
Please, use the following runbook. 
1. Disable the subscription to pg10.
2. Disable Application Users on Publisher.
3. Drop all replication slots on Publisher (The upgrade can not be executed if there are any replication slots)
4. Run RDS's upgrade (which runs pg_upgrade).
5. Recreate replication slots with the same names.
6. Enable Application Users on Publisher.
7. Re-Enable the subscriptions to the newly upgraded server.
Good luck

On Fri, May 19, 2023 at 11:49 AM Mike Lissner <mlissner@michaeljaylissner.com> wrote:
I also am realizing belatedly that my solution of dropping the subscriber probably won't work anyway, since I'd lose the changes on the publisher for the duration of the upgrade. Maybe I could drop the subscription while keeping the slot on the publisher, and then create a new subscription after the upgrade using that slot and copy_data=False? Getting wonky.

On Fri, May 19, 2023 at 8:17 AM Mike Lissner <mlissner@michaeljaylissner.com> wrote:
Hi all,

In AWS RDS, we are using logical replication between a postgresql 14 publisher and a postgresql 10 subscriber. The subscriber is rather old, so yesterday I tried to update it using AWS's built in upgrade tool (which uses pg_upgrade behind the scenes).

I did a pretty thorough test run before beginning, but the live run went pretty poorly. My process was:

1. Disable the subscription to pg10.
2. Run RDS's upgrade (which runs pg_upgrade).
3. Re-Enable the subscription to the newly upgraded server.

The idea was that the publisher could still be live and collect changes, and then on step 3, those changes would flush to the newly upgraded server.

When I hit step three, things went awry. From what I can tell, it seems like pg_upgrade might have wiped out the LSN location of the subscriber, because I was getting many messages in the logs saying:

2023-05-19 01:01:09 UTC:100.20.224.120(56536):django@courtlistener:[29669]:STATEMENT: CREATE_REPLICATION_SLOT "pg_18278_sync_86449755_7234675743763347169" LOGICAL pgoutput USE_SNAPSHOT
2023-05-19 01:01:09 UTC:100.20.224.120(56550):django@courtlistener:[29670]:ERROR: replication slot "pg_18278_sync_16561_7234675743763347169" does not exist
2023-05-19 01:01:09 UTC:100.20.224.120(56550):django@courtlistener:[29670]:STATEMENT: DROP_REPLICATION_SLOT pg_18278_sync_16561_7234675743763347169 WAIT
2023-05-19 01:01:09 UTC:100.20.224.120(56550):django@courtlistener:[29670]:ERROR: all replication slots are in use
2023-05-19 01:01:09 UTC:100.20.224.120(56550):django@courtlistener:[29670]:HINT: Free one or increase max_replication_slots.
I followed those instructions, and upped max_replication_slots to 200. That fixed that error, but then I had errors about COPY commands failing, and looking in the publisher I saw about 150 slots like:
select * from pg_replication_slots ;                 slot_name                  |  plugin  | slot_type | datoid |   database    | temporary | active | active_pid | xmin | catalog_xmin | restart_lsn  | confirmed_flush_lsn | wal_status | safe_wal_size | two_phase 
--------------------------------------------+----------+-----------+--------+---------------+-----------+--------+------------+------+--------------+--------------+---------------------+------------+---------------+----------- pg_18278_sync_86449408_7234675743763347169 | pgoutput | logical   |  16428 | courtlistener | f         | t      |       6906 |      |    859962500 | EA5/954A9F18 |                     | reserved   |               | f pg_18278_sync_20492279_7234675743763347169 | pgoutput | logical   |  16428 | courtlistener | f         | f      |            |      |    859962448 | EA5/9548EDF0 | EA5/9548EE28        | reserved   |               | f pg_18278_sync_16940_7234675743763347169    | pgoutput | logical   |  16428 | courtlistener | f         | f      |            |      |    859962448 | EA5/9548EE60 | EA5/9548EE98        | reserved   |               | f

So this looks like it's trying to sync all of the existing tables all over again when I re-enabled the subscription.

Does that make sense? In the future, I'll DROP the subscription and then create a new one with copy_data=False, but this was a real gotcha. 

Anybody know what's going on here?

Thanks,

Mike

Re: Trying to understand a failed upgrade in AWS RDS

От
Mike Lissner
Дата:
Thanks for the suggestions. I think in the future I'll do something like this rather than try to re-use existing subscriptions.

I'm still trying to understand what went wrong though. Putting a finer point on my question: Does pg_upgrade mess up disabled subscriptions?

On Fri, May 19, 2023 at 1:55 PM Elterman, Michael <melterman@enova.com> wrote:
Please, use the following runbook. 
1. Disable the subscription to pg10.
2. Disable Application Users on Publisher.
3. Drop all replication slots on Publisher (The upgrade can not be executed if there are any replication slots)
4. Run RDS's upgrade (which runs pg_upgrade).
5. Recreate replication slots with the same names.
6. Enable Application Users on Publisher.
7. Re-Enable the subscriptions to the newly upgraded server.
Good luck

On Fri, May 19, 2023 at 11:49 AM Mike Lissner <mlissner@michaeljaylissner.com> wrote:
I also am realizing belatedly that my solution of dropping the subscriber probably won't work anyway, since I'd lose the changes on the publisher for the duration of the upgrade. Maybe I could drop the subscription while keeping the slot on the publisher, and then create a new subscription after the upgrade using that slot and copy_data=False? Getting wonky.

On Fri, May 19, 2023 at 8:17 AM Mike Lissner <mlissner@michaeljaylissner.com> wrote:
Hi all,

In AWS RDS, we are using logical replication between a postgresql 14 publisher and a postgresql 10 subscriber. The subscriber is rather old, so yesterday I tried to update it using AWS's built in upgrade tool (which uses pg_upgrade behind the scenes).

I did a pretty thorough test run before beginning, but the live run went pretty poorly. My process was:

1. Disable the subscription to pg10.
2. Run RDS's upgrade (which runs pg_upgrade).
3. Re-Enable the subscription to the newly upgraded server.

The idea was that the publisher could still be live and collect changes, and then on step 3, those changes would flush to the newly upgraded server.

When I hit step three, things went awry. From what I can tell, it seems like pg_upgrade might have wiped out the LSN location of the subscriber, because I was getting many messages in the logs saying:

2023-05-19 01:01:09 UTC:100.20.224.120(56536):django@courtlistener:[29669]:STATEMENT: CREATE_REPLICATION_SLOT "pg_18278_sync_86449755_7234675743763347169" LOGICAL pgoutput USE_SNAPSHOT
2023-05-19 01:01:09 UTC:100.20.224.120(56550):django@courtlistener:[29670]:ERROR: replication slot "pg_18278_sync_16561_7234675743763347169" does not exist
2023-05-19 01:01:09 UTC:100.20.224.120(56550):django@courtlistener:[29670]:STATEMENT: DROP_REPLICATION_SLOT pg_18278_sync_16561_7234675743763347169 WAIT
2023-05-19 01:01:09 UTC:100.20.224.120(56550):django@courtlistener:[29670]:ERROR: all replication slots are in use
2023-05-19 01:01:09 UTC:100.20.224.120(56550):django@courtlistener:[29670]:HINT: Free one or increase max_replication_slots.
I followed those instructions, and upped max_replication_slots to 200. That fixed that error, but then I had errors about COPY commands failing, and looking in the publisher I saw about 150 slots like:
select * from pg_replication_slots ;                 slot_name                  |  plugin  | slot_type | datoid |   database    | temporary | active | active_pid | xmin | catalog_xmin | restart_lsn  | confirmed_flush_lsn | wal_status | safe_wal_size | two_phase 
--------------------------------------------+----------+-----------+--------+---------------+-----------+--------+------------+------+--------------+--------------+---------------------+------------+---------------+----------- pg_18278_sync_86449408_7234675743763347169 | pgoutput | logical   |  16428 | courtlistener | f         | t      |       6906 |      |    859962500 | EA5/954A9F18 |                     | reserved   |               | f pg_18278_sync_20492279_7234675743763347169 | pgoutput | logical   |  16428 | courtlistener | f         | f      |            |      |    859962448 | EA5/9548EDF0 | EA5/9548EE28        | reserved   |               | f pg_18278_sync_16940_7234675743763347169    | pgoutput | logical   |  16428 | courtlistener | f         | f      |            |      |    859962448 | EA5/9548EE60 | EA5/9548EE98        | reserved   |               | f

So this looks like it's trying to sync all of the existing tables all over again when I re-enabled the subscription.

Does that make sense? In the future, I'll DROP the subscription and then create a new one with copy_data=False, but this was a real gotcha. 

Anybody know what's going on here?

Thanks,

Mike

Re: Trying to understand a failed upgrade in AWS RDS

От
Julien Rouhaud
Дата:
On Sat, 20 May 2023, 05:56 Mike Lissner, <mlissner@michaeljaylissner.com> wrote:

I'm still trying to understand what went wrong though. Putting a finer point on my question: Does pg_upgrade mess up disabled subscriptions?

yes, whether they're disabled or not. As far as I know it's impossible to reliably pg_upgrade a node that has subscriptions and eventually resume logical replication. 

It's possible to make it work with some efforts in some basic configurations and / or if no changes happen on the publications, but it's up to you trying to find out if your specific scenario can work as it's not documented. It's also impossible to check whether some incompatible events happened on any of the publisher nodes so you have to make sure that you have total control and knowledge of all the activity that happens on the publisher nodes during the upgrade. 

Re: Trying to understand a failed upgrade in AWS RDS

От
Mike Lissner
Дата:
As far as I know it's impossible to reliably pg_upgrade a node that has subscriptions and eventually resume logical replication. 

Should this go in the documentation somewhere? Maybe in the pg_upgrade notes? I still don't understand the mechanism. You also say that:
 
It's possible to make it work with some efforts in some basic configurations and / or if no changes happen on the publications

But that kind of surprises me too, actually, because it seemed like pg_upgrade wiped out the LSN locations of the subcriber, making it start all over.

Upgrading a subscriber seems like something that could/should work, so it should be documented if pg_upgrade is incompatible with maintaining a subscription, shouldn't it?

Re: Trying to understand a failed upgrade in AWS RDS

От
Alan Hodgson
Дата:
On Sun, 2023-05-21 at 07:56 -0700, Mike Lissner wrote:
As far as I know it's impossible to reliably pg_upgrade a node that has subscriptions and eventually resume logical replication. 


Should this go in the documentation somewhere? Maybe in the pg_upgrade notes? I still don't understand the mechanism. You also say that:
 
It's possible to make it work with some efforts in some basic configurations and / or if no changes happen on the publications


But that kind of surprises me too, actually, because it seemed like pg_upgrade wiped out the LSN locations of the subcriber, making it start all over.

Upgrading a subscriber seems like something that could/should work, so it should be documented if pg_upgrade is incompatible with maintaining a subscription, shouldn't it?

The docs are strangely silent on this. AFAIK pg_upgrade on either the publisher or subscriber breaks logical replication, which does make sense since pg_upgrade basically makes a new database cluster as it runs.

There is a way to manually set the LSN position of an enabled=false replication slot, but I've failed to make that work right in tests so far.