Re: Help diagnosing replication (copy) error

Поиск
Список
Период
Сортировка
От Jeff Ross
Тема Re: Help diagnosing replication (copy) error
Дата
Msg-id f39e6929-c290-4f08-bcdc-fe409c740fd7@openvistas.net
обсуждение исходный текст
Ответ на Help diagnosing replication (copy) error  (Steve Baldwin <steve.baldwin@gmail.com>)
Ответы Re: Help diagnosing replication (copy) error
Список pgsql-general
On 3/8/24 14:50, Steve Baldwin wrote:

> Hi,
>
> I'm in the process of migrating a cluster from 15.3 to 16.2. We have a 
> 'zero downtime' requirement so I'm using logical replication to create 
> the new cluster and then perform the switch in the application.
>
> I have a situation where all but one table have done their initial 
> copy. The remaining table is the largest (of course), and the 
> replication slot that is assigned for the copy 
> (pg_378075177_sync_60067_7343845372910323059) is showing as 
> 'active=false' if I select from pg_replication_slots on the publisher.
>
> I've checked the recent logs for both the publishing cluster and the 
> subscribing cluster but I can't see any replication errors. I guess I 
> could have missed them, but it doesn't seem like anything is being 
> 'retried' like I've seen in the past with replication errors.
>
> I've used this mechanism for zero-downtime upgrades multiple times in 
> the past, and have recently used it to upgrade smaller clusters from 
> 15.x to 16.2 without issue.
>
> The clusters are hosted on AWS RDS, so I have no access to the 
> servers, but if that's the only way to diagnose the issue, I can 
> create a support case.
>
> Does anyone have any suggestions as to where I should look for the issue?
>
> Thanks,
>
> Steve

In our setup we're logically replicating a 450G database hosted on real 
hardware to an RDS instance.

Multiple times we've had replication simply stop and we could never find 
any reason for that on either publisher or subscriber.

The *only* solution that ever worked in these cases was dropping the 
subscription in RDS and re-creating it with (copy_data = false).

At that point replication picks right up again for new transactions 
*but* at the expense of losing all of the WAL that should have been 
replicated during the outage.  I wrote a python based "logical 
replication fixer" to fill in those gaps.

Given that the subscriber is the one that initiates the connection to 
the publisher and that as soon as the subscription is dropped and 
restarted replication resumes my hunch is that this is squarely on RDS.  
With both publisher and subscriber on RDS as in your case YMMV.

RDS is a black box--who knows what's really going on there?  It would be 
interesting to see what the response is after you open a support case.  
I hope you'll be able to share that with the list.

Jeff







В списке pgsql-general по дате отправления:

Предыдущее
От: Steve Baldwin
Дата:
Сообщение: Re: Help diagnosing replication (copy) error
Следующее
От: hassan rafi
Дата:
Сообщение: Seeing high query planning time on Azure Postgres Single Server version 11.