Replication Question / Issue - PRIMARY with SYNC and ASYNC Replication

Поиск
Список
Период
Сортировка
От Edwin UY
Тема Replication Question / Issue - PRIMARY with SYNC and ASYNC Replication
Дата
Msg-id CA+wokJ_sf=EZrdHgPnPn_6vUG43B0yTHS50MaM5jLeS1z6LYyQ@mail.gmail.com
обсуждение исходный текст
Ответы Re: Replication Question / Issue - PRIMARY with SYNC and ASYNC Replication
Список pgsql-admin
Hi,

Apologies for a long email. I suppose as much information as possible will help with troubleshooting
PostgreSQL is Version 11. I know, it's old, I don't have a choice due to the application.

There is a PRIMARY and 2 replicas, SYNC and ASYNC.
We had a network outage that rendered the application unusable for some reason even though we still have a PRIMARY and a replication server in place.
This is now resolved since the network is restored so I am just wanting to get some guidance for a quick resolution in the future.

Not really sure how to confirm which one is SYNC or ASYNC.
select * from pg_stat_replication from the PRIMARY shows nothing
So, I am left with no choice but to trust the documentation where it says

SERVER -E = PRIMARY
SERVER -F  = ASYNC
SERVER -G = SYNC

When we have the network issue.
SERVER-E and SERVER-F are accessible and they can communicate to each other. SERVER-G is not accessible. However the application connection is intermittently dropping.

The primary is showing several errors like below:
STATEMENT:  ROLLBACK PREPARED 'gid'
ERROR:  prepared transaction with identifier "gid" is busy

SERVER-F is showing
FATAL:  could not connect to the primary server: could not connect to server: No route to host
                Is the server running on host "SERVER-G" and accepting
                TCP/IP connections on port 5432?

Can't check SERVER-G as it is not accessible.

I assume the prepared transactions are from the replication, not from the application.
The error from SERVER-F is as expected since SERVER-G is not accessible.
Under this scenario, the application is intermittently having issues connecting to the database. Not sure why.
We have re-started both databases SERVER-E and SERVER-F and clear up the prepared transaction as well using https://www.cybertec-postgresql.com/en/prepared-transactions.
After startup we can see the prepared transaction gone, pg_prepared_xacts is emptty and then will show one one prepare transaction that is active based on pg_stat_activity.
select * from pg_stat_replication still shows nothing.
To resolve the SERVER-F error, we change the recovery.conf and change primary_conninfo to use SERVER-E.
This still did not resolve the application issue and the primary log still shows the following every so often.

STATEMENT:  ROLLBACK PREPARED 'gid'
ERROR:  prepared transaction with identifier "gid" is busy

At this stage, I thought maybe the PRIMARY and the replicas are configured in such a way that the PRIMARY must receive confirmation from both that it has committed too otherwise it will just continue waiting.
Under this scenario, it is not able too since SERVER-G is not accessible. Does that make sense?

Anyway, maybe someone will be interested to read this email and can shed some light on this and can advise whether there's some configuration setting somewhere that we should have modified as a temporary workaround.
Could it be because of synchronous_commit= on? Maybe we should have changed this when SERVER-G is not accessible?

Everything is back to normal once SERVER-G has become accessible again.
That is about 6 hours though :( and doesn't explain why things will stop working normally when a replica is down and the PRIMARY is still accessible.
Does that mean, if both replicas are down and only the PRIMARY is accessible, we have to totally turn off / disable replication?
If we do need to break the replica, when the PRIMARY is UP and both replicas are inaccessible, do we just unset synchronous_standby_names?

Any reply is much appreciated. Thanks in advance.

Regards,
Ed

В списке pgsql-admin по дате отправления: