Re: Race condition with restore_command on streaming replica

Поиск
Список
Период
Сортировка
От Dilip Kumar
Тема Re: Race condition with restore_command on streaming replica
Дата
Msg-id CAFiTN-t7-1TOF-x2bc1h1CMyVqzJxKhzcMgsQxWqxAi2nN0gdw@mail.gmail.com
обсуждение исходный текст
Ответ на Race condition with restore_command on streaming replica  ("Brad Nicholson" <bradn@ca.ibm.com>)
Ответы RE: Race condition with restore_command on streaming replica  ("Brad Nicholson" <bradn@ca.ibm.com>)
Список pgsql-general
On Thu, Nov 5, 2020 at 1:39 AM Brad Nicholson <bradn@ca.ibm.com> wrote:
>
> Hi,
>
> I've recently been seeing a race condition with the restore_command on replicas using streaming replication.
>
> On the primary, we are archiving wal files to s3 compatible storage via pgBackRest. In the recovery.conf section of
thepostgresql.conf file on the replicas, we define the restore command as follows: 
>
> restore_command = '/usr/bin/pgbackrest --config /conf/postgresql/pgbackrest_restore.conf --stanza=formation
archive-get%f "%p"' 
>
> We have a three member setup - m-0, m-1, m-2. Consider the case where m-0 is the Primary and m-1 and m-2 are replicas
connectedto the m-0. 
>
> When issuing a switchover (via Patroni) from m-0 to m-1, the connection from m-2 to m-0 is terminated. The
restore_commandon m-2 is run, and it looks for the .history file for the new timeline. If this happens before the
historyfile is created and pushed to the archive, m-2 will look for the next wal file on the existing timeline in the
archive.It will never be created as the source has moved on, so this m-2 hangs waiting on that file. The
restore_commandon the replica looking for this non-existent file is only run once. This seems like an odd state to be
in.The replica is waiting on a new file, but it's not actually looking for it. Is this expected, or should the
restore_commandbe polling for that file? 

I am not sure how Patroni does it internally,  can you explain the
scenario in more detail?  Suppose you are executing the promote on m-1
and if the promotion is successful it will switch the timeline and it
will create the timeline history file.  Now, once the promotion is
successful if we change the primary_conninfo on the m-2 then it will
restart the walsender and look for the latest .history file which it
should find either from direct streaming or through the
restore_command.  If you are saying that m-2 tried to look for the
history file before m-1 created it then it seems like you change the
primary_conninfo on m-2 before the m-1 promotion got completed.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



В списке pgsql-general по дате отправления:

Предыдущее
От: Morris de Oryx
Дата:
Сообщение: Re: ERROR: could not find tuple for statistics object - is there a way to clean this up?
Следующее
От: Radoslav Nedyalkov
Дата:
Сообщение: Re: conflict with recovery when delay is gone