Re: Infinite loop in XLogPageRead() on standby

Поиск
Список
Период
Сортировка
От Alexander Kukushkin
Тема Re: Infinite loop in XLogPageRead() on standby
Дата
Msg-id CAFh8B==zUj1+asN5REAvqJccgUZFgOh5Ze9c=mOrGypRuTEm=g@mail.gmail.com
обсуждение исходный текст
Ответ на Re: Infinite loop in XLogPageRead() on standby  (Kyotaro Horiguchi <horikyota.ntt@gmail.com>)
Ответы Re: Infinite loop in XLogPageRead() on standby
Список pgsql-hackers
Hi Kyotaro,

On Thu, 29 Feb 2024 at 08:18, Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote:

In the first place, it's important to note that we do not guarantee
that an async standby can always switch its replication connection to
the old primary or another sibling standby. This is due to the
variations in replication lag among standbys. pg_rewind is required to
adjust such discrepancies.

Sure, I know. But in this case the async standby received and flushed absolutely the same amount of WAL as the promoted one.
 

I might be overlooking something, but I don't understand how this
occurs without purposefully tweaking WAL files. The repro script
pushes an incomplete WAL file to the archive as a non-partial
segment. This shouldn't happen in the real world.

It easily happens if the primary crashed and standbys didn't receive another page with continuation record.

In the repro script, the replication connection of the second standby
is switched from the old primary to the first standby after its
promotion. After the switching, replication is expected to continue
from the beginning of the last replayed segment.

Well, maybe, but apparently the standby is busy trying to decode a record that spans multiple pages, and it is just infinitely waiting for the next page to arrive. Also, the restart "fixes" the problem, because indeed it is reading the file from the beginning.
 
But with the script,
the second standby copies the intentionally broken file, which differs
from the data that should be received via streaming.

As I already said, this is a simple way to emulate the primary crash while standbys receiving WAL.
It could easily happen that the record spans on multiple pages is not fully received and flushed.

--
Regards,
--
Alexander Kukushkin

В списке pgsql-hackers по дате отправления:

Предыдущее
От: Dean Rasheed
Дата:
Сообщение: Re: Supporting MERGE on updatable views
Следующее
От: Stephen Frost
Дата:
Сообщение: Re: Atomic ops for unlogged LSN