Re: BUG #17928: Standby fails to decode WAL on termination of primary

Поиск
Список
Период
Сортировка
От Kyotaro Horiguchi
Тема Re: BUG #17928: Standby fails to decode WAL on termination of primary
Дата
Msg-id 20230711.155414.939136339819647605.horikyota.ntt@gmail.com
обсуждение исходный текст
Ответ на Re: BUG #17928: Standby fails to decode WAL on termination of primary  (Noah Misch <noah@leadboat.com>)
Ответы Re: BUG #17928: Standby fails to decode WAL on termination of primary  (Noah Misch <noah@leadboat.com>)
Список pgsql-bugs
At Mon, 10 Jul 2023 13:00:12 -0700, Noah Misch <noah@leadboat.com> wrote in 
> On Mon, May 15, 2023 at 03:38:17PM +1200, Thomas Munro wrote:
> > On Fri, May 12, 2023 at 6:00 AM Alexander Lakhin <exclusion@gmail.com> wrote:
> > > 2023-05-11 20:19:22.248 MSK [2037134] FATAL:  invalid memory alloc request size 2021163525
> > > 2023-05-11 20:19:22.248 MSK [2037114] LOG:  startup process (PID 2037134) exited with exit code 1
> > 
> > Thanks Alexander.  Looking into this.  I think it is probably
> > something like: recycled standby pages are not zeroed (something we
> > already needed to do something about[1]), and when we read a recycled
> > garbage size (like your "xxxx") at the end of a page at an offset
> > where we don't have a full record header on one page, we skip the
> > ValidXLogRecordHeader() call (and always did), but the check in
> > allocate_recordbuf() which previously handled that "gracefully" (well,
> > it would try to allocate up to 1GB bogusly, but it wouldn't try to
> > allocate more than that and ereport) is a bit too late.  I probably
> > need to add an earlier not-too-big validation.  Thinking.
> 
> I agree about an earlier not-too-big validation.  Like the attached?  I
> haven't tested it with Alexander's recipe or pondered it thoroughly.

I like the patch for its code clean up, but I'm afraid that it removes
the existing record length check when reading continued
pages. However, I'm unsure about merely adding a check for too-long
records, due to the potential for requesting an excessively large
amount of memory, even if it will be released shortly.

> > [1] https://www.postgresql.org/message-id/20210505010835.umylslxgq4a6rbwg@alap3.anarazel.de
> 
> Regarding [1], is it still worth zeroing recycled pages on standbys and/or
> reading the whole header before allocating xl_tot_len?  (Are there benefits
> other than avoiding a 1G backend allocation or 4G frontend allocation, or is
> that benefit worth the cycles?)

I believe reading the whole header is the most sensible approach as it
can prevent unnecessary memory requests. Another potential solution
(or hack) for this specific case is to let XLogWalRcvFlush write a
finalizing ((uint32)0) when dying is true. This stabilizes the
behavior to "invalid record length.. got 0" when running the TAP test.

regards.


Kyotaro Horiguchi
NTT Open Source Software Center

Вложения

В списке pgsql-bugs по дате отправления:

Предыдущее
От: Thomas Munro
Дата:
Сообщение: Re: BUG #18009: Postgres Recovery not happening
Следующее
От: Andrew Dunstan
Дата:
Сообщение: Re: BUG #17994: Invalidating relcache corrupts tupDesc inside ExecEvalFieldStoreDeForm()