Re: [BUG] Panic due to incorrect missingContrecPtr after promotion

Поиск
Список
Период
Сортировка
От Imseih (AWS), Sami
Тема Re: [BUG] Panic due to incorrect missingContrecPtr after promotion
Дата
Msg-id D98404B4-F94D-4FD8-99E4-773E4615669F@amazon.com
обсуждение исходный текст
Ответ на Re: [BUG] Panic due to incorrect missingContrecPtr after promotion  (Kyotaro Horiguchi <horikyota.ntt@gmail.com>)
Ответы Re: [BUG] Panic due to incorrect missingContrecPtr after promotion  ("Imseih (AWS), Sami" <simseih@amazon.com>)
Список pgsql-hackers
> Any luck with this?

Apologies for the delay, as I have been away. 
I will test this next week and report back my findings.

Thanks

Sami Imseih
Amazon Web Services (AWS)


On 6/28/22, 2:10 AM, "Kyotaro Horiguchi" <horikyota.ntt@gmail.com> wrote:

    CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you
canconfirm the sender and know the content is safe.
 



    I'd like to look into the WAL segments related to the failure.

    Mmm... With the patch, xlogreader->abortedRecPtr is valid only and
    always when the last read failed record was an aborted contrec. If
    recovery ends here the first insereted record is an "aborted contrec"
    record.  I still see it as the only chance that an aborted contrecord
    is followed by a non-"aborted contrec" record is that recovery somehow
    fetches two consecutive WAL segments that are inconsistent at the
    boundary.


    I found the reason that the TAP test doesn't respond to the first
    proposed patch (the below).

    -                       if (!StandbyMode &&
    +                       if (!StandbyModeRequested &&
                                    !XLogRecPtrIsInvalid(xlogreader->abortedRecPtr))

    The cause was that I disabled standby-mode in the test. The change
    affects only while standby mode is on, which was to make the test
    reliable and simpler. The first attached detects the same error (in a
    somwhat maybe-unstable way) and responds to the fix above, and also
    responds to the aborted_contrec_reset_3.patch.

    So, aborted_contrec_reset_3 looks closer to the issue than before.

    Would you mind trying the second attached to abtain detailed log on
    your testing environment? With the patch, the modified TAP test yields
    the log lines like below.

    2022-06-28 15:49:20.661 JST [165472] LOG:  ### [A] @0/1FFD338: abort=(0/1FFD338)0/0, miss=(0/2000000)0/0,
SbyMode=0,SbyModeReq=1
 
    ...
    2022-06-28 15:49:20.681 JST [165472] LOG:  ### [F] @0/2094610: abort=(0/0)0/1FFD338, miss=(0/0)0/2000000,
SbyMode=1,SbyModeReq=1
 
    ...
    2022-06-28 15:49:20.767 JST [165472] LOG:  ### [S] @0/2094610: abort=(0/0)0/1FFD338, miss=(0/0)0/2000000,
SbyMode=0,SbyModeReq=1
 
    ...
    2022-06-28 15:49:20.777 JST [165470] PANIC:  xlog flush request 0/2094610 is not satisfied --- flushed only to
0/2000088

    In this example, abortedRecPtr is set at the first line and recovery
    continued to 2094610 but abortedRecPtr is not reset then PANICed. ([A]
    means aborted contrec falure. [F] and [S] are failed and succeeded
    reads respectively.)

    regards.

    --
    Kyotaro Horiguchi
    NTT Open Source Software Center


В списке pgsql-hackers по дате отправления:

Предыдущее
От: Tom Lane
Дата:
Сообщение: Re: generic plans and "initial" pruning
Следующее
От: Simon Riggs
Дата:
Сообщение: Re: Slow standby snapshot