Обсуждение: Re: 011_crash_recovery.pl intermittently fails

Поиск
Список
Период
Сортировка

Re: 011_crash_recovery.pl intermittently fails

От
Thomas Munro
Дата:
On Mon, Mar 8, 2021 at 9:32 PM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:
> At Sun, 07 Mar 2021 20:09:33 -0500, Tom Lane <tgl@sss.pgh.pa.us> wrote in
> > Thomas Munro <thomas.munro@gmail.com> writes:
> > > Thanks!  I'm afraid I wouldn't get around to it for a few weeks, so if
> > > you have time, please do.  (I'm not sure if it's strictly necessary to
> > > log *this* xid, if a higher xid has already been logged, considering
> > > that the goal is just to avoid getting confused about an xid that is
> > > recycled after crash recovery, but coordinating that might be more
> > > complicated; I don't know.)
> >
> > Yeah, ideally the patch wouldn't add any unnecessary WAL flush,
> > if there's some cheap way to determine that our XID must already
> > have been written out.  But I'm not sure that it's worth adding
> > any great amount of complexity to avoid that.  For sure I would
> > not advocate adding any new bookkeeping overhead in the mainline
> > code paths to support it.
>
> We need to *write* an additional record if the current transaction
> haven't yet written one (EnsureTopTransactionIdLogged()). One
> annoyance is the possibly most-common usage of calling
> pg_current_xact_id() at the beginning of a transaction, which leads to
> an additional 8 byte-long log of XLOG_XACT_ASSIGNMENT. We could also
> avoid that by detecting any larger xid is already flushed out.

Yeah, that would be very expensive for users doing that.

> I haven't find a simple and clean way to tracking the maximum
> flushed-out XID.  The new cooperation between xlog.c and xact.c
> related to XID and LSN happen on shared variable makes things
> complex...
>
> So the attached doesn't contain the max-flushed-xid tracking feature.

I guess that would be just as expensive if the user does that
sequentially with small transactions (ie allocating xids one by one).

I remembered this thread after seeing the failure of Michael's new
build farm animal "tanager".  I think we need to solve this somehow...
according to our documentation "Applications might use this function,
for example, to determine whether their transaction committed or
aborted after the application and database server become disconnected
while a COMMIT is in progress.", but it's currently useless or
dangerous for that purpose.



Re: 011_crash_recovery.pl intermittently fails

От
Michael Paquier
Дата:
On Wed, Jan 25, 2023 at 12:40:02PM +1300, Thomas Munro wrote:
> I remembered this thread after seeing the failure of Michael's new
> build farm animal "tanager".  I think we need to solve this somehow...

Well, this host has a problem, for what looks like a kernel issue, I
guess..  This is repeatable across all the branches, randomly, with
various errors with the POSIX DSM implementation:
# [63cf68b7.5e5a:1] ERROR:  could not open shared memory segment "/PostgreSQL.543738922": No such file or directory
# [63cf68b7.5e58:6] ERROR:  could not open shared memory segment "/PostgreSQL.543738922": No such file or directory

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=tanager&dt=2023-01-24%2004%3A23%3A53
dynamic_shared_memory_type should be using posix in this case.
Switching to mmap may help, perhaps..  I need to test it.

Anyway, sorry for the noise on this one.
--
Michael

Вложения

Re: 011_crash_recovery.pl intermittently fails

От
Thomas Munro
Дата:
On Wed, Jan 25, 2023 at 1:02 PM Michael Paquier <michael@paquier.xyz> wrote:
> Well, this host has a problem, for what looks like a kernel issue, I
> guess..  This is repeatable across all the branches, randomly, with
> various errors with the POSIX DSM implementation:
> # [63cf68b7.5e5a:1] ERROR:  could not open shared memory segment "/PostgreSQL.543738922": No such file or directory
> # [63cf68b7.5e58:6] ERROR:  could not open shared memory segment "/PostgreSQL.543738922": No such file or directory

Something to do with
https://www.postgresql.org/docs/current/kernel-resources.html#SYSTEMD-REMOVEIPC
?

The failure I saw looked like a straight up case of the bug reported
in this thread to me.



Re: 011_crash_recovery.pl intermittently fails

От
Michael Paquier
Дата:
On Wed, Jan 25, 2023 at 01:20:39PM +1300, Thomas Munro wrote:
> Something to do with
> https://www.postgresql.org/docs/current/kernel-resources.html#SYSTEMD-REMOVEIPC
> ?

Still this is unrelated?  This is a buildfarm instance, so the backend
does not run with systemd.

> The failure I saw looked like a straight up case of the bug reported
> in this thread to me.

Okay...
--
Michael

Вложения

Re: 011_crash_recovery.pl intermittently fails

От
Tom Lane
Дата:
Michael Paquier <michael@paquier.xyz> writes:
> On Wed, Jan 25, 2023 at 01:20:39PM +1300, Thomas Munro wrote:
>> Something to do with
>> https://www.postgresql.org/docs/current/kernel-resources.html#SYSTEMD-REMOVEIPC
>> ?

> Still this is unrelated?  This is a buildfarm instance, so the backend
> does not run with systemd.

That systemd behavior affects IPC resources regardless of what process
created them.

            regards, tom lane



Re: 011_crash_recovery.pl intermittently fails

От
Michael Paquier
Дата:
On Tue, Jan 24, 2023 at 07:42:06PM -0500, Tom Lane wrote:
> That systemd behavior affects IPC resources regardless of what process
> created them.

Thanks, my memory was fuzzy regarding that.  I am curious if the error
in the recovery tests will persist with that set up.  The next run
will be in a few hours, so let's see..
--
Michael

Вложения

Re: 011_crash_recovery.pl intermittently fails

От
Michael Paquier
Дата:
On Wed, Jan 25, 2023 at 10:32:10AM +0900, Michael Paquier wrote:
> Thanks, my memory was fuzzy regarding that.  I am curious if the error
> in the recovery tests will persist with that set up.  The next run
> will be in a few hours, so let's see..

So it looks like tanaget is able to reproduce the failure of this
thread much more frequently than the other animals:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=tanager&dt=2023-01-25%2003%3A05%3A05

That's interesting.  FWIW, this environment is just a Raspberry PI 4
with 8GB of memory with clang.
--
Michael

Вложения