Re: subscriptionCheck failures on nightjar

Поиск
Список
Период
Сортировка
От Tom Lane
Тема Re: subscriptionCheck failures on nightjar
Дата
Msg-id 2636.1569016167@sss.pgh.pa.us
обсуждение исходный текст
Ответ на Re: subscriptionCheck failures on nightjar  (Andres Freund <andres@anarazel.de>)
Ответы Re: subscriptionCheck failures on nightjar  (Tom Lane <tgl@sss.pgh.pa.us>)
Re: subscriptionCheck failures on nightjar  (Andres Freund <andres@anarazel.de>)
Re: subscriptionCheck failures on nightjar  (Alvaro Herrera <alvherre@2ndquadrant.com>)
Список pgsql-hackers
Andres Freund <andres@anarazel.de> writes:
> On 2019-09-20 16:25:21 -0400, Tom Lane wrote:
>> I recreated my freebsd-9-under-qemu setup and I can still reproduce
>> the problem, though not with high reliability (order of 1 time in 10).
>> Anything particular you want logged?

> A DEBUG2 log would help a fair bit, because it'd log some information
> about what changes the "horizons" determining when data may be removed.

Actually, what I did was as attached [1], and I am getting traces like
[2].  The problem seems to occur only when there are two or three
processes concurrently creating the same snapshot file.  It's not
obvious from the debug trace, but the snapshot file *does* exist
after the music stops.

It is very hard to look at this trace and conclude anything other
than "rename(2) is broken, it's not atomic".  Nothing in our code
has deleted the file: no checkpoint has started, nor do we see
the DEBUG1 output that CheckPointSnapBuild ought to produce.
But fsync_fname momentarily can't see it (and then later another
process does see it).

It is now apparent why we're only seeing this on specific ancient
platforms.  I looked around for info about rename(2) not being
atomic, and I found this info about FreeBSD:

https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=94849

The reported symptom there isn't quite the same, so probably there
is another issue, but there is plenty of reason to be suspicious
that UFS rename(2) is buggy in this release.  As for dromedary's
ancient version of macOS, Apple is exceedinly untransparent about
their bugs, but I found

http://www.weirdnet.nl/apple/rename.html

In short, what we got here is OS bugs that have probably been
resolved years ago.

The question is what to do next.  Should we just retire these
specific buildfarm critters, or do we want to push ahead with
getting rid of the PANIC here?

            regards, tom lane



В списке pgsql-hackers по дате отправления:

Предыдущее
От: Andres Freund
Дата:
Сообщение: Re: subscriptionCheck failures on nightjar
Следующее
От: Tom Lane
Дата:
Сообщение: Re: subscriptionCheck failures on nightjar