Re: Random pg_upgrade test failure on drongo

Поиск
Список
Период
Сортировка
От Alexander Lakhin
Тема Re: Random pg_upgrade test failure on drongo
Дата
Msg-id 5fd57074-dfe2-809b-9c57-630aff6f1468@gmail.com
обсуждение исходный текст
Ответ на RE: Random pg_upgrade test failure on drongo  ("Hayato Kuroda (Fujitsu)" <kuroda.hayato@fujitsu.com>)
Ответы Re: Random pg_upgrade test failure on drongo  (Andrew Dunstan <andrew@dunslane.net>)
RE: Random pg_upgrade test failure on drongo  ("Hayato Kuroda (Fujitsu)" <kuroda.hayato@fujitsu.com>)
Список pgsql-hackers
Hello Kuroda-san,

25.11.2023 18:19, Hayato Kuroda (Fujitsu) wrote:
> Thanks for attaching a program. This helps us to understand the issue.
> I wanted to confirm your env - this failure was occurred on windows server XXXX, right?

I see that behavior on:
Windows 10 Version 1607 (OS Build 14393.0)
Windows Server 2016 Version 1607 (OS Build 14393.0)
Windows Server 2019 Version 1809 (OS Build 17763.1)

But it's not reproduced on:
Windows 10 Version 1809 (OS Build 17763.1) (triple-checked)
Windows Server 2019 Version 1809 (OS Build 17763.592)
Windows 10 Version 22H2 (OS Build 19045.3693)
Windows 11 Version 21H2 (OS Build 22000.613)

So it looks like the failure occurs depending not on Windows edition, but
rather on it's build. For Windows Server 2019 the "good" build is
somewhere between 17763.1 and 17763.592, but for Windows 10 it's between
14393.0 and 17763.1.
(Maybe there was some change related to FILE_DISPOSITION_POSIX_SEMANTICS/
FILE_DISPOSITION_ON_CLOSE implementation; I don't know where to find
information about that change.)

It's also interesting, what is full version/build of OS on drongo and
fairywren.

>> That is, my idea was to try removing a file through renaming it as a fast
>> path (thus avoiding that troublesome state DELETE PENDING), and if that
>> fails, to perform removal as before. May be the whole function might be
>> simplified, but I'm not sure about special cases yet.
> I felt that your result showed pgrename() would be more rarely delayed than unlink().
> That's why a file which has original name would not exist when subsequent open() was called.

I think that's because unlink() is performed asynchronously on those old
Windows versions, but rename() is always synchronous.

>>> * IIUC, the important points is the latter part, which waits until the status is
>>>     changed. Based on that, can we remove a double rmtree() from
>> cleanup_output_dirs()?
>>>     They seems to be add for the similar motivation.
>> I couldn't yet reproduce a failure, which motivated that doubling (IIUC, it
>> was observed in [1]), with c28911750 reverted, so I need more time to
>> research that issue to answer this question.
> Yeah, as the first place, this failure seldom occurred....

I've managed to reproduce that issue (or at least a situation that
manifested similarly) with a sleep added in miscinit.c:
         ereport(IsPostmasterEnvironment ? LOG : NOTICE,
                         (errmsg("database system is shut down")));
+       pg_usleep(500000L);

With this change, I get the same warning as in [1] when running in
parallel 10 tests 002_pg_upgrade with a minimal olddump (on iterations
33, 46, 8). And with my PoC patch applied, I could see the same warning
as well (on iteration 6).

I believe that's because rename() can't rename a directory containing an
open file, just as unlink() can't remove it.

In the light of the above, I think that the issue in question should be
fixed in accordance with/as a supplement to [2].

[1] https://www.postgresql.org/message-id/20230131172806.GM22427%40telsasoft.com
[2]
https://www.postgresql.org/message-id/flat/CA%2BhUKG%2BajSQ_8eu2AogTncOnZ5me2D-Cn66iN_-wZnRjLN%2Bicg%40mail.gmail.com

Best regards,
Alexander



В списке pgsql-hackers по дате отправления:

Предыдущее
От: Amit Kapila
Дата:
Сообщение: Re: pg_upgrade and logical replication
Следующее
От: "Drouvot, Bertrand"
Дата:
Сообщение: Re: Synchronizing slots from primary to standby