Re: BUG #16161: pg_ctl stop fails sometimes (on Windows)

Поиск
Список
Период
Сортировка
От Alexander Lakhin
Тема Re: BUG #16161: pg_ctl stop fails sometimes (on Windows)
Дата
Msg-id 5c7eaa87-8670-2008-3a15-337c133c12c2@gmail.com
обсуждение исходный текст
Ответ на Re: BUG #16161: pg_ctl stop fails sometimes (on Windows)  (Tom Lane <tgl@sss.pgh.pa.us>)
Ответы Re: BUG #16161: pg_ctl stop fails sometimes (on Windows)  (Tom Lane <tgl@sss.pgh.pa.us>)
Re: BUG #16161: pg_ctl stop fails sometimes (on Windows)  (Tom Lane <tgl@sss.pgh.pa.us>)
Список pgsql-bugs
19.12.2019 18:44, Tom Lane wrote:
> Alexander Lakhin <exclusion@gmail.com> writes:
>> Please look at the patch that implements (2). It makes `vcregress
>> recoverycheck` pass (and my demo restart test still passes too).
> I think we could be a little smarter here.  If the problem is
> STATUS_DELETE_PENDING, doesn't that affect stat() as well?  That is,
> if stat() succeeds, we needn't wait, independently of whether it
> says S_ISDIR or not?  This seems like a noticeable improvement if
> true, because it would mean that ordinary file-permission failures
> also need not wait.  We'd still get confused if we get a permission
> failure on a containing directory rather than the file itself, but
> that I'm prepared to dismiss as an uncommon case.
>
> Entirely-untested patch along this line attached.
I have tested it with my demo restart test and it works. Yes, stat() is
resulted in the CreateFile() call too, so it should fail the same way. 
I haven't managed to catch both CreateFile() and stat() failing with
STATUS_DELETE_PENDING (CreateFile() failed for me 37 times per 1000
loops), but I believe that sleep() will work correctly in this case.
`vcregress recoverytest` is passed too.
The DELETE_PENDING conflict now looks like this:
1 "8:50:44.9928200
PM","postgres.exe","2792","CreateFile","...\pgdata\postmaster.pid","SUCCESS","Desired
Access: Read Attributes, Delete, ..."
2 "8:50:44.9928951
PM","postgres.exe","2792","QueryAttributeTagFile","...\pgdata\postmaster.pid","SUCCESS",...
3 "8:50:44.9929114
PM","pg_ctl.exe","2996","CreateFile","...\pgdata\postmaster.pid","DELETE
PENDING","Desired Access: Generic Read, Disposition: Open, ..."
4 "8:50:44.9929162
PM","postgres.exe","2792","SetDispositionInformationFile","...\pgdata\postmaster.pid","SUCCESS","Delete:
True"
5 "8:50:44.9929377
PM","postgres.exe","2792","CloseFile","...\pgdata\postmaster.pid","SUCCESS",""
6 "8:50:44.9932617
PM","pg_ctl.exe","2996","CreateFile","...\pgdata\postmaster.pid","NAME
NOT FOUND","Desired Access: Read Attributes, ..."
7 "8:50:45.1042027
PM","pg_ctl.exe","2996","CreateFile","...\pgdata\postmaster.pid","NAME
NOT FOUND","Desired Access: Generic Read, ..."
8 "8:50:45.9926445
PM","pg_ctl.exe","2396","CreateFile","...\pgdata\postmaster.pid","NAME
NOT FOUND","Desired Access: Generic Read ..."
Here 1, 2, 4, 5 are calls behind unlink() in postmaster, 3 is
CreateFile() in pgwin32_open(), 6 is stat() in pgwin32_safestat(), then
we have sleep for 0.1 sec, 7 is CreateFile() in pgwin32_open() again,
and 8 is CreateFile() performed by a new pg_ctl instance.

Maybe we should change the condition to 'if (stat(fileName, &st) != 0 &&
(err = GetLastError()) == ERROR_ACCESS_DENIED)' to avoid unnecessary
sleep with a loop iteration...
With this change the same conflict looks like this:
1 "10:23:12.9940793
PM","postgres.exe","1908","CreateFile","...\pgdata\postmaster.pid","SUCCESS","Desired
Access: Read Attributes, ..."
2 "10:23:12.9941500
PM","postgres.exe","1908","QueryAttributeTagFile","...\pgdata\postmaster.pid","SUCCESS",
...
3 "10:23:12.9941713
PM","postgres.exe","1908","SetDispositionInformationFile","...\pgdata\postmaster.pid","SUCCESS","Delete:
True"
4 "10:23:12.9941766
PM","pg_ctl.exe","2420","CreateFile","...\pgdata\postmaster.pid","DELETE
PENDING","Desired Access: Generic Read ..."
5 "10:23:12.9941936
PM","postgres.exe","1908","CloseFile","...\pgdata\postmaster.pid","SUCCESS",""
6 "10:23:12.9945254
PM","pg_ctl.exe","2420","CreateFile","...\pgdata\postmaster.pid","NAME
NOT FOUND","Desired Access: Read Attributes, ..."
7 "10:23:13.9940261
PM","pg_ctl.exe","1676","CreateFile","...\pgdata\postmaster.pid","NAME
NOT FOUND","Desired Access: Generic Read, ..."
Here 1, 2, 3, 5 are calls behind unlink() in postmaster, 4 is
CreateFile() in pgwin32_open(), 6 is stat() in pgwin32_safestat(), 7 is
CreateFile() performed by a new pg_ctl instance.
`vcregress recoverytest` is passed.
The log of access to the logical decoding slot now looks like this:
1 "10:37:22.9496662
PM","postgres.exe","2956","CreateFile","...\pgdata\pg_replslot\otherdb_slot","IS
DIRECTORY","Desired Access: Generic Read, ..."
2 "10:37:22.9498061
PM","postgres.exe","2956","CreateFile","...\pgdata\pg_replslot\otherdb_slot","SUCCESS","Desired
Access: Read Attributes, ..."
3 "10:37:22.9498478

PM","postgres.exe","2956","QueryInformationVolume","...\pgdata\pg_replslot\otherdb_slot","SUCCESS","VolumeCreationTime:
... "
4 "10:37:22.9498634
PM","postgres.exe","2956","QueryAllInformationFile","...\pgdata\pg_replslot\otherdb_slot","BUFFER
OVERFLOW","CreationTime: ..."
5 "10:37:22.9498860
PM","postgres.exe","2956","CloseFile","...\pgdata\pg_replslot\otherdb_slot","SUCCESS",""
6 "10:37:22.9499936
PM","postgres.exe","2956","QueryOpen","...\pgdata\pg_replslot\otherdb_slot","FAST
IO DISALLOWED",""
7 "10:37:22.9500733
PM","postgres.exe","2956","CreateFile","...\pgdata\pg_replslot\otherdb_slot","SUCCESS","Desired
Access: ..."
8 "10:37:22.9500973

PM","postgres.exe","2956","QueryNetworkOpenInformationFile","...\pgdata\pg_replslot\otherdb_slot","SUCCESS","CreationTime:..."
9 "10:37:22.9501108
PM","postgres.exe","2956","CloseFile","...\pgdata\pg_replslot\otherdb_slot","SUCCESS",""

> BTW ... it's likely that stat() here is actually going to invoke
> pgwin32_safestat(), which has its own opinions about this, and
> indeed seems to think we'll get ERROR_DELETE_PENDING.  I think
> that's harmless here, but it makes me wonder if we should use
> a native Windows API instead of going through stat().
Yes, I would call native stat(), as we don't need to perform an
excessive call to get file size (for directories).
It seems that the check for ERROR_DELETE_PENDING was added to
pgwin32_safestat() blindly, the issue wasn't reproduced at that time:
https://www.postgresql.org/message-id/CAB7nPqRJV6trFta-Qzgi6j2feuYR2ZC%2BKHvWdHnbpDG2scTrxw%40mail.gmail.com

>> As open(dir) is getting a little more expensive than before, maybe it's
>> still worthwhile to patch fsync*(..., isdir,...). I can prepare such a
>> patch if so.
> I think we should leave that for later, so that the buildfarm can
> actually test whatever logic we put in here.
Agreed. It doesn't directly affect this bug and should be done separately.

Best regards,
Alexander



В списке pgsql-bugs по дате отправления:

Предыдущее
От: Tom Lane
Дата:
Сообщение: Re: BUG #16161: pg_ctl stop fails sometimes (on Windows)
Следующее
От: Tom Lane
Дата:
Сообщение: Re: BUG #16161: pg_ctl stop fails sometimes (on Windows)