Re: BUG #16750: VACUUM sporadically fails to access pgstat on Cygwin

Поиск
Список
Период
Сортировка
От Alexander Lakhin
Тема Re: BUG #16750: VACUUM sporadically fails to access pgstat on Cygwin
Дата
Msg-id dc97c060-59be-c69f-ff57-ef0c1bf5db5e@gmail.com
обсуждение исходный текст
Список pgsql-bugs
29.11.2020 08:00, PG Bug reporting form wrote:
pg_regress/vacuum WARNING:  could not open statistics file
"/cygdrive/w/lorikeet/REL_12_STABLE/global.stat": Permission denied
pg_regress/create_index WARNING:  could not open statistics file
"/cygdrive/w/lorikeet/REL_13_STABLE/global.stat": Device or resource busy

The issue with the "Permission denied" error is already resolved for Windows
(see bug #16161), but cygwin still suffers from it.

As I have found out, it happens when multiple VACUUM processes access the pgstat statistics. There can be the autovacuum process or just several VACUUM's executed simultaneously, i.e. we have the following scenario:
The first process initiates a call to rename() in PgstatCollectorMain, via pgstat_vacuum_stat() -> backend_read_statsfile() -> pgstat_send_inquiry() | PgstatCollectorMain: pgstat_recv_inquiry() -> pending_write_requests -> pgstat_write_statsfiles() -> rename(tmpfile, statfile).
The second proccess calls AllocateFile()/fopen(), via backend_read_statsfile() -> pgstat_read_db_statsfile_timestamp().
Also there can be third process that calls backend_read_statsfile() and thus fopen(), too.

So one process (PgstatCollectorMain) tries to rename pg_stat_tmp/global.tmp to pg_stat_tmp/global.stat and another process (backend) tries to open it, and yet another one can have it open.

The important fact is that rename() in cygwin() is not atomic at all, and it could be performed in two ways [1]. First, cygwin tries to execute IRP_MJ_SET_INFORMATION for global.tmp with "Type: SetRenameInformationFile, ReplaceIfExists: True, FileName: global.stat". (See LogFile-SharingViolation1.CSV attached.)
If it fails with ACCESS_DENIED (because global.stat is open by another process, see line 51 of the log), it tries to replace file hard (and bad for us) way:
IRP_MJ_CREATE for global.stat with "Desired Access: Delete, Disposition: Open.." (see line 78) and then IRP_MJ_WRITE, IRP_MJ_SET_INFORMATION, IRP_MJ_CLOSE...

So the first possibility to fail for the opening backend is to get EBUSY error [1] (see line 89) . In this case we should just wait as we already do in pgwin32_open().

The second possibility is to get the EACCES error due to the delete pending state. But it's important to note that the global.stat file is not deleted at the end, it's replaced by global.tmp. So stat() just returns zero and in this case we shouldn't just return error (as in pgwin32_open).

While fixing bug #16161 (when we added similar handling for the "delete pending" state) we've also encountered an issue [2].
There was a delay for 1 sec added on each open(directory) call as Windows returns ERROR_ACCESS_DENIED for such calls.
Fortunately, cygwin doesn't do the same. It allows to open(directory) without an error, as the attached Logfile-006_logical_decoding.CSV shows.
So the checking for stat() result is not necessary to prevent that behavior.
There can be still unnecessary delay for 1 second in case of actual file permissions problems, but for now I see no code paths on cygwin where it can be significant.

Please look at the proposed patch, that makes regress/vacuum test (and an extra test with three vacuums in parallel) pass reliably.

Best regards,
Alexander

[1] https://cygwin.com/git?p=newlib-cygwin.git;a=blob;f=winsup/cygwin/syscalls.cc;h=885ca375ac2b9afaa8fb85cb3f51cbffe9996a3d;hb=HEAD#l2728
[2] https://www.postgresql.org/message-id/b2b8cc4c-b7d7-0d1b-f6ee-a02bbef5bdd0%40gmail.com
[16161] https://www.postgresql.org/message-id/flat/16161-7a985d2f1bbe8f71%40postgresql.org
Вложения

В списке pgsql-bugs по дате отправления:

Предыдущее
От: PG Bug reporting form
Дата:
Сообщение: BUG #16750: VACUUM sporadically fails to access pgstat on Cygwin
Следующее
От: James Coleman
Дата:
Сообщение: Re: BUG #16741: Error: subplan "SubPlan 1" was not initialized