Re: Intermittent pg_ctl failures on Windows

Поиск
Список
Период
Сортировка
От r.zharkov@postgrespro.ru
Тема Re: Intermittent pg_ctl failures on Windows
Дата
Msg-id 0057a0ce5a8893e4713a9a9d7e352162@postgrespro.ru
обсуждение исходный текст
Ответ на RE: Intermittent pg_ctl failures on Windows  (Badrul Chowdhury <bachow@microsoft.com>)
Ответы Re: Intermittent pg_ctl failures on Windows  (Tom Lane <tgl@sss.pgh.pa.us>)
Список pgsql-hackers
Hello!

We reproduced these errors on out buildfarm and my windows workstation.
We used small TAP test that restarts PostgresNode in loop. Additionally,
constant WAITS_PER_SEC in the pg_ctl.c file has been increased 1000 
times.

There are two different problems with pg_ctl:
1 - share access to postmaster.pid;

2 - share access to logfile. Postmaster runs in the shell (CMD.EXE).
When the server stops cmd.exe may blocks logfile from been opened by new 
cmd.exe.
> waiting for server to shut down.... done
> server stopped
> waiting for server to start....The process cannot access the file 
> because it is being used by another process.
>  stopped waiting
> pg_ctl: could not start server
> Examine the log output.

To avoid the first error we made two changes:
pg_ctl now opens the postmaster.pid file using pgwin32_open() function 
to correctly handle share locks.
On Windows systems we cannot handle ERROR_DELETE_PENDING because 
GetLastError() returns ERROR_ACCESS_DENIED instead.
So we rename the lock files before delete them.

To avoid the second error we added the wait for cmd.exe finish in the 
do_stop() and do_restart() functions.

Git patch is in the attachment.

On 2018-03-12 18:55, Badrul Chowdhury wrote:
> Hi Tom,
> 
> This is a great catch. I am looking into it: I will start by
> reproducing the error as you suggested.
> 
> Thanks,
> Badrul
> 
> -----Original Message-----
> From: Tom Lane [mailto:tgl@sss.pgh.pa.us]
> Sent: Saturday, March 10, 2018 2:48 PM
> To: pgsql-hackers@lists.postgresql.org
> Subject: Intermittent pg_ctl failures on Windows
> 
> The buildfarm's Windows members occasionally show weird pg_ctl
> failures, for instance this recent case:
> 
>
https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbuildfarm.postgresql.org%2Fcgi-bin%2Fshow_log.pl%3Fnm%3Dbowerbird%26dt%3D2018-03-10%252020%253A30%253A20&data=04%7C01%7Cbachow%40microsoft.com%7C28a8094e84c74c26ecb108d586d91a9d%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636563189370087651%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwifQ%3D%3D%7C-1&sdata=qBtMsJ0EJFs4DVtkA6TZJhCDNlj392uNxsB6MHnu7po%3D&reserved=0
> 
> ### Restarting node "master"
> # Running: pg_ctl -D
> G:/prog/bf/root/HEAD/pgsql.build/src/test/recovery/tmp_check/t_006_logical_decoding_master_data/pgdata
> -l
> G:/prog/bf/root/HEAD/pgsql.build/src/test/recovery/tmp_check/log/006_logical_decoding_master.log
> restart waiting for server to shut down.... done server stopped
> waiting for server to start....The process cannot access the file
> because it is being used by another process.
>  stopped waiting
> pg_ctl: could not start server
> Examine the log output.
> Bail out!  system pg_ctl failed
> 
> or this one:
> 
>
https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbuildfarm.postgresql.org%2Fcgi-bin%2Fshow_log.pl%3Fnm%3Dbowerbird%26dt%3D2017-12-29%252023%253A30%253A24&data=04%7C01%7Cbachow%40microsoft.com%7C28a8094e84c74c26ecb108d586d91a9d%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636563189370087651%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwifQ%3D%3D%7C-1&sdata=NdoDkZxBagXpiPDjNmhN6znHh%2BITyjEv2StPpLaabaw%3D&reserved=0
> 
> ### Stopping node "subscriber" using mode fast # Running: pg_ctl -D
> c:/prog/bf/root/HEAD/pgsql.build/src/test/subscription/tmp_check/t_001_rep_changes_subscriber_data/pgdata
> -m fast stop waiting for server to shut down....pg_ctl: could not open
> PID file
>
"c:/prog/bf/root/HEAD/pgsql.build/src/test/subscription/tmp_check/t_001_rep_changes_subscriber_data/pgdata/postmaster.pid":
> Permission denied Bail out!  system pg_ctl failed
> 
> I'd been writing these off as Microsoft randomness and/or antivirus
> interference, but it suddenly occurred to me that there might be a
> consistent explanation: since commit f13ea95f9, when pg_ctl is waiting
> for server start/stop, it is trying to read postmaster.pid
> more-or-less concurrently with the postmaster writing to that file.
> On Unix that's not much of a problem, but I believe that on Windows
> you have to specifically open the file with sharing enabled, or you
> get error messages like these.
> The postmaster should be enabling sharing, because port.h redirects
> open/fopen to pgwin32_open/pgwin32_fopen which enable the sharing
> flags.
> But it only does that #ifndef FRONTEND.  So pg_ctl is just using naked
> open(), which could explain these failures.
> 
> If this theory is accurate, it should be pretty easy to replicate the
> problem if you modify the postmaster to hold postmaster.pid open
> longer when rewriting it, e.g. stick fractional-second sleeps into
> CreateLockFile and AddToDataDirLockFile.
> 
> I'm not in a position to investigate this in detail nor test a fix,
> but I think somebody should.
> 
>             regards, tom lane

-- 

regars, Roman Zharkov
Вложения

В списке pgsql-hackers по дате отправления:

Предыдущее
От: Amit Kapila
Дата:
Сообщение: Re: SegFault on 9.6.14
Следующее
От: David Rowley
Дата:
Сообщение: Re: Tid scan improvements