Re: Intermittent pg_ctl failures on Windows
От | r.zharkov@postgrespro.ru |
---|---|
Тема | Re: Intermittent pg_ctl failures on Windows |
Дата | |
Msg-id | 0057a0ce5a8893e4713a9a9d7e352162@postgrespro.ru обсуждение исходный текст |
Ответ на | RE: Intermittent pg_ctl failures on Windows (Badrul Chowdhury <bachow@microsoft.com>) |
Ответы |
Re: Intermittent pg_ctl failures on Windows
(Tom Lane <tgl@sss.pgh.pa.us>)
|
Список | pgsql-hackers |
Hello! We reproduced these errors on out buildfarm and my windows workstation. We used small TAP test that restarts PostgresNode in loop. Additionally, constant WAITS_PER_SEC in the pg_ctl.c file has been increased 1000 times. There are two different problems with pg_ctl: 1 - share access to postmaster.pid; 2 - share access to logfile. Postmaster runs in the shell (CMD.EXE). When the server stops cmd.exe may blocks logfile from been opened by new cmd.exe. > waiting for server to shut down.... done > server stopped > waiting for server to start....The process cannot access the file > because it is being used by another process. > stopped waiting > pg_ctl: could not start server > Examine the log output. To avoid the first error we made two changes: pg_ctl now opens the postmaster.pid file using pgwin32_open() function to correctly handle share locks. On Windows systems we cannot handle ERROR_DELETE_PENDING because GetLastError() returns ERROR_ACCESS_DENIED instead. So we rename the lock files before delete them. To avoid the second error we added the wait for cmd.exe finish in the do_stop() and do_restart() functions. Git patch is in the attachment. On 2018-03-12 18:55, Badrul Chowdhury wrote: > Hi Tom, > > This is a great catch. I am looking into it: I will start by > reproducing the error as you suggested. > > Thanks, > Badrul > > -----Original Message----- > From: Tom Lane [mailto:tgl@sss.pgh.pa.us] > Sent: Saturday, March 10, 2018 2:48 PM > To: pgsql-hackers@lists.postgresql.org > Subject: Intermittent pg_ctl failures on Windows > > The buildfarm's Windows members occasionally show weird pg_ctl > failures, for instance this recent case: > > https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbuildfarm.postgresql.org%2Fcgi-bin%2Fshow_log.pl%3Fnm%3Dbowerbird%26dt%3D2018-03-10%252020%253A30%253A20&data=04%7C01%7Cbachow%40microsoft.com%7C28a8094e84c74c26ecb108d586d91a9d%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636563189370087651%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwifQ%3D%3D%7C-1&sdata=qBtMsJ0EJFs4DVtkA6TZJhCDNlj392uNxsB6MHnu7po%3D&reserved=0 > > ### Restarting node "master" > # Running: pg_ctl -D > G:/prog/bf/root/HEAD/pgsql.build/src/test/recovery/tmp_check/t_006_logical_decoding_master_data/pgdata > -l > G:/prog/bf/root/HEAD/pgsql.build/src/test/recovery/tmp_check/log/006_logical_decoding_master.log > restart waiting for server to shut down.... done server stopped > waiting for server to start....The process cannot access the file > because it is being used by another process. > stopped waiting > pg_ctl: could not start server > Examine the log output. > Bail out! system pg_ctl failed > > or this one: > > https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbuildfarm.postgresql.org%2Fcgi-bin%2Fshow_log.pl%3Fnm%3Dbowerbird%26dt%3D2017-12-29%252023%253A30%253A24&data=04%7C01%7Cbachow%40microsoft.com%7C28a8094e84c74c26ecb108d586d91a9d%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636563189370087651%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwifQ%3D%3D%7C-1&sdata=NdoDkZxBagXpiPDjNmhN6znHh%2BITyjEv2StPpLaabaw%3D&reserved=0 > > ### Stopping node "subscriber" using mode fast # Running: pg_ctl -D > c:/prog/bf/root/HEAD/pgsql.build/src/test/subscription/tmp_check/t_001_rep_changes_subscriber_data/pgdata > -m fast stop waiting for server to shut down....pg_ctl: could not open > PID file > "c:/prog/bf/root/HEAD/pgsql.build/src/test/subscription/tmp_check/t_001_rep_changes_subscriber_data/pgdata/postmaster.pid": > Permission denied Bail out! system pg_ctl failed > > I'd been writing these off as Microsoft randomness and/or antivirus > interference, but it suddenly occurred to me that there might be a > consistent explanation: since commit f13ea95f9, when pg_ctl is waiting > for server start/stop, it is trying to read postmaster.pid > more-or-less concurrently with the postmaster writing to that file. > On Unix that's not much of a problem, but I believe that on Windows > you have to specifically open the file with sharing enabled, or you > get error messages like these. > The postmaster should be enabling sharing, because port.h redirects > open/fopen to pgwin32_open/pgwin32_fopen which enable the sharing > flags. > But it only does that #ifndef FRONTEND. So pg_ctl is just using naked > open(), which could explain these failures. > > If this theory is accurate, it should be pretty easy to replicate the > problem if you modify the postmaster to hold postmaster.pid open > longer when rewriting it, e.g. stick fractional-second sleeps into > CreateLockFile and AddToDataDirLockFile. > > I'm not in a position to investigate this in detail nor test a fix, > but I think somebody should. > > regards, tom lane -- regars, Roman Zharkov
Вложения
В списке pgsql-hackers по дате отправления: