Обсуждение: Fw: Windows 10 got stuck with PostgreSQL at starting up. Addingdelay lets it avoid.
Fw: Windows 10 got stuck with PostgreSQL at starting up. Addingdelay lets it avoid.
От
Yugo Nagata
Дата:
Hi, Recently, one of our clients reported a problem that Windows 10 sometime (approximately once in 300 tries) hung up at OS starting up while PostgreSQL 9.3.x service is starting up. My co-worker analyzed this and found that PostgreSQL's auxiliary process and Windows' logon processes are in a dead-lock situation. Although this problem have been found only with PostgreSQL 9.3.x and Windows 10 in our client's environment for now, maybe the same problem occurs with other versions of PostgreSQL. He reported this problem to pgsql-general list as below. Also, he created a patch to add a build-time option for adding 0.5 or 3.0 seconds delay after each sub process starts. The attached is the same one. Our client confirmed that this patch resolves the dead-lock problem. Is it acceptable to add this option to PostgreSQL? Any comment would be appreciated. Regards, Begin forwarded message: Date: Fri, 29 Jun 2018 15:03:10 +0900 From: TAKATSUKA Haruka <harukat@sraoss.co.jp> To: pgsql-general@postgresql.org Subject: Windows 10 got stuck with PostgreSQL at starting up. Adding delay lets it avoid. I got a trouble in PostgreSQL 9.3.x on Windows 10. I would like to add new delay code as an official build option. Windows 10 sometime (approximately once in 300 tries) hung up at OS starting up. The logs say it happened while the PostgreSQL service was starting. When OS stopped, some postgres auxiliary process were started and some were not started yet. The Windows dump say some threads of the postgres auxiliary process are waiting OS level locks and the logon processes’thread are also waiting a lock. MS help desk said that PostgreSQL’s OS level deadlock caused OS freeze. I think it is strange story. But, in fact, it not happened in repeated tests when I got rid of PostgreSQL from the initial auto-starting services. I tweaked PostgreSQL 9.3.x (the newest from the repository) to add 0.5 or 3.0 seconds delay after each sub process starts. And then the hung up was gone. This test patch is attached. It is only implemented for Windows. Also, I did not use existing pg_usleep because it contains locking codes (e.g. WaitForSingleObject and Enter/LeaveCriticalSection). Although Windows OS may have some problems, I think we should have a means to avoid it. Can PostgreSQL be accepted such delay codes as build-time options by preprocessor variables? Thanks, Takatsuka Haruka -- Yugo Nagata <nagata@sraoss.co.jp>
Вложения
Re: Fw: Windows 10 got stuck with PostgreSQL at starting up. Addingdelay lets it avoid.
От
Michael Paquier
Дата:
On Fri, Jul 20, 2018 at 05:58:13PM +0900, Yugo Nagata wrote: > He reported this problem to pgsql-general list as below. Also, he created a patch > to add a build-time option for adding 0.5 or 3.0 seconds delay after each sub > process starts. The attached is the same one. Our client confirmed that this > patch resolves the dead-lock problem. Is it acceptable to add this option to > PostgreSQL? Any comment would be appreciated. If the OS startup gets slower, then an arbitrary delay is not going to solve things and you would finish by facing the same problem. It seems to me that we need to understand what are the low-level locks which get stuck, if it is possible to make their acquirement conditional, and then loop conditionally with multiple retries. -- Michael
Вложения
Yugo Nagata <nagata@sraoss.co.jp> writes: > Recently, one of our clients reported a problem that Windows 10 sometime > (approximately once in 300 tries) hung up at OS starting up while PostgreSQL > 9.3.x service is starting up. My co-worker analyzed this and found that > PostgreSQL's auxiliary process and Windows' logon processes are in a dead-lock > situation. Really? What would they deadlock on? Why is there any connection whatsoever? Why has nobody else run into this? > He reported this problem to pgsql-general list as below. Also, he created a patch > to add a build-time option for adding 0.5 or 3.0 seconds delay after each sub > process starts. This seems like an ugly hack that probably doesn't reliably resolve whatever the problem is, but does manage to kill postmaster responsiveness :-(. It'd be especially awful to insert such a delay after forking parallel worker processes, which would be a problem in anything much newer than 9.3. I think we need more investigation; and to start with, reproducing the problem in a branch that's not within hailing distance of its EOL would be a good idea. (Not that I have reason to think PG's behavior has changed much here ... but 9.3 is just not a good basis for asking us to do anything now.) regards, tom lane
On Fri, 20 Jul 2018 19:13:21 +0900 Michael Paquier <michael@paquier.xyz> wrote: > On Fri, Jul 20, 2018 at 05:58:13PM +0900, Yugo Nagata wrote: > > He reported this problem to pgsql-general list as below. Also, he created a patch > > to add a build-time option for adding 0.5 or 3.0 seconds delay after each sub > > process starts. The attached is the same one. Our client confirmed that this > > patch resolves the dead-lock problem. Is it acceptable to add this option to > > PostgreSQL? Any comment would be appreciated. > > If the OS startup gets slower, then an arbitrary delay is not going to > solve things and you would finish by facing the same problem. It seems > to me that we need to understand what are the low-level locks which get > stuck, if it is possible to make their acquirement conditional, and then > loop conditionally with multiple retries. From investigation of the kernel dump of Windows, it seems that PushLocks were acqired in postgres processes and that this caused the deadlock. However, it is still not clear which part of postgres code is involved this lock. We will investigate this more and report if we found something. > -- > Michael -- Yugo Nagata <nagata@sraoss.co.jp>
On Fri, 20 Jul 2018 10:48:15 -0400 Tom Lane <tgl@sss.pgh.pa.us> wrote: > Yugo Nagata <nagata@sraoss.co.jp> writes: > > Recently, one of our clients reported a problem that Windows 10 sometime > > (approximately once in 300 tries) hung up at OS starting up while PostgreSQL > > 9.3.x service is starting up. My co-worker analyzed this and found that > > PostgreSQL's auxiliary process and Windows' logon processes are in a dead-lock > > situation. > > Really? What would they deadlock on? Why is there any connection > whatsoever? Why has nobody else run into this? It is not clear where the hang occered, but this might be a problem only on the specific version of Windows. Our client reported that the hang occured with Windows 10 IoT Enterpise 2015 LTSB, but not with Windows 10 IoT Enterpise 2016 LTSB or Windows 7. > > > He reported this problem to pgsql-general list as below. Also, he created a patch > > to add a build-time option for adding 0.5 or 3.0 seconds delay after each sub > > process starts. > > This seems like an ugly hack that probably doesn't reliably resolve > whatever the problem is, but does manage to kill postmaster > responsiveness :-(. It'd be especially awful to insert such a delay > after forking parallel worker processes, which would be a problem in > anything much newer than 9.3. Agreed. > I think we need more investigation; and to start with, reproducing > the problem in a branch that's not within hailing distance of its EOL > would be a good idea. (Not that I have reason to think PG's behavior > has changed much here ... but 9.3 is just not a good basis for asking > us to do anything now.) They also reported that this problem occured with Windows 10 IoT Enterpise 2015 LTSB + PostgreSQL 10.3 as well as PostgreSQL 9.3.22. However, reproducing this would be hard because we don't have Windows 10 IoT enviromnemt and also the frequency is approximately once in 300 retries of OS startup. We will investigate this more and report if we found something. Regards, -- Yugo Nagata <nagata@sraoss.co.jp>