Обсуждение: Fw: Windows 10 got stuck with PostgreSQL at starting up. Addingdelay lets it avoid.

Поиск
Список
Период
Сортировка

Fw: Windows 10 got stuck with PostgreSQL at starting up. Addingdelay lets it avoid.

От
Yugo Nagata
Дата:
Hi,

Recently, one of our clients reported a problem that Windows 10 sometime 
(approximately once in 300 tries) hung up at OS starting up while PostgreSQL
9.3.x service is starting up. My co-worker analyzed this and found that
PostgreSQL's auxiliary process and Windows' logon processes are in a dead-lock
situation.

Although this problem have been found only with PostgreSQL 9.3.x and Windows 10
in our client's environment for now, maybe the same problem occurs with other 
versions of PostgreSQL.

He reported this problem to pgsql-general list as below. Also, he created a patch
to add a build-time option for adding 0.5 or 3.0 seconds delay after each sub 
process starts.  The attached is the same one.  Our client confirmed that this 
patch resolves the dead-lock problem. Is it acceptable to add this option to 
PostgreSQL?  Any comment would be appreciated.

Regards,




Begin forwarded message:

Date: Fri, 29 Jun 2018 15:03:10 +0900
From: TAKATSUKA Haruka <harukat@sraoss.co.jp>
To: pgsql-general@postgresql.org
Subject: Windows 10 got stuck with PostgreSQL at starting up. Adding delay lets it avoid.


I got a trouble in PostgreSQL 9.3.x on Windows 10.
I would like to add new delay code as an official build option.

Windows 10 sometime (approximately once in 300 tries) hung up 
at OS starting up. The logs say it happened while the PostgreSQL 
service was starting. When OS stopped, some postgres auxiliary 
process were started and some were not started yet. 

The Windows dump say some threads of the postgres auxiliary process
are waiting OS level locks and the logon processes’thread are
also waiting a lock. MS help desk said that PostgreSQL’s OS level 
deadlock caused OS freeze. I think it is strange story. But, 
in fact, it not happened in repeated tests when I got rid of 
PostgreSQL from the initial auto-starting services.

I tweaked PostgreSQL 9.3.x (the newest from the repository) to add 
0.5 or 3.0 seconds delay after each sub process starts. 
And then the hung up was gone. This test patch is attached. 
It is only implemented for Windows. Also, I did not use existing 
pg_usleep because it contains locking codes (e.g. WaitForSingleObject
and Enter/LeaveCriticalSection).

Although Windows OS may have some problems, I think we should have
a means to avoid it. Can PostgreSQL be accepted such delay codes
as build-time options by preprocessor variables?


Thanks,
Takatsuka Haruka


-- 
Yugo Nagata <nagata@sraoss.co.jp>

Вложения

Re: Fw: Windows 10 got stuck with PostgreSQL at starting up. Addingdelay lets it avoid.

От
Michael Paquier
Дата:
On Fri, Jul 20, 2018 at 05:58:13PM +0900, Yugo Nagata wrote:
> He reported this problem to pgsql-general list as below. Also, he created a patch
> to add a build-time option for adding 0.5 or 3.0 seconds delay after each sub
> process starts.  The attached is the same one.  Our client confirmed that this
> patch resolves the dead-lock problem. Is it acceptable to add this option to
> PostgreSQL?  Any comment would be appreciated.

If the OS startup gets slower, then an arbitrary delay is not going to
solve things and you would finish by facing the same problem.  It seems
to me that we need to understand what are the low-level locks which get
stuck, if it is possible to make their acquirement conditional, and then
loop conditionally with multiple retries.
--
Michael

Вложения

Re: Fw: Windows 10 got stuck with PostgreSQL at starting up. Adding delay lets it avoid.

От
Tom Lane
Дата:
Yugo Nagata <nagata@sraoss.co.jp> writes:
> Recently, one of our clients reported a problem that Windows 10 sometime
> (approximately once in 300 tries) hung up at OS starting up while PostgreSQL
> 9.3.x service is starting up. My co-worker analyzed this and found that
> PostgreSQL's auxiliary process and Windows' logon processes are in a dead-lock
> situation.

Really?  What would they deadlock on?  Why is there any connection
whatsoever?  Why has nobody else run into this?

> He reported this problem to pgsql-general list as below. Also, he created a patch
> to add a build-time option for adding 0.5 or 3.0 seconds delay after each sub
> process starts.

This seems like an ugly hack that probably doesn't reliably resolve
whatever the problem is, but does manage to kill postmaster
responsiveness :-(.  It'd be especially awful to insert such a delay
after forking parallel worker processes, which would be a problem in
anything much newer than 9.3.

I think we need more investigation; and to start with, reproducing
the problem in a branch that's not within hailing distance of its EOL
would be a good idea.  (Not that I have reason to think PG's behavior
has changed much here ... but 9.3 is just not a good basis for asking
us to do anything now.)

            regards, tom lane


Re: Fw: Windows 10 got stuck with PostgreSQL at starting up. Addingdelay lets it avoid.

От
Yugo Nagata
Дата:
On Fri, 20 Jul 2018 19:13:21 +0900
Michael Paquier <michael@paquier.xyz> wrote:

> On Fri, Jul 20, 2018 at 05:58:13PM +0900, Yugo Nagata wrote:
> > He reported this problem to pgsql-general list as below. Also, he created a patch
> > to add a build-time option for adding 0.5 or 3.0 seconds delay after each sub 
> > process starts.  The attached is the same one.  Our client confirmed that this 
> > patch resolves the dead-lock problem. Is it acceptable to add this option to 
> > PostgreSQL?  Any comment would be appreciated.
> 
> If the OS startup gets slower, then an arbitrary delay is not going to
> solve things and you would finish by facing the same problem.  It seems
> to me that we need to understand what are the low-level locks which get
> stuck, if it is possible to make their acquirement conditional, and then
> loop conditionally with multiple retries.

From investigation of the kernel dump of Windows, it seems that PushLocks
were acqired in postgres processes and that this caused the deadlock.
However, it is still not clear which part of postgres code is involved this
lock. We will investigate this more and report if we found something.

> --
> Michael


-- 
Yugo Nagata <nagata@sraoss.co.jp>


Re: Fw: Windows 10 got stuck with PostgreSQL at starting up. Addingdelay lets it avoid.

От
Yugo Nagata
Дата:
On Fri, 20 Jul 2018 10:48:15 -0400
Tom Lane <tgl@sss.pgh.pa.us> wrote:

> Yugo Nagata <nagata@sraoss.co.jp> writes:
> > Recently, one of our clients reported a problem that Windows 10 sometime 
> > (approximately once in 300 tries) hung up at OS starting up while PostgreSQL
> > 9.3.x service is starting up. My co-worker analyzed this and found that
> > PostgreSQL's auxiliary process and Windows' logon processes are in a dead-lock
> > situation.
> 
> Really?  What would they deadlock on?  Why is there any connection
> whatsoever?  Why has nobody else run into this?

It is not clear where the hang occered, but this might be a problem
only on the specific version of Windows. Our client reported that
the hang occured with  Windows 10 IoT Enterpise 2015 LTSB, but not
with Windows 10 IoT Enterpise 2016 LTSB or Windows 7. 

> 
> > He reported this problem to pgsql-general list as below. Also, he created a patch
> > to add a build-time option for adding 0.5 or 3.0 seconds delay after each sub 
> > process starts.
> 
> This seems like an ugly hack that probably doesn't reliably resolve
> whatever the problem is, but does manage to kill postmaster
> responsiveness :-(.  It'd be especially awful to insert such a delay
> after forking parallel worker processes, which would be a problem in
> anything much newer than 9.3.

Agreed.

> I think we need more investigation; and to start with, reproducing
> the problem in a branch that's not within hailing distance of its EOL
> would be a good idea.  (Not that I have reason to think PG's behavior
> has changed much here ... but 9.3 is just not a good basis for asking
> us to do anything now.)

They also reported that this problem occured with Windows 10 IoT Enterpise
2015 LTSB + PostgreSQL 10.3 as well as PostgreSQL 9.3.22. However, 
reproducing this would be hard because we don't have Windows 10 IoT
enviromnemt and also the frequency is approximately once in 300 retries
of OS startup.

We will investigate this more and report if we found something.

Regards,


-- 
Yugo Nagata <nagata@sraoss.co.jp>