Обсуждение: Cygwin PostgreSQL Regression Test Problems (Revisited)

Поиск
Список
Период
Сортировка

Cygwin PostgreSQL Regression Test Problems (Revisited)

От
Jason Tishler
Дата:
On Mon, Jan 15, 2001 at 11:37:55PM -0500, Jason Tishler wrote:
> 2. I am unable to successfully run the regression tests on a NT 4.0 SP5
> machine with only 64 MB of physical memory and about 175 MB of swap space.
> Other than lacking RAM and swap space, this machine is the "same" as other
> NT/2000 machines which can successfully run the regression tests.
>
> The tests usually hang during the "parallel group (18 tests)" test
> right after numerology.  By "hang," I mean that the original postmaster
> is still running, but there are no postmaster children, and there are
> some number of psql processes hanging around.  Using NT's TaskManager,
> I can see that the machine is running out of memory.  I have even seen
> the "Windows is running low on virtual memory" dialog a few times.
> Should I expect this behavior from such a lame machine?

I previously reported the above problem with the parallel version of
the regression test (i.e., make check) on a machine with limited memory.
Unfortunately, I am seeing similar problems on a machine with 192 MB of
physical memory and about 208 MB of swap space.  So, now I feel that my
initial conclusion that limited memory was the root cause is erroneous.

My current WAG is that there is a race condition in Cygwin that is
causing the some back-end postgres processes to abort.  This in turn
causes the associated front-end psql processes to hang which in turn
causes the regression test to hang.

What is the best way to "catch" this problem?  What are the best set of
options to pass to postmaster that will be in turn passed to the back-end
postgres processes to hopefully shed some light on this situation?  Can I
get the individual back-end postgres processes to log to separate files?

There is so much going on during a parallel regression test that it's
hard to figure out what is really happening.  Any help would be greatly
appreciated.

Thanks,
Jason

--
Jason Tishler
Director, Software Engineering       Phone: +1 (732) 264-8770 x235
Dot Hill Systems Corp.               Fax:   +1 (732) 264-8798
82 Bethany Road, Suite 7             Email: Jason.Tishler@dothill.com
Hazlet, NJ 07730 USA                 WWW:   http://www.dothill.com

Re: Cygwin PostgreSQL Regression Test Problems (Revisited)

От
Tom Lane
Дата:
Jason Tishler <Jason.Tishler@dothill.com> writes:
> I previously reported the above problem with the parallel version of
> the regression test (i.e., make check) on a machine with limited memory.
> Unfortunately, I am seeing similar problems on a machine with 192 MB of
> physical memory and about 208 MB of swap space.  So, now I feel that my
> initial conclusion that limited memory was the root cause is erroneous.

Not necessarily.  18 parallel tests imply 54 concurrent processes
(a shell, a psql, and a backend per test).  Depending on whether Windoze
is any good about sharing sharable pages across processes, it's not hard
at all to believe that each process might chew up a few meg of memory
and/or swap.  You don't have a whole lot of headroom there if so.

Try modifying the parallel_schedule file to break the largest set of
concurrent tests down into two sets of nine tests.

Considering that we've seen people run into maxuprc limits on some Unix
versions, I wonder whether we ought to just do that across-the-board.

> What is the best way to "catch" this problem?  What are the best set of
> options to pass to postmaster that will be in turn passed to the back-end
> postgres processes to hopefully shed some light on this situation?

I'd use -d1 which should be enough to see backends starting and exiting.
Any more will clutter the log with individual queries, which probably
would be more detail than you really want...

            regards, tom lane

Re: Cygwin PostgreSQL Regression Test Problems (Revisited)

От
Jason Tishler
Дата:
Tom,

On Wed, Mar 28, 2001 at 01:57:33PM -0500, Tom Lane wrote:
> Jason Tishler <Jason.Tishler@dothill.com> writes:
> > I previously reported the above problem with the parallel version of
> > the regression test (i.e., make check) on a machine with limited memory.
> > Unfortunately, I am seeing similar problems on a machine with 192 MB of
> > physical memory and about 208 MB of swap space.  So, now I feel that my
> > initial conclusion that limited memory was the root cause is erroneous.
>
> Not necessarily.  18 parallel tests imply 54 concurrent processes
> (a shell, a psql, and a backend per test).  Depending on whether Windoze
> is any good about sharing sharable pages across processes, it's not hard
> at all to believe that each process might chew up a few meg of memory
> and/or swap.  You don't have a whole lot of headroom there if so.

I just increased the swap space (i.e., pagefile.sys) to 384 MB and I
still get hangs.  Watching memory usage via the NT Task Manager, Windows
tells me that the memory usage during the regression test is <= 80 MB
which is significantly less than my physical memory.

I wonder if I'm bucking up against some Cygwin limitations.  On the
cygwin-developers list, there was a recent discussion that indicated
that a Cygwin process can only have a max of 64 children.  May be there
is a limit like that which is causing backends to abort?

> Try modifying the parallel_schedule file to break the largest set of
> concurrent tests down into two sets of nine tests.

I'm sure that will work (at least most of the time) since I only get one
of two psql processes to hangs for any given run.  But, "fixing" the
problem this way just doesn't feel right to me.

> Considering that we've seen people run into maxuprc limits on some Unix
> versions, I wonder whether we ought to just do that across-the-board.

Of course, this solution is much better. :,)

> > What is the best way to "catch" this problem?  What are the best set of
> > options to pass to postmaster that will be in turn passed to the back-end
> > postgres processes to hopefully shed some light on this situation?
>
> I'd use -d1 which should be enough to see backends starting and exiting.
> Any more will clutter the log with individual queries, which probably
> would be more detail than you really want...

I've done the above and it seems to indicate that all backends exited
with a status of 0.  So, I still don't know why some backends "aborted."

Any other suggestions?  Such as somehow specifying an individual log
file for each backend.

Thanks,
Jason

--
Jason Tishler
Director, Software Engineering       Phone: +1 (732) 264-8770 x235
Dot Hill Systems Corp.               Fax:   +1 (732) 264-8798
82 Bethany Road, Suite 7             Email: Jason.Tishler@dothill.com
Hazlet, NJ 07730 USA                 WWW:   http://www.dothill.com

Re: Cygwin PostgreSQL Regression Test Problems (Revisited)

От
Tom Lane
Дата:
Jason Tishler <Jason.Tishler@dothill.com> writes:
> I've done the above and it seems to indicate that all backends exited
> with a status of 0.  So, I still don't know why some backends "aborted."

Hm.  So what exactly is the failure mode?  Do the psql processes report
any errors?  Have they produced (any/all of) the expected output?

            regards, tom lane

Re: Cygwin PostgreSQL Regression Test Problems (Revisited)

От
Jason Tishler
Дата:
Tom,

On Wed, Mar 28, 2001 at 04:40:30PM -0500, Tom Lane wrote:
> Jason Tishler <Jason.Tishler@dothill.com> writes:
> > I've done the above and it seems to indicate that all backends exited
> > with a status of 0.  So, I still don't know why some backends "aborted."
>
> Hm.  So what exactly is the failure mode?  Do the psql processes report
> any errors?  Have they produced (any/all of) the expected output?

The failure mode is always something like the following:

The regression test proceeds normally until one of the larger parallel
groups is running.  Then it will hang after output such as:

parallel group (18 tests):  point lseg box path circle date polygon time abstime inet interval reltime type_sanity
oidjoinsopr_sanity timestamp... 

If I do a ps, I will see the postmaster process and one or more psql
processes.  The corresponding postgres processes are no longer running.
(Were they ever running?)  The NT Task Manager shows essentially 100% idle.

I usually kill the psql processes, with the following command:

    kill $(ps | fgrep psql | awk '{print $1}')

Then the regression test will continue with output like the following:

                                                             ...Signal 15
Signal 15
 comments tinterval
     point                ... ok
     lseg                 ... ok
     box                  ... ok
     path                 ... ok
     polygon              ... ok
     circle               ... ok
     date                 ... ok
     time                 ... ok
     timestamp            ... ok
     interval             ... ok
     abstime              ... ok
     reltime              ... ok
     tinterval            ... FAILED
     inet                 ... ok
     comments             ... FAILED
     oidjoins             ... ok
     type_sanity          ... ok
     opr_sanity           ... ok
test geometry             ... ok
..

I believe that the "failures" above correspond to the psql processes
that I killed.

Sometimes the regression test will run to completion without any more
hangs.  Sometimes it will hang at one or more large parallel groups.  If
I continue to kill the psql processes as above, the regression test will
eventually complete (with more "failures").

I've trying another experiment of killing a postgres backend to see if
the psql process notices the backend dying.  It does but I was only able
to kill -9 the postgres backend.  Otherwise, postgres ignored the
signal.  So, I don't know if my experiment was valid.  If a backend
exits normally while a psql is connected, will the psql process notice
this event?

Any other suggestions?  Or, should I just run the serial_schedule and
stop my head banging?

Thanks,
Jason
--
Jason Tishler
Director, Software Engineering       Phone: +1 (732) 264-8770 x235
Dot Hill Systems Corp.               Fax:   +1 (732) 264-8798
82 Bethany Road, Suite 7             Email: Jason.Tishler@dothill.com
Hazlet, NJ 07730 USA                 WWW:   http://www.dothill.com

Re: Cygwin PostgreSQL Regression Test Problems (Revisited)

От
Tom Lane
Дата:
Jason Tishler <Jason.Tishler@dothill.com> writes:
> Then the regression test will continue with output like the following:

>                                                              ...Signal 15
> Signal 15
>  comments tinterval
>      point                ... ok
>      lseg                 ... ok
>      box                  ... ok
>      path                 ... ok
>      polygon              ... ok
>      circle               ... ok
>      date                 ... ok
>      time                 ... ok
>      timestamp            ... ok
>      interval             ... ok
>      abstime              ... ok
>      reltime              ... ok
>      tinterval            ... FAILED
>      inet                 ... ok
>      comments             ... FAILED
>      oidjoins             ... ok
>      type_sanity          ... ok
>      opr_sanity           ... ok
> test geometry             ... ok
> ..

This doesn't tell us much.  What shows up in the output files of the
failed tests --- what are the *diffs*, not just the summary display?

            regards, tom lane

Re: Cygwin PostgreSQL Regression Test Problems (Revisited)

От
Hiroshi Inoue
Дата:
Tom Lane wrote:
>
> Jason Tishler <Jason.Tishler@dothill.com> writes:
> > Then the regression test will continue with output like the following:
>
> >                                                              ...Signal 15
> > Signal 15
> >  comments tinterval
> >      point                ... ok
> >      lseg                 ... ok
> >      box                  ... ok
> >      path                 ... ok
> >      polygon              ... ok
> >      circle               ... ok
> >      date                 ... ok
> >      time                 ... ok
> >      timestamp            ... ok
> >      interval             ... ok
> >      abstime              ... ok
> >      reltime              ... ok
> >      tinterval            ... FAILED
> >      inet                 ... ok
> >      comments             ... FAILED
> >      oidjoins             ... ok
> >      type_sanity          ... ok
> >      opr_sanity           ... ok
> > test geometry             ... ok
> > ..
>
> This doesn't tell us much.  What shows up in the output files of the
> failed tests --- what are the *diffs*, not just the summary display?
>

Hmmm, *diffs* are available little.
psql hangs at PQsetdbLogin()(select() in the
first pqWait() in connectDBComplete()).

regards,
Hiroshi Inoue

Re: Cygwin PostgreSQL Regression Test Problems (Revisited)

От
Yutaka tanida
Дата:
Jason,

On Wed, 28 Mar 2001 13:36:45 -0500
Jason Tishler <Jason.Tishler@dothill.com> wrote:

> On Mon, Jan 15, 2001 at 11:37:55PM -0500, Jason Tishler wrote:
> > The tests usually hang during the "parallel group (18 tests)" test
> > right after numerology.  By "hang," I mean that the original postmaster
> > is still running, but there are no postmaster children, and there are
> > some number of psql processes hanging around.  Using NT's TaskManager,
> > I can see that the machine is running out of memory.  I have even seen
> > the "Windows is running low on virtual memory" dialog a few times.
> > Should I expect this behavior from such a lame machine?
> I previously reported the above problem with the parallel version of
> the regression test (i.e., make check) on a machine with limited memory.
> Unfortunately, I am seeing similar problems on a machine with 192 MB of
> physical memory and about 208 MB of swap space.  So, now I feel that my
> initial conclusion that limited memory was the root cause is erroneous.

I can't reproduce it. Paralell regression test works perfectly and
returns "All 76 tests passed." . There's no hung-up.

Enviroment:
PIII-550 , 256MB RAM PC
NT4.0 + SP6

PostgreSQL 7.1Beta6
cygipc 1.08+my 2 patch
Cygwin1.dll 010215 snapshot

--
Yutaka tanida <yutaka@hi-net.zaq.ne.jp>


Re: Cygwin PostgreSQL Regression Test Problems (Revisited)

От
Jason Tishler
Дата:
Tom,

On Wed, Mar 28, 2001 at 06:06:22PM -0500, Tom Lane wrote:
> Jason Tishler <Jason.Tishler@dothill.com> writes:
> So no queries get executed at all before the backend exits.  Given that
> the backend seems to be exiting normally, one would suppose that the
> backend thinks it is seeing an EOF from the client.  Is there anything
> about "unexpected EOF on client connection" in the postmaster log?

I grep-ed for EOF in postmaster.log but came up empty.  Did I need to
run with debugging turned on to see this error message?  I was running
*without* debugging turned on.

> Another possibility is that the failing psqls are never managing to
> connect in the first place.  Can you attach to one of the stuck psqls
> with gdb and get a backtrace to see where it is?

I get the following backtrace for one of the hung psql processes:

    (gdb) bt
    #0  0x77f682cb in ?? ()
    #1  0x77f1cd76 in ?? ()
    #2  0x6103deee in _size_of_stack_reserve__ ()
    #3  0x6103d84e in _size_of_stack_reserve__ ()
    #4  0x67989978 in pqWait (forRead=0, forWrite=1, conn=0xa010258)
        at fe-misc.c:738
    #5  0x6798287c in connectDBComplete (conn=0xa010258) at fe-connect.c:1103
    #6  0x67981fb1 in PQsetdbLogin (pghost=0x0, pgport=0x0, pgoptions=0x0,
        pgtty=0x0, dbName=0x1a0260e8 "regression", login=0x0, pwd=0x0)
        at fe-connect.c:524
    #7  0x40e43f in main (argc=6, argv=0x1a021ad8) at startup.c:178

On Thu, Mar 29, 2001 at 03:20:59PM +0900, Hiroshi Inoue wrote:
> psql hangs at PQsetdbLogin()(select() in the
> first pqWait() in connectDBComplete()).

Note that my hang seems identical to the one reported by Hiroshi Inoue.

Thanks,
Jason

--
Jason Tishler
Director, Software Engineering       Phone: +1 (732) 264-8770 x235
Dot Hill Systems Corp.               Fax:   +1 (732) 264-8798
82 Bethany Road, Suite 7             Email: Jason.Tishler@dothill.com
Hazlet, NJ 07730 USA                 WWW:   http://www.dothill.com

Re: Cygwin PostgreSQL Regression Test Problems (Revisited)

От
Tom Lane
Дата:
Jason Tishler <Jason.Tishler@dothill.com> writes:
> I get the following backtrace for one of the hung psql processes:

>     (gdb) bt
>     #0  0x77f682cb in ?? ()
>     #1  0x77f1cd76 in ?? ()
>     #2  0x6103deee in _size_of_stack_reserve__ ()
>     #3  0x6103d84e in _size_of_stack_reserve__ ()
>     #4  0x67989978 in pqWait (forRead=0, forWrite=1, conn=0xa010258)
>         at fe-misc.c:738
>     #5  0x6798287c in connectDBComplete (conn=0xa010258) at fe-connect.c:1103
>     #6  0x67981fb1 in PQsetdbLogin (pghost=0x0, pgport=0x0, pgoptions=0x0,
>         pgtty=0x0, dbName=0x1a0260e8 "regression", login=0x0, pwd=0x0)
>         at fe-connect.c:524
>     #7  0x40e43f in main (argc=6, argv=0x1a021ad8) at startup.c:178

It would be helpful to see the contents of the conn object ("f 5" then
"p *conn" should do it).

If Hiroshi is correct that this is the *first* call to pqWait in
connectDBComplete, then I think we are looking at a kernel bug (or more
likely a cygwin bug).  psql has opened a TCP connection socket and is
now waiting for the socket to show as write-ready before it will send
a connection request.  If select() never reports the socket as
write-ready, you have a hang ... and it's not possible to blame the hang
on anything else but the kernel.  Both ends of the connection are on the
same machine, so there's no network problem or anything like that.
There is not anything else that we should need to do at the application
level before we should be allowed to send data.

            regards, tom lane

Re: Cygwin PostgreSQL Regression Test Problems (Revisited)

От
Tom Lane
Дата:
Jason Tishler <Jason.Tishler@dothill.com> writes:
>   status = CONNECTION_STARTED, asyncStatus = PGASYNC_IDLE,

Oh-ho, that's interesting!  If you look at fe-connect.c you'll see that
CONNECTION_STARTED must indicate that connect() returned EINPROGRESS
rather than a success indication.  The socket is supposed to go
write-ready when the connection is finished --- for example HPUX's
connect man page sez

          [EINPROGRESS]            Nonblocking I/O is enabled using
                                   O_NONBLOCK, O_NDELAY, or FIOSNBIO, and
                                   the connection cannot be completed
                                   immediately.  This is not a failure.
                                   Make the connect() call again a few
                                   seconds later.  Alternatively, wait for
                                   completion by calling select() and
                                   selecting for write.

But, evidently, it never is coming ready for write.

BTW, I note that we are trying to use Unix sockets here.  Does the bug
still appear if you force pg_regress to use TCP connections?

            regards, tom lane

Re: Cygwin PostgreSQL Regression Test Problems (Revisited)

От
Jason Tishler
Дата:
Tom,

On Thu, Mar 29, 2001 at 10:43:49AM -0500, Tom Lane wrote:
> Jason Tishler <Jason.Tishler@dothill.com> writes:
> > I get the following backtrace for one of the hung psql processes:
>
> >     (gdb) bt
> >     #0  0x77f682cb in ?? ()
> >     #1  0x77f1cd76 in ?? ()
> >     #2  0x6103deee in _size_of_stack_reserve__ ()
> >     #3  0x6103d84e in _size_of_stack_reserve__ ()
> >     #4  0x67989978 in pqWait (forRead=0, forWrite=1, conn=0xa010258)
> >         at fe-misc.c:738
> >     #5  0x6798287c in connectDBComplete (conn=0xa010258) at fe-connect.c:1103
> >     #6  0x67981fb1 in PQsetdbLogin (pghost=0x0, pgport=0x0, pgoptions=0x0,
> >         pgtty=0x0, dbName=0x1a0260e8 "regression", login=0x0, pwd=0x0)
> >         at fe-connect.c:524
> >     #7  0x40e43f in main (argc=6, argv=0x1a021ad8) at startup.c:178
>
> It would be helpful to see the contents of the conn object ("f 5" then
> "p *conn" should do it).

I did as you suggested above and got the following:

(gdb) f 5
#5  0x6798287c in connectDBComplete (conn=0xa010258) at fe-connect.c:1103
1103                                    if (pqWait(0, 1, conn))
(gdb) p *conn
$1 = {pghost = 0x0, pghostaddr = 0x0, pgport = 0xa016610 "65432",
  pgunixsocket = 0x0, pgtty = 0xa016620 "", pgoptions = 0xa016630 "",
  dbName = 0xa017170 "regression", pguser = 0xa017150 "jt",
  pgpass = 0xa017160 "", Pfdebug = 0x0,
  noticeHook = 0x67984e8c <defaultNoticeProcessor>, noticeArg = 0x0,
  status = CONNECTION_STARTED, asyncStatus = PGASYNC_IDLE,
  notifyList = 0xa0103e0, sock = 3, laddr = {sa = {sa_family = 0,
      sa_data = '\000' <repeats 13 times>}, in = {sin_family = 0,
      sin_port = 0, sin_addr = {s_addr = 0},
      __pad = "\000\000\000\000\000\000\000"}, un = {sun_family = 0,
      sun_path = '\000' <repeats 107 times>}}, raddr = {sa = {sa_family = 1,
      sa_data = "/tmp/.s.PGSQL."}, in = {sin_family = 1, sin_port = 29743,
      sin_addr = {s_addr = 774860909}, __pad = "s.PGSQL."}, un = {
      sun_family = 1,
      sun_path = "/tmp/.s.PGSQL.65432", '\000' <repeats 88 times>}},
  raddr_len = 21, be_pid = 0, be_key = 0, salt = "\000", lobjfuncs = 0x0,
  inBuffer = 0xa0103f0 "", inBufSize = 16384, inStart = 0, inCursor = 0,
  inEnd = 0, nonblocking = 0, outBuffer = 0xa0143f8 "", outBufSize = 8192,
  outCount = 0, result = 0x0, curTuple = 0x0,
  setenv_state = SETENV_STATE_IDLE, next_eo = 0x0, errorMessage = {
    data = 0xa016400 "", len = 0, maxlen = 256}, workBuffer = {
    data = 0xa016508 "", len = 0, maxlen = 256}, client_encoding = 0}

> If Hiroshi is correct that this is the *first* call to pqWait in
> connectDBComplete, then I think we are looking at a kernel bug (or more
> likely a cygwin bug).  psql has opened a TCP connection socket and is
> now waiting for the socket to show as write-ready before it will send
> a connection request.  If select() never reports the socket as
> write-ready, you have a hang ... and it's not possible to blame the hang
> on anything else but the kernel.  Both ends of the connection are on the
> same machine, so there's no network problem or anything like that.
> There is not anything else that we should need to do at the application
> level before we should be allowed to send data.

Does the details reported above support your hypothesis?  If so, can you
assist me in formulating a minimal test case that I can take back to
the Cygwin community?

Thanks,
Jason

--
Jason Tishler
Director, Software Engineering       Phone: +1 (732) 264-8770 x235
Dot Hill Systems Corp.               Fax:   +1 (732) 264-8798
82 Bethany Road, Suite 7             Email: Jason.Tishler@dothill.com
Hazlet, NJ 07730 USA                 WWW:   http://www.dothill.com

Re: Cygwin PostgreSQL Regression Test Problems (Revisited)

От
Jason Tishler
Дата:
Tom,

On Thu, Mar 29, 2001 at 11:40:08AM -0500, Tom Lane wrote:
> BTW, I note that we are trying to use Unix sockets here.  Does the bug
> still appear if you force pg_regress to use TCP connections?

I'm not sure if you already know this, but Cygwin Unix sockets are
actually implemented as TCP/IP sockets.  Anyway, I forced TCP connections
and got the same psql hang:

(gdb) p *conn
$1 = {pghost = 0xa016618 "localhost", pghostaddr = 0x0,
  pgport = 0xa016628 "65432", pgunixsocket = 0x0, pgtty = 0xa016638 "",
  pgoptions = 0xa016648 "", dbName = 0xa017188 "regression",
  pguser = 0xa017168 "jt", pgpass = 0xa017178 "", Pfdebug = 0x0,
  noticeHook = 0x67984e8c <defaultNoticeProcessor>, noticeArg = 0x0,
  status = CONNECTION_STARTED, asyncStatus = PGASYNC_IDLE,
  notifyList = 0xa0103e8, sock = 3, laddr = {sa = {sa_family = 0,
      sa_data = '\000' <repeats 13 times>}, in = {sin_family = 0,
      sin_port = 0, sin_addr = {s_addr = 0},
      __pad = "\000\000\000\000\000\000\000"}, un = {sun_family = 0,
      sun_path = '\000' <repeats 107 times>}}, raddr = {sa = {sa_family = 2,
      sa_data = "ÿ\230\177\000\000\001\000\000\000\000\000\000\000"}, in = {
      sin_family = 2, sin_port = 39167, sin_addr = {s_addr = 16777343},
      __pad = "\000\000\000\000\000\000\000"}, un = {sun_family = 2,
      sun_path = "ÿ\230\177\000\000\001", '\000' <repeats 101 times>}},
  raddr_len = 16, be_pid = 0, be_key = 0, salt = "\000", lobjfuncs = 0x0,
  inBuffer = 0xa0103f8 "", inBufSize = 16384, inStart = 0, inCursor = 0,
  inEnd = 0, nonblocking = 0, outBuffer = 0xa014400 "", outBufSize = 8192,
  outCount = 0, result = 0x0, curTuple = 0x0,
  setenv_state = SETENV_STATE_IDLE, next_eo = 0x0, errorMessage = {
    data = 0xa016408 "", len = 0, maxlen = 256}, workBuffer = {
    data = 0xa016510 "", len = 0, maxlen = 256}, client_encoding = 0}

> But, evidently, it never is coming ready for write.

Any ideas on how to demonstrate this to the Cygwin community without
all of the PostgreSQL baggage.  Sorry, but I'm not very experienced
with sockets.

Thanks,
Jason

--
Jason Tishler
Director, Software Engineering       Phone: +1 (732) 264-8770 x235
Dot Hill Systems Corp.               Fax:   +1 (732) 264-8798
82 Bethany Road, Suite 7             Email: Jason.Tishler@dothill.com
Hazlet, NJ 07730 USA                 WWW:   http://www.dothill.com

Re: Cygwin PostgreSQL Regression Test Problems (Revisited)

От
Jason Tishler
Дата:
Tom,

On Thu, Mar 29, 2001 at 01:00:44PM -0500, Tom Lane wrote:
> Not sure why this guy only responded to me and not the list, but here's
> a lead you might want to follow up ...
>
> On Thu, 29 Mar 2001 10:49:16 -0700, Scott Ribe wrote:
> > On Thu, Mar 29, 2001, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > >Oh-ho, that's interesting!  If you look at fe-connect.c you'll see that
> > >CONNECTION_STARTED must indicate that connect() returned EINPROGRESS
> > >rather than a success indication.  The socket is supposed to go
> > >write-ready when the connection is finished...
> >
> > Uhm, generally speaking I am not qualified to participate in this
> > discussion...
> >
> > BUT I am pretty sure that some time past while searching for some other
> > network-related info on the MS web site I came across a document
> > describing bugs (or unique MS "features") in non-blocking IO and
> > particularly discussed the EINPROGRESS return value.
> >
> > I don't know what I'm talking about, I could be wrong, but I think you
> > should search on the MS web site for nonblocking IO and EINPROGRESS and you
> > might find the exact info that you need to discuss with the Cygwin folks.

I quickly searched the MSDN and could not find anything explicitly
mentioning problems with non-blocking I/O and EINPROGRESS.  Nevertheless,
in src/interfaces/libpq/fe-connect.c, I found the following comment:

    /* ----------
     * Since I have no idea whether this is a valid thing to do under Windows
     * before a connection is made, and since I have no way of testing it, I
     * leave the code looking as below.  When someone decides that they want
     * non-blocking connections under Windows, they can define
     * WIN32_NON_BLOCKING_CONNECTIONS before compilation.  If it works, then
     * this code can be cleaned up.

Cygwin is essentially Windows in this regard since Cygwin uses Windows
sockets to implement Posix sockets.  My WAG is that if EINPROGRESS is
returned during a connect attempt then the regression test hangs;
otherwise, the regression test runs to completion.

So, I applied the attached patch so that non-blocking I/O is not enabled
until after the connection has been established (just like with Win32
and SSL).  I have the regression test running in a forever loop.  So far
it has succeeded 10 times without a hang.  On this machine, I have never
been able to get more than three in a row to succeed before.

I am going to run the regression tests all night.  I will report back
tomorrow to let the list know whether or not I got any hangs.

Would the PostgreSQL team be willing to accept this patch?  At least,
until I determine whether or not I can get Cygwin "fixed?" I will post
to the Cygwin list tomorrow (when/if they are back up).

BTW, Cygwin did not support non-blocking (socket) I/O until 1.1.5
which is in the November 2000 time frame.  So another WAG is that this
problem started to occur then, but I don't really remember that well.

Thanks,
Jason

--
Jason Tishler
Director, Software Engineering       Phone: +1 (732) 264-8770 x235
Dot Hill Systems Corp.               Fax:   +1 (732) 264-8798
82 Bethany Road, Suite 7             Email: Jason.Tishler@dothill.com
Hazlet, NJ 07730 USA                 WWW:   http://www.dothill.com

Вложения

Re: Cygwin PostgreSQL Regression Test Problems (Revisited)

От
Jason Tishler
Дата:
Tom,

On Thu, Mar 29, 2001 at 04:55:17PM -0500, Jason Tishler wrote:
> Cygwin is essentially Windows in this regard since Cygwin uses Windows
> sockets to implement Posix sockets.  My WAG is that if EINPROGRESS is
> returned during a connect attempt then the regression test hangs;
> otherwise, the regression test runs to completion.
>
> So, I applied the attached patch so that non-blocking I/O is not enabled
> until after the connection has been established (just like with Win32
> and SSL).  I have the regression test running in a forever loop.  So far
> it has succeeded 10 times without a hang.  On this machine, I have never
> been able to get more than three in a row to succeed before.
>
> I am going to run the regression tests all night.  I will report back
> tomorrow to let the list know whether or not I got any hangs.

The regression test forever loop ran all night without a hang -- 150+
successes in a row.  So, I feel that it is safe to say that Cygwin (or
Windows Sockets) has problems with nonblocking connects.

> Would the PostgreSQL team be willing to accept this patch?

Any feedback on my patch (reattached for convenience)?  I would hate to
see 7.1 go out the door with this issue.

BTW, why is libpq's connection policy currently nonblocking for all
platforms except (straight) Win32?  Do people try to connect to multiple
postmasters concurrently?  If not, then what is the benefit over a
blocking connect?

> At least,
> until I determine whether or not I can get Cygwin "fixed?" I will post
> to the Cygwin list tomorrow (when/if they are back up).

I will post to the Cygwin list regarding this problem.  Just to make sure
that I have my story straight: psql is hanging while trying a nonblocking
connect to postmaster (not one of the backends).  Correct?

If anyone has any nonblocking socket client code with a corresponding
server lying around please let me know.  I would like to post this to
the Cygwin list to facilitate their debugging.

Thanks,
Jason

--
Jason Tishler
Director, Software Engineering       Phone: +1 (732) 264-8770 x235
Dot Hill Systems Corp.               Fax:   +1 (732) 264-8798
82 Bethany Road, Suite 7             Email: Jason.Tishler@dothill.com
Hazlet, NJ 07730 USA                 WWW:   http://www.dothill.com

Вложения

Re: Cygwin PostgreSQL Regression Test Problems (Revisited)

От
Jason Tishler
Дата:
Tom,

On Fri, Mar 30, 2001 at 09:25:47AM -0500, Jason Tishler wrote:
> Any feedback on my patch (reattached for convenience)?  I would hate to
> see 7.1 go out the door with this issue.

I believe that I have finally found the root cause to the psql hangs.
IMO, Cygwin is functioning properly and the issue lies in the libpq's
pqWait() use of select().

The MSDN states the following for select():

..
Summary:
    A socket will be identified in a particular set when select returns if:

..

exceptfds:
    If processing a connect call (nonblocking), connection attempt failed.
..

In libpq's pqWait(), we have the following:

    if (select(conn->sock + 1, &input_mask, &output_mask, (fd_set *) NULL,
        (struct timeval *) NULL) < 0)

After reading the above code, I hypothesized that select() was hanging
because the exceptfds was NULL.

Sure enough, if I apply the attached (nasty, hacky) patch, then the
regression test does *not* hang anymore -- even with nonblocking connects.
Although some tests will fail due to a connection refused condition --
which is not unreasonable since postmaster is very busy.

IMO, pqWait() should be enhanced to check the exceptfds too -- at least
for Cygwin.  If it is not too late in the release cycle to consider such
a change, then someone with much more intimate knowledge of libpq should
only use my patch as a starting point and then do the right thing.

If the above enhancement is deemed too risky, then I implore the
PostgreSQL team to accept my previous patch that just makes connects
blocking for Cygwin.  Note with this patch applied, I did see some
regression test failures due to a connection refused condition -- for
the same reasons as above.

Thanks,
Jason

--
Jason Tishler
Director, Software Engineering       Phone: +1 (732) 264-8770 x235
Dot Hill Systems Corp.               Fax:   +1 (732) 264-8798
82 Bethany Road, Suite 7             Email: Jason.Tishler@dothill.com
Hazlet, NJ 07730 USA                 WWW:   http://www.dothill.com

Вложения

RE: Cygwin PostgreSQL Regression Test Problems (Revisited)

От
"Hiroshi Inoue"
Дата:
> -----Original Message-----
> From: Jason Tishler
>
> Tom,
>
> On Fri, Mar 30, 2001 at 09:25:47AM -0500, Jason Tishler wrote:
> > Any feedback on my patch (reattached for convenience)?  I would hate to
> > see 7.1 go out the door with this issue.
>
> I believe that I have finally found the root cause to the psql hangs.
> IMO, Cygwin is functioning properly and the issue lies in the libpq's
> pqWait() use of select().
>
> The MSDN states the following for select():
>
> ..
> Summary:
>     A socket will be identified in a particular set when select
> returns if:
>
> ..
>
> exceptfds:
>     If processing a connect call (nonblocking), connection
> attempt failed.
> ..
>

Oh I found the same description yesterday though I've had no time
to test it. If your patch resolves *hang*, it may be the right solution
at least for cygwin port.
BTW I've never passed the pararell regression test without hang or
refusal(with your previous patch appiled) under my cygwin environ-
ment. I added one more connect() call after the refusal and passed
all regression test successfully. Hmm it may be a more preferable
solution.

regards,
Hiroshi Inoue

Re: Cygwin PostgreSQL Regression Test Problems (Revisited)

От
Tom Lane
Дата:
"Hiroshi Inoue" <Inoue@tpf.co.jp> writes:
> Oh I found the same description yesterday though I've had no time
> to test it. If your patch resolves *hang*, it may be the right solution
> at least for cygwin port.

It seems clear that it's a good idea for fe-misc.c to check the
exceptfds bit as well as read/write ready --- I'm surprised we have not
seen problems associated with that on other platforms.  I think it
should check exceptfds all the time, regardless of whether we are
waiting to read or to write.

I'm inclined to also accept Jason's change to do the connect() in
blocking mode on Cygwin.  If we do both of those things, have we
resolved the issue on Cygwin, or is there still a problem?

            regards, tom lane

Re: Cygwin PostgreSQL Regression Test Problems (Revisited)

От
Jason Tishler
Дата:
Tom,

On Sat, Mar 31, 2001 at 05:45:45PM -0500, Tom Lane wrote:
> "Hiroshi Inoue" <Inoue@tpf.co.jp> writes:
> > Oh I found the same description yesterday though I've had no time
> > to test it. If your patch resolves *hang*, it may be the right solution
> > at least for cygwin port.
>
> It seems clear that it's a good idea for fe-misc.c to check the
> exceptfds bit as well as read/write ready --- I'm surprised we have not
> seen problems associated with that on other platforms.  I think it
> should check exceptfds all the time, regardless of whether we are
> waiting to read or to write.

I'm glad that you agree.  Please post to the list when the change is in
CVS and I will test that this solves the Cygwin regression test (i.e.,
psql) hangs.

BTW, this will also solve the problem of Cygwin psql hanging when no
postmaster is running which I stumbled across when enabling Unix domain
socket support.  Previously, I thought that it was a Cygwin problem but
now I know that it is caused by the same pqWait() problem.

> I'm inclined to also accept Jason's change to do the connect() in
> blocking mode on Cygwin.

Actually, the blocking connect() change for Cygwin is obviated by the
pqWait() fix.  So, I am now no longer recommending making the blocking
connect() change for Cygwin.  Unless, you do so for other Unixes too.

> If we do both of those things, have we
> resolved the issue on Cygwin, or is there still a problem?

If you do both of these changes, then the pqWait() fix will never be
triggered under Cygwin.  When I tested my hacky patch to pqWait(), I had
to back out my blocking connect() patch in order for the pqWait() changes
to take affect.  The regression test still did not hang -- although, I
continued to have spurious failures due to connection refused conditions.

On Sat, Mar 31, 2001 at 10:15:08AM +0900, Hiroshi Inoue wrote:
> BTW I've never passed the pararell regression test without hang or
> refusal(with your previous patch appiled) under my cygwin environ-
> ment. I added one more connect() call after the refusal and passed
> all regression test successfully. Hmm it may be a more preferable
> solution.

I'm wondering whether it makes sense to add a simple connection retry
policy as suggested above by Hiroshi?  Otherwise, make check will
generate false negatives due to connection refused conditions.

If it is considered too late in the release cycle for such a change,
then I offer the following suggestions:

1. Change make check to use the serial_schedule or at least allow it to
be easily selected via a make variable (e.g., make schedule=serial_schedule
check).

2. Change the backlog parameter to listen() in src/backend/libpq/pqcomm.c
to a number that will "ensure" that the parallel_schedule version of the
regression test does not generate connection refused conditions.  Note
that I'm not even sure this will really work on all (or any) platforms.

Thanks,
Jason

--
Jason Tishler
Director, Software Engineering       Phone: +1 (732) 264-8770 x235
Dot Hill Systems Corp.               Fax:   +1 (732) 264-8798
82 Bethany Road, Suite 7             Email: Jason.Tishler@dothill.com
Hazlet, NJ 07730 USA                 WWW:   http://www.dothill.com

Re: Cygwin PostgreSQL Regression Test Problems (Revisited)

От
Jason Tishler
Дата:
Tom,

On Sat, Mar 31, 2001 at 10:07:22PM -0500, Jason Tishler wrote:
> BTW, this will also solve the problem of Cygwin psql hanging when no
> postmaster is running which I stumbled across when enabling Unix domain
> socket support.  Previously, I thought that it was a Cygwin problem but
> now I know that it is caused by the same pqWait() problem.

Oops, I meant an unconnected socket file (e.g., /tmp/.s.PGSQL.5432) above
-- not no postmaster is running.

That's the problem with taking notes (which I rarely do but did in this
case) -- you actually have to review the notes for them to be useful... :,)

Jason

--
Jason Tishler
Director, Software Engineering       Phone: +1 (732) 264-8770 x235
Dot Hill Systems Corp.               Fax:   +1 (732) 264-8798
82 Bethany Road, Suite 7             Email: Jason.Tishler@dothill.com
Hazlet, NJ 07730 USA                 WWW:   http://www.dothill.com

RE: Cygwin PostgreSQL Regression Test Problems (Revisited)

От
"Hiroshi Inoue"
Дата:
> -----Original Message-----
> From: Jason Tishler [mailto:Jason.Tishler@dothill.com]
>
> 2. Change the backlog parameter to listen() in src/backend/libpq/pqcomm.c
> to a number that will "ensure" that the parallel_schedule version of the
> regression test does not generate connection refused conditions.  Note
> that I'm not even sure this will really work on all (or any) platforms.
>

Hmm, I changed the backlog parameter on trial but I wasn't able
to see any improvements.

regards,
Hiroshi Inoue


Re: Cygwin PostgreSQL Regression Test Problems (Revisited)

От
Jason Tishler
Дата:
Hiroshi,

On Sun, Apr 01, 2001 at 11:45:04PM +0900, Hiroshi Inoue wrote:
> > From: Jason Tishler [mailto:Jason.Tishler@dothill.com]
> >
> > 2. Change the backlog parameter to listen() in src/backend/libpq/pqcomm.c
> > to a number that will "ensure" that the parallel_schedule version of the
> > regression test does not generate connection refused conditions.  Note
> > that I'm not even sure this will really work on all (or any) platforms.
> >
>
> Hmm, I changed the backlog parameter on trial but I wasn't able
> to see any improvements.

That is what I kind of expected.  Even if it worked, it would not have
been a full proof solution anyway.

Thanks,
Jason

--
Jason Tishler
Director, Software Engineering       Phone: +1 (732) 264-8770 x235
Dot Hill Systems Corp.               Fax:   +1 (732) 264-8798
82 Bethany Road, Suite 7             Email: Jason.Tishler@dothill.com
Hazlet, NJ 07730 USA                 WWW:   http://www.dothill.com

Re: Cygwin PostgreSQL Regression Test Problems (Revisited)

От
Tom Lane
Дата:
Jason Tishler <Jason.Tishler@dothill.com> writes:
>> It seems clear that it's a good idea for fe-misc.c to check the
>> exceptfds bit as well as read/write ready --- I'm surprised we have not
>> seen problems associated with that on other platforms.  I think it
>> should check exceptfds all the time, regardless of whether we are
>> waiting to read or to write.

> I'm glad that you agree.  Please post to the list when the change is in
> CVS and I will test that this solves the Cygwin regression test (i.e.,
> psql) hangs.

Done as of yesterday; should be in this morning's snapshot.

> Actually, the blocking connect() change for Cygwin is obviated by the
> pqWait() fix.  So, I am now no longer recommending making the blocking
> connect() change for Cygwin.  Unless, you do so for other Unixes too.

I made both changes in the hope that the blocking connect change would
suppress your problem with connection-refused failures.  If it does not,
then we may as well reverse out the fe-connect.c change.  Let me know.

>> If we do both of those things, have we
>> resolved the issue on Cygwin, or is there still a problem?

> I'm wondering whether it makes sense to add a simple connection retry
> policy as suggested above by Hiroshi?

I do not think it is appropriate for libpq to do that.  For one thing,
where would you stop --- why exactly two tries?

> 2. Change the backlog parameter to listen() in src/backend/libpq/pqcomm.c
> to a number that will "ensure" that the parallel_schedule version of the
> regression test does not generate connection refused conditions.  Note
> that I'm not even sure this will really work on all (or any) platforms.

We already use SOMAXCONN which is supposed to be defined by the system
as the maximum allowed queue depth.  If Cygwin fails to define it, or
defines it as something less than it should be, then we might consider
installing a Cygwin-specific hack to redefine SOMAXCONN.  However
Hiroshi says later that he already tried this.  I'm inclined to think
that Cygwin simply has a problem with servicing concurrent connection
requests, perhaps even before the alleged SOMAXCONN value is reached.

            regards, tom lane

Re: Cygwin PostgreSQL Regression Test Problems (Revisited)

От
Jason Tishler
Дата:
Tom,

On Sun, Apr 01, 2001 at 01:57:35PM -0400, Tom Lane wrote:
> Jason Tishler <Jason.Tishler@dothill.com> writes:
> > I'm glad that you agree.  Please post to the list when the change is in
> > CVS and I will test that this solves the Cygwin regression test (i.e.,
> > psql) hangs.
>
> Done as of yesterday; should be in this morning's snapshot.

Thanks.

> > Actually, the blocking connect() change for Cygwin is obviated by the
> > pqWait() fix.  So, I am now no longer recommending making the blocking
> > connect() change for Cygwin.  Unless, you do so for other Unixes too.
>
> I made both changes in the hope that the blocking connect change would
> suppress your problem with connection-refused failures.  If it does not,
> then we may as well reverse out the fe-connect.c change.  Let me know.

With both changes or only the fe-connect.c one, psql does not hang and
displays the following error message when the connection is refused:

psql: connectDBStart() -- connect() failed: Connection refused
        Is the postmaster running locally
        and accepting connections on Unix socket '/tmp/.s.PGSQL.65432'?

With only the fe-misc.c change, psql does not hang and displays the
following error message when the connection is refused:

psql: PQconnectPoll() -- connect() failed: error 10061
        Is the postmaster running locally
        and accepting connections on Unix socket '/tmp/.s.PGSQL.65432'?

In both cases there are no hangs, just the error messages are different.
Unfortunately, for the non-blocking case the error message is cryptic.
I tried tracking down error "10061" which comes from getsockopt(), but
I was unsuccessful.  Is there any way to improve the readability of this
error message?

Also, the blocking connect change did *not* fix the connection refused
(spurious) regression test failures.  So this change should probably be
backed out.

> > I'm wondering whether it makes sense to add a simple connection retry
> > policy as suggested above by Hiroshi?
>
> I do not think it is appropriate for libpq to do that.

When I made my suggestion above, I was concerned that may be libpq was not
the right layer to be implementing connection policies and that possibly
psql was the better place.

> For one thing, where would you stop --- why exactly two tries?

This was another one of my concerns too.

> > 2. Change the backlog parameter to listen() in src/backend/libpq/pqcomm.c
> > to a number that will "ensure" that the parallel_schedule version of the
> > regression test does not generate connection refused conditions.  Note
> > that I'm not even sure this will really work on all (or any) platforms.
>
> We already use SOMAXCONN which is supposed to be defined by the system
> as the maximum allowed queue depth.  If Cygwin fails to define it, or
> defines it as something less than it should be, then we might consider
> installing a Cygwin-specific hack to redefine SOMAXCONN.

Cygwin defines SOMAXCONN to be 5.  However, winsock.h defines it to be 5
while winsock2.h defines it to be 0x7fffffff.  So, I'm not sure what it
the real Cygwin (i.e., Windows) maximum.

> However Hiroshi says later that he already tried this.

Even if it worked, this would have just pushed the problem instead of
really fixing it.

> I'm inclined to think
> that Cygwin simply has a problem with servicing concurrent connection
> requests, perhaps even before the alleged SOMAXCONN value is reached.

You meant Windows.  Right? :,)

In summary, I feel that the fe-connect.c change should be backed out so
that Cygwin will be consistent with other UNIXes.  I also hope that the
non-blocking connection failure message can be made more readable and
that make check will not generate spurious failure messages under Cygwin
on slow machines.

Thanks,
Jason

--
Jason Tishler
Director, Software Engineering       Phone: +1 (732) 264-8770 x235
Dot Hill Systems Corp.               Fax:   +1 (732) 264-8798
82 Bethany Road, Suite 7             Email: Jason.Tishler@dothill.com
Hazlet, NJ 07730 USA                 WWW:   http://www.dothill.com

Re: Cygwin PostgreSQL Regression Test Problems (Revisited)

От
Tom Lane
Дата:
Jason Tishler <Jason.Tishler@dothill.com> writes:
> In both cases there are no hangs, just the error messages are different.
> Unfortunately, for the non-blocking case the error message is cryptic.
> I tried tracking down error "10061" which comes from getsockopt(), but
> I was unsuccessful.  Is there any way to improve the readability of this
> error message?

I'm inclined to leave the blocking-connect change in there just to
suppress that peculiar error message.  No, I have no idea where it's
coming from, either.

>> However Hiroshi says later that he already tried [ raising SOMAXCONN ]

> Even if it worked, this would have just pushed the problem instead of
> really fixing it.

If the problem were overflow of the accept queue, then raising the
listen() parameter ought to fix it, assuming that Windows does actually
allow larger values for the parameter.  Given that we are only hearing
this problem reported on Windows, I have a sneaking suspicion that the
effective queue length limit is 1 on that platform no matter what we
pass to listen().  Is there anyone we might ask about concurrent
connection-request handling on Windows?

            regards, tom lane

Re: Cygwin PostgreSQL Regression Test Problems (Revisited)

От
Jason Tishler
Дата:
Tom,

On Mon, Apr 02, 2001 at 01:44:14PM -0400, Tom Lane wrote:
> Jason Tishler <Jason.Tishler@dothill.com> writes:
> > In both cases there are no hangs, just the error messages are different.
> > Unfortunately, for the non-blocking case the error message is cryptic.
> > I tried tracking down error "10061" which comes from getsockopt(), but
> > I was unsuccessful.  Is there any way to improve the readability of this
> > error message?
>
> I'm inclined to leave the blocking-connect change in there just to
> suppress that peculiar error message.  No, I have no idea where it's
> coming from, either.

I just figured out what is error 10061 -- it is WSAECONNREFUSED, Winsock's
version of ECONNREFUSED.  I just submitted a patch to Cygwin that maps
getsockopt optval's from the Winsock versions to their corresponding
errno values.  I just tried psql with an unconnected socket file and
psql displayed:

    psql: PQconnectPoll() -- connect() failed: Connection refused
        Is the postmaster running locally
        and accepting connections on Unix socket '/tmp/.s.PGSQL.5432'?

as desired.

If interested, see the following for details:

    http://www.cygwin.com/ml/cygwin-patches/2001-q2/msg00003.html

If my Cygwin patch is accepted, I'll let the list know.  At that time, I
think that the fe-connect.c change should be backed out.

> >> However Hiroshi says later that he already tried [ raising SOMAXCONN ]
>
> > Even if it worked, this would have just pushed the problem instead of
> > really fixing it.
>
> If the problem were overflow of the accept queue, then raising the
> listen() parameter ought to fix it, assuming that Windows does actually
> allow larger values for the parameter.  Given that we are only hearing
> this problem reported on Windows, I have a sneaking suspicion that the
> effective queue length limit is 1 on that platform no matter what we
> pass to listen().  Is there anyone we might ask about concurrent
> connection-request handling on Windows?

In digging some more through the MSDN, I found out the backlog limit
on NT 4.0 Workstation and Server is 5 and 200, respectively.  So, it
would appears that NT is really using this parameter.  If interested,
see the following for more details:

    http://support.microsoft.com/support/kb/articles/Q127/1/44.asp

When running the parallel_schedule, as many as 18 psql's are trying to
connect to postmaster.  Isn't it conceivable that more than 6 are trying
to connection concurrently?

Thanks,
Jason

Jason

--
Jason Tishler
Director, Software Engineering       Phone: +1 (732) 264-8770 x235
Dot Hill Systems Corp.               Fax:   +1 (732) 264-8798
82 Bethany Road, Suite 7             Email: Jason.Tishler@dothill.com
Hazlet, NJ 07730 USA                 WWW:   http://www.dothill.com

Re: Cygwin PostgreSQL Regression Test Problems (Revisited)

От
Tom Lane
Дата:
Jason Tishler <Jason.Tishler@dothill.com> writes:
> I just figured out what is error 10061 -- it is WSAECONNREFUSED, Winsock's
> version of ECONNREFUSED.  I just submitted a patch to Cygwin that maps
> getsockopt optval's from the Winsock versions to their corresponding
> errno values.

Ah so.  Sounds good.

> If my Cygwin patch is accepted, I'll let the list know.  At that time, I
> think that the fe-connect.c change should be backed out.

My feeling is that we should leave it in place for 7.1 in any case.
Once there's a shipping Cygwin version that maps the error number
correctly, we can back out the patch so that Cygwin is treated more
like other platforms.

> In digging some more through the MSDN, I found out the backlog limit
> on NT 4.0 Workstation and Server is 5 and 200, respectively.

This page only talks about NT; what of other flavors of Windows?  Cygwin
runs on more than NT, doesn't it?

Interesting point here: a copy of Postgres compiled on NT WS would
presumably see SOMAXCONN = 5 in the system headers.  If the executable
is then moved to NT Server, it would fail to take advantage of the
higher queue limit.  Do we need to hardwire a hack to use the larger
value always on Windows?

> When running the parallel_schedule, as many as 18 psql's are trying to
> connect to postmaster.  Isn't it conceivable that more than 6 are trying
> to connection concurrently?

Yes (although that's still hypothesis, not the proven cause of failure).

I still suspect there's something else going on here, anyway.  SOMAXCONN
is nominally 5 on quite a lot of Unixen, but we've only heard reports of
transient "make check" connect failures on Windows.  Why is Windows so
much more prone to show this problem?

            regards, tom lane

Re: Cygwin PostgreSQL Regression Test Problems (Revisited)

От
Tom Lane
Дата:
I wrote:
> I still suspect there's something else going on here, anyway.  SOMAXCONN
> is nominally 5 on quite a lot of Unixen, but we've only heard reports of
> transient "make check" connect failures on Windows.  Why is Windows so
> much more prone to show this problem?

Hm, maybe I need to take this back.  Some poking around shows that
SOMAXCONN is defined as 128 on Linux, 20 on HPUX, which are the
platforms I've tested most.  As an experiment I reduced the listen()
parameter to 5 on HPUX, and bingo: I get connection-refused failures
in "make check".  So it seems that Windows' behavior is not so out of
line after all.  We would probably see similar failures on BSD-derived
systems, since BSD systems traditionally set SOMAXCONN to 5.  (Any
BSD partisans able to check this?)

I do not think that we should change "make check" to avoid this issue.
If you are on a platform that has a problem with supporting lots of
parallel connection requests, it seems to me that you'd best know about
that limitation, and "make check" is doing you a service by pointing
out the problem.

What I do think we should consider is whether to believe SOMAXCONN
unconditionally, or to use a large value in the listen() call no matter
what the system headers claim SOMAXCONN is.  This would avoid
sillinesses such as using an NT-Workstation limit on an NT-Server
machine.  The only risk I can see is that some platforms might reject
as erroneous a listen() parameter that's more than they are prepared to
support.  The Unix man pages I have access to claim that a too-large
listen() parameter will be clamped to the kernel's SOMAXCONN without
raising an error, but does anyone have an idea whether that behavior
is universal?

In the longer term, we should think about whether we can reduce the
postmaster's connection service delay.  Someone recently suggested
that the postmaster should fork a child immediately upon receiving
a connection, and let the child work on the authentication process
while the parent goes right back to accept().  I'm not sure if that
would help "make check" very much, since it's presumably not running
anything more complex than "trust" authentication anyway.  But it
should eliminate auth delays caused by SSL, malfunctioning ident
daemons, and sundry other problems.

            regards, tom lane

Re: Cygwin PostgreSQL Regression Test Problems (Revisited)

От
Jason Tishler
Дата:
Tom,

On Mon, Apr 02, 2001 at 03:50:55PM -0400, Tom Lane wrote:
> Jason Tishler <Jason.Tishler@dothill.com> writes:
> > If my Cygwin patch is accepted, I'll let the list know.  At that time, I
> > think that the fe-connect.c change should be backed out.
>
> My feeling is that we should leave it in place for 7.1 in any case.
> Once there's a shipping Cygwin version that maps the error number
> correctly, we can back out the patch so that Cygwin is treated more
> like other platforms.

OK, the above plan is reasonable.

> > In digging some more through the MSDN, I found out the backlog limit
> > on NT 4.0 Workstation and Server is 5 and 200, respectively.
>
> This page only talks about NT; what of other flavors of Windows?  Cygwin
> runs on more than NT, doesn't it?

Yes, it runs on 2000, 9X/Me, and even XP.  Unfortunately, I couldn't
(easily) find the limits for these versions.  My WAG is that 2000 and
XP will be the same or similar to NT.  I am not concerned about 9X/Me
because I find them unusable for other reasons.

> Interesting point here: a copy of Postgres compiled on NT WS would
> presumably see SOMAXCONN = 5 in the system headers.  If the executable
> is then moved to NT Server, it would fail to take advantage of the
> higher queue limit.

Actually, even if compiled on NT Server, SOMAXCONN is it set to 5 due to
Cygwin's socket.h.

> Do we need to hardwire a hack to use the larger
> value always on Windows?

Sounds like a good idea, but the effort only seems reasonable if we can
conclude that Windows will really take advantage of it.

> > When running the parallel_schedule, as many as 18 psql's are trying to
> > connect to postmaster.  Isn't it conceivable that more than 6 are trying
> > to connection concurrently?
>
> Yes (although that's still hypothesis, not the proven cause of failure).
>
> I still suspect there's something else going on here, anyway.  SOMAXCONN
> is nominally 5 on quite a lot of Unixen, but we've only heard reports of
> transient "make check" connect failures on Windows.  Why is Windows so
> much more prone to show this problem?

I don't know!  I've been banging my head to find out why and my head is
starting to hurt... :,)

Jason

--
Jason Tishler
Director, Software Engineering       Phone: +1 (732) 264-8770 x235
Dot Hill Systems Corp.               Fax:   +1 (732) 264-8798
82 Bethany Road, Suite 7             Email: Jason.Tishler@dothill.com
Hazlet, NJ 07730 USA                 WWW:   http://www.dothill.com

Re: Cygwin PostgreSQL Regression Test Problems (Revisited)

От
Bruce Momjian
Дата:
> I wrote:
> > I still suspect there's something else going on here, anyway.  SOMAXCONN
> > is nominally 5 on quite a lot of Unixen, but we've only heard reports of
> > transient "make check" connect failures on Windows.  Why is Windows so
> > much more prone to show this problem?
>
> Hm, maybe I need to take this back.  Some poking around shows that
> SOMAXCONN is defined as 128 on Linux, 20 on HPUX, which are the
> platforms I've tested most.  As an experiment I reduced the listen()
> parameter to 5 on HPUX, and bingo: I get connection-refused failures
> in "make check".  So it seems that Windows' behavior is not so out of
> line after all.  We would probably see similar failures on BSD-derived
> systems, since BSD systems traditionally set SOMAXCONN to 5.  (Any
> BSD partisans able to check this?)

BSDi 4.01 has:

    /*
     * Maximum queue length specifiable by listen.
     * The kernel has a configurable limit;
     * the non-kernel value is the traditional one.
     */
    #ifndef KERNEL
    #define SOMAXCONN   64  /* XXX, really run-time settable */
    #else
    #ifndef _POSIX_SOURCE
    #define SOMAXCONN_DFLT  64
    #endif
    #endif

and sysctl has:

    net.socket.maxconn = 64

that can be easily changed.

--
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026

Re: Cygwin PostgreSQL Regression Test Problems (Revisited)

От
Bruce Momjian
Дата:
> In the longer term, we should think about whether we can reduce the
> postmaster's connection service delay.  Someone recently suggested
> that the postmaster should fork a child immediately upon receiving
> a connection, and let the child work on the authentication process
> while the parent goes right back to accept().  I'm not sure if that
> would help "make check" very much, since it's presumably not running
> anything more complex than "trust" authentication anyway.  But it
> should eliminate auth delays caused by SSL, malfunctioning ident
> daemons, and sundry other problems.

I think the trust for SSL/indent would be a good idea.

--
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026