Обсуждение: Intermittent "make check" failures on hyena

Поиск
Список
Период
Сортировка

Intermittent "make check" failures on hyena

От
Tom Lane
Дата:
I'm noticing that buildfarm member hyena sometimes fails the parallel
regression tests, for instance
http://www.pgbuildfarm.org/cgi-bin/show_log.pl?nm=hyena&dt=2006-07-19%2009:20:00

The symptom is always that one of the tests fails entirely because
psql couldn't connect:

psql: could not connect to server: Connection refusedIs the server running locally and acceptingconnections on Unix
domainsocket "/tmp/.s.PGSQL.55678"?
 

It's a different test failing in each occurrence.  Sometimes there are
ensuing failures in subsequent tests that expect the side-effects
of the one that failed, but there's clearly a common cause here.

AFAIK it is not possible for Postgres itself to cause a "connection
refused" failure --- that's a kernel-level errno.  So what's going on
here?  The only idea that comes to mind is that this version of Solaris
has some very low limit on SOMAXCONN, and when the timing is just so
it's bouncing connection requests because several of them arrive faster
than the postmaster can fork off children.  Googling suggests that there
are versions of Solaris with SOMAXCONN as low as 5 :-( ... but other
pages say that the default is higher, so this theory might be wrong.

What is SOMAXCONN set to on that box, anyway?  If it's tiny, I suggest
you increase SOMAXCONN to something saner, or if you can't, run "make
check" with MAX_CONNECTIONS=5 added to the make command.  Does the
buildfarm script have provisions for site-local settings of this
parameter?
        regards, tom lane


Re: Intermittent "make check" failures on hyena

От
Andrew Dunstan
Дата:

Tom Lane wrote:

>I'm noticing that buildfarm member hyena sometimes fails the parallel
>regression tests, for instance
>http://www.pgbuildfarm.org/cgi-bin/show_log.pl?nm=hyena&dt=2006-07-19%2009:20:00
>
>The symptom is always that one of the tests fails entirely because
>psql couldn't connect:
>
>psql: could not connect to server: Connection refused
>    Is the server running locally and accepting
>    connections on Unix domain socket "/tmp/.s.PGSQL.55678"?
>
>It's a different test failing in each occurrence.  Sometimes there are
>ensuing failures in subsequent tests that expect the side-effects
>of the one that failed, but there's clearly a common cause here.
>
>AFAIK it is not possible for Postgres itself to cause a "connection
>refused" failure --- that's a kernel-level errno.  So what's going on
>here?  The only idea that comes to mind is that this version of Solaris
>has some very low limit on SOMAXCONN, and when the timing is just so
>it's bouncing connection requests because several of them arrive faster
>than the postmaster can fork off children.  Googling suggests that there
>are versions of Solaris with SOMAXCONN as low as 5 :-( ... but other
>pages say that the default is higher, so this theory might be wrong.
>
>What is SOMAXCONN set to on that box, anyway?  If it's tiny, I suggest
>you increase SOMAXCONN to something saner, or if you can't, run "make
>check" with MAX_CONNECTIONS=5 added to the make command.  Does the
>buildfarm script have provisions for site-local settings of this
>parameter?
>
>    
>  
>


Yes it sure does.

This is the box that Sun donated, btw.

I get: ndd /dev/tcp tcp_conn_req_max_q   => 128

Is that the Solaris equivalent of SOMAXCONN? That's low, maybe, but not 
impossibly low.

I don't have root on the box, though. For now I have set MAX_CONNECTIONS 
to 8, to provide a modest limit on parallelism. I will see if I can 
coordinate with Robert and Josh to increase the OS limits.

Thanks for the diagnosis.


cheers

andrew


Re: Intermittent "make check" failures on hyena

От
Tom Lane
Дата:
Andrew Dunstan <andrew@dunslane.net> writes:
> Tom Lane wrote:
>> AFAIK it is not possible for Postgres itself to cause a "connection
>> refused" failure --- that's a kernel-level errno.  So what's going on
>> here?  The only idea that comes to mind is that this version of Solaris
>> has some very low limit on SOMAXCONN, and when the timing is just so
>> it's bouncing connection requests because several of them arrive faster
>> than the postmaster can fork off children.  Googling suggests that there
>> are versions of Solaris with SOMAXCONN as low as 5 :-( ... but other
>> pages say that the default is higher, so this theory might be wrong.

> This is the box that Sun donated, btw.
> I get: ndd /dev/tcp tcp_conn_req_max_q   => 128
> Is that the Solaris equivalent of SOMAXCONN? That's low, maybe, but not 
> impossibly low.

Yeah, I found that variable name in googling.  If it's 128 then there's
no way that it's causing the problem --- you'd have to assume a value in
the single digits to explain the observed failures.

I see one occurrence in the 8.1 branch on hyena, but the failure
probability seems to have jumped way up in HEAD since we put in the
C-coded pg_regress.  This lends weight to the idea that it's a
timing-related issue, because pg_regress.c is presumably much faster
at forking off a parallel gang of psqls than the shell script was;
and it's hard to see what else about the pg_regress change could be
affecting the psqls' ability to connect once forked.

We probably need to get some Solaris experts involved in diagnosing
what's happening.  Judging by the buildfarm results you should be able
to replicate it fairly easily by doing "make installcheck-parallel"
repeatedly.
        regards, tom lane


Re: Intermittent "make check" failures on hyena

От
Andrew Dunstan
Дата:

Tom Lane wrote:

>I see one occurrence in the 8.1 branch on hyena, but the failure
>probability seems to have jumped way up in HEAD since we put in the
>C-coded pg_regress.  This lends weight to the idea that it's a
>timing-related issue, because pg_regress.c is presumably much faster
>at forking off a parallel gang of psqls than the shell script was;
>and it's hard to see what else about the pg_regress change could be
>affecting the psqls' ability to connect once forked.
>
>We probably need to get some Solaris experts involved in diagnosing
>what's happening.  Judging by the buildfarm results you should be able
>to replicate it fairly easily by doing "make installcheck-parallel"
>repeatedly.
>
>
>  
>

I will refer this to those experts - my Solaris-fu is a tad rusty these 
days.

cheers

andrew


Re: Intermittent "make check" failures on hyena

От
Zdenek Kotala
Дата:
Andrew Dunstan wrote:
> 
> 
> Tom Lane wrote:
> 
>> I see one occurrence in the 8.1 branch on hyena, but the failure
>> probability seems to have jumped way up in HEAD since we put in the
>> C-coded pg_regress.  This lends weight to the idea that it's a
>> timing-related issue, because pg_regress.c is presumably much faster
>> at forking off a parallel gang of psqls than the shell script was;
>> and it's hard to see what else about the pg_regress change could be
>> affecting the psqls' ability to connect once forked.
>>
>> We probably need to get some Solaris experts involved in diagnosing
>> what's happening.  Judging by the buildfarm results you should be able
>> to replicate it fairly easily by doing "make installcheck-parallel"
>> repeatedly.
>>
> 
> I will refer this to those experts - my Solaris-fu is a tad rusty these 
> days.

How Tom mentioned, problem is in the size of TCP connection queue 
(parameter tcp_conn_req_max_q). Default is 128 in solaris 10. Second 
limit is twice number of backends. See ./backend/libpq/pqcomm.c
                  /*                 * Select appropriate accept-queue length limit. 
PG_SOMAXCONN is only                 * intended to provide a clamp on the request on 
platforms where an                 * overly large request provokes a kernel error (are 
there any?).                 */                maxconn = MaxBackends * 2;                if (maxconn > PG_SOMAXCONN)
                   maxconn = PG_SOMAXCONN;
 
                err = listen(fd, maxconn);


However what happened? I think that following scenarios occurred. 
Postmaster listen only in one process and there are many clients run 
really parallel. T2000 server has 32 threads ( 8 core and each has 4 
threads). These clients generate more TCP/IP request at one time, than 
postmaster is able accepted.


Zdenek


Re: Intermittent "make check" failures on hyena

От
Josh Berkus
Дата:
Zdenek,

> However what happened? I think that following scenarios occurred.
> Postmaster listen only in one process and there are many clients run
> really parallel. T2000 server has 32 threads ( 8 core and each has 4
> threads). These clients generate more TCP/IP request at one time, than
> postmaster is able accepted.

I don't quite follow this ... are you saying that the regression test 
generate more than 128 connections?    And that Solaris ships by default 
only allowing 128 connections?   

-- 
--Josh

Josh Berkus
PostgreSQL @ Sun
San Francisco


Re: Intermittent "make check" failures on hyena

От
Zdenek Kotala
Дата:
Josh Berkus napsal(a):
> Zdenek,
> 
>> However what happened? I think that following scenarios occurred.
>> Postmaster listen only in one process and there are many clients run
>> really parallel. T2000 server has 32 threads ( 8 core and each has 4
>> threads). These clients generate more TCP/IP request at one time, than
>> postmaster is able accepted.
> 
> I don't quite follow this ... are you saying that the regression test 
> generate more than 128 connections?    And that Solaris ships by default 
> only allowing 128 connections?   

My idea is completely wrong. The problem is not related to TCP/IP stack. 
It related only to UNIX_AF sockets. I have tried analyze problem with 
dtrace, but I cannot reproduce this on my machine. Josh could I have 
access to this server (for D-Trace I need root access as well).
Zdenek


Re: Intermittent "make check" failures on hyena

От
Zdenek Kotala
Дата:
Josh Berkus wrote:
> Zdenek,
> 
>> However what happened? I think that following scenarios occurred.
>> Postmaster listen only in one process and there are many clients run
>> really parallel. T2000 server has 32 threads ( 8 core and each has 4
>> threads). These clients generate more TCP/IP request at one time, than
>> postmaster is able accepted.
> 
> I don't quite follow this ... are you saying that the regression test 
> generate more than 128 connections?    And that Solaris ships by default 
> only allowing 128 connections?   
> 

Default is 128 for queue of TCP established socket, but not accepted. It 
is called backlog (see man listen). This values is possible change by 
ndd command. However problem is, that unix domain socket has different 
backlog limit. This limit is 32 waiting request for accept and this 
value is not possible setup. However, kernel patch 118855-15 increase 
this value up to 128 - see bugid 4352289. New version of 
solaris/opensolaris has bigger value - 4096.

Please, install latest solaris 10 patches on this machine and check if 
it helps.
    Zdenek


Re: Intermittent "make check" failures on hyena

От
Andrew Dunstan
Дата:
Zdenek Kotala wrote:
> Josh Berkus wrote:
>> Zdenek,
>>
>>> However what happened? I think that following scenarios occurred.
>>> Postmaster listen only in one process and there are many clients run
>>> really parallel. T2000 server has 32 threads ( 8 core and each has 4
>>> threads). These clients generate more TCP/IP request at one time, than
>>> postmaster is able accepted.
>>
>> I don't quite follow this ... are you saying that the regression test 
>> generate more than 128 connections?    And that Solaris ships by 
>> default only allowing 128 connections?  
>
> Default is 128 for queue of TCP established socket, but not accepted. 
> It is called backlog (see man listen). This values is possible change 
> by ndd command. However problem is, that unix domain socket has 
> different backlog limit. This limit is 32 waiting request for accept 
> and this value is not possible setup. However, kernel patch 118855-15 
> increase this value up to 128 - see bugid 4352289. New version of 
> solaris/opensolaris has bigger value - 4096.
>
> Please, install latest solaris 10 patches on this machine and check if 
> it helps.
>
>  

Even at 32 this hardly seems to be a likely cause of the problem. I 
think the maximum parallelism of our tests is around 20. Anyway, lets's 
get the patch installed - I have a test regime set up that will 
reproduce the error moderately reliably within about a dozen or so tries.

cheers

andrew



Re: Intermittent "make check" failures on hyena

От
Zdenek Kotala
Дата:
Andrew Dunstan napsal(a):
> 
> Even at 32 this hardly seems to be a likely cause of the problem. I 
> think the maximum parallelism of our tests is around 20. Anyway, lets's 
> get the patch installed - I have a test regime set up that will 
> reproduce the error moderately reliably within about a dozen or so tries.


I deeply analyzed this problem and it is problem in the local transport 
layer modul - tl driver in the solaris kernel. It is kind of race 
condition and it appears when there are parallel request to close and 
connect. I raised bug. Reconnect should be work around.
    Zdenek