Обсуждение: Intermittent "make check" failures on hyena
I'm noticing that buildfarm member hyena sometimes fails the parallel regression tests, for instance http://www.pgbuildfarm.org/cgi-bin/show_log.pl?nm=hyena&dt=2006-07-19%2009:20:00 The symptom is always that one of the tests fails entirely because psql couldn't connect: psql: could not connect to server: Connection refusedIs the server running locally and acceptingconnections on Unix domainsocket "/tmp/.s.PGSQL.55678"? It's a different test failing in each occurrence. Sometimes there are ensuing failures in subsequent tests that expect the side-effects of the one that failed, but there's clearly a common cause here. AFAIK it is not possible for Postgres itself to cause a "connection refused" failure --- that's a kernel-level errno. So what's going on here? The only idea that comes to mind is that this version of Solaris has some very low limit on SOMAXCONN, and when the timing is just so it's bouncing connection requests because several of them arrive faster than the postmaster can fork off children. Googling suggests that there are versions of Solaris with SOMAXCONN as low as 5 :-( ... but other pages say that the default is higher, so this theory might be wrong. What is SOMAXCONN set to on that box, anyway? If it's tiny, I suggest you increase SOMAXCONN to something saner, or if you can't, run "make check" with MAX_CONNECTIONS=5 added to the make command. Does the buildfarm script have provisions for site-local settings of this parameter? regards, tom lane
Tom Lane wrote: >I'm noticing that buildfarm member hyena sometimes fails the parallel >regression tests, for instance >http://www.pgbuildfarm.org/cgi-bin/show_log.pl?nm=hyena&dt=2006-07-19%2009:20:00 > >The symptom is always that one of the tests fails entirely because >psql couldn't connect: > >psql: could not connect to server: Connection refused > Is the server running locally and accepting > connections on Unix domain socket "/tmp/.s.PGSQL.55678"? > >It's a different test failing in each occurrence. Sometimes there are >ensuing failures in subsequent tests that expect the side-effects >of the one that failed, but there's clearly a common cause here. > >AFAIK it is not possible for Postgres itself to cause a "connection >refused" failure --- that's a kernel-level errno. So what's going on >here? The only idea that comes to mind is that this version of Solaris >has some very low limit on SOMAXCONN, and when the timing is just so >it's bouncing connection requests because several of them arrive faster >than the postmaster can fork off children. Googling suggests that there >are versions of Solaris with SOMAXCONN as low as 5 :-( ... but other >pages say that the default is higher, so this theory might be wrong. > >What is SOMAXCONN set to on that box, anyway? If it's tiny, I suggest >you increase SOMAXCONN to something saner, or if you can't, run "make >check" with MAX_CONNECTIONS=5 added to the make command. Does the >buildfarm script have provisions for site-local settings of this >parameter? > > > > Yes it sure does. This is the box that Sun donated, btw. I get: ndd /dev/tcp tcp_conn_req_max_q => 128 Is that the Solaris equivalent of SOMAXCONN? That's low, maybe, but not impossibly low. I don't have root on the box, though. For now I have set MAX_CONNECTIONS to 8, to provide a modest limit on parallelism. I will see if I can coordinate with Robert and Josh to increase the OS limits. Thanks for the diagnosis. cheers andrew
Andrew Dunstan <andrew@dunslane.net> writes: > Tom Lane wrote: >> AFAIK it is not possible for Postgres itself to cause a "connection >> refused" failure --- that's a kernel-level errno. So what's going on >> here? The only idea that comes to mind is that this version of Solaris >> has some very low limit on SOMAXCONN, and when the timing is just so >> it's bouncing connection requests because several of them arrive faster >> than the postmaster can fork off children. Googling suggests that there >> are versions of Solaris with SOMAXCONN as low as 5 :-( ... but other >> pages say that the default is higher, so this theory might be wrong. > This is the box that Sun donated, btw. > I get: ndd /dev/tcp tcp_conn_req_max_q => 128 > Is that the Solaris equivalent of SOMAXCONN? That's low, maybe, but not > impossibly low. Yeah, I found that variable name in googling. If it's 128 then there's no way that it's causing the problem --- you'd have to assume a value in the single digits to explain the observed failures. I see one occurrence in the 8.1 branch on hyena, but the failure probability seems to have jumped way up in HEAD since we put in the C-coded pg_regress. This lends weight to the idea that it's a timing-related issue, because pg_regress.c is presumably much faster at forking off a parallel gang of psqls than the shell script was; and it's hard to see what else about the pg_regress change could be affecting the psqls' ability to connect once forked. We probably need to get some Solaris experts involved in diagnosing what's happening. Judging by the buildfarm results you should be able to replicate it fairly easily by doing "make installcheck-parallel" repeatedly. regards, tom lane
Tom Lane wrote: >I see one occurrence in the 8.1 branch on hyena, but the failure >probability seems to have jumped way up in HEAD since we put in the >C-coded pg_regress. This lends weight to the idea that it's a >timing-related issue, because pg_regress.c is presumably much faster >at forking off a parallel gang of psqls than the shell script was; >and it's hard to see what else about the pg_regress change could be >affecting the psqls' ability to connect once forked. > >We probably need to get some Solaris experts involved in diagnosing >what's happening. Judging by the buildfarm results you should be able >to replicate it fairly easily by doing "make installcheck-parallel" >repeatedly. > > > > I will refer this to those experts - my Solaris-fu is a tad rusty these days. cheers andrew
Andrew Dunstan wrote: > > > Tom Lane wrote: > >> I see one occurrence in the 8.1 branch on hyena, but the failure >> probability seems to have jumped way up in HEAD since we put in the >> C-coded pg_regress. This lends weight to the idea that it's a >> timing-related issue, because pg_regress.c is presumably much faster >> at forking off a parallel gang of psqls than the shell script was; >> and it's hard to see what else about the pg_regress change could be >> affecting the psqls' ability to connect once forked. >> >> We probably need to get some Solaris experts involved in diagnosing >> what's happening. Judging by the buildfarm results you should be able >> to replicate it fairly easily by doing "make installcheck-parallel" >> repeatedly. >> > > I will refer this to those experts - my Solaris-fu is a tad rusty these > days. How Tom mentioned, problem is in the size of TCP connection queue (parameter tcp_conn_req_max_q). Default is 128 in solaris 10. Second limit is twice number of backends. See ./backend/libpq/pqcomm.c /* * Select appropriate accept-queue length limit. PG_SOMAXCONN is only * intended to provide a clamp on the request on platforms where an * overly large request provokes a kernel error (are there any?). */ maxconn = MaxBackends * 2; if (maxconn > PG_SOMAXCONN) maxconn = PG_SOMAXCONN; err = listen(fd, maxconn); However what happened? I think that following scenarios occurred. Postmaster listen only in one process and there are many clients run really parallel. T2000 server has 32 threads ( 8 core and each has 4 threads). These clients generate more TCP/IP request at one time, than postmaster is able accepted. Zdenek
Zdenek, > However what happened? I think that following scenarios occurred. > Postmaster listen only in one process and there are many clients run > really parallel. T2000 server has 32 threads ( 8 core and each has 4 > threads). These clients generate more TCP/IP request at one time, than > postmaster is able accepted. I don't quite follow this ... are you saying that the regression test generate more than 128 connections? And that Solaris ships by default only allowing 128 connections? -- --Josh Josh Berkus PostgreSQL @ Sun San Francisco
Josh Berkus napsal(a): > Zdenek, > >> However what happened? I think that following scenarios occurred. >> Postmaster listen only in one process and there are many clients run >> really parallel. T2000 server has 32 threads ( 8 core and each has 4 >> threads). These clients generate more TCP/IP request at one time, than >> postmaster is able accepted. > > I don't quite follow this ... are you saying that the regression test > generate more than 128 connections? And that Solaris ships by default > only allowing 128 connections? My idea is completely wrong. The problem is not related to TCP/IP stack. It related only to UNIX_AF sockets. I have tried analyze problem with dtrace, but I cannot reproduce this on my machine. Josh could I have access to this server (for D-Trace I need root access as well). Zdenek
Josh Berkus wrote: > Zdenek, > >> However what happened? I think that following scenarios occurred. >> Postmaster listen only in one process and there are many clients run >> really parallel. T2000 server has 32 threads ( 8 core and each has 4 >> threads). These clients generate more TCP/IP request at one time, than >> postmaster is able accepted. > > I don't quite follow this ... are you saying that the regression test > generate more than 128 connections? And that Solaris ships by default > only allowing 128 connections? > Default is 128 for queue of TCP established socket, but not accepted. It is called backlog (see man listen). This values is possible change by ndd command. However problem is, that unix domain socket has different backlog limit. This limit is 32 waiting request for accept and this value is not possible setup. However, kernel patch 118855-15 increase this value up to 128 - see bugid 4352289. New version of solaris/opensolaris has bigger value - 4096. Please, install latest solaris 10 patches on this machine and check if it helps. Zdenek
Zdenek Kotala wrote: > Josh Berkus wrote: >> Zdenek, >> >>> However what happened? I think that following scenarios occurred. >>> Postmaster listen only in one process and there are many clients run >>> really parallel. T2000 server has 32 threads ( 8 core and each has 4 >>> threads). These clients generate more TCP/IP request at one time, than >>> postmaster is able accepted. >> >> I don't quite follow this ... are you saying that the regression test >> generate more than 128 connections? And that Solaris ships by >> default only allowing 128 connections? > > Default is 128 for queue of TCP established socket, but not accepted. > It is called backlog (see man listen). This values is possible change > by ndd command. However problem is, that unix domain socket has > different backlog limit. This limit is 32 waiting request for accept > and this value is not possible setup. However, kernel patch 118855-15 > increase this value up to 128 - see bugid 4352289. New version of > solaris/opensolaris has bigger value - 4096. > > Please, install latest solaris 10 patches on this machine and check if > it helps. > > Even at 32 this hardly seems to be a likely cause of the problem. I think the maximum parallelism of our tests is around 20. Anyway, lets's get the patch installed - I have a test regime set up that will reproduce the error moderately reliably within about a dozen or so tries. cheers andrew
Andrew Dunstan napsal(a): > > Even at 32 this hardly seems to be a likely cause of the problem. I > think the maximum parallelism of our tests is around 20. Anyway, lets's > get the patch installed - I have a test regime set up that will > reproduce the error moderately reliably within about a dozen or so tries. I deeply analyzed this problem and it is problem in the local transport layer modul - tl driver in the solaris kernel. It is kind of race condition and it appears when there are parallel request to close and connect. I raised bug. Reconnect should be work around. Zdenek