Обсуждение: regresssion script hole
While investigating some problems with buildfarm member spoonbill I came across this piece of code in pg_regress.sh, which seems less than robust: # Wait till postmaster is able to accept connections (normally only # a second or so, but Cygwin is reportedly *much*slower). Don't # wait forever, however. i=0 max=60 until "$bindir/psql" -X $psql_options postgres </dev/null2>/dev/null do i=`expr $i + 1` if [ $i -ge $max ] then break fi if kill-0 $postmaster_pid >/dev/null 2>&1 then : still starting up else break fi sleep 1 done if kill -0 $postmaster_pid >/dev/null 2>&1 then echo "running on port $PGPORT with pid $postmaster_pid" else echo echo "$me: postmaster did not start" echo "Examine $LOGDIR/postmaster.log for the reason." echo (exit 2); exit fi The problem is that if the postmaster takes more than 60 seconds to start listening (as is apparently happening on spoonbill - don't yet know why) this code falls through. I'm inclined to run the psql test one more time to make sure we can actually connect, and if not then fail at this point. I wouldn't bother but it did confuse the heck out of both Stefan and me when createlang failed. Thoughts? cheers andrew
Andrew Dunstan <andrew@dunslane.net> writes: > The problem is that if the postmaster takes more than 60 seconds to > start listening (as is apparently happening on spoonbill - don't yet > know why) this code falls through. If the postmaster takes that long to start listening, I'd say we need to fix the postmaster not pg_regress. > I'm inclined to run the psql test one more time to make sure we can > actually connect, and if not then fail at this point. How does that differ from just iterating the loop one more time? regards, tom lane
Tom Lane wrote: >Andrew Dunstan <andrew@dunslane.net> writes: > > >>The problem is that if the postmaster takes more than 60 seconds to >>start listening (as is apparently happening on spoonbill - don't yet >>know why) this code falls through. >> >> > >If the postmaster takes that long to start listening, I'd say we need to >fix the postmaster not pg_regress. > > > We need both, I think. I am still trying to find out why it's taking so long. This is on the 8.0 branch, though. Later branches seem to be working. >>I'm inclined to run the psql test one more time to make sure we can >>actually connect, and if not then fail at this point. >> >> > >How does that differ from just iterating the loop one more time? > > There is no guarantee that at the end of the loop we have connected successfully to postgres. I will post a patch that shows what I suggest. cheers andrew
Andrew Dunstan <andrew@dunslane.net> writes: > We need both, I think. I am still trying to find out why it's taking so > long. This is on the 8.0 branch, though. Later branches seem to be working. One idea that comes to mind is a DNS lookup timeout. Can you strace the postmaster to see what it's doing? regards, tom lane
Tom Lane wrote: >Andrew Dunstan <andrew@dunslane.net> writes: > > >>We need both, I think. I am still trying to find out why it's taking so >>long. This is on the 8.0 branch, though. Later branches seem to be working. >> >> > >One idea that comes to mind is a DNS lookup timeout. Can you strace the >postmaster to see what it's doing? > > > > There is ktrace output I managed to capture at http://developer.postgresql.org/~adunstan/ktrace.txt Not sure what it tells us. I do see it reading in the whole timezone db, which isn't pretty. cheers andrew
Andrew Dunstan <andrew@dunslane.net> writes: > Tom Lane wrote: >> One idea that comes to mind is a DNS lookup timeout. Can you strace the >> postmaster to see what it's doing? > There is ktrace output I managed to capture at > http://developer.postgresql.org/~adunstan/ktrace.txt > Not sure what it tells us. I do see it reading in the whole timezone db, > which isn't pretty. Yeah... I think we fixed that in 8.1. Anyway, the tail end of the trace shows it repeatedly sending off a UDP packet and getting practically the same data back: 24583 postgres CALL socket(0x2,0x2,0)24583 postgres RET socket 324583 postgres CALL sendto(0x3,0x43e1e000,0x25,0,0x493a6338,0x10)24583postgres GIO fd 3 wrote 37 bytes "\M-Sr\^A\0\0\^A\0\0\0\0\0\0\^B''\rkaltenbrunner\^Bcc\0\0\^A\0\^A"24583postgres RET sendto 37/0x2524583 postgres CALL getpid()24583 postgres RET getpid 24583/0x600724583 postgres CALL select(0x4,0x40739600,0,0,0xfffffffffffe4d90)24583 postgresRET select 124583 postgres CALL recvfrom(0x3,0x477e4000,0x10000,0,0xfffffffffffe4da0,0xfffffffffffe4d5c)24583postgres GIO fd 3 read 37 bytes "\M-Sr\M^A\M^B\0\^A\0\0\0\0\0\0\^B''\rkaltenbrunner\^Bcc\0\0\^A\0\^A"24583postgres RET recvfrom 37/0x2524583 postgres CALL close(0x3)24583 postgres RET close 0 I'm not too up on what the DNS protocol looks like on-the-wire, but I'll bet this is it. I think it's trying to look up "kaltenbrunner.cc" and failing. regards, tom lane
Tom Lane wrote: >Anyway, the tail end of the trace >shows it repeatedly sending off a UDP packet and getting practically the >same data back: > > >I'm not too up on what the DNS protocol looks like on-the-wire, but I'll >bet this is it. I think it's trying to look up "kaltenbrunner.cc" and >failing. > > > Why are we actually looking up anything? Just so we can bind to a listening socket? Anyway, maybe the box needs a lookup line in its /etc/resolv.conf to direct it to use files first, something like lookup file bind Stefan, can you look into that? It would be a bit ugly if it's calling DNS (and failing) to resolve localhost. cheers andrew
On Sun, Jun 18, 2006 at 07:50:04PM -0400, Tom Lane wrote: > 24583 postgres CALL recvfrom(0x3,0x477e4000,0x10000,0,0xfffffffffffe4da0,0xfffffffffffe4d5c) > 24583 postgres GIO fd 3 read 37 bytes > "\M-Sr\M^A\M^B\0\^A\0\0\0\0\0\0\^B''\rkaltenbrunner\^Bcc\0\0\^A\0\^A" > 24583 postgres RET recvfrom 37/0x25 > 24583 postgres CALL close(0x3) > 24583 postgres RET close 0 > > I'm not too up on what the DNS protocol looks like on-the-wire, but I'll > bet this is it. I think it's trying to look up "kaltenbrunner.cc" and > failing. Maybe I'm misreading the packet, but I think the query is for ''kaltenbrunner.cc (two single quotes followed by kaltenbrunner.cc) and the DNS server is responding with SRVFAIL. -- Michael Fuhr
On Sun, Jun 18, 2006 at 07:18:07PM -0600, Michael Fuhr wrote: > Maybe I'm misreading the packet, but I think the query is for > ''kaltenbrunner.cc (two single quotes followed by kaltenbrunner.cc) Correction: ''.kaltenbrunner.cc -- Michael Fuhr
Michael Fuhr wrote: > On Sun, Jun 18, 2006 at 07:18:07PM -0600, Michael Fuhr wrote: >> Maybe I'm misreading the packet, but I think the query is for >> ''kaltenbrunner.cc (two single quotes followed by kaltenbrunner.cc) > > Correction: ''.kaltenbrunner.cc yes that is exactly the issue - the postmaster tries to resolve ''.kaltenbrunner.cc multiple times during startup and getting ServFail as a response from the upstream resolver. Stefan
Andrew Dunstan wrote: > > > Tom Lane wrote: > >> Anyway, the tail end of the trace >> shows it repeatedly sending off a UDP packet and getting practically the >> same data back: >> >> >> I'm not too up on what the DNS protocol looks like on-the-wire, but I'll >> bet this is it. I think it's trying to look up "kaltenbrunner.cc" and >> failing. >> >> >> > > Why are we actually looking up anything? Just so we can bind to a > listening socket? > > Anyway, maybe the box needs a lookup line in its /etc/resolv.conf to > direct it to use files first, something like > > lookup file bind > > Stefan, can you look into that? It would be a bit ugly if it's calling > DNS (and failing) to resolve localhost. no - resolving localhost works fine (both using /etc/hosts and through the dns-resolver) - and I infact verified that when we initially started to investigate that issue a while ago :-) Stefan
Stefan Kaltenbrunner wrote: >Andrew Dunstan wrote: > > >>Why are we actually looking up anything? Just so we can bind to a >>listening socket? >> >>Anyway, maybe the box needs a lookup line in its /etc/resolv.conf to >>direct it to use files first, something like >> >> lookup file bind >> >>Stefan, can you look into that? It would be a bit ugly if it's calling >>DNS (and failing) to resolve localhost. >> >> > > >no - resolving localhost works fine (both using /etc/hosts and through >the dns-resolver) - and I infact verified that when we initially started >to investigate that issue a while ago :-) > > > > Why are we looking up 'kaltenbrunner.cc' at all then? In any case, can we just try with that resolver line? The question isn't whether is succeeds, it's how long it takes to succeed. When I increased the pg_regress timeout it actually went through the whole regression test happily. I suspect we have 2 things eating up the 60s timeout here: loading the timezone db and resolving whatever it is we are trying to resolve. cheers andrew
Andrew Dunstan <andrew@dunslane.net> writes: > The question isn't whether is succeeds, it's how long it takes to > succeed. When I increased the pg_regress timeout it actually went > through the whole regression test happily. I suspect we have 2 things > eating up the 60s timeout here: loading the timezone db and resolving > whatever it is we are trying to resolve. The behavior of loading the whole TZ database was there for awhile before anyone noticed; I believe it could only be responsible for a few seconds. So the failed DNS responses must be the problem. Could we get a ktrace with timestamps on the syscalls to confirm that? Of course the $64 question is *why* is 8.0 trying to resolve that name, particularly seeing that the later branches apparently aren't. regards, tom lane
On Mon, Jun 19, 2006 at 09:21:21AM -0400, Tom Lane wrote: > Of course the $64 question is *why* is 8.0 trying to resolve that name, > particularly seeing that the later branches apparently aren't. The formatting of the message suggests it is a gethostbyname("''") doing it. Did any quoting rules change between 8.0 and 8.1 w.r.t. the configuration files? I wonder it it'd be worth adding some conditional code around gethostbyname() calls that warn if the call took longer than say 10 seconds. By printing out that and the string it's looking up you could save a lot of time confirming if the delay is there or elsewhere... Have a nice day, -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > From each according to his ability. To each according to his ability to litigate.
Martijn van Oosterhout wrote: > On Mon, Jun 19, 2006 at 09:21:21AM -0400, Tom Lane wrote: >> Of course the $64 question is *why* is 8.0 trying to resolve that name, >> particularly seeing that the later branches apparently aren't. > > The formatting of the message suggests it is a gethostbyname("''") > doing it. Did any quoting rules change between 8.0 and 8.1 w.r.t. the > configuration files? I tcpdump'd the dns-traffic on that box during a postmaster startup and it's definitly trying to look up ''.kaltenbrunner.cc a lot of times. And from what it looks like it might be getting somehow rate limited by my ISPs recursive resolvers after doing the same query a dozens of times and getting a "servfail" every time. At least the timestamps seem to indicate that the responses are getting delayed up to 10 seconds after a number of queries ... It might be a complete shot in the dark but spoonbill worked fine on REL_8_0_STABLE until i disabled reporting 3 month ago. During this time the large escaping security fix/standard_strings patch went in - could this be related in any way ? Stefan
Stefan Kaltenbrunner <stefan@kaltenbrunner.cc> writes: > I tcpdump'd the dns-traffic on that box during a postmaster startup and > it's definitly trying to look up ''.kaltenbrunner.cc a lot of times. I just strace'd postmaster start on a Fedora box and can see nothing corresponding. Since this is a "make check" we know that the PG configuration files it's using are stock ... so it must be something about the system config that's sending it round the bend. What do you have in /etc/hosts, /etc/resolv.conf, /etc/nsswitch.conf? regards, tom lane
Oh, I think I see the problem: 8.0 pg_regress: if [ "$unix_sockets" = no ]; then postmaster_options="$postmaster_options -c listen_addresses=$hostname" else postmaster_options="$postmaster_options -c listen_addresses=''" fi 8.1 pg_regress: if [ "$unix_sockets" = no ]; then postmaster_options="$postmaster_options -c listen_addresses=$hostname" else postmaster_options="$postmaster_options -c listen_addresses=" fi regards, tom lane
Stefan Kaltenbrunner wrote: > Tom Lane wrote: > >> Andrew Dunstan <andrew@dunslane.net> writes: >> >>> The question isn't whether is succeeds, it's how long it takes to >>> succeed. When I increased the pg_regress timeout it actually went >>> through the whole regression test happily. I suspect we have 2 things >>> eating up the 60s timeout here: loading the timezone db and resolving >>> whatever it is we are trying to resolve. >>> >> The behavior of loading the whole TZ database was there for awhile >> before anyone noticed; I believe it could only be responsible for a >> few seconds. So the failed DNS responses must be the problem. Could >> we get a ktrace with timestamps on the syscalls to confirm that? >> >> Of course the $64 question is *why* is 8.0 trying to resolve that name, >> particularly seeing that the later branches apparently aren't. >> > > hmm maybe the later branches are trying to resolve that too - but only > the combination of the TZ database loading + the failed DNS-queries is > pushing the startup time over the 60 second limit on this (quite slow) box ? > > I will try to verify what the later branches are doing exactly ... > > > Yes, we're on the margin here. The successful runs I saw got through the timeout in 5 or 10 seconds over the 60 that we currently allow. What interests me is where it even gets the string 'kaltenbrunner.cc' from. It looks to me like the most likely place is the search line in /etc/resolv.conf. Ir would be nice to know exactly what it is trying to resolve. cheers andrew cheers andrew
I wrote: > 8.0 pg_regress: > postmaster_options="$postmaster_options -c listen_addresses=''" > 8.1 pg_regress: > postmaster_options="$postmaster_options -c listen_addresses=" and in fact here's the commit that changed that: 2005-06-19 22:26 tgl * src/test/regress/pg_regress.sh: Change shell syntax that seemsnot to work right on FreeBSD 6-CURRENT buildfarm machines. So apparently it's some generic disease of the BSD shell. I should have back-patched at the time but did not. Will take care of it. On the timezone search business, it's still the case that HEAD will search through all the timezones if it's not given an explicit setting (eg an explicit environment TZ value). We could suppress that by having pg_regress set TZ, but then the regression tests wouldn't exercise the search code at all, which is probably not a great idea. regards, tom lane
Tom Lane wrote: > Oh, I think I see the problem: > > 8.0 pg_regress: > > if [ "$unix_sockets" = no ]; then > postmaster_options="$postmaster_options -c listen_addresses=$hostname" > else > postmaster_options="$postmaster_options -c listen_addresses=''" > fi > > 8.1 pg_regress: > > if [ "$unix_sockets" = no ]; then > postmaster_options="$postmaster_options -c listen_addresses=$hostname" > else > postmaster_options="$postmaster_options -c listen_addresses=" > fi > > > Good catch! I'm impressed! This is surely the heart of the problem. That change (from rev 1.56) clearly needs to be backported to 8.0. cheers andrew
Tom Lane wrote: > Andrew Dunstan <andrew@dunslane.net> writes: >> The question isn't whether is succeeds, it's how long it takes to >> succeed. When I increased the pg_regress timeout it actually went >> through the whole regression test happily. I suspect we have 2 things >> eating up the 60s timeout here: loading the timezone db and resolving >> whatever it is we are trying to resolve. > > The behavior of loading the whole TZ database was there for awhile > before anyone noticed; I believe it could only be responsible for a > few seconds. So the failed DNS responses must be the problem. Could > we get a ktrace with timestamps on the syscalls to confirm that? > > Of course the $64 question is *why* is 8.0 trying to resolve that name, > particularly seeing that the later branches apparently aren't. hmm maybe the later branches are trying to resolve that too - but only the combination of the TZ database loading + the failed DNS-queries is pushing the startup time over the 60 second limit on this (quite slow) box ? I will try to verify what the later branches are doing exactly ... Stefan