Обсуждение: spoonbill vs. -HEAD
Hi all! I finally started to investigate why spoonbill stopped reporting to the buildfarm feedback about 2 months ago. It seems that the foreign-keys locking patch (or something commity very close to January 23th) broke it in a fairly annoying way - running the buildfarm script seems to consistently "stall" during the isolationtester part of the regression testing leaving the postgresql instance running causing all future buildfarm runs to fail... The process listing at that time looks like: https://www.kaltenbrunner.cc/files/process_listing.txt pg_stats_activity of the running instance: https://www.kaltenbrunner.cc/files/pg_stat_activity.txt pg_locks: https://www.kaltenbrunner.cc/files/pg_locks.txt backtraces of the three backends: https://www.kaltenbrunner.cc/files/bt_20467.txt https://www.kaltenbrunner.cc/files/bt_20897.txt https://www.kaltenbrunner.cc/files/bt_24285.txt Stefan
Stefan Kaltenbrunner <stefan@kaltenbrunner.cc> writes: > I finally started to investigate why spoonbill stopped reporting to the > buildfarm feedback about 2 months ago. > It seems that the foreign-keys locking patch (or something commity very > close to January 23th) broke it in a fairly annoying way - running the > buildfarm script seems to > consistently "stall" during the isolationtester part of the regression > testing leaving the postgresql instance running causing all future > buildfarm runs to fail... It looks from here like the isolationtester client is what's dropping the ball --- the backend states are unsurprising, and two of them are waiting for a new client command. Can you get a stack trace from the isolationtester process? regards, tom lane
On 03/26/2013 08:45 PM, Tom Lane wrote: > Stefan Kaltenbrunner <stefan@kaltenbrunner.cc> writes: >> I finally started to investigate why spoonbill stopped reporting to the >> buildfarm feedback about 2 months ago. >> It seems that the foreign-keys locking patch (or something commity very >> close to January 23th) broke it in a fairly annoying way - running the >> buildfarm script seems to >> consistently "stall" during the isolationtester part of the regression >> testing leaving the postgresql instance running causing all future >> buildfarm runs to fail... > > It looks from here like the isolationtester client is what's dropping > the ball --- the backend states are unsurprising, and two of them are > waiting for a new client command. Can you get a stack trace from the > isolationtester process? https://www.kaltenbrunner.cc/files/isolationtester.txt Stefan
Stefan Kaltenbrunner <stefan@kaltenbrunner.cc> writes: > On 03/26/2013 08:45 PM, Tom Lane wrote: >> It looks from here like the isolationtester client is what's dropping >> the ball --- the backend states are unsurprising, and two of them are >> waiting for a new client command. Can you get a stack trace from the >> isolationtester process? > https://www.kaltenbrunner.cc/files/isolationtester.txt Hmm ... isolationtester.c:584 is in the code that tries to cancel the current permutation (test case) after realizing that it's constructed an invalid permutation. It looks like the preceding PQcancel() failed for some reason, since the waiting backend is still waiting. The isolationtester code doesn't bother to check for an error result there, which is kinda bad, not that it's clear what it could do about it. Could you look at the contents of the local variable "buf" in the run_permutation stack frame? Or else try modifying the code along the lines of - PQcancel(cancel, buf, sizeof(buf)); + if (!PQcancel(cancel, buf, sizeof(buf))) + fprintf(stderr, "PQcancel failed: %s\n", buf); and see if it prints anything interesting before hanging up. regards, tom lane
On 03/26/2013 02:50 PM, Stefan Kaltenbrunner wrote: > Hi all! > > > I finally started to investigate why spoonbill stopped reporting to the > buildfarm feedback about 2 months ago. > It seems that the foreign-keys locking patch (or something commity very > close to January 23th) broke it in a fairly annoying way - running the > buildfarm script seems to > consistently "stall" during the isolationtester part of the regression > testing leaving the postgresql instance running causing all future > buildfarm runs to fail... There is some timeout code already in the buildfarm client. It was originally put there to help us when we got CVS hangs, a not infrequent occurrence in the early days, so it's currently only used if configured for the checkout phase, but it could easily be used to create a build timeout which would kill the whole process group if the timeout expired. It wouldn't work on Windows, and of course it won't solve whatever problem caused the hang in the first place, but it still might be worth doing. cheers andrew
Andrew Dunstan <andrew@dunslane.net> writes: > There is some timeout code already in the buildfarm client. It was > originally put there to help us when we got CVS hangs, a not infrequent > occurrence in the early days, so it's currently only used if configured > for the checkout phase, but it could easily be used to create a build > timeout which would kill the whole process group if the timeout expired. > It wouldn't work on Windows, and of course it won't solve whatever > problem caused the hang in the first place, but it still might be worth > doing. +1 --- at least then we'd get reports of failures, rather than the current behavior where the animal just stops reporting. regards, tom lane
On 03/26/2013 09:33 PM, Tom Lane wrote: > Stefan Kaltenbrunner <stefan@kaltenbrunner.cc> writes: >> On 03/26/2013 08:45 PM, Tom Lane wrote: >>> It looks from here like the isolationtester client is what's dropping >>> the ball --- the backend states are unsurprising, and two of them are >>> waiting for a new client command. Can you get a stack trace from the >>> isolationtester process? > >> https://www.kaltenbrunner.cc/files/isolationtester.txt > > Hmm ... isolationtester.c:584 is in the code that tries to cancel the > current permutation (test case) after realizing that it's constructed > an invalid permutation. It looks like the preceding PQcancel() failed > for some reason, since the waiting backend is still waiting. The > isolationtester code doesn't bother to check for an error result there, > which is kinda bad, not that it's clear what it could do about it. > Could you look at the contents of the local variable "buf" in the > run_permutation stack frame? Or else try modifying the code along the > lines of > > - PQcancel(cancel, buf, sizeof(buf)); > + if (!PQcancel(cancel, buf, sizeof(buf))) > + fprintf(stderr, "PQcancel failed: %s\n", buf); > > and see if it prints anything interesting before hanging up. hmm - will look into that in a bit - but I also just noticed that on the same day spoonbill broke there was also a commit to that file immediately before that code adding the fflush() calls. Stefan
Stefan Kaltenbrunner <stefan@kaltenbrunner.cc> writes: > hmm - will look into that in a bit - but I also just noticed that on the > same day spoonbill broke there was also a commit to that file > immediately before that code adding the fflush() calls. It's hard to see how those would be related to this symptom. My bet is that the new fk-deadlock test exposed some pre-existing issue in isolationtester. Not quite clear what yet, though. A different line of thought is that the cancel was received by the backend but didn't succeed in cancelling the query for some reason. regards, tom lane
On 03/26/2013 11:30 PM, Tom Lane wrote: > Stefan Kaltenbrunner <stefan@kaltenbrunner.cc> writes: >> hmm - will look into that in a bit - but I also just noticed that on the >> same day spoonbill broke there was also a commit to that file >> immediately before that code adding the fflush() calls. > > It's hard to see how those would be related to this symptom. My bet > is that the new fk-deadlock test exposed some pre-existing issue in > isolationtester. Not quite clear what yet, though. yeah removing them does not seem to change the behaviour at all > > A different line of thought is that the cancel was received by the > backend but didn't succeed in cancelling the query for some reason. I added the "pgcancel failed" codepath you suggested but it does not seem to get triggered at all so the above might actually be what is happening... Stefan
On 03/26/2013 10:18 PM, Tom Lane wrote: > Andrew Dunstan <andrew@dunslane.net> writes: >> There is some timeout code already in the buildfarm client. It was >> originally put there to help us when we got CVS hangs, a not infrequent >> occurrence in the early days, so it's currently only used if configured >> for the checkout phase, but it could easily be used to create a build >> timeout which would kill the whole process group if the timeout expired. >> It wouldn't work on Windows, and of course it won't solve whatever >> problem caused the hang in the first place, but it still might be worth >> doing. > > +1 --- at least then we'd get reports of failures, rather than the > current behavior where the animal just stops reporting. yeah I have had multiple buildfarm members running into similiar issues (like the still-unexplained issues on spoonbill from a year back: http://www.postgresql.org/message-id/4FE4B674.3020500@kaltenbrunner.cc) so I would really like to see an option for a global timeout. Stefan
On 03/27/2013 03:49 PM, Stefan Kaltenbrunner wrote: > On 03/26/2013 10:18 PM, Tom Lane wrote: >> Andrew Dunstan <andrew@dunslane.net> writes: >>> There is some timeout code already in the buildfarm client. It was >>> originally put there to help us when we got CVS hangs, a not infrequent >>> occurrence in the early days, so it's currently only used if configured >>> for the checkout phase, but it could easily be used to create a build >>> timeout which would kill the whole process group if the timeout expired. >>> It wouldn't work on Windows, and of course it won't solve whatever >>> problem caused the hang in the first place, but it still might be worth >>> doing. >> +1 --- at least then we'd get reports of failures, rather than the >> current behavior where the animal just stops reporting. > yeah I have had multiple buildfarm members running into similiar issues > (like the still-unexplained issues on spoonbill from a year back: > http://www.postgresql.org/message-id/4FE4B674.3020500@kaltenbrunner.cc) > so I would really like to see an option for a global timeout. > I have been looking at it briefly. There are a few wrinkles, but it's on the TODO list. cheers andrew > > Stefan >
Stefan Kaltenbrunner <stefan@kaltenbrunner.cc> writes: > On 03/26/2013 11:30 PM, Tom Lane wrote: >> A different line of thought is that the cancel was received by the >> backend but didn't succeed in cancelling the query for some reason. > I added the "pgcancel failed" codepath you suggested but it does not > seem to get triggered at all so the above might actually be what is > happening... Stefan was kind enough to grant me access to spoonbill, and after some experimentation I found out the problem. It seems that OpenBSD blocks additional deliveries of a signal while the signal handler is in progress, and that this is implemented by just calling sigprocmask() before and after calling the handler. Therefore, if the handler doesn't return normally --- like, say, it longjmps --- the restoration of the previous mask never happens. So we're left with the signal still blocked, meaning second and subsequent attempts to interrupt the backend don't work. This isn't revealed by the regular regression tests because they don't exercise PQcancel, but several recently-added isolation tests do attempt to PQcancel the same backend more than once. It's a bit surprising that it's taken us this long to recognize the problem. Typical use of PQcancel doesn't necessarily cause a failure: StatementCancelHandler() won't exit through longjmp unless ImmediateInterruptOK is true, which is only the case while waiting for a heavyweight lock. But still, you'd think somebody would've run into the case in normal usage. I think the simplest fix is to insert "PG_SETMASK(&UnBlockSig)" into StatementCancelHandler() and any other handlers that might exit via longjmp. I'm a bit inclined to only do this on platforms where a problem is demonstrable, which so far is only OpenBSD. (You'd think that all BSDen would have the same issue, but the buildfarm shows otherwise.) BTW, this does not seem to explain the symptoms shown at http://www.postgresql.org/message-id/4FE4D89A.8020002@kaltenbrunner.cc because what we were seeing there was that *all* signals appeared to be blocked. However, after this round of debugging I no longer have a lot of faith in OpenBSD's ps, because it was lying to me about whether the process had signals blocked or not (or at least, it couldn't see the effects of the interrupt signal disable, although when I added debugging code to print the active signal mask according to sigprocmask() I got told the truth). So I'm not sure how much trust to put in those older ps results. It's possible that the previous failures were a manifestation of something related to this bug. regards, tom lane
I wrote: > I think the simplest fix is to insert "PG_SETMASK(&UnBlockSig)" into > StatementCancelHandler() and any other handlers that might exit via > longjmp. I'm a bit inclined to only do this on platforms where a > problem is demonstrable, which so far is only OpenBSD. (You'd > think that all BSDen would have the same issue, but the buildfarm > shows otherwise.) BTW, on further thought it seems like maybe this is an OpenBSD bug, at least in part: what is evidently happening is that the temporary blockage of SIGINT during the handler persists even after we've longjmp'd back to the main loop. But we're using sigsetjmp(..., 1) to establish that longjmp handler --- so why isn't the original signal mask reinstalled when we return to the main loop? If (your version of) OpenBSD is getting this wrong, it'd explain why we've not seen similar behavior elsewhere. This theory doesn't really exonerate our code completely, because we use sigsetjmp(..., 0) in PG_TRY, which means that in a code path where we catch a longjmp and don't PG_RE_THROW it, we could be left with the signal disabled. I don't believe that is happening in the isolation-test cases, though. regards, tom lane
On 04/03/2013 12:59 AM, Tom Lane wrote: > I wrote: >> I think the simplest fix is to insert "PG_SETMASK(&UnBlockSig)" into >> StatementCancelHandler() and any other handlers that might exit via >> longjmp. I'm a bit inclined to only do this on platforms where a >> problem is demonstrable, which so far is only OpenBSD. (You'd >> think that all BSDen would have the same issue, but the buildfarm >> shows otherwise.) > > BTW, on further thought it seems like maybe this is an OpenBSD bug, > at least in part: what is evidently happening is that the temporary > blockage of SIGINT during the handler persists even after we've > longjmp'd back to the main loop. But we're using sigsetjmp(..., 1) > to establish that longjmp handler --- so why isn't the original signal > mask reinstalled when we return to the main loop? > > If (your version of) OpenBSD is getting this wrong, it'd explain why > we've not seen similar behavior elsewhere. hmm trolling the openbsd cvs history brings up this: http://www.openbsd.org/cgi-bin/cvsweb/src/sys/arch/sparc64/sparc64/machdep.c?r1=1.143;sortby=date#rev1.143 Stefan
Stefan Kaltenbrunner <stefan@kaltenbrunner.cc> writes: > On 04/03/2013 12:59 AM, Tom Lane wrote: >> BTW, on further thought it seems like maybe this is an OpenBSD bug, >> at least in part: what is evidently happening is that the temporary >> blockage of SIGINT during the handler persists even after we've >> longjmp'd back to the main loop. But we're using sigsetjmp(..., 1) >> to establish that longjmp handler --- so why isn't the original signal >> mask reinstalled when we return to the main loop? >> >> If (your version of) OpenBSD is getting this wrong, it'd explain why >> we've not seen similar behavior elsewhere. > hmm trolling the openbsd cvs history brings up this: > http://www.openbsd.org/cgi-bin/cvsweb/src/sys/arch/sparc64/sparc64/machdep.c?r1=1.143;sortby=date#rev1.143 That's about alternate signal stacks, which we're not using. I put together a simple test program (attached) and tried it on spoonbill, and it says that the signal *does* get unblocked when control returns to the sigsetjmp(...,1). So now I'm really confused. Somehow the results we're getting in a full-fledged backend do not match up with the results gotten by this test program ... but why? regards, tom lane
Вложения
On 04/04/2013 02:18 AM, Tom Lane wrote: > Stefan Kaltenbrunner <stefan@kaltenbrunner.cc> writes: >> On 04/03/2013 12:59 AM, Tom Lane wrote: >>> BTW, on further thought it seems like maybe this is an OpenBSD bug, >>> at least in part: what is evidently happening is that the temporary >>> blockage of SIGINT during the handler persists even after we've >>> longjmp'd back to the main loop. But we're using sigsetjmp(..., 1) >>> to establish that longjmp handler --- so why isn't the original signal >>> mask reinstalled when we return to the main loop? >>> >>> If (your version of) OpenBSD is getting this wrong, it'd explain why >>> we've not seen similar behavior elsewhere. > >> hmm trolling the openbsd cvs history brings up this: >> http://www.openbsd.org/cgi-bin/cvsweb/src/sys/arch/sparc64/sparc64/machdep.c?r1=1.143;sortby=date#rev1.143 > > That's about alternate signal stacks, which we're not using. > > I put together a simple test program (attached) and tried it on > spoonbill, and it says that the signal *does* get unblocked when control > returns to the sigsetjmp(...,1). So now I'm really confused. Somehow > the results we're getting in a full-fledged backend do not match up with > the results gotten by this test program ... but why? as a followup to this - I spend some time upgrading spoonbill to the lastest OpenBSD release (5.3) the other day and it seems to be able to pass a full regression test run now on a manual run. I will add it to the regular reporting schedule again, but it seems at least part of the problem is/was an Operating system level issue that got fixed in either 5.2 or 5.3 (spoonbill was on 5.1 before). Stefan