Обсуждение: There's something rotten in the PG infrastructure
Last night, and again tonight, I noticed weird delays in the buildfarm's response to commits. Last night I supposed that the buildfarm server was down, and chided Andrew about it --- but now it seems the blame is elsewhere, at least in part. Facts (all times GMT-4 unless noted): * According to git, Peter committed fab6ca23eaf114d1ae12377c7f5c8c952b5e0159 at Sun, 17 May 2015 03:35:29 +0000 (23:35 -0400). This is not too far from reality, because according to my mail logs, the commit message came through from pgsql-committers at Sat May 16 23:36:19 2015. * However, neither of my buildfarm critters noticed anything had happened for about an hour and a half. prairiedog lit off with a run around 0:55 Sunday, dromedary around 1:10. (Both of them check every 20 minutes, not on the same schedule.) dromedary's run finished around 1:25. * As of right now, 1:56 AM, the buildfarm status page is not showing an update from dromedary, or indeed any other machine for nearly four hours. There should have been a lot of updates by now. It looks to me like not only is the buildfarm server wedged, but there's something wrong with pushing from gitmaster to the mirror used by buildfarm members. It's not continuous, because stuff pushed during the day Saturday seemed to get acted on promptly, but what's happening now? regards, tom lane
I wrote: > * However, neither of my buildfarm critters noticed anything had happened > for about an hour and a half. prairiedog lit off with a run around 0:55 > Sunday, dromedary around 1:10. (Both of them check every 20 minutes, not > on the same schedule.) dromedary's run finished around 1:25. > * As of right now, 1:56 AM, the buildfarm status page is not showing an > update from dromedary, or indeed any other machine for nearly four hours. > There should have been a lot of updates by now. Hmm ... actually, it looks like both prairiedog and dromedary are rebuilding repeatedly, which is odd because certainly nothing is happening on gitmaster. But the run start times quoted above probably only represent the runs that were active when I got annoyed enough to go look at what was happening. regards, tom lane
* Tom Lane (tgl@sss.pgh.pa.us) wrote: > I wrote: > > * However, neither of my buildfarm critters noticed anything had happened > > for about an hour and a half. prairiedog lit off with a run around 0:55 > > Sunday, dromedary around 1:10. (Both of them check every 20 minutes, not > > on the same schedule.) dromedary's run finished around 1:25. > > > * As of right now, 1:56 AM, the buildfarm status page is not showing an > > update from dromedary, or indeed any other machine for nearly four hours. > > There should have been a lot of updates by now. > > Hmm ... actually, it looks like both prairiedog and dromedary are > rebuilding repeatedly, which is odd because certainly nothing is happening > on gitmaster. But the run start times quoted above probably only > represent the runs that were active when I got annoyed enough to go look > at what was happening. I've been poking around and I don't see any obvious issues on either gitmaster or git.p.o. The cronjobs appear to be running. Not saying there isn't an issue but it's at least less than obvious if there is. Further, these systems are still on wheezy (many other boxes have been upgraded to jessie at this point), so nothing much has changed on them in quite some time. It's possible that there is an issue due to jessie being on the host server, but hopefully not. Thanks, Stephen
<p dir="ltr"><br /> On May 17, 2015 08:30, "Stephen Frost" <<a href="mailto:sfrost@snowman.net">sfrost@snowman.net</a>>wrote:<br /> ><br /> > * Tom Lane (<a href="mailto:tgl@sss.pgh.pa.us">tgl@sss.pgh.pa.us</a>)wrote:<br /> > > I wrote:<br /> > > > * However, neitherof my buildfarm critters noticed anything had happened<br /> > > > for about an hour and a half. prairiedoglit off with a run around 0:55<br /> > > > Sunday, dromedary around 1:10. (Both of them check every 20minutes, not<br /> > > > on the same schedule.) dromedary's run finished around 1:25.<br /> > ><br /> >> > * As of right now, 1:56 AM, the buildfarm status page is not showing an<br /> > > > update from dromedary,or indeed any other machine for nearly four hours.<br /> > > > There should have been a lot of updatesby now.<br /> > ><br /> > > Hmm ... actually, it looks like both prairiedog and dromedary are<br /> >> rebuilding repeatedly, which is odd because certainly nothing is happening<br /> > > on gitmaster. But therun start times quoted above probably only<br /> > > represent the runs that were active when I got annoyed enoughto go look<br /> > > at what was happening.<br /> ><br /> > I've been poking around and I don't see anyobvious issues on either<br /> > gitmaster or git.p.o. The cronjobs appear to be running. Not saying<br /> > thereisn't an issue but it's at least less than obvious if there is.<br /> > Further, these systems are still on wheezy(many other boxes have been<br /> > upgraded to jessie at this point), so nothing much has changed on them<br />> in quite some time. It's possible that there is an issue due to jessie<br /> > being on the host server, but hopefullynot.<br /> ><p dir="ltr">We've seen some emails from our buildfarm client as well, where the buildfarm complainsabout database errors. I'm guessing some intermittent errors there, since the site itself works. That's also thereason some snapshots on the ftp site are currently slightly out of date - they are uploaded at the end of successfulbuildfarm runs.. <br /><p dir="ltr">So Afaict, the problem seems to be in the buildfarm, not in the pg infrastructure.<p dir="ltr">The quick check for the replication between the git servers is to just look at the gitweb interface.That one reads directly from the same got repository as the buildfarn clients would. <p dir="ltr">/Magnus
Magnus Hagander <magnus@hagander.net> writes: > So Afaict, the problem seems to be in the buildfarm, not in the pg > infrastructure. The BF status page seems to have started updating again, fifteen minutes or so ago. No idea what was up before that. regards, tom lane
On Sun, May 17, 2015 at 9:13 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Magnus Hagander <magnus@hagander.net> writes:
> So Afaict, the problem seems to be in the buildfarm, not in the pg
> infrastructure.
The BF status page seems to have started updating again, fifteen minutes
or so ago. No idea what was up before that.
Snapshot building has recovered as well. Hopefully it was something temporary that won't be coming back.
On 05/17/2015 01:57 AM, Tom Lane wrote: > Last night, and again tonight, I noticed weird delays in the buildfarm's > response to commits. Last night I supposed that the buildfarm server was > down, and chided Andrew about it --- but now it seems the blame is > elsewhere, at least in part. Facts (all times GMT-4 unless noted): > > * According to git, Peter committed fab6ca23eaf114d1ae12377c7f5c8c952b5e0159 > at Sun, 17 May 2015 03:35:29 +0000 (23:35 -0400). This is not too far > from reality, because according to my mail logs, the commit message came > through from pgsql-committers at Sat May 16 23:36:19 2015. > > * However, neither of my buildfarm critters noticed anything had happened > for about an hour and a half. prairiedog lit off with a run around 0:55 > Sunday, dromedary around 1:10. (Both of them check every 20 minutes, not > on the same schedule.) dromedary's run finished around 1:25. > > * As of right now, 1:56 AM, the buildfarm status page is not showing an > update from dromedary, or indeed any other machine for nearly four hours. > There should have been a lot of updates by now. > > It looks to me like not only is the buildfarm server wedged, but there's > something wrong with pushing from gitmaster to the mirror used by > buildfarm members. It's not continuous, because stuff pushed during the > day Saturday seemed to get acted on promptly, but what's happening now? > > This is my fault, I'm sorry. JD warned me a little while ago that we were running out of database disk space, and I said I'd do something about it, and let it slip. I'm taking some emergency measures to relieve the situation. cheers andrew