Обсуждение: There's something rotten in the PG infrastructure

Поиск
Список
Период
Сортировка

There's something rotten in the PG infrastructure

От
Tom Lane
Дата:
Last night, and again tonight, I noticed weird delays in the buildfarm's
response to commits.  Last night I supposed that the buildfarm server was
down, and chided Andrew about it --- but now it seems the blame is
elsewhere, at least in part.  Facts (all times GMT-4 unless noted):

* According to git, Peter committed fab6ca23eaf114d1ae12377c7f5c8c952b5e0159
at Sun, 17 May 2015 03:35:29 +0000 (23:35 -0400).  This is not too far
from reality, because according to my mail logs, the commit message came
through from pgsql-committers at Sat May 16 23:36:19 2015.

* However, neither of my buildfarm critters noticed anything had happened
for about an hour and a half.  prairiedog lit off with a run around 0:55
Sunday, dromedary around 1:10.  (Both of them check every 20 minutes, not
on the same schedule.)  dromedary's run finished around 1:25.

* As of right now, 1:56 AM, the buildfarm status page is not showing an
update from dromedary, or indeed any other machine for nearly four hours.
There should have been a lot of updates by now.

It looks to me like not only is the buildfarm server wedged, but there's
something wrong with pushing from gitmaster to the mirror used by
buildfarm members.  It's not continuous, because stuff pushed during the
day Saturday seemed to get acted on promptly, but what's happening now?
        regards, tom lane



Re: There's something rotten in the PG infrastructure

От
Tom Lane
Дата:
I wrote:
> * However, neither of my buildfarm critters noticed anything had happened
> for about an hour and a half.  prairiedog lit off with a run around 0:55
> Sunday, dromedary around 1:10.  (Both of them check every 20 minutes, not
> on the same schedule.)  dromedary's run finished around 1:25.

> * As of right now, 1:56 AM, the buildfarm status page is not showing an
> update from dromedary, or indeed any other machine for nearly four hours.
> There should have been a lot of updates by now.

Hmm ... actually, it looks like both prairiedog and dromedary are
rebuilding repeatedly, which is odd because certainly nothing is happening
on gitmaster.  But the run start times quoted above probably only
represent the runs that were active when I got annoyed enough to go look
at what was happening.
        regards, tom lane



Re: There's something rotten in the PG infrastructure

От
Stephen Frost
Дата:
* Tom Lane (tgl@sss.pgh.pa.us) wrote:
> I wrote:
> > * However, neither of my buildfarm critters noticed anything had happened
> > for about an hour and a half.  prairiedog lit off with a run around 0:55
> > Sunday, dromedary around 1:10.  (Both of them check every 20 minutes, not
> > on the same schedule.)  dromedary's run finished around 1:25.
>
> > * As of right now, 1:56 AM, the buildfarm status page is not showing an
> > update from dromedary, or indeed any other machine for nearly four hours.
> > There should have been a lot of updates by now.
>
> Hmm ... actually, it looks like both prairiedog and dromedary are
> rebuilding repeatedly, which is odd because certainly nothing is happening
> on gitmaster.  But the run start times quoted above probably only
> represent the runs that were active when I got annoyed enough to go look
> at what was happening.

I've been poking around and I don't see any obvious issues on either
gitmaster or git.p.o.  The cronjobs appear to be running.  Not saying
there isn't an issue but it's at least less than obvious if there is.
Further, these systems are still on wheezy (many other boxes have been
upgraded to jessie at this point), so nothing much has changed on them
in quite some time.  It's possible that there is an issue due to jessie
being on the host server, but hopefully not.
Thanks,
    Stephen

Re: There's something rotten in the PG infrastructure

От
Magnus Hagander
Дата:
<p dir="ltr"><br /> On May 17, 2015 08:30, "Stephen Frost" <<a
href="mailto:sfrost@snowman.net">sfrost@snowman.net</a>>wrote:<br /> ><br /> > * Tom Lane (<a
href="mailto:tgl@sss.pgh.pa.us">tgl@sss.pgh.pa.us</a>)wrote:<br /> > > I wrote:<br /> > > > * However,
neitherof my buildfarm critters noticed anything had happened<br /> > > > for about an hour and a half. 
prairiedoglit off with a run around 0:55<br /> > > > Sunday, dromedary around 1:10.  (Both of them check every
20minutes, not<br /> > > > on the same schedule.)  dromedary's run finished around 1:25.<br /> > ><br />
>> > * As of right now, 1:56 AM, the buildfarm status page is not showing an<br /> > > > update from
dromedary,or indeed any other machine for nearly four hours.<br /> > > > There should have been a lot of
updatesby now.<br /> > ><br /> > > Hmm ... actually, it looks like both prairiedog and dromedary are<br />
>> rebuilding repeatedly, which is odd because certainly nothing is happening<br /> > > on gitmaster.  But
therun start times quoted above probably only<br /> > > represent the runs that were active when I got annoyed
enoughto go look<br /> > > at what was happening.<br /> ><br /> > I've been poking around and I don't see
anyobvious issues on either<br /> > gitmaster or git.p.o.  The cronjobs appear to be running.  Not saying<br /> >
thereisn't an issue but it's at least less than obvious if there is.<br /> > Further, these systems are still on
wheezy(many other boxes have been<br /> > upgraded to jessie at this point), so nothing much has changed on them<br
/>> in quite some time.  It's possible that there is an issue due to jessie<br /> > being on the host server, but
hopefullynot.<br /> ><p dir="ltr">We've seen some emails from  our buildfarm client as well, where the buildfarm
complainsabout database errors. I'm guessing some intermittent errors there, since the site itself works. That's also
thereason some snapshots on the ftp site are currently slightly out of date - they are uploaded at the end of
successfulbuildfarm runs.. <br /><p dir="ltr">So Afaict, the problem seems to be in the buildfarm, not in the pg
infrastructure.<p dir="ltr">The quick check for the replication between the git servers is to just look at the gitweb
interface.That one reads directly from the same got repository as the buildfarn clients would. <p dir="ltr">/Magnus  

Re: There's something rotten in the PG infrastructure

От
Tom Lane
Дата:
Magnus Hagander <magnus@hagander.net> writes:
> So Afaict, the problem seems to be in the buildfarm, not in the pg
> infrastructure.

The BF status page seems to have started updating again, fifteen minutes
or so ago.  No idea what was up before that.
        regards, tom lane



Re: There's something rotten in the PG infrastructure

От
Magnus Hagander
Дата:
On Sun, May 17, 2015 at 9:13 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Magnus Hagander <magnus@hagander.net> writes:
> So Afaict, the problem seems to be in the buildfarm, not in the pg
> infrastructure.

The BF status page seems to have started updating again, fifteen minutes
or so ago.  No idea what was up before that.

Snapshot building has recovered as well. Hopefully it was something temporary that won't be coming back. 

--

Re: There's something rotten in the PG infrastructure

От
Andrew Dunstan
Дата:
On 05/17/2015 01:57 AM, Tom Lane wrote:
> Last night, and again tonight, I noticed weird delays in the buildfarm's
> response to commits.  Last night I supposed that the buildfarm server was
> down, and chided Andrew about it --- but now it seems the blame is
> elsewhere, at least in part.  Facts (all times GMT-4 unless noted):
>
> * According to git, Peter committed fab6ca23eaf114d1ae12377c7f5c8c952b5e0159
> at Sun, 17 May 2015 03:35:29 +0000 (23:35 -0400).  This is not too far
> from reality, because according to my mail logs, the commit message came
> through from pgsql-committers at Sat May 16 23:36:19 2015.
>
> * However, neither of my buildfarm critters noticed anything had happened
> for about an hour and a half.  prairiedog lit off with a run around 0:55
> Sunday, dromedary around 1:10.  (Both of them check every 20 minutes, not
> on the same schedule.)  dromedary's run finished around 1:25.
>
> * As of right now, 1:56 AM, the buildfarm status page is not showing an
> update from dromedary, or indeed any other machine for nearly four hours.
> There should have been a lot of updates by now.
>
> It looks to me like not only is the buildfarm server wedged, but there's
> something wrong with pushing from gitmaster to the mirror used by
> buildfarm members.  It's not continuous, because stuff pushed during the
> day Saturday seemed to get acted on promptly, but what's happening now?
>
>         


This is my fault, I'm sorry. JD warned me a little while ago that we 
were running out of database disk space, and I said I'd do something 
about it, and let it slip.

I'm taking some emergency measures to relieve the situation.

cheers

andrew