Обсуждение: Yet another infrastructure problem

Поиск

Список

Период

Сортировка

Yet another infrastructure problem

От

"Greg Sabino Mullane"

Дата:

24 октября 2008 г., 10:22:30

-----BEGIN PGP SIGNED MESSAGE-----
Hash: RIPEMD160


People have been complaining on IRC that nothing can be
downloaded from our site, as the mirror-picking script throws
an internal error.

When are we going to fix our infrastructure properly?

- --
Greg Sabino Mullane greg@turnstep.com
PGP Key: 0x14964AC8 200810240920
http://biglumber.com/x/web?pk=2529DF6AB8F79407E94445B4BC9B906714964AC8
-----BEGIN PGP SIGNATURE-----

iEYEAREDAAYFAkkBy8gACgkQvJuQZxSWSsjlzQCghZjQgwb4tpaAflhYfesj9RWS
NsUAn3nxlF3yDoMx8B7rolH/qq5HWxuc
=4vcE
-----END PGP SIGNATURE-----

Re: Yet another infrastructure problem

От

Magnus Hagander

Дата:

24 октября 2008 г., 11:16:33

Greg Sabino Mullane wrote:
> 
> People have been complaining on IRC that nothing can be
> downloaded from our site, as the mirror-picking script throws
> an internal error.
> 
> When are we going to fix our infrastructure properly?

As Stefan has already posted on this very list, he is performing
maintenance on that machine in order to move it to new hardware.

//Magnus

Re: Yet another infrastructure problem

От

"David Blewett"

Дата:

24 октября 2008 г., 17:43:31

On Fri, Oct 24, 2008 at 9:22 AM, Greg Sabino Mullane <greg@turnstep.com> wrote:
> People have been complaining on IRC that nothing can be
> downloaded from our site, as the mirror-picking script throws
> an internal error.

It looks like it's still throwing an error:
http://wwwmaster.postgresql.org/download/mirrors-ftp?file=%2Fsource%2Fv8.3.4%2Fpostgresql-8.3.4.tar.bz2

returns:
Internal Server Error
No mirrors were found

Re: Yet another infrastructure problem

От

"Greg Sabino Mullane"

Дата:

24 октября 2008 г., 18:24:59

-----BEGIN PGP SIGNED MESSAGE-----
Hash: RIPEMD160


>> People have been complaining on IRC that nothing can be
>> downloaded from our site, as the mirror-picking script throws
>> an internal error.
>
>> When are we going to fix our infrastructure properly?

> As Stefan has already posted on this very list, he is performing
> maintenance on that machine in order to move it to new hardware.

I understand that, but I think this project is big enough, and
important enough, and has enough smart people involved in it,
that things like this should just not happen. Some thoughts, in
order of descending importance to the matter at hand:

* Why do we have so many eggs in one basket? I know that "jails"
allows us to have many subdomains/services on one physical box,
but we've seen three problems with the concept lately:

1) Global software updates that breaks things in all jails
2) Battling over resources and causing one jail to affect another
3) Hardware problems that affect more than one jail

* One way around problems like this is to mirror the services.
That may involve load balancing, DNS tricks, database replication,
and other assorted goodies. It may be difficult, but it's something
I'd like to at least start us talking about.

* As much as I love the concept of BSD (and I might even be running it
at home if it didn't always coredump while installing on my laptop), we
should realize that the there are many people in our community who are
really, really good with Linux. Many of the people on the PG lists do
Linuxy support as their dayjob. I'm not saying we should dump BSD, but
I'm dismayed to see the resistance given to adding non-BSD boxes to our
mix.


- --
Greg Sabino Mullane greg@turnstep.com
PGP Key: 0x14964AC8 200810241713
http://biglumber.com/x/web?pk=2529DF6AB8F79407E94445B4BC9B906714964AC8
-----BEGIN PGP SIGNATURE-----

iEYEAREDAAYFAkkCPIYACgkQvJuQZxSWSsgTPwCg86uW8cM+x6sLXnIPCIUoXNnD
21sAoMoFOu+VVt0bVAtifG5qGHweht9c
=qKED
-----END PGP SIGNATURE-----

Re: Yet another infrastructure problem

От

Stefan Kaltenbrunner

Дата:

25 октября 2008 г., 05:04:22

Greg Sabino Mullane wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: RIPEMD160
> 
> 
>>> People have been complaining on IRC that nothing can be
>>> downloaded from our site, as the mirror-picking script throws
>>> an internal error.
>>> When are we going to fix our infrastructure properly?
> 
>> As Stefan has already posted on this very list, he is performing
>> maintenance on that machine in order to move it to new hardware.
> 
> I understand that, but I think this project is big enough, and
> important enough, and has enough smart people involved in it,
> that things like this should just not happen. Some thoughts, in
> order of descending importance to the matter at hand:
> 
> * Why do we have so many eggs in one basket? I know that "jails"
> allows us to have many subdomains/services on one physical box,
> but we've seen three problems with the concept lately:
> 
> 1) Global software updates that breaks things in all jails

we need to do software upgrades once in a while because OSes reach their 
EOL date (and therefor loose security support). Softwareupdates tend to 
break stuff and OSes are more complex than a single application so we 
have to expect some issues.
Security/Feature upgrades of userspace apps are obviously only affecting 
a single jail.

> 2) Battling over resources and causing one jail to affect another

that one has happened - but only one or two times over the last few 
years so I'm not convinced it is a real issue rather than an isolated 
incident.

> 3) Hardware problems that affect more than one jail

the very same would happen if we used some sort of full virtualization 
technology so I'm not sure I see the point. Or are you actively 
proposing we should request and run 40+ physical servers in the future ?
I don't think that would be sensible in any way (both from a resource 
wasting pov and the administrative overhead - and we don't have that 
many boxes either).

> 
> * One way around problems like this is to mirror the services.
> That may involve load balancing, DNS tricks, database replication,
> and other assorted goodies. It may be difficult, but it's something
> I'd like to at least start us talking about.

the low hanging fruit in that regard has already been taken (have you 
seen the static part of website being down in the last few years?) - 
most of the other services are much much harder to operate in a 
loadbalanced (or master-master) setup or doing it seems simply overkill.
Furthermore I don't think that just making services more complex (as in 
redundant) will necessarily result in better availability. Howver I 
aknowledge that we can improve in some areas (like wiki authentication).

> 
> * As much as I love the concept of BSD (and I might even be running it
> at home if it didn't always coredump while installing on my laptop), we
> should realize that the there are many people in our community who are
> really, really good with Linux. Many of the people on the PG lists do
> Linuxy support as their dayjob. I'm not saying we should dump BSD, but
> I'm dismayed to see the resistance given to adding non-BSD boxes to our
> mix.

Not against that idea in general (and we already have a fair share of 
linux boxes too) how would linux solve any of the issues you mentioned ? 
All of the linux distributions had their fair share of "breaking stuff 
with security/point updates/upgrades" and if hardware breaks it doesn't 
matter if we run BSD, Linux or Windows.

Stefan

Re: Yet another infrastructure problem

От

Stefan Kaltenbrunner

Дата:

25 октября 2008 г., 05:07:13

David Blewett wrote:
> On Fri, Oct 24, 2008 at 9:22 AM, Greg Sabino Mullane <greg@turnstep.com> wrote:
>> People have been complaining on IRC that nothing can be
>> downloaded from our site, as the mirror-picking script throws
>> an internal error.
> 
> It looks like it's still throwing an error:
> http://wwwmaster.postgresql.org/download/mirrors-ftp?file=%2Fsource%2Fv8.3.4%2Fpostgresql-8.3.4.tar.bz2

seems to work for me - at least now.


Stefan

Re: Yet another infrastructure problem

От

Magnus Hagander

Дата:

25 октября 2008 г., 05:23:41

Stefan Kaltenbrunner wrote:
>> * One way around problems like this is to mirror the services.
>> That may involve load balancing, DNS tricks, database replication,
>> and other assorted goodies. It may be difficult, but it's something
>> I'd like to at least start us talking about.
> 
> the low hanging fruit in that regard has already been taken (have you
> seen the static part of website being down in the last few years?) -
> most of the other services are much much harder to operate in a
> loadbalanced (or master-master) setup or doing it seems simply overkill.
> Furthermore I don't think that just making services more complex (as in
> redundant) will necessarily result in better availability. Howver I
> aknowledge that we can improve in some areas (like wiki authentication).

I think the most important thing to get a workaround for is our mirror
management. Because now if wwwmaster goes down, nobody can download our
stuff from the website - even if both the webside and 100 ftp servers
are up.

Now, getting that one done shouldn't be too hard. Since the data is
really only one-way (I don't mind if we loose click-thru stats). So we
can either:

1) replicate the mirror database to a secondary jail somewhere, running
the wwwmaster code. Link the downloads to a separate DNS name that maps
to both these machines, and does checking similar to our static machines
to remove them from DNS if they go down.

2) reimplement the mirror management stuff in client-side javascript
somehow, and serve it off the static mirrors. Not entirely sure how to
do this cleanly, or how to fallback if $user has javascript disabled,
but it would have the advantage of not needing another jail.

My vote would be for #1 here, as I think is clear.

The login system would also be good to have distributed, but it's used
by orders of magnitude less number of people. But if we replicate the
database off to the other machine, it should be possible to point the
logins on both machines as well - it's a simple pl/pgsql function that
needs to be called. We'll just need to deal with the "last logged in"
part that won't work then.

//Magnus

Re: Yet another infrastructure problem

От

Robert Treat

Дата:

25 октября 2008 г., 11:46:08

On Saturday 25 October 2008 04:23:34 Magnus Hagander wrote:
> Stefan Kaltenbrunner wrote:
> >> * One way around problems like this is to mirror the services.
> >> That may involve load balancing, DNS tricks, database replication,
> >> and other assorted goodies. It may be difficult, but it's something
> >> I'd like to at least start us talking about.
> >
> > the low hanging fruit in that regard has already been taken (have you
> > seen the static part of website being down in the last few years?) -
> > most of the other services are much much harder to operate in a
> > loadbalanced (or master-master) setup or doing it seems simply overkill.
> > Furthermore I don't think that just making services more complex (as in
> > redundant) will necessarily result in better availability. Howver I
> > aknowledge that we can improve in some areas (like wiki authentication).
>
> I think the most important thing to get a workaround for is our mirror
> management. Because now if wwwmaster goes down, nobody can download our
> stuff from the website - even if both the webside and 100 ftp servers
> are up.
>
> Now, getting that one done shouldn't be too hard. Since the data is
> really only one-way (I don't mind if we loose click-thru stats). So we
> can either:
>
> 1) replicate the mirror database to a secondary jail somewhere, running
> the wwwmaster code. Link the downloads to a separate DNS name that maps
> to both these machines, and does checking similar to our static machines
> to remove them from DNS if they go down.
>
> 2) reimplement the mirror management stuff in client-side javascript
> somehow, and serve it off the static mirrors. Not entirely sure how to
> do this cleanly, or how to fallback if $user has javascript disabled,
> but it would have the advantage of not needing another jail.
>
>
> My vote would be for #1 here, as I think is clear.
>
>
>
> The login system would also be good to have distributed, but it's used
> by orders of magnitude less number of people. But if we replicate the
> database off to the other machine, it should be possible to point the
> logins on both machines as well - it's a simple pl/pgsql function that
> needs to be called. We'll just need to deal with the "last logged in"
> part that won't work then.
>

If you used plproxy rather plpgsql, i think you could eliminate this problem. 

-- 
Robert Treat
Build A Brighter LAMP :: Linux Apache {middleware} PostgreSQL

Re: Yet another infrastructure problem

От

Robert Treat

Дата:

25 октября 2008 г., 11:46:14

On Saturday 25 October 2008 04:04:08 Stefan Kaltenbrunner wrote:
> All of the linux distributions had their fair share of "breaking stuff
> with security/point updates/upgrades" and if hardware breaks it doesn't
> matter if we run BSD, Linux or Windows.
>

good point, we should switch to solaris! :-D

-- 
Robert Treat
Build A Brighter LAMP :: Linux Apache {middleware} PostgreSQL

Re: Yet another infrastructure problem

От

"Joshua D. Drake"

Дата:

25 октября 2008 г., 14:17:12

Robert Treat wrote:
> On Saturday 25 October 2008 04:04:08 Stefan Kaltenbrunner wrote:
>> All of the linux distributions had their fair share of "breaking stuff
>> with security/point updates/upgrades" and if hardware breaks it doesn't
>> matter if we run BSD, Linux or Windows.
>>
> 
> good point, we should switch to solaris! :-D
> 

*cough* :P

Guys we seem to have this argument every 3-6 months. I know that I have 
started on my own once or twice. So for the sake of everyone's bandwidth 
and time let me just break it down.

PostgreSQL.Org uses a FreeBSD architecture. To my knowledge there are 
only two exceptions to this, one of which will go away by the end of the 
month. Don't ask for linux -- you aren't going to get it.

We use jails. Deal with it.

We do have problems with the infrastructure but they are being dealt 
with as time and resources allow. That being said, the infrastructure 
design in terms of the base technologies is set.

In short, yes I would like to see us move to all Linux. Debian or Ubuntu 
Hardy would be my choice. However, that is not going to happen. So I 
have accepted that I will learn/be a FreeBSD admin. I can deal with 
that, FreeBSD although a tad weird for SysV guys is a good OS and it 
solves the problem we are trying to solve.

I use to buy into the argument of ... if we had Linux more people would 
be willing to help. That argument is crap. People will help if they want 
to help. They will learn what they need to help. Those that say, "if you 
were running linux I would help" have a good heart but aren't people 
that are really going to help in the long run anyway.

So can we just put on the Wiki that this is the way it is? That way the 
next time it comes up, we just point.

Joshua D. Drake

Re: Yet another infrastructure problem

От

Magnus Hagander

Дата:

25 октября 2008 г., 17:50:06

Robert Treat wrote:
> On Saturday 25 October 2008 04:23:34 Magnus Hagander wrote:
>> The login system would also be good to have distributed, but it's used
>> by orders of magnitude less number of people. But if we replicate the
>> database off to the other machine, it should be possible to point the
>> logins on both machines as well - it's a simple pl/pgsql function that
>> needs to be called. We'll just need to deal with the "last logged in"
>> part that won't work then.
>>
> 
> If you used plproxy rather plpgsql, i think you could eliminate this problem. 
> 

Does pl/proxy actually help with that? I haven't actually used it, but
from what I can tell dealing with failover is still on the TODO list for
it ("RUN ON ANY: if one con failed, try another"). Or?

//Magnus

Re: Yet another infrastructure problem

От

Russell Smith

Дата:

25 октября 2008 г., 22:04:03

Magnus Hagander wrote:
> Greg Sabino Mullane wrote:
>   
>> People have been complaining on IRC that nothing can be
>> downloaded from our site, as the mirror-picking script throws
>> an internal error.
>>
>> When are we going to fix our infrastructure properly?
>>     
>
> As Stefan has already posted on this very list, he is performing
> maintenance on that machine in order to move it to new hardware.
>
> //Magnus
>
>   
We are still missing the one important thing "Notification"  lots and
lots of people use the website that will never go near the lists, irc or
anything else.  Notifying the email lists of downtime will stop the
heavily involved community from complaining, but it does absolutely
nothing for general user trying to download something from the internet.

You can argue about replication, downtime and the like until you are
blue in the face.  There will always be some downtime.  The question is
how do people know about it, when is it and what do they do about it?

Until reading this thread I had never even thought about how PostgreSQL
does or doesn't notify people about downtime or potential downtime. 
Reading down thread this notification issue appears to have been
ignored.  To me it seems like relatively low hanging fruit to allow
messages to be posted on the website about planned outages, and
notifications of recent unplanned outages.  Complaining on IRC is one of
the only ways to find out what's going on at the moment for a casual
user.  When Marc's hosting had trouble a couple of years back, the only
way to find out anything was on irc.

I'd look into this, but I'd need a lot more knowledge about how the web
stuff is setup, and I'm probably not going to be able to glean that from
people in a couple of weeks.  But if I can.  Great!.

Russell.

Re: Yet another infrastructure problem

От

Magnus Hagander

Дата:

26 октября 2008 г., 08:57:24

On 26 okt 2008, at 02.03, Russell Smith <mr-russ@pws.com.au> wrote:

> Magnus Hagander wrote:
>> Greg Sabino Mullane wrote:
>>
>>> People have been complaining on IRC that nothing can be
>>> downloaded from our site, as the mirror-picking script throws
>>> an internal error.
>>>
>>> When are we going to fix our infrastructure properly?
>>>
>>
>> As Stefan has already posted on this very list, he is performing
>> maintenance on that machine in order to move it to new hardware.
>>
>> //Magnus
>>
>>
> We are still missing the one important thing "Notification"  lots and
> lots of people use the website that will never go near the lists,  
> irc or
> anything else.  Notifying the email lists of downtime will stop the
> heavily involved community from complaining, but it does absolutely
> nothing for general user trying to download something from the  
> internet.

That is a very good point. And it actually goes to many other parts of  
the project, and not just the infrastructure. Basically the  
authoritative version of *all* important information is the lists.

>
> You can argue about replication, downtime and the like until you are
> blue in the face.  There will always be some downtime.  The question  
> is
> how do people know about it, when is it and what do they do about it?

Agreed.

> Until reading this thread I had never even thought about how  
> PostgreSQL
> does or doesn't notify people about downtime or potential downtime.
> Reading down thread this notification issue appears to have been
> ignored.  To me it seems like relatively low hanging fruit to allow
> messages to be posted on the website about planned outages, and
> notifications of recent unplanned

So how do you deal with a case like the one discussed here, where the  
web is what didn't work? The static fromtends were up, but not the  
master which is used to update them...

> outages.  Complaining on IRC is one of
> the only ways to find out what'so going on at the moment for a casual
> user.

The casual user would be using the lists, certainly not irc. Peope who  
aren't deep in the project certainly will hit the lists first, because  
that's what we say on our website.

Now what they really do is email webmaster, which a lot of peope did.

That said, I agree a better way would be good to have.

> When Marc's hosting had trouble a couple of years back, the only
> way to find out anything was on irc.

That outlines one of the major problems. It must not be too hard to  
deal with for the guy trying to fix the actual problem. Sending an  
email is *easy*, and stefan did so in this case. But as you also note,  
even this is too much for some people.

We could publish a snapshot of our nagios data, but I doubt that would  
actually be helpful to these peope.

> I'd look into this, but I'd need a lot more knowledge about how the  
> web
> stuff is setup, and I'm probably not going to be able to glean that  
> from
> people in a couple of weeks.  But if I can.  Great!.
>

Hey, give it a shot. Just remember that the technical part is the easy  
part.  Creating a process and getting buyin for that is going to be  
the hard part.

/Magnus

Re: Yet another infrastructure problem

От

"Dave Page"

Дата:

26 октября 2008 г., 09:37:00

Russell,

The planned maintenance (to replace the troublesome hardware) was  
announced publically by Stefan.

/D

On 10/26/08, Russell Smith <mr-russ@pws.com.au> wrote:
> Magnus Hagander wrote:
>> Greg Sabino Mullane wrote:
>>
>>> People have been complaining on IRC that nothing can be
>>> downloaded from our site, as the mirror-picking script throws
>>> an internal error.
>>>
>>> When are we going to fix our infrastructure properly?
>>>
>>
>> As Stefan has already posted on this very list, he is performing
>> maintenance on that machine in order to move it to new hardware.
>>
>> //Magnus
>>
>>
> We are still missing the one important thing "Notification"  lots and
> lots of people use the website that will never go near the lists,  
> irc or
> anything else.  Notifying the email lists of downtime will stop the
> heavily involved community from complaining, but it does absolutely
> nothing for general user trying to download something from the  
> internet.
>
> You can argue about replication, downtime and the like until you are
> blue in the face.  There will always be some downtime.  The question  
> is
> how do people know about it, when is it and what do they do about it?
>
> Until reading this thread I had never even thought about how  
> PostgreSQL
> does or doesn't notify people about downtime or potential downtime.
> Reading down thread this notification issue appears to have been
> ignored.  To me it seems like relatively low hanging fruit to allow
> messages to be posted on the website about planned outages, and
> notifications of recent unplanned outages.  Complaining on IRC is  
> one of
> the only ways to find out what's going on at the moment for a casual
> user.  When Marc's hosting had trouble a couple of years back, the  
> only
> way to find out anything was on irc.
>
> I'd look into this, but I'd need a lot more knowledge about how the  
> web
> stuff is setup, and I'm probably not going to be able to glean that  
> from
> people in a couple of weeks.  But if I can.  Great!.
>
> Russell.
>
>
>
> -- 
> Sent via pgsql-www mailing list (pgsql-www@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-www
>


-- 
Dave Page
EnterpriseDB UK:   http://www.enterprisedb.com

Re: Yet another infrastructure problem

От

"Greg Sabino Mullane"

Дата:

29 октября 2008 г., 22:22:17

-----BEGIN PGP SIGNED MESSAGE-----
Hash: RIPEMD160

Stefan wrote:

>> 2) Battling over resources and causing one jail to affect another

> that one has happened - but only one or two times over the last few
> years so I'm not convinced it is a real issue rather than an isolated
> incident.

I think this happends more than you realize. Isn't the jabber service
still causing problems now? Wasn't the wiki recently affected by something
else? Who knows how often it happens to a lesser extent? It's only the
extreme cases that cause notices to be sent to this list.

>> 3) Hardware problems that affect more than one jail

> the very same would happen if we used some sort of full virtualization
> technology so I'm not sure I see the point. Or are you actively
> proposing we should request and run 40+ physical servers in the future ?
> I don't think that would be sensible in any way (both from a resource
> wasting pov and the administrative overhead - and we don't have that
> many boxes either).

No, not 40+, but having the small handful of important services distributed
on separate boxes/data centers would be a good idea. Specifically, the
archives, search, website, wiki, cvs, and mailing lists should ideally all be
on different servers, to minimize the impact on the project as a whole
when something goes down.

>> * One way around problems like this is to mirror the services.
>> That may involve load balancing, DNS tricks, database replication,
>> and other assorted goodies. It may be difficult, but it's something
>> I'd like to at least start us talking about.

> the low hanging fruit in that regard has already been taken (have you
> seen the static part of website being down in the last few years?) -

No, I'm completely happy with the static part of the website.

> most of the other services are much much harder to operate in a
> loadbalanced (or master-master) setup or doing it seems simply overkill.
> Furthermore I don't think that just making services more complex (as in
> redundant) will necessarily result in better availability. Howver I
> aknowledge that we can improve in some areas (like wiki authentication).

Er...how do you figure redundant services do not necessarily result in better
availability? That's kind of its point - and we certainly don't have anywhere
near 100% uptime for practically any part of our infrastructure. I do recognize
there is a complexity tradefoff to be made, so perhaps only some (or none)
of the services need that tradeoff to be made. However, I consider it a valid
point to be raised. This goes a little to disaster recovery as well, so perhaps
some of the services (e.g. cvs) are already mirrored in some fashion, and all
we need to do is to tweak some things?

>> * As much as I love the concept of BSD (and I might even be running it
>> at home if it didn't always coredump while installing on my laptop), we
>> should realize that the there are many people in our community who are
>> really, really good with Linux. Many of the people on the PG lists do
>> Linuxy support as their dayjob. I'm not saying we should dump BSD, but
>> I'm dismayed to see the resistance given to adding non-BSD boxes to our
>> mix.

> Not against that idea in general (and we already have a fair share of
> linux boxes too) how would linux solve any of the issues you mentioned ?
> All of the linux distributions had their fair share of "breaking stuff
> with security/point updates/upgrades" and if hardware breaks it doesn't
> matter if we run BSD, Linux or Windows.

It wouldn't solve any of the above issues, which is why it was the last
bullet point. As Robert points out, we could just switch to Sun's Solaris,
then we wouldn't have any problems. Look how well MySQL is going under their
watch! :)

Joshua Drake wrote:

> PostgreSQL.Org uses a FreeBSD architecture. To my knowledge there are
> only two exceptions to this, one of which will go away by the end of the
> month. Don't ask for linux -- you aren't going to get it.

> We use jails. Deal with it.

We are dealing with it, that's one of the big problems.

> I use to buy into the argument of ... if we had Linux more people would
> be willing to help. That argument is crap. People will help if they want
> to help. They will learn what they need to help. Those that say, "if you
> were running linux I would help" have a good heart but aren't people
> that are really going to help in the long run anyway.

That counter-argument is crap as well. People will 'learn what they need to
help'? This is a volunteer project, so the more barriers we put in front of
people, the less that will get done. While homogeneity of servers can be a
good thing from a sysadmin perspective, expanding the pool of potential help
can be as well.

> So can we just put on the Wiki that this is the way it is? That way the
> next time it comes up, we just point.

Next time can we send a message to -announce and -general letting people
know the website, cvs, wiki, and pgadmin are going to be down? I think that
was one of the most annoying aspects of this whole incident.

- --
Greg Sabino Mullane greg@turnstep.com
PGP Key: 0x14964AC8 200810292113
http://biglumber.com/x/web?pk=2529DF6AB8F79407E94445B4BC9B906714964AC8
-----BEGIN PGP SIGNATURE-----

iEYEAREDAAYFAkkJDBEACgkQvJuQZxSWSsgMFQCgt51u2F4c/7TrSaVAO79Y293+
HbEAn1dM6owdqWZK0Ey06BzX9u56e6U8
=J3av
-----END PGP SIGNATURE-----

Re: Yet another infrastructure problem

От

Alvaro Herrera

Дата:

29 октября 2008 г., 22:27:32

Greg Sabino Mullane wrote:

> Stefan wrote:
> 
> > that one has happened - but only one or two times over the last few
> > years so I'm not convinced it is a real issue rather than an isolated
> > incident.
> 
> I think this happends more than you realize. Isn't the jabber service
> still causing problems now?

This is a bad example, because Jabber runs on a Linux server :-)

-- 
Alvaro Herrera                                http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

Re: Yet another infrastructure problem

От

"Joshua D. Drake"

Дата:

30 октября 2008 г., 01:15:23

Alvaro Herrera wrote:
> Greg Sabino Mullane wrote:
> 
>> Stefan wrote:
>>
>>> that one has happened - but only one or two times over the last few
>>> years so I'm not convinced it is a real issue rather than an isolated
>>> incident.
>> I think this happends more than you realize. Isn't the jabber service
>> still causing problems now?
> 
> This is a bad example, because Jabber runs on a Linux server :-)
> 

And is actually a hardware issue that is being dealt with, not a 
software one.

Joshua D. Drake

Re: Yet another infrastructure problem

От

"Joshua D. Drake"

Дата:

30 октября 2008 г., 01:45:26

Greg Sabino Mullane wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: RIPEMD160

> That counter-argument is crap as well. People will 'learn what they need to
> help'? This is a volunteer project, so the more barriers we put in front of
> people, the less that will get done. While homogeneity of servers can be a
> good thing from a sysadmin perspective, expanding the pool of potential help
> can be as well.

I learned FreeBSD very much against my will for this project. I use PHP 
very much against my will for this project. I use Docbook SGML very much 
against my will for this project. I use CVS very much against my will 
for this project.

There are plenty of things that I think are plain outright dumb that 
this project does. However, because I want to contribute to this project 
and it is what the project (regardless if I agree) has deemed will be 
done, I do.

> 
>> So can we just put on the Wiki that this is the way it is? That way the
>> next time it comes up, we just point.
> 
> Next time can we send a message to -announce and -general letting people
> know the website, cvs, wiki, and pgadmin are going to be down? I think that
> was one of the most annoying aspects of this whole incident.

Well I certainly can't argue with that.

Joshua D. Drake

Re: Yet another infrastructure problem

От

Andrew Sullivan

Дата:

30 октября 2008 г., 09:42:31

On Wed, Oct 29, 2008 at 09:15:20PM -0700, Joshua D. Drake wrote:
>
> And is actually a hardware issue that is being dealt with, not a software 
> one.

I fail totally to see how either that, or the OS in question, in any
way constitutes a premise against Greg's argument.  His argument is
just that there are a lot of services, and several of them appear, to
the uneducated eye, to be rather less reliable than one might hope.

He has proposed a way to help: increase diversity of the systems by
introducing another operating system and some additional hardware.
Apart from the duplication of services, such a diversity of code bases
adds robustness to a distributed system, because a code problem in one
system will not affect the other one.  (For instance, if it turns out
that jails have a bug in some release of FreeBSD, it's likely not to
be the same problem in xen running on Linux.)  As a side benefit, this
might lower the initial cost of volunteering for enough people that
there would be more volunteers.  (Just to prove I can argue both sides
of the fence, though: adding more sysadmins to a distributed system
often does not improve the reliability of the system.  The system
needs to be designed for many hands, and I don't know if this one is.)

Since I'm officially Not Volunteering to help with this, I don't have
a dog in the race.  But I don't think responding to Greg's sound
argument with red herrings is going to address his point.  "This is
what we picked; deal with it," is a pretty lame argument in the face
of public failures of the stuff one picked.

A

-- 
Andrew Sullivan
ajs@commandprompt.com
+1 503 667 4564 x104
http://www.commandprompt.com/

Re: Yet another infrastructure problem

От

"Dave Page"

Дата:

30 октября 2008 г., 10:06:43

On Thu, Oct 30, 2008 at 12:42 PM, Andrew Sullivan <ajs@crankycanuck.ca> wrote:
> On Wed, Oct 29, 2008 at 09:15:20PM -0700, Joshua D. Drake wrote:
>>
>> And is actually a hardware issue that is being dealt with, not a software
>> one.
>
> I fail totally to see how either that, or the OS in question, in any
> way constitutes a premise against Greg's argument.

He raised the issue of the Jabber server (and some other services) in
response to a comment on how we've (once) seen a FreeBSD jail hog
resources and adversely affect another.

> His argument is
> just that there are a lot of services, and several of them appear, to
> the uneducated eye, to be rather less reliable than one might hope.

Right - and as far as I'm aware, pretty much all of those issues boil
down to what a couple of days ago was thought to be two hardware
issues. 1 of them definitely is (the Linux based Jabber server) which
CP staff (or JD) are apparently migrating to new hardware, the other
is now looking like a kernel bug that manifests itself on certain
hardware configurations which is hopefully now resolved as well.

> He has proposed a way to help: increase diversity of the systems by
> introducing another operating system and some additional hardware.

Aside from not having additional hardware, running multiple OSs itself
adds significant management overhead as I'm sure you realise - in a
project where volunteers to handle the day to day tasks rarely last
more than a week, that's something thats difficult to justify to the
few of us that do continue to do the work on an ongoing basis.

Further, it reduces our ability to re-deploy services quickly where
ever we like (though granted that would be offset somewhat by having
more machines). We would need to ensure that our OSs were equally well
spread out geographically to ensure we could redeploy any service in
any data center as we currently can (and, on occasion, do).

I guess what I'm saying is that whilst in an ideal world we'd have
diverse OS's, different hardware for all services, and even different
data centers for each, in reality it just isn't practical for us.

-- 
Dave Page
EnterpriseDB UK:   http://www.enterprisedb.com

Re: Yet another infrastructure problem

От

Robert Treat

Дата:

30 октября 2008 г., 23:40:37

On Thursday 30 October 2008 09:06:32 Dave Page wrote:
> Further, it reduces our ability to re-deploy services quickly where
> ever we like (though granted that would be offset somewhat by having
> more machines). We would need to ensure that our OSs were equally well
> spread out geographically to ensure we could redeploy any service in
> any data center as we currently can (and, on occasion, do).
>

Is this really true? ISTM for much of the software we maintain, having the 
config files / scripts all checked into svn would be enough to enable us to 
re-deploy on even completely different hardware/OS without a large amount of 
effort, which probably would open up other options for hosting... case in 
point, I'm pretty sure we could pop out a VM for postgres needs, but it would 
need to run solaris since since we're geared toward zones. For something like 
jabber, I believe we already run the same jabber software as postgresql.org, 
so it sure seems like it should be easy to move it back and forth even across 
different systems... I'd guess there are other services like this too, and 
probably other people in similar situations.  

-- 
Robert Treat
Conjecture: http://www.xzilla.net
Consulting: http://www.omniti.com

Re: Yet another infrastructure problem

От

"Dave Page"

Дата:

31 октября 2008 г., 05:14:38

On Fri, Oct 31, 2008 at 2:40 AM, Robert Treat
<xzilla@users.sourceforge.net> wrote:

> Is this really true? ISTM for much of the software we maintain, having the
> config files / scripts all checked into svn would be enough to enable us to
> re-deploy on even completely different hardware/OS without a large amount of
> effort, which probably would open up other options for hosting... case in
> point, I'm pretty sure we could pop out a VM for postgres needs, but it would
> need to run solaris since since we're geared toward zones. For something like
> jabber, I believe we already run the same jabber software as postgresql.org,
> so it sure seems like it should be easy to move it back and forth even across
> different systems... I'd guess there are other services like this too, and
> probably other people in similar situations.

By 'redeploy', I mean start a backup copy of a jail on a different box
within a few minutes. As you say, we do also automatically backup key
config files to SVN so we can move services to different OSs as well,
but obviously not nearly as quickly.

-- 
Dave Page
EnterpriseDB UK:   http://www.enterprisedb.com

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Обсуждение: Yet another infrastructure problem