Обсуждение: Yet another infrastructure problem
-----BEGIN PGP SIGNED MESSAGE----- Hash: RIPEMD160 People have been complaining on IRC that nothing can be downloaded from our site, as the mirror-picking script throws an internal error. When are we going to fix our infrastructure properly? - -- Greg Sabino Mullane greg@turnstep.com PGP Key: 0x14964AC8 200810240920 http://biglumber.com/x/web?pk=2529DF6AB8F79407E94445B4BC9B906714964AC8 -----BEGIN PGP SIGNATURE----- iEYEAREDAAYFAkkBy8gACgkQvJuQZxSWSsjlzQCghZjQgwb4tpaAflhYfesj9RWS NsUAn3nxlF3yDoMx8B7rolH/qq5HWxuc =4vcE -----END PGP SIGNATURE-----
Greg Sabino Mullane wrote: > > People have been complaining on IRC that nothing can be > downloaded from our site, as the mirror-picking script throws > an internal error. > > When are we going to fix our infrastructure properly? As Stefan has already posted on this very list, he is performing maintenance on that machine in order to move it to new hardware. //Magnus
On Fri, Oct 24, 2008 at 9:22 AM, Greg Sabino Mullane <greg@turnstep.com> wrote: > People have been complaining on IRC that nothing can be > downloaded from our site, as the mirror-picking script throws > an internal error. It looks like it's still throwing an error: http://wwwmaster.postgresql.org/download/mirrors-ftp?file=%2Fsource%2Fv8.3.4%2Fpostgresql-8.3.4.tar.bz2 returns: Internal Server Error No mirrors were found
-----BEGIN PGP SIGNED MESSAGE----- Hash: RIPEMD160 >> People have been complaining on IRC that nothing can be >> downloaded from our site, as the mirror-picking script throws >> an internal error. > >> When are we going to fix our infrastructure properly? > As Stefan has already posted on this very list, he is performing > maintenance on that machine in order to move it to new hardware. I understand that, but I think this project is big enough, and important enough, and has enough smart people involved in it, that things like this should just not happen. Some thoughts, in order of descending importance to the matter at hand: * Why do we have so many eggs in one basket? I know that "jails" allows us to have many subdomains/services on one physical box, but we've seen three problems with the concept lately: 1) Global software updates that breaks things in all jails 2) Battling over resources and causing one jail to affect another 3) Hardware problems that affect more than one jail * One way around problems like this is to mirror the services. That may involve load balancing, DNS tricks, database replication, and other assorted goodies. It may be difficult, but it's something I'd like to at least start us talking about. * As much as I love the concept of BSD (and I might even be running it at home if it didn't always coredump while installing on my laptop), we should realize that the there are many people in our community who are really, really good with Linux. Many of the people on the PG lists do Linuxy support as their dayjob. I'm not saying we should dump BSD, but I'm dismayed to see the resistance given to adding non-BSD boxes to our mix. - -- Greg Sabino Mullane greg@turnstep.com PGP Key: 0x14964AC8 200810241713 http://biglumber.com/x/web?pk=2529DF6AB8F79407E94445B4BC9B906714964AC8 -----BEGIN PGP SIGNATURE----- iEYEAREDAAYFAkkCPIYACgkQvJuQZxSWSsgTPwCg86uW8cM+x6sLXnIPCIUoXNnD 21sAoMoFOu+VVt0bVAtifG5qGHweht9c =qKED -----END PGP SIGNATURE-----
Greg Sabino Mullane wrote: > -----BEGIN PGP SIGNED MESSAGE----- > Hash: RIPEMD160 > > >>> People have been complaining on IRC that nothing can be >>> downloaded from our site, as the mirror-picking script throws >>> an internal error. >>> When are we going to fix our infrastructure properly? > >> As Stefan has already posted on this very list, he is performing >> maintenance on that machine in order to move it to new hardware. > > I understand that, but I think this project is big enough, and > important enough, and has enough smart people involved in it, > that things like this should just not happen. Some thoughts, in > order of descending importance to the matter at hand: > > * Why do we have so many eggs in one basket? I know that "jails" > allows us to have many subdomains/services on one physical box, > but we've seen three problems with the concept lately: > > 1) Global software updates that breaks things in all jails we need to do software upgrades once in a while because OSes reach their EOL date (and therefor loose security support). Softwareupdates tend to break stuff and OSes are more complex than a single application so we have to expect some issues. Security/Feature upgrades of userspace apps are obviously only affecting a single jail. > 2) Battling over resources and causing one jail to affect another that one has happened - but only one or two times over the last few years so I'm not convinced it is a real issue rather than an isolated incident. > 3) Hardware problems that affect more than one jail the very same would happen if we used some sort of full virtualization technology so I'm not sure I see the point. Or are you actively proposing we should request and run 40+ physical servers in the future ? I don't think that would be sensible in any way (both from a resource wasting pov and the administrative overhead - and we don't have that many boxes either). > > * One way around problems like this is to mirror the services. > That may involve load balancing, DNS tricks, database replication, > and other assorted goodies. It may be difficult, but it's something > I'd like to at least start us talking about. the low hanging fruit in that regard has already been taken (have you seen the static part of website being down in the last few years?) - most of the other services are much much harder to operate in a loadbalanced (or master-master) setup or doing it seems simply overkill. Furthermore I don't think that just making services more complex (as in redundant) will necessarily result in better availability. Howver I aknowledge that we can improve in some areas (like wiki authentication). > > * As much as I love the concept of BSD (and I might even be running it > at home if it didn't always coredump while installing on my laptop), we > should realize that the there are many people in our community who are > really, really good with Linux. Many of the people on the PG lists do > Linuxy support as their dayjob. I'm not saying we should dump BSD, but > I'm dismayed to see the resistance given to adding non-BSD boxes to our > mix. Not against that idea in general (and we already have a fair share of linux boxes too) how would linux solve any of the issues you mentioned ? All of the linux distributions had their fair share of "breaking stuff with security/point updates/upgrades" and if hardware breaks it doesn't matter if we run BSD, Linux or Windows. Stefan
David Blewett wrote: > On Fri, Oct 24, 2008 at 9:22 AM, Greg Sabino Mullane <greg@turnstep.com> wrote: >> People have been complaining on IRC that nothing can be >> downloaded from our site, as the mirror-picking script throws >> an internal error. > > It looks like it's still throwing an error: > http://wwwmaster.postgresql.org/download/mirrors-ftp?file=%2Fsource%2Fv8.3.4%2Fpostgresql-8.3.4.tar.bz2 seems to work for me - at least now. Stefan
Stefan Kaltenbrunner wrote: >> * One way around problems like this is to mirror the services. >> That may involve load balancing, DNS tricks, database replication, >> and other assorted goodies. It may be difficult, but it's something >> I'd like to at least start us talking about. > > the low hanging fruit in that regard has already been taken (have you > seen the static part of website being down in the last few years?) - > most of the other services are much much harder to operate in a > loadbalanced (or master-master) setup or doing it seems simply overkill. > Furthermore I don't think that just making services more complex (as in > redundant) will necessarily result in better availability. Howver I > aknowledge that we can improve in some areas (like wiki authentication). I think the most important thing to get a workaround for is our mirror management. Because now if wwwmaster goes down, nobody can download our stuff from the website - even if both the webside and 100 ftp servers are up. Now, getting that one done shouldn't be too hard. Since the data is really only one-way (I don't mind if we loose click-thru stats). So we can either: 1) replicate the mirror database to a secondary jail somewhere, running the wwwmaster code. Link the downloads to a separate DNS name that maps to both these machines, and does checking similar to our static machines to remove them from DNS if they go down. 2) reimplement the mirror management stuff in client-side javascript somehow, and serve it off the static mirrors. Not entirely sure how to do this cleanly, or how to fallback if $user has javascript disabled, but it would have the advantage of not needing another jail. My vote would be for #1 here, as I think is clear. The login system would also be good to have distributed, but it's used by orders of magnitude less number of people. But if we replicate the database off to the other machine, it should be possible to point the logins on both machines as well - it's a simple pl/pgsql function that needs to be called. We'll just need to deal with the "last logged in" part that won't work then. //Magnus
On Saturday 25 October 2008 04:23:34 Magnus Hagander wrote: > Stefan Kaltenbrunner wrote: > >> * One way around problems like this is to mirror the services. > >> That may involve load balancing, DNS tricks, database replication, > >> and other assorted goodies. It may be difficult, but it's something > >> I'd like to at least start us talking about. > > > > the low hanging fruit in that regard has already been taken (have you > > seen the static part of website being down in the last few years?) - > > most of the other services are much much harder to operate in a > > loadbalanced (or master-master) setup or doing it seems simply overkill. > > Furthermore I don't think that just making services more complex (as in > > redundant) will necessarily result in better availability. Howver I > > aknowledge that we can improve in some areas (like wiki authentication). > > I think the most important thing to get a workaround for is our mirror > management. Because now if wwwmaster goes down, nobody can download our > stuff from the website - even if both the webside and 100 ftp servers > are up. > > Now, getting that one done shouldn't be too hard. Since the data is > really only one-way (I don't mind if we loose click-thru stats). So we > can either: > > 1) replicate the mirror database to a secondary jail somewhere, running > the wwwmaster code. Link the downloads to a separate DNS name that maps > to both these machines, and does checking similar to our static machines > to remove them from DNS if they go down. > > 2) reimplement the mirror management stuff in client-side javascript > somehow, and serve it off the static mirrors. Not entirely sure how to > do this cleanly, or how to fallback if $user has javascript disabled, > but it would have the advantage of not needing another jail. > > > My vote would be for #1 here, as I think is clear. > > > > The login system would also be good to have distributed, but it's used > by orders of magnitude less number of people. But if we replicate the > database off to the other machine, it should be possible to point the > logins on both machines as well - it's a simple pl/pgsql function that > needs to be called. We'll just need to deal with the "last logged in" > part that won't work then. > If you used plproxy rather plpgsql, i think you could eliminate this problem. -- Robert Treat Build A Brighter LAMP :: Linux Apache {middleware} PostgreSQL
On Saturday 25 October 2008 04:04:08 Stefan Kaltenbrunner wrote: > All of the linux distributions had their fair share of "breaking stuff > with security/point updates/upgrades" and if hardware breaks it doesn't > matter if we run BSD, Linux or Windows. > good point, we should switch to solaris! :-D -- Robert Treat Build A Brighter LAMP :: Linux Apache {middleware} PostgreSQL
Robert Treat wrote: > On Saturday 25 October 2008 04:04:08 Stefan Kaltenbrunner wrote: >> All of the linux distributions had their fair share of "breaking stuff >> with security/point updates/upgrades" and if hardware breaks it doesn't >> matter if we run BSD, Linux or Windows. >> > > good point, we should switch to solaris! :-D > *cough* :P Guys we seem to have this argument every 3-6 months. I know that I have started on my own once or twice. So for the sake of everyone's bandwidth and time let me just break it down. PostgreSQL.Org uses a FreeBSD architecture. To my knowledge there are only two exceptions to this, one of which will go away by the end of the month. Don't ask for linux -- you aren't going to get it. We use jails. Deal with it. We do have problems with the infrastructure but they are being dealt with as time and resources allow. That being said, the infrastructure design in terms of the base technologies is set. In short, yes I would like to see us move to all Linux. Debian or Ubuntu Hardy would be my choice. However, that is not going to happen. So I have accepted that I will learn/be a FreeBSD admin. I can deal with that, FreeBSD although a tad weird for SysV guys is a good OS and it solves the problem we are trying to solve. I use to buy into the argument of ... if we had Linux more people would be willing to help. That argument is crap. People will help if they want to help. They will learn what they need to help. Those that say, "if you were running linux I would help" have a good heart but aren't people that are really going to help in the long run anyway. So can we just put on the Wiki that this is the way it is? That way the next time it comes up, we just point. Joshua D. Drake
Robert Treat wrote: > On Saturday 25 October 2008 04:23:34 Magnus Hagander wrote: >> The login system would also be good to have distributed, but it's used >> by orders of magnitude less number of people. But if we replicate the >> database off to the other machine, it should be possible to point the >> logins on both machines as well - it's a simple pl/pgsql function that >> needs to be called. We'll just need to deal with the "last logged in" >> part that won't work then. >> > > If you used plproxy rather plpgsql, i think you could eliminate this problem. > Does pl/proxy actually help with that? I haven't actually used it, but from what I can tell dealing with failover is still on the TODO list for it ("RUN ON ANY: if one con failed, try another"). Or? //Magnus
Magnus Hagander wrote: > Greg Sabino Mullane wrote: > >> People have been complaining on IRC that nothing can be >> downloaded from our site, as the mirror-picking script throws >> an internal error. >> >> When are we going to fix our infrastructure properly? >> > > As Stefan has already posted on this very list, he is performing > maintenance on that machine in order to move it to new hardware. > > //Magnus > > We are still missing the one important thing "Notification" lots and lots of people use the website that will never go near the lists, irc or anything else. Notifying the email lists of downtime will stop the heavily involved community from complaining, but it does absolutely nothing for general user trying to download something from the internet. You can argue about replication, downtime and the like until you are blue in the face. There will always be some downtime. The question is how do people know about it, when is it and what do they do about it? Until reading this thread I had never even thought about how PostgreSQL does or doesn't notify people about downtime or potential downtime. Reading down thread this notification issue appears to have been ignored. To me it seems like relatively low hanging fruit to allow messages to be posted on the website about planned outages, and notifications of recent unplanned outages. Complaining on IRC is one of the only ways to find out what's going on at the moment for a casual user. When Marc's hosting had trouble a couple of years back, the only way to find out anything was on irc. I'd look into this, but I'd need a lot more knowledge about how the web stuff is setup, and I'm probably not going to be able to glean that from people in a couple of weeks. But if I can. Great!. Russell.
On 26 okt 2008, at 02.03, Russell Smith <mr-russ@pws.com.au> wrote: > Magnus Hagander wrote: >> Greg Sabino Mullane wrote: >> >>> People have been complaining on IRC that nothing can be >>> downloaded from our site, as the mirror-picking script throws >>> an internal error. >>> >>> When are we going to fix our infrastructure properly? >>> >> >> As Stefan has already posted on this very list, he is performing >> maintenance on that machine in order to move it to new hardware. >> >> //Magnus >> >> > We are still missing the one important thing "Notification" lots and > lots of people use the website that will never go near the lists, > irc or > anything else. Notifying the email lists of downtime will stop the > heavily involved community from complaining, but it does absolutely > nothing for general user trying to download something from the > internet. That is a very good point. And it actually goes to many other parts of the project, and not just the infrastructure. Basically the authoritative version of *all* important information is the lists. > > You can argue about replication, downtime and the like until you are > blue in the face. There will always be some downtime. The question > is > how do people know about it, when is it and what do they do about it? Agreed. > Until reading this thread I had never even thought about how > PostgreSQL > does or doesn't notify people about downtime or potential downtime. > Reading down thread this notification issue appears to have been > ignored. To me it seems like relatively low hanging fruit to allow > messages to be posted on the website about planned outages, and > notifications of recent unplanned So how do you deal with a case like the one discussed here, where the web is what didn't work? The static fromtends were up, but not the master which is used to update them... > outages. Complaining on IRC is one of > the only ways to find out what'so going on at the moment for a casual > user. The casual user would be using the lists, certainly not irc. Peope who aren't deep in the project certainly will hit the lists first, because that's what we say on our website. Now what they really do is email webmaster, which a lot of peope did. That said, I agree a better way would be good to have. > When Marc's hosting had trouble a couple of years back, the only > way to find out anything was on irc. That outlines one of the major problems. It must not be too hard to deal with for the guy trying to fix the actual problem. Sending an email is *easy*, and stefan did so in this case. But as you also note, even this is too much for some people. We could publish a snapshot of our nagios data, but I doubt that would actually be helpful to these peope. > I'd look into this, but I'd need a lot more knowledge about how the > web > stuff is setup, and I'm probably not going to be able to glean that > from > people in a couple of weeks. But if I can. Great!. > Hey, give it a shot. Just remember that the technical part is the easy part. Creating a process and getting buyin for that is going to be the hard part. /Magnus
Russell, The planned maintenance (to replace the troublesome hardware) was announced publically by Stefan. /D On 10/26/08, Russell Smith <mr-russ@pws.com.au> wrote: > Magnus Hagander wrote: >> Greg Sabino Mullane wrote: >> >>> People have been complaining on IRC that nothing can be >>> downloaded from our site, as the mirror-picking script throws >>> an internal error. >>> >>> When are we going to fix our infrastructure properly? >>> >> >> As Stefan has already posted on this very list, he is performing >> maintenance on that machine in order to move it to new hardware. >> >> //Magnus >> >> > We are still missing the one important thing "Notification" lots and > lots of people use the website that will never go near the lists, > irc or > anything else. Notifying the email lists of downtime will stop the > heavily involved community from complaining, but it does absolutely > nothing for general user trying to download something from the > internet. > > You can argue about replication, downtime and the like until you are > blue in the face. There will always be some downtime. The question > is > how do people know about it, when is it and what do they do about it? > > Until reading this thread I had never even thought about how > PostgreSQL > does or doesn't notify people about downtime or potential downtime. > Reading down thread this notification issue appears to have been > ignored. To me it seems like relatively low hanging fruit to allow > messages to be posted on the website about planned outages, and > notifications of recent unplanned outages. Complaining on IRC is > one of > the only ways to find out what's going on at the moment for a casual > user. When Marc's hosting had trouble a couple of years back, the > only > way to find out anything was on irc. > > I'd look into this, but I'd need a lot more knowledge about how the > web > stuff is setup, and I'm probably not going to be able to glean that > from > people in a couple of weeks. But if I can. Great!. > > Russell. > > > > -- > Sent via pgsql-www mailing list (pgsql-www@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-www > -- Dave Page EnterpriseDB UK: http://www.enterprisedb.com
-----BEGIN PGP SIGNED MESSAGE----- Hash: RIPEMD160 Stefan wrote: >> 2) Battling over resources and causing one jail to affect another > that one has happened - but only one or two times over the last few > years so I'm not convinced it is a real issue rather than an isolated > incident. I think this happends more than you realize. Isn't the jabber service still causing problems now? Wasn't the wiki recently affected by something else? Who knows how often it happens to a lesser extent? It's only the extreme cases that cause notices to be sent to this list. >> 3) Hardware problems that affect more than one jail > the very same would happen if we used some sort of full virtualization > technology so I'm not sure I see the point. Or are you actively > proposing we should request and run 40+ physical servers in the future ? > I don't think that would be sensible in any way (both from a resource > wasting pov and the administrative overhead - and we don't have that > many boxes either). No, not 40+, but having the small handful of important services distributed on separate boxes/data centers would be a good idea. Specifically, the archives, search, website, wiki, cvs, and mailing lists should ideally all be on different servers, to minimize the impact on the project as a whole when something goes down. >> * One way around problems like this is to mirror the services. >> That may involve load balancing, DNS tricks, database replication, >> and other assorted goodies. It may be difficult, but it's something >> I'd like to at least start us talking about. > the low hanging fruit in that regard has already been taken (have you > seen the static part of website being down in the last few years?) - No, I'm completely happy with the static part of the website. > most of the other services are much much harder to operate in a > loadbalanced (or master-master) setup or doing it seems simply overkill. > Furthermore I don't think that just making services more complex (as in > redundant) will necessarily result in better availability. Howver I > aknowledge that we can improve in some areas (like wiki authentication). Er...how do you figure redundant services do not necessarily result in better availability? That's kind of its point - and we certainly don't have anywhere near 100% uptime for practically any part of our infrastructure. I do recognize there is a complexity tradefoff to be made, so perhaps only some (or none) of the services need that tradeoff to be made. However, I consider it a valid point to be raised. This goes a little to disaster recovery as well, so perhaps some of the services (e.g. cvs) are already mirrored in some fashion, and all we need to do is to tweak some things? >> * As much as I love the concept of BSD (and I might even be running it >> at home if it didn't always coredump while installing on my laptop), we >> should realize that the there are many people in our community who are >> really, really good with Linux. Many of the people on the PG lists do >> Linuxy support as their dayjob. I'm not saying we should dump BSD, but >> I'm dismayed to see the resistance given to adding non-BSD boxes to our >> mix. > Not against that idea in general (and we already have a fair share of > linux boxes too) how would linux solve any of the issues you mentioned ? > All of the linux distributions had their fair share of "breaking stuff > with security/point updates/upgrades" and if hardware breaks it doesn't > matter if we run BSD, Linux or Windows. It wouldn't solve any of the above issues, which is why it was the last bullet point. As Robert points out, we could just switch to Sun's Solaris, then we wouldn't have any problems. Look how well MySQL is going under their watch! :) Joshua Drake wrote: > PostgreSQL.Org uses a FreeBSD architecture. To my knowledge there are > only two exceptions to this, one of which will go away by the end of the > month. Don't ask for linux -- you aren't going to get it. > We use jails. Deal with it. We are dealing with it, that's one of the big problems. > I use to buy into the argument of ... if we had Linux more people would > be willing to help. That argument is crap. People will help if they want > to help. They will learn what they need to help. Those that say, "if you > were running linux I would help" have a good heart but aren't people > that are really going to help in the long run anyway. That counter-argument is crap as well. People will 'learn what they need to help'? This is a volunteer project, so the more barriers we put in front of people, the less that will get done. While homogeneity of servers can be a good thing from a sysadmin perspective, expanding the pool of potential help can be as well. > So can we just put on the Wiki that this is the way it is? That way the > next time it comes up, we just point. Next time can we send a message to -announce and -general letting people know the website, cvs, wiki, and pgadmin are going to be down? I think that was one of the most annoying aspects of this whole incident. - -- Greg Sabino Mullane greg@turnstep.com PGP Key: 0x14964AC8 200810292113 http://biglumber.com/x/web?pk=2529DF6AB8F79407E94445B4BC9B906714964AC8 -----BEGIN PGP SIGNATURE----- iEYEAREDAAYFAkkJDBEACgkQvJuQZxSWSsgMFQCgt51u2F4c/7TrSaVAO79Y293+ HbEAn1dM6owdqWZK0Ey06BzX9u56e6U8 =J3av -----END PGP SIGNATURE-----
Greg Sabino Mullane wrote: > Stefan wrote: > > > that one has happened - but only one or two times over the last few > > years so I'm not convinced it is a real issue rather than an isolated > > incident. > > I think this happends more than you realize. Isn't the jabber service > still causing problems now? This is a bad example, because Jabber runs on a Linux server :-) -- Alvaro Herrera http://www.CommandPrompt.com/ PostgreSQL Replication, Consulting, Custom Development, 24x7 support
Alvaro Herrera wrote: > Greg Sabino Mullane wrote: > >> Stefan wrote: >> >>> that one has happened - but only one or two times over the last few >>> years so I'm not convinced it is a real issue rather than an isolated >>> incident. >> I think this happends more than you realize. Isn't the jabber service >> still causing problems now? > > This is a bad example, because Jabber runs on a Linux server :-) > And is actually a hardware issue that is being dealt with, not a software one. Joshua D. Drake
Greg Sabino Mullane wrote: > -----BEGIN PGP SIGNED MESSAGE----- > Hash: RIPEMD160 > That counter-argument is crap as well. People will 'learn what they need to > help'? This is a volunteer project, so the more barriers we put in front of > people, the less that will get done. While homogeneity of servers can be a > good thing from a sysadmin perspective, expanding the pool of potential help > can be as well. I learned FreeBSD very much against my will for this project. I use PHP very much against my will for this project. I use Docbook SGML very much against my will for this project. I use CVS very much against my will for this project. There are plenty of things that I think are plain outright dumb that this project does. However, because I want to contribute to this project and it is what the project (regardless if I agree) has deemed will be done, I do. > >> So can we just put on the Wiki that this is the way it is? That way the >> next time it comes up, we just point. > > Next time can we send a message to -announce and -general letting people > know the website, cvs, wiki, and pgadmin are going to be down? I think that > was one of the most annoying aspects of this whole incident. Well I certainly can't argue with that. Joshua D. Drake
On Wed, Oct 29, 2008 at 09:15:20PM -0700, Joshua D. Drake wrote: > > And is actually a hardware issue that is being dealt with, not a software > one. I fail totally to see how either that, or the OS in question, in any way constitutes a premise against Greg's argument. His argument is just that there are a lot of services, and several of them appear, to the uneducated eye, to be rather less reliable than one might hope. He has proposed a way to help: increase diversity of the systems by introducing another operating system and some additional hardware. Apart from the duplication of services, such a diversity of code bases adds robustness to a distributed system, because a code problem in one system will not affect the other one. (For instance, if it turns out that jails have a bug in some release of FreeBSD, it's likely not to be the same problem in xen running on Linux.) As a side benefit, this might lower the initial cost of volunteering for enough people that there would be more volunteers. (Just to prove I can argue both sides of the fence, though: adding more sysadmins to a distributed system often does not improve the reliability of the system. The system needs to be designed for many hands, and I don't know if this one is.) Since I'm officially Not Volunteering to help with this, I don't have a dog in the race. But I don't think responding to Greg's sound argument with red herrings is going to address his point. "This is what we picked; deal with it," is a pretty lame argument in the face of public failures of the stuff one picked. A -- Andrew Sullivan ajs@commandprompt.com +1 503 667 4564 x104 http://www.commandprompt.com/
On Thu, Oct 30, 2008 at 12:42 PM, Andrew Sullivan <ajs@crankycanuck.ca> wrote: > On Wed, Oct 29, 2008 at 09:15:20PM -0700, Joshua D. Drake wrote: >> >> And is actually a hardware issue that is being dealt with, not a software >> one. > > I fail totally to see how either that, or the OS in question, in any > way constitutes a premise against Greg's argument. He raised the issue of the Jabber server (and some other services) in response to a comment on how we've (once) seen a FreeBSD jail hog resources and adversely affect another. > His argument is > just that there are a lot of services, and several of them appear, to > the uneducated eye, to be rather less reliable than one might hope. Right - and as far as I'm aware, pretty much all of those issues boil down to what a couple of days ago was thought to be two hardware issues. 1 of them definitely is (the Linux based Jabber server) which CP staff (or JD) are apparently migrating to new hardware, the other is now looking like a kernel bug that manifests itself on certain hardware configurations which is hopefully now resolved as well. > He has proposed a way to help: increase diversity of the systems by > introducing another operating system and some additional hardware. Aside from not having additional hardware, running multiple OSs itself adds significant management overhead as I'm sure you realise - in a project where volunteers to handle the day to day tasks rarely last more than a week, that's something thats difficult to justify to the few of us that do continue to do the work on an ongoing basis. Further, it reduces our ability to re-deploy services quickly where ever we like (though granted that would be offset somewhat by having more machines). We would need to ensure that our OSs were equally well spread out geographically to ensure we could redeploy any service in any data center as we currently can (and, on occasion, do). I guess what I'm saying is that whilst in an ideal world we'd have diverse OS's, different hardware for all services, and even different data centers for each, in reality it just isn't practical for us. -- Dave Page EnterpriseDB UK: http://www.enterprisedb.com
On Thursday 30 October 2008 09:06:32 Dave Page wrote: > Further, it reduces our ability to re-deploy services quickly where > ever we like (though granted that would be offset somewhat by having > more machines). We would need to ensure that our OSs were equally well > spread out geographically to ensure we could redeploy any service in > any data center as we currently can (and, on occasion, do). > Is this really true? ISTM for much of the software we maintain, having the config files / scripts all checked into svn would be enough to enable us to re-deploy on even completely different hardware/OS without a large amount of effort, which probably would open up other options for hosting... case in point, I'm pretty sure we could pop out a VM for postgres needs, but it would need to run solaris since since we're geared toward zones. For something like jabber, I believe we already run the same jabber software as postgresql.org, so it sure seems like it should be easy to move it back and forth even across different systems... I'd guess there are other services like this too, and probably other people in similar situations. -- Robert Treat Conjecture: http://www.xzilla.net Consulting: http://www.omniti.com
On Fri, Oct 31, 2008 at 2:40 AM, Robert Treat <xzilla@users.sourceforge.net> wrote: > Is this really true? ISTM for much of the software we maintain, having the > config files / scripts all checked into svn would be enough to enable us to > re-deploy on even completely different hardware/OS without a large amount of > effort, which probably would open up other options for hosting... case in > point, I'm pretty sure we could pop out a VM for postgres needs, but it would > need to run solaris since since we're geared toward zones. For something like > jabber, I believe we already run the same jabber software as postgresql.org, > so it sure seems like it should be easy to move it back and forth even across > different systems... I'd guess there are other services like this too, and > probably other people in similar situations. By 'redeploy', I mean start a backup copy of a jail on a different box within a few minutes. As you say, we do also automatically backup key config files to SVN so we can move services to different OSs as well, but obviously not nearly as quickly. -- Dave Page EnterpriseDB UK: http://www.enterprisedb.com