Re: race condition in sync rep
От | Robert Haas |
---|---|
Тема | Re: race condition in sync rep |
Дата | |
Msg-id | AANLkTikE0KVCNOB4o=ZqAX=8TQ9CjvAzriZML=oeNgpk@mail.gmail.com обсуждение исходный текст |
Ответ на | Re: race condition in sync rep (Simon Riggs <simon@2ndQuadrant.com>) |
Список | pgsql-hackers |
On Sun, Mar 27, 2011 at 7:46 AM, Simon Riggs <simon@2ndquadrant.com> wrote: > Are the master and standby on same system or are they separated by a network? > > I'm surprised that a network roundtrip takes less time than the > backend takes to mark clog and then queue for the SyncRepLock. When I first noticed that it was slow (really hanging, though I failed to realize it) with fsync=off, I had two clusters on the system. I didn't know what was going on at that point, and wasn't specifically looking for this bug - I was actually testing some other aspect of the behavior and hit upon it by accident. Then while I was at PG East I realized there was a race condition. (I think I actually realized it while I was dreaming about PostgreSQL; if you think dreaming about PostgreSQL is a sign that something is seriously wrong with me, you are likely correct.) Just to convince myself that I wasn't making things up I then stuck a sleep(1) in right before the sync rep wait, for testing purposes, which of course made it trivial to demonstrate the hang; I again did that on the same system (different one) but of course with the sleep in there it wouldn't have mattered. Then later I realized that the race condition and the fsync=off were probably the same problem, so I wrote up the email that way. If your point is that I never demonstrated with sync rep between two different systems, I agree. I suspect it could be done, but you'd probably have to load the master down pretty heavily while keeping the load on the standby very light - or possibly it would work to just run a single-threaded test for a really long time, but I don't know because I haven't tried it. I'm actually not that interested in quantifying the exact probability of this happening under any given set of circumstances; it seems like enough that it's been found and fixed. If something in the phrasing of my original email gave offence, it wasn't intended to: in particular, the use of the word "nasty" was intended to convey "difficult to find; tricky". I think my fear that it would prove difficult to fix also may have affected that word choice; I didn't anticipate it being resolved so quickly and with such a small patch. I am doing my best to help fix the things that I believe to be bugs in the code without pissing anybody off. Clearly, at least in your case, that doesn't seem to have been entirely successful, but in all honesty it's not for lack of trying. I really, really want to get this release out the door and get back to writing code and doing CommitFests; but I also want it to be good (as I'm sure you do as well) and I think we're not quite there yet. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
В списке pgsql-hackers по дате отправления: