Re: Patch for fail-back without fresh backup

Поиск
Список
Период
Сортировка
От Pavan Deolasee
Тема Re: Patch for fail-back without fresh backup
Дата
Msg-id CABOikdNnfgy-1Px8=_AeWjwnP+rSQ+fYNJYVn_iqKkXZF_EiOA@mail.gmail.com
обсуждение исходный текст
Ответ на Re: Patch for fail-back without fresh backup  (Simon Riggs <simon@2ndQuadrant.com>)
Ответы Re: Patch for fail-back without fresh backup  (Simon Riggs <simon@2ndQuadrant.com>)
Список pgsql-hackers



On Sun, Jun 16, 2013 at 5:10 PM, Simon Riggs <simon@2ndquadrant.com> wrote:


My perspective is that if the master crashed, assuming that you know
everything about that and suddenly jumping back on seem like a recipe
for disaster. Attempting that is currently blocked by the technical
obstacles you've identified, but that doesn't mean they are the only
ones - we don't yet understand what all the problems lurking might be.
Personally, I won't be following you onto that minefield anytime soon.


Would it be fair to say that a user will be willing to trust her crashed master in all scenarios where she would have done so in a single instance setup ? IOW without the replication setup, AFAIU users have traditionally trusted the WAL recovery to recover from failed instances. This would include some common failures such as power outages and hardware failures, but may not include others such as on disk corruption.
 
So I strongly object to calling this patch anything to do with
"failback safe". You simply don't have enough data to make such a bold
claim. (Which is why we call it synchronous replication and not "zero
data loss", for example).


I agree. We should probably find a better name for this. Any suggestions ?
 
But that's not the whole story. I can see some utility in a patch that
makes all WAL transfer synchronous, rather than just commits. Some
name like synchronous_transfer might be appropriate. e.g.
synchronous_transfer = all | commit (default).


Its an interesting idea, but I think there is some difference here. For example, the proposed feature allows a backend to wait at other points but not commit. Since commits are more foreground in nature and this feature does not require us to wait during common foreground activities, we want a configuration where master can wait for synchronous transfers at other than commits. May we can solve that by having more granular control to the said parameter ?
 
The idea of another slew of parameters that are very similar to
synchronous replication but yet somehow different seems weird. I can't
see a reason why we'd want a second lot of parameters. Why not just
use the existing ones for sync rep? (I'm surprised the Parameter
Police haven't visited you in the night...) Sure, we might want to
expand the design for how we specify multi-node sync rep, but that is
a different patch.

How would we then distinguish between synchronous and the new kind of standby ? I am told, one of the very popular setups for DR is to have one local sync standby and one async (may be cascaded by the local sync). Since this new feature is more useful for DR because taking a fresh backup on a slower link is even more challenging, IMHO we should support such setups.
 

I'm worried to see that adding this feature and yet turning it off
causes a measureable drop in performance. I don't think we want that
at all. That clearly needs more work and thought.


I agree. We need to repeat those tests. I don't trust that turning the feature is causing 1-2% drop. In one of the tests, I see turning the feature on is showing better number compared to when its turn off. That's clearly noise or need concrete argument to convince that way.
 
I also think your performance results are somewhat bogus. Fast
transaction workloads were already mostly commit waits -

But not in case of async standby, right ?
 
measurements
of what happens to large loads, index builds etc would likely reveal
something quite different.


I agree. I also feel we need tests where the FlushBuffer gets called more often by the normal backends to see how much added wait in that code path causes performance drops. Another important thing to test would be to see how it works on a slower/high latency links.
 
I'm tempted by the thought that we should put the WaitForLSN inside
XLogFlush, rather than scatter additional calls everywhere and then
have us inevitably miss one.


That indeed seems cleaner.

Thanks,
Pavan 

В списке pgsql-hackers по дате отправления:

Предыдущее
От: Soroosh Sardari
Дата:
Сообщение: SLRU
Следующее
От: Pavan Deolasee
Дата:
Сообщение: Re: SLRU