Re: Replication server timeout patch
От | Daniel Farina |
---|---|
Тема | Re: Replication server timeout patch |
Дата | |
Msg-id | AANLkTi=X+ucrE6FRNvOQDidoHVkbQ5rG212fHqz_u0yf@mail.gmail.com обсуждение исходный текст |
Ответ на | Re: Replication server timeout patch (Robert Haas <robertmhaas@gmail.com>) |
Список | pgsql-hackers |
<p>On Feb 11, 2011 8:20 PM, "Robert Haas" <<a href="mailto:robertmhaas@gmail.com">robertmhaas@gmail.com</a>> wrote:<br/> ><br /> > On Fri, Feb 11, 2011 at 4:38 PM, Robert Haas <<a href="mailto:robertmhaas@gmail.com">robertmhaas@gmail.com</a>>wrote:<br /> > > On Fri, Feb 11, 2011 at 4:30 PM,Heikki Linnakangas<br /> > > <<a href="mailto:heikki.linnakangas@enterprisedb.com">heikki.linnakangas@enterprisedb.com</a>>wrote:<br /> > >> On11.02.2011 22:11, Robert Haas wrote:<br /> > >>><br /> > >>> On Fri, Feb 11, 2011 at 2:02 PM, DanielFarina<<a href="mailto:drfarina@acm.org">drfarina@acm.org</a>> wrote:<br /> > >>>><br /> >>>>> I split this out of the synchronous replication patch for independent<br /> > >>>> review.I'm dashing out the door, so I haven't put it on the CF yet or<br /> > >>>> anything, but I just wantedto get it out there...I'll be around in<br /> > >>>> Not Too Long to finish any other details.<br />> >>><br /> > >>> This looks like a useful and separately committable change.<br /> > >><br/> > >> Hmm, so this patch implements a watchdog, where the master disconnects the<br /> > >>standby if the heartbeat from the standby stops for more than<br /> > >> 'replication_[server]_timeout'seconds. The standby sends the heartbeat<br /> > >> every wal_receiver_status_intervalseconds.<br /> > >><br /> > >> It would be nice if the master and standby couldnegotiate those settings.<br /> > >> As the patch stands, it's easy to have a pathological configuration where<br/> > >> replication_server_timeout < wal_receiver_status_interval, so that the<br /> > >> masterrepeatedly disconnects the standby because it doesn't reply in time.<br /> > >> Maybe the standby should reporthow often it's going to send a heartbeat,<br /> > >> and master should wait for that long + some safety margin.Or maybe the<br /> > >> master should tell the standby how often it should send the heartbeat?<br /> >><br /> > > I guess the biggest use case for that behavior would be in a case<br /> > > where you havetwo standbys, one of which doesn't send a heartbeat and<br /> > > the other of which does. Then you really can'trely on a single<br /> > > timeout.<br /> > ><br /> > > Maybe we could change the server parameterto indicate what multiple<br /> > > of wal_receiver_status_interval causes a hangup, and then change the<br/> > > client to notify the server what value it's using. But that gets<br /> > > complicated, becausethe value could be changed while the standby is<br /> > > running.<br /> ><br /> > On reflection I'm deeplyuncertain this is a good idea. It's pretty<br /> > hopeless to suppose that we can keep the user from choosingparameter<br /> > settings which will cause them problems, and there are certainly far<br /> > stupider thingsthey could do then set replication_timeout <<br /> > wal_receiver_status_interval. They could, for example,set fsync=off<br /> > or work_mem=4GB or checkpoint_segments=3 (never mind that we ship that<br /> > last oneout of the box). Any of those settings have the potential to<br /> > thoroughly destroy their system in one way oranother, and there's not<br /> > a darn thing we can do about it. Setting up some kind of handshake<br /> > systembased on a multiple of the wal_receiver_status_interval is<br /> > going to be complex, and it's not necessarilygoing to deliver the<br /> > behavior someone wants anyway. If someone has<br /> > wal_receiver_status_interval=10on one system and =30 on another<br /> > system, does it therefore follow that the timeoutsshould also be<br /> > different by 3X? Perhaps, but it's non-obvious.<br /> ><br /> > There are two thingsthat I think are pretty clear. If the receiver<br /> > has wal_receiver_status_interval=0, then we should ignore<br/> > replication_timeout for that connection. And also we need to make<br /> > sure that the replication_timeoutcan't kill off a connection that is<br /> > in the middle of streaming a large base backup. Maybewe should try<br /> > to get those two cases right and not worry about the rest. Dan, can<br /> > you checkwhether the base backup thing is a problem with this as<br /> > implemented?<p>Yes, I will have something to saycome Saturday.<p>--<br /> fdr
В списке pgsql-hackers по дате отправления: