Re: Replication server timeout patch

Поиск
Список
Период
Сортировка
От Daniel Farina
Тема Re: Replication server timeout patch
Дата
Msg-id AANLkTi=X+ucrE6FRNvOQDidoHVkbQ5rG212fHqz_u0yf@mail.gmail.com
обсуждение исходный текст
Ответ на Re: Replication server timeout patch  (Robert Haas <robertmhaas@gmail.com>)
Список pgsql-hackers
<p>On Feb 11, 2011 8:20 PM, "Robert Haas" <<a href="mailto:robertmhaas@gmail.com">robertmhaas@gmail.com</a>>
wrote:<br/> ><br /> > On Fri, Feb 11, 2011 at 4:38 PM, Robert Haas <<a
href="mailto:robertmhaas@gmail.com">robertmhaas@gmail.com</a>>wrote:<br /> > > On Fri, Feb 11, 2011 at 4:30
PM,Heikki Linnakangas<br /> > > <<a
href="mailto:heikki.linnakangas@enterprisedb.com">heikki.linnakangas@enterprisedb.com</a>>wrote:<br /> > >>
On11.02.2011 22:11, Robert Haas wrote:<br /> > >>><br /> > >>> On Fri, Feb 11, 2011 at 2:02 PM,
DanielFarina<<a href="mailto:drfarina@acm.org">drfarina@acm.org</a>>  wrote:<br /> > >>>><br />
>>>>> I split this out of the synchronous replication patch for independent<br /> > >>>>
review.I'm dashing out the door, so I haven't put it on the CF yet or<br /> > >>>> anything, but I just
wantedto get it out there...I'll be around in<br /> > >>>> Not Too Long to finish any other details.<br
/>> >>><br /> > >>> This looks like a useful and separately committable change.<br /> >
>><br/> > >> Hmm, so this patch implements a watchdog, where the master disconnects the<br /> >
>>standby if the heartbeat from the standby stops for more than<br /> > >>
'replication_[server]_timeout'seconds. The standby sends the heartbeat<br /> > >> every
wal_receiver_status_intervalseconds.<br /> > >><br /> > >> It would be nice if the master and standby
couldnegotiate those settings.<br /> > >> As the patch stands, it's easy to have a pathological configuration
where<br/> > >> replication_server_timeout < wal_receiver_status_interval, so that the<br /> > >>
masterrepeatedly disconnects the standby because it doesn't reply in time.<br /> > >> Maybe the standby should
reporthow often it's going to send a heartbeat,<br /> > >> and master should wait for that long + some safety
margin.Or maybe the<br /> > >> master should tell the standby how often it should send the heartbeat?<br />
>><br /> > > I guess the biggest use case for that behavior would be in a case<br /> > > where you
havetwo standbys, one of which doesn't send a heartbeat and<br /> > > the other of which does.  Then you really
can'trely on a single<br /> > > timeout.<br /> > ><br /> > > Maybe we could change the server
parameterto indicate what multiple<br /> > > of wal_receiver_status_interval causes a hangup, and then change
the<br/> > > client to notify the server what value it's using.  But that gets<br /> > > complicated,
becausethe value could be changed while the standby is<br /> > > running.<br /> ><br /> > On reflection I'm
deeplyuncertain this is a good idea.  It's pretty<br /> > hopeless to suppose that we can keep the user from
choosingparameter<br /> > settings which will cause them problems, and there are certainly far<br /> > stupider
thingsthey could do then set replication_timeout <<br /> > wal_receiver_status_interval.  They could, for
example,set fsync=off<br /> > or work_mem=4GB or checkpoint_segments=3 (never mind that we ship that<br /> > last
oneout of the box).  Any of those settings have the potential to<br /> > thoroughly destroy their system in one way
oranother, and there's not<br /> > a darn thing we can do about it.  Setting up some kind of handshake<br /> >
systembased on a multiple of the wal_receiver_status_interval is<br /> > going to be complex, and it's not
necessarilygoing to deliver the<br /> > behavior someone wants anyway.  If someone has<br /> >
wal_receiver_status_interval=10on one system and =30 on another<br /> > system, does it therefore follow that the
timeoutsshould also be<br /> > different by 3X?  Perhaps, but it's non-obvious.<br /> ><br /> > There are two
thingsthat I think are pretty clear.  If the receiver<br /> > has wal_receiver_status_interval=0, then we should
ignore<br/> > replication_timeout for that connection.  And also we need to make<br /> > sure that the
replication_timeoutcan't kill off a connection that is<br /> > in the middle of streaming a large base backup.
 Maybewe should try<br /> > to get those two cases right and not worry about the rest.  Dan, can<br /> > you
checkwhether the base backup thing is a problem with this as<br /> > implemented?<p>Yes, I will have something to
saycome Saturday.<p>--<br /> fdr 

В списке pgsql-hackers по дате отправления:

Предыдущее
От: Robert Haas
Дата:
Сообщение: Re: Replication server timeout patch
Следующее
От: Greg Smith
Дата:
Сообщение: Re: Debian readline/libedit breakage