Re: streaming replication breaks horribly if master crashes

Поиск
Список
Период
Сортировка
От Heikki Linnakangas
Тема Re: streaming replication breaks horribly if master crashes
Дата
Msg-id 4C19BC2C.6060105@enterprisedb.com
обсуждение исходный текст
Ответ на Re: streaming replication breaks horribly if master crashes  (Greg Stark <gsstark@mit.edu>)
Ответы Re: streaming replication breaks horribly if master crashes  (Rafael Martinez <r.m.guerrero@usit.uio.no>)
Список pgsql-hackers
On 17/06/10 02:40, Greg Stark wrote:
> On Thu, Jun 17, 2010 at 12:16 AM, Kevin Grittner
> <Kevin.Grittner@wicourts.gov>  wrote:
>> Greg Stark<gsstark@mit.edu>  wrote:
>>
>>> TCP keepalives are for detecting broken network connections
>>
>> Yeah.  That seems like what we have here.  If you shoot the OS in
>> the head, the network connection is broken rather abruptly, without
>> the normal packets exchanged to close the TCP connection.  It sounds
>> like it behaves just fine except for not detecting a broken
>> connection.
>
> So I think there are two things happening here. If you shut down the
> master and don't replace it then you'll get no network errors until
> TCP gives up entirely. Similarly if you pull the network cable or your
> switch powers off or your routing table becomes messed up, or anything
> else occurs which prevents packets from getting through then you'll
> see similar breakage. You wouldn't want your database to suddenly come
> up as master in such circumstances though when you'll have to fix the
> problem anyways, doing so won't solve any problems it would just
> create a second problem.

We're not talking about a timeout for promoting standby to master. The 
problem is that the standby doesn't notice that from the master's point 
of view, the connection has been broken. Whether it's because of a 
network error or because the master server crashed doesn't matter, the 
standby should reconnect in any case. TCP keepalives are a perfect fit, 
as long as you can tune the keepalive time short enough. Where "Short 
enough" is up to the admin to decide depending on the application.

Having said that, it would probably make life easier if we implemented 
an application level heartbeat anyway. Not all OS's allow tuning keepalives.

> But there's a second case. The Postgres master just stops responding
> -- perhaps it starts seeing disk errors and becomes stuck in disk-wait
> or the machine just becomes very heaviliy loaded and Postgres can't
> get any cycles, or someone attaches to it with gdb, or one of any
> number of things happen which cause it to stop sending data. In that
> case replication will not see any data from the master but TCP will
> never time out because the network is just fine. That's why there
> needs to be an application level health check if you want to have
> timeouts. You can't depend on the network layer to detect problems
> between the application.

If the PostgreSQL master stops responding, it's OK for the slave to sit 
and wait for the master to recover. Reconnecting wouldn't help.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


В списке pgsql-hackers по дате отправления:

Предыдущее
От: Fujii Masao
Дата:
Сообщение: Re: streaming replication breaks horribly if master crashes
Следующее
От: "Joshua D. Drake"
Дата:
Сообщение: Re: ANNOUNCE list (was Re: New PGXN Extension site)