Re: max_standby_delay considered harmful

Поиск
Список
Период
Сортировка
От Greg Smith
Тема Re: max_standby_delay considered harmful
Дата
Msg-id 4BE0AD31.9010600@2ndquadrant.com
обсуждение исходный текст
Ответ на max_standby_delay considered harmful  (Tom Lane <tgl@sss.pgh.pa.us>)
Ответы Re: max_standby_delay considered harmful  (Tom Lane <tgl@sss.pgh.pa.us>)
Re: max_standby_delay considered harmful  (Josh Berkus <josh@agliodbs.com>)
Список pgsql-hackers
Tom Lane wrote:
> 1. The timestamps we are reading from the log might be historical,
> if we are replaying from archive rather than reading a live SR stream.
> In the current implementation that means zero grace period for standby
> queries.  Now if your only interest is catching up as fast as possible,
> that could be a sane behavior, but this is clearly not the only possible
> interest --- in fact, if that's all you care about, why did you allow
> standby queries at all?
>   

If the standby is not current, you may not want people to execute 
queries against it.  In some situations, returning results against 
obsolete data is worse than not letting the query execute at all.  As I 
see it, the current max_standby_delay implementation includes the 
expectation that the results you are getting are no more than 
max_standby_delay behind the master, presuming that new data is still 
coming in.  If the standby has really fallen further behind than that, 
there are situations where you don't want it doing anything but catching 
up until that is no longer the case, and you especially don't want it 
returning stale query data.

The fact that tuning in that direction could mean the standby never 
actually executes any queries is something you need to monitor for--it 
suggests the standby isn't powerful/well connected to the master enough 
to keep up--but that's not necessarily the wrong behavior.  Saying "I 
only want the standby to execute queries if it's not too far behind the 
master" is the answer to "why did you allow standby queries at all?" 
when tuning for that use case.

> 2. There could be clock skew between the master and slave servers.
>   

Not the database's problem to worry about.  Document that time should be 
carefully sync'd and move on.  I'll add that.

> 3. There could be significant propagation delay from master to slave,
> if the WAL stream is being transmitted with pg_standby or some such.
> Again this results in cutting into the standby queries' grace period,
> for no defensible reason.
>   

Then people should adjust their max_standby_delay upwards to account for 
that.  For high availability purposes, it's vital that the delay number 
be referenced to the commit records on the master.  If lag is eating a 
portion of that, again it's something people should be monitoring for, 
but not something we can correct.  The whole idea here is that 
max_standby_delay is an upper bound on how stale the data on the standby 
can be, and whether or not lag is a component to that doesn't impact how 
the database is being asked to act.

> In addition to these fundamental problems there's a fatal implementation
> problem: the actual comparison is not to the master's current clock
> reading, but to the latest commit, abort, or checkpoint timestamp read
> from the WAL.
Right; this has been documented for months at 
http://wiki.postgresql.org/wiki/Hot_Standby_TODO and on the list before 
that, i.e. "If there's little activity in the master, that can lead to 
surprising results."  The suggested long-term fix has been adding 
keepalive timestamps into SR, which seems to get reinvented every time 
somebody plays with this for a bit.  The HS documentation improvements 
I'm working on will suggest that you make sure this doesn't happen, that 
people have some sort of keepalive  WAL-generating activity on the 
master regularly, if they expect max_standby_delay to work reasonably in 
the face of an idle master.  It's not ideal, but it's straightforward to 
work around in user space.

> I'm inclined to think that we should throw away all this logic and just
> have the slave cancel competing queries if the replay process waits
> more than max_standby_delay seconds to acquire a lock.  This is simple,
> understandable, and behaves the same whether we're reading live data or
> not.

I don't consider something that allows queries to execute when not 
playing recent "live" data is necessarily a step forward, from the 
perspective of implementations preferring high-availability.  It's 
reasonable for some people to request that the last thing a standby 
that's not current (<max_standby_delay behind the master, based on the 
last thing received) should be doing is answering any queries, when it 
doesn't have current data and it should be working on catchup instead.

Discussion here obviously has wandered past your fundamental objections 
here and onto implementation trivia, but I didn't think the difference 
between what you expected and what's actually committed already was 
properly addressed before doing that.

-- 
Greg Smith  2ndQuadrant US  Baltimore, MD
PostgreSQL Training, Services and Support
greg@2ndQuadrant.com   www.2ndQuadrant.us



В списке pgsql-hackers по дате отправления:

Предыдущее
От: Simon Riggs
Дата:
Сообщение: Re: max_standby_delay considered harmful
Следующее
От: Tom Lane
Дата:
Сообщение: Re: max_standby_delay considered harmful