Re: Hot Standby conflict resolution handling

Поиск
Список
Период
Сортировка
От Pavan Deolasee
Тема Re: Hot Standby conflict resolution handling
Дата
Msg-id CABOikdOCtU9bw-FGCr4+gbJdGosMddAK3EE1yfBamxBTNLRDEA@mail.gmail.com
обсуждение исходный текст
Ответ на Re: Hot Standby conflict resolution handling  (Andres Freund <andres@2ndquadrant.com>)
Ответы Re: Hot Standby conflict resolution handling
Список pgsql-hackers


On Tue, Dec 4, 2012 at 1:44 PM, Andres Freund <andres@2ndquadrant.com> wrote:

>
> After max_standby_streaming_delay, the standby starts cancelling the
> queries. I get an error like this on the standby:
> postgres=# explain verbose select count(b) from test WHERE a > 100000;
> FATAL:  terminating connection due to conflict with recovery
> DETAIL:  User query might have needed to see row versions that must be
> removed.
> HINT:  In a moment you should be able to reconnect to the database and
> repeat your command.
> server closed the connection unexpectedly
>     This probably means the server terminated abnormally
>     before or while processing the request.
> The connection to the server was lost. Attempting reset: Succeeded.
>
> So I've couple questions/concerns here
>
> 1. Why to throw a FATAL error here ? A plain ERROR should be enough to
> abort the transaction. There are four places in ProcessInterrupts() where
> we throw these kind of errors and three of them are FATAL.

The problem here is that were in IDLE IN TRANSACTION in this case. Which
currently cannot be cancelled (i.e. pg_cancel_backend() just won't do
anything).

There are two problems making this non-trivial. For one, while we're in
IDLE IN TXN the client doesn't expect a response on a protocol level, so
we can't simply ereport() at that time.
For another, when were in IDLE IN TXN we're potentially inside openssl
so we can't jump out of there anyway because that would quite likely
corrupt the internal state of openssl.

I tried to fix this before (c.f. "Idle in transaction cancellation" or
similar) but while I had some kind of fix for the first issue (i saved
the error and reported it later when the protocol state allows it) I
missed the jumping out of openssl bit. I think its not that hard to
solve though. I remember having something preliminary but I never had
the time to finish it. If I remember correctly the trick was to set
openssl into non-blocking mode temporarily and return to the caller
inside be-secure.c:my_sock_read.

Thanks Andres. I also read the original thread and I now understand why we are using FATAL here, at least until we have a better solution. Obviously the connection reset is no good either because as someone commented in the original discussion, I thought that I'm seeing a server crash while it was not.

 

>
> AFAICS the first of these should be ereport(ERROR). Otherwise irrespective
> of whether RecoveryConflictRetryable is true or false, we will always
> ereport(FATAL).

Which is fine, because were below if (ProcDiePending). Note there's a
separate path for QueryCancelPending. We go on to kill connections once
the normal conflict handling has tried several times.


Ok. Understood.I now see that every path below if (ProcDiePending) will call FATAL, albeit with different error codes. That explains the current code.

 

I think we desparately need to improve *all* of these message with
significantly more detail (cause for cancellation, relation, current
xid, conflicting xid, current/last query).


I agree.

Thanks,
Pavan 


--
Pavan Deolasee
http://www.linkedin.com/in/pavandeolasee

В списке pgsql-hackers по дате отправления:

Предыдущее
От: Andres Freund
Дата:
Сообщение: Re: support for LDAP URLs
Следующее
От: Magnus Hagander
Дата:
Сообщение: Re: PQconninfo function for libpq