Re: Hot Standby conflict resolution handling

Поиск
Список
Период
Сортировка
От Andres Freund
Тема Re: Hot Standby conflict resolution handling
Дата
Msg-id 20121204081229.GA26353@alap2.lan
обсуждение исходный текст
Ответ на Hot Standby conflict resolution handling  (Pavan Deolasee <pavan.deolasee@gmail.com>)
Ответы Re: Hot Standby conflict resolution handling
Список pgsql-hackers
Hi,

On 2012-12-04 12:30:43 +0530, Pavan Deolasee wrote:
> I was trying some simple queries on a Hot Standby with streaming
> replication.
>
> On standby, I do this:
> postgres=# begin transaction isolation level repeatable read;
> BEGIN
> postgres=# explain verbose select count(b) from test WHERE a > 100000;
>
> On master, I insert a bunch of tuples in the table and run VACUUM ANALYZE.
> postgres=# INSERT INTO test VALUES (generate_series(110001,120000), 'foo',
> 1);
> INSERT 0 10000
> postgres=# VACUUM ANALYZE test;
> VACUUM
>
> After max_standby_streaming_delay, the standby starts cancelling the
> queries. I get an error like this on the standby:
> postgres=# explain verbose select count(b) from test WHERE a > 100000;
> FATAL:  terminating connection due to conflict with recovery
> DETAIL:  User query might have needed to see row versions that must be
> removed.
> HINT:  In a moment you should be able to reconnect to the database and
> repeat your command.
> server closed the connection unexpectedly
>     This probably means the server terminated abnormally
>     before or while processing the request.
> The connection to the server was lost. Attempting reset: Succeeded.
>
> So I've couple questions/concerns here
>
> 1. Why to throw a FATAL error here ? A plain ERROR should be enough to
> abort the transaction. There are four places in ProcessInterrupts() where
> we throw these kind of errors and three of them are FATAL.

The problem here is that were in IDLE IN TRANSACTION in this case. Which
currently cannot be cancelled (i.e. pg_cancel_backend() just won't do
anything).

There are two problems making this non-trivial. For one, while we're in
IDLE IN TXN the client doesn't expect a response on a protocol level, so
we can't simply ereport() at that time.
For another, when were in IDLE IN TXN we're potentially inside openssl
so we can't jump out of there anyway because that would quite likely
corrupt the internal state of openssl.

I tried to fix this before (c.f. "Idle in transaction cancellation" or
similar) but while I had some kind of fix for the first issue (i saved
the error and reported it later when the protocol state allows it) I
missed the jumping out of openssl bit. I think its not that hard to
solve though. I remember having something preliminary but I never had
the time to finish it. If I remember correctly the trick was to set
openssl into non-blocking mode temporarily and return to the caller
inside be-secure.c:my_sock_read.
At that location ProcessInterrupts can run safely, error out silently,
and reraise the error once were in the right protocol state.

> 2836         else if (RecoveryConflictPending && RecoveryConflictRetryable)
> 2837         {
> 2838             pgstat_report_recovery_conflict(RecoveryConflictReason);
> 2839             ereport(FATAL,
> 2840                     (errcode(ERRCODE_T_R_SERIALIZATION_FAILURE),
> 2841               errmsg("terminating connection due to conflict with
> recovery"),
> 2842                      errdetail_recovery_conflict()));
> 2843         }
> 2844         else if (RecoveryConflictPending)
> 2845         {
> 2846             /* Currently there is only one non-retryable recovery
> conflict */
> 2847             Assert(RecoveryConflictReason ==
> PROCSIG_RECOVERY_CONFLICT_DATABASE);
> 2848             pgstat_report_recovery_conflict(RecoveryConflictReason);
> 2849             ereport(FATAL,
> 2850                     (errcode(ERRCODE_DATABASE_DROPPED),
> 2851               errmsg("terminating connection due to conflict with
> recovery"),
> 2852                      errdetail_recovery_conflict()));
> 2853         }
>
> AFAICS the first of these should be ereport(ERROR). Otherwise irrespective
> of whether RecoveryConflictRetryable is true or false, we will always
> ereport(FATAL).

Which is fine, because were below if (ProcDiePending). Note there's a
separate path for QueryCancelPending. We go on to kill connections once
the normal conflict handling has tried several times.

> 2. For my test, the error message itself looks wrong because I did not
> actually remove any rows on the master. VACUUM probably marked a bunch of
> pages as all-visible and that should have triggered a conflict on the
> standby in order to support index-only scans. IMHO we should  improve the
> error message to avoid any confusion. Or we can add a new ProcSignalReason
> to differentiate between a cancel due to clean up vs visibilitymap_set()
> operation.

I think we desparately need to improve *all* of these message with
significantly more detail (cause for cancellation, relation, current
xid, conflicting xid, current/last query).

Greetings,

Andres Freund

--Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



В списке pgsql-hackers по дате отправления:

Предыдущее
От: Jeff Davis
Дата:
Сообщение: Re: Enabling Checksums
Следующее
От: Amit Kapila
Дата:
Сообщение: Re: visibilitymap_count() at the end of vacuum