Re: Sync Rep and shutdown Re: Sync Rep v19

Поиск
Список
Период
Сортировка
От Robert Haas
Тема Re: Sync Rep and shutdown Re: Sync Rep v19
Дата
Msg-id AANLkTi=+0mNL289VGDdT4dKCYWd572toA3-6KyrptXVx@mail.gmail.com
обсуждение исходный текст
Ответ на Re: Sync Rep and shutdown Re: Sync Rep v19  (Fujii Masao <masao.fujii@gmail.com>)
Ответы Re: Sync Rep and shutdown Re: Sync Rep v19  (Simon Riggs <simon@2ndQuadrant.com>)
Список pgsql-hackers
On Thu, Mar 17, 2011 at 2:08 AM, Fujii Masao <masao.fujii@gmail.com> wrote:
> This occurs to me; we should ensure that, in shutdown case, walwriter
> should exit after all the backends have gone out? I'm not sure if it's worth
> thinking of the case, but what if synchronous_standby_names is unset
> and config file is reloaded after smart shutdown is requested? In this
> case, the reload cannot wake up the waiting backends since walwriter
> has already exited. This behavior looks a bit inconsistent.

I agree we need to fix smart shutdown.  I think that's going to have
to be a separate patch, however; it's got more problems than this
patch can fix without expanding into a monster.

> In the patch, in order to read the latest value, you take a light-weight lock.
> But I wonder why taking a lock can ensure that the value is up-to-date.

A lock acquisition acts as a memory sequence point.

> + * WAL writer calls this as needed to update the shared sync_standbys_needed
>
> Typo: s/sync_standbys_needed/sync_standbys_defined

Fixed.

> + * we exit normally, or SYNC_REP_MUST_DISCONNECT in abnormal cases.
>
> Typo: the reference to SYNC_REP_MUST_DISCONNECT is not required.

Fixed.

> +                * So in this case we issue a NOTICE (which some clients may
>
> Typo: s/NOTICE/WARNING

Fixed.

> +               if (ProcDiePending)
> +               {
> +                       ereport(WARNING,
> +                                       (errcode(ERRCODE_ADMIN_SHUTDOWN),
> +                                        errmsg("canceling the wait for replication and terminating
> connection due to administrator command"),
> +                                        errdetail("The transaction has already been committed locally
> but might have not been replicated to the standby.")));
> +                       whereToSendOutput = DestNone;
> +                       LWLockAcquire(SyncRepLock, LW_EXCLUSIVE);
> +                       MyProc->syncRepState = SYNC_REP_WAIT_COMPLETE;
> +                       SHMQueueDelete(&(MyProc->syncRepLinks));
>
> SHMQueueIsDetached() should be checked before calling SHMQueueDelete().
> Walsender can already delete the backend from the queue before reaching here.

Fixed.  But come to think of it, doesn't this mean
SyncRepCleanupAtProcExit() needs to repeat the test after acquiring
the lock?

> +               if (QueryCancelPending)
> +               {
> +                       QueryCancelPending = false;
> +                       ereport(WARNING,
> +                                       (errmsg("canceling wait for synchronous replication due to user request"),
> +                                        errdetail("The transaction has committed locally, but may not
> have replicated to the standby.")));
> +                       LWLockAcquire(SyncRepLock, LW_EXCLUSIVE);
> +                       MyProc->syncRepState = SYNC_REP_WAIT_COMPLETE;
> +                       SHMQueueDelete(&(MyProc->syncRepLinks));
> +                       LWLockRelease(SyncRepLock);
>
> Same as above.

Fixed.

> +               if (!PostmasterIsAlive(true))
> +               {
> +                       whereToSendOutput = DestNone;
> +                       proc_exit(1);
>
> proc_exit() should not be called at that point because it leads PANIC.

Fixed, although I'm not sure it matters.

> I think that it's better to check ProcDiePending, QueryCancelPending
> and PostmasterIsAlive *before* waiting on the latch, not after. Because
> those events can occur before reaching there, and it's not worth waiting
> for 60 seconds to detect them.

Not necessary.  Inspired by one of your earlier patches, I made die()
and StatementCancelHandler() set the latch.  We could still wait up to
60 s to detect postmaster death, but that's a very rare situation and
it's not worth slowing down the common case for it, especially since
there is a race condition no matter what.

Thanks for the review!

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Вложения

В списке pgsql-hackers по дате отправления:

Предыдущее
От: Cédric Villemain
Дата:
Сообщение: Re: really lazy vacuums?
Следующее
От: Robert Haas
Дата:
Сообщение: Re: Sync Rep and shutdown Re: Sync Rep v19