Re: Review for GetWALAvailability()

Поиск
Список
Период
Сортировка
От Alvaro Herrera
Тема Re: Review for GetWALAvailability()
Дата
Msg-id 20200616185047.GA19008@alvherre.pgsql
обсуждение исходный текст
Ответ на Re: Review for GetWALAvailability()  (Fujii Masao <masao.fujii@oss.nttdata.com>)
Ответы Re: Review for GetWALAvailability()  (Fujii Masao <masao.fujii@oss.nttdata.com>)
Список pgsql-hackers
On 2020-Jun-17, Fujii Masao wrote:

> While reading InvalidateObsoleteReplicationSlots() code, I found another issue.
> 
>             ereport(LOG,
>                     (errmsg("terminating walsender %d because replication slot \"%s\" is too far behind",
>                             wspid, NameStr(slotname))));
>             (void) kill(wspid, SIGTERM);
> 
> wspid indicates the PID of process using the slot. That process can be
> a backend, for example, executing pg_replication_slot_advance().
> So "walsender" in the above log message is not always correct.

Good point.

>             int            wspid = ReplicationSlotAcquire(NameStr(slotname),
>                                                        SAB_Inquire);
> 
> Why do we need to call ReplicationSlotAcquire() here and mark the slot as
> used by the checkpointer? Isn't it enough to check directly the slot's
> active_pid, instead?
> 
> Maybe ReplicationSlotAcquire() is necessary because
> ReplicationSlotRelease() is called later? If so, why do we need to call
> ReplicationSlotRelease()? ISTM that we don't need to do that if the slot's
> active_pid is zero. No?

I think the point here was that in order to modify the slot you have to
acquire it -- it's not valid to modify a slot you don't own.


> +        /*
> +         * Signal to terminate the process using the replication slot.
> +         *
> +         * Try to signal every 100ms until it succeeds.
> +         */
> +        if (!killed && kill(active_pid, SIGTERM) == 0)
> +            killed = true;
> +        ConditionVariableTimedSleep(&slot->active_cv, 100,
> +                                    WAIT_EVENT_REPLICATION_SLOT_DROP);
> +    } while (ReplicationSlotIsActive(slot, NULL));

Note that here you're signalling only once and then sleeping many times
in increments of 100ms -- you're not signalling every 100ms as the
comment claims -- unless the signal fails, but you don't really expect
that.  On the contrary, I'd claim that the logic is reversed: if the
signal fails, *then* you should stop signalling.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



В списке pgsql-hackers по дате отправления:

Предыдущее
От: Alvaro Herrera
Дата:
Сообщение: Re: Review for GetWALAvailability()
Следующее
От: Robert Haas
Дата:
Сообщение: Re: global barrier & atomics in signal handlers (Re: Atomicoperations within spinlocks)