Re: Review for GetWALAvailability()

Поиск
Список
Период
Сортировка
От Kyotaro Horiguchi
Тема Re: Review for GetWALAvailability()
Дата
Msg-id 20200616.140026.1624101556608079529.horikyota.ntt@gmail.com
обсуждение исходный текст
Ответ на Re: Review for GetWALAvailability()  (Fujii Masao <masao.fujii@oss.nttdata.com>)
Ответы Re: Review for GetWALAvailability()  (Fujii Masao <masao.fujii@oss.nttdata.com>)
Список pgsql-hackers
At Tue, 16 Jun 2020 01:46:21 +0900, Fujii Masao <masao.fujii@oss.nttdata.com> wrote in 
> > In short, it is known behavior but it was judged as useless to prevent
> > that.
> > That can happen when checkpointer removes up to the segment that is
> > being read by walsender.  I think that that doesn't happen (or
> > happenswithin a narrow time window?) for physical replication but
> > happenes for logical replication.
> > While development, I once added walsender a code to exit for that
> > reason, but finally it is moved to InvalidateObsoleteReplicationSlots
> > as a bit defferent function.
> 
> BTW, I read the code of InvalidateObsoleteReplicationSlots() and
> probably
> found some issues in it.
> 
> 1. Each cycle of the "for" loop in
> InvalidateObsoleteReplicationSlots()
>     emits the log message  "terminating walsender ...". This means that
>     if it takes more than 10ms for walsender to exit after it's signaled,
>     the second and subsequent cycles would happen and output the same
>     log message several times. IMO that log message should be output
>     only once.

Sounds reasonable.

> 2. InvalidateObsoleteReplicationSlots() uses the loop to scan
> replication                              
>     slots array and uses the "for" loop in each scan. Also it calls
>     ReplicationSlotAcquire() for each "for" loop cycle, and
>     ReplicationSlotAcquire() uses another loop to scan replication slots
>     array. I don't think this is good design.
> 
>     ISTM that we can get rid of ReplicationSlotAcquire()'s loop because
>     InvalidateObsoleteReplicationSlots() already know the index of the
>     slot
>     that we want to find. The attached patch does that. Thought?

The inner loop is expected to run at most several times per
checkpoint, which won't be a serious problem. However, it is better if
we can get rid of that in a reasonable way.

The attached patch changes the behavior for SAB_Block. Before the
patch, it rescans from the first slot for the same name, but with the
patch it just rechecks the same slot.  The only caller of the function
with SAB_Block is ReplicationSlotDrop and I don't come up with a case
where another slot with the same name is created at different place
before the condition variable fires. But I'm not sure the change is
completely safe. Maybe some assertion is needed?

> 3. There is a corner case where the termination of walsender cleans up
>     the temporary replication slot while
>     InvalidateObsoleteReplicationSlots()
>     is sleeping on ConditionVariableTimedSleep(). In this case,
>     ReplicationSlotAcquire() is called in the subsequent cycle of the
>     "for"
>     loop, cannot find the slot and then emits ERROR message. This leads
>     to the failure of checkpoint by the checkpointer.

Agreed.

>     To avoid this case, if SAB_Inquire is specified,
>     ReplicationSlotAcquire()
>     should return the special value instead of emitting ERROR even when
>     it cannot find the slot. Also InvalidateObsoleteReplicationSlots()
>     should
>     handle that special returned value.

I thought the same thing hearing that. 

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



В списке pgsql-hackers по дате отправления:

Предыдущее
От: Ashutosh Bapat
Дата:
Сообщение: Re: factorial of negative numbers
Следующее
От: Peter Geoghegan
Дата:
Сообщение: Re: language cleanups in code and docs