Re: Problem with synchronous replication

Поиск
Список
Период
Сортировка
От Kyotaro Horiguchi
Тема Re: Problem with synchronous replication
Дата
Msg-id 20191029.195001.642314195780172818.horikyota.ntt@gmail.com
обсуждение исходный текст
Ответ на Problem with synchronous replication  ("Dongming Liu" <lingce.ldm@alibaba-inc.com>)
Ответы Re: Problem with synchronous replication
Re: Problem with synchronous replication
Список pgsql-hackers
Hello.

At Fri, 25 Oct 2019 15:18:34 +0800, "Dongming Liu" <lingce.ldm@alibaba-inc.com> wrote in 
> 
> Hi,
> 
> I recently discovered two possible bugs about synchronous replication.
> 
> 1. SyncRepCleanupAtProcExit may delete an element that has been deleted
> SyncRepCleanupAtProcExit first checks whether the queue is detached, if it is not detached, 
> acquires the SyncRepLock lock and deletes it. If this element has been deleted by walsender, 
> it will be deleted repeatedly, SHMQueueDelete will core with a segment fault. 
> 
> IMO, like SyncRepCancelWait, we should lock the SyncRepLock first and then check
> whether the queue is detached or not.

I think you're right here.

> 2. SyncRepWaitForLSN may not call SyncRepCancelWait if ereport check one interrupt.
> For SyncRepWaitForLSN, if a query cancel interrupt arrives, we just terminate the wait 
> with suitable warning. As follows:
> 
> a. set QueryCancelPending to false
> b. errport outputs one warning
> c. calls SyncRepCancelWait to delete one element from the queue
> 
> If another cancel interrupt arrives when we are outputting warning at step b, the errfinish
> will call CHECK_FOR_INTERRUPTS that will output an ERROR, such as "canceling autovacuum
> task", then the process will jump to the sigsetjmp. Unfortunately, the step c will be skipped
> and the element that should be deleted by SyncRepCancelWait is remained.
> 
> The easiest way to fix this is to swap the order of step b and step c. On the other hand, 
> let sigsetjmp clean up the queue may also be a good choice. What do you think?
> 
> Attached the patch, any feedback is greatly appreciated.

This is not right. It is in transaction commit so it is in a
HOLD_INTERRUPTS section. ProcessInterrupt does not respond to
cancel/die interrupts thus the ereport should return.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



В списке pgsql-hackers по дате отправления:

Предыдущее
От: Vik Fearing
Дата:
Сообщение: Join Correlation Name
Следующее
От: Peter Eisentraut
Дата:
Сообщение: Re: Join Correlation Name