Re: Problem with synchronous replication

Поиск
Список
Период
Сортировка
От lingce.ldm
Тема Re: Problem with synchronous replication
Дата
Msg-id 4BA09384-F86E-4A3A-AA5D-544C8B3F32B5@alibaba-inc.com
обсуждение исходный текст
Ответ на Re: Problem with synchronous replication  (Kyotaro Horiguchi <horikyota.ntt@gmail.com>)
Ответы Re: Problem with synchronous replication
Список pgsql-hackers
On Oct 29, 2019, at 18:50, Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote:

Hello.

At Fri, 25 Oct 2019 15:18:34 +0800, "Dongming Liu" <lingce.ldm@alibaba-inc.com> wrote in 

Hi,

I recently discovered two possible bugs about synchronous replication.

1. SyncRepCleanupAtProcExit may delete an element that has been deleted
SyncRepCleanupAtProcExit first checks whether the queue is detached, if it is not detached, 
acquires the SyncRepLock lock and deletes it. If this element has been deleted by walsender, 
it will be deleted repeatedly, SHMQueueDelete will core with a segment fault. 

IMO, like SyncRepCancelWait, we should lock the SyncRepLock first and then check
whether the queue is detached or not.

I think you're right here.

Thanks.


2. SyncRepWaitForLSN may not call SyncRepCancelWait if ereport check one interrupt.
For SyncRepWaitForLSN, if a query cancel interrupt arrives, we just terminate the wait 
with suitable warning. As follows:

a. set QueryCancelPending to false
b. errport outputs one warning
c. calls SyncRepCancelWait to delete one element from the queue

If another cancel interrupt arrives when we are outputting warning at step b, the errfinish
will call CHECK_FOR_INTERRUPTS that will output an ERROR, such as "canceling autovacuum
task", then the process will jump to the sigsetjmp. Unfortunately, the step c will be skipped
and the element that should be deleted by SyncRepCancelWait is remained.

The easiest way to fix this is to swap the order of step b and step c. On the other hand, 
let sigsetjmp clean up the queue may also be a good choice. What do you think?

Attached the patch, any feedback is greatly appreciated.

This is not right. It is in transaction commit so it is in a
HOLD_INTERRUPTS section. ProcessInterrupt does not respond to
cancel/die interrupts thus the ereport should return.

I read the relevant code, you are right. I called SyncRepWaitForLSN somewhere else, 
but forgot to put it in a HOLD_INTERRUPTS and  triggered an exception.

regards.

Dongming Liu
Вложения

В списке pgsql-hackers по дате отправления:

Предыдущее
От: Dilip Kumar
Дата:
Сообщение: Re: [HACKERS] Block level parallel vacuum
Следующее
От: lingce.ldm
Дата:
Сообщение: Re: Problem with synchronous replication