Fix a race condition in ConditionVariableTimedSleep()
От | Bertrand Drouvot |
---|---|
Тема | Fix a race condition in ConditionVariableTimedSleep() |
Дата | |
Msg-id | aBhuTqNhMN3prcqe@ip-10-97-1-34.eu-west-3.compute.internal обсуждение исходный текст |
Ответы |
Re: Fix a race condition in ConditionVariableTimedSleep()
Re: Fix a race condition in ConditionVariableTimedSleep() |
Список | pgsql-hackers |
Hi hackers, While working on wait event related stuff I observed a failed assertion: " TRAP: failed Assert("node->next == 0 && node->prev == 0"), File: "../../../../src/include/storage/proclist.h", Line: 91 " during pg_regress/database. To reproduce, add an ereport(LOG,..) or CHECK_FOR_INTERRUPTS() or whatever would trigger ConditionVariableBroadcast() in pgstat_report_wait_end(): And launch: postgres=# CREATE TABLESPACE regress_tblspace LOCATION '/home/postgres/bdttbs'; CREATE TABLESPACE postgres=# create database regression_bdt9; CREATE DATABASE postgres=# ALTER DATABASE regression_bdt9 SET TABLESPACE regress_tblspace; server closed the connection unexpectedly This probably means the server terminated abnormally before or while processing the request. The connection to the server was lost. Attempting reset: Succeeded. The issue is that in ConditionVariableTimedSleep() during the WaitLatch() we’ll also trigger ProcessInterrupts() that will call ConditionVariableCancelSleep() that will: - remove the process from the wait list - and also set cv_sleep_target to NULL. Then in ConditionVariableTimedSleep(), we’ll go through: “ /* * If this process has been taken out of the wait list, then we know * that it has been signaled by ConditionVariableSignal (or * ConditionVariableBroadcast), so we should return to the caller. But * that doesn't guarantee that the exit condition is met, only that we * ought to check it. So we must put the process back into the wait * list, to ensure we don't miss any additional wakeup occurring while * the caller checks its exit condition. We can take ourselves out of * the wait list only when the caller calls * ConditionVariableCancelSleep. * * If we're still in the wait list, then the latch must have been set * by something other than ConditionVariableSignal; though we don't * guarantee not to return spuriously, we'll avoid this obvious case. */ SpinLockAcquire(lock: &cv->mutex); if (!proclist_contains(&cv->wakeup, MyProcNumber, cvWaitLink)) { done = true; proclist_push_tail(&cv->wakeup, MyProcNumber, cvWaitLink); } SpinLockRelease(&cv->mutex); “ Means we re-add the process in the wait list. The issue is that cv_sleep_target is still NULL so that we’ll exit ConditionVariableTimedSleep() due to: " if (cv != cv_sleep_target) done = true; /* We were signaled, so return */ if (done) return false; " Later we’ll want to add our process to a wait list, calling ConditionVariablePrepareToSleep() that will not call ConditionVariableCancelSleep() due to: “ if (cv_sleep_target != NULL) ConditionVariableCancelSleep(); “ As cv_sleep_target is still NULL. Finally calling proclist_push_tail() that will trigger the failed assertion. The proposed fix attached is done in ConditionVariableTimedSleep() as this is the place that introduces the race condition. It re-assigns cv_sleep_target to cv and then ensures that cv_sleep_target accurately describes which condition variable we’re prepared to wait on. Looking forward to your feedback, Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Вложения
В списке pgsql-hackers по дате отправления: