Assuming that TAS() will succeed the first time is verboten

Поиск
Список
Период
Сортировка
От Tom Lane
Тема Assuming that TAS() will succeed the first time is verboten
Дата
Msg-id 5926.978036890@sss.pgh.pa.us
обсуждение исходный текст
Ответы Re: Assuming that TAS() will succeed the first time is verboten  (ncm@zembu.com (Nathan Myers))
Список pgsql-hackers
I have been digging into the observed failure

FATAL: Checkpoint lock is busy while data base is shutting down

on some Alpha machines.  It apparently doesn't happen on all Alphas,
but it's quite reproducible on some of them.

The bottom line turns out to be that on the Alpha hardware, it is
possible for TAS() to fail even when the lock is initially zero,
because that hardware's locking protocol will fail to acquire the
lock if the ldq_l/stq_c sequence is interrupted.  TAS() *must* be
called in a retry loop on Alphas.  Thus, the coding presently in
xlog.c,
   while (TAS(&(XLogCtl->chkp_lck)))   {       struct timeval delay = {2, 0};
       if (shutdown)           elog(STOP, "Checkpoint lock is busy while data base is shutting down");       (void)
select(0,NULL, NULL, NULL, &delay);   }
 

is no good because it does not allow for multiple retries.

Offhand I see no good reason why the above-quoted code isn't just
   S_LOCK(&(XLogCtl->chkp_lck));

and propose to fix this problem by reducing it to that.  If the lock
is held when it shouldn't be, we'll fail with a stuck-spinlock error.

It also bothers me that xlog.c contains several places where there is a
potentially infinite wait for a lock.  It seems to me that these should
time out with stuck-spinlock messages.  Do you object to such a change?
        regards, tom lane


В списке pgsql-hackers по дате отправления:

Предыдущее
От: "Patrick Dunford"
Дата:
Сообщение: Connecting across internet
Следующее
От: Brent Verner
Дата:
Сообщение: Re: Alpha tas() patch