> The bottom line turns out to be that on the Alpha hardware, it is
> possible for TAS() to fail even when the lock is initially zero,
> because that hardware's locking protocol will fail to acquire the
> lock if the ldq_l/stq_c sequence is interrupted. TAS() *must* be
> called in a retry loop on Alphas. Thus, the coding presently in
> xlog.c,
>
> while (TAS(&(XLogCtl->chkp_lck)))
> {
> struct timeval delay = {2, 0};
>
> if (shutdown)
> elog(STOP, "Checkpoint lock is busy while data
> base is shutting down");
> (void) select(0, NULL, NULL, NULL, &delay);
> }
>
> is no good because it does not allow for multiple retries.
>
> Offhand I see no good reason why the above-quoted code isn't just
>
> S_LOCK(&(XLogCtl->chkp_lck));
>
> and propose to fix this problem by reducing it to that. If the lock
> is held when it shouldn't be, we'll fail with a stuck-spinlock error.
It was not test against stuck slock: postmaster can shutdown db only
when there is no backends running and this was just additional test
for that. Assuming that postmaster does right things I don't object
to get rid of this test, but I would just remove shutdown test itself
- checkpoints run long time and 2 sec timeout is better than ones
in s_lock_sleep();
> It also bothers me that xlog.c contains several places where
> there is a potentially infinite wait for a lock. It seems to me
> that these should time out with stuck-spinlock messages. Do you object
> to such a change?
I always considered time-based decision is slock stuck or not as hack -
in old times elog(ERROR/FATAL) didn't unlock slocks sometimes, isn't
it fixed now?
Anyway I don't object if it bothers you -:)
But please do not use S_LOCK - as you see WAL code try to do other
things if slock of "primary interest" is busy.
Vadim