Re: Issue with the PRNG used by Postgres

Поиск
Список
Период
Сортировка
От Tom Lane
Тема Re: Issue with the PRNG used by Postgres
Дата
Msg-id 4090821.1712772140@sss.pgh.pa.us
обсуждение исходный текст
Ответ на Re: Issue with the PRNG used by Postgres  (Andres Freund <andres@anarazel.de>)
Ответы Re: Issue with the PRNG used by Postgres  (Andres Freund <andres@anarazel.de>)
Список pgsql-hackers
Andres Freund <andres@anarazel.de> writes:
> On 2024-04-10 13:03:05 -0400, Tom Lane wrote:
>> So I think we need something like the attached.

> LGTM.

On third thought ... while I still think this is a misuse of
perform_spin_delay and we should change it, I'm not sure it'll do
anything to address Parag's problem, because IIUC he's seeing actual
"stuck spinlock" reports.  That implies that the inner loop of
LWLockWaitListLock slept NUM_DELAYS times without ever seeing
LW_FLAG_LOCKED clear.  What I'm suggesting would change the triggering
condition to "NUM_DELAYS sleeps without acquiring the lock", which is
strictly more likely to happen, so it's not going to help him.  It's
certainly still well out in we-shouldn't-get-there territory, though.

Also, fooling around with the cur_delay adjustment doesn't affect
this at all: "stuck spinlock" is still going to be raised after
NUM_DELAYS failures to observe the lock clear or obtain the lock.
Increasing cur_delay won't change that, it'll just spread the
fixed number of attempts over a longer period; and there's no
reason to believe that does anything except make it take longer
to fail.  Per the header comment for s_lock.c:

 * We time out and declare error after NUM_DELAYS delays (thus, exactly
 * that many tries).  With the given settings, this will usually take 2 or
 * so minutes.  It seems better to fix the total number of tries (and thus
 * the probability of unintended failure) than to fix the total time
 * spent.

If you believe that waiting processes can be awakened close enough to
simultaneously to hit the behavior I posited earlier, then encouraging
them to have different cur_delay values will help; but Andres doesn't
believe it and I concede it seems like a stretch.

So I think fooling with the details in s_lock.c is pretty much beside
the point.  The most likely bet is that Parag's getting bit by the
bug fixed in a4adc31f690.  It's possible he's seeing the effect of
some different issue that causes lwlock.c to hold that lock a long
time at scale, but that's where I'd look first.

            regards, tom lane



В списке pgsql-hackers по дате отправления:

Предыдущее
От: Andres Freund
Дата:
Сообщение: Re: Issue with the PRNG used by Postgres
Следующее
От: Tom Lane
Дата:
Сообщение: Re: psql: Greatly speed up "\d tablename" when not using regexes