Re: failures in t/031_recovery_conflict.pl on CI

Поиск
Список
Период
Сортировка
От Andres Freund
Тема Re: failures in t/031_recovery_conflict.pl on CI
Дата
Msg-id 20220503182025.wvbebs2ojk6vpi5f@alap3.anarazel.de
обсуждение исходный текст
Ответ на Re: failures in t/031_recovery_conflict.pl on CI  (Tom Lane <tgl@sss.pgh.pa.us>)
Ответы Re: failures in t/031_recovery_conflict.pl on CI  (Tom Lane <tgl@sss.pgh.pa.us>)
Список pgsql-hackers
Hi,

On 2022-05-03 01:16:46 -0400, Tom Lane wrote:
> Andres Freund <andres@anarazel.de> writes:
> > On 2022-05-02 23:44:32 -0400, Tom Lane wrote:
> >> I can poke into that tomorrow, but are you sure that that isn't an
> >> expectable result?
> 
> > It's not expected. But I think I might see what the problem is:
> > We wait for the FETCH (and thus the buffer pin to be acquired). But that
> > doesn't guarantee that the lock has been acquired. We can't check that with
> > pump_until() afaics, because there'll not be any output. But a query_until()
> > checking pg_locks should do the trick?
> 
> Irritatingly, it doesn't reproduce (at least not easily) in a manual
> build on the same box.

Odd, given how readily it seem to reproduce on the bf. I assume you built with
> Uses -fsanitize=alignment -DWRITE_READ_PARSE_PLAN_TREES -DSTRESS_SORT_INT_MIN
-DENFORCE_REGRESSION_TEST_NAME_RESTRICTIONS


> So it's almost surely a timing issue, and your theory here seems plausible.

Unfortunately I don't think my theory holds, because I actually had added a
defense against this into the test that I forgot about momentarily...

# just to make sure we're waiting for lock already
ok( $node_standby->poll_query_until(
        'postgres', qq[
SELECT 'waiting' FROM pg_locks WHERE locktype = 'relation' AND NOT granted;
], 'waiting'),
    "$sect: lock acquisition is waiting");

and on longfin that step completes sucessfully.


I think what happens is that we get a buffer pin conflict, because these days
we can actually process buffer pin conflicts while waiting for a lock. The
easiest way to get around that is to increase the replay timeout for that
test, I think?

I think we need a restart, not a reload, because reloads aren't guaranteed to
be processed at any certain point in time :/.


Testing a fix in a variety of timing circumstances now...


Greetings,

Andres Freund



В списке pgsql-hackers по дате отправления:

Предыдущее
От: Tom Lane
Дата:
Сообщение: Re: fix cost subqueryscan wrong parallel cost
Следующее
От: Tom Lane
Дата:
Сообщение: Re: failures in t/031_recovery_conflict.pl on CI