Re: Race conditions in 019_replslot_limit.pl

Поиск
Список
Период
Сортировка
От Kyotaro Horiguchi
Тема Re: Race conditions in 019_replslot_limit.pl
Дата
Msg-id 20220531.103107.1261637053934370702.horikyota.ntt@gmail.com
обсуждение исходный текст
Ответ на Race conditions in 019_replslot_limit.pl  (Heikki Linnakangas <hlinnaka@iki.fi>)
Список pgsql-hackers
At Mon, 30 May 2022 12:01:55 -0700, Andres Freund <andres@anarazel.de> wrote in 
> Hi,
> 
> On 2022-03-27 22:37:34 -0700, Andres Freund wrote:
> > On 2022-03-27 17:36:14 -0400, Tom Lane wrote:
> > > Andres Freund <andres@anarazel.de> writes:
> > > > I still feel like there's something off here. But that's probably not enough
> > > > to keep causing failures. I'm inclined to leave the debugging in for a bit
> > > > longer, but not fail the test anymore?
> > > 
> > > WFM.
> > 
> > I've done so now.
> 
> I did look over the test results a couple times since then and once more
> today. There were a few cases with pretty significant numbers of iterations:
> 
> The highest is
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=dragonet&dt=2022-04-07%2022%3A14%3A03
> showing:
> # multiple walsenders active in iteration 19
> 
> It's somewhat interesting that the worst case was just around the feature
> freeze, where the load on my buildfarm animal boxes was higher than normal.

If disk is too busy, CheckPointReplicationSlots may take very long.

> I comparison to earlier approaches, with the current in-tree approach, we
> don't do anything when hitting the "problem", other than wait. Which does give
> us additional information - afaics there's nothing at all indicating that some
> other backend existed allowing the replication slot drop to finish.

preventing?  Only checkpointer and a client backend that ran "SELECT * FROM
pg_stat_activity" are the only processes that are running during the
blocking state.

> It just looks like for reasons I still do not understand, removing a
directory
> and 2 files or so takes multiple seconds (at least ~36 new connections, 18
> pg_usleep(100_100)), while there are no other indications of problems.

That fact suports that CheckPointReplicationSlots took long time.

> I also still don't have a theory why this suddenly started to happen.

Maybe we need to see the load of disks at that time OS-wide. Couldn't
compiler or other non-postgres tools put significant load to disks?

> Unless somebody has another idea, I'm planning to remove all the debugging
> code added, but keep the retry based approach in 019_replslot_limit.pl, so we
> don't again get all the spurious failures.

+1.  

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



В списке pgsql-hackers по дате отправления:

Предыдущее
От: "Euler Taveira"
Дата:
Сообщение: Re: Ignore heap rewrites for materialized views in logical replication
Следующее
От: Michael Paquier
Дата:
Сообщение: Re: Prevent writes on large objects in read-only transactions