Re: pg_basebackup WAL streamer shutdown is bogus - leading to slow tests

Поиск
Список
Период
Сортировка
От Magnus Hagander
Тема Re: pg_basebackup WAL streamer shutdown is bogus - leading to slow tests
Дата
Msg-id CABUevEzwa+y4V8VtOLuTZ7OhvqdOBzrwQu-BgbgDwPgL_cVhsA@mail.gmail.com
обсуждение исходный текст
Ответ на pg_basebackup WAL streamer shutdown is bogus - leading to slow tests  (Andres Freund <andres@anarazel.de>)
Список pgsql-hackers
On Sun, Jan 16, 2022 at 10:22 AM Andres Freund <andres@anarazel.de> wrote:
>
> Hi,
>
> I was wondering in [1] what we could do about the slowest tests on
> windows.
>
> On 2021-12-31 11:25:28 -0800, Andres Freund wrote:
> > Picking a random successful cfbot run [1] I see the following tap tests taking
> > more than 20 seconds:
> >
> > 67188 ms pg_basebackup t/010_pg_basebackup.pl
> > 59710 ms recovery t/001_stream_rep.pl
>
> Comparing these times to measurements taken on my normal linux workstation,
> something seemed just *very* off, even with a slow CI instance and windows in
> the mix.
>
> A bunch of printf debugging later, I realized the problem is that several of
> the pg_basebackups in tests take a *long* time. E.g. for t/001_stream_rep.pl
> the backups from the standby each take just over 10s. That's awfully
> specific...
>
> # Taking pg_basebackup my_backup from node "standby_1"
> # Running: pg_basebackup -D C:/dev/postgres/./tmp_check/t_001_stream_rep_standby_1_data/backup/my_backup -h
C:/Users/myadmin/AppData/Local/Temp/yba26PBYX1-p 59181 --checkpoint fast --no-sync --label my_backup -v
 
> # ran in 10.145s
> # Backup finished
>
> This reproduceably happens and it's *not* related to the socket shutdown()
> changes we've been debugging lately - even after a revert the problem
> persists.
>
> Because our logging for basebackups is quite weak, both for server and client
> side, I needed to add a fair bit more debugging to figure it out:
>
> pg_basebackup: wait to finish at 0.492
> pg_basebackup: waiting for background process to finish streaming ...
> pg_basebackup: stream poll timeout 10.112
>
> The problem is that there's just no implemented way to timely shutdown the WAL
> streaming thread in pg_basebackup. The code in pg_basebackup.c says:
>
>         if (verbose)
>             pg_log_info("waiting for background process to finish streaming ...");
>         ...
>         /*
>          * On Windows, since we are in the same process, we can just store the
>          * value directly in the variable, and then set the flag that says
>          * it's there.
>          */
> ...
>                 xlogendptr = ((uint64) hi) << 32 | lo;
>                 InterlockedIncrement(&has_xlogendptr);
>
> But just setting a variable doesn't do much if the thread is in
> HandleCopyStream()->CopyStreamPoll()->select()
>
> The only reason we ever succeed shutting down, without more WAL coming in, is
> that pg_basebackup defaults to sending a status message every 10 seconds. At
> which point the thread sees has_xlogendptr = true, and shuts down.
>
>
> A test specific workaround would be to just add --status-interval=1 to
> Cluster.pm::backup(). But that seems very unsatisfying.
>
> I don't immediately see a solution for this, other than to add
> StreamCtl->stop_event (mirroring ->stop_socket) and then convert
> CopyStreamPoll() to use WaitForMultipleObjects().  Microsoft's select()
> doesn't support pipes and there's no socketpair().
>
> Any more straightforward ideas?
>
>
> From a cursory look at history, it used to be that pg_basebackup had this
> behaviour on all platforms, but it got fixed for other platforms in
> 7834d20b57a by Tom (assuming the problem wasn't present there).

Ugh, yeah that sounds like a correct analysis to me, and ugh, yeah
that's not very nice.

And yes, I think we have to create an event, and then use
WSAEventSelect() + WaitForSingleObjectEx(). Should be enough to just
use one event I think, and then the timeout -- but it might be more
readable to have a separate event for the socket and the stop? But we
can have just one event that's both used to stop and then use
WSAEventSelect() to associate it with the socket as well as neede.

(And yes, I agree that it's a lot better to fix it properly than to
just reduce the timeout)

-- 
 Magnus Hagander
 Me: https://www.hagander.net/
 Work: https://www.redpill-linpro.com/



В списке pgsql-hackers по дате отправления:

Предыдущее
От: Alexander Lakhin
Дата:
Сообщение: Re: Why is src/test/modules/committs/t/002_standby.pl flaky?
Следующее
От: Tom Lane
Дата:
Сообщение: Re: pg_basebackup WAL streamer shutdown is bogus - leading to slow tests