Re: Windows vs recovery tests

Поиск
Список
Период
Сортировка
От Andres Freund
Тема Re: Windows vs recovery tests
Дата
Msg-id 20220112235826.nzwqga5b6felcqkn@alap3.anarazel.de
обсуждение исходный текст
Ответ на Windows vs recovery tests  (Andrew Dunstan <andrew@dunslane.net>)
Ответы Re: Windows vs recovery tests  (Andres Freund <andres@anarazel.de>)
Список pgsql-hackers
Hi,

On 2022-01-12 14:34:00 -0500, Andrew Dunstan wrote:
> For some considerable time the recovery tests have been at best flaky on
> Windows, and at worst disastrous (i.e. they can hang rather than just
> fail). It's a problem I worked around on my buildfarm animals by
> disabling the tests, hoping to find time to get back to analysing the
> problem. But now we are seeing failures on the cfbot too (e.g.
> https://cirrus-ci.com/task/5860692694663168 and
> https://cirrus-ci.com/task/5316745152954368 ) so I think we need to
> spend some effort on finding out what's going on here.

I'm somewhat certain that this is caused by assertions or aborts hanging with
a GUI popup, e.g. due to a check in the CRT.

I saw these kind of hangs a lot in the aio development tree before I merged
the changes to change error/abort handling on windows. Before the recent CI
changes cfbot ran windows tests without assertions, which - besides just
running fewer tests - explains having fewer such hang before, because there's
more sources of such error popups in the debug CRT.

It'd be nice if somebody could look at the patch and discussion in
https://www.postgresql.org/message-id/20211005193033.tg4pqswgvu3hcolm%40alap3.anarazel.de


The debugging information for the cirrus-ci tasks has a list of
processes. E.g. for https://cirrus-ci.com/task/5860692694663168 there's

      1 agent.exe
      1 CExecSvc.exe
      1 csrss.exe
      1 fontdrvhost.exe
      1 lsass.exe
      1 msdtc.exe
      1 psql.exe
      1 services.exe
      1 wininit.exe
      9 cmd.exe
      9 perl.exe
      9 svchost.exe
     49 postgres.exe
processes.

So we know that some tests were actually still in progress... It's
particularly interesting that there's a psql process still hanging around...


Before I "simplified" that away, the CI patch ran all tests with a shorter
individual timeout than the overall CI timeout, so we'd see error logs
etc. Perhaps that was a mistake to remove. IIRC I did something like

"C:\Program Files\Git\usr\bin\timeout.exe" -v --kill-after=35m 30m perl path/to/vcregress.pl ...

Perhaps worth re-adding?

Greetings,

Andres Freund



В списке pgsql-hackers по дате отправления:

Предыдущее
От: Tatsuo Ishii
Дата:
Сообщение: Re: is ErrorResponse possible on Sync?
Следующее
От: Peter Geoghegan
Дата:
Сообщение: Re: disfavoring unparameterized nested loops