Re: stress test for parallel workers

Поиск
Список
Период
Сортировка
От Thomas Munro
Тема Re: stress test for parallel workers
Дата
Msg-id CA+hUKGL6cDyb2maq2P60cEsjFK=3saBCAj7sDzE3jysL-PRwqg@mail.gmail.com
обсуждение исходный текст
Ответ на Re: stress test for parallel workers  (Tom Lane <tgl@sss.pgh.pa.us>)
Ответы Re: stress test for parallel workers  (Tom Lane <tgl@sss.pgh.pa.us>)
Re: stress test for parallel workers  (Heikki Linnakangas <hlinnaka@iki.fi>)
Список pgsql-hackers
On Wed, Jul 24, 2019 at 5:15 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Thomas Munro <thomas.munro@gmail.com> writes:
> > On Wed, Jul 24, 2019 at 10:11 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > Do you have an example to hand?  Is this
> > failure always happening on Linux?
>
> I dug around a bit further, and while my recollection of a lot of
> "postmaster exited during a parallel transaction" failures is accurate,
> there is a very strong correlation I'd not noticed: it's just a few
> buildfarm critters that are producing those.  To wit, I find that
> string in these recent failures (checked all runs in the past 3 months):
>
>   sysname  |    branch     |      snapshot
> -----------+---------------+---------------------
>  lorikeet  | HEAD          | 2019-06-16 20:28:25
>  lorikeet  | HEAD          | 2019-07-07 14:58:38
>  lorikeet  | HEAD          | 2019-07-02 10:38:08
>  lorikeet  | HEAD          | 2019-06-14 14:58:24
>  lorikeet  | HEAD          | 2019-07-04 20:28:44
>  lorikeet  | HEAD          | 2019-04-30 11:00:49
>  lorikeet  | HEAD          | 2019-06-19 20:29:27
>  lorikeet  | HEAD          | 2019-05-21 08:28:26
>  lorikeet  | REL_11_STABLE | 2019-07-11 08:29:08
>  lorikeet  | REL_11_STABLE | 2019-07-09 08:28:41
>  lorikeet  | REL_12_STABLE | 2019-07-16 08:28:37
>  lorikeet  | REL_12_STABLE | 2019-07-02 21:46:47
>  lorikeet  | REL9_6_STABLE | 2019-07-02 20:28:14
>  vulpes    | HEAD          | 2019-06-14 09:18:18
>  vulpes    | HEAD          | 2019-06-27 09:17:19
>  vulpes    | HEAD          | 2019-07-21 09:01:45
>  vulpes    | HEAD          | 2019-06-12 09:11:02
>  vulpes    | HEAD          | 2019-07-05 08:43:29
>  vulpes    | HEAD          | 2019-07-15 08:43:28
>  vulpes    | HEAD          | 2019-07-19 09:28:12
>  wobbegong | HEAD          | 2019-06-09 20:43:22
>  wobbegong | HEAD          | 2019-07-02 21:17:41
>  wobbegong | HEAD          | 2019-06-04 21:06:07
>  wobbegong | HEAD          | 2019-07-14 20:43:54
>  wobbegong | HEAD          | 2019-06-19 21:05:04
>  wobbegong | HEAD          | 2019-07-08 20:55:18
>  wobbegong | HEAD          | 2019-06-28 21:18:46
>  wobbegong | HEAD          | 2019-06-02 20:43:20
>  wobbegong | HEAD          | 2019-07-04 21:01:37
>  wobbegong | HEAD          | 2019-06-14 21:20:59
>  wobbegong | HEAD          | 2019-06-23 21:36:51
>  wobbegong | HEAD          | 2019-07-18 21:31:36
> (32 rows)
>
> We already knew that lorikeet has its own peculiar stability
> problems, and these other two critters run different compilers
> on the same Fedora 27 ppc64le platform.
>
> So I think I've got to take back the assertion that we've got
> some lurking generic problem.  This pattern looks way more
> like a platform-specific issue.  Overaggressive OOM killer
> would fit the facts on vulpes/wobbegong, perhaps, though
> it's odd that it only happens on HEAD runs.

chipmunk also:

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=chipmunk&dt=2019-08-06%2014:16:16

I wondered if the build farm should try to report OOM kill -9 or other
signal activity affecting the postmaster.

On some systems (depending on sysctl kernel.dmesg_restrict on Linux,
security.bsd.unprivileged_read_msgbuf on FreeBSD etc) you can run
dmesg as a non-root user, and there the OOM killer's footprints or
signaled exit statuses for processes under init might normally be found,
but that seems a bit invasive for the host system (I guess you'd
filter it carefully).  Unfortunately it isn't enabled on many common
systems anyway.

Maybe there is a systemd-specific way to get the info we need without
being root?

Another idea: start the postmaster under a subreaper (Linux 3.4+
prctl(PR_SET_CHILD_SUBREAPER), FreeBSD 10.2+
procctl(PROC_REAP_ACQUIRE)) that exists just to report on its
children's exit status, so the build farm could see "pid XXX was
killed by signal 9" message if it is nuked by the OOM killer.  Perhaps
there is a common subreaper wrapper out there that would wait, print
messages like that, rince and repeat until it has no children and then
exit, or perhaps pg_ctl or even a perl script could do somethign like
that if requested.  Another thought, not explored, is the brand new
Linux pidfd stuff that can be used to wait and get exit status for a
non-child process (or the older BSD equivalent), but the paint isn't
even dry on that stuff anwyay.

--
Thomas Munro
https://enterprisedb.com



В списке pgsql-hackers по дате отправления:

Предыдущее
От: Stephen Frost
Дата:
Сообщение: Re: no default hash partition
Следующее
От: Michael Paquier
Дата:
Сообщение: Re: Refactoring code stripping trailing \n and \r from strings