Re: "could not reattach to shared memory" on buildfarm member dory

Поиск
Список
Период
Сортировка
От Noah Misch
Тема Re: "could not reattach to shared memory" on buildfarm member dory
Дата
Msg-id 20190402135442.GA1173872@rfd.leadboat.com
обсуждение исходный текст
Ответ на Re: "could not reattach to shared memory" on buildfarm member dory  (Noah Misch <noah@leadboat.com>)
Ответы Re: "could not reattach to shared memory" on buildfarm member dory  (Tom Lane <tgl@sss.pgh.pa.us>)
Список pgsql-hackers
On Sun, Dec 02, 2018 at 09:35:06PM -0800, Noah Misch wrote:
> On Tue, Sep 25, 2018 at 08:05:12AM -0700, Noah Misch wrote:
> > On Mon, Sep 24, 2018 at 01:53:05PM -0400, Tom Lane wrote:
> > > Overall, I agree that neither of these approaches are exactly attractive.
> > > We're paying a heck of a lot of performance or complexity to solve a
> > > problem that shouldn't even be there, and that we don't understand well.
> > > In particular, the theory that some privileged code is injecting a thread
> > > into every new process doesn't square with my results at
> > > https://www.postgresql.org/message-id/15345.1525145612%40sss.pgh.pa.us
> > > 
> > > I think our best course of action at this point is to do nothing until
> > > we have a clearer understanding of what's actually happening on dory.
> > > Perhaps such understanding will yield an idea for a less painful fix.
> 
> Could one of you having a dory login use
> https://live.sysinternals.com/Procmon.exe to capture process events during
> backend startup?  The ideal would be one capture where startup failed reattach
> and another where it succeeded, but having the successful run alone would be a
> good start.

Joseph Ayers provided, off-list, the capture from a successful startup.  It
wasn't materially different from the one my system generates, so I abandoned
that line of inquiry.  Having explored other aspects of the problem, I expect
the attached fix will work.  I can reproduce the 4 MiB allocations described
in https://postgr.es/m/29823.1525132900@sss.pgh.pa.us; a few times per
"vcregress check", they emerge in the middle of PGSharedMemoryReAttach().  On
my system, there's 5.7 MiB of free address space just before UsedShmemSegAddr,
so the 4 MiB allocation fits in there, and PGSharedMemoryReAttach() does not
fail.  Still, it's easy to imagine that boring variations between environments
could surface dory's problem by reducing that free 5.7 MiB to, say, 3.9 MiB.

The 4 MiB allocations are stacks for new threads of the default thread
pool[1].  (I confirmed that by observing their size change when I changed
StackReserveSize in MSBuildProject.pm and by checking all stack pointers with
"thread apply all info frame" in gdb.)  The API calls in
PGSharedMemoryReAttach() don't cause the thread creation; it's a timing
coincidence.  Commit 2307868 would have worked around the problem, but
pg_usleep() is essentially a no-op on Windows before
pgwin32_signal_initialize() runs.  (I'll push Assert(pgwin32_signal_event) to
some functions.)  While one fix is to block until all expected threads have
started, that could be notably slow, and I don't know how to implement it
cleanly.  I think a better fix is to arrange for the system to prefer a
different address space region for these thread stacks; for details, see the
first comment the patch adds to win32_shmem.c.  This works here.

> backend startup sees six thread creations:
> 
> 1. main thread
> 2. thread created before postgres.exe has control
> 3. thread created before postgres.exe has control
> 4. thread created before postgres.exe has control
> 5. in pgwin32_signal_initialize()
> 6. in src\backend\port\win32\timer.c:setitimer()
> 
> Threads 2-4 exit exactly 30s after creation.  If we fail to reattach to shared
> memory, we'll exit before reaching code to start 5 or 6.

Threads 2-4 proved to be worker threads of the default thread pool.

[1] https://docs.microsoft.com/en-us/windows/desktop/ProcThread/thread-pools

Вложения

В списке pgsql-hackers по дате отправления:

Предыдущее
От: Jesper Pedersen
Дата:
Сообщение: Re: partitioned tables referenced by FKs
Следующее
От: Michael Banck
Дата:
Сообщение: Re: Progress reporting for pg_verify_checksums