Re: "could not reattach to shared memory" on buildfarm member dory

Поиск
Список
Период
Сортировка
От Noah Misch
Тема Re: "could not reattach to shared memory" on buildfarm member dory
Дата
Msg-id 20180819020007.GD3795674@rfd.leadboat.com
обсуждение исходный текст
Ответ на Re: "could not reattach to shared memory" on buildfarm member dory  (Tom Lane <tgl@sss.pgh.pa.us>)
Ответы Re: "could not reattach to shared memory" on buildfarm member dory
Список pgsql-hackers
On Tue, May 01, 2018 at 11:31:50AM -0400, Tom Lane wrote:
> Well, at this point the only thing that's entirely clear is that none
> of the ideas I had work.  I think we are going to be forced to pursue
> Noah's idea of doing an end-to-end retry.  Somebody else will need to
> take point on that; I lack a Windows environment and have already done
> a lot more blind patch-pushing than I like in this effort.

Having tried this, I find a choice between performance and complexity.  Both
of my designs use proc_exit(4) to indicate failure to reattach.  The simpler,
slower design has WIN32 internal_forkexec() block until the child reports (via
SetEvent()) that it reattached to shared memory.  This caused a fivefold
reduction in process creation performance[1].  The less-simple, faster design
stashes the Port structure and retry count in the BackendList entry, which
reaper() uses to retry the fork upon seeing status 4.  Notably, this requires
new code for regular backends, for bgworkers, and for others.  It's currently
showing a 30% performance _increase_ on the same benchmark; I can't explain
that increase and doubt it will last, but I think it's plausible for the
less-simple design to be performance-neutral.

I see these options:

1. Use the simpler design with a GUC, disabled by default, to control whether
   the new code is active.  Mention the GUC in a new errhint() for the "could
   not reattach to shared memory" error.

2. Like (1), but enable the GUC by default.

3. Like (1), but follow up with a patch to enable the GUC by default in v12
   only.

4. In addition to (1), enable retries if the GUC is set _or_ this postmaster
   has seen at least one child fail to reattach.

5. Use the less-simple design, with retries enabled unconditionally.

I think I prefer (3), with (1) being a close second.  My hesitation on (3) is
that parallel query has made startup time count even if you use a connection
pool, and all the Windows users not needing these retries will see parallel
query become that much slower.  I dislike (5) for its impact on
platform-independent postmaster code.  Other opinions?

I'm attaching a mostly-finished patch for the slower design.  I tested
correctness with -DREATTACH_FORCE_FAIL_PERCENT=99.  I'm also attaching a
proof-of-concept patch for the faster design.  In this proof of concept, the
postmaster does not close its copy of a backend socket until the backend
exits.  Also, bgworkers can change from BGWH_STARTED back to
BGWH_NOT_YET_STARTED; core code tolerates this, but external code may not.
Those would justify paying some performance to fix.  The proof of concept
handles bgworkers and regular backends, but it does not handle the startup
process, checkpointer, etc.  That doesn't affect benchmarking, of course.

nm

[1] This (2 forks per transaction) dropped from 139tps to 27tps:
    echo 'select 1' >script
    env PGOPTIONS="--default_transaction_isolation=repeatable\\ read --force_parallel_mode=on" pgbench -T15 -j30 -c30
--connect-n -fscript 

Вложения

В списке pgsql-hackers по дате отправления:

Предыдущее
От: "Jonathan S. Katz"
Дата:
Сообщение: Re: Fix for REFRESH MATERIALIZED VIEW ownership error message
Следующее
От: "Jonathan S. Katz"
Дата:
Сообщение: Re: Fix for REFRESH MATERIALIZED VIEW ownership error message