Обсуждение: can we optimize STACK_DEPTH_SLOP

Поиск
Список
Период
Сортировка

can we optimize STACK_DEPTH_SLOP

От
Greg Stark
Дата:
Poking at NetBSD kernel source it looks like the default ulimit -s
depends on the architecture and ranges from 512k to 16M. Postgres
insists on max_stack_depth being STACK_DEPTH_SLOP -- ie 512kB -- less
than the ulimit setting making it impossible to start up on
architectures with a default of 512kB without raising the ulimit.

If we could just lower it to 384kB then Postgres would start up but I
wonder if we should just use MIN(stack_rlimit/2, STACK
_DEPTH_SLOP) so that there's always a setting of max_stack_depth that
would allow Postgres to start.

./arch/sun2/include/vmparam.h:73:#define DFLSSIZ (512*1024) /* initial
stack size limit */
./arch/arm/include/arm32/vmparam.h:66:#define DFLSSIZ (2*1024*1024) /*
initial stack size limit */
./arch/sun3/include/vmparam3.h:109:#define DFLSSIZ (512*1024) /*
initial stack size limit */
./arch/sun3/include/vmparam3x.h:58:#define DFLSSIZ (2*1024*1024) /*
initial stack size limit */
./arch/luna68k/include/vmparam.h:70:#define DFLSSIZ (512*1024) /*
initial stack size limit */
./arch/hppa/include/vmparam.h:62:#define DFLSSIZ (2*1024*1024) /*
initial stack size limit */
./arch/hp300/include/vmparam.h:82:#define DFLSSIZ (2*1024*1024) /*
initial stack size limit */
./arch/alpha/include/vmparam.h:79:#define DFLSSIZ (1<<21) /* initial
stack size (2M) */
./arch/acorn26/include/vmparam.h:55:#define DFLSSIZ (512*1024) /*
initial stack size limit */
./arch/amd64/include/vmparam.h:83:#define DFLSSIZ (4*1024*1024) /*
initial stack size limit */
./arch/amd64/include/vmparam.h:101:#define DFLSSIZ32 (2*1024*1024) /*
initial stack size limit */
./arch/ia64/include/vmparam.h:57:#define DFLSSIZ (1<<21) /* initial
stack size (2M) */
./arch/mvme68k/include/vmparam.h:82:#define DFLSSIZ (512*1024) /*
initial stack size limit */
./arch/i386/include/vmparam.h:74:#define DFLSSIZ (2*1024*1024) /*
initial stack size limit */
./arch/amiga/include/vmparam.h:82:#define DFLSSIZ (2*1024*1024) /*
initial stack size limit */
./arch/sparc/include/vmparam.h:94:#define DFLSSIZ (8*1024*1024) /*
initial stack size limit */
./arch/mips/include/vmparam.h:95:#define DFLSSIZ (4*1024*1024) /*
initial stack size limit */
./arch/mips/include/vmparam.h:114:#define DFLSSIZ (16*1024*1024) /*
initial stack size limit */
./arch/mips/include/vmparam.h:134:#define DFLSSIZ32 DFLTSIZ /* initial
stack size limit */
./arch/sh3/include/vmparam.h:69:#define DFLSSIZ (2 * 1024 * 1024)
./arch/mac68k/include/vmparam.h:115:#define DFLSSIZ (2*1024*1024) /*
initial stack size limit */
./arch/next68k/include/vmparam.h:89:#define DFLSSIZ (512*1024) /*
initial stack size limit */
./arch/news68k/include/vmparam.h:82:#define DFLSSIZ (512*1024) /*
initial stack size limit */
./arch/x68k/include/vmparam.h:74:#define DFLSSIZ (512*1024) /* initial
stack size limit */
./arch/cesfic/include/vmparam.h:82:#define DFLSSIZ (512*1024) /*
initial stack size limit */
./arch/usermode/include/vmparam.h:69:#define DFLSSIZ (2 * 1024 * 1024)
./arch/usermode/include/vmparam.h:78:#define DFLSSIZ (4 * 1024 * 1024)
./arch/powerpc/include/oea/vmparam.h:74:#define DFLSSIZ (2*1024*1024)
/* default stack size */
./arch/powerpc/include/ibm4xx/vmparam.h:60:#define DFLSSIZ
(2*1024*1024) /* default stack size */
./arch/powerpc/include/booke/vmparam.h:75:#define DFLSSIZ
(2*1024*1024) /* default stack size */
./arch/vax/include/vmparam.h:74:#define DFLSSIZ (512*1024) /* initial
stack size limit */
./arch/sparc64/include/vmparam.h:100:#define DFLSSIZ (2*1024*1024) /*
initial stack size limit */
./arch/sparc64/include/vmparam.h:125:#define DFLSSIZ (2*1024*1024) /*
initial stack size limit */
./arch/sparc64/include/vmparam.h:145:#define DFLSSIZ32 (2*1024*1024)
/* initial stack size limit */
./arch/atari/include/vmparam.h:81:#define DFLSSIZ (2*1024*1024) /*
initial stack size limit */


-- 
greg



Re: can we optimize STACK_DEPTH_SLOP

От
Tom Lane
Дата:
Greg Stark <stark@mit.edu> writes:
> Poking at NetBSD kernel source it looks like the default ulimit -s
> depends on the architecture and ranges from 512k to 16M. Postgres
> insists on max_stack_depth being STACK_DEPTH_SLOP -- ie 512kB -- less
> than the ulimit setting making it impossible to start up on
> architectures with a default of 512kB without raising the ulimit.

> If we could just lower it to 384kB then Postgres would start up but I
> wonder if we should just use MIN(stack_rlimit/2, STACK
> _DEPTH_SLOP) so that there's always a setting of max_stack_depth that
> would allow Postgres to start.

I'm pretty nervous about reducing that materially without any
investigation into how much of the slop we actually use.  Our assumption
so far has generally been that only recursive routines need to have any
stack depth check; but there are plenty of very deep non-recursive call
paths.  I do not think we're doing people any favors by letting them skip
fooling with "ulimit -s" if the result is that their database crashes
under stress.  For that matter, even if we were sure we'd produce a
"stack too deep" error rather than crashing, that's still not very nice
if it happens on run-of-the-mill queries.
        regards, tom lane



Re: can we optimize STACK_DEPTH_SLOP

От
Robert Haas
Дата:
On Tue, Jul 5, 2016 at 11:54 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Greg Stark <stark@mit.edu> writes:
>> Poking at NetBSD kernel source it looks like the default ulimit -s
>> depends on the architecture and ranges from 512k to 16M. Postgres
>> insists on max_stack_depth being STACK_DEPTH_SLOP -- ie 512kB -- less
>> than the ulimit setting making it impossible to start up on
>> architectures with a default of 512kB without raising the ulimit.
>
>> If we could just lower it to 384kB then Postgres would start up but I
>> wonder if we should just use MIN(stack_rlimit/2, STACK
>> _DEPTH_SLOP) so that there's always a setting of max_stack_depth that
>> would allow Postgres to start.
>
> I'm pretty nervous about reducing that materially without any
> investigation into how much of the slop we actually use.  Our assumption
> so far has generally been that only recursive routines need to have any
> stack depth check; but there are plenty of very deep non-recursive call
> paths.  I do not think we're doing people any favors by letting them skip
> fooling with "ulimit -s" if the result is that their database crashes
> under stress.  For that matter, even if we were sure we'd produce a
> "stack too deep" error rather than crashing, that's still not very nice
> if it happens on run-of-the-mill queries.

To me it seems like using anything based on stack_rlimit/2 is pretty
risky for the reason that you state, but I also think that not being
able to start the database at all on some platforms with small stacks
is bad.  If I had to guess, I'd bet that most functions in the backend
use a few hundred bytes of stack space or less, so that even 100kB of
stack space is enough for hundreds of stack frames.  If we're putting
that kind of depth on the stack without ever checking the stack depth,
we deserve what we get.  That having been said, it wouldn't surprise
me to find that we have functions here and there which put objects
that are many kB in size on the stack, making it much easier to
overrun the available stack space in only a few frames.  It would be
nice if there were a tool that you could run over your binaries and
have it dump out the names of all functions that create large stack
frames, but I don't know of one.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: can we optimize STACK_DEPTH_SLOP

От
Tom Lane
Дата:
Robert Haas <robertmhaas@gmail.com> writes:
> On Tue, Jul 5, 2016 at 11:54 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> I'm pretty nervous about reducing that materially without any
>> investigation into how much of the slop we actually use.

> To me it seems like using anything based on stack_rlimit/2 is pretty
> risky for the reason that you state, but I also think that not being
> able to start the database at all on some platforms with small stacks
> is bad.

My point was that this is something we should investigate, not just
guess about.

I did some experimentation using the attached quick-kluge patch, which
(1) causes each exiting server process to report its actual ending stack
size, and (2) hacks the STACK_DEPTH_SLOP test so that you can set
max_stack_depth considerably higher than what rlimit(2) claims.
Unfortunately the way I did (1) only works on systems with pmap; I'm not
sure how to make it more portable.

My results on an x86_64 RHEL6 system were pretty interesting:

1. All but two of the regression test scripts have ending stack sizes
of 188K to 196K.  There is one outlier at 296K (most likely the regex
test, though I did not stop to confirm that) and then there's the
errors.sql test, which intentionally provokes a "stack too deep" failure
and will therefore consume approximately max_stack_depth stack if it can.

2. With the RHEL6 default "ulimit -s" setting of 10240kB, you actually
have to increase max_stack_depth to 12275kB before you get a crash in
errors.sql.  At the highest passing value, 12274kB, pmap says we end
with
      1 00007ffc51f6e000  12284K rw---    [ stack ]
which is just shy of 2MB more than the alleged limit.  I conclude that
at least in this kernel version, the kernel doesn't complain until your
stack would be 2MB *more* than the ulimit -s value.

That result also says that at least for that particular test, the
value of STACK_DEPTH_SLOP could be as little as 10K without a crash,
even without this surprising kernel forgiveness.  But of course that
test isn't really pushing the slop factor, since it's only compiling a
trivial expression at each recursion depth.

Given these results I definitely wouldn't have a problem with reducing
STACK_DEPTH_SLOP to 200K, and you could possibly talk me down to less.
On x86_64.  Other architectures might be more stack-hungry, though.
I'm particularly worried about IA64 --- I wonder if anyone can perform
these same experiments on that?

            regards, tom lane

diff --git a/src/backend/storage/ipc/ipc.c b/src/backend/storage/ipc/ipc.c
index cc36b80..7740120 100644
*** a/src/backend/storage/ipc/ipc.c
--- b/src/backend/storage/ipc/ipc.c
*************** static int    on_proc_exit_index,
*** 98,106 ****
--- 98,113 ----
  void
  proc_exit(int code)
  {
+     char    sysbuf[256];
+
      /* Clean up everything that must be cleaned up */
      proc_exit_prepare(code);

+     /* report stack size to stderr */
+     snprintf(sysbuf, sizeof(sysbuf), "pmap %d | grep stack 1>&2",
+              (int) getpid());
+     system(sysbuf);
+
  #ifdef PROFILE_PID_DIR
      {
          /*
diff --git a/src/include/tcop/tcopprot.h b/src/include/tcop/tcopprot.h
index 7254355..009bec2 100644
*** a/src/include/tcop/tcopprot.h
--- b/src/include/tcop/tcopprot.h
***************
*** 27,33 ****


  /* Required daylight between max_stack_depth and the kernel limit, in bytes */
! #define STACK_DEPTH_SLOP (512 * 1024L)

  extern CommandDest whereToSendOutput;
  extern PGDLLIMPORT const char *debug_query_string;
--- 27,33 ----


  /* Required daylight between max_stack_depth and the kernel limit, in bytes */
! #define STACK_DEPTH_SLOP (-100 * 1024L * 1024L)

  extern CommandDest whereToSendOutput;
  extern PGDLLIMPORT const char *debug_query_string;

Re: can we optimize STACK_DEPTH_SLOP

От
Greg Stark
Дата:
On Tue, Jul 5, 2016 at 8:48 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Unfortunately the way I did (1) only works on systems with pmap; I'm not
> sure how to make it more portable.

I did a similar(ish) test which is admittedly not as exhaustive as
using pmap. I instrumented check_stack_depth() itself to keep track of
a high water mark (and based on Robert's thought process) to keep
track of the largest increment over the previous checked stack depth.
This doesn't cover any cases where there's no check_stack_depth() call
in the call stack at all (but then if there's no check_stack_depth
call at all it's hard to see how any setting of STACK_DEPTH_SLOP is
necessarily going to help).

I see similar results to you. The regexp test shows:
LOG:  disconnection: highest stack depth: 392256 largest stack increment: 35584

And the:
STATEMENT:  select infinite_recurse();
LOG:  disconnection: highest stack depth: 2097584 largest stack increment: 1936

There were a couple other tests with similar stack increase increments
to the regular expression test:

STATEMENT:  alter table atacc2 add constraint foo check (test>0) no inherit;
LOG:  disconnection: highest stack depth: 39232 largest stack increment: 34224
STATEMENT:  SELECT chr(0);
LOG:  disconnection: highest stack depth: 44144 largest stack increment: 34512

But aside from those two the next largest increment between two
success check_stack_depth calls was about 12kB:

STATEMENT:  select array_elem_check(121.00);
LOG:  disconnection: highest stack depth: 24256 largest stack increment: 12896

This was all on x86_64 too.

--
greg

Вложения

Re: can we optimize STACK_DEPTH_SLOP

От
Tom Lane
Дата:
Greg Stark <stark@mit.edu> writes:
> I did a similar(ish) test which is admittedly not as exhaustive as
> using pmap. I instrumented check_stack_depth() itself to keep track of
> a high water mark (and based on Robert's thought process) to keep
> track of the largest increment over the previous checked stack depth.
> This doesn't cover any cases where there's no check_stack_depth() call
> in the call stack at all (but then if there's no check_stack_depth
> call at all it's hard to see how any setting of STACK_DEPTH_SLOP is
> necessarily going to help).

Well, the point of STACK_DEPTH_SLOP is that we don't want to have to
put check_stack_depth calls in every function in the backend, especially
not otherwise-inexpensive leaf functions.  So the idea is for the slop
number to cover the worst-case call graph after the last function with a
check.  Your numbers are pretty interesting, in that they clearly prove
we need a slop value of at least 40-50K, but they don't really show that
that's adequate.

I'm a bit disturbed by the fact that you seem to be showing maximum
measured depth for the non-outlier tests as only around 40K-ish.
That doesn't match up very well with my pmap results, since in no
case did I see a physical stack size below 188K.

[ pokes around for a little bit... ]  Oh, this is interesting: it looks
like the *postmaster*'s stack size is 188K, and of course every forked
child is going to inherit that as a minimum stack depth.  What's more,
pmap shows stack sizes near that for all my running postmasters going back
to 8.4.  But 8.3 and before show a stack size of 84K, which seems to be
some sort of minimum on this machine; even a trivial "cat" process has
that stack size according to pmap.

Conclusion: something we did in 8.4 greatly bloated the postmaster's
stack space consumption, to the point that it's significantly more than
anything a normal backend does.  That's surprising and scary, because
it means the postmaster is *more* exposed to stack SIGSEGV than most
backends.  We need to find the cause, IMO.
        regards, tom lane



Re: can we optimize STACK_DEPTH_SLOP

От
Greg Stark
Дата:
On Wed, Jul 6, 2016 at 2:34 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> Conclusion: something we did in 8.4 greatly bloated the postmaster's
> stack space consumption, to the point that it's significantly more than
> anything a normal backend does.  That's surprising and scary, because
> it means the postmaster is *more* exposed to stack SIGSEGV than most
> backends.  We need to find the cause, IMO.

Hm. I do something based on your test where I build a .so and started
the postmaster with -c shared_preload_libraries to load it. I tried to
run it on every revision I have built for the historic benchmarks.
That only worked as far back as 8.4.0 -- which makes me suspect it's
possibly because of precisely shared_preload_libraries and the dynamic
linker that the stack size grew....

The only thing it actually revealed was a *drop* of 50kB between
REL9_2_0~1610 and REL9_2_0~1396.


REL8_4_0~1702 188K
REL8_4_0~1603 192K
REL8_4_0~1498 188K
REL8_4_0~1358 192K
REL8_4_0~1218 184K
REL8_4_0~1013 188K
REL8_4_0~996 192K
REL8_4_0~856 192K
REL8_4_0~775 192K
REL8_4_0~567 192K
REL8_4_0~480 188K
REL8_4_0~360 188K
REL8_4_0~151 188K
REL9_0_0~1855 188K
REL9_0_0~1654 188K
REL9_0_0~1538 192K
REL9_0_0~1454 184K
REL9_0_0~1351 184K
REL9_0_0~1249 188K
REL9_0_0~1107 184K
REL9_0_0~938 184K
REL9_0_0~627 184K
REL9_0_0~414 184K
REL9_0_0~202 184K
REL9_1_0~1867 188K
REL9_1_0~1695 184K
REL9_1_0~1511 188K
REL9_1_0~1328 192K
REL9_1_0~978 192K
REL9_1_0~948 188K
REL9_1_0~628 188K
REL9_1_0~382 192K
REL9_2_0~1825 184K
REL9_2_0~1610 192K                                          <--------------- here
REL9_2_0~1396 148K
REL9_2_0~1226 148K
REL9_2_0~1190 148K
REL9_2_0~1072 140K
REL9_2_0~1071 144K
REL9_2_0~984 144K
REL9_2_0~777 144K
REL9_2_0~767 148K
REL9_2_0~551 148K
REL9_2_0~309 144K
REL9_3_0~1509 148K
REL9_3_0~1304 148K
REL9_3_0~1099 144K
REL9_3_0~1030 144K
REL9_3_0~944 140K
REL9_3_0~789 144K
REL9_3_0~735 148K
REL9_3_0~589 144K
REL9_3_0~390 148K
REL9_3_0~223 144K
REL9_4_0~1923 148K
REL9_4_0~1894 148K
REL9_4_0~1755 144K
REL9_4_0~1688 144K
REL9_4_0~1617 144K
REL9_4_0~1431 144K
REL9_4_0~1246 144K
REL9_4_0~1142 148K
REL9_4_0~995 148K
REL9_4_0~744 140K
REL9_4_0~462 148K
REL9_5_0~2370 148K
REL8_4_22 192K
REL9_5_0~2183 148K
REL9_5_0~1996 148K
REL9_5_0~1782 144K
REL9_5_0~1569 148K
REL9_5_0~1557 144K
REL9_5_ALPHA1-20-g7b156c1 144K
REL9_5_ALPHA1-299-g47ebbdc 144K
REL9_5_ALPHA1-489-ge06b2e1 144K
REL9_0_23 188K
REL9_1_19 192K
REL9_2_14 144K
REL9_3_10 148K
REL9_4_5 148K
REL9_5_ALPHA1-683-ge073490 144K
REL9_5_ALPHA1-844-gdfcd9cb 148K
REL9_5_0 148K
REL9_5_ALPHA1-972-g7dc09c1 144K
REL9_5_ALPHA1-1114-g57a6a72 148K



-- 
greg



Re: can we optimize STACK_DEPTH_SLOP

От
Tom Lane
Дата:
Greg Stark <stark@mit.edu> writes:
> Ok, I managed to get __atribute__((destructor)) working and capitured
> the attached pmap output for all the revisions. You can see the git
> revision in the binary name along with a putative date though in the
> case of branches the date can be deceptive. It looks to me like REL8_4
> is already bloated by REL8_4_0~2268 (which means 2268 commits *before*
> the REL8_4_0 tag -- i.e. soon after it branched).

I traced through this by dint of inserting a lot of system("pmap") calls,
and what I found out is that it's the act of setting one of the timezone
variables that does it.  This is because tzload() allocates a local
variable "union local_storage ls", which sounds harmless enough, but
in point of fact the darn thing is 78K!  And to add insult to injury,
with my setting (US/Eastern) there is a recursive call to parse the
"posixrules" timezone file.  So that's 150K worth of stack right
there, although possibly it's only half that for some zone settings.
(And if you use "GMT" you escape all of this, since that's hard coded.)

So now I understand why the IANA code has provisions for malloc'ing
that storage rather than just using the stack.  We should do likewise.
        regards, tom lane



Re: can we optimize STACK_DEPTH_SLOP

От
Greg Stark
Дата:
Ok, I managed to get __atribute__((destructor)) working and capitured
the attached pmap output for all the revisions. You can see the git
revision in the binary name along with a putative date though in the
case of branches the date can be deceptive. It looks to me like REL8_4
is already bloated by REL8_4_0~2268 (which means 2268 commits *before*
the REL8_4_0 tag -- i.e. soon after it branched).

I can't really make heads or tails of this. I don't see any commits in
the early days of 8.4 that could change the stack depth in the
postmaster.

Вложения

Re: can we optimize STACK_DEPTH_SLOP

От
Tom Lane
Дата:
I found out that pmap can give much more fine-grained results than I was
getting before, if you give it the -x flag and then pay attention to the
"dirty" column rather than the "nominal size" column.  That gives a
reliable indication of how much stack space the process ever actually
touched, with resolution apparently 4KB on my machine.

I redid my measurements with commit 62c8421e8 applied, and now get results
like this for one run of the standard regression tests:

$ grep '\[ stack \]' postmaster.log  | sort -k 4n | uniq -c   137 00007fff0f615000      84      36      36 rw---    [
stack]    21 00007fff0f615000      84      40      40 rw---    [ stack ]     4 00007fff0f615000      84      44      44
rw---   [ stack ]    20 00007fff0f615000      84      48      48 rw---    [ stack ]     8 00007fff0f615000      84
52     52 rw---    [ stack ]     2 00007fff0f615000      84      56      56 rw---    [ stack ]    10 00007fff0f615000
  84      60      60 rw---    [ stack ]     3 00007fff0f615000      84      64      64 rw---    [ stack ]     3
00007fff0f615000     84      68      68 rw---    [ stack ]     2 00007fff0f615000      84      72      72 rw---    [
stack]     1 00007fff0f612000      96      76      76 rw---    [ stack ]     2 00007fff0f60e000     112     112     112
rw---   [ stack ]     1 00007fff0f5e0000     296     296     296 rw---    [ stack ]     1 00007fff0f427000    2060
2060   2060 rw---    [ stack ]
 

The rightmost numeric column is the "dirty KB in region" column, and 36KB
is the floor established by the postmaster.  (It looks like selecting
timezone is still the largest stack-space hog in that, but it's no longer
enough to make me want to do something about it.)  So now we're seeing
some cases that exceed that floor, which is good.  regex and errors are
still the outliers, as expected.

Also, I found that on OS X "vmmap -dirty" could produce results comparable
to pmap, so here's the numbers for the same test case on current OS X:
154 Stack                             8192K      36K        2   5 Stack                             8192K      40K
 2  11 Stack                             8192K      44K        2   6 Stack                             8192K      48K
    2  11 Stack                             8192K      52K        2   7 Stack                             8192K
56K       2   8 Stack                             8192K      60K        2   2 Stack                             8192K
  64K        2   2 Stack                             8192K      68K        2   4 Stack
8192K     72K        2   1 Stack                             8192K      76K        2   2 Stack
  8192K     108K        2   1 Stack                             8192K     384K        2   1 Stack
     8192K    2056K        2 
 

(The "virtual" stack size seems to always be the same as ulimit -s,
ie 8MB by default, on this platform.)  This is good confirmation
that the actual stack consumption is pretty stable across different
compilers, though it looks like OS X's version of clang is a bit
more stack-wasteful for the regex recursion.

Based on these numbers, I'd have no fear of reducing STACK_DEPTH_SLOP
to 256KB on x86_64.  It would sure be good to check things on some
other architectures, though ...
        regards, tom lane



Re: can we optimize STACK_DEPTH_SLOP

От
Tom Lane
Дата:
I wrote:
> Based on these numbers, I'd have no fear of reducing STACK_DEPTH_SLOP
> to 256KB on x86_64.  It would sure be good to check things on some
> other architectures, though ...

I went to the work of doing the same test on a PPC Mac:
182 Stack                   [   8192K/     40K] 25 Stack                   [   8192K/     48K]  2 Stack
 [   8192K/     56K] 11 Stack                   [   8192K/     60K]  5 Stack                   [   8192K/     64K]  2
Stack                  [   8192K/    108K]  1 Stack                   [   8192K/    576K]  1 Stack                   [
8192K/   2056K]
 

The last number here is "resident pages", not "dirty pages", because
this older version of OS X doesn't provide the latter.  Still, the
numbers seem to track pretty well with the ones I got on x86_64.
Which is a bit odd when you think about it: a 32-bit machine ought
to consume less stack space because pointers are narrower.

Also on my old HPPA dinosaur:
 40  addr 0x7b03a000, length 8, physical pages 8, type STACK166  addr 0x7b03a000, length 10, physical pages 9, type
STACK26  addr 0x7b03a000, length 12, physical pages 11, type STACK 16  addr 0x7b03a000, length 14, physical pages 13,
typeSTACK  1  addr 0x7b03a000, length 15, physical pages 13, type STACK  1  addr 0x7b03a000, length 16, physical pages
15,type STACK  2  addr 0x7b03a000, length 28, physical pages 27, type STACK  1  addr 0x7b03a000, length 190, physical
pages190, type STACK  1  addr 0x7b03a000, length 514, physical pages 514, type STACK
 

As best I can tell, "length" is the nominal virtual space for the stack,
and "physical pages" is the actually allocated/resident space, both
measured in 4K pages.  So that again matches pretty well, although the
stack-efficiency of the recursive regex functions seems to get worse with
each new case I look at.

However ... the thread here
https://www.postgresql.org/message-id/flat/21563.1289064886%40sss.pgh.pa.us
says that depending on your choice of compiler and optimization level,
IA64 can be 4x to 5x worse for stack space than x86_64, even after
spotting it double the memory allocation to handle its two separate
stacks.  I don't currently have access to an IA64 machine to check.

Based on what I'm seeing so far, really 100K ought to be more than plenty
of slop for most architectures, but I'm afraid to go there for IA64.

Also, there might be some more places like tzload() that are putting
unreasonably large variables on the stack, but that the regression tests
don't exercise (I've not tested anything replication-related, for
example).

Bottom line: I propose that we keep STACK_DEPTH_SLOP at 512K for IA64
but reduce it to 256K for everything else.
        regards, tom lane



Re: can we optimize STACK_DEPTH_SLOP

От
Greg Stark
Дата:
On Fri, Jul 8, 2016 at 4:46 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Based on what I'm seeing so far, really 100K ought to be more than plenty
> of slop for most architectures, but I'm afraid to go there for IA64.

Searching for info on ia64 turned up this interesting thread:

https://www.postgresql.org/message-id/21563.1289064886%40sss.pgh.pa.us

From that discussion it seems we should probably run these tests with
-O0 because the stack usage can be substantially higher without
optimizations. And it doesn't sound like ia64 uses much more *normal*
stack, just that there's the additional register stack.

It might not be unreasonable to commit the pmap hack, gather the data
from the build farm then later add an #ifdef around it. (or just make
it #ifdef USE_ASSERTIONS which I assume most build farm members are
running with anyways).

Alternatively it wouldn't be very hard to use mincore(2) to implement
it natively. I believe mincore is nonstandard but present in Linux and
BSD.


-- 
greg



Re: can we optimize STACK_DEPTH_SLOP

От
Tom Lane
Дата:
Greg Stark <stark@mit.edu> writes:
> Searching for info on ia64 turned up this interesting thread:
> https://www.postgresql.org/message-id/21563.1289064886%40sss.pgh.pa.us

Yeah, that's the same one I referenced upthread ;-)

> From that discussion it seems we should probably run these tests with
> -O0 because the stack usage can be substantially higher without
> optimizations. And it doesn't sound like ia64 uses much more *normal*
> stack, just that there's the additional register stack.

> It might not be unreasonable to commit the pmap hack, gather the data
> from the build farm then later add an #ifdef around it. (or just make
> it #ifdef USE_ASSERTIONS which I assume most build farm members are
> running with anyways).

Hmm.  The two IA64 critters in the farm are running HPUX, which means
they likely don't have pmap.  But I could clean up the hack I used to
gather stack size data on gaur's host and commit it temporarily.
On non-HPUX platforms we could just try system("pmap -x") and see what
happens; as long as we're ignoring the result it shouldn't cause anything
really bad.

I was going to object that this would probably not tell us anything
about the worst-case IA64 stack usage, but I see that neither of those
critters are selecting any optimization, so actually it would.

So, agreed, let's commit some temporary debug code and see what the
buildfarm can teach us.  Will go work on that in a bit.

> Alternatively it wouldn't be very hard to use mincore(2) to implement
> it natively. I believe mincore is nonstandard but present in Linux and
> BSD.

Hm, after reading the man page I don't quite see how that would help?
You'd have to already know the mapped stack address range in order to
call the function without getting ENOMEM.
        regards, tom lane



Re: can we optimize STACK_DEPTH_SLOP

От
Greg Stark
Дата:
On Fri, Jul 8, 2016 at 3:32 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Hm, after reading the man page I don't quite see how that would help?
> You'd have to already know the mapped stack address range in order to
> call the function without getting ENOMEM.


I had assumed unmapped pages would just return a 0 in the bitmap. I
suppose you could still do it by just probing one page at a time until
you find an unmapped page. In a way that's better since you can count
stack pages even if they're paged out.


Fwiw here's the pmap info from burbot (Linux Sparc64):
136      48      48 rw---    [ stack ]
136      48      48 rw---    [ stack ]
136      48      48 rw---    [ stack ]
136      48      48 rw---    [ stack ]
136      56      56 rw---    [ stack ]
136      80      80 rw---    [ stack ]
136      96      96 rw---    [ stack ]
136     112     112 rw---    [ stack ]
136     112     112 rw---    [ stack ]
576     576     576 rw---    [ stack ]
2056    2056    2056 rw---    [ stack ]

I'm actually a bit confused how to interpret these numbers. This
appears to be an 8kB pagesize architecture so is that 576*8kB or over
5MB of stack for the regexp test? But we don't know if there are any
check_stack_depth calls in that call tree?

-- 
greg



Re: can we optimize STACK_DEPTH_SLOP

От
Tom Lane
Дата:
Greg Stark <stark@mit.edu> writes:
> Fwiw here's the pmap info from burbot (Linux Sparc64):
> 136      48      48 rw---    [ stack ]
> 136      48      48 rw---    [ stack ]
> 136      48      48 rw---    [ stack ]
> 136      48      48 rw---    [ stack ]
> 136      56      56 rw---    [ stack ]
> 136      80      80 rw---    [ stack ]
> 136      96      96 rw---    [ stack ]
> 136     112     112 rw---    [ stack ]
> 136     112     112 rw---    [ stack ]
> 576     576     576 rw---    [ stack ]
> 2056    2056    2056 rw---    [ stack ]

> I'm actually a bit confused how to interpret these numbers. This
> appears to be an 8kB pagesize architecture so is that 576*8kB or over
> 5MB of stack for the regexp test?

No, pmap specifies that its outputs are measured in kilobytes.  So this
is by and large the same as what I'm seeing on x86_64, again with the
caveat that the recursive regex routines seem to vary all over the place
in terms of stack consumption.

> But we don't know if there are any
> check_stack_depth calls in that call tree?

The regex recursion definitely does have check_stack_depth calls in it
(since commit b63fc2877).  But what we're trying to measure here is the
worst-case stack depth regardless of any check_stack_depth calls.  That's
a ceiling on what we might need to set STACK_DEPTH_SLOP to --- probably a
very loose ceiling, but I don't want to err on the side of underestimating
it.  I wouldn't consider either the regex or errors tests as needing to
bound STACK_DEPTH_SLOP, since we know that most of their consumption is
from recursive code that contains check_stack_depth calls.  But it's
useful to look at those depths just as a sanity check that we're getting
valid numbers.
        regards, tom lane



Re: can we optimize STACK_DEPTH_SLOP

От
Tom Lane
Дата:
I wrote:
> So, agreed, let's commit some temporary debug code and see what the
> buildfarm can teach us.  Will go work on that in a bit.

After reviewing the buildfarm results, I'm feeling nervous about this
whole idea again.  For the most part, the unaccounted-for daylight between
the maximum stack depth measured by check_stack_depth and the actually
dirtied stack space reported by pmap is under 100K.  But there are a
pretty fair number of exceptions.  The worst cases I found were on
"dunlin", which approached 200K extra space in a couple of places:
dunlin        | 2016-07-09 22:05:09 | check.log                                      | 00007ffff2667000     268     208
   208 rw---   [ stack ]dunlin        | 2016-07-09 22:05:09 | check.log                                      | max
measuredstack depth 14kBdunlin        | 2016-07-09 22:05:09 | install-check-C.log                            |
00007fffee650000    268     200     200 rw---   [ stack ]dunlin        | 2016-07-09 22:05:09 | install-check-C.log
                     | max measured stack depth 14kB
 

This appears to be happening in the tsdicts test script.  Other machines
also show a significant discrepancy between pmap and check_stack_depth
results for that test, which suggests that maybe the tsearch code is being
overly reliant on large local variables.  But I haven't dug through it.

Another area of concern is PLs.  For instance, on capybara, a machine
otherwise pretty unexceptional in stack-space appetite, quite a few of the
PL tests ate ~100K of unaccounted-for space:
capybara      | 2016-07-09 21:15:56 | pl-install-check-C.log                         | 00007ffc61bbe000     132     104
   104 rw---    [ stack ]capybara      | 2016-07-09 21:15:56 | pl-install-check-C.log                         |
00007ffc61bbe000    132       0       0 rw---    [ stack ]capybara      | 2016-07-09 21:15:56 | pl-install-check-C.log
                      | max measured stack depth 8kBcapybara      | 2016-07-09 21:15:56 | pl-install-check-C.log
                | 00007ffc61bbd000     136     136     136 rw---    [ stack ]capybara      | 2016-07-09 21:15:56 |
pl-install-check-C.log                        | 00007ffc61bbd000     136       0       0 rw---    [ stack ]capybara
| 2016-07-09 21:15:56 | pl-install-check-C.log                         | max measured stack depth 0kBcapybara      |
2016-07-0921:15:56 | pl-install-check-C.log                         | 00007ffc61bbe000     132     104     104 rw---
[stack ]capybara      | 2016-07-09 21:15:56 | pl-install-check-C.log                         | 00007ffc61bbe000     132
     0       0 rw---    [ stack ]capybara      | 2016-07-09 21:15:56 | pl-install-check-C.log                         |
maxmeasured stack depth 5kBcapybara      | 2016-07-09 21:15:56 | pl-install-check-C.log                         |
00007ffc61bbe000    132     116     116 rw---    [ stack ]capybara      | 2016-07-09 21:15:56 | pl-install-check-C.log
                      | 00007ffc61bbe000     132       0       0 rw---    [ stack ]capybara      | 2016-07-09 21:15:56
|pl-install-check-C.log                         | max measured stack depth 7kB
 

Presumably that reflects some oddity of the local version of perl or
python, but I have no idea what.

So while we could possibly get away with reducing STACK_DEPTH_SLOP
to 256K, there is good reason to think that that would be leaving
little or no safety margin.

At this point I'm inclined to think we should leave well enough alone.
At the very least, if we were to try to reduce that number, I'd want
to have some plan for tracking our stack space consumption better than
we have done in the past.
        regards, tom lane


PS: for amusement's sake, here are some numbers I extracted concerning
the relative stack-hungriness of different buildfarm members.  First,
the number of recursion levels each machine could accomplish before
hitting "stack too deep" in the errors.sql regression test (measured by
counting the number of CONTEXT lines in the relevant error message):
   sysname    |      snapshot       | count 
---------------+---------------------+-------protosciurus  | 2016-07-10 12:03:06 |   731chub          | 2016-07-10
15:10:01|  1033quokka        | 2016-07-10 02:17:31 |  1033hornet        | 2016-07-09 23:42:32 |  1156clam          |
2016-07-0922:00:01 |  1265anole         | 2016-07-09 22:41:40 |  1413spoonbill     | 2016-07-09 23:00:05 |
1535sungazer     | 2016-07-09 23:51:33 |  1618gaur          | 2016-07-09 04:53:13 |  1634kouprey       | 2016-07-10
04:58:00|  1653nudibranch    | 2016-07-10 09:18:10 |  1664grouse        | 2016-07-10 08:43:02 |  1708sprat         |
2016-07-1008:43:55 |  1717pademelon     | 2016-07-09 06:12:10 |  1814mandrill      | 2016-07-10 00:10:02 |  2093gharial
     | 2016-07-10 01:15:50 |  2248francolin     | 2016-07-10 13:00:01 |  2379piculet       | 2016-07-10 13:00:01 |
2379lorikeet     | 2016-07-10 08:04:19 |  2422caecilian     | 2016-07-09 19:31:50 |  2423jacana        | 2016-07-09
22:36:38|  2515bowerbird     | 2016-07-10 02:13:47 |  2617locust        | 2016-07-09 21:50:26 |  2838prairiedog    |
2016-07-0922:44:58 |  2838dromedary     | 2016-07-09 20:48:06 |  2840damselfly     | 2016-07-10 10:27:09 |
2880curculio     | 2016-07-09 21:30:01 |  2905mylodon       | 2016-07-09 20:50:01 |  2974tern          | 2016-07-09
23:51:23|  3015burbot        | 2016-07-10 03:30:45 |  3042magpie        | 2016-07-09 21:38:02 |  3043reindeer      |
2016-07-1004:00:05 |  3043friarbird     | 2016-07-10 04:20:01 |  3187nightjar      | 2016-07-09 21:17:52 |
3187sittella     | 2016-07-09 21:46:29 |  3188crake         | 2016-07-09 22:06:09 |  3267guaibasaurus  | 2016-07-10
00:17:01|  3267ibex          | 2016-07-09 20:59:06 |  3267mule          | 2016-07-09 23:30:02 |  3267spurfowl      |
2016-07-0921:06:39 |  3267anchovy       | 2016-07-09 21:41:04 |  3268blesbok       | 2016-07-09 21:17:46 |
3268capybara     | 2016-07-09 21:15:56 |  3268conchuela     | 2016-07-09 21:00:01 |  3268handfish      | 2016-07-09
04:37:57|  3268macaque       | 2016-07-08 21:25:06 |  3268minisauripus  | 2016-07-10 03:19:42 |  3268rhinoceros    |
2016-07-0921:45:01 |  3268sidewinder    | 2016-07-09 21:45:00 |  3272jaguarundi    | 2016-07-10 06:52:05 |  3355loach
     | 2016-07-09 21:15:00 |  3355okapi         | 2016-07-10 06:15:02 |  3425fulmar        | 2016-07-09 23:47:57 |
3436longfin      | 2016-07-09 21:10:17 |  3444brolga        | 2016-07-10 09:40:46 |  3537dunlin        | 2016-07-09
22:05:09|  3616coypu         | 2016-07-09 22:20:46 |  3626hyrax         | 2016-07-09 19:52:03 |  3635treepie       |
2016-07-0922:41:37 |  3635frogmouth     | 2016-07-10 02:00:09 |  3636narwhal       | 2016-07-10 10:00:05 |
3966rover_firefly| 2016-07-10 15:01:45 |  4084lapwing       | 2016-07-09 21:15:01 |  4085cockatiel     | 2016-07-10
13:40:47|  4362currawong     | 2016-07-10 05:16:03 |  5136mastodon      | 2016-07-10 11:00:01 |  5136termite       |
2016-07-0921:01:30 |  5452hamster       | 2016-07-09 16:00:06 |  5685dangomushi    | 2016-07-09 18:00:27 |  5692gull
     | 2016-07-10 04:48:28 |  5692mereswine     | 2016-07-10 10:40:57 |  5810axolotl       | 2016-07-09 22:12:12 |
5811chipmunk     | 2016-07-10 08:18:07 |  5949grison        | 2016-07-09 21:00:02 |  5949
 
(74 rows)

(coypu gets a gold star for this one, since it makes a good showing
despite having max_stack_depth set to 1536kB --- everyone else seems
to be using 2MB.)

Second, the stack space consumed for the regex regression test --- here,
smaller is better:
currawong     | 2016-07-10 05:16:03 | max measured stack depth 213kBmastodon      | 2016-07-10 11:00:01 | max measured
stackdepth 213kBaxolotl       | 2016-07-09 22:12:12 | max measured stack depth 240kBhamster       | 2016-07-09 16:00:06
|max measured stack depth 240kBmereswine     | 2016-07-10 10:40:57 | max measured stack depth 240kBbrolga        |
2016-07-1009:40:46 | max measured stack depth 284kBnarwhal       | 2016-07-10 10:00:05 | max measured stack depth
284kBcockatiel    | 2016-07-10 13:40:47 | max measured stack depth 285kBfrancolin     | 2016-07-10 13:00:01 | max
measuredstack depth 285kBhyrax         | 2016-07-09 19:52:03 | max measured stack depth 285kBmagpie        | 2016-07-09
21:38:02| max measured stack depth 285kBpiculet       | 2016-07-10 13:00:01 | max measured stack depth 285kBreindeer
 | 2016-07-10 04:00:05 | max measured stack depth 285kBtreepie       | 2016-07-09 22:41:37 | max measured stack depth
285kBlapwing      | 2016-07-09 21:15:01 | max measured stack depth 287kBrover_firefly | 2016-07-10 15:01:45 | max
measuredstack depth 287kBcoypu         | 2016-07-09 22:20:46 | max measured stack depth 288kBfriarbird     | 2016-07-10
04:20:01| max measured stack depth 289kBnightjar      | 2016-07-09 21:17:52 | max measured stack depth 289kBgharial
 | 2016-07-10 01:15:50 | max measured stack depths 290kB, 384kBbowerbird     | 2016-07-10 02:13:47 | max measured stack
depth378kBcaecilian     | 2016-07-09 19:31:50 | max measured stack depth 378kBfrogmouth     | 2016-07-10 02:00:09 | max
measuredstack depth 378kBmylodon       | 2016-07-09 20:50:01 | max measured stack depth 378kBjaguarundi    | 2016-07-10
06:52:05| max measured stack depth 379kBloach         | 2016-07-09 21:15:00 | max measured stack depth 379kBlongfin
 | 2016-07-09 21:10:17 | max measured stack depth 379kBsidewinder    | 2016-07-09 21:45:00 | max measured stack depth
379kBanchovy      | 2016-07-09 21:41:04 | max measured stack depth 381kBblesbok       | 2016-07-09 21:17:46 | max
measuredstack depth 381kBcapybara      | 2016-07-09 21:15:56 | max measured stack depth 381kBconchuela     | 2016-07-09
21:00:01| max measured stack depth 381kBcrake         | 2016-07-09 22:06:09 | max measured stack depth 381kBcurculio
 | 2016-07-09 21:30:01 | max measured stack depth 381kBguaibasaurus  | 2016-07-10 00:17:01 | max measured stack depth
381kBhandfish     | 2016-07-09 04:37:57 | max measured stack depth 381kBibex          | 2016-07-09 20:59:06 | max
measuredstack depth 381kBmacaque       | 2016-07-08 21:25:06 | max measured stack depth 381kBminisauripus  | 2016-07-10
03:19:42| max measured stack depth 381kBmule          | 2016-07-09 23:30:02 | max measured stack depth 381kBrhinoceros
 | 2016-07-09 21:45:01 | max measured stack depth 381kBsittella      | 2016-07-09 21:46:29 | max measured stack depth
381kBspurfowl     | 2016-07-09 21:06:39 | max measured stack depth 381kBdromedary     | 2016-07-09 20:48:06 | max
measuredstack depth 382kBpademelon     | 2016-07-09 06:12:10 | max measured stack depth 382kBfulmar        | 2016-07-09
23:47:57| max measured stack depth 383kBdunlin        | 2016-07-09 22:05:09 | max measured stack depth 388kBokapi
 | 2016-07-10 06:15:02 | max measured stack depth 389kBmandrill      | 2016-07-10 00:10:02 | max measured stack depth
489kBtern         | 2016-07-09 23:51:23 | max measured stack depth 491kBdamselfly     | 2016-07-10 10:27:09 | max
measuredstack depth 492kBburbot        | 2016-07-10 03:30:45 | max measured stack depth 567kBlocust        | 2016-07-09
21:50:26| max measured stack depth 571kBprairiedog    | 2016-07-09 22:44:58 | max measured stack depth 571kBclam
 | 2016-07-09 22:00:01 | max measured stack depth 573kBjacana        | 2016-07-09 22:36:38 | max measured stack depth
661kBlorikeet     | 2016-07-10 08:04:19 | max measured stack depth 662kBgaur          | 2016-07-09 04:53:13 | max
measuredstack depth 756kBchub          | 2016-07-10 15:10:01 | max measured stack depth 856kBquokka        | 2016-07-10
02:17:31| max measured stack depth 856kBhornet        | 2016-07-09 23:42:32 | max measured stack depth 868kBgrouse
 | 2016-07-10 08:43:02 | max measured stack depth 944kBkouprey       | 2016-07-10 04:58:00 | max measured stack depth
944kBnudibranch   | 2016-07-10 09:18:10 | max measured stack depth 945kBsprat         | 2016-07-10 08:43:55 | max
measuredstack depth 946kBsungazer      | 2016-07-09 23:51:33 | max measured stack depth 963kBprotosciurus  | 2016-07-10
12:03:06| max measured stack depth 1432kB
 

The second list omits a couple of machines whose reports got garbled
by concurrent insertions into the log file.