Re: Re: [PATCH] unified frontend support for pg_malloc et al and palloc/pfree mulation (was xlogreader-v4)

Поиск
Список
Период
Сортировка
Andres Freund <andres@2ndquadrant.com> writes:
> Well, I *did* benchmark it as noted elsewhere in the thread, but thats
> obviously just machine (E5520 x 2) with one rather restricted workload
> (pgbench -S -jc 40 -T60). At least its rather palloc heavy.

> Here are the numbers:

> before:
> #101646.763208 101350.361595 101421.425668 101571.211688 101862.172051 101449.857665
> after:
> #101553.596257 102132.277795 101528.816229 101733.541792 101438.531618 101673.400992

> So on my system if there is a difference, its positive (0.12%).

pgbench-based testing doesn't fill me with a lot of confidence for this
--- its numbers contain a lot of communication overhead, not to mention
that pgbench itself can be a bottleneck.  It struck me that we have a
recent test case that's known to be really palloc-intensive, namely
Pavel's example here:
http://www.postgresql.org/message-id/CAFj8pRCKfoz6L82PovLXNK-1JL=jzjwaT8e2BD2PwNKm7i7KVg@mail.gmail.com

I set up a non-cassert build of commit
78a5e738e97b4dda89e1bfea60675bcf15f25994 (ie, just before the patch that
reduced the data-copying overhead for that).  On my Fedora 16 machine
(dual 2.0GHz Xeon E5503, gcc version 4.6.3 20120306 (Red Hat 4.6.3-2))
I get a runtime for Pavel's example of 17023 msec (average over five
runs).  I then applied oprofile and got a breakdown like this:
 samples|      %|
------------------  108409 84.5083 /home/tgl/testversion/bin/postgres   13723 10.6975 /lib64/libc-2.14.90.so    3153
2.4579/home/tgl/testversion/lib/postgresql/plpgsql.so
 

samples  %        symbol name
10960    10.1495  AllocSetAlloc
6325      5.8572  MemoryContextAllocZeroAligned
6225      5.7646  base_yyparse
3765      3.4866  copyObject
2511      2.3253  MemoryContextAlloc
2292      2.1225  grouping_planner
2044      1.8928  SearchCatCache
1956      1.8113  core_yylex
1763      1.6326  expression_tree_walker
1347      1.2474  MemoryContextCreate
1340      1.2409  check_stack_depth
1276      1.1816  GetCachedPlan
1175      1.0881  AllocSetFree
1106      1.0242  GetSnapshotData
1106      1.0242  _SPI_execute_plan
1101      1.0196  extract_query_dependencies_walker

I then applied the palloc.h and mcxt.c hunks of your patch and rebuilt.
Now I get an average runtime of 16666 ms, a full 2% faster, which is a
bit astonishing, particularly because the oprofile results haven't moved
much:
  107642 83.7427 /home/tgl/testversion/bin/postgres   14677 11.4183 /lib64/libc-2.14.90.so    3180  2.4740
/home/tgl/testversion/lib/postgresql/plpgsql.so

samples  %        symbol name
10038     9.3537  AllocSetAlloc
6392      5.9562  MemoryContextAllocZeroAligned
5763      5.3701  base_yyparse
4810      4.4821  copyObject
2268      2.1134  grouping_planner
2178      2.0295  core_yylex
1963      1.8292  palloc
1867      1.7397  SearchCatCache
1835      1.7099  expression_tree_walker
1551      1.4453  check_stack_depth
1374      1.2803  _SPI_execute_plan
1282      1.1946  MemoryContextCreate
1187      1.1061  AllocSetFree
...
653       0.6085  palloc0
...
552       0.5144  MemoryContextAlloc

The number of calls of AllocSetAlloc certainly hasn't changed at all, so
how did that get faster?

I notice that the postgres executable is about 0.2% smaller, presumably
because a whole lot of inlined fetches of CurrentMemoryContext are gone.
This makes me wonder if my result is due to chance improvements of cache
line alignment for inner loops.

I would like to know if other people get comparable results on other
hardware (non-Intel hardware would be especially interesting).  If this
result holds up across a range of platforms, I'll withdraw my objection
to making palloc a plain function.
        regards, tom lane



В списке pgsql-hackers по дате отправления:

Предыдущее
От: Stephen Frost
Дата:
Сообщение: Re: Index build temp files
Следующее
От: Simon Riggs
Дата:
Сообщение: Re: Index build temp files