Re: [HACKERS] Improve catcache/syscache performance.

Поиск
Список
Период
Сортировка
От Andres Freund
Тема Re: [HACKERS] Improve catcache/syscache performance.
Дата
Msg-id 20170922061545.wkbzt7b6p6x47bzg@alap3.anarazel.de
обсуждение исходный текст
Ответ на [HACKERS] Improve catcache/syscache performance.  (Andres Freund <andres@anarazel.de>)
Ответы Re: [HACKERS] Improve catcache/syscache performance.  (Robert Haas <robertmhaas@gmail.com>)
Re: [HACKERS] Improve catcache/syscache performance.  (tushar <tushar.ahuja@enterprisedb.com>)
Список pgsql-hackers
Hi,

On 2017-09-13 23:12:07 -0700, Andres Freund wrote:
> Attached is a patch that tries to improve sys/catcache performance,
> going further than the patch referenced earlier.

Here's a variant that cleans up the previous changes a bit, and adds
some further improvements:

Here's the main commit message:

  Improve sys/catcache performance.

  The following are the individual improvements:
  1) Avoidance of FunctionCallInfo based function calls, replaced by
     more efficient functions with a native C argument interface.
  2) Don't extract columns from a cache entry's tuple whenever matching
     entries - instead store them as a Datum array. This also allows to
     get rid of having to build dummy tuples for negative & list
     entries, and of a hack for dealing with cstring vs. text weirdness.
  3) Reorder members of catcache.h struct, so imortant entries are more
     likely to be on one cacheline.
  4) Allowing the compiler to specialize critical SearchCatCache for a
     specific number of attributes allows to unroll loops and avoid
     other nkeys dependant initialization.
  5) Only initializing the ScanKey when necessary, i.e. catcache misses,
     greatly reduces cache unnecessary cpu cache misses.
  6) Split of the cache-miss case from the hash lookup, reducing stack
     allocations etc in the common case.
  7) CatCTup and their corresponding heaptuple are allocated in one
     piece.

  This results in making cache lookups themselves roughly three times as
  fast - full-system benchmarks obviously improve less than that.

  I've also evaluated further techniques:
  - replace open coded hash with simplehash - the list walk right now
    shows up in profiles. Unfortunately it's not easy to do so safely as
    an entry's memory location can change at various times, which
    doesn't work well with the refcounting and cache invalidation.
  - Cacheline-aligning CatCTup entries - helps some with performance,
    but the win isn't big and the code for it is ugly, because the
    tuples have to be freed as well.
  - add more proper functions, rather than macros for
    SearchSysCacheCopyN etc., but right now they don't show up in
    profiles.

  The reason the macro wrapper for syscache.c/h have to be changed,
  rather than just catcache, is that doing otherwise would require
  exposing the SysCache array to the outside.  That might be a good idea
  anyway, but it's for another day.


With the attached benchmark for wide tuples and simple queries I get:

pgbench -M prepared -f ~/tmp/pgbench-many-cols.sql

master:
tps = 16112.117859 (excluding connections establishing)
tps = 16192.186504 (excluding connections establishing)
tps = 16091.257399 (excluding connections establishing)

patch:
tps = 18616.116993 (excluding connections establishing)
tps = 18584.036276 (excluding connections establishing)
tps = 18843.246281 (excluding connections establishing)

~17% gain


pgbench -M prepared -f ~/tmp/pgbench-many-cols.sql -c -j 16:
master:
tps = 73277.282455 (excluding connections establishing)
tps = 73078.408303 (excluding connections establishing)
tps = 73432.476550 (excluding connections establishing)

patch:
tps = 89424.043728 (excluding connections establishing)
tps = 89223.731307 (excluding connections establishing)
tps = 87830.665009 (excluding connections establishing)

~21% gain


standard pgbench readonly:
1 client:
master:
tps = 41662.984894 (excluding connections establishing)
tps = 40965.435121 (excluding connections establishing)
tps = 41438.197117 (excluding connections establishing)

patch:
tps = 42657.455818 (excluding connections establishing)
tps = 42834.812173 (excluding connections establishing)
tps = 42784.306987 (excluding connections establishing)

So roughly ~2.3%, much smaller, as expected, because the syscache is
much less of a bottleneck here.

-cj 16:
master:
tps = 204642.558752 (excluding connections establishing)
tps = 205834.493312 (excluding connections establishing)
tps = 207781.943687 (excluding connections establishing)

dev:
tps = 211459.087649 (excluding connections establishing)
tps = 214890.093976 (excluding connections establishing)
tps = 214526.773530 (excluding connections establishing)

So ~3.3%.

I personally find these numbers quite convincing for a fairly localized
microoptimization.


For the attached benchmark, here's the difference in profiles:
before:
single function overhead:
+    8.10%  postgres  postgres            [.] SearchCatCache
-    7.26%  postgres  libc-2.24.so        [.] __memmove_avx_unaligned_erms
   - __memmove_avx_unaligned_erms
      + 59.29% SearchCatCache
      + 23.51% appendBinaryStringInfo
      + 5.56% pgstat_report_activity
      + 4.05% socket_putmessage
      + 2.86% pstrdup
      + 2.65% AllocSetRealloc
      + 0.73% hash_search_with_hash_value
      + 0.68% btrescan
        0.67% 0x55c02baea83f
+    4.97%  postgres  postgres            [.] appendBinaryStringInfo
+    2.92%  postgres  postgres            [.] ExecBuildProjectionInfo
+    2.60%  postgres  libc-2.24.so        [.] __strncpy_sse2_unaligned
+    2.27%  postgres  postgres            [.] hashoid
+    2.18%  postgres  postgres            [.] fmgr_info
+    2.02%  postgres  libc-2.24.so        [.] strlen

hierarchical / include child costs:
+   21.35%     8.86%  postgres  postgres            [.] SearchCatCache

after:
single function overhead:
+    6.34%  postgres  postgres            [.] appendBinaryStringInfo
+    5.12%  postgres  postgres            [.] SearchCatCache1
-    4.44%  postgres  libc-2.24.so        [.] __memmove_avx_unaligned_erms
   - __memmove_avx_unaligned_erms
      + 60.08% appendBinaryStringInfo
      + 13.88% AllocSetRealloc
      + 11.58% socket_putmessage
      + 6.54% pstrdup
      + 4.67% pgstat_report_activity
      + 1.20% pq_getbytes
      + 1.03% btrescan
        1.03% 0x560d35168dab
+    4.02%  postgres  postgres            [.] fmgr_info
+    3.18%  postgres  postgres            [.] ExecBuildProjectionInfo
+    2.43%  postgres  libc-2.24.so        [.] strlen

hierarchical / include child costs:
+    6.63%     5.12%  postgres  postgres  [.] SearchCatCache1
+    0.49%     0.49%  postgres  postgres  [.] SearchSysCache1
+    0.10%     0.10%  postgres  postgres  [.] SearchCatCache3


(Most of the other top entries here are addressed in neirby threads)

- Andres

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Вложения

В списке pgsql-hackers по дате отправления:

Предыдущее
От: Michael Paquier
Дата:
Сообщение: Re: [HACKERS] Assertion failure when the non-exclusive pg_stop_backup aborted.
Следующее
От: Andres Freund
Дата:
Сообщение: Re: [HACKERS] Improve catcache/syscache performance.