scalability bottlenecks with (many) partitions (and more)

Поиск
Список
Период
Сортировка
От	Tomas Vondra
Тема	scalability bottlenecks with (many) partitions (and more)
Дата	29 января 00:57:02
Msg-id	510b887e-c0ce-4a0c-a17a-2c6abb8d9a5c@enterprisedb.com обсуждение исходный текст
Ответы	Re: scalability bottlenecks with (many) partitions (and more) (Ronan Dunklau <ronan.dunklau@aiven.io>)
Список	pgsql-hackers
Дерево обсуждения
Hi,

I happened to investigate a query involving a partitioned table, which
led me to a couple of bottlenecks severely affecting queries dealing
with multiple partitions (or relations in general). After a while I came
up with three WIP patches that improve the behavior by an order of
magnitude, and not just in some extreme cases.


Consider a partitioned pgbench with 20 partitions, say:

   pgbench -i -s 100 --partitions 100 testdb

but let's modify the pgbench_accounts a little bit:

   ALTER TABLE pgbench_accounts ADD COLUMN aid_parent INT;
   UPDATE pgbench_accounts SET aid_parent = aid;
   CREATE INDEX ON pgbench_accounts(aid_parent);
   VACUUM FULL pgbench_accounts;

which simply adds "aid_parent" column which is not a partition key. And
now let's do a query

    SELECT * FROM pgbench_accounts pa JOIN pgbench_branches pb
        ON (pa.bid = pb.bid) WHERE pa.aid_parent = :aid

so pretty much the regular "pgbench -S" except that on the column that
does not allow partition elimination. Now, the plan looks like this:

                                QUERY PLAN
----------------------------------------------------------------------
 Hash Join  (cost=1.52..34.41 rows=10 width=465)
   Hash Cond: (pa.bid = pb.bid)
   ->  Append  (cost=0.29..33.15 rows=10 width=101)
         ->  Index Scan using pgbench_accounts_1_aid_parent_idx on
pgbench_accounts_1 pa_1  (cost=0.29..3.31 rows=1 width=101)
               Index Cond: (aid_parent = 3489734)
         ->  Index Scan using pgbench_accounts_2_aid_parent_idx on
pgbench_accounts_2 pa_2  (cost=0.29..3.31 rows=1 width=101)
               Index Cond: (aid_parent = 3489734)
         ->  Index Scan using pgbench_accounts_3_aid_parent_idx on
pgbench_accounts_3 pa_3  (cost=0.29..3.31 rows=1 width=101)
               Index Cond: (aid_parent = 3489734)
         ->  Index Scan using pgbench_accounts_4_aid_parent_idx on
pgbench_accounts_4 pa_4  (cost=0.29..3.31 rows=1 width=101)
               Index Cond: (aid_parent = 3489734)
         -> ...
   ->  Hash  (cost=1.10..1.10 rows=10 width=364)
         ->  Seq Scan on pgbench_branches pb  (cost=0.00..1.10 rows=10
width=364)


So yeah, scanning all 100 partitions. Not great, but no partitioning
scheme is perfect for all queries. Anyway, let's see how this works on a
big AMD EPYC machine with 96/192 cores - with "-M simple" we get:

parts      1       8      16      32     64       96     160      224
-----------------------------------------------------------------------
0      13877  105732  210890  410452  709509  844683  1050658  1163026
100      653    3957    7120   12022   12707   11813    10349     9633
1000      20     142     270     474     757     808      567      427

These are transactions per second, for different number of clients
(numbers in the header). With -M prepared the story doesn't change - the
numbers are higher, but the overall behavior is pretty much the same.

Firstly, with no partitions (first row), the throughput by ~13k/client
initially, then it gradually levels off. But it grows all the time.

But with 100 or 1000 partitions, it peaks and then starts dropping
again. And moreover, the throughput with 100 or 1000 partitions is just
a tiny fraction of the non-partitioned value. The difference is roughly
equal to the number of partitions - for example with 96 clients, the
difference between 0 and 1000 partitions is 844683/808 = 1045.

I could demonstrate the same behavior with fewer partitions - e.g. with
10 partitions you get ~10x difference, and so on.

Another thing I'd mention is that this is not just about partitioning.
Imagine a star schema with a fact table and dimensions - you'll get the
same behavior depending on the number of dimensions you need to join
with. With "-M simple" you may get this, for example:

dims        1      8      16      32      64      96     160      224
----------------------------------------------------------------------
1       11737  92925  183678  361497  636598  768956  958679  1042799
10        462   3558    7086   13889   25367   29503   25353    24030
100         4     31      61     122     231     292     292      288

So, similar story - significant slowdown as we're adding dimensions.


Now, what could be causing this? Clearly, there's a bottleneck of some
kind, and we're hitting it. Some of this may be simply due to execution
doing more stuff (more index scans, more initialization, ...) but maybe
not - one of the reasons why I started looking into this was not using
all the CPU even for small scales - the CPU was maybe 60% utilized.

So I started poking at things. The first thing that I thought about was
locking, obviously. That's consistent with the limited CPU utilization
(waiting on a lock = not running), and it's somewhat expected when using
many partitions - we need to lock all of them, and if we have 100 or
1000 of them, that's potentially lot of locks.

From past experiments I've known about two places where such bottleneck
could be - NUM_LOCK_PARTITIONS and fast-path locking. So I decided to
give it a try, increase these values and see what happens.

For NUM_LOCK_PARTITIONS this is pretty simple (see 0001 patch). The
LWLock table has 16 partitions by default - it's quite possible that on
machine with many cores and/or many partitions, we can easily hit this.
So I bumped this 4x to 64 partitions.

For fast-path locking the changes are more complicated (see 0002). We
allow keeping 16 relation locks right in PGPROC, and only when this gets
full we promote them to the actual lock table. But with enough
partitions we're guaranteed to fill these 16 slots, of course. But
increasing the number of slots is not simple - firstly, the information
is split between an array of 16 OIDs and UINT64 serving as a bitmap.
Increasing the size of the OID array is simple, but it's harder for the
auxiliary bitmap. But there's more problems - with more OIDs a simple
linear search won't do. But a simple hash table is not a good idea too,
because of poor locality and the need to delete stuff ...

What I ended up doing is having a hash table of 16-element arrays. There
are 64 "pieces", each essentially the (16 x OID + UINT64 bitmap) that we
have now. Each OID is mapped to exactly one of these parts as if in a
hash table, and in each of those 16-element parts we do exactly the same
thing we do now (linear search, removal, etc.). This works great, the
locality is great, etc. The one disadvantage is this makes PGPROC
larger, but I did a lot of benchmarks and I haven't seen any regression
that I could attribute to this. (More about this later.)

Unfortunately, for the pgbench join this does not make much difference.
But for the "star join" (with -M prepared) it does this:

             1      8     16    32       64       96      160       224
------------------------------------------------------------------------
master   21610 137450 247541 300902   270932   229692   191454   189233
patched  21664 151695 301451 594615  1036424  1211716  1480953  1656203
speedup    1.0    1.1    1.2    2.0      3.8      5.3      7.7      8.8

That's a pretty nice speedup, I think.

However, why doesn't the partitioned join improve (at not very much)?
Well, perf profile says stuff like this:


9.16%    0.77%  postgres      [kernel.kallsyms]      [k] asm_exc_page_fault
     |      
      --8.39%--asm_exc_page_fault
            |      
             --7.52%--exc_page_fault
                    |      
                    --7.13%--do_user_addr_fault
                           |      
                            --6.64%--handle_mm_fault
                                  |      
                                   --6.29%--__handle_mm_fault
                                          |      
                                          |--2.17%--__mem_cgroup_charge
                                          |       |      
                                          |       |--1.25%--charge_memcg
                                          |       |       |      
                                          |       |        --0.57%-- ...
                                          |       |      
                                          |        --0.67%-- ...
                                          |      
                                          |--2.04%--vma_alloc_folio

After investigating this for a bit, I came to the conclusion this may be
some sort of a scalability problem in glibc/malloc. I decided to try if
the "memory pool" patch (which I've mentioned in the memory limit thread
as an alternative way to introduce backend-level accounting/limit) could
serve as a backend-level malloc cache, and how would that work. So I
cleaned up the PoC patch I already had (see 0003), and gave it a try.

And with both patches applied, the results for the partitioned join with
100 partitions look like this:

-M simple

                1      8      16      32      64      96     160    224
------------------------------------------------------------------------
master        653   3957    7120   12022   12707   11813   10349   9633
both patches  954   7356   14580   28259   51552   65278   70607  69598
speedup       1.5    1.9     2.0     2.4     4.1     5.5     6.8    7.2


-M prepared

                1      8      16      32      64      96     160    224
------------------------------------------------------------------------
master       1639   8273   14138   14746   13446   14001   11129  10136
both patches 4792  30102   62208  122157  220984  267763  315632 323567
speedup       2.9    3.6     4.4     8.3    16.4    19.1    28.4   31.9


That's pretty nice, I think. And I've seen many such improvements, it's
not a cherry-picked example. For the star join, the improvements are
very similar.

I'm attaching PDF files with a table visualizing results for these two
benchmarks - there's results for different number of partitions/scales,
and different builds (master, one or both of the patches). There's also
a comparison to master, with color scale "red = slower, green = faster"
(but there's no red anywhere, not even for low client counts).

It's also interesting that with just the 0003 patch applied, the change
is much smaller. It's as if the two bottlenecks (locking and malloc) are
in balance - if you only address one one, you don't get much. But if you
address both, it flies.

FWIW where does the malloc overhead come from? For one, while we do have
some caching of malloc-ed memory in memory contexts, that doesn't quite
work cross-query, because we destroy the contexts at the end of the
query. We attempt to cache the memory contexts too, but in this case
that can't help because the allocations come from btbeginscan() where we
do this:

    so = (BTScanOpaque) palloc(sizeof(BTScanOpaqueData));

and BTScanOpaqueData is ~27kB, which means it's an oversized chunk and
thus always allocated using a separate malloc() call. Maybe we could
break it into smaller/cacheable parts, but I haven't tried, and I doubt
it's the only such allocation.

I don't want to get into too much detail about the memory pool, but I
think it's something we should consider doing - I'm sure there's stuff
to improve, but caching the malloc may clearly be very beneficial. The
basic idea is to have a cache that is "adaptive" (i.e. adjusts to
caching blocks of sizes needed by the workload) but also cheap. The
patch is PoC/WIP and needs more work, but I think it works quite well.
If anyone wants to take a look or have a chat at FOSDEM, for example,
I'm available.

FWIW I was wondering if this is a glibc-specific malloc bottleneck, so I
tried running the benchmarks with LD_PRELOAD=jemalloc, and that improves
the behavior a lot - it gets us maybe ~80% of the mempool benefits.
Which is nice, it confirms it's glibc-specific (I wonder if there's a
way to tweak glibc to address this), and it also means systems using
jemalloc (e.g. FreeBSD, right?) don't have this problem. But it also
says the mempool has ~20% benefit on top of jemalloc.

FWIW there's another bottleneck people may not realize, and that's the
number of file descriptors. Once you get to >1000 relations, you can
easily get into situation like this:


54.18%    0.48%  postgres       [kernel.kallsyms]       [k]
entry_SYSCALL_64_after_hwframe
      |       
       --53.70%--entry_SYSCALL_64_after_hwframe
               |       
               --53.03%--do_syscall_64
                       |       
                       |--28.29%--__x64_sys_openat
                       |        |       
                       |         --28.14%--do_sys_openat2
                       |                |       
                       |                |--23.14%--do_filp_open
                       |                |        |       
                       |                |         --22.72%--path_openat


That's pretty bad, it means we're closing/opening file descriptors like
crazy, because every query needs the files. If I increase the number of
file descriptors (both in ulimit and max_files_per_process) to prevent
this trashing, I can increase the throughput ~5x. Of course, this is not
a bottleneck that we can "fix" in code, it's simply a consequence of not
having enough file descriptors etc. But I wonder if we might make it
easier to monitor this, e.g. by tracking the fd cache hit ratio, or
something like that ...


There's a more complete set of benchmarking scripts and results for
these and other tests, in various formats (PDF, ODS, ...) at

    https://github.com/tvondra/scalability-patches

There's results from multiple machines - not just the big epyc machine,
but also smaller intel machines (4C and 16C), and even two rpi5 (yes, it
helps even on rpi5, quite a bit).


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Вложения

В списке pgsql-hackers по дате отправления:
Предыдущее

От: Jelte Fennema-Nio
Дата: 29 января, 00:52:06
Сообщение: Re: proposal: psql: show current user in prompt
Следующее

От: Michael Paquier
Дата: 29 января, 02:11:32
Сообщение: Re: Add recovery to pg_control and remove backup_label
Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

scalability bottlenecks with (many) partitions (and more)

Вложения

Предыдущее

Следующее