Re: 9.5: Better memory accounting, towards memory-bounded HashAgg

Поиск
Список
Период
Сортировка
От Tomas Vondra
Тема Re: 9.5: Better memory accounting, towards memory-bounded HashAgg
Дата
Msg-id 54AD83E1.1090109@2ndquadrant.com
обсуждение исходный текст
Ответ на Re: 9.5: Better memory accounting, towards memory-bounded HashAgg  (Jeff Davis <pgsql@j-davis.com>)
Ответы Re: 9.5: Better memory accounting, towards memory-bounded HashAgg  (Michael Paquier <michael.paquier@gmail.com>)
Re: 9.5: Better memory accounting, towards memory-bounded HashAgg  (Jeff Davis <pgsql@j-davis.com>)
Список pgsql-hackers
Hi,

On 23.12.2014 10:16, Jeff Davis wrote:
> It seems that these two patches are being reviewed together. Should
> I just combine them into one? My understanding was that some wanted
> to review the memory accounting patch separately.
>
> On Sun, 2014-12-21 at 20:19 +0100, Tomas Vondra wrote:
>> That's the only conflict, and after fixing it it compiles OK.
>> However, I got a segfault on the very first query I tried :-(
>
> If lookup_hash_entry doesn't find the group, and there's not enough
> memory to create it, then it returns NULL; but the caller wasn't
> checking for NULL. My apologies for such a trivial mistake, I was
> doing most of my testing using DISTINCT. My fix here was done
> quickly, so I'll take a closer look later to make sure I didn't miss
> something else.
>
> New patch attached (rebased, as well).

I did a review today, using these two patches:

    * memory-accounting-v9.patch (submitted on December 2)
    * hashagg-disk-20141222.patch

I started with some basic performance measurements comparing hashagg
queries without and with the patches (while you compared hashagg and
sort). That's IMHO an interesting comparison, especially when no
batching is necessary - in the optimal case the users should not see any
slowdown (we shouldn't make them pay for the batching unless it actually
is necessary).

So I did this:

    drop table if exists test_hash_agg;

    create table test_hash_agg as
        select
            i AS a,
            mod(i,1000000) AS b,
            mod(i,100000) AS c,
            mod(i,10000) AS d,
            mod(i,1000) AS e,
            mod(i,100) AS f
    from generate_series(1,10000000) s(i);

    vacuum (full, analyze) test_hash_agg;

i.e. a ~500MB table with 10M rows, and columns with different
cardinalities. And then queries like this:

   select count(*) from (select a, count(a) from test_hash_agg
                         group by a) foo;

   -- 10M groups (OOM)
   select count(*) from (select a, array_agg(a) from test_hash_agg
                         group by a) foo;

   -- 100 groups
   select count(*) from (select f, array_agg(f) from test_hash_agg
                         group by f) foo;

which performed quite well, i.e. I've seen absolutely no slowdown.
Which, in the array_agg case, is quite is quite suspicious, because on
every lookup_hash_entry call, it has to do MemoryContextMemAllocated()
on 10M contexts, and I really doubt that can be done in ~0 time.

So I started digging in the code and I noticed this:

    hash_mem = MemoryContextMemAllocated(aggstate->hashcontext, true);

which is IMHO obviously wrong, because that accounts only for the
hashtable itself. It might be correct for aggregates with state passed
by value, but it's incorrect for state passed by reference (e.g.
Numeric, arrays etc.), because initialize_aggregates does this:

    oldContext = MemoryContextSwitchTo(aggstate->aggcontext);
    pergroupstate->transValue = datumCopy(peraggstate->initValue,
                                        peraggstate->transtypeByVal,
                                        peraggstate->transtypeLen);
    MemoryContextSwitchTo(oldContext);

and it's also wrong for all the user-defined aggretates that have no
access to hashcontext at all, and only get reference to aggcontext using
AggCheckCallContext(). array_agg() is a prime example.

In those cases the patch actually does no memory accounting and as
hashcontext has no child contexts, there's no accounting overhead.

After fixing this bug (replacing hashcontext with aggcontext in both
calls to MemoryContextMemAllocated) the performance drops dramatically.
For the query with 100 groups (not requiring any batching) I see this:

test=# explain analyze select count(x) from (select f, array_agg(1) AS x
from test_hash_agg group by f) foo;

                        QUERY PLAN (original patch)
------------------------------------------------------------------------
 Aggregate  (cost=213695.57..213695.58 rows=1 width=32)
            (actual time=2539.156..2539.156 rows=1 loops=1)
   ->  HashAggregate  (cost=213693.07..213694.32 rows=100 width=4)
                      (actual time=2492.264..2539.012 rows=100 loops=1)
         Group Key: test_hash_agg.f
         Batches: 1  Memory Usage: 24kB  Disk Usage:0kB
         ->  Seq Scan on test_hash_agg
                     (cost=0.00..163693.71 rows=9999871 width=4)
                     (actual time=0.022..547.379 rows=10000000 loops=1)
 Planning time: 0.039 ms
 Execution time: 2542.932 ms
(7 rows)

                        QUERY PLAN (fixed patch)
------------------------------------------------------------------------
 Aggregate  (cost=213695.57..213695.58 rows=1 width=32)
            (actual time=5670.885..5670.885 rows=1 loops=1)
   ->  HashAggregate  (cost=213693.07..213694.32 rows=100 width=4)
                      (actual time=5623.254..5670.803 rows=100 loops=1)
         Group Key: test_hash_agg.f
         Batches: 1  Memory Usage: 117642kB  Disk Usage:0kB
         ->  Seq Scan on test_hash_agg
                     (cost=0.00..163693.71 rows=9999871 width=4)
                     (actual time=0.014..577.924 rows=10000000 loops=1)
 Planning time: 0.103 ms
 Execution time: 5677.187 ms
(7 rows)

So the performance drops 2x. With more groups, the performance impact is
even worse. For example with the first query (with 10M groups), this is
what I get in perf:

explain analyze select count(x) from (select a, array_agg(1) AS x from
test_hash_agg group by a) foo;


   PerfTop:    4671 irqs/sec  kernel:11.2%  exact:  0.0%
------------------------------------------------------------------------

    87.07%  postgres                 [.] MemoryContextMemAllocated
     1.63%  [zcommon]                [k] fletcher_4_native
     1.60%  [kernel]                 [k] acpi_processor_ffh_cstate_enter
     0.68%  [kernel]                 [k] xhci_irq
     0.30%  ld-2.19.so               [.] _dl_sysinfo_int80
     0.30%  [kernel]                 [k] memmove
     0.26%  [kernel]                 [k] ia32_syscall
     0.16%  libglib-2.0.so.0.3800.2  [.] 0x000000000008c52a
     0.15%  libpthread-2.19.so       [.] pthread_mutex_lock


and it runs indefinitely (I gave up after a few minutes). I believe this
renders the proposed memory accounting approach dead.

Granted, the 10M groups is a bit extreme example, but the first query
with 100 groups certainly is not.

I understand the effort to avoid the 2% regression measured by Robert on
a PowerPC machine, but I don't think that's a sufficient reason to cause
so much trouble to everyone using array_agg() or user-defined aggregates
based on the same 'create subcontext' pattern.

Especially when the reindex may be often improved by using more
maintenance_work_mem, but there's nothing you can do to improve this (if
it's not batching, then increasing work_mem will do nothing).

The array_agg() patch I submitted to this CF would fix this particular
query, as it removes the child contexts (so there's no need for
recursion in MemoryContextMemAllocated), but it does nothing to the
user-defined aggregates out there. And it's not committed yet.

Also, ISTM this makes it rather unusable as a general accounting
approach. If mere 100 subcontexts results in 2x slowdown, then a handful
of subcontexts will certainly have a measurable impact if we decide to
use this somewhere else (and not just in hash aggregate).

So I was curious how the accounting mechanism I proposed (parallel
MemoryAccounting hierarchy next to MemoryContext) would handle this, so
I've used it instead of the memory-accounting-v9.patch. And I measured
no difference compared to master (no slowdown at all).

I also tried array_agg() queries that actually require batchning, e.g.

  select a, array_agg(1) x from test_hash_agg group by a;

which produces 10M groups, each using a separate 8kB context, which
amounts to 80GB in total. With work_mem=1GB this should proceed just
fine with ~80 batches. In practice, it runs indefinitely (again, I lost
patience after a few minutes), and I see this:

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s
avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda               0,00     0,00    0,00  281,00     0,00 140784,00
1002,02   143,25  512,57    0,00  512,57   3,56 100,00

tomas@rimmer ~/tmp/pg-hashagg/base/pgsql_tmp $ du -s
374128  .
tomas@rimmer ~/tmp/pg-hashagg/base/pgsql_tmp $ ls -l | wc -l
130
tomas@rimmer ~/tmp/pg-hashagg/base/pgsql_tmp $ ls -l | head
celkem 372568
-rw------- 1 tomas users 2999480  7. led 19.46 pgsql_tmp23267.2637
-rw------- 1 tomas users 2973520  7. led 19.46 pgsql_tmp23267.2638
-rw------- 1 tomas users 2978800  7. led 19.46 pgsql_tmp23267.2639
-rw------- 1 tomas users 2959880  7. led 19.46 pgsql_tmp23267.2640
-rw------- 1 tomas users 3010040  7. led 19.46 pgsql_tmp23267.2641
-rw------- 1 tomas users 3083520  7. led 19.46 pgsql_tmp23267.2642
-rw------- 1 tomas users 3053160  7. led 19.46 pgsql_tmp23267.2643
-rw------- 1 tomas users 3044360  7. led 19.46 pgsql_tmp23267.2644
-rw------- 1 tomas users 3014000  7. led 19.46 pgsql_tmp23267.2645

That is, there are ~130 files, each ~3MB large, ~370MB in total. But the
system does ~140MB/s writes all the time. Also, the table has ~500MB in
total.

So either I'm missing something, but there's some sort of bug.

Attached you can find a bunch of files:

  * 0001-memory-accounting.patch (my memory accounting)
  * 0002-hashagg-patch.patch (jeff's patch, for completeness)
  * 0003-context-fix.patch (fix hashcontext -> aggcontext)
  * test.sql (data / queries I used for testing)

regards
Tomas

Вложения

В списке pgsql-hackers по дате отправления:

Предыдущее
От: Mike Blackwell
Дата:
Сообщение: Re: [PATCH] explain sortorder
Следующее
От: Stephen Frost
Дата:
Сообщение: Re: INSERT ... ON CONFLICT UPDATE and RLS