Re: accounting for memory used for BufFile during hash joins

Поиск
Список
Период
Сортировка
От Melanie Plageman
Тема Re: accounting for memory used for BufFile during hash joins
Дата
Msg-id CAAKRu_YAtexcKhU3g8P563gBR5gfOCKUiPPqZjDOeYBpPJohng@mail.gmail.com
обсуждение исходный текст
Ответ на Re: accounting for memory used for BufFile during hash joins  (Tomas Vondra <tomas.vondra@2ndquadrant.com>)
Ответы Re: accounting for memory used for BufFile during hash joins  (Tomas Vondra <tomas.vondra@2ndquadrant.com>)
Список pgsql-hackers


On Wed, May 8, 2019 at 8:08 AM Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
On Tue, May 07, 2019 at 05:30:27PM -0700, Melanie Plageman wrote:
>   One thing I was a little confused by was the nbatch_inmemory member
>   of the hashtable.  The comment in ExecChooseHashTableSize says that
>   it is determining the number of batches we can fit in memory.  I
>   thought that the problem was the amount of space taken up by the
>   BufFile data structure itself--which is related to the number of
>   open BufFiles you need at a time. This comment in
>   ExecChooseHashTableSize makes it sound like you are talking about
>   fitting more than one batch of tuples into memory at a time. I was
>   under the impression that you could only fit one batch of tuples in
>   memory at a time.

I suppose you mean this chunk:

    /*
     * See how many batches we can fit into memory (driven mostly by size
     * of BufFile, with PGAlignedBlock being the largest part of that).
     * We need one BufFile for inner and outer side, so we count it twice
     * for each batch, and we stop once we exceed (work_mem/2).
     */
    while ((nbatch_inmemory * 2) * sizeof(PGAlignedBlock) * 2
           <= (work_mem * 1024L / 2))
        nbatch_inmemory *= 2;

Yeah, that comment is a bit confusing. What the code actually does is
computing the largest "slice" of batches for which we can keep the
BufFile structs in memory, without exceeding work_mem/2.

Maybe the nbatch_inmemory should be renamed to nbatch_slice, not sure.

I definitely would prefer to see hashtable->nbatch_inmemory renamed to
hashtable->nbatch_slice--or maybe hashtable->nbuff_inmemory?

I've been poking around the code for awhile today, and, even though I
know that the nbatch_inmemory is referring to the buffiles that can
fit in memory, I keep forgetting and thinking it is referring to the
tuple data that can fit in memory.

It might be worth explicitly calling out somewhere in the comments
that overflow slices will only be created either when the number of
batches was underestimated as part of ExecHashIncreaseNumBatches and
the new number of batches exceeds the value for
hashtable->nbatch_inmemory or when creating the hashtable initially
and the number of batches exceeds the value for
hashtable->nbatch_inmemory (the name confuses this for me at hashtable
creation time especially) -- the number of actual buffiles that can be
managed in memory.
 

Attached is an updated patch, fixing this. I tried to clarify some of
the comments too, and I fixed another bug I found while running the
regression tests. It's still very much a crappy PoC code, though.


So, I ran the following example on master and with your patch.

drop table foo;
drop table bar;
create table foo(a int, b int);
create table bar(c int, d int);
insert into foo select i, i from generate_series(1,10000)i;
insert into bar select 1, 1 from generate_series(1,1000)i;
insert into bar select i%3, i%3 from generate_series(1000,10000)i;
insert into foo select 1,1 from generate_series(1,1000)i;
analyze foo; analyze bar;
set work_mem=64;

On master, explain analyze looked like this

postgres=# explain analyze verbose select * from foo, bar where a = c;
                                                        QUERY PLAN                                                        
--------------------------------------------------------------------------------------------------------------------------
 Hash Join  (cost=339.50..53256.27 rows=4011001 width=16) (actual time=28.962..1048.442 rows=4008001 loops=1)
   Output: foo.a, foo.b, bar.c, bar.d
   Hash Cond: (bar.c = foo.a)
   ->  Seq Scan on public.bar  (cost=0.00..145.01 rows=10001 width=8) (actual time=0.030..1.777 rows=10001 loops=1)
         Output: bar.c, bar.d
   ->  Hash  (cost=159.00..159.00 rows=11000 width=8) (actual time=12.285..12.285 rows=11000 loops=1)
         Output: foo.a, foo.b
         Buckets: 2048 (originally 2048)  Batches: 64 (originally 16)  Memory Usage: 49kB
         ->  Seq Scan on public.foo  (cost=0.00..159.00 rows=11000 width=8) (actual time=0.023..3.786 rows=11000 loops=1)
               Output: foo.a, foo.b
 Planning Time: 0.435 ms
 Execution Time: 1206.904 ms
(12 rows)

and with your patch, it looked like this.

postgres=# explain analyze verbose select * from foo, bar where a = c;
                                                        QUERY PLAN                                                        
--------------------------------------------------------------------------------------------------------------------------
 Hash Join  (cost=339.50..53256.27 rows=4011001 width=16) (actual time=28.256..1102.026 rows=4008001 loops=1)
   Output: foo.a, foo.b, bar.c, bar.d
   Hash Cond: (bar.c = foo.a)
   ->  Seq Scan on public.bar  (cost=0.00..145.01 rows=10001 width=8) (actual time=0.040..1.717 rows=10001 loops=1)
         Output: bar.c, bar.d
   ->  Hash  (cost=159.00..159.00 rows=11000 width=8) (actual time=12.327..12.327 rows=11000 loops=1)
         Output: foo.a, foo.b
         Buckets: 2048 (originally 2048)  Batches: 16384 (originally 16, in-memory 2)  Memory Usage: 131160kB
         ->  Seq Scan on public.foo  (cost=0.00..159.00 rows=11000 width=8) (actual time=0.029..3.569 rows=11000 loops=1)
               Output: foo.a, foo.b
 Planning Time: 0.260 ms
 Execution Time: 1264.995 ms
(12 rows)

I noticed that the number of batches is much higher with the patch,
and, I was checking $PGDATA/base/pgsql_tmp and saw that the number of
temp files which are the overflow files any given time was quite high.

I would imagine that the desired behaviour is to keep memory usage
within work_mem.
In this example, the number of slices is about 8000, each of which
would have an overflow file. Is this the case you mention in the
comment in ExecChooseHashTableSize ?

* We ignore (per-slice)
* overflow files, because those serve as "damage control" for cases
* when per-batch BufFiles would exceed work_mem. Given enough batches
* it's impossible to enforce work_mem strictly, because the overflow
* files alone will consume more memory.

--
Melanie Plageman

В списке pgsql-hackers по дате отправления:

Предыдущее
От: Ian Barwick
Дата:
Сообщение: Re: PG 12 draft release notes
Следующее
От: Colm McHugh
Дата:
Сообщение: Patch to fix write after end of array in hashed agg initialization