Re: Use generation context to speed up tuplesorts

Поиск

Список

Период

Сортировка

От	Ronan Dunklau
Тема	Re: Use generation context to speed up tuplesorts
Дата	8 декабря 2021 г. 15:51:17
Msg-id	8046109.NyiUUSuA9g@aivenronan обсуждение исходный текст
Ответ на	Re: Use generation context to speed up tuplesorts (Tomas Vondra <tomas.vondra@enterprisedb.com>)
Ответы	Re: Use generation context to speed up tuplesorts
Список	pgsql-hackers

Дерево обсуждения

Le jeudi 9 septembre 2021, 15:37:59 CET Tomas Vondra a écrit :
> And now comes the funny part - if I run it in the same backend as the
> "full" benchmark, I get roughly the same results:
>
>       block_size | chunk_size | mem_allocated | alloc_ms | free_ms
>      ------------+------------+---------------+----------+---------
>            32768 |        512 |     806256640 |    37159 |   76669
>
> but if I reconnect and run it in the new backend, I get this:
>
>       block_size | chunk_size | mem_allocated | alloc_ms | free_ms
>      ------------+------------+---------------+----------+---------
>            32768 |        512 |     806158336 |   233909 |  100785
>      (1 row)
>
> It does not matter if I wait a bit before running the query, if I run it
> repeatedly, etc. The machine is not doing anything else, the CPU is set
> to use "performance" governor, etc.

I've reproduced the behaviour you mention.
I also noticed asm_exc_page_fault showing up in the perf report in that case.

Running an strace on it shows that in one case, we have a lot of brk calls,
while when we run in the same process as the previous tests, we don't.

My suspicion is that the previous workload makes glibc malloc change it's
trim_threshold and possibly other dynamic options, which leads to constantly
moving the brk pointer in one case and not the other.

Running your fifo test with absurd malloc options shows that indeed that might
be the case (I needed to change several, because changing one disable the
dynamic adjustment for every single one of them, and malloc would fall back to
using mmap and freeing it on each iteration):

mallopt(M_TOP_PAD, 1024 * 1024 * 1024);
mallopt(M_TRIM_THRESHOLD, 256 * 1024 * 1024);
mallopt(M_MMAP_THRESHOLD, 4*1024*1024*sizeof(long));

I get the following results for your self contained test. I ran the query
twice, in each case, seeing the importance of the first allocation and the
subsequent ones:

With default malloc options:

 block_size | chunk_size | mem_allocated | alloc_ms | free_ms
------------+------------+---------------+----------+---------
      32768 |        512 |     795836416 |   300156 |  207557

 block_size | chunk_size | mem_allocated | alloc_ms | free_ms
------------+------------+---------------+----------+---------
      32768 |        512 |     795836416 |   211942 |   77207


With the oversized values above:

 block_size | chunk_size | mem_allocated | alloc_ms | free_ms
------------+------------+---------------+----------+---------
      32768 |        512 |     795836416 |   219000 |   36223


 block_size | chunk_size | mem_allocated | alloc_ms | free_ms
------------+------------+---------------+----------+---------
      32768 |        512 |     795836416 |    75761 |   78082
(1 row)

I can't tell how representative your benchmark extension would be of real life
allocation / free patterns, but there is probably something we can improve
here.

I'll try to see if I can understand more precisely what is happening.

--
Ronan Dunklau

В списке pgsql-hackers по дате отправления:

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Re: Use generation context to speed up tuplesorts