Re: Use generation context to speed up tuplesorts

Поиск
Список
Период
Сортировка
От Tomas Vondra
Тема Re: Use generation context to speed up tuplesorts
Дата
Msg-id bcdd4e3e-c12d-cd2b-7ead-a91ad416100a@enterprisedb.com
обсуждение исходный текст
Ответ на Re: Use generation context to speed up tuplesorts  (Tomas Vondra <tomas.vondra@enterprisedb.com>)
Ответы Re: Use generation context to speed up tuplesorts  (Ronan Dunklau <ronan.dunklau@aiven.io>)
Re: Use generation context to speed up tuplesorts  (David Rowley <dgrowleyml@gmail.com>)
Список pgsql-hackers
Hi,

I've spent quite a bit of time investigating the regressions clearly 
visible in the benchmark results for some allocation/free patterns. 
Attached is a subset of results from a recent run, but the old results 
show mostly the same thing.

Some allocation patterns (e.g. fifo-cycle of lifo-cycle) are perfectly 
fine - the patches are a clear improvement. For for example results for 
fifo-decrease pattern look like this (showing just timings of palloc):

     block    chunk      0001     0002    0003     0004
    ----------------------------------------------------
     32768      512     38497    37234   39817   226451

0001 just adds an extension used for the benchmarking, so it's pretty 
much "master" without any behavior changes. 0002 and 0003 make some 
changes that seem to have no impact on this workload, but 0004 makes it 
about 5x slower. Which is bizarre, because 0004 just tweaks how we 
calculate the threshold for oversized chunks.

But that shouldn't have any impact, because with the benchmark 
parameters the threshold should be 64kB, way more than the chunk size 
(which is what we allocate).

Similarly, it should not matter whether we double or quadruple the block 
size, because we reach the maximum block size (1MB) very fast in both 
cases, and then just just 1MB blocks for 99% of the benchmark.

I was thinking that maybe the extra allocChunkLimit increases the size 
of the GenerationContext struct, requiring another cacheline. But no, 
it's 3 cachelines in both cases.

But while trying to investigate / profile this, I noticed a rather 
bizarre behavior, which I can't explain but I suspect it may be related 
to the slowdown.

The benchmark queries are  executing generation_bench_fifo() for a range 
of block size / chunk_size combinations, and look about like this:

     select block_size, chunk_size, x.*
       from generate_series(32,512,32) a(chunk_size),
          (values (1024), (2048), (4096), (8192), (16384), (32768))
          AS b(block_size),
       lateral generation_bench_fifo(1000000, block_size, chunk_size,
                                     2*chunk_size, 100, 10000, 5000) x;

Which takes a while to run, and then produces something like this:

      block_size | chunk_size | mem_allocated | alloc_ms | free_ms
     ------------+------------+---------------+----------+---------
            1024 |         32 |      81528832 |    62016 |   24601
            1024 |         64 |     137449472 |    61951 |   36034
            1024 |         96 |     208408464 |    85437 |   57984
     ...
           32768 |        448 |     707756032 |    36605 |   67648
           32768 |        480 |     757104640 |    36838 |   71967
           32768 |        512 |     806256640 |    37073 |   76489
     (96 rows)


Now, I can run benchmark for a single case (e.g. the last one in the 
results above), say like this:

     select block_size, chunk_size, x.* from
        (values (512)) AS a(chunk_size),
        (values (32768)) AS b(block_size),
     lateral generation_bench_fifo(1000000, block_size, chunk_size,
                                   2*chunk_size, 100, 10000, 5000) x;


And now comes the funny part - if I run it in the same backend as the 
"full" benchmark, I get roughly the same results:

      block_size | chunk_size | mem_allocated | alloc_ms | free_ms
     ------------+------------+---------------+----------+---------
           32768 |        512 |     806256640 |    37159 |   76669

but if I reconnect and run it in the new backend, I get this:

      block_size | chunk_size | mem_allocated | alloc_ms | free_ms
     ------------+------------+---------------+----------+---------
           32768 |        512 |     806158336 |   233909 |  100785
     (1 row)

It does not matter if I wait a bit before running the query, if I run it 
repeatedly, etc. The machine is not doing anything else, the CPU is set 
to use "performance" governor, etc.

This is really strange, and the really suspicious thing is that this 
slower timing is very close to the 0004 timing (which is 226451). So 
perhaps the reason for these slowdowns is the same, it's just that with 
0004 it happens all the time.

According to perf profiling, it seems there's one pretty significant 
change - 0004 spends quite a bit of time in asm_exc_page_fault

   41.17%     5.26%  postgres  postgres   [.] GenerationAlloc
           |
            --36.06%--GenerationAlloc
                      |
                      |--28.95%--asm_exc_page_fault
                      |          |
                      |           --23.17%--exc_page_fault
                      |                     |

while on 0003 there's not a single frame with asm_exc_page_fault. I have 
no idea why, though.

I wonder if this might be some unexpected / unfortunate interaction with 
kernel memory management, or something like that. Any ideas what might 
be the root cause, what to look for, etc.?


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Вложения

В списке pgsql-hackers по дате отправления:

Предыдущее
От: Robert Haas
Дата:
Сообщение: Re: trap instead of error on 32 TiB table
Следующее
От: Tom Lane
Дата:
Сообщение: Re: trap instead of error on 32 TiB table