Обсуждение: [PATCH] Resolve Parallel Hash Join Performance Issue

Поиск
Список
Период
Сортировка

[PATCH] Resolve Parallel Hash Join Performance Issue

От
"Deng, Gang"
Дата:

Hi all,

 

Attached is a patch to resolve parallel hash join performance issue. This is my first time to contribute patch to PostgreSQL community, I referred one of previous thread as template to report the issue and patch. Please let me know if need more information of the problem and patch.

 

A. Problem Summary

When we ran query which was executed by hash join operation, we can not achieve good performance improvement with more number of threads. More specifically, when we ran query02 of TPC-DS workload using scale 500 (500GB dataset), execution time using 8 threads was 124.6 sec while time using 28 threads was 103.5 sec. Here is execution time by different number of threads:

 

number of thread:    1      4      8        16      28

time used(sec):     460.4   211  124.6    101.9   103.5 

 

The test was  made on a server with 384GB DRAM, 56 cores/112 HT. Data has been cached into OS page cache, so there was no disk I/O during execution, and there were enough physical CPU cores to support 28 threads to run in parallel.

 

We investigated this problem with perf c2c (http://man7.org/linux/man-pages/man1/perf-c2c.1.html) tool, confirmed the problem was caused by false sharing cache coherence. And we located the code write cache line is at line 457 of nodeHashjoin.c (pg version 12.0).

 

B. Patch

    change line 457 in ExecHashJoinImpl function of nodeHashJoin.c. (be applicable to both 12.0 and 12.1)

    original code:

        HeapTupleHeaderSetMatch(HJTUPLE_MINTUPLE(node->hj_CurTuple));

    changed to:

        if (!HeapTupleHeaderHasMatch(HJTUPLE_MINTUPLE(node->hj_CurTuple)))

        {

            HeapTupleHeaderSetMatch(HJTUPLE_MINTUPLE(node->hj_CurTuple));

        }

    Compared with original code, modified code can avoid unnecessary write to memory/cache.

 

C. Test case:

  1. Use https://github.com/pivotalguru/TPC-DS to setup TPC-DS benchmark for postgreSQL
  2. run below command to ensure query will be executed with expected parallelism:

            psql postgres -h localhost -U postgres -c "alter table web_sales set (parallel_workers =28);";psql postgres -h localhost -U postgres -c "alter table catalog_sales set (parallel_workers =28);"

      3. run query: psql postgres -h localhost -U postgres -f 102.tpcds.02.sql first to ensure data is loaded into page cache.

      4. run query " time psql postgres -h localhost -U postgres -f 102.tpcds.02.sql" again to measure performance.

 

D. Result

With the modified code, performance of hash join operation can scale better with number of threads. Here is result of query02 after patch. For example, performance improved ~2.5x when run 28 threads.

 

number of thread:    1       4        8     16    28

time used(sec):    465.1  193.1   97.9   55.9  41 

 

I attached 5 files for more information:

  1. query_plan_q2_no_opt_28_thread: query plan using 28 threads without patch 
  2. query_plan_q2_opt_28_thread: query plan using 28 threads with patch
  3. perf_c2c_no_opt.txt: perf c2c output before patch
  4. perf_c2c_opt.txt: perf c2c output after patch
  5. git diff of the patch

 

 

Thanks and Best Regards

 

Deng, Gang (邓刚)

IAGS-CPDP-CEE PRC Enterprise

Mobile: 13161337000

Office: 010-57511964

 

Вложения

Re: [PATCH] Resolve Parallel Hash Join Performance Issue

От
Thomas Munro
Дата:
On Thu, Jan 9, 2020 at 10:04 PM Deng, Gang <gang.deng@intel.com> wrote:
> Attached is a patch to resolve parallel hash join performance issue. This is my first time to contribute patch to
PostgreSQLcommunity, I referred one of previous thread as template to report the issue and patch. Please let me know if
needmore information of the problem and patch. 

Thank you very much for investigating this and for your report.

>         HeapTupleHeaderSetMatch(HJTUPLE_MINTUPLE(node->hj_CurTuple));
>
>     changed to:
>
>         if (!HeapTupleHeaderHasMatch(HJTUPLE_MINTUPLE(node->hj_CurTuple)))
>
>         {
>
>             HeapTupleHeaderSetMatch(HJTUPLE_MINTUPLE(node->hj_CurTuple));
>
>         }
>
>     Compared with original code, modified code can avoid unnecessary write to memory/cache.

Right, I see.  The funny thing is that the match bit is not even used
in this query (it's used for right and full hash join, and those
aren't supported for parallel joins yet).  Hmm.  So, instead of the
test you proposed, an alternative would be to use if (!parallel).
That's a value that will be constant-folded, so that there will be no
branch in the generated code (see the pg_attribute_always_inline
trick).  If, in a future release, we need the match bit for parallel
hash join because we add parallel right/full hash join support, we
could do it the way you showed, but only if it's one of those join
types, using another constant parameter.

> D. Result
>
> With the modified code, performance of hash join operation can scale better with number of threads. Here is result of
query02after patch. For example, performance improved ~2.5x when run 28 threads. 
>
> number of thread:    1       4        8     16    28
> time used(sec):    465.1  193.1   97.9   55.9  41

Wow.  That is a very nice improvement.



Re: [PATCH] Resolve Parallel Hash Join Performance Issue

От
Tom Lane
Дата:
Thomas Munro <thomas.munro@gmail.com> writes:
> Right, I see.  The funny thing is that the match bit is not even used
> in this query (it's used for right and full hash join, and those
> aren't supported for parallel joins yet).  Hmm.  So, instead of the
> test you proposed, an alternative would be to use if (!parallel).
> That's a value that will be constant-folded, so that there will be no
> branch in the generated code (see the pg_attribute_always_inline
> trick).  If, in a future release, we need the match bit for parallel
> hash join because we add parallel right/full hash join support, we
> could do it the way you showed, but only if it's one of those join
> types, using another constant parameter.

Can we base the test off the match type today, and avoid leaving
something that will need to be fixed later?

I'm pretty sure that the existing coding is my fault, and that it's
like that because I reasoned that setting the bit was too cheap
to justify having a test-and-branch around it.  Apparently that's
not true anymore in a parallel join, but I have to say that it's
unclear why.  In any case, the reasoning probably still holds good
in non-parallel cases, so it'd be a shame to introduce a run-time
test if we can avoid it.

            regards, tom lane



RE: [PATCH] Resolve Parallel Hash Join Performance Issue

От
"Deng, Gang"
Дата:
Thank you for the comment. Yes, I agree the alternative of using '(!parallel)', so that no need to test the bit. Will
someonesubmit patch to for it accordingly?
 

-----Original Message-----
From: Thomas Munro <thomas.munro@gmail.com> 
Sent: Thursday, January 9, 2020 6:04 PM
To: Deng, Gang <gang.deng@intel.com>
Cc: pgsql-hackers@postgresql.org
Subject: Re: [PATCH] Resolve Parallel Hash Join Performance Issue

On Thu, Jan 9, 2020 at 10:04 PM Deng, Gang <gang.deng@intel.com> wrote:
> Attached is a patch to resolve parallel hash join performance issue. This is my first time to contribute patch to
PostgreSQLcommunity, I referred one of previous thread as template to report the issue and patch. Please let me know if
needmore information of the problem and patch.
 

Thank you very much for investigating this and for your report.

>         HeapTupleHeaderSetMatch(HJTUPLE_MINTUPLE(node->hj_CurTuple));
>
>     changed to:
>
>         if 
> (!HeapTupleHeaderHasMatch(HJTUPLE_MINTUPLE(node->hj_CurTuple)))
>
>         {
>
>             
> HeapTupleHeaderSetMatch(HJTUPLE_MINTUPLE(node->hj_CurTuple));
>
>         }
>
>     Compared with original code, modified code can avoid unnecessary write to memory/cache.

Right, I see.  The funny thing is that the match bit is not even used in this query (it's used for right and full hash
join,and those aren't supported for parallel joins yet).  Hmm.  So, instead of the test you proposed, an alternative
wouldbe to use if (!parallel).
 
That's a value that will be constant-folded, so that there will be no branch in the generated code (see the
pg_attribute_always_inlinetrick).  If, in a future release, we need the match bit for parallel hash join because we add
parallelright/full hash join support, we could do it the way you showed, but only if it's one of those join types,
usinganother constant parameter.
 

> D. Result
>
> With the modified code, performance of hash join operation can scale better with number of threads. Here is result of
query02after patch. For example, performance improved ~2.5x when run 28 threads.
 
>
> number of thread:    1       4        8     16    28
> time used(sec):    465.1  193.1   97.9   55.9  41

Wow.  That is a very nice improvement.

RE: [PATCH] Resolve Parallel Hash Join Performance Issue

От
"Deng, Gang"
Дата:
Regarding to the reason of setting bit was not cheap anymore in parallel join. As I explain in my original mail, it is
because'false sharing cache coherence'. In short word, setting of the bit will cause the whole cache line (64 bytes)
dirty.So that all CPU cores contain the cache line have to load it again, which will waste much cpu time. Article
https://software.intel.com/en-us/articles/avoiding-and-identifying-false-sharing-among-threadsexplain more detail. 

-----Original Message-----
From: Tom Lane <tgl@sss.pgh.pa.us>
Sent: Thursday, January 9, 2020 10:43 PM
To: Thomas Munro <thomas.munro@gmail.com>
Cc: Deng, Gang <gang.deng@intel.com>; pgsql-hackers@postgresql.org
Subject: Re: [PATCH] Resolve Parallel Hash Join Performance Issue

Thomas Munro <thomas.munro@gmail.com> writes:
> Right, I see.  The funny thing is that the match bit is not even used
> in this query (it's used for right and full hash join, and those
> aren't supported for parallel joins yet).  Hmm.  So, instead of the
> test you proposed, an alternative would be to use if (!parallel).
> That's a value that will be constant-folded, so that there will be no
> branch in the generated code (see the pg_attribute_always_inline
> trick).  If, in a future release, we need the match bit for parallel
> hash join because we add parallel right/full hash join support, we
> could do it the way you showed, but only if it's one of those join
> types, using another constant parameter.

Can we base the test off the match type today, and avoid leaving something that will need to be fixed later?

I'm pretty sure that the existing coding is my fault, and that it's like that because I reasoned that setting the bit
wastoo cheap to justify having a test-and-branch around it.  Apparently that's not true anymore in a parallel join, but
Ihave to say that it's unclear why.  In any case, the reasoning probably still holds good in non-parallel cases, so
it'dbe a shame to introduce a run-time test if we can avoid it. 

            regards, tom lane



Re: [PATCH] Resolve Parallel Hash Join Performance Issue

От
Thomas Munro
Дата:
(Replies to both Gang and Tom below).

On Fri, Jan 10, 2020 at 1:52 PM Deng, Gang <gang.deng@intel.com> wrote:
> Thank you for the comment. Yes, I agree the alternative of using '(!parallel)', so that no need to test the bit. Will
someonesubmit patch to for it accordingly?
 

Here's a patch like that.

On Fri, Jan 10, 2020 at 3:43 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Thomas Munro <thomas.munro@gmail.com> writes:
> > Right, I see.  The funny thing is that the match bit is not even used
> > in this query (it's used for right and full hash join, and those
> > aren't supported for parallel joins yet).  Hmm.  So, instead of the
> > test you proposed, an alternative would be to use if (!parallel).
> > That's a value that will be constant-folded, so that there will be no
> > branch in the generated code (see the pg_attribute_always_inline
> > trick).  If, in a future release, we need the match bit for parallel
> > hash join because we add parallel right/full hash join support, we
> > could do it the way you showed, but only if it's one of those join
> > types, using another constant parameter.
>
> Can we base the test off the match type today, and avoid leaving
> something that will need to be fixed later?

I agree that it'd be nicer to use the logically correct thing, namely
a test of HJ_FILL_INNER(node), but that'd be a run-time check.  I'd
like to back-patch this and figured that we don't want to add new
branches too casually.

I have an experimental patch where "fill_inner" and "fill_outer" are
compile-time constants and you can skip various bits of code without
branching (part of a larger experiment to figure out which of many
parameters are worth specialising at a cost of a couple of KB of text
per combination, including the ability to use wider hashes so that
monster sized joins work better).  Then I could test the logically
correct thing explicitly without branches.

Вложения

Re: [PATCH] Resolve Parallel Hash Join Performance Issue

От
Thomas Munro
Дата:
On Tue, Jan 21, 2020 at 6:20 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> On Fri, Jan 10, 2020 at 1:52 PM Deng, Gang <gang.deng@intel.com> wrote:
> > Thank you for the comment. Yes, I agree the alternative of using '(!parallel)', so that no need to test the bit.
Willsomeone submit patch to for it accordingly?
 
>
> Here's a patch like that.

Pushed.  Thanks again for the report!

I didn't try the TPC-DS query, but could see a small improvement from
this on various simple queries, especially with a fairly small hash
table and a large outer relation, when many cores are probing.

(Off topic for this thread, but after burning a few hours on a 72-way
box investigating various things including this, I was reminded of the
performance drop-off for joins with large hash tables that happens
somewhere around 8-16 workers.  That's because we can't give 32KB
chunks out fast enough, and if you increase the chunk size it helps
only a bit.  That really needs some work; maybe something like a
separation of reservation and allocation, so that multiple segments
can be created in parallel while respecting limits, or something like
that.  The other thing I was reminded of: FreeBSD blows Linux out of
the water on big parallel hash joins on identical hardware; I didn't
dig further today but I suspect this may be down to lack of huge pages
(TLB misses), and perhaps also those pesky fallocate() calls.  I'm
starting to wonder if we should have a new GUC shared_work_mem that
reserves a wodge of shm in the main region, and hand out 'fast DSM
segments' from there, or some other higher level abstraction that's
wired into the resource release system; they would benefit from
huge_pages=try on Linux, they'd be entirely allocated (in the VM
sense) and there'd be no system calls, though admittedly there'd be
more ways for things to go wrong...)