Proposed Patch to Improve Performance of Multi-Batch Hash Join for Skewed Data Sets

Поиск
Список
Период
Сортировка
От Lawrence, Ramon
Тема Proposed Patch to Improve Performance of Multi-Batch Hash Join for Skewed Data Sets
Дата
Msg-id 6EEA43D22289484890D119821101B1DF2C1683@exchange20.mercury.ad.ubc.ca
обсуждение исходный текст
Ответы Re: Proposed Patch to Improve Performance of Multi-Batch Hash Join for Skewed Data Sets  ("Joshua Tolley" <eggyknap@gmail.com>)
Re: Proposed Patch to Improve Performance of Multi-Batch Hash Join for Skewed Data Sets  (Joshua Tolley <eggyknap@gmail.com>)
Re: Proposed Patch to Improve Performance of Multi-Batch Hash Join for Skewed Data Sets  (Joshua Tolley <eggyknap@gmail.com>)
Re: Proposed Patch to Improve Performance of Multi-Batch Hash Join for Skewed Data Sets  (Tom Lane <tgl@sss.pgh.pa.us>)
Список pgsql-hackers

We propose a patch that improves hybrid hash join’s performance for large multi-batch joins where the probe relation has skew.

 

Project name: Histojoin

Patch file: histojoin_v1.patch

 

This patch implements the Histojoin join algorithm as an optional feature added to the standard Hybrid Hash Join (HHJ).  A flag is used to enable or disable the Histojoin features.  When Histojoin is disabled, HHJ acts as normal.  The Histojoin features allow HHJ to use PostgreSQL’s statistics to do skew aware partitioning.  The basic idea is to keep build relation tuples in a small in-memory hash table that have join values that are frequently occurring in the probe relation.  This improves performance of HHJ when multiple batches are used by 10% to 50% for skewed data sets.  The performance improvements of this patch can be seen in the paper (pages 25-30) at:

 

http://people.ok.ubc.ca/rlawrenc/histojoin2.pdf

 

All generators and materials needed to verify these results can be provided.

 

This is a patch against the HEAD of the repository.

 

This patch does not contain platform specific code.  It compiles and has been tested on our machines in both Windows (MSVC++) and Linux (GCC).

 

Currently the Histojoin feature is enabled by default and is used whenever HHJ is used and there are Most Common Value (MCV) statistics available on the probe side base relation of the join.  To disable this feature simply set the enable_hashjoin_usestatmcvs flag to off in the database configuration file or at run time with the 'set' command.

 

One potential improvement not included in the patch is that Most Common Value (MCV) statistics are only determined when the probe relation is produced by a scan operator.  There is a benefit to using MCVs even when the probe relation is not a base scan, but we were unable to determine how to find statistics from a base relation after other operators are performed.

 

This patch was created by Bryce Cutt as part of his work on his M.Sc. thesis.

 

--

Dr. Ramon Lawrence

Assistant Professor, Department of Computer Science, University of British Columbia Okanagan

E-mail: ramon.lawrence@ubc.ca

 

Вложения

В списке pgsql-hackers по дате отправления:

Предыдущее
От: Tom Lane
Дата:
Сообщение: Re: Re: [COMMITTERS] pgsql: Properly access a buffer's LSN using existing access macros
Следующее
От: John DeSoi
Дата:
Сообщение: Re: Lisp as a procedural language?