Re: Proposed Patch to Improve Performance of Multi-Batch Hash Join for Skewed Data Sets

Поиск
Список
Период
Сортировка
От Joshua Tolley
Тема Re: Proposed Patch to Improve Performance of Multi-Batch Hash Join for Skewed Data Sets
Дата
Msg-id e7e0a2570811011541x28612963w1f17dcb6d2fe846a@mail.gmail.com
обсуждение исходный текст
Ответ на Proposed Patch to Improve Performance of Multi-Batch Hash Join for Skewed Data Sets  ("Lawrence, Ramon" <ramon.lawrence@ubc.ca>)
Ответы Re: Proposed Patch to Improve Performance of Multi-Batch Hash Join for Skewed Data Sets  ("Lawrence, Ramon" <ramon.lawrence@ubc.ca>)
Список pgsql-hackers
On Mon, Oct 20, 2008 at 4:42 PM, Lawrence, Ramon <ramon.lawrence@ubc.ca> wrote:
> We propose a patch that improves hybrid hash join's performance for large
> multi-batch joins where the probe relation has skew.
>
> Project name: Histojoin
> Patch file: histojoin_v1.patch
>
> This patch implements the Histojoin join algorithm as an optional feature
> added to the standard Hybrid Hash Join (HHJ).  A flag is used to enable or
> disable the Histojoin features.  When Histojoin is disabled, HHJ acts as
> normal.  The Histojoin features allow HHJ to use PostgreSQL's statistics to
> do skew aware partitioning.  The basic idea is to keep build relation tuples
> in a small in-memory hash table that have join values that are frequently
> occurring in the probe relation.  This improves performance of HHJ when
> multiple batches are used by 10% to 50% for skewed data sets.  The
> performance improvements of this patch can be seen in the paper (pages
> 25-30) at:
>
> http://people.ok.ubc.ca/rlawrenc/histojoin2.pdf
>
> All generators and materials needed to verify these results can be provided.
>
> This is a patch against the HEAD of the repository.
>
> This patch does not contain platform specific code.  It compiles and has
> been tested on our machines in both Windows (MSVC++) and Linux (GCC).
>
> Currently the Histojoin feature is enabled by default and is used whenever
> HHJ is used and there are Most Common Value (MCV) statistics available on
> the probe side base relation of the join.  To disable this feature simply
> set the enable_hashjoin_usestatmcvs flag to off in the database
> configuration file or at run time with the 'set' command.
>
> One potential improvement not included in the patch is that Most Common
> Value (MCV) statistics are only determined when the probe relation is
> produced by a scan operator.  There is a benefit to using MCVs even when the
> probe relation is not a base scan, but we were unable to determine how to
> find statistics from a base relation after other operators are performed.
>
> This patch was created by Bryce Cutt as part of his work on his M.Sc.
> thesis.
>
> --
> Dr. Ramon Lawrence
> Assistant Professor, Department of Computer Science, University of British
> Columbia Okanagan
> E-mail: ramon.lawrence@ubc.ca

I'm interested in trying to review this patch. Having not done patch
review before, I can't exactly promise grand results, but if you could
provide me with the data to check your results? In the meantime I'll
go read the paper.

- Josh / eggyknap


В списке pgsql-hackers по дате отправления:

Предыдущее
От: Bruce Momjian
Дата:
Сообщение: Re: Updates of SE-PostgreSQL 8.4devel patches (r1168)
Следующее
От: Simon Riggs
Дата:
Сообщение: Re: Well done, Hackers