Re: Proposed Patch to Improve Performance of Multi-Batch Hash Join for Skewed Data Sets

Поиск
Список
Период
Сортировка
От Lawrence, Ramon
Тема Re: Proposed Patch to Improve Performance of Multi-Batch Hash Join for Skewed Data Sets
Дата
Msg-id 6EEA43D22289484890D119821101B1DF2C16BC@exchange20.mercury.ad.ubc.ca
обсуждение исходный текст
Ответ на Re: Proposed Patch to Improve Performance of Multi-Batch Hash Join for Skewed Data Sets  ("Joshua Tolley" <eggyknap@gmail.com>)
Ответы Re: Proposed Patch to Improve Performance of Multi-Batch Hash Join for Skewed Data Sets  ("Joshua Tolley" <eggyknap@gmail.com>)
Re: Proposed Patch to Improve Performance of Multi-Batch Hash Join for Skewed Data Sets  (Tom Lane <tgl@sss.pgh.pa.us>)
Список pgsql-hackers
Joshua,

Thank you for offering to review the patch.

The easiest way to test would be to generate your own TPC-H data and
load it into a database for testing.  I have posted the TPC-H generator
at:

http://people.ok.ubc.ca/rlawrenc/TPCHSkew.zip

The generator can produce skewed data sets.  It was produced by
Microsoft Research.

After unzipping, on a Windows machine, you can just run the command:

dbgen -s 1 -z 1

This will produce a TPC-H database of scale 1 GB with a Zipfian skew of
z=1.  More information on the generator is in the document README-S.DOC.
Source is provided for the generator, so you should be able to run it on
other operating systems as well.

The schema DDL is at:

http://people.ok.ubc.ca/rlawrenc/tpch_pg_ddl.txt

Note that the load time for 1G data is 1-2 hours and for 10G data is
about 24 hours.  I recommend you do not add the foreign keys until after
the data is loaded.

The other alternative is to do a pgdump on our data sets.  However, the
download size would be quite large, and it will take a couple of days
for us to get you the data in that form.

--
Dr. Ramon Lawrence
Assistant Professor, Department of Computer Science, University of
British Columbia Okanagan
E-mail: ramon.lawrence@ubc.ca


> -----Original Message-----
> From: Joshua Tolley [mailto:eggyknap@gmail.com]
> Sent: November 1, 2008 3:42 PM
> To: Lawrence, Ramon
> Cc: pgsql-hackers@postgresql.org; Bryce Cutt
> Subject: Re: [HACKERS] Proposed Patch to Improve Performance of Multi-
> Batch Hash Join for Skewed Data Sets
>
> On Mon, Oct 20, 2008 at 4:42 PM, Lawrence, Ramon
<ramon.lawrence@ubc.ca>
> wrote:
> > We propose a patch that improves hybrid hash join's performance for
> large
> > multi-batch joins where the probe relation has skew.
> >
> > Project name: Histojoin
> > Patch file: histojoin_v1.patch
> >
> > This patch implements the Histojoin join algorithm as an optional
> feature
> > added to the standard Hybrid Hash Join (HHJ).  A flag is used to
enable
> or
> > disable the Histojoin features.  When Histojoin is disabled, HHJ
acts as
> > normal.  The Histojoin features allow HHJ to use PostgreSQL's
statistics
> to
> > do skew aware partitioning.  The basic idea is to keep build
relation
> tuples
> > in a small in-memory hash table that have join values that are
> frequently
> > occurring in the probe relation.  This improves performance of HHJ
when
> > multiple batches are used by 10% to 50% for skewed data sets.  The
> > performance improvements of this patch can be seen in the paper
(pages
> > 25-30) at:
> >
> > http://people.ok.ubc.ca/rlawrenc/histojoin2.pdf
> >
> > All generators and materials needed to verify these results can be
> provided.
> >
> > This is a patch against the HEAD of the repository.
> >
> > This patch does not contain platform specific code.  It compiles and
has
> > been tested on our machines in both Windows (MSVC++) and Linux
(GCC).
> >
> > Currently the Histojoin feature is enabled by default and is used
> whenever
> > HHJ is used and there are Most Common Value (MCV) statistics
available
> on
> > the probe side base relation of the join.  To disable this feature
> simply
> > set the enable_hashjoin_usestatmcvs flag to off in the database
> > configuration file or at run time with the 'set' command.
> >
> > One potential improvement not included in the patch is that Most
Common
> > Value (MCV) statistics are only determined when the probe relation
is
> > produced by a scan operator.  There is a benefit to using MCVs even
when
> the
> > probe relation is not a base scan, but we were unable to determine
how
> to
> > find statistics from a base relation after other operators are
> performed.
> >
> > This patch was created by Bryce Cutt as part of his work on his
M.Sc.
> > thesis.
> >
> > --
> > Dr. Ramon Lawrence
> > Assistant Professor, Department of Computer Science, University of
> British
> > Columbia Okanagan
> > E-mail: ramon.lawrence@ubc.ca
>
> I'm interested in trying to review this patch. Having not done patch
> review before, I can't exactly promise grand results, but if you could
> provide me with the data to check your results? In the meantime I'll
> go read the paper.
>
> - Josh / eggyknap


В списке pgsql-hackers по дате отправления:

Предыдущее
От: Greg Stark
Дата:
Сообщение: Re: WIP: Hash Join-Filter Pruning using Bloom Filters
Следующее
От: "Joshua Tolley"
Дата:
Сообщение: Re: Proposed Patch to Improve Performance of Multi-Batch Hash Join for Skewed Data Sets