Re: Gather Merge

Поиск
Список
Период
Сортировка
От Rushabh Lathia
Тема Re: Gather Merge
Дата
Msg-id CAGPqQf3eBDMLYCzNWixbNQ8o5Czfr_Qv5eSomDEBstzMROgioQ@mail.gmail.com
обсуждение исходный текст
Ответ на Re: Gather Merge  (Thomas Munro <thomas.munro@enterprisedb.com>)
Ответы Re: Gather Merge
Re: Gather Merge
Список pgsql-hackers


On Fri, Nov 4, 2016 at 8:30 AM, Thomas Munro <thomas.munro@enterprisedb.com> wrote:
On Thu, Oct 27, 2016 at 10:50 PM, Rushabh Lathia
<rushabh.lathia@gmail.com> wrote:
> Please find attached latest patch which fix the review point as well as
> additional clean-up.

I've signed up to review this patch and I'm planning to do some
testing.  Here's some initial feedback after a quick read-through:


Thanks Thomas.
 
+ if (gather_merge_readnext(gm_state, i, initialize ? false : true))

Clunky ternary operator... how about "!initialize".


Fixed.
 
+/*
+ * Function clear out a slot in the tuple table for each gather merge
+ * slots and returns the clear clear slot.
+ */

Maybe better like this?  "_Clear_ out a slot in the tuple table for
each gather merge _slot_ and _return_ the _cleared_ slot."


Fixed.
 
+ /* Free tuple array as we no more need it */

"... as we don't need it any more"


Fixed
 
+/*
+ * Read the next tuple for gather merge.
+ *
+ * Function fetch the sorted tuple out of the heap.
+ */

"_Fetch_ the sorted tuple out of the heap."


Fixed
 
+ * Otherwise, pull the next tuple from whichever participate we
+ * returned from last time, and reinsert the index into the heap,
+ * because it might now compare differently against the existing

s/participate/participant/


Fixed.
 
+ * Portions Copyright (c) 1996-2015, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California

Shouldn't this say just "(c) 2016, PostgreSQL Global Development
Group"? 

Fixed.
 
Are we supposed to be blaming the University of California
for new files?


Not quite sure about this, so keeping this as it is.
 
+#include "executor/tqueue.h"
+#include "miscadmin.h"
+#include "utils/memutils.h"
+#include "utils/rel.h"
+#include "lib/binaryheap.h"

Not correctly sorted.


Copied from nodeGather.c. But Fixed here.
 
+ /*
+ * store the tuple descriptor into gather merge state, so we can use it
+ * later while initilizing the gather merge slots.
+ */

s/initilizing/initializing/


Fixed.
 
+/* ----------------------------------------------------------------
+ * ExecEndGatherMerge
+ *
+ * frees any storage allocated through C routines.
+ * ----------------------------------------------------------------

The convention in Postgres code seems to be to use a form like "Free
any storage ..." in function documentation.  Not sure if that's an
imperative, an infinitive, or if the word "we" is omitted since
English is so fuzzy like that, but it's inconsistent with other
documentation to use "frees" here.  Oh, I see that exact wording is in
several other files.  I guess I'll just leave this as a complaint
about all those files then :-)


Sure.
 
+ * Pull atleast single tuple from each worker + leader and set up the heap.

s/atleast single/at least a single/


Fixed.
 
+ * Read the tuple for given reader into nowait mode, and form the tuple array.

s/ into / in /


Fixed.
 
+ * Function attempt to read tuple for the given reader and store it into reader

s/Function attempt /Attempt /


Fixed.
 
+ * Function returns true if found tuple for the reader, otherwise returns

s/Function returns /Return /


Fixed.
 
+ * First try to read tuple for each worker (including leader) into nowait
+ * mode, so that we initialize read from each worker as well as leader.

I wonder if it would be good to standardise on the terminology we use
when we mean workers AND the leader.  In my Parallel Shared Hash work,
I've been saying 'participants' if I mean = workers + leader.  What do
you think?


I am not quite sure about participants. In my options when we explicitly
say workers + leader its more clear. I am open to change is if committer
thinks otherwise.

 
+ * After this if all active worker unable to produce the tuple, then
+ * re-read and this time read the tuple into wait mode. For the worker,
+ * which was able to produced single tuple in the earlier loop and still
+ * active, just try fill the tuple array if more tuples available.
+ */

How about this?  "After this, if all active workers are unable to
produce a tuple, then re-read and this time us wait mode.  For workers
that were able to produce a tuple in the earlier loop and are still
active, just try to fill the tuple array if more tuples are
available."


Fixed.
 
+ * The heap is never spilled to disk, since we assume N is not very large. So
+ * this is much simple then cost_sort.

s/much simple then/much simpler than/


Fixed.
 
+ /*
+ * Avoid log(0)...
+ */
+ N = (path->num_workers < 2) ? 2.0 : (double) path->num_workers;
+ logN = LOG2(N);
...
+ /* Per-tuple heap maintenance cost */
+ run_cost += path->path.rows * comparison_cost * 2.0 * logN;

Why multiply by two?  The comment above this code says "about log2(N)
comparisons to delete the top heap entry and another log2(N)
comparisons to insert its successor".  In fact gather_merge_getnext
calls binaryheap_replace_first, which replaces the top element without
any comparisons at all and then performs a sift-down in log2(N)
comparisons to find its new position.  There is no per-tuple "delete"
involved.  We "replace" the top element with the value it already had,
just to trigger the sift-down, because we know that our comparator
function might have a new opinion of the sort order of this element.
Very clever!  The comment and the 2.0 factor in cost_gather_merge seem
to be wrong though -- or am I misreading the code?

See cost_merge_append.
 
Also, shouldn't we add 1 to N to account for the leader?  Suppose
there are 2 workers.  There are 3 elements in the binary heap.  The
element to be sifted down must be compared against either 1 or 2
others to reorganise the heap.  Surely in that case we should estimate
log2(3) = ~1.58 comparisons, not log2(2) = 1 comparison.

Yes, good catch. For Gather Merge leader always participate, so
we should num_workers + 1.
 

I suspect that the leader's contribution will be equivalent to a whole
worker if the plan involves a sort: as soon as the leader pulls a
tuple in gather_merge_init, the sort node will pull all the tuples it
can in a tight loop.  It's unfortunate that cost_seqscan has to
estimate what the leader's contribution will be without knowing
whether it has a "greedy" high-startup-cost consumer like a sort or
hash node where the leader will contribute a whole backend's full
attention as soon as it executes the plan, or a lazy consumer where
the leader will probably not contribute much if there are enough
workers to keep it distracted.  In the case of a Gather Merge -> Sort
-> Parallel Seq Scan plan, I think we will overestimate the number of
rows (per participant), because cost_seqscan will guess that the
leader is spending 30% of its time per worker servicing the workers,
when in fact it will be sucking tuples into a sort node just as fast
as anyone else.  But I don't see what this patch can do about that...


Exactly. There is very thin line - when it comes to calculating the cost.
In general, while calculating the cost for GM, I just tried to be similar
to the Gather + MergeAppend.
 
+ * When force is true, function reads the tuple into wait mode. For gather
+ * merge we need to fill the slot from which we returned the earlier tuple, so
+ * this require tuple to be read into wait mode. During initialization phase,
+ * once we try to read the tuple into no-wait mode as we want to initialize all
+ * the readers. Refer gather_merge_init() for more details.
+ *
+ * Function returns true if found tuple for the reader, otherwise returns
+ * false.
+ */
+static bool
+gather_merge_readnext(GatherMergeState *gm_state, int reader, bool force)

s/into wait mode/in wait mode/

This appears throughout the comments; not sure if I can explain this
well but "in wait mode" describes a state of being which is wanted
here, "into wait mode" describes some kind of change or movement or
insertion.

Perhaps it would be better to say "reads the tuple _queue_ in wait
mode", just to make clearer that this is talking about the wait/nowait
feature of tuple queues, and perhaps also note that the leader always
waits since it executes the plan.


Fixed. Just choose to s/into wait mode/in wait mode/

Maybe we should use "bool nowait" here anway, mirroring the TupleQueue
interface?  Why introduce another terminology for the same thing with
inverted sense?


Agree with you. Changed the function gm_readnext_tuple() & gather_merge_readnext()
APIs.
 
+/*
+ * Read the tuple for given reader into nowait mode, and form the tuple array.
+ */
+static void
+form_tuple_array(GatherMergeState *gm_state, int reader)

This function is stangely named.  How about try_to_fill_tuple_buffer
or something?

+ GMReaderTuple *gm_tuple = &gm_state->gm_tuple[reader];

I wonder if the purpose of gm_tuple, would be clearer if it were
called gm_tuple_buffers.  Plural because it holds one buffer per
reader.  Then in that variable on the left hand side there could be
called tuple_buffer (singular), because it's the buffer of tuples for
one single reader.


Yes, you are right. I renamed the variable as well as structure.

PFA latest patch which address the review comments as
well as few other clean ups.

Apart from this my colleague Rafia Sabih reported one regression with
GM. Which was like, if we set work_mem enough to accommodate the
sort operation - in such case GM path get select even though Sort
performs much better.

Example:

create table t (i int);
insert into t values(generate_series(1,10000000));
set work_mem =1024000;
explain analyze select * from t order by i;
set enable_gathermerge =off;
explain analyze select * from t order by i;

postgres=# explain  analyze select * from t  order by i;                                                            QUERY PLAN                                                            
---------------------------------------------------------------------------------------------------------------------------------- Gather Merge  (cost=335916.26..648415.76 rows=2499996 width=4) (actual time=2234.145..7628.555 rows=10000000 loops=1)   Workers Planned: 4   Workers Launched: 4   ->  Sort  (cost=334916.22..341166.21 rows=2499996 width=4) (actual time=2226.609..2611.041 rows=2000000 loops=5)         Sort Key: i         Sort Method: quicksort  Memory: 147669kB         ->  Parallel Seq Scan on t  (cost=0.00..69247.96 rows=2499996 width=4) (actual time=0.034..323.129 rows=2000000 loops=5) Planning time: 0.061 ms Execution time: 8143.809 ms
(9 rows)

postgres=# set enable_gathermerge = off;
SET
postgres=# explain  analyze select * from t  order by i;                                                      QUERY PLAN                                                      
---------------------------------------------------------------------------------------------------------------------- Sort  (cost=1306920.83..1331920.79 rows=9999985 width=4) (actual time=3521.143..4854.148 rows=10000000 loops=1)   Sort Key: i   Sort Method: quicksort  Memory: 854075kB   ->  Seq Scan on t  (cost=0.00..144247.85 rows=9999985 width=4) (actual time=0.113..1340.758 rows=10000000 loops=1) Planning time: 0.100 ms Execution time: 5535.560 ms
(6 rows)

Looking at the plan I realize that this is happening because wrong costing
for Gather Merge. Here in the plan we can see the row estimated by
Gather Merge is wrong. This is because earlier patch GM was considering
rows = subpath->rows, which is not true as subpath is partial path. So
we need to multiple it with number of worker. Attached patch also fixed
this issues. I also run the TPC-H benchmark with the patch and results
are same as earlier.


Thanks,
Rushabh Lathia

В списке pgsql-hackers по дате отправления:

Предыдущее
От: Ashutosh Bapat
Дата:
Сообщение: Re: Partition-wise join for join between (declaratively) partitioned tables
Следующее
От: Rushabh Lathia
Дата:
Сообщение: Re: Gather Merge