Обсуждение: Weird index or sort behaviour

От:
Matthew Wakeling
Дата:

I'm seeing some interesting behaviour. I'm executing a query where I
perform a merge join between two copies of the same table, completely
symmetrically, and the two sides of the merge are sourced differently.

SELECT COUNT(*)
FROM
     (SELECT DISTINCT
         l1.objectid,
         l1.id AS id1,
         l1.intermine_start AS start1,
         l1.intermine_end AS end1,
         l2.id AS id2,
         l2.intermine_start AS start2,
         l2.intermine_end AS end2
     FROM
         locationbin8000 l1,
         locationbin8000 l2
     WHERE
             l1.subjecttype = 'GeneFlankingRegion'
         AND l2.subjecttype = 'GeneFlankingRegion'
         AND l1.objectid = l2.objectid
         AND l1.bin = l2.bin
     ) AS a
WHERE
         start1 <= end2
     AND start2 <= end1;

                       QUERY PLAN
---------------------------------------------------------
  Aggregate
    (cost=703459.72..703459.73 rows=1 width=0)
    (actual time=43673.526..43673.527 rows=1 loops=1)
    ->  HashAggregate
          (cost=657324.23..677828.89 rows=2050466 width=28)
          (actual time=33741.380..42187.885 rows=17564726 loops=1)
          ->  Merge Join
                (cost=130771.22..621441.07 rows=2050466 width=28)
                (actual time=456.970..15292.997 rows=21463106 loops=1)
                Merge Cond: ((l1.objectid = l2.objectid) AND (l1.bin = l2.bin))
                Join Filter: ((l1.intermine_start <= l2.intermine_end) AND (l2.intermine_start <= l1.intermine_end))
                ->  Index Scan using locationbin8000__subjectobjectbin on locationbin8000 l1
                      (cost=0.00..72096.78 rows=670733 width=20)
                      (actual time=0.085..345.834 rows=664588 loops=1)
                      Index Cond: (subjecttype = 'GeneFlankingRegion'::text)
                ->  Sort
                      (cost=130771.22..132448.05 rows=670733 width=20)
                      (actual time=456.864..3182.638 rows=38231659 loops=1)
                      Sort Key: l2.objectid, l2.bin
                      Sort Method:  quicksort  Memory: 81690kB
                      ->  Bitmap Heap Scan on locationbin8000 l2
                            (cost=12706.60..65859.76 rows=670733 width=20)
                            (actual time=107.259..271.026 rows=664588 loops=1)
                            Recheck Cond: (subjecttype = 'GeneFlankingRegion'::text)
                            ->  Bitmap Index Scan on locationbin8000__subjecttypeid
                                  (cost=0.00..12538.92 rows=670733 width=0)
                                  (actual time=106.327..106.327 rows=664588 loops=1)
                                  Index Cond: (subjecttype = 'GeneFlankingRegion'::text)
  Total runtime: 44699.675 ms
(15 rows)

Here is the definition of the locationbin8000 table:

     Table "public.locationbin8000"
      Column      |  Type   | Modifiers
-----------------+---------+-----------
  id              | integer |
  objectid        | integer |
  intermine_start | integer |
  intermine_end   | integer |
  subjecttype     | text    |
  bin             | integer |
Indexes:
     "locationbin8000__subjectobjectbin" btree (subjecttype, objectid, bin)
     "locationbin8000__subjecttypeid" btree (subjecttype, id)

The table is clustered on the locationbin8000__subjectobjectbin index, and
has been analysed.

So you can see, the merge join requires two inputs both ordered by
(objectid, bin), which is readily supplied by the
locationbin8000__subjectobjectbin index, given that I am restricting the
subjecttype of both sides (to the same thing, I might add). Therefore, I
would expect the merge join to feed off two identical index scans. This is
what happens for one of the sides of the merge join, but not the other,
even though the sides are symmetrical.

Does anyone know why it isn't doing two index scans? Given that the cost
of the index scan is half that of the alternative, I'm surprised that it
uses this plan.

I'm using Postgres 8.4.0

Matthew

--
 "Interwoven alignment preambles are not allowed."
 If you have been so devious as to get this message, you will understand
 it, and you deserve no sympathy.  -- Knuth, in the TeXbook

От:
Tom Lane
Дата:

Matthew Wakeling <> writes:
> I'm seeing some interesting behaviour. I'm executing a query where I
> perform a merge join between two copies of the same table, completely
> symmetrically, and the two sides of the merge are sourced differently.

This is not as surprising as you think.  A mergejoin is *not*
symmetrical between its two inputs: the inner side is subject to being
partially rewound and rescanned when the outer side is advanced to a new
row with the same merge key.  This means there is a premium on cheap
rescan for the inner side that doesn't exist for the outer ... and a
sort node is cheaper to rescan than a generic indexscan.  It's
impossible to tell from the data you provided whether the planner was
correct to pick a sort over an indexscan for the inner side, but the
fact that it did so is not prima facie evidence of a bug.  You could
force choice of the other plan via enable_sort = off and then compare
estimated and actual runtimes to see if the planner got it right.

            regards, tom lane

От:
Matthew Wakeling
Дата:

On Tue, 18 Aug 2009, Tom Lane wrote:
> Matthew Wakeling <> writes:
>> I'm seeing some interesting behaviour. I'm executing a query where I
>> perform a merge join between two copies of the same table, completely
>> symmetrically, and the two sides of the merge are sourced differently.
>
> This is not as surprising as you think.  A mergejoin is *not*
> symmetrical between its two inputs: the inner side is subject to being
> partially rewound and rescanned when the outer side is advanced to a new
> row with the same merge key.  This means there is a premium on cheap
> rescan for the inner side that doesn't exist for the outer ... and a
> sort node is cheaper to rescan than a generic indexscan.

Very clever. Yes, that is what is happening. I'm surprised that the system
doesn't buffer the inner side to avoid having to rescan each time, but
then I guess you would have problems if the buffer grew larger than
memory.

> It's impossible to tell from the data you provided whether the planner
> was correct to pick a sort over an indexscan for the inner side, but the
> fact that it did so is not prima facie evidence of a bug.  You could
> force choice of the other plan via enable_sort = off and then compare
> estimated and actual runtimes to see if the planner got it right.

Yes, it does get an almost unmeasureable amount slower if I force sorts
off and nested loop (its next choice) off.

Matthew

--
 $ rm core
 Segmentation Fault (core dumped)

От:
Tom Lane
Дата:

Matthew Wakeling <> writes:
> Very clever. Yes, that is what is happening. I'm surprised that the system
> doesn't buffer the inner side to avoid having to rescan each time, but
> then I guess you would have problems if the buffer grew larger than
> memory.

Well, it does consider adding a Materialize node for that purpose,
but in this case it evidently thought a sort was cheaper.

            regards, tom lane

От:
Tom Lane
Дата:

I wrote:
> Matthew Wakeling <> writes:
>> Very clever. Yes, that is what is happening. I'm surprised that the system
>> doesn't buffer the inner side to avoid having to rescan each time, but
>> then I guess you would have problems if the buffer grew larger than
>> memory.

> Well, it does consider adding a Materialize node for that purpose,
> but in this case it evidently thought a sort was cheaper.

Hmmm ... actually, after looking at the code, I notice that we only
consider adding a Materialize node to buffer an inner input that is a
Sort node.  The idea was suggested by Greg Stark, if memory serves.
I wonder now if it'd be worthwhile to generalize that to consider
adding a Materialize above *any* inner mergejoin input.

            regards, tom lane

От:
Greg Stark
Дата:

On Tue, Aug 18, 2009 at 5:57 PM, Tom Lane<> wrote:
> Hmmm ... actually, after looking at the code, I notice that we only
> consider adding a Materialize node to buffer an inner input that is a
> Sort node.  The idea was suggested by Greg Stark, if memory serves.
> I wonder now if it'd be worthwhile to generalize that to consider
> adding a Materialize above *any* inner mergejoin input.

If my recollection is right the reason we put the materialize above
the sort node has to do with Simon's deferred final merge pass
optimization. The materialize was a way to lazily build the final
merge as we do the merge but still have the ability to rewind.

I would be more curious in the poster's situation to turn off
enable_seqscan, enable_sort, and/or enable_nestloop see how the index
scan merge join plan runs. rewinding an index scan is more expensive
than rewinding a materialize node but would it really be so much
expensive that it's worth copying the entire table into temporary
space?

--
greg
http://mit.edu/~gsstark/resume.pdf

От:
Tom Lane
Дата:

Greg Stark <> writes:
> If my recollection is right the reason we put the materialize above
> the sort node has to do with Simon's deferred final merge pass
> optimization. The materialize was a way to lazily build the final
> merge as we do the merge but still have the ability to rewind.

> I would be more curious in the poster's situation to turn off
> enable_seqscan, enable_sort, and/or enable_nestloop see how the index
> scan merge join plan runs. rewinding an index scan is more expensive
> than rewinding a materialize node but would it really be so much
> expensive that it's worth copying the entire table into temporary
> space?

Absolutely not, but remember that what we're expecting the Materialize
to do is buffer only as far back as the last Mark, so that it's unlikely
ever to spill to disk.  It might well be a win to do that rather than
re-fetching from the indexscan.  The incremental win compared to not
having the materialize would be small compared to what it is for a sort,
but it could still be worthwhile I think.  In particular, in Matthew's
example the sort is being estimated at significantly higher cost than
the indexscan, which presumably means that we are estimating there will
be a *lot* of re-fetches, else we wouldn't have rejected the indexscan
on the inside.  Inserting a materialize would make the re-fetches
cheaper.  I'm fairly sure that this plan structure would cost out
cheaper than the sort according to cost_mergejoin's cost model.  As
noted in the comments therein, that cost model is a bit oversimplified,
so it might not be cheaper in reality ... but we ought to try it.

            regards, tom lane

От:
Matthew Wakeling
Дата:

On Tue, 18 Aug 2009, Tom Lane wrote:
>> I would be more curious in the poster's situation to turn off
>> enable_seqscan, enable_sort, and/or enable_nestloop see how the index
>> scan merge join plan runs.

Like this:

                             QUERY PLAN
-----------------------------------------------------------------------
  Aggregate
    (cost=2441719.92..2441719.93 rows=1 width=0)
    (actual time=50087.537..50087.538 rows=1 loops=1)
    ->  HashAggregate
         (cost=2397366.95..2417079.38 rows=1971243 width=28)
         (actual time=40462.069..48634.713 rows=17564726 loops=1)
          ->  Merge Join
                (cost=0.00..2362870.20 rows=1971243 width=28)
                (actual time=0.095..22041.693 rows=21463106 loops=1)
                Merge Cond: ((l1.objectid = l2.objectid) AND (l1.bin = l2.bin))
                Join Filter: ((l1.intermine_start <= l2.intermine_end) AND (l2.intermine_start <= l1.intermine_end))
                ->  Index Scan using locationbin8000__subjectobjectbin on locationbin8000 l1
                      (cost=0.00..71635.23 rows=657430 width=20)
                      (actual time=0.056..170.857 rows=664588 loops=1)
                      Index Cond: (subjecttype = 'GeneFlankingRegion'::text)
                ->  Index Scan using locationbin8000__subjectobjectbin on locationbin8000 l2
                      (cost=0.00..71635.23 rows=657430 width=20)
                      (actual time=0.020..9594.466 rows=38231659 loops=1)
                      Index Cond: (l2.subjecttype = 'GeneFlankingRegion'::text)
  Total runtime: 50864.569 ms
(10 rows)

>> rewinding an index scan is more expensive than rewinding a materialize
>> node but would it really be so much expensive that it's worth copying
>> the entire table into temporary space?
>
> Absolutely not, but remember that what we're expecting the Materialize
> to do is buffer only as far back as the last Mark, so that it's unlikely
> ever to spill to disk.

If that's how it works, then that sounds very promising indeed.

> In particular, in Matthew's example the sort is being estimated at
> significantly higher cost than the indexscan, which presumably means
> that we are estimating there will be a *lot* of re-fetches, else we
> wouldn't have rejected the indexscan on the inside.

select sum(c * c) / sum(c) from (select objectid, bin, count(*) AS c from
locationbin8000 where subjecttype = 'GeneFlankingRegion' GROUP BY
objectid, bin) as a;
       ?column?
---------------------
  57.5270393085641029

So on average, we will be rewinding by 57 rows each time. A materialise
step really does sound like a win in this situation.

Matthew

--
 Patron: "I am looking for a globe of the earth."
 Librarian: "We have a table-top model over here."
 Patron: "No, that's not good enough. Don't you have a life-size?"
 Librarian: (pause) "Yes, but it's in use right now."

От:
Tom Lane
Дата:

Matthew Wakeling <> writes:
>                 ->  Index Scan using locationbin8000__subjectobjectbin on locationbin8000 l1
>                       (cost=0.00..71635.23 rows=657430 width=20)
>                       (actual time=0.056..170.857 rows=664588 loops=1)
>                       Index Cond: (subjecttype = 'GeneFlankingRegion'::text)
>                 ->  Index Scan using locationbin8000__subjectobjectbin on locationbin8000 l2
>                       (cost=0.00..71635.23 rows=657430 width=20)
>                       (actual time=0.020..9594.466 rows=38231659 loops=1)
>                       Index Cond: (l2.subjecttype = 'GeneFlankingRegion'::text)

>  ... So on average, we will be rewinding by 57 rows each time.

As indeed is reflected in those actual rowcounts.  (The estimated
counts and costs don't include re-fetching, but the actuals do.)

Even more interesting, the actual runtime is about 56x different too,
which implies that Matthew's re-fetches are not noticeably cheaper than
the original fetches.  I'd be surprised if that were true in an
indexscan pulling from disk (you'd expect recently-touched rows to stay
cached for awhile).  But it could easily be true if the whole table were
cached already.  Matthew, how big is this table compared to your RAM?
Were you testing a case in which it'd be in cache?

            regards, tom lane

От:
Matthew Wakeling
Дата:

On Tue, 18 Aug 2009, Tom Lane wrote:
>>                 ->  Index Scan using locationbin8000__subjectobjectbin on locationbin8000 l1
>>                       (cost=0.00..71635.23 rows=657430 width=20)
>>                       (actual time=0.056..170.857 rows=664588 loops=1)
>>                       Index Cond: (subjecttype = 'GeneFlankingRegion'::text)
>>                 ->  Index Scan using locationbin8000__subjectobjectbin on locationbin8000 l2
>>                       (cost=0.00..71635.23 rows=657430 width=20)
>>                       (actual time=0.020..9594.466 rows=38231659 loops=1)
>>                       Index Cond: (l2.subjecttype = 'GeneFlankingRegion'::text)
>
>>  ... So on average, we will be rewinding by 57 rows each time.
>
> As indeed is reflected in those actual rowcounts.  (The estimated
> counts and costs don't include re-fetching, but the actuals do.)
>
> Even more interesting, the actual runtime is about 56x different too,
> which implies that Matthew's re-fetches are not noticeably cheaper than
> the original fetches.  I'd be surprised if that were true in an
> indexscan pulling from disk (you'd expect recently-touched rows to stay
> cached for awhile).  But it could easily be true if the whole table were
> cached already.  Matthew, how big is this table compared to your RAM?
> Were you testing a case in which it'd be in cache?

Oh, definitely. I have run this test so many times, it's all going to be
in the cache. Luckily, that's what we are looking at as a normal situation
in production. Also, since the table is clustered on that index, I would
expect the performance when it is out of cache to be fairly snappy anyway.

For reference, the table is 350 MB, the index is 238 MB, and the RAM in
the machine is 4GB (although it's my desktop so it'll have all sorts of
other rubbish using that up). Our servers have 16GB to 32GB of RAM, so no
problem there.

Matthew

--
 I'm always interested when [cold callers] try to flog conservatories.
 Anyone who can actually attach a conservatory to a fourth floor flat
 stands a marginally better than average chance of winning my custom.
 (Seen on Usenet)

От:
Tom Lane
Дата:

Matthew Wakeling <> writes:
> [ discussion about applying materialize to a mergejoin's inner indexscan ]

I have finally gotten round to doing something about this, and applied
the attached patch to CVS HEAD.  Could you test it on your problem case
to see what happens?  If it's not convenient to load your data into
8.5devel, I believe the patch would work all right in 8.4.x.  (A quick
check shows that it applies except for one deletion hunk that has a
conflict due to a comment change; you could easily do that deletion
manually.)

            regards, tom lane

Index: src/backend/nodes/outfuncs.c
===================================================================
RCS file: /cvsroot/pgsql/src/backend/nodes/outfuncs.c,v
retrieving revision 1.371
diff -c -r1.371 outfuncs.c
*** src/backend/nodes/outfuncs.c    28 Oct 2009 14:55:38 -0000    1.371
--- src/backend/nodes/outfuncs.c    15 Nov 2009 02:45:20 -0000
***************
*** 1501,1506 ****
--- 1501,1507 ----
      WRITE_NODE_FIELD(path_mergeclauses);
      WRITE_NODE_FIELD(outersortkeys);
      WRITE_NODE_FIELD(innersortkeys);
+     WRITE_BOOL_FIELD(materialize_inner);
  }

  static void
Index: src/backend/optimizer/path/allpaths.c
===================================================================
RCS file: /cvsroot/pgsql/src/backend/optimizer/path/allpaths.c,v
retrieving revision 1.188
diff -c -r1.188 allpaths.c
*** src/backend/optimizer/path/allpaths.c    26 Oct 2009 02:26:33 -0000    1.188
--- src/backend/optimizer/path/allpaths.c    15 Nov 2009 02:45:20 -0000
***************
*** 1443,1455 ****
          {
              MergePath  *mp = (MergePath *) path;

!             if (mp->outersortkeys || mp->innersortkeys)
!             {
!                 for (i = 0; i < indent; i++)
!                     printf("\t");
!                 printf("  sortouter=%d sortinner=%d\n",
!                        ((mp->outersortkeys) ? 1 : 0),
!                        ((mp->innersortkeys) ? 1 : 0));
              }
          }

--- 1443,1454 ----
          {
              MergePath  *mp = (MergePath *) path;

!             for (i = 0; i < indent; i++)
!                 printf("\t");
!             printf("  sortouter=%d sortinner=%d materializeinner=%d\n",
!                    ((mp->outersortkeys) ? 1 : 0),
!                    ((mp->innersortkeys) ? 1 : 0),
!                    ((mp->materialize_inner) ? 1 : 0));
              }
          }

Index: src/backend/optimizer/path/costsize.c
===================================================================
RCS file: /cvsroot/pgsql/src/backend/optimizer/path/costsize.c,v
retrieving revision 1.211
diff -c -r1.211 costsize.c
*** src/backend/optimizer/path/costsize.c    12 Sep 2009 22:12:03 -0000    1.211
--- src/backend/optimizer/path/costsize.c    15 Nov 2009 02:45:20 -0000
***************
*** 1167,1189 ****
  }

  /*
-  * sort_exceeds_work_mem
-  *      Given a finished Sort plan node, detect whether it is expected to
-  *      spill to disk (ie, will need more than work_mem workspace)
-  *
-  * This assumes there will be no available LIMIT.
-  */
- bool
- sort_exceeds_work_mem(Sort *sort)
- {
-     double        input_bytes = relation_byte_size(sort->plan.plan_rows,
-                                                  sort->plan.plan_width);
-     long        work_mem_bytes = work_mem * 1024L;
-
-     return (input_bytes > work_mem_bytes);
- }
-
- /*
   * cost_material
   *      Determines and returns the cost of materializing a relation, including
   *      the cost of reading the input data.
--- 1167,1172 ----
***************
*** 1543,1549 ****
   *      Determines and returns the cost of joining two relations using the
   *      merge join algorithm.
   *
!  * 'path' is already filled in except for the cost fields
   * 'sjinfo' is extra info about the join for selectivity estimation
   *
   * Notes: path's mergeclauses should be a subset of the joinrestrictinfo list;
--- 1526,1543 ----
   *      Determines and returns the cost of joining two relations using the
   *      merge join algorithm.
   *
!  * Unlike other costsize functions, this routine makes one actual decision:
!  * whether we should materialize the inner path.  We do that either because
!  * the inner path can't support mark/restore, or because it's cheaper to
!  * use an interposed Material node to handle mark/restore.  When the decision
!  * is cost-based it would be logically cleaner to build and cost two separate
!  * paths with and without that flag set; but that would require repeating most
!  * of the calculations here, which are not all that cheap.  Since the choice
!  * will not affect output pathkeys or startup cost, only total cost, there is
!  * no possibility of wanting to keep both paths.  So it seems best to make
!  * the decision here and record it in the path's materialize_inner field.
!  *
!  * 'path' is already filled in except for the cost fields and materialize_inner
   * 'sjinfo' is extra info about the join for selectivity estimation
   *
   * Notes: path's mergeclauses should be a subset of the joinrestrictinfo list;
***************
*** 1561,1567 ****
      List       *innersortkeys = path->innersortkeys;
      Cost        startup_cost = 0;
      Cost        run_cost = 0;
!     Cost        cpu_per_tuple;
      QualCost    merge_qual_cost;
      QualCost    qp_qual_cost;
      double        outer_path_rows = PATH_ROWS(outer_path);
--- 1555,1564 ----
      List       *innersortkeys = path->innersortkeys;
      Cost        startup_cost = 0;
      Cost        run_cost = 0;
!     Cost        cpu_per_tuple,
!                 inner_run_cost,
!                 bare_inner_cost,
!                 mat_inner_cost;
      QualCost    merge_qual_cost;
      QualCost    qp_qual_cost;
      double        outer_path_rows = PATH_ROWS(outer_path);
***************
*** 1606,1615 ****
      /*
       * When there are equal merge keys in the outer relation, the mergejoin
       * must rescan any matching tuples in the inner relation. This means
!      * re-fetching inner tuples.  Our cost model for this is that a re-fetch
!      * costs the same as an original fetch, which is probably an overestimate;
!      * but on the other hand we ignore the bookkeeping costs of mark/restore.
!      * Not clear if it's worth developing a more refined model.
       *
       * For regular inner and outer joins, the number of re-fetches can be
       * estimated approximately as size of merge join output minus size of
--- 1603,1609 ----
      /*
       * When there are equal merge keys in the outer relation, the mergejoin
       * must rescan any matching tuples in the inner relation. This means
!      * re-fetching inner tuples; we have to estimate how often that happens.
       *
       * For regular inner and outer joins, the number of re-fetches can be
       * estimated approximately as size of merge join output minus size of
***************
*** 1641,1647 ****
          if (rescannedtuples < 0)
              rescannedtuples = 0;
      }
!     /* We'll inflate inner run cost this much to account for rescanning */
      rescanratio = 1.0 + (rescannedtuples / inner_path_rows);

      /*
--- 1635,1641 ----
          if (rescannedtuples < 0)
              rescannedtuples = 0;
      }
!     /* We'll inflate various costs this much to account for rescanning */
      rescanratio = 1.0 + (rescannedtuples / inner_path_rows);

      /*
***************
*** 1778,1809 ****
                    -1.0);
          startup_cost += sort_path.startup_cost;
          startup_cost += (sort_path.total_cost - sort_path.startup_cost)
!             * innerstartsel * rescanratio;
!         run_cost += (sort_path.total_cost - sort_path.startup_cost)
!             * (innerendsel - innerstartsel) * rescanratio;
!
!         /*
!          * If the inner sort is expected to spill to disk, we want to add a
!          * materialize node to shield it from the need to handle mark/restore.
!          * This will allow it to perform the last merge pass on-the-fly, while
!          * in most cases not requiring the materialize to spill to disk.
!          * Charge an extra cpu_tuple_cost per tuple to account for the
!          * materialize node.  (Keep this estimate in sync with similar ones in
!          * create_mergejoin_path and create_mergejoin_plan.)
!          */
!         if (relation_byte_size(inner_path_rows, inner_path->parent->width) >
!             (work_mem * 1024L))
!             run_cost += cpu_tuple_cost * inner_path_rows;
      }
      else
      {
          startup_cost += inner_path->startup_cost;
          startup_cost += (inner_path->total_cost - inner_path->startup_cost)
!             * innerstartsel * rescanratio;
!         run_cost += (inner_path->total_cost - inner_path->startup_cost)
!             * (innerendsel - innerstartsel) * rescanratio;
      }

      /* CPU costs */

      /*
--- 1772,1854 ----
                    -1.0);
          startup_cost += sort_path.startup_cost;
          startup_cost += (sort_path.total_cost - sort_path.startup_cost)
!             * innerstartsel;
!         inner_run_cost = (sort_path.total_cost - sort_path.startup_cost)
!             * (innerendsel - innerstartsel);
      }
      else
      {
          startup_cost += inner_path->startup_cost;
          startup_cost += (inner_path->total_cost - inner_path->startup_cost)
!             * innerstartsel;
!         inner_run_cost = (inner_path->total_cost - inner_path->startup_cost)
!             * (innerendsel - innerstartsel);
      }

+     /*
+      * Decide whether we want to materialize the inner input to shield it from
+      * mark/restore and performing re-fetches.  Our cost model for regular
+      * re-fetches is that a re-fetch costs the same as an original fetch,
+      * which is probably an overestimate; but on the other hand we ignore the
+      * bookkeeping costs of mark/restore.  Not clear if it's worth developing
+      * a more refined model.  So we just need to inflate the inner run cost
+      * by rescanratio.
+      */
+     bare_inner_cost = inner_run_cost * rescanratio;
+     /*
+      * When we interpose a Material node the re-fetch cost is assumed to be
+      * just cpu_tuple_cost per tuple, independently of the underlying plan's
+      * cost; but we have to charge an extra cpu_tuple_cost per original fetch
+      * as well.  Note that we're assuming the materialize node will never
+      * spill to disk, since it only has to remember tuples back to the last
+      * mark.  (If there are a huge number of duplicates, our other cost
+      * factors will make the path so expensive that it probably won't get
+      * chosen anyway.)  So we don't use cost_rescan here.
+      *
+      * Note: keep this estimate in sync with create_mergejoin_plan's labeling
+      * of the generated Material node.
+      */
+     mat_inner_cost = inner_run_cost +
+         cpu_tuple_cost * inner_path_rows * rescanratio;
+
+     /* Prefer materializing if it looks cheaper */
+     if (mat_inner_cost < bare_inner_cost)
+         path->materialize_inner = true;
+     /*
+      * Even if materializing doesn't look cheaper, we *must* do it if the
+      * inner path is to be used directly (without sorting) and it doesn't
+      * support mark/restore.
+      *
+      * Since the inner side must be ordered, and only Sorts and IndexScans can
+      * create order to begin with, and they both support mark/restore, you
+      * might think there's no problem --- but you'd be wrong.  Nestloop and
+      * merge joins can *preserve* the order of their inputs, so they can be
+      * selected as the input of a mergejoin, and they don't support
+      * mark/restore at present.
+      */
+     else if (innersortkeys == NIL &&
+              !ExecSupportsMarkRestore(inner_path->pathtype))
+         path->materialize_inner = true;
+     /*
+      * Also, force materializing if the inner path is to be sorted and the
+      * sort is expected to spill to disk.  This is because the final merge
+      * pass can be done on-the-fly if it doesn't have to support mark/restore.
+      * We don't try to adjust the cost estimates for this consideration,
+      * though.
+      */
+     else if (innersortkeys != NIL &&
+              relation_byte_size(inner_path_rows, inner_path->parent->width) >
+              (work_mem * 1024L))
+         path->materialize_inner = true;
+     else
+         path->materialize_inner = false;
+
+     /* Charge the right incremental cost for the chosen case */
+     if (path->materialize_inner)
+         run_cost += mat_inner_cost;
+     else
+         run_cost += bare_inner_cost;
+
      /* CPU costs */

      /*
Index: src/backend/optimizer/plan/createplan.c
===================================================================
RCS file: /cvsroot/pgsql/src/backend/optimizer/plan/createplan.c,v
retrieving revision 1.266
diff -c -r1.266 createplan.c
*** src/backend/optimizer/plan/createplan.c    26 Oct 2009 02:26:33 -0000    1.266
--- src/backend/optimizer/plan/createplan.c    15 Nov 2009 02:45:21 -0000
***************
*** 1664,1672 ****
                               best_path->jpath.outerjoinpath->parent->relids);

      /*
!      * Create explicit sort nodes for the outer and inner join paths if
!      * necessary.  The sort cost was already accounted for in the path. Make
!      * sure there are no excess columns in the inputs if sorting.
       */
      if (best_path->outersortkeys)
      {
--- 1664,1671 ----
                               best_path->jpath.outerjoinpath->parent->relids);

      /*
!      * Create explicit sort nodes for the outer and inner paths if necessary.
!      * Make sure there are no excess columns in the inputs if sorting.
       */
      if (best_path->outersortkeys)
      {
***************
*** 1695,1717 ****
          innerpathkeys = best_path->jpath.innerjoinpath->pathkeys;

      /*
!      * If inner plan is a sort that is expected to spill to disk, add a
!      * materialize node to shield it from the need to handle mark/restore.
!      * This will allow it to perform the last merge pass on-the-fly, while in
!      * most cases not requiring the materialize to spill to disk.
!      *
!      * XXX really, Sort oughta do this for itself, probably, to avoid the
!      * overhead of a separate plan node.
       */
!     if (IsA(inner_plan, Sort) &&
!         sort_exceeds_work_mem((Sort *) inner_plan))
      {
          Plan       *matplan = (Plan *) make_material(inner_plan);

          /*
           * We assume the materialize will not spill to disk, and therefore
           * charge just cpu_tuple_cost per tuple.  (Keep this estimate in sync
!          * with similar ones in cost_mergejoin and create_mergejoin_path.)
           */
          copy_plan_costsize(matplan, inner_plan);
          matplan->total_cost += cpu_tuple_cost * matplan->plan_rows;
--- 1694,1710 ----
          innerpathkeys = best_path->jpath.innerjoinpath->pathkeys;

      /*
!      * If specified, add a materialize node to shield the inner plan from
!      * the need to handle mark/restore.
       */
!     if (best_path->materialize_inner)
      {
          Plan       *matplan = (Plan *) make_material(inner_plan);

          /*
           * We assume the materialize will not spill to disk, and therefore
           * charge just cpu_tuple_cost per tuple.  (Keep this estimate in sync
!          * with cost_mergejoin.)
           */
          copy_plan_costsize(matplan, inner_plan);
          matplan->total_cost += cpu_tuple_cost * matplan->plan_rows;
***************
*** 1887,1892 ****
--- 1880,1886 ----
                                 inner_plan,
                                 best_path->jpath.jointype);

+     /* Costs of sort and material steps are included in path cost already */
      copy_path_costsize(&join_plan->join.plan, &best_path->jpath.path);

      return join_plan;
Index: src/backend/optimizer/util/pathnode.c
===================================================================
RCS file: /cvsroot/pgsql/src/backend/optimizer/util/pathnode.c,v
retrieving revision 1.154
diff -c -r1.154 pathnode.c
*** src/backend/optimizer/util/pathnode.c    17 Sep 2009 20:49:29 -0000    1.154
--- src/backend/optimizer/util/pathnode.c    15 Nov 2009 02:45:21 -0000
***************
*** 17,23 ****
  #include <math.h>

  #include "catalog/pg_operator.h"
- #include "executor/executor.h"
  #include "miscadmin.h"
  #include "optimizer/clauses.h"
  #include "optimizer/cost.h"
--- 17,22 ----
***************
*** 1414,1460 ****
          pathkeys_contained_in(innersortkeys, inner_path->pathkeys))
          innersortkeys = NIL;

-     /*
-      * If we are not sorting the inner path, we may need a materialize node to
-      * ensure it can be marked/restored.
-      *
-      * Since the inner side must be ordered, and only Sorts and IndexScans can
-      * create order to begin with, and they both support mark/restore, you
-      * might think there's no problem --- but you'd be wrong.  Nestloop and
-      * merge joins can *preserve* the order of their inputs, so they can be
-      * selected as the input of a mergejoin, and they don't support
-      * mark/restore at present.
-      *
-      * Note: Sort supports mark/restore, so no materialize is really needed in
-      * that case; but one may be desirable anyway to optimize the sort.
-      * However, since we aren't representing the sort step separately in the
-      * Path tree, we can't explicitly represent the materialize either. So
-      * that case is not handled here.  Instead, cost_mergejoin has to factor
-      * in the cost and create_mergejoin_plan has to add the plan node.
-      */
-     if (innersortkeys == NIL &&
-         !ExecSupportsMarkRestore(inner_path->pathtype))
-     {
-         Path       *mpath;
-
-         mpath = (Path *) create_material_path(inner_path->parent, inner_path);
-
-         /*
-          * We expect the materialize won't spill to disk (it could only do so
-          * if there were a whole lot of duplicate tuples, which is a case
-          * cost_mergejoin will avoid choosing anyway).    Therefore
-          * cost_material's cost estimate is bogus and we should charge just
-          * cpu_tuple_cost per tuple.  (Keep this estimate in sync with similar
-          * ones in cost_mergejoin and create_mergejoin_plan; also see
-          * cost_rescan.)
-          */
-         mpath->startup_cost = inner_path->startup_cost;
-         mpath->total_cost = inner_path->total_cost;
-         mpath->total_cost += cpu_tuple_cost * inner_path->parent->rows;
-
-         inner_path = mpath;
-     }
-
      pathnode->jpath.path.pathtype = T_MergeJoin;
      pathnode->jpath.path.parent = joinrel;
      pathnode->jpath.jointype = jointype;
--- 1413,1418 ----
***************
*** 1465,1470 ****
--- 1423,1429 ----
      pathnode->path_mergeclauses = mergeclauses;
      pathnode->outersortkeys = outersortkeys;
      pathnode->innersortkeys = innersortkeys;
+     /* pathnode->materialize_inner will be set by cost_mergejoin */

      cost_mergejoin(pathnode, root, sjinfo);

Index: src/include/nodes/relation.h
===================================================================
RCS file: /cvsroot/pgsql/src/include/nodes/relation.h,v
retrieving revision 1.178
diff -c -r1.178 relation.h
*** src/include/nodes/relation.h    26 Oct 2009 02:26:43 -0000    1.178
--- src/include/nodes/relation.h    15 Nov 2009 02:45:21 -0000
***************
*** 835,840 ****
--- 835,848 ----
  /*
   * A mergejoin path has these fields.
   *
+  * Unlike other path types, a MergePath node doesn't represent just a single
+  * run-time plan node: it can represent up to four.  Aside from the MergeJoin
+  * node itself, there can be a Sort node for the outer input, a Sort node
+  * for the inner input, and/or a Material node for the inner input.  We could
+  * represent these nodes by separate path nodes, but considering how many
+  * different merge paths are investigated during a complex join problem,
+  * it seems better to avoid unnecessary palloc overhead.
+  *
   * path_mergeclauses lists the clauses (in the form of RestrictInfos)
   * that will be used in the merge.
   *
***************
*** 846,860 ****
   * outersortkeys (resp. innersortkeys) is NIL if the outer path
   * (resp. inner path) is already ordered appropriately for the
   * mergejoin.  If it is not NIL then it is a PathKeys list describing
!  * the ordering that must be created by an explicit sort step.
   */

  typedef struct MergePath
  {
      JoinPath    jpath;
      List       *path_mergeclauses;        /* join clauses to be used for merge */
!     List       *outersortkeys;    /* keys for explicit sort, if any */
!     List       *innersortkeys;    /* keys for explicit sort, if any */
  } MergePath;

  /*
--- 854,872 ----
   * outersortkeys (resp. innersortkeys) is NIL if the outer path
   * (resp. inner path) is already ordered appropriately for the
   * mergejoin.  If it is not NIL then it is a PathKeys list describing
!  * the ordering that must be created by an explicit Sort node.
!  *
!  * materialize_inner is TRUE if a Material node should be placed atop the
!  * inner input.  This may appear with or without an inner Sort step.
   */

  typedef struct MergePath
  {
      JoinPath    jpath;
      List       *path_mergeclauses;        /* join clauses to be used for merge */
!     List       *outersortkeys;            /* keys for explicit sort, if any */
!     List       *innersortkeys;            /* keys for explicit sort, if any */
!     bool        materialize_inner;        /* add Materialize to inner? */
  } MergePath;

  /*
Index: src/include/optimizer/cost.h
===================================================================
RCS file: /cvsroot/pgsql/src/include/optimizer/cost.h,v
retrieving revision 1.98
diff -c -r1.98 cost.h
*** src/include/optimizer/cost.h    12 Sep 2009 22:12:04 -0000    1.98
--- src/include/optimizer/cost.h    15 Nov 2009 02:45:21 -0000
***************
*** 84,90 ****
  extern void cost_sort(Path *path, PlannerInfo *root,
            List *pathkeys, Cost input_cost, double tuples, int width,
            double limit_tuples);
- extern bool sort_exceeds_work_mem(Sort *sort);
  extern void cost_material(Path *path,
                Cost input_startup_cost, Cost input_total_cost,
                double tuples, int width);
--- 84,89 ----

От:
Matthew Wakeling
Дата:

On Sat, 14 Nov 2009, Tom Lane wrote:
> Matthew Wakeling <> writes:
>> [ discussion about applying materialize to a mergejoin's inner indexscan ]
>
> I have finally gotten round to doing something about this, and applied
> the attached patch to CVS HEAD.  Could you test it on your problem case
> to see what happens?  If it's not convenient to load your data into
> 8.5devel, I believe the patch would work all right in 8.4.x.  (A quick
> check shows that it applies except for one deletion hunk that has a
> conflict due to a comment change; you could easily do that deletion
> manually.)

Um, cool. I'm going to have to look back at the archives to even work out
what query I was complaining about though. May take a little while to
test, but I'll get back to you. Thanks,

Matthew

--
 The only secure computer is one that's unplugged, locked in a safe,
 and buried 20 feet under the ground in a secret location...and i'm not
 even too sure about that one.                         --Dennis Huges, FBI