Обсуждение: Should Oracle outperform PostgreSQL on a complex multidimensional query?
I'm currently benchmarking several RDBMSs with respect to analytical query performance on medium-sized multidimensional data sets. The data set contains 30,000,000 fact rows evenly distributed in a multidimensional space of 9 hierarchical dimensions. Each dimension has 8000 members.
The test query selects about one half of the members from each dimension, and calculates fact sums grouped by 5 high-level members from each dimensional hierarchy. (There are actually some additional complications that makes the query end up listing 20 table aliases in the from-clause, 18 of which are aliases for 2 physical tables.)
On Oracle the query runs in less than 3 seconds. All steps have been taken to ensure that Oracle will apply star schema optimization to the query (e.g. having loads of single-column bitmap indexes). The query plan reveals that a bitmap merge takes place before fact lookup.
There's a lot of RAM available, and series of queries have been run in advance to make sure the required data resides in the cache. This is confirmed by a very high CPU utilization and virtually no I/O during the query execution.
I have established similar conditions for the query in PostgreSQL, and it runs in about 30 seconds. Again the CPU utilization is high with no noticable I/O. The query plan is of course very different from that of Oracle, since PostgreSQL lacks the bitmap index merge operation. It narrows down the result one dimension at a time, using the single-column indexes provided. It is not an option for us to provide multi-column indexes tailored to the specific query, since we want full freedom as to which dimensions each query will use.
Are these the results we should expect when comparing PostgreSQL to Oracle for such queries, or are there special optimization options for PostgreSQL that we may have overlooked? (I wouldn't be suprised if there are, since I spent at least 2 full days trying to trigger the star optimization magic in my Oracle installation.)
=?iso-8859-1?Q?P=E5l_Stenslet?= <paal.stenslet@exie.com> writes:
> I have established similar conditions for the query in PostgreSQL, and =
> it runs in about 30 seconds. Again the CPU utilization is high with no =
> noticable I/O. The query plan is of course very different from that of =
> Oracle, since PostgreSQL lacks the bitmap index merge operation.
Perhaps you should be trying this on PG 8.1?  In any case, without
specific details of your schema or a look at EXPLAIN ANALYZE results,
it's unlikely that anyone is going to have any useful comments for you.
            regards, tom lane
			
		On Thu, 2005-12-08 at 12:26 +0100, Pål Stenslet wrote: > I'm currently benchmarking several RDBMSs with respect to analytical > query performance on medium-sized multidimensional data sets. The data > set contains 30,000,000 fact rows evenly distributed in a > multidimensional space of 9 hierarchical dimensions. Each dimension > has 8000 members. > I have established similar conditions for the query in PostgreSQL, and > it runs in about 30 seconds. Again the CPU utilization is high with no > noticable I/O. The query plan is of course very different from that of > Oracle, since PostgreSQL lacks the bitmap index merge operation. It > narrows down the result one dimension at a time, using the > single-column indexes provided. It is not an option for us to provide > multi-column indexes tailored to the specific query, since we want > full freedom as to which dimensions each query will use. > Are these the results we should expect when comparing PostgreSQL to > Oracle for such queries, or are there special optimization options for > PostgreSQL that we may have overlooked? (I wouldn't be suprised if > there are, since I spent at least 2 full days trying to trigger the > star optimization magic in my Oracle installation.) Yes, I'd expect something like this right now in 8.1; the numbers stack up to PostgreSQL doing equivalent join speeds, but w/o star join. You've confused the issue here since: - Oracle performs star joins using a bit map index transform. It is the star join that is the important bit here, not the just the bitmap part. - PostgreSQL does actually provide bitmap index merge, but not star join (YET!) [I've looked into this, but there seem to be multiple patent claims covering various aspects of this technique, yet at least other 3 vendors manage to achieve this. So far I've not dug too deeply, but I understand the optimizations we'd need to perform in PostgreSQL to do this.] Best Regards, Simon Riggs
How are star joins different from what we do now? --------------------------------------------------------------------------- Simon Riggs wrote: > On Thu, 2005-12-08 at 12:26 +0100, P?l Stenslet wrote: > > I'm currently benchmarking several RDBMSs with respect to analytical > > query performance on medium-sized multidimensional data sets. The data > > set contains 30,000,000 fact rows evenly distributed in a > > multidimensional space of 9 hierarchical dimensions. Each dimension > > has 8000 members. > > > I have established similar conditions for the query in PostgreSQL, and > > it runs in about 30 seconds. Again the CPU utilization is high with no > > noticable I/O. The query plan is of course very different from that of > > Oracle, since PostgreSQL lacks the bitmap index merge operation. It > > narrows down the result one dimension at a time, using the > > single-column indexes provided. It is not an option for us to provide > > multi-column indexes tailored to the specific query, since we want > > full freedom as to which dimensions each query will use. > > > Are these the results we should expect when comparing PostgreSQL to > > Oracle for such queries, or are there special optimization options for > > PostgreSQL that we may have overlooked? (I wouldn't be suprised if > > there are, since I spent at least 2 full days trying to trigger the > > star optimization magic in my Oracle installation.) > > Yes, I'd expect something like this right now in 8.1; the numbers stack > up to PostgreSQL doing equivalent join speeds, but w/o star join. > > You've confused the issue here since: > - Oracle performs star joins using a bit map index transform. It is the > star join that is the important bit here, not the just the bitmap part. > - PostgreSQL does actually provide bitmap index merge, but not star join > (YET!) > > [I've looked into this, but there seem to be multiple patent claims > covering various aspects of this technique, yet at least other 3 vendors > manage to achieve this. So far I've not dug too deeply, but I understand > the optimizations we'd need to perform in PostgreSQL to do this.] > > Best Regards, Simon Riggs > > > ---------------------------(end of broadcast)--------------------------- > TIP 9: In versions below 8.0, the planner will ignore your desire to > choose an index scan if your joining column's datatypes do not > match > -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania 19073
Bruce Momjian wrote: > How are star joins different from what we do now? > > --------------------------------------------------------------------------- > Recall that a "star" query with n tables means a query where there are (n - 1) supposedly small tables (dimensions) and 1 large table (fact) - which has foreign keys to each dimension. As I understand it, the classic "tar join" is planned like this: 1) The result of the restriction clauses on each of the (small) dimension tables is computed. 2) The cartesian product of all the results of 1) is formed. 3) The fact (big) table is joined to the pseudo relation formed by 2). From what I have seen most vendor implementations do not (always) perform the full cartesion product of the dimensions, but do some of them, join to the fact, then join to the remaining dimensions afterwards. There is another join strategy called the "star transformation" where some of the dimension joins get rewritten as subqueries, then the above method is used again! This tends to be useful when the cartesion products would be stupidly large (e.g. "sparse" facts, or few restriction clauses) regards Mark P.s : Luke or Simon might have a better definition... but thought I'd chuck in my 2c... :-)
On Fri, 2005-12-16 at 23:28 -0500, Bruce Momjian wrote: > How are star joins different from what we do now? Various ways of doing them, but all use plans that you wouldn't have come up with via normal join planning. Methods: 1. join all N small tables together in a cartesian product, then join to main Large table once (rather than N times) 2. transform joins into subselects, then return subselect rows via an index bitmap. Joins are performed via a bitmap addition process. You can fake (1) yourself with a temporary table, and the basics for (2) are now in place also. The characteristics of these joins make them suitable for large Data Warehouses with Fact-Dimension style designs. Many RDBMS have this, but we need to be careful of patent claims. I'm sure there's a way through that, but I'm not looking for it yet. Anybody else wishing to assist with a detailed analysis would be much appreciated. Best Regards, Simon Riggs
OK, so while our bitmap scan allows multiple indexes to be joined to get to heap rows, a star joins allows multiple dimensions _tables_ to be joined to index into a larger main fact table --- interesting. Added to TODO: * Allow star join optimizations While our bitmap scan allows multiple indexes to be joined to get to heap rows, a star joins allows multiple dimension _tables_ to be joined to index into a larger main fact table. The join is usually performed by either creating a cartesian product of all the dimmension tables and doing a single join on that product or using subselects to create bitmaps of each dimmension table match and merge the bitmaps to perform the join on the fact table. --------------------------------------------------------------------------- Simon Riggs wrote: > On Fri, 2005-12-16 at 23:28 -0500, Bruce Momjian wrote: > > How are star joins different from what we do now? > > Various ways of doing them, but all use plans that you wouldn't have > come up with via normal join planning. > > Methods: > 1. join all N small tables together in a cartesian product, then join to > main Large table once (rather than N times) > 2. transform joins into subselects, then return subselect rows via an > index bitmap. Joins are performed via a bitmap addition process. > > You can fake (1) yourself with a temporary table, and the basics for (2) > are now in place also. > > The characteristics of these joins make them suitable for large Data > Warehouses with Fact-Dimension style designs. > > Many RDBMS have this, but we need to be careful of patent claims. I'm > sure there's a way through that, but I'm not looking for it yet. Anybody > else wishing to assist with a detailed analysis would be much > appreciated. > > Best Regards, Simon Riggs > -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania 19073
Simon Riggs <simon@2ndquadrant.com> writes:
> On Fri, 2005-12-16 at 23:28 -0500, Bruce Momjian wrote:
>> How are star joins different from what we do now?
> Methods:
> 1. join all N small tables together in a cartesian product, then join to
> main Large table once (rather than N times)
Of course, the reason the current planner does not think of this is that
it does not consider clauseless joins unless there is no alternative.
However, I submit that it wouldn't pick such a plan anyway, and should
not, because the idea is utterly stupid.  The plan you currently get for
this sort of scenario is typically a nest of hash joins:
                               QUERY PLAN
------------------------------------------------------------------------
 Hash Join  (cost=2.25..4652.25 rows=102400 width=16)
   Hash Cond: ("outer".f1 = "inner".f1)
   ->  Hash Join  (cost=1.12..3115.12 rows=102400 width=12)
         Hash Cond: ("outer".f2 = "inner".f1)
         ->  Seq Scan on fact  (cost=0.00..1578.00 rows=102400 width=8)
         ->  Hash  (cost=1.10..1.10 rows=10 width=4)
               ->  Seq Scan on d2  (cost=0.00..1.10 rows=10 width=4)
   ->  Hash  (cost=1.10..1.10 rows=10 width=4)
         ->  Seq Scan on d1  (cost=0.00..1.10 rows=10 width=4)
(9 rows)
This involves only one scan of the fact table.  As each row is pulled up
through the nest of hash joins, we hash one dimension key and join to
one small table at each level.  This is at worst the same amount of work
as hashing all the keys at once and probing a single cartesian-product
hashtable, probably less work (fewer wasted key-comparisons).  And
definitely less memory.  You do have to keep your eye on the ball that
you don't waste a lot of overhead propagating the row up through
multiple join levels, but we got rid of most of the problem there in
8.1 via the introduction of "virtual tuple slots".  If this isn't fast
enough yet, it'd make more sense to invest effort in further cutting the
executor's join overhead (which of course benefits *all* plan types)
than in trying to make the planner choose a star join.
> 2. transform joins into subselects, then return subselect rows via an
> index bitmap. Joins are performed via a bitmap addition process.
This one might be interesting but it's not clear what you are talking
about.  "Bitmap addition"?
            regards, tom lane
			
		Bruce Momjian <pgman@candle.pha.pa.us> writes:
> Added to TODO:
> * Allow star join optimizations
See my response to Simon for reasons why this doesn't seem like a
particularly good TODO item.
            regards, tom lane
			
		I wrote:
> However, I submit that it wouldn't pick such a plan anyway, and should
> not, because the idea is utterly stupid.
BTW, some experimentation suggests that in fact a star join is already
slower than the "regular" plan in 8.1.  You can force a star-join plan
to be generated like this:
regression=# set join_collapse_limit TO 1;
SET
regression=# explain select * from fact,d1 cross join d2 where fact.f1=d1.f1 and fact.f2=d2.f1;
                                QUERY PLAN
---------------------------------------------------------------------------
 Hash Join  (cost=4.71..8238.71 rows=102400 width=16)
   Hash Cond: (("outer".f1 = "inner".f1) AND ("outer".f2 = "inner".f1))
   ->  Seq Scan on fact  (cost=0.00..1578.00 rows=102400 width=8)
   ->  Hash  (cost=4.21..4.21 rows=100 width=8)
         ->  Nested Loop  (cost=1.11..4.21 rows=100 width=8)
               ->  Seq Scan on d1  (cost=0.00..1.10 rows=10 width=4)
               ->  Materialize  (cost=1.11..1.21 rows=10 width=4)
                     ->  Seq Scan on d2  (cost=0.00..1.10 rows=10 width=4)
(8 rows)
and at least in the one test case I tried, this runs slower than the
nested-hash plan.  EXPLAIN ANALYZE misleadingly makes it look faster,
but that's just because of the excessive per-plan-node ANALYZE
overhead.  Try doing something like
    \timing
    select count(*) from fact, ...
to get realistic numbers.
            regards, tom lane
			
		Tom Lane wrote: > Bruce Momjian <pgman@candle.pha.pa.us> writes: > > Added to TODO: > > * Allow star join optimizations > > See my response to Simon for reasons why this doesn't seem like a > particularly good TODO item. Yes, TODO removed. I thought we were waiting for bitmap joins before trying star joins. I did not realize they might never be a win. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania 19073
Tom, On 12/17/05 10:47 AM, "Tom Lane" <tgl@sss.pgh.pa.us> wrote: > BTW, some experimentation suggests that in fact a star join is already > slower than the "regular" plan in 8.1. You can force a star-join plan > to be generated like this: Cool! We've got Paal's test case in the queue to run, it's taking us some time to get to it, possibly by next week we should be able to run some of these cases: 1) 8.1.1 btree with bitmap scan 2) 8.1.1 on-disk bitmap with direct AND operations 3) (2) with forced star transformation (materialize) We'll also be trying the same things with the CVS tip of Bizgres MPP, probably over X-mas. We should be able to handily beat Oracle's 3 second number. - Luke
Tom Lane wrote:
> Simon Riggs <simon@2ndquadrant.com> writes:
>
>>On Fri, 2005-12-16 at 23:28 -0500, Bruce Momjian wrote:
>>
>>>How are star joins different from what we do now?
>
>
>>Methods:
>>1. join all N small tables together in a cartesian product, then join to
>>main Large table once (rather than N times)
>
>
> Of course, the reason the current planner does not think of this is that
> it does not consider clauseless joins unless there is no alternative.
>
> However, I submit that it wouldn't pick such a plan anyway, and should
> not, because the idea is utterly stupid.  The plan you currently get for
> this sort of scenario is typically a nest of hash joins:
>
>                                QUERY PLAN
> ------------------------------------------------------------------------
>  Hash Join  (cost=2.25..4652.25 rows=102400 width=16)
>    Hash Cond: ("outer".f1 = "inner".f1)
>    ->  Hash Join  (cost=1.12..3115.12 rows=102400 width=12)
>          Hash Cond: ("outer".f2 = "inner".f1)
>          ->  Seq Scan on fact  (cost=0.00..1578.00 rows=102400 width=8)
>          ->  Hash  (cost=1.10..1.10 rows=10 width=4)
>                ->  Seq Scan on d2  (cost=0.00..1.10 rows=10 width=4)
>    ->  Hash  (cost=1.10..1.10 rows=10 width=4)
>          ->  Seq Scan on d1  (cost=0.00..1.10 rows=10 width=4)
> (9 rows)
>
> This involves only one scan of the fact table.  As each row is pulled up
> through the nest of hash joins, we hash one dimension key and join to
> one small table at each level.  This is at worst the same amount of work
> as hashing all the keys at once and probing a single cartesian-product
> hashtable, probably less work (fewer wasted key-comparisons).  And
> definitely less memory.  You do have to keep your eye on the ball that
> you don't waste a lot of overhead propagating the row up through
> multiple join levels, but we got rid of most of the problem there in
> 8.1 via the introduction of "virtual tuple slots".  If this isn't fast
> enough yet, it'd make more sense to invest effort in further cutting the
> executor's join overhead (which of course benefits *all* plan types)
> than in trying to make the planner choose a star join.
>
>
>>2. transform joins into subselects, then return subselect rows via an
>>index bitmap. Joins are performed via a bitmap addition process.
>
>
> This one might be interesting but it's not clear what you are talking
> about.  "Bitmap addition"?
Yeah - the quoted method of "make a cartesian product of the dimensions
and then join to the fact all at once" is not actually used (as written)
in many implementations - probably for the reasons you are pointing out.
I found these two papers whilst browsing:
http://www.cs.brown.edu/courses/cs227/Papers/Indexing/O'NeilGraefe.pdf
http://www.dama.upc.edu/downloads/jaguilar-2005-4.pdf
They seem to be describing a more subtle method making use of join
indexes and bitmapped indexes.
If I understand it correctly, the idea is to successively build up a
list (hash / bitmap) of fact RIDS that will satisfy the query, and when
complete actually perform the join and construct tuples. The goal being
  similar in intent to the star join method (i.e. access the fact table
as little and as "late" as possible), but avoiding the cost of actually
constructing the dimension cartesian product.
cheers
Mark
			
		Tom Lane wrote:
>>2. transform joins into subselects, then return subselect rows via an
>>index bitmap. Joins are performed via a bitmap addition process.
Looks like 8.1 pretty much does this right now:
First the basic star:
EXPLAIN ANALYZE
SELECT
         d0.dmth,
         d1.dat,
         count(f.fval )
FROM
         dim0 AS d0,
         dim1 AS d1,
         fact0 AS f
WHERE   d0.d0key = f.d0key
AND     d1.d1key = f.d1key
AND     d0.dyr BETWEEN 2010 AND 2015
AND     d1.dattyp BETWEEN '10th measure type' AND '14th measure type'
GROUP BY
         d0.dmth,
         d1.dat
;
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------------
  HashAggregate  (cost=334842.41..334846.53 rows=329 width=37) (actual
time=144317.960..144318.814 rows=120 loops=1)
    ->  Hash Join  (cost=145.72..334636.91 rows=27400 width=37) (actual
time=1586.363..142831.025 rows=201600 loops=1)
          Hash Cond: ("outer".d0key = "inner".d0key)
          ->  Hash Join  (cost=89.72..333279.41 rows=137001 width=37)
(actual time=1467.322..135585.317 rows=1000000 loops=1)
                Hash Cond: ("outer".d1key = "inner".d1key)
                ->  Seq Scan on fact0 f  (cost=0.00..281819.45
rows=10000045 width=12) (actual time=120.881..70364.473 rows=10000000
loops=1)
                ->  Hash  (cost=89.38..89.38 rows=137 width=33) (actual
time=24.822..24.822 rows=660 loops=1)
                      ->  Index Scan using dim1_dattyp on dim1 d1
(cost=0.00..89.38 rows=137 width=33) (actual time=0.502..19.374 rows=660
loops=1)
                            Index Cond: (((dattyp)::text >= '10th
measure type'::text) AND ((dattyp)::text <= '14th measure type'::text))
          ->  Hash  (cost=51.00..51.00 rows=2000 width=8) (actual
time=31.620..31.620 rows=2016 loops=1)
                ->  Index Scan using dim0_dyr on dim0 d0
(cost=0.00..51.00 rows=2000 width=8) (actual time=0.379..17.377
rows=2016 loops=1)
                      Index Cond: ((dyr >= 2010) AND (dyr <= 2015))
  Total runtime: 144320.588 ms
(13 rows)
Now apply the star transformation:
EXPLAIN ANALYZE
SELECT
         d0.dmth,
         d1.dat,
         count(f.fval )
FROM
         dim0 AS d0,
         dim1 AS d1,
         fact0 AS f
WHERE   d0.d0key = f.d0key
AND     d1.d1key = f.d1key
AND     d0.dyr BETWEEN 2010 AND 2015
AND     d1.dattyp BETWEEN '10th measure type' AND '14th measure type'
AND     f.d0key IN (SELECT cd0.d0key FROM dim0 cd0
                     WHERE cd0.dyr BETWEEN 2010 AND 2015)
AND     f.d1key IN (SELECT cd1.d1key FROM dim1 cd1
                     WHERE cd1.dattyp BETWEEN '10th measure type'
                                      AND '14th measure type')
GROUP BY
         d0.dmth,
         d1.dat
;
        QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------------------------------
  HashAggregate  (cost=129230.89..129231.83 rows=75 width=37) (actual
time=39798.192..39799.015 rows=120 loops=1)
    ->  Nested Loop IN Join  (cost=149.44..129230.33 rows=75 width=37)
(actual time=269.919..38125.520 rows=201600 loops=1)
          ->  Hash Join  (cost=147.43..128171.03 rows=375 width=45)
(actual time=269.516..27342.866 rows=201600 loops=1)
                Hash Cond: ("outer".d0key = "inner".d0key)
                ->  Nested Loop  (cost=91.43..128096.03 rows=2000
width=37) (actual time=152.084..19869.365 rows=1000000 loops=1)
                      ->  Hash Join  (cost=91.43..181.52 rows=2
width=37) (actual time=29.931..46.339 rows=660 loops=1)
                            Hash Cond: ("outer".d1key = "inner".d1key)
                            ->  Index Scan using dim1_dattyp on dim1 d1
  (cost=0.00..89.38 rows=137 width=33) (actual time=0.516..7.683
rows=660 loops=1)
                                  Index Cond: (((dattyp)::text >= '10th
measure type'::text) AND ((dattyp)::text <= '14th measure type'::text))
                            ->  Hash  (cost=91.09..91.09 rows=137
width=4) (actual time=29.238..29.238 rows=660 loops=1)
                                  ->  HashAggregate  (cost=89.72..91.09
rows=137 width=4) (actual time=20.940..24.900 rows=660 loops=1)
                                        ->  Index Scan using dim1_dattyp
on dim1 cd1  (cost=0.00..89.38 rows=137 width=4) (actual
time=0.042..14.841 rows=660 loops=1)
                                              Index Cond:
(((dattyp)::text >= '10th measure type'::text) AND ((dattyp)::text <=
'14th measure type'::text))
                      ->  Index Scan using fact0_d1key on fact0 f
(cost=0.00..62707.26 rows=100000 width=12) (actual time=0.205..12.691
rows=1515 loops=660)
                            Index Cond: ("outer".d1key = f.d1key)
                ->  Hash  (cost=51.00..51.00 rows=2000 width=8) (actual
time=31.264..31.264 rows=2016 loops=1)
                      ->  Index Scan using dim0_dyr on dim0 d0
(cost=0.00..51.00 rows=2000 width=8) (actual time=0.339..16.885
rows=2016 loops=1)
                            Index Cond: ((dyr >= 2010) AND (dyr <= 2015))
          ->  Bitmap Heap Scan on dim0 cd0  (cost=2.00..2.81 rows=1
width=4) (actual time=0.031..0.031 rows=1 loops=201600)
                Recheck Cond: ("outer".d0key = cd0.d0key)
                Filter: ((dyr >= 2010) AND (dyr <= 2015))
                ->  Bitmap Index Scan on dim0_d0key  (cost=0.00..2.00
rows=1 width=0) (actual time=0.015..0.015 rows=1 loops=201600)
                      Index Cond: ("outer".d0key = cd0.d0key)
  Total runtime: 39800.294 ms
(24 rows)
The real run times are more like 24s and 9s, but you get the idea.
Cheers
Mark
			
		Tom Lane <tgl@sss.pgh.pa.us> writes:
> Simon Riggs <simon@2ndquadrant.com> writes:
> > On Fri, 2005-12-16 at 23:28 -0500, Bruce Momjian wrote:
> >> How are star joins different from what we do now?
>
> > Methods:
> > 1. join all N small tables together in a cartesian product, then join to
> > main Large table once (rather than N times)
>
> Of course, the reason the current planner does not think of this is that
> it does not consider clauseless joins unless there is no alternative.
>
> However, I submit that it wouldn't pick such a plan anyway, and should
> not, because the idea is utterly stupid.  The plan you currently get for
> this sort of scenario is typically a nest of hash joins:
>
>                                QUERY PLAN
> ------------------------------------------------------------------------
>  Hash Join  (cost=2.25..4652.25 rows=102400 width=16)
>    Hash Cond: ("outer".f1 = "inner".f1)
>    ->  Hash Join  (cost=1.12..3115.12 rows=102400 width=12)
>          Hash Cond: ("outer".f2 = "inner".f1)
>          ->  Seq Scan on fact  (cost=0.00..1578.00 rows=102400 width=8)
>          ->  Hash  (cost=1.10..1.10 rows=10 width=4)
>                ->  Seq Scan on d2  (cost=0.00..1.10 rows=10 width=4)
>    ->  Hash  (cost=1.10..1.10 rows=10 width=4)
>          ->  Seq Scan on d1  (cost=0.00..1.10 rows=10 width=4)
> (9 rows)
I realize DSS systems often expect to run queries using sequential scans but
perhaps the point of this particular plan is to exploit indexes? (I think
particularly bitmap indexes but ...)
So in this case, you would expect an index scan of d1 to pull out just the
records that d1 says should be included, and an index scan of d2 to pull out
just the records that d2 says should be included, then finally a nested loop
index lookup of f1 for the primary keys that show up in both the d1 scan and
the d2 scan.
So in the following it would be nice if the index scan on f didn't have to
appear until *after* all the hashes were checked for the dimenions, not after
only one of them. This would be even nicer if instead of hashes a bitmap data
structure could be built and bitmap operations used to do the joins, since no
other columns from these dimension tables need to be preserved to be included
in the select list.
It would be even better if there were an on-disk representation of these
bitmap data structures but I don't see how to do that with MVCC at all.
slo=> explain select * from fact as f where fact_id in (select fact_id from d d1 where dim_id = 4) and fact_id in
(selectfact_id from d d2 where dim_id = 29) and fact_id in (select fact_id from d d3 where dim_id = 57); 
                                           QUERY PLAN
------------------------------------------------------------------------------------------------
 Hash IN Join  (cost=15.77..21.86 rows=1 width=110)
   Hash Cond: ("outer".fact_id = "inner".fact_id)
   ->  Hash IN Join  (cost=10.51..16.59 rows=1 width=118)
         Hash Cond: ("outer".fact_id = "inner".fact_id)
         ->  Nested Loop  (cost=5.26..11.31 rows=2 width=114)
               ->  HashAggregate  (cost=5.26..5.26 rows=2 width=4)
                     ->  Index Scan using di on d d2  (cost=0.00..5.25 rows=3 width=4)
                           Index Cond: (dim_id = 29)
               ->  Index Scan using fact_pkey on fact f  (cost=0.00..3.01 rows=1 width=110)
                     Index Cond: (f.fact_id = "outer".fact_id)
         ->  Hash  (cost=5.25..5.25 rows=3 width=4)
               ->  Index Scan using di on d d1  (cost=0.00..5.25 rows=3 width=4)
                     Index Cond: (dim_id = 4)
   ->  Hash  (cost=5.25..5.25 rows=3 width=4)
         ->  Index Scan using di on d d3  (cost=0.00..5.25 rows=3 width=4)
               Index Cond: (dim_id = 57)
(16 rows)
> > 2. transform joins into subselects, then return subselect rows via an
> > index bitmap. Joins are performed via a bitmap addition process.
>
> This one might be interesting but it's not clear what you are talking
> about.  "Bitmap addition"?
Well "transform joins into subselects" is a red herring. Joins and subselects
are two ways of spelling special cases of the same thing and internally they
ought to go through the same codepaths. They don't in Postgres but judging by
the plans it produces I believe they do at least in a lot of cases in Oracle.
That's sort of the whole point of the phrase "star join". What the user really
wants is a single table joined to a bunch of small tables. There's no way to
write that in SQL due to the limitations of the language but a bunch of
subqueries expresses precisely the same concept (albeit with another set of
language limitations which luckily don't impact this particular application).
--
greg
			
		On Sun, 2005-12-18 at 15:02 +1300, Mark Kirkwood wrote: > Yeah - the quoted method of "make a cartesian product of the dimensions > and then join to the fact all at once" is not actually used (as written) > in many implementations But it is used in some, which is why I mentioned it. I gave two implementations, that is just (1) > - probably for the reasons you are pointing out. > I found these two papers whilst browsing: > > > http://www.cs.brown.edu/courses/cs227/Papers/Indexing/O'NeilGraefe.pdf > http://www.dama.upc.edu/downloads/jaguilar-2005-4.pdf > > > They seem to be describing a more subtle method making use of join > indexes and bitmapped indexes. Which is the option (2) I described. Best Regards, Simon Riggs
On Sat, 2005-12-17 at 13:13 -0500, Tom Lane wrote:
> Simon Riggs <simon@2ndquadrant.com> writes:
> > On Fri, 2005-12-16 at 23:28 -0500, Bruce Momjian wrote:
> >> How are star joins different from what we do now?
>
> > Methods:
> > 1. join all N small tables together in a cartesian product, then join to
> > main Large table once (rather than N times)
>
> Of course, the reason the current planner does not think of this is that
> it does not consider clauseless joins unless there is no alternative.
Understood
>  The plan you currently get for
> this sort of scenario is typically a nest of hash joins:
>
>                                QUERY PLAN
> ------------------------------------------------------------------------
>  Hash Join  (cost=2.25..4652.25 rows=102400 width=16)
>    Hash Cond: ("outer".f1 = "inner".f1)
>    ->  Hash Join  (cost=1.12..3115.12 rows=102400 width=12)
>          Hash Cond: ("outer".f2 = "inner".f1)
>          ->  Seq Scan on fact  (cost=0.00..1578.00 rows=102400 width=8)
>          ->  Hash  (cost=1.10..1.10 rows=10 width=4)
>                ->  Seq Scan on d2  (cost=0.00..1.10 rows=10 width=4)
>    ->  Hash  (cost=1.10..1.10 rows=10 width=4)
>          ->  Seq Scan on d1  (cost=0.00..1.10 rows=10 width=4)
> (9 rows)
>
> This involves only one scan of the fact table.  As each row is pulled up
> through the nest of hash joins, we hash one dimension key and join to
> one small table at each level.
Understood
> This is at worst the same amount of work
> as hashing all the keys at once and probing a single cartesian-product
> hashtable, probably less work (fewer wasted key-comparisons).  And
> definitely less memory.  You do have to keep your eye on the ball that
> you don't waste a lot of overhead propagating the row up through
> multiple join levels, but we got rid of most of the problem there in
> 8.1 via the introduction of "virtual tuple slots".  If this isn't fast
> enough yet, it'd make more sense to invest effort in further cutting the
> executor's join overhead (which of course benefits *all* plan types)
> than in trying to make the planner choose a star join.
That join type is used when an index-organised table is available, so
that a SeqScan of the larger table can be avoided.
I'd say the plan would make sense if the columns of the cartesian
product match a multi-column index on the larger table that would not
ever be used unless sufficient columns are restricted in each lookup.
That way you are able to avoid the SeqScan that occurs for the multiple
nested Hash Join case. (Clearly, normal selectivity rules apply on the
use of the index in this way).
So I think that plan type still can be effective in some circumstances.
Mind you: building an N-way index on a large table isn't such a good
idea, unless you can partition the tables and still use a join. Which is
why I've not centred on this case as being important before now.
My understanding: Teradata and DB2 use this.
This may be covered by patents.
> > 2. transform joins into subselects, then return subselect rows via an
> > index bitmap. Joins are performed via a bitmap addition process.
>
> This one might be interesting but it's not clear what you are talking
> about.  "Bitmap addition"?
Ref: "Re: [HACKERS] slow IN() clause for many cases"
Required Transforms: join -> IN (subselect) -> = ANY(ARRAY(subselect))
ending with the ability to use an bitmap index scan
(which clearly requires a run-time, not a plan-time evaluation - though
the distinction is minor if you go straight from plan->execute as is the
case with most Data Warehouse queries).
If you do this for all joins, you can then solve the problem with a
Bitmap And step, which is what I meant by "addition".
If you have need columns in the result set from the smaller tables you
can get them by joining the result set back to the smaller tables again.
My understanding: Oracle uses this.
This may be covered by patents.
Best Regards, Simon Riggs
			
		On Sun, 2005-12-18 at 17:07 +1300, Mark Kirkwood wrote: > Tom Lane wrote: > > >>2. transform joins into subselects, then return subselect rows via an > >>index bitmap. Joins are performed via a bitmap addition process. > > Looks like 8.1 pretty much does this right now: Good analysis. 8.1 doesn't do: - the transforms sufficiently well (you just performed them manually) - doesn't AND together multiple bitmaps to assist with N-way joins Those aren't criticisms, just observations. Pal's original example was a 9-dimension join, so I think PostgreSQL does very well on making that run in 30 seconds. That's a complex example and I think upholds just how good things are right now. Anyway, back to the starting point: IMHO there is an additional optimisation that can be performed to somehow speed up Single large table-many small table joins. And we have some clues as to how we might do that. Best Regards, Simon Riggs
Simon Riggs wrote: > On Sun, 2005-12-18 at 17:07 +1300, Mark Kirkwood wrote: > >>Tom Lane wrote: >> >> >>>>2. transform joins into subselects, then return subselect rows via an >>>>index bitmap. Joins are performed via a bitmap addition process. >> >>Looks like 8.1 pretty much does this right now: > > > Good analysis. > > 8.1 doesn't do: > - the transforms sufficiently well (you just performed them manually) Absolutely - I was intending to note that very point, but it got lost somewhere between brain and fingers :-) > - doesn't AND together multiple bitmaps to assist with N-way joins > Ah yes - I had overlooked that, good point! Cheers Mark
Simon Riggs wrote: > On Sun, 2005-12-18 at 15:02 +1300, Mark Kirkwood wrote: > > >>Yeah - the quoted method of "make a cartesian product of the dimensions >>and then join to the fact all at once" is not actually used (as written) >>in many implementations > > > But it is used in some, which is why I mentioned it. > > I gave two implementations, that is just (1) > > Sorry Simon, didn't mean to imply you shouldn't have mentioned it - was merely opining about its effectiveness.... >>- probably for the reasons you are pointing out. >>I found these two papers whilst browsing: >> >> >>http://www.cs.brown.edu/courses/cs227/Papers/Indexing/O'NeilGraefe.pdf >>http://www.dama.upc.edu/downloads/jaguilar-2005-4.pdf >> >> >>They seem to be describing a more subtle method making use of join >>indexes and bitmapped indexes. > > > Which is the option (2) I described. > Ok - I misunderstood you on this one, and thought you were describing the "star transformation" - upon re-reading, I see that yes, it's more or less a description of the O'Neil Graefe method. best wishes Mark
Simon Riggs wrote:
> On Sat, 2005-12-17 at 13:13 -0500, Tom Lane wrote:
>
>>Simon Riggs <simon@2ndquadrant.com> writes:
>>
>>>On Fri, 2005-12-16 at 23:28 -0500, Bruce Momjian wrote:
>>>
>>>>How are star joins different from what we do now?
>>
>>>Methods:
>>>1. join all N small tables together in a cartesian product, then join to
>>>main Large table once (rather than N times)
>>
>>Of course, the reason the current planner does not think of this is that
>>it does not consider clauseless joins unless there is no alternative.
>
>
> Understood
>
>
>> The plan you currently get for
>>this sort of scenario is typically a nest of hash joins:
>>
>>                               QUERY PLAN
>>------------------------------------------------------------------------
>> Hash Join  (cost=2.25..4652.25 rows=102400 width=16)
>>   Hash Cond: ("outer".f1 = "inner".f1)
>>   ->  Hash Join  (cost=1.12..3115.12 rows=102400 width=12)
>>         Hash Cond: ("outer".f2 = "inner".f1)
>>         ->  Seq Scan on fact  (cost=0.00..1578.00 rows=102400 width=8)
>>         ->  Hash  (cost=1.10..1.10 rows=10 width=4)
>>               ->  Seq Scan on d2  (cost=0.00..1.10 rows=10 width=4)
>>   ->  Hash  (cost=1.10..1.10 rows=10 width=4)
>>         ->  Seq Scan on d1  (cost=0.00..1.10 rows=10 width=4)
>>(9 rows)
>>
>>This involves only one scan of the fact table.  As each row is pulled up
>>through the nest of hash joins, we hash one dimension key and join to
>>one small table at each level.
>
>
> Understood
>
>
>>This is at worst the same amount of work
>>as hashing all the keys at once and probing a single cartesian-product
>>hashtable, probably less work (fewer wasted key-comparisons).  And
>>definitely less memory.  You do have to keep your eye on the ball that
>>you don't waste a lot of overhead propagating the row up through
>>multiple join levels, but we got rid of most of the problem there in
>>8.1 via the introduction of "virtual tuple slots".  If this isn't fast
>>enough yet, it'd make more sense to invest effort in further cutting the
>>executor's join overhead (which of course benefits *all* plan types)
>>than in trying to make the planner choose a star join.
>
>
> That join type is used when an index-organised table is available, so
> that a SeqScan of the larger table can be avoided.
>
> I'd say the plan would make sense if the columns of the cartesian
> product match a multi-column index on the larger table that would not
> ever be used unless sufficient columns are restricted in each lookup.
> That way you are able to avoid the SeqScan that occurs for the multiple
> nested Hash Join case. (Clearly, normal selectivity rules apply on the
> use of the index in this way).
>
> So I think that plan type still can be effective in some circumstances.
> Mind you: building an N-way index on a large table isn't such a good
> idea, unless you can partition the tables and still use a join. Which is
> why I've not centred on this case as being important before now.
>
> My understanding: Teradata and DB2 use this.
>
FWIW - I think DB2 uses the successive fact RID buildup (i.e method 2),
unfortunately I haven't got a working copy of DB2 in front of me to test.
Cheers
Mark
			
		On Mon, 2005-12-19 at 11:10 +1300, Mark Kirkwood wrote: > >>I found these two papers whilst browsing: > >> > >> > >>http://www.cs.brown.edu/courses/cs227/Papers/Indexing/O'NeilGraefe.pdf > >>http://www.dama.upc.edu/downloads/jaguilar-2005-4.pdf > >> > >> > >>They seem to be describing a more subtle method making use of join > >>indexes and bitmapped indexes. > > > > > > Which is the option (2) I described. > > > > Ok - I misunderstood you on this one, and thought you were describing > the "star transformation" - upon re-reading, I see that yes, it's more > or less a description of the O'Neil Graefe method. Papers look interesting; I'd not seen them. My knowledge of this is mostly practical. O'Neil and Graefe seem to be talking about using join indexes, which is probably method (3)... oh lordy. Best Regards, Simon Riggs
On Mon, 2005-12-19 at 11:13 +1300, Mark Kirkwood wrote: > > My understanding: Teradata and DB2 use this. > > FWIW - I think DB2 uses the successive fact RID buildup (i.e method 2), > unfortunately I think you're right; I was thinking about that point too because DB2 doesn't have index-organised tables (well, sort of: MDC). I was confused because IBM seem to have a patent on (1), even though it seems exactly like the original NCR/Teradata implementation, which predates the patent filing by many years. Wierd. It's a minefield of patents.... > I haven't got a working copy of DB2 in front of me to test. True, not all copies work :-) Best Regards, Simon Riggs