Обсуждение: Partitioned table performance

От:
"Stacy White"
Дата:

We're working with a Postgres database that includes a fairly large table
(100M rows, increasing at around 2M per day).

In some cases we've seen some increased performance in tests by splitting
the table into several smaller tables.  Both 'UNION ALL' views, and the
superclass/subclass scheme work well at pruning down the set of rows a query
uses, but they seem to introduce a large performance hit to the time to
process each row (~50% for superclass/subclass, and ~150% for union views).

Is this to be expected?  Or is this a problem with our test setup?

I've listed details on our tests at the end of this message.  The results
are similar with our larger tables; the overhead appears to be per record
returned from the subquery/subclass; it's not a constant overhead per query.
Our production instance is running 7.4.2, but the results are the same on
8.0.

For reference, I tested with this setup (for the superclass/subclass
partitioning scheme):

   CREATE TABLE super_foo ( partition NUMERIC, bar NUMERIC );
   ANALYZE super_foo ;

   CREATE TABLE sub_foo1 () INHERITS ( super_foo );
   INSERT INTO sub_foo1 VALUES ( 1, 1 );
   -- repeat insert until sub_foo1 has 1,000,000 rows
   CREATE INDEX idx_subfoo1_partition ON sub_foo1 ( partition );
   ANALYZE sub_foo1 ;

   CREATE TABLE sub_foo2 () INHERITS ( super_foo );
   INSERT INTO sub_foo2 VALUES ( 2, 1 );
   -- repeat insert until sub_foo2 has 1,000,000 rows
   CREATE INDEX idx_subfoo2_partition ON sub_foo2 ( partition );
   ANALYZE sub_foo2 ;

and this setup for the union all scheme:

   CREATE TABLE union_foo1 ( bar NUMERIC );
   INSERT INTO union_foo1 VALUES ( 1 ) ;
   -- repeat insert until union_foo1 has 1,000,000 rows
   ANALYZE union_foo1 ;

   CREATE TABLE union_foo2 ( bar NUMERIC );
   INSERT INTO union_foo2 VALUES ( 1 ) ;
   -- repeat insert until union_foo2 has 1,000,000 rows
   ANALYZE union_foo2 ;

   CREATE VIEW union_foo AS
      SELECT 1 AS partition, * FROM union_foo1
      UNION ALL
      SELECT 2 AS partition, * FROM union_foo2 ;

The partition pruning works marvelously:

  EXPLAIN SELECT SUM(bar) FROM super_foo WHERE partition = 2 ;
                                                   QUERY PLAN
 ---------------------------------------------------------------------------
----------------------------------
 Aggregate  (cost=21899.02..21899.02 rows=1 width=32)
   ->  Append  (cost=0.00..19399.01 rows=1000002 width=32)
         ->  Seq Scan on super_foo  (cost=0.00..0.00 rows=1 width=32)
               Filter: (partition = 2::numeric)
         ->  Index Scan using idx_subfoo1_partition on sub_foo1 super_foo
(cost=0.00..2.01 rows=1 width=10)
               Index Cond: (partition = 2::numeric)
         ->  Seq Scan on sub_foo2 super_foo  (cost=0.00..19397.00
rows=1000000 width=10)
               Filter: (partition = 2::numeric)

and

  EXPLAIN SELECT SUM(bar) FROM union_foo WHERE partition = 2 ;
                                             QUERY PLAN
 ---------------------------------------------------------------------------
----------------------
 Aggregate  (cost=75819.15..75819.15 rows=1 width=32)
   ->  Subquery Scan union_foo  (cost=0.00..70818.60 rows=2000220 width=32)
         ->  Append  (cost=0.00..50816.40 rows=2000220 width=10)
               ->  Subquery Scan "*SELECT* 1"  (cost=0.00..25408.20
rows=1000110 width=10)
                     ->  Result  (cost=0.00..15407.10 rows=1000110 width=10)
                           One-Time Filter: false
                           ->  Seq Scan on union_foo1  (cost=0.00..15407.10
rows=1000110 width=10)
               ->  Subquery Scan "*SELECT* 2"  (cost=0.00..25408.20
rows=1000110 width=10)
                     ->  Seq Scan on union_foo2  (cost=0.00..15407.10
rows=1000110 width=10)


But you can see a fair amount of overhead, espcially in the case of the
union view:

  SELECT SUM(bar) FROM sub_foo1 UNION ALL SELECT SUM(bar) FROM sub_foo2 ;
  Time: 2291.637 ms

  SELECT SUM(bar) FROM union_foo1 UNION ALL SELECT SUM(bar) FROM union_foo2
;
  Time: 2248.225 ms

  SELECT SUM(bar) FROM super_foo ;
  Time: 3329.953 ms

  SELECT SUM(bar) FROM union_foo ;
  Time: 5267.742 ms


  SELECT SUM(bar) FROM sub_foo2 ;
  Time: 1124.496 ms

  SELECT SUM(bar) FROM union_foo2 ;
  Time: 1090.616 ms

  SELECT SUM(bar) FROM super_foo WHERE partition = 2 ;
  Time: 2137.767 ms

  SELECT SUM(bar) FROM union_foo WHERE partition = 2 ;
  Time: 2774.887 ms


От:
Josh Berkus
Дата:

Stacy,

Thanks for the stats!

> In some cases we've seen some increased performance in tests by splitting
> the table into several smaller tables.  Both 'UNION ALL' views, and the
> superclass/subclass scheme work well at pruning down the set of rows a
> query uses, but they seem to introduce a large performance hit to the time
> to process each row (~50% for superclass/subclass, and ~150% for union
> views).

This seems reasonable, actually, given your test.   Really, what you should be
comparing it against is not against selecting from an individual partition,
but selecting from the whole business as one large table.

I also suspect that wider rows results in less overhead proportionally; note
that your test contains *only* the indexed rows.   I should soon have a test
to prove this, hopefully.

However, I would be interested in seeing EXPLAIN ANALYZE from your tests
rather than just EXPLAIN.

--
Josh Berkus
Aglio Database Solutions
San Francisco

От:
"Stacy White"
Дата:

Thanks for the quick reply, Josh.  Here are some more, with wider tables and
'EXPLAIN ANALYZE' output.  These tests use the same basic structure as
before, but with 20 data columns rather than just the one:

   CREATE TABLE one_big_foo (
         partition INTEGER,
         bar1 INTEGER,
         ...
         bar20 INTEGER
    )

Each set of test tables holds 1,000,000 tuples with a partition value of
'1', and 1,000,000 with a partition value of '2'.  The bar* columns are all
set to non-null values.  The 'one_big_foo' table stores all 2M rows in one
table.  'super_foo' and 'union_foo' split the data into two tables, and use
inheritance and union views (respectively) to tie them together, as
described in my previous message.

Query timings and 'EXPLAIN ANALYZE' results for full table scans and for
partition scans follow:


vod=# SELECT COUNT(*), MAX(bar1) FROM one_big_foo ;
Time: 3695.274 ms

vod=# SELECT COUNT(*), MAX(bar1) FROM super_foo ;
Time: 4641.992 ms

vod=# SELECT COUNT(*), MAX(bar1) FROM union_foo ;
Time: 16035.025 ms


vod=# SELECT COUNT(*), MAX(bar1) FROM one_big_foo WHERE partition = 1 ;
Time: 4395.274 ms

vod=# SELECT COUNT(*), MAX(bar1) FROM super_foo WHERE partition = 1 ;
Time: 3050.920 ms

vod=# SELECT COUNT(*), MAX(bar1) FROM union_foo WHERE partition = 1 ;
Time: 7468.664 ms




vod=# EXPLAIN ANALYZE SELECT COUNT(*), MAX(bar1) FROM one_big_foo ;
                                                          QUERY PLAN
----------------------------------------------------------------------------
---------------------------------------------------
 Aggregate  (cost=61747.92..61747.92 rows=1 width=4) (actual
time=18412.471..18412.474 rows=1 loops=1)
   ->  Seq Scan on one_big_foo  (cost=0.00..51747.61 rows=2000061
width=4) (actual time=0.097..10079.192 rows=2000000 loops=1)
 Total runtime: 18412.597 ms
(3 rows)

Time: 18413.919 ms


vod=# EXPLAIN ANALYZE SELECT COUNT(*), MAX(bar1) FROM super_foo ;
                                                                QUERY
PLAN
----------------------------------------------------------------------------
---------------------------------------------------------------
 Aggregate  (cost=61749.87..61749.87 rows=1 width=4) (actual
time=30267.913..30267.916 rows=1 loops=1)
   ->  Append  (cost=0.00..51749.24 rows=2000125 width=4) (actual
time=0.127..22830.610 rows=2000000 loops=1)
         ->  Seq Scan on super_foo  (cost=0.00..0.00 rows=1 width=4)
(actual time=0.005..0.005 rows=0 loops=1)
         ->  Seq Scan on sub_foo1 super_foo  (cost=0.00..25874.62
rows=1000062 width=4) (actual time=0.113..5808.899 rows=1000000 loops=1)
         ->  Seq Scan on sub_foo2 super_foo  (cost=0.00..25874.62
rows=1000062 width=4) (actual time=0.075..5829.095 rows=1000000 loops=1)
 Total runtime: 30268.061 ms
(6 rows)

Time: 30303.271 ms


vod=# EXPLAIN ANALYZE SELECT COUNT(*), MAX(bar1) FROM union_foo ;
                                                                   QUERY
PLAN
----------------------------------------------------------------------------
--------------------------------------------------------------------
 Aggregate  (cost=98573.40..98573.40 rows=1 width=4) (actual
time=62542.849..62542.852 rows=1 loops=1)
   ->  Subquery Scan union_foo  (cost=0.00..88573.20 rows=2000040
width=4) (actual time=0.130..55536.040 rows=2000000 loops=1)
         ->  Append  (cost=0.00..68572.80 rows=2000040 width=80) (actual
time=0.122..43210.763 rows=2000000 loops=1)
               ->  Subquery Scan "*SELECT* 1"  (cost=0.00..34286.40
rows=1000020 width=80) (actual time=0.118..16312.708 rows=1000000
loops=1)                     ->  Seq Scan on union_foo1
(cost=0.00..24286.20 rows=1000020 width=80) (actual time=0.107..7763.460
rows=1000000 loops=1)
               ->  Subquery Scan "*SELECT* 2"  (cost=0.00..34286.40
rows=1000020 width=80) (actual time=0.116..16610.387 rows=1000000
loops=1)                     ->  Seq Scan on union_foo2
(cost=0.00..24286.20 rows=1000020 width=80) (actual time=0.095..7549.522
rows=1000000 loops=1)
 Total runtime: 62543.098 ms



vod=# EXPLAIN ANALYZE SELECT COUNT(*), MAX(bar1) FROM one_big_foo WHERE
partition = 1 ;
QUERY PLAN
----------------------------------------------------------------------------
-------------------------------------------------
 Aggregate  (cost=61711.25..61711.25 rows=1 width=4) (actual
time=11592.135..11592.139 rows=1 loops=1)
   ->  Seq Scan on one_big_foo  (cost=0.00..56747.76 rows=992697
width=4) (actual time=0.106..7627.170 rows=1000000 loops=1)
         Filter: (partition = 1::numeric)
 Total runtime: 11592.264 ms
(4 rows)

Time: 11593.749 ms

vod=# EXPLAIN ANALYZE SELECT COUNT(*), MAX(bar1) FROM super_foo WHERE
partition = 1 ;

QUERY PLAN
----------------------------------------------------------------------------
---------------------------------------------------------------------------
 Aggregate  (cost=33377.11..33377.11 rows=1 width=4) (actual
time=15670.309..15670.312 rows=1 loops=1)
   ->  Append  (cost=0.00..28376.79 rows=1000064 width=4) (actual
time=6.699..12072.483 rows=1000000 loops=1)
         ->  Seq Scan on super_foo  (cost=0.00..0.00 rows=1 width=4)
(actual time=0.005..0.005 rows=0 loops=1)
               Filter: (partition = 1::numeric)
         ->  Seq Scan on sub_foo1 super_foo  (cost=0.00..28374.78
rows=1000062 width=4) (actual time=0.106..6688.812 rows=1000000 loops=1)
               Filter: (partition = 1::numeric)
         ->  Index Scan using idx_sub_foo2_partition on sub_foo2
super_foo  (cost=0.00..2.01 rows=1 width=4) (actual time=0.221..0.221
rows=0 loops=1)
               Index Cond: (partition = 1::numeric)
 Total runtime: 15670.463 ms
(9 rows)

Time: 15672.235 ms

vod=# EXPLAIN ANALYZE SELECT COUNT(*), MAX(bar1) FROM union_foo WHERE
partition = 1 ;
                                                                   QUERY
PLAN
----------------------------------------------------------------------------
--------------------------------------------------------------------
 Aggregate  (cost=98573.40..98573.40 rows=1 width=4) (actual
time=31897.629..31897.632 rows=1 loops=1)
   ->  Subquery Scan union_foo  (cost=0.00..88573.20 rows=2000040
width=4) (actual time=0.134..28323.692 rows=1000000 loops=1)
         ->  Append  (cost=0.00..68572.80 rows=2000040 width=80) (actual
time=0.125..21969.522 rows=1000000 loops=1)
               ->  Subquery Scan "*SELECT* 1"  (cost=0.00..34286.40
rows=1000020 width=80) (actual time=0.120..16867.005 rows=1000000
loops=1)
                     ->  Seq Scan on union_foo1  (cost=0.00..24286.20
rows=1000020 width=80) (actual time=0.108..8017.931 rows=1000000
loops=1)
               ->  Subquery Scan "*SELECT* 2"  (cost=0.00..34286.40
rows=1000020 width=80) (actual time=0.011..0.011 rows=0 loops=1)
                     ->  Result  (cost=0.00..24286.20 rows=1000020
width=80) (actual time=0.004..0.004 rows=0 loops=1)
                           One-Time Filter: false
                           ->  Seq Scan on union_foo2
(cost=0.00..24286.20 rows=1000020 width=80) (never executed)
 Total runtime: 31897.897 ms
(10 rows)

Time: 31900.204 ms



----- Original Message -----
From: "Josh Berkus" <>
To: <>
Cc: "Stacy White" <>
Sent: Sunday, December 05, 2004 3:06 PM
Subject: Re: [PERFORM] Partitioned table performance


Stacy,

Thanks for the stats!

> In some cases we've seen some increased performance in tests by splitting
> the table into several smaller tables. Both 'UNION ALL' views, and the
> superclass/subclass scheme work well at pruning down the set of rows a
> query uses, but they seem to introduce a large performance hit to the time
> to process each row (~50% for superclass/subclass, and ~150% for union
> views).

This seems reasonable, actually, given your test.   Really, what you should
be
comparing it against is not against selecting from an individual partition,
but selecting from the whole business as one large table.

I also suspect that wider rows results in less overhead proportionally; note
that your test contains *only* the indexed rows.   I should soon have a test
to prove this, hopefully.

However, I would be interested in seeing EXPLAIN ANALYZE from your tests
rather than just EXPLAIN.

--
Josh Berkus
Aglio Database Solutions
San Francisco


От:
Josh Berkus
Дата:

Stacy,

> Each set of test tables holds 1,000,000 tuples with a partition value of
> '1', and 1,000,000 with a partition value of '2'.  The bar* columns are all
> set to non-null values.  The 'one_big_foo' table stores all 2M rows in one
> table.  'super_foo' and 'union_foo' split the data into two tables, and use
> inheritance and union views (respectively) to tie them together, as
> described in my previous message.
>
> Query timings and 'EXPLAIN ANALYZE' results for full table scans and for
> partition scans follow:

Hmmm .... interesting.   I think you've demonstrated that pseudo-partitioning
doesn't pay for having only 2 partitions.   Examine this:

         ->  Index Scan using idx_sub_foo2_partition on sub_foo2
super_foo  (cost=0.00..2.01 rows=1 width=4) (actual time=0.221..0.221
rows=0 loops=1)
               Index Cond: (partition = 1::numeric)
 Total runtime: 15670.463 ms

As you see, even though the aggregate operation requires a seq scan, the
planner is still able to scan, and discard, sub_foo2, using its index in 0.2
seconds.  Unfortunately, super_foo still needs to contend with:

   ->  Append  (cost=0.00..28376.79 rows=1000064 width=4) (actual
time=6.699..12072.483 rows=1000000 loops=1)

Right there, in the Append, you lose 6 seconds.   This means that
pseudo-partitioning via inheritance will become a speed gain once you can
"make up" that 6 seconds by being able to discard more partitions.   If you
want, do a test with 6 partitions instead of 2 and let us know how it comes
out.

Also, keep in mind that there are other reasons to do pseudo-partitioning than
your example.  Data write performance, expiring partitions, and vacuum are
big reasons that can motivate partitioning even in cases when selects are
slower.

--
Josh Berkus
Aglio Database Solutions
San Francisco

От:
"Stacy White"
Дата:

Josh,

You're absolutely correct that the overhead becomes less significant as the
partitioning prunes more rows.  I can even see a two-partition table being
useful in some situations (e.g., a table divided into a relatively small
"recent data" partition and a much larger "historical data" partition).  The
break-even point is when your partitioning scheme prunes 20% of the rows
(assuming you're using the inheritance based scheme).

Thanks again for the reply.  So it sounds like the answer to my original
question is that it's expected that the pseudo-partitioning would introduce
a fairly significant amount of overhead.  Correct?



----- Original Message -----
From: "Josh Berkus" <>
To: <>
Cc: "Stacy White" <>
Sent: Friday, December 10, 2004 9:52 PM
Subject: Re: [PERFORM] Partitioned table performance


Stacy,

> Each set of test tables holds 1,000,000 tuples with a partition value of
> '1', and 1,000,000 with a partition value of '2'. The bar* columns are all
> set to non-null values. The 'one_big_foo' table stores all 2M rows in one
> table. 'super_foo' and 'union_foo' split the data into two tables, and use
> inheritance and union views (respectively) to tie them together, as
> described in my previous message.
>
> Query timings and 'EXPLAIN ANALYZE' results for full table scans and for
> partition scans follow:

Hmmm .... interesting.   I think you've demonstrated that
pseudo-partitioning
doesn't pay for having only 2 partitions.   Examine this:

-> Index Scan using idx_sub_foo2_partition on sub_foo2
super_foo (cost=0.00..2.01 rows=1 width=4) (actual time=0.221..0.221
rows=0 loops=1)
Index Cond: (partition = 1::numeric)
Total runtime: 15670.463 ms

As you see, even though the aggregate operation requires a seq scan, the
planner is still able to scan, and discard, sub_foo2, using its index in 0.2
seconds.  Unfortunately, super_foo still needs to contend with:

-> Append (cost=0.00..28376.79 rows=1000064 width=4) (actual
time=6.699..12072.483 rows=1000000 loops=1)

Right there, in the Append, you lose 6 seconds.   This means that
pseudo-partitioning via inheritance will become a speed gain once you can
"make up" that 6 seconds by being able to discard more partitions.   If you
want, do a test with 6 partitions instead of 2 and let us know how it comes
out.

Also, keep in mind that there are other reasons to do pseudo-partitioning
than
your example.  Data write performance, expiring partitions, and vacuum are
big reasons that can motivate partitioning even in cases when selects are
slower.

--
Josh Berkus
Aglio Database Solutions
San Francisco

---------------------------(end of broadcast)---------------------------
TIP 3: if posting/reading through Usenet, please send an appropriate
      subscribe-nomail command to  so that your
      message can get through to the mailing list cleanly


От:
Josh Berkus
Дата:

Stacy,

> Thanks again for the reply.  So it sounds like the answer to my original
> question is that it's expected that the pseudo-partitioning would introduce
> a fairly significant amount of overhead.  Correct?

Correct.   For that matter, Oracle table partitioning introduces significant
overhead, from what I've seen.  I don't think there's a way not to.

Generally, I counsel people that they only want to consider
pseudo-partitioning if they have one axis on the table which is used in 90%
or more of the queries against that table.

What would improve the situation significantly, and the utility of
pseudo-partitioning, is the ability to have a single index span multiple
partitions.   This would allow you to have a segmented index for the
partitioned axis, yet still use an unsegmented index for the other columns.
However, there's a *lot* of work to do to make that happen.

--
--Josh

Josh Berkus
Aglio Database Solutions
San Francisco

От:
Greg Stark
Дата:

Josh Berkus <> writes:

> Stacy,
>
> > Thanks again for the reply.  So it sounds like the answer to my original
> > question is that it's expected that the pseudo-partitioning would introduce
> > a fairly significant amount of overhead.  Correct?
>
> Correct.   For that matter, Oracle table partitioning introduces significant
> overhead, from what I've seen.  I don't think there's a way not to.

Well Oracle has lots of partitioning intelligence pushed up to the planner to
avoid overhead.

If you have a query with something like "WHERE date = '2004-01-01'" and date
is your partition key (even if it's a range) then Oracle will figure out which
partition it will need at planning time.

Even if your query is something like "WHERE date = ?" then Oracle will still
recognize that it will only need a single partition at planning time, though
it has to decide which partition at execution time.

We didn't notice any run-time performance degradation when we went to
partitioned tables. Maybe we were so blinded by the joy they brought us on the
maintenance side though. I don't think we specifically checked for run-time
consequences.

But I'm a bit puzzled. Why would Append have any significant cost? It's just
taking the tuples from one plan node and returning them until they run out,
then taking the tuples from another plan node. It should have no i/o cost and
hardly any cpu cost. Where is the time going?

> What would improve the situation significantly, and the utility of
> pseudo-partitioning, is the ability to have a single index span multiple
> partitions.   This would allow you to have a segmented index for the
> partitioned axis, yet still use an unsegmented index for the other columns.
> However, there's a *lot* of work to do to make that happen.

In my experience "global indexes" defeat the whole purpose of having the
partitions. They make dropping and adding partitions expensive which was
always the reason we wanted to partition something anyways.

It is handy having a higher level interface to deal with partitioned tables.
You can create a single "local" or "segmented" index and not have to manually
deal with all the partitions as separate tables. But that's just syntactic
sugar.

--
greg

От:
Josh Berkus
Дата:

Greg,

> Well Oracle has lots of partitioning intelligence pushed up to the planner
> to avoid overhead.
>
> If you have a query with something like "WHERE date = '2004-01-01'" and
> date is your partition key (even if it's a range) then Oracle will figure
> out which partition it will need at planning time.

Hmmm ... well, we're looking at making a spec for Postgres Table Partitioning.
Maybe you could help?

> But I'm a bit puzzled. Why would Append have any significant cost? It's
> just taking the tuples from one plan node and returning them until they run
> out, then taking the tuples from another plan node. It should have no i/o
> cost and hardly any cpu cost. Where is the time going?

Beats me.  Tom?

> In my experience "global indexes" defeat the whole purpose of having the
> partitions. They make dropping and adding partitions expensive which was
> always the reason we wanted to partition something anyways.

Hmmm.    Possibly, I was just thinking about the cost to partitioned tables
when you do a selection *not* on the partitioned axis.   Also that currently
we can't enforce UNIQUE constraints across partitions.

But maybe reducing the cost of Append is the answer to this.

> It is handy having a higher level interface to deal with partitioned
> tables. You can create a single "local" or "segmented" index and not have
> to manually deal with all the partitions as separate tables. But that's
> just syntactic sugar.

Right, and the easy part.

--
Josh Berkus
Aglio Database Solutions
San Francisco

От:
Greg Stark
Дата:

Josh Berkus <> writes:

> > But I'm a bit puzzled. Why would Append have any significant cost? It's
> > just taking the tuples from one plan node and returning them until they run
> > out, then taking the tuples from another plan node. It should have no i/o
> > cost and hardly any cpu cost. Where is the time going?
>
> Beats me.  Tom?
>
> > In my experience "global indexes" defeat the whole purpose of having the
> > partitions. They make dropping and adding partitions expensive which was
> > always the reason we wanted to partition something anyways.
>
> Hmmm.    Possibly, I was just thinking about the cost to partitioned tables
> when you do a selection *not* on the partitioned axis.   Also that currently
> we can't enforce UNIQUE constraints across partitions.

Like I said though, we found "global indexes" defeated the whole purpose. That
meant no global UNIQUE constraints for us when we went to partitioned tables.
It gave the DBAs the willies but it really wasn't a big deal.

You can still do unique local indexes on a specific partition. So as long as
your partition key is in the primary key you can have a trustworthy primary
key.

And even if not, you usually find you're only loading data into only one
partition. In most applications it's pretty hard to get a record from two
different partitions with conflicting IDs and not hard to check for. You could
easily put a constraint saying that all PO numbers in the new fiscal year have
to be greater than the last PO number from last year, for example.

> But maybe reducing the cost of Append is the answer to this.

The problem with global indexes is that adding or removing an entire partition
becomes a large job. [Actually with Postgres MVCC I suppose removing might
not. But cleaning up would eventually be a large job, and the point remains
for adding a partition.]

Ideally adding and removing a partition should be a O(1) operation. No data
modification at all, purely catalog changes.

> > It is handy having a higher level interface to deal with partitioned
> > tables. You can create a single "local" or "segmented" index and not have
> > to manually deal with all the partitions as separate tables. But that's
> > just syntactic sugar.
>
> Right, and the easy part.

I think the hard part lies in the optimizer actually. The semantics of the
operations to manipulate partitions might be tricky to get right but the
coding should be straightforward. Having the optimizer be able to recognize
when it can prune partitions will be a lot of work.

--
greg

От:
Tom Lane
Дата:

Greg Stark <> writes:
> But I'm a bit puzzled. Why would Append have any significant cost? It's just
> taking the tuples from one plan node and returning them until they run out,
> then taking the tuples from another plan node. It should have no i/o cost and
> hardly any cpu cost. Where is the time going?

As best I can tell by profiling, the cost of the Append node per se is
indeed negligible --- no more than a couple percent of the runtime in
CVS tip for a test case similar to Stacy White's example.

It looks bad in EXPLAIN ANALYZE, but you have to realize that passing
the tuples up through the Append node doubles the instrumentation
overhead of EXPLAIN ANALYZE, which is pretty sizable already.  (If you
turn on \timing in psql and try the query itself vs. EXPLAIN ANALYZE,
the actual elapsed time is about double, at least for me.)

The other effect, which I hadn't expected, is that the seqscans
themselves actually slow down.  I get

regression=# explain analyze SELECT COUNT(*), MAX(bar1) FROM super_foo ;
                                                               QUERY PLAN

----------------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=16414.32..16414.32 rows=1 width=4) (actual time=32313.980..32313.988 rows=1 loops=1)
   ->  Append  (cost=0.00..13631.54 rows=556555 width=4) (actual time=0.232..21848.401 rows=524289 loops=1)
         ->  Seq Scan on super_foo  (cost=0.00..0.00 rows=1 width=4) (actual time=0.020..0.020 rows=0 loops=1)
         ->  Seq Scan on sub_foo1 super_foo  (cost=0.00..6815.77 rows=278277 width=4) (actual time=0.187..6926.395
rows=262144loops=1) 
         ->  Seq Scan on sub_foo2 super_foo  (cost=0.00..6815.77 rows=278277 width=4) (actual time=0.168..7026.953
rows=262145loops=1) 
 Total runtime: 32314.993 ms
(6 rows)

regression=# explain analyze SELECT COUNT(*), MAX(bar1) FROM sub_foo1;
                                                       QUERY PLAN

------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=8207.16..8207.16 rows=1 width=4) (actual time=9850.420..9850.428 rows=1 loops=1)
   ->  Seq Scan on sub_foo1  (cost=0.00..6815.77 rows=278277 width=4) (actual time=0.202..4642.401 rows=262144 loops=1)
 Total runtime: 9851.423 ms
(3 rows)

Notice the actual times for the sub_foo1 seqscans.  That increase (when
counted for both input tables) almost exactly accounts for the
difference in non-EXPLAIN ANALYZE runtime.

After digging around, I find that the reason for the difference is that
the optimization to avoid a projection step (ExecProject) isn't applied
for scans of inheritance unions:

    /*
     * Can't do it with inheritance cases either (mainly because Append
     * doesn't project).
     */
    if (rel->reloptkind != RELOPT_BASEREL)
        return false;

So if you were to try the example in a pre-7.4 PG, which didn't have
that optimization, you'd probably find that the speeds were just about
the same.  (I'm too lazy to verify this though.)

I looked briefly at what it would take to cover this case, and decided
that it's a nontrivial change, so it's too late to do something about it
for 8.0.  I think it's probably possible to fix it though, at least for
cases where the child tables have rowtypes identical to the parent.

            regards, tom lane

От:
"Jim C. Nasby"
Дата:

Sorry for the late reply, so I included the whole thread. Should this be
a TODO?

On Wed, Dec 15, 2004 at 08:30:08PM -0500, Tom Lane wrote:
> Greg Stark <> writes:
> > But I'm a bit puzzled. Why would Append have any significant cost? It's just
> > taking the tuples from one plan node and returning them until they run out,
> > then taking the tuples from another plan node. It should have no i/o cost and
> > hardly any cpu cost. Where is the time going?
>
> As best I can tell by profiling, the cost of the Append node per se is
> indeed negligible --- no more than a couple percent of the runtime in
> CVS tip for a test case similar to Stacy White's example.
>
> It looks bad in EXPLAIN ANALYZE, but you have to realize that passing
> the tuples up through the Append node doubles the instrumentation
> overhead of EXPLAIN ANALYZE, which is pretty sizable already.  (If you
> turn on \timing in psql and try the query itself vs. EXPLAIN ANALYZE,
> the actual elapsed time is about double, at least for me.)
>
> The other effect, which I hadn't expected, is that the seqscans
> themselves actually slow down.  I get
>
> regression=# explain analyze SELECT COUNT(*), MAX(bar1) FROM super_foo ;
>                                                                QUERY PLAN
>
----------------------------------------------------------------------------------------------------------------------------------------
>  Aggregate  (cost=16414.32..16414.32 rows=1 width=4) (actual time=32313.980..32313.988 rows=1 loops=1)
>    ->  Append  (cost=0.00..13631.54 rows=556555 width=4) (actual time=0.232..21848.401 rows=524289 loops=1)
>          ->  Seq Scan on super_foo  (cost=0.00..0.00 rows=1 width=4) (actual time=0.020..0.020 rows=0 loops=1)
>          ->  Seq Scan on sub_foo1 super_foo  (cost=0.00..6815.77 rows=278277 width=4) (actual time=0.187..6926.395
rows=262144loops=1) 
>          ->  Seq Scan on sub_foo2 super_foo  (cost=0.00..6815.77 rows=278277 width=4) (actual time=0.168..7026.953
rows=262145loops=1) 
>  Total runtime: 32314.993 ms
> (6 rows)
>
> regression=# explain analyze SELECT COUNT(*), MAX(bar1) FROM sub_foo1;
>                                                        QUERY PLAN
>
------------------------------------------------------------------------------------------------------------------------
>  Aggregate  (cost=8207.16..8207.16 rows=1 width=4) (actual time=9850.420..9850.428 rows=1 loops=1)
>    ->  Seq Scan on sub_foo1  (cost=0.00..6815.77 rows=278277 width=4) (actual time=0.202..4642.401 rows=262144
loops=1)
>  Total runtime: 9851.423 ms
> (3 rows)
>
> Notice the actual times for the sub_foo1 seqscans.  That increase (when
> counted for both input tables) almost exactly accounts for the
> difference in non-EXPLAIN ANALYZE runtime.
>
> After digging around, I find that the reason for the difference is that
> the optimization to avoid a projection step (ExecProject) isn't applied
> for scans of inheritance unions:
>
>     /*
>      * Can't do it with inheritance cases either (mainly because Append
>      * doesn't project).
>      */
>     if (rel->reloptkind != RELOPT_BASEREL)
>         return false;
>
> So if you were to try the example in a pre-7.4 PG, which didn't have
> that optimization, you'd probably find that the speeds were just about
> the same.  (I'm too lazy to verify this though.)
>
> I looked briefly at what it would take to cover this case, and decided
> that it's a nontrivial change, so it's too late to do something about it
> for 8.0.  I think it's probably possible to fix it though, at least for
> cases where the child tables have rowtypes identical to the parent.
>
>             regards, tom lane
>
> ---------------------------(end of broadcast)---------------------------
> TIP 3: if posting/reading through Usenet, please send an appropriate
>       subscribe-nomail command to  so that your
>       message can get through to the mailing list cleanly
>

--
Jim C. Nasby, Database Consultant               
Give your computer some brain candy! www.distributed.net Team #1828

Windows: "Where do you want to go today?"
Linux: "Where do you want to go tomorrow?"
FreeBSD: "Are you guys coming, or what?"

От:
"Jim C. Nasby"
Дата:

On Wed, Dec 15, 2004 at 11:56:40AM -0800, Josh Berkus wrote:
> Greg,
>
> > Well Oracle has lots of partitioning intelligence pushed up to the planner
> > to avoid overhead.
> >
> > If you have a query with something like "WHERE date = '2004-01-01'" and
> > date is your partition key (even if it's a range) then Oracle will figure
> > out which partition it will need at planning time.
>
> Hmmm ... well, we're looking at making a spec for Postgres Table Partitioning.
> Maybe you could help?

This is something I've been thinking about doing for
http://stats.distributed.net; is there a formal project for this
somewhere?

On a different note, has anyone looked at the savings you get by
ommitting the partition field from the child tables? ISTM that the
savings would be substantial for narrow tables. Of course that most
likely means doing a union view instead of inheritence, but I'm guessing
here. The table I'm thinking of partitioning is quite narrow (see
below), so I suspect that dropping project_id out would result in a
substantial savings (there's basically nothing that ever queries across
the whole table). With the data distribution, I suspect just breaking
project ID's 205, 5, and 25 into partitioned tables that didn't contain
project_id would save about 450M (4bytes * 95% * 130M).

(the table has ~130M rows)

   Table "public.email_contrib"
   Column   |  Type   | Modifiers
------------+---------+-----------
 project_id | integer | not null
 id         | integer | not null
 date       | date    | not null
 team_id    | integer |
 work_units | bigint  | not null
Indexes:
    "email_contrib_pkey" primary key, btree (project_id, id, date)
    "email_contrib__pk24" btree (id, date) WHERE (project_id = 24)
    "email_contrib__pk25" btree (id, date) WHERE (project_id = 25)
    "email_contrib__pk8" btree (id, date) WHERE (project_id = 8)
    "email_contrib__project_date" btree (project_id, date)
Foreign-key constraints:
    "fk_email_contrib__id" FOREIGN KEY (id) REFERENCES stats_participant(id) ON UPDATE CASCADE
    "fk_email_contrib__team_id" FOREIGN KEY (team_id) REFERENCES stats_team(team) ON UPDATE CASCADE

stats=# select * from pg_stats where tablename='email_contrib' and
attname='project_id';
 schemaname |   tablename   |  attname   | null_frac | avg_width | n_distinct | most_common_vals  |
most_common_freqs| histogram_bounds | correlation  

------------+---------------+------------+-----------+-----------+------------+-------------------+---------------------------------------------------------+------------------+-------------
  public     | email_contrib | project_id |         0 |         4 | 6 | {205,5,25,8,24,3} |
{0.461133,0.4455,0.0444333,0.0418667,0.0049,0.00216667}| |    0.703936 
--
Jim C. Nasby, Database Consultant               
Give your computer some brain candy! www.distributed.net Team #1828

Windows: "Where do you want to go today?"
Linux: "Where do you want to go tomorrow?"
FreeBSD: "Are you guys coming, or what?"

От:
"Stacy White"
Дата:

The discussion seems to have diverged a little, so I don't feel too bad
about making some semi-off-topic comments.

From: "Greg Stark" <>
> Like I said though, we found "global indexes" defeated the whole purpose.

First semi-off-topic comment: I think this depends on the index, the data,
and the goal of the partitioning.  We use partitioning on one of our Oracle
projects for performance rather than managability.  In this case, a global
index on a non-partitioned field can be helpful.

Imagine an 'orders' table with 100 partitions on week.  Products have a
short life cycle, and are typically around for only a week or two.  A query
like 'SELECT * FROM orders WHERE product_no = ?' forces a lookup on 100
different local indexes, but only one global index.


Second-semi-off-topic comment: Josh mentioned that Oracle's partitioning
introduces it's own overhead, so I re-ran my earlier benchmarks on one of
our Oracle machines.

I believe Oracle's licensing agreement prohibits me from posting any
benchmarks, so all I'll say is that Postgres' inheritance partitioning
implementation is _very_ low overhead, and even union views are competitive.