Обсуждение: Poor performance when joining against inherited tables

Поиск
Список
Период
Сортировка

Poor performance when joining against inherited tables

От
Lucas Madar
Дата:
I have a database that contains many tables, each with some common
characteristics. For legacy reasons, they have to be implemented in a
way so that they are *all* searchable by an older identifier to find the
newer identifier. To do this, we've used table inheritance.

Each entry has an id, as well as a legacyid1 and legacyid2. There's a
master table that the application uses, containing a base representation
and common characteristics:

objects ( id, ... )
item ( id, legacyid1, legacyid2 )
  | - itemXX
  | - itemYY

There is nothing at all in the item table, it's just used for
inheritance. However, weird things happen when this table is joined:

EXPLAIN ANALYZE SELECT * FROM objects INNER JOIN item f USING ( id );

  QUERY PLAN
------------
  Hash Join  (cost=457943.85..1185186.17 rows=8643757 width=506)
    Hash Cond: (f.id = objects.id)
    ->  Append  (cost=0.00..224458.57 rows=8643757 width=20)
          ->  Seq Scan on item f  (cost=0.00..26.30 rows=1630 width=20)
          ->  Seq Scan on itemXX f  (cost=0.00..1.90 rows=90 width=20)
          ->  Seq Scan on itemYY f  (cost=0.00..7.66 rows=266 width=20)
          ->  Seq Scan on itemZZ f  (cost=0.00..1.02 rows=2 width=20)
          ...
    ->  Hash  (cost=158447.49..158447.49 rows=3941949 width=490)
          ->  Seq Scan on objects  (cost=0.00..158447.49 rows=3941949
width=490)

This scans everything over everything, and obviously takes forever
(there are millions of rows in the objects table, and tens of thousands
in each itemXX table).

However, if I disable seqscan (set enable_seqscan=false), I get the
following plan:

  QUERY PLAN
------------
  Hash Join  (cost=10001298843.53..290002337961.71 rows=8643757 width=506)
    Hash Cond: (f.id = objects.id)
    ->  Append  (cost=10000000000.00..290000536334.43 rows=8643757 width=20)
          ->  Seq Scan on item f  (cost=10000000000.00..10000000026.30
rows=1630 width=20)
          ->  Index Scan using xxx_pkey on itemXX f  (cost=0.00..10.60
rows=90 width=20)
          ->  Index Scan using yyy_pkey on itemYY f  (cost=0.00..25.24
rows=266 width=20)
          ->  Index Scan using zzz_pkey on itemZZ f  (cost=0.00..9.28
rows=2 width=20)
          ...
    ->  Hash  (cost=999347.17..999347.17 rows=3941949 width=490)
          ->  Index Scan using objects_pkey on objects
(cost=0.00..999347.17 rows=3941949 width=490)

This seems like a much more sensible query plan. But it seems to think
doing a sequential scan on the *empty* item table is excessively
expensive in this case.

Aside from enable_seqscan=false, is there any way I can make the query
planner not balk over doing a seqscan on an empty table?

Thanks,
Lucas Madar


Re: Poor performance when joining against inherited tables

От
Shaun Thomas
Дата:
On 04/11/2011 03:11 PM, Lucas Madar wrote:

> EXPLAIN ANALYZE SELECT * FROM objects INNER JOIN item f USING ( id );
>
> This scans everything over everything, and obviously takes forever
> (there are millions of rows in the objects table, and tens of thousands
> in each itemXX table).

What is your constraint_exclusion setting? This needs to be 'ON' for the
check constraints you use to enforce your inheritance rules to work right.

You *do* have check constraints on all your child tables, right? Just in
case, please refer to the doc on table partitioning:

http://www.postgresql.org/docs/current/static/ddl-partitioning.html

Also, your example has no where clause. Without a where clause,
constraint exclusion won't even function. How is the database supposed
to know that matching a 4M row table against several partitioned tables
will result in few matches? All it really has are stats on your joined
id for this particular query, and you're basically telling to join all
of them. That usually calls for a sequence scan, because millions of
index seeks will almost always be slower than a few sequence scans.

--
Shaun Thomas
OptionsHouse | 141 W. Jackson Blvd. | Suite 800 | Chicago IL, 60604
312-676-8870
sthomas@peak6.com

______________________________________________

See  http://www.peak6.com/email_disclaimer.php
for terms and conditions related to this email

Re: Poor performance when joining against inherited tables

От
Robert Haas
Дата:
On Mon, Apr 11, 2011 at 4:11 PM, Lucas Madar <madar@samsix.com> wrote:
> I have a database that contains many tables, each with some common
> characteristics. For legacy reasons, they have to be implemented in a way so
> that they are *all* searchable by an older identifier to find the newer
> identifier. To do this, we've used table inheritance.
>
> Each entry has an id, as well as a legacyid1 and legacyid2. There's a master
> table that the application uses, containing a base representation and common
> characteristics:
>
> objects ( id, ... )
> item ( id, legacyid1, legacyid2 )
>  | - itemXX
>  | - itemYY
>
> There is nothing at all in the item table, it's just used for inheritance.
> However, weird things happen when this table is joined:
>
> EXPLAIN ANALYZE SELECT * FROM objects INNER JOIN item f USING ( id );
>
>  QUERY PLAN
> ------------
>  Hash Join  (cost=457943.85..1185186.17 rows=8643757 width=506)
>   Hash Cond: (f.id = objects.id)
>   ->  Append  (cost=0.00..224458.57 rows=8643757 width=20)
>         ->  Seq Scan on item f  (cost=0.00..26.30 rows=1630 width=20)
>         ->  Seq Scan on itemXX f  (cost=0.00..1.90 rows=90 width=20)
>         ->  Seq Scan on itemYY f  (cost=0.00..7.66 rows=266 width=20)
>         ->  Seq Scan on itemZZ f  (cost=0.00..1.02 rows=2 width=20)
>         ...
>   ->  Hash  (cost=158447.49..158447.49 rows=3941949 width=490)
>         ->  Seq Scan on objects  (cost=0.00..158447.49 rows=3941949
> width=490)
>
> This scans everything over everything, and obviously takes forever (there
> are millions of rows in the objects table, and tens of thousands in each
> itemXX table).
>
> However, if I disable seqscan (set enable_seqscan=false), I get the
> following plan:
>
>  QUERY PLAN
> ------------
>  Hash Join  (cost=10001298843.53..290002337961.71 rows=8643757 width=506)
>   Hash Cond: (f.id = objects.id)
>   ->  Append  (cost=10000000000.00..290000536334.43 rows=8643757 width=20)
>         ->  Seq Scan on item f  (cost=10000000000.00..10000000026.30
> rows=1630 width=20)
>         ->  Index Scan using xxx_pkey on itemXX f  (cost=0.00..10.60 rows=90
> width=20)
>         ->  Index Scan using yyy_pkey on itemYY f  (cost=0.00..25.24
> rows=266 width=20)
>         ->  Index Scan using zzz_pkey on itemZZ f  (cost=0.00..9.28 rows=2
> width=20)
>         ...
>   ->  Hash  (cost=999347.17..999347.17 rows=3941949 width=490)
>         ->  Index Scan using objects_pkey on objects (cost=0.00..999347.17
> rows=3941949 width=490)
>
> This seems like a much more sensible query plan.

I don't think so.  Scanning the index to extract all the rows in a
table is typically going to be a lot slower than a sequential scan.

A more interesting question is why you're not getting a plan like this:

Nested Loop
-> Seq Scan on objects
-> Append
   ->  Index Scan using xxx_pkey on itemXX
   ->  Index Scan using yyy_pkey on itemYY
   ->  Index Scan using zzz_pkey on itemZZ

> But it seems to think doing
> a sequential scan on the *empty* item table is excessively expensive in this
> case.
>
> Aside from enable_seqscan=false, is there any way I can make the query
> planner not balk over doing a seqscan on an empty table?

Why would you care?  A sequential scan of an empty table is very fast.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Poor performance when joining against inherited tables

От
Tom Lane
Дата:
Robert Haas <robertmhaas@gmail.com> writes:
> A more interesting question is why you're not getting a plan like this:

> Nested Loop
> -> Seq Scan on objects
> -> Append
>    ->  Index Scan using xxx_pkey on itemXX
>    ->  Index Scan using yyy_pkey on itemYY
>    ->  Index Scan using zzz_pkey on itemZZ

Probably because there are 4 million rows in the objects table.

Or maybe it's a pre-8.2 database and can't even generate such a plan.
But if it did generate it, it would almost certainly have decided that
this was more expensive than a hash or merge join.

People have this weird idea that the existence of an index ought to make
enormous joins free ...

            regards, tom lane

Re: Poor performance when joining against inherited tables

От
Lucas Madar
Дата:
On 05/11/2011 09:38 AM, Robert Haas wrote:
>> However, if I disable seqscan (set enable_seqscan=false), I get the
>> following plan:
>>
>>   QUERY PLAN
>> ------------
>>   Hash Join  (cost=10001298843.53..290002337961.71 rows=8643757 width=506)
>>    Hash Cond: (f.id = objects.id)
>>    ->    Append  (cost=10000000000.00..290000536334.43 rows=8643757 width=20)
>>          ->    Seq Scan on item f  (cost=10000000000.00..10000000026.30
>> rows=1630 width=20)
>>          ->    Index Scan using xxx_pkey on itemXX f  (cost=0.00..10.60 rows=90
>> width=20)
>>          ->    Index Scan using yyy_pkey on itemYY f  (cost=0.00..25.24
>> rows=266 width=20)
>>          ->    Index Scan using zzz_pkey on itemZZ f  (cost=0.00..9.28 rows=2
>> width=20)
>>          ...
>>    ->    Hash  (cost=999347.17..999347.17 rows=3941949 width=490)
>>          ->    Index Scan using objects_pkey on objects (cost=0.00..999347.17
>> rows=3941949 width=490)
>>
>> This seems like a much more sensible query plan.
> I don't think so.  Scanning the index to extract all the rows in a
> table is typically going to be a lot slower than a sequential scan.
>
> A more interesting question is why you're not getting a plan like this:
>
> Nested Loop
> ->  Seq Scan on objects
> ->  Append
>     ->   Index Scan using xxx_pkey on itemXX
>     ->   Index Scan using yyy_pkey on itemYY
>     ->   Index Scan using zzz_pkey on itemZZ

Compared to the previous query plan (omitted in this e-mail, in which
the planner was scanning all the item tables sequentially), the second
query is much more desirable. It takes about 12 seconds to complete,
versus the other query which I canceled after six hours. However, what
you propose seems to make even more sense.

>> But it seems to think doing
>> a sequential scan on the *empty* item table is excessively expensive in this
>> case.
>>
>> Aside from enable_seqscan=false, is there any way I can make the query
>> planner not balk over doing a seqscan on an empty table?
> Why would you care?  A sequential scan of an empty table is very fast.
>
My issue is that it looks like it's avoiding the sequential scan:

Seq Scan on item f  (cost=10000000000.00..10000000026.30 rows=1630 width=20)

It says the sequential scan has a cost that's way too high, and I'm
presuming that's why it's choosing the extremely slow plan over the much
faster plan. I don't know very much about plans, but I'm assuming the
planner chooses the plan with the lowest cost.

I'd much prefer it *does* the sequential scan of the empty table and
goes with the other parts of the plan.

Thanks,
Lucas Madar

Re: Poor performance when joining against inherited tables

От
Maciek Sakrejda
Дата:
> It says the sequential scan has a cost that's way too high, and I'm
> presuming that's why it's choosing the extremely slow plan over the much
> faster plan.

Well, not exactly. It's giving you that cost because you disabled
seqscan, which actually just bumps the cost really high:

postgres=# create temporary table foo as select generate_series(1,3);
SELECT
postgres=# explain analyze select * from foo;
                                           QUERY PLAN
-------------------------------------------------------------------------------------------------
 Seq Scan on foo  (cost=0.00..34.00 rows=2400 width=4) (actual
time=0.010..0.012 rows=3 loops=1)
 Total runtime: 2.591 ms
(2 rows)

postgres=# set enable_seqscan to false;
SET
postgres=# explain analyze select * from foo;
                                                     QUERY PLAN
--------------------------------------------------------------------------------------------------------------------
 Seq Scan on foo  (cost=10000000000.00..10000000034.00 rows=2400
width=4) (actual time=0.004..0.007 rows=3 loops=1)
 Total runtime: 0.037 ms
(2 rows)


As far as I know, there is no hard way to disable any given plan
option, since sometimes that may be the only choice.

The (estimated) cost of the seq scan chosen here is *not* the same as
the cost of the scan when the planner actually considers this plan (in
fact, that will the same as the one in the first plan).

However, note the cost of the Index Scan nodes in the second plan:
they are *higher* than their corresponding Seq Scan nodes (in the
first plan), which is why you get the first plan when seq can *is*
enabled.

Also, your plan output looks like plain EXPLAIN and not EXPLAIN
ANALYZE (i.e., the "actual time" nodes are missing).

Other than that, I think Shaun's comments apply.

---
Maciek Sakrejda | System Architect | Truviso

1065 E. Hillsdale Blvd., Suite 215
Foster City, CA 94404
(650) 242-3500 Main
www.truviso.com

Re: Poor performance when joining against inherited tables

От
Robert Haas
Дата:
On Wed, May 11, 2011 at 4:47 PM, Lucas Madar <madar@samsix.com> wrote:
> On 05/11/2011 09:38 AM, Robert Haas wrote:
>>>
>>> However, if I disable seqscan (set enable_seqscan=false), I get the
>>> following plan:
>>>
>>>  QUERY PLAN
>>> ------------
>>>  Hash Join  (cost=10001298843.53..290002337961.71 rows=8643757 width=506)
>>>   Hash Cond: (f.id = objects.id)
>>>   ->    Append  (cost=10000000000.00..290000536334.43 rows=8643757
>>> width=20)
>>>         ->    Seq Scan on item f  (cost=10000000000.00..10000000026.30
>>> rows=1630 width=20)
>>>         ->    Index Scan using xxx_pkey on itemXX f  (cost=0.00..10.60
>>> rows=90
>>> width=20)
>>>         ->    Index Scan using yyy_pkey on itemYY f  (cost=0.00..25.24
>>> rows=266 width=20)
>>>         ->    Index Scan using zzz_pkey on itemZZ f  (cost=0.00..9.28
>>> rows=2
>>> width=20)
>>>         ...
>>>   ->    Hash  (cost=999347.17..999347.17 rows=3941949 width=490)
>>>         ->    Index Scan using objects_pkey on objects
>>> (cost=0.00..999347.17
>>> rows=3941949 width=490)
>>>
>>> This seems like a much more sensible query plan.
>>
>> I don't think so.  Scanning the index to extract all the rows in a
>> table is typically going to be a lot slower than a sequential scan.
>>
>
> Compared to the previous query plan (omitted in this e-mail, in which the
> planner was scanning all the item tables sequentially), the second query is
> much more desirable. It takes about 12 seconds to complete, versus the other
> query which I canceled after six hours. However, what you propose seems to
> make even more sense.

I was just looking at this email again, and had another thought:
perhaps the tables in question are badly bloated.  In your situation,
it seems that the plan didn't change much when you set
enable_seqscan=off: it just replaced full-table seq-scans with
full-table index-scans, which should be slower.  But if you have a
giant table that's mostly empty space, then following the index
pointers to the limited number of blocks that contain any useful data
might be faster than scanning all the empty space.  If you still have
these tables around somewhere, you could test this hypothesis by
running CLUSTER on all the tables and see whether the seq-scan gets
faster.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company