Обсуждение: Joining two large tables on a tiny subset of rows

Поиск
Список
Период
Сортировка

Joining two large tables on a tiny subset of rows

От
"Dmitri Bichko"
Дата:
Hello,

I have two tables, one has a foreing key from the other (only showing
the relevant columns and indices here):
            Table "expresso.probes"  Column    |          Type          | Modifiers
-------------+------------------------+-----------platform_id | integer                | not nullprobe_num   | integer
             | not nullmrna_acc    | character varying(50)  | 
Indexes:   "idx_probes_mrna_acc" btree (mrna_acc, platform_id) WHERE mrna_acc
IS NOT NULL
          Table "expresso.mrna_info"  Column    |         Type          | Modifiers
-------------+-----------------------+-----------mrna_acc    | character varying(25) | not nullsymbol      | character
varying(50)| not null 
Indexes:   "idx_mrna_info_symbol" btree (upper(symbol::text)) WHERE symbol IS
NOT NULL

I want to select all the records in "probes" which have an "mrna_acc"
that is associated with a specific symbol in "mrna_info".

This works fine if I do it in two steps: select from "mrna_info" (12
records out of ~250 thousand) and paste those into an IN clause to
select from "probes" (83 records out of ~4.5 million).

But if I try to do this as one query I always end up with a seq scan on
"probes".  I've tried doing it with joins, subselects, joining with a
dynamic table (select from () as foo), nothing seems to get the desired
result.

Here are the explains for the two step process:

expression=> EXPLAIN ANALYZE SELECT mrna_acc FROM mrna_info WHERE
upper(symbol) = upper('pparg') AND symbol IS NOT NULL;
QUERYPLAN 
------------------------------------------------------------------------
---------------------------------------------------------------Index Scan using idx_mrna_info_symbol on mrna_info
(cost=0.00..2934.78
rows=930 width=12) (actual time=0.038..0.089 rows=12 loops=1)  Index Cond: (upper((symbol)::text) = 'PPARG'::text)
Filter:(symbol IS NOT NULL)Total runtime: 0.123 ms 
(4 rows)

EXPLAIN ANALYZE SELECT platform_id, probe_num FROM probes WHERE mrna_acc
IN
('U10374','U09138','U01841','U01664','NM_015869','NM_013124','NM_011146'
,'NM_005037','D83233','BC021798','BC006811','AB011365') AND mrna_acc IS
NOT NULL;
QUERY PLAN

------------------------------------------------------------------------
------------------------------------------------------------------------
------------------------------------------------------------------------
------------------------------------------------------------------------
------------------------------------------------------------------------
------------------------------------------------------------------------
--------------------------------------------------------------------Index Scan using idx_probes_mrna_acc,
idx_probes_mrna_acc,
idx_probes_mrna_acc, idx_probes_mrna_acc, idx_probes_mrna_acc,
idx_probes_mrna_acc, idx_probes_mrna_acc, idx_probes_mrna_acc,
idx_probes_mrna_acc, idx_probes_mrna_acc, idx_probes_mrna_acc,
idx_probes_mrna_acc on probes  (cost=0.00..14710.63 rows=4151 width=8)
(actual time=0.040..0.719 rows=142 loops=1)  Index Cond: (((mrna_acc)::text = 'U10374'::text) OR ((mrna_acc)::text
= 'U09138'::text) OR ((mrna_acc)::text = 'U01841'::text) OR
((mrna_acc)::text = 'U01664'::text) OR ((mrna_acc)::text =
'NM_015869'::text) OR ((mrna_acc)::text = 'NM_013124'::text) OR
((mrna_acc)::text = 'NM_011146'::text) OR ((mrna_acc)::text =
'NM_005037'::text) OR ((mrna_acc)::text = 'D83233'::text) OR
((mrna_acc)::text = 'BC021798'::text) OR ((mrna_acc)::text =
'BC006811'::text) OR ((mrna_acc)::text = 'AB011365'::text))  Filter: (mrna_acc IS NOT NULL)Total runtime: 0.844 ms
(4 rows)

And here is the explain of the join (it's essentially the same plan as
the subselect and all the other ways I've tried):

expression=> explain SELECT platform_id, probe_num FROM mrna_info m,
probes p WHERE m.mrna_acc = p.mrna_acc and p.mrna_acc is not null and
UPPER(symbol) = UPPER('pparg') AND m.symbol IS NOT NULL;                                               QUERY PLAN
------------------------------------------------------------------------
----------------------------------Hash Join  (cost=2937.10..133392.04 rows=9939 width=8)  Hash Cond:
(("outer".mrna_acc)::text= ("inner".mrna_acc)::text)  ->  Seq Scan on probes p  (cost=0.00..115434.22 rows=2984265 
width=19)        Filter: (mrna_acc IS NOT NULL)  ->  Hash  (cost=2934.78..2934.78 rows=930 width=12)        ->  Index
Scanusing idx_mrna_info_symbol on mrna_info m 
(cost=0.00..2934.78 rows=930 width=12)              Index Cond: (upper((symbol)::text) = 'PPARG'::text)
Filter:(symbol IS NOT NULL) 
(8 rows)

What am I doing wrong?

Dmitri

The information transmitted is intended only for the person or entity to which it is addressed and may contain
confidentialand/or privileged material. Any review, retransmission, dissemination or other use of, or taking of any
actionin reliance upon, this information by persons or entities other than the intended recipient is prohibited. If you
receivedthis in error, please contact the sender and delete the material from any computer 


Re: Joining two large tables on a tiny subset of rows

От
Gnanavel S
Дата:


On 7/27/05, Dmitri Bichko <dbichko@aveopharma.com> wrote:
Hello,

I have two tables, one has a foreing key from the other (only showing
the relevant columns and indices here):

             Table "expresso.probes"
   Column    |          Type          | Modifiers
-------------+------------------------+-----------
platform_id | integer                | not null
probe_num   | integer                | not null
mrna_acc    | character varying(50)  |
Indexes:
    "idx_probes_mrna_acc" btree (mrna_acc, platform_id) WHERE mrna_acc
IS NOT NULL

           Table "expresso.mrna_info"
   Column    |         Type          | Modifiers
-------------+-----------------------+-----------
mrna_acc    | character varying(25) | not null
symbol      | character varying(50) | not null
Indexes:
    "idx_mrna_info_symbol" btree (upper(symbol::text)) WHERE symbol IS
NOT NULL

I want to select all the records in "probes" which have an "mrna_acc"
that is associated with a specific symbol in "mrna_info".

This works fine if I do it in two steps: select from "mrna_info" (12
records out of ~250 thousand) and paste those into an IN clause to
select from "probes" (83 records out of ~4.5 million).

But if I try to do this as one query I always end up with a seq scan on
"probes".  I've tried doing it with joins, subselects, joining with a
dynamic table (select from () as foo), nothing seems to get the desired
result.

Here are the explains for the two step process:

expression=> EXPLAIN ANALYZE SELECT mrna_acc FROM mrna_info WHERE
upper(symbol) = upper('pparg') AND symbol IS NOT NULL;
                                                              QUERY PLAN
------------------------------------------------------------------------
---------------------------------------------------------------
Index Scan using idx_mrna_info_symbol on mrna_info  (cost=0.00..2934.78
rows=930 width=12) (actual time=0.038..0.089 rows=12 loops=1)
   Index Cond: (upper((symbol)::text) = 'PPARG'::text)
   Filter: (symbol IS NOT NULL)
Total runtime: 0.123 ms
(4 rows)

Whether any records are returned by this query. Moreover "symbol is not null" condition is not needed as that column cannot be null by table definition.

EXPLAIN ANALYZE SELECT platform_id, probe_num FROM probes WHERE mrna_acc
IN
('U10374','U09138','U01841','U01664','NM_015869','NM_013124','NM_011146'
,'NM_005037','D83233','BC021798','BC006811','AB011365') AND mrna_acc IS
NOT NULL;

QUERY PLAN

------------------------------------------------------------------------
------------------------------------------------------------------------
------------------------------------------------------------------------
------------------------------------------------------------------------
------------------------------------------------------------------------
------------------------------------------------------------------------
--------------------------------------------------------------------
Index Scan using idx_probes_mrna_acc, idx_probes_mrna_acc,
idx_probes_mrna_acc, idx_probes_mrna_acc, idx_probes_mrna_acc,
idx_probes_mrna_acc, idx_probes_mrna_acc, idx_probes_mrna_acc,
idx_probes_mrna_acc, idx_probes_mrna_acc, idx_probes_mrna_acc,
idx_probes_mrna_acc on probes  (cost=0.00..14710.63 rows=4151 width=8)
(actual time=0.040..0.719 rows=142 loops=1)
   Index Cond: (((mrna_acc)::text = 'U10374'::text) OR ((mrna_acc)::text
= 'U09138'::text) OR ((mrna_acc)::text = 'U01841'::text) OR
((mrna_acc)::text = 'U01664'::text) OR ((mrna_acc)::text =
'NM_015869'::text) OR ((mrna_acc)::text = 'NM_013124'::text) OR
((mrna_acc)::text = 'NM_011146'::text) OR ((mrna_acc)::text =
'NM_005037'::text) OR ((mrna_acc)::text = 'D83233'::text) OR
((mrna_acc)::text = 'BC021798'::text) OR ((mrna_acc)::text =
'BC006811'::text) OR ((mrna_acc)::text = 'AB011365'::text))
   Filter: (mrna_acc IS NOT NULL)
Total runtime: 0.844 ms
(4 rows)

And here is the explain of the join (it's essentially the same plan as
the subselect and all the other ways I've tried):

expression=> explain SELECT platform_id, probe_num FROM mrna_info m,
probes p WHERE m.mrna_acc = p.mrna_acc and p.mrna_acc is not null and
UPPER(symbol) = UPPER('pparg') AND m.symbol IS NOT NULL;

Here m.mrna_acc cannot be null, so condition on p.mrna_acc is not null is useless as m.mrna_acc = p.mrna_acc (for null) will not give any records
Try to  use outer join.

                                                QUERY PLAN
------------------------------------------------------------------------
----------------------------------
Hash Join  (cost=2937.10..133392.04 rows=9939 width=8)
   Hash Cond: (("outer".mrna_acc)::text = ("inner".mrna_acc)::text)
   ->  Seq Scan on probes p  (cost=0.00..115434.22 rows=2984265
width=19)
         Filter: (mrna_acc IS NOT NULL)
   ->  Hash  (cost=2934.78..2934.78 rows=930 width=12)
         ->  Index Scan using idx_mrna_info_symbol on mrna_info m
(cost=0.00..2934.78 rows=930 width=12)
               Index Cond: (upper((symbol)::text) = 'PPARG'::text)
               Filter: (symbol IS NOT NULL)
(8 rows)

What am I doing wrong?

Dmitri

The information transmitted is intended only for the person or entity to which it is addressed and may contain confidential and/or privileged material. Any review, retransmission, dissemination or other use of, or taking of any action in reliance upon, this information by persons or entities other than the intended recipient is prohibited. If you received this in error, please contact the sender and delete the material from any computer

---------------------------(end of broadcast)---------------------------
TIP 4: Have you searched our list archives?

               http://archives.postgresql.org



--
with regards,
S.Gnanavel
Satyam Computer Services Ltd.

Re: Joining two large tables on a tiny subset of rows

От
Richard Huxton
Дата:
Dmitri Bichko wrote:
> Hello,
> 
> I have two tables, one has a foreing key from the other (only showing
> the relevant columns and indices here):
> 
>              Table "expresso.probes"
>    Column    |          Type          | Modifiers
> -------------+------------------------+-----------
>  platform_id | integer                | not null
>  probe_num   | integer                | not null
>  mrna_acc    | character varying(50)  |
> Indexes:
>     "idx_probes_mrna_acc" btree (mrna_acc, platform_id) WHERE mrna_acc
> IS NOT NULL
> 
>            Table "expresso.mrna_info"
>    Column    |         Type          | Modifiers
> -------------+-----------------------+-----------
>  mrna_acc    | character varying(25) | not null
>  symbol      | character varying(50) | not null
> Indexes:
>     "idx_mrna_info_symbol" btree (upper(symbol::text)) WHERE symbol IS
> NOT NULL

1. The two types of mrna_acc don't match - one has a max length of 25, 
one 50. Why?
2. With idx_probes_mrna_acc, why WHERE mrna_acc IS NOT NULL? NULLs 
aren't indexed anyway.
3. You say there is a foreign key, but I don't even see a primary key 
anywhere. I'm guessing mrna_info.mrna_acc is the primary key for that table.

> Here are the explains for the two step process:
> 
> expression=> EXPLAIN ANALYZE SELECT mrna_acc FROM mrna_info WHERE
> upper(symbol) = upper('pparg') AND symbol IS NOT NULL;

As someone else mentions, the IS NOT NULL is redundant.

>  Index Scan using idx_mrna_info_symbol on mrna_info  (cost=0.00..2934.78
> rows=930 width=12) (actual time=0.038..0.089 rows=12 loops=1)

Note that the estimated number of rows is wrong though (930 rather than 
the actual 12).

> EXPLAIN ANALYZE SELECT platform_id, probe_num FROM probes WHERE mrna_acc
> IN
> ('U10374','U09138','U01841','U01664','NM_015869','NM_013124','NM_011146'
> ,'NM_005037','D83233','BC021798','BC006811','AB011365') AND mrna_acc IS
> NOT NULL;

Again, a redundant IS NOT NULL, which presumably you're putting in to 
use the index.

>  Index Scan using idx_probes_mrna_acc, idx_probes_mrna_acc,
> idx_probes_mrna_acc, idx_probes_mrna_acc, idx_probes_mrna_acc,
> idx_probes_mrna_acc, idx_probes_mrna_acc, idx_probes_mrna_acc,
> idx_probes_mrna_acc, idx_probes_mrna_acc, idx_probes_mrna_acc,
> idx_probes_mrna_acc on probes  (cost=0.00..14710.63 rows=4151 width=8)
> (actual time=0.040..0.719 rows=142 loops=1)

Again, it's getting the row estimate badly wrong (4151 vs 142).

> And here is the explain of the join (it's essentially the same plan as
> the subselect and all the other ways I've tried):
> 
> expression=> explain SELECT platform_id, probe_num FROM mrna_info m,
> probes p WHERE m.mrna_acc = p.mrna_acc and p.mrna_acc is not null and
> UPPER(symbol) = UPPER('pparg') AND m.symbol IS NOT NULL;

Well, firstly get rid of the redundant "IS NOT NULL"s in the query and 
the indexes, then vacuum analyse the tables and post an EXPLAIN ANALYSE.

The problem will probably turn out to be poor row estimates (you can 
increase the statistics gathered on the mrna_acc values) or poor 
configuration settings (making indexes look expensive compared to 
sequential scans).

--  Richard Huxton  Archonet Ltd