Joining two large tables on a tiny subset of rows

Поиск
Список
Период
Сортировка
От Dmitri Bichko
Тема Joining two large tables on a tiny subset of rows
Дата
Msg-id F18A6F7CF1661F46920F2CF713122FED46CC35@mail.aveo.aveopharma.com
обсуждение исходный текст
Ответы Re: Joining two large tables on a tiny subset of rows  (Gnanavel S <s.gnanavel@gmail.com>)
Re: Joining two large tables on a tiny subset of rows  (Richard Huxton <dev@archonet.com>)
Список pgsql-sql
Hello,

I have two tables, one has a foreing key from the other (only showing
the relevant columns and indices here):
            Table "expresso.probes"  Column    |          Type          | Modifiers
-------------+------------------------+-----------platform_id | integer                | not nullprobe_num   | integer
             | not nullmrna_acc    | character varying(50)  | 
Indexes:   "idx_probes_mrna_acc" btree (mrna_acc, platform_id) WHERE mrna_acc
IS NOT NULL
          Table "expresso.mrna_info"  Column    |         Type          | Modifiers
-------------+-----------------------+-----------mrna_acc    | character varying(25) | not nullsymbol      | character
varying(50)| not null 
Indexes:   "idx_mrna_info_symbol" btree (upper(symbol::text)) WHERE symbol IS
NOT NULL

I want to select all the records in "probes" which have an "mrna_acc"
that is associated with a specific symbol in "mrna_info".

This works fine if I do it in two steps: select from "mrna_info" (12
records out of ~250 thousand) and paste those into an IN clause to
select from "probes" (83 records out of ~4.5 million).

But if I try to do this as one query I always end up with a seq scan on
"probes".  I've tried doing it with joins, subselects, joining with a
dynamic table (select from () as foo), nothing seems to get the desired
result.

Here are the explains for the two step process:

expression=> EXPLAIN ANALYZE SELECT mrna_acc FROM mrna_info WHERE
upper(symbol) = upper('pparg') AND symbol IS NOT NULL;
QUERYPLAN 
------------------------------------------------------------------------
---------------------------------------------------------------Index Scan using idx_mrna_info_symbol on mrna_info
(cost=0.00..2934.78
rows=930 width=12) (actual time=0.038..0.089 rows=12 loops=1)  Index Cond: (upper((symbol)::text) = 'PPARG'::text)
Filter:(symbol IS NOT NULL)Total runtime: 0.123 ms 
(4 rows)

EXPLAIN ANALYZE SELECT platform_id, probe_num FROM probes WHERE mrna_acc
IN
('U10374','U09138','U01841','U01664','NM_015869','NM_013124','NM_011146'
,'NM_005037','D83233','BC021798','BC006811','AB011365') AND mrna_acc IS
NOT NULL;
QUERY PLAN

------------------------------------------------------------------------
------------------------------------------------------------------------
------------------------------------------------------------------------
------------------------------------------------------------------------
------------------------------------------------------------------------
------------------------------------------------------------------------
--------------------------------------------------------------------Index Scan using idx_probes_mrna_acc,
idx_probes_mrna_acc,
idx_probes_mrna_acc, idx_probes_mrna_acc, idx_probes_mrna_acc,
idx_probes_mrna_acc, idx_probes_mrna_acc, idx_probes_mrna_acc,
idx_probes_mrna_acc, idx_probes_mrna_acc, idx_probes_mrna_acc,
idx_probes_mrna_acc on probes  (cost=0.00..14710.63 rows=4151 width=8)
(actual time=0.040..0.719 rows=142 loops=1)  Index Cond: (((mrna_acc)::text = 'U10374'::text) OR ((mrna_acc)::text
= 'U09138'::text) OR ((mrna_acc)::text = 'U01841'::text) OR
((mrna_acc)::text = 'U01664'::text) OR ((mrna_acc)::text =
'NM_015869'::text) OR ((mrna_acc)::text = 'NM_013124'::text) OR
((mrna_acc)::text = 'NM_011146'::text) OR ((mrna_acc)::text =
'NM_005037'::text) OR ((mrna_acc)::text = 'D83233'::text) OR
((mrna_acc)::text = 'BC021798'::text) OR ((mrna_acc)::text =
'BC006811'::text) OR ((mrna_acc)::text = 'AB011365'::text))  Filter: (mrna_acc IS NOT NULL)Total runtime: 0.844 ms
(4 rows)

And here is the explain of the join (it's essentially the same plan as
the subselect and all the other ways I've tried):

expression=> explain SELECT platform_id, probe_num FROM mrna_info m,
probes p WHERE m.mrna_acc = p.mrna_acc and p.mrna_acc is not null and
UPPER(symbol) = UPPER('pparg') AND m.symbol IS NOT NULL;                                               QUERY PLAN
------------------------------------------------------------------------
----------------------------------Hash Join  (cost=2937.10..133392.04 rows=9939 width=8)  Hash Cond:
(("outer".mrna_acc)::text= ("inner".mrna_acc)::text)  ->  Seq Scan on probes p  (cost=0.00..115434.22 rows=2984265 
width=19)        Filter: (mrna_acc IS NOT NULL)  ->  Hash  (cost=2934.78..2934.78 rows=930 width=12)        ->  Index
Scanusing idx_mrna_info_symbol on mrna_info m 
(cost=0.00..2934.78 rows=930 width=12)              Index Cond: (upper((symbol)::text) = 'PPARG'::text)
Filter:(symbol IS NOT NULL) 
(8 rows)

What am I doing wrong?

Dmitri

The information transmitted is intended only for the person or entity to which it is addressed and may contain
confidentialand/or privileged material. Any review, retransmission, dissemination or other use of, or taking of any
actionin reliance upon, this information by persons or entities other than the intended recipient is prohibited. If you
receivedthis in error, please contact the sender and delete the material from any computer 


В списке pgsql-sql по дате отправления:

Предыдущее
От: Chris Browne
Дата:
Сообщение: Re: REINDEX DATABASE
Следующее
От: Gnanavel S
Дата:
Сообщение: Re: Joining two large tables on a tiny subset of rows