Обсуждение: Indexes performance

Поиск
Список
Период
Сортировка

Indexes performance

От
charavay
Дата:
Hello !
We have difficulties with the use of indexes. For example, we have two
tables :

    * table lnk :

Table "public.lnk"
 Column |         Type          | Modifiers
--------+-----------------------+-----------
 index  | integer               | not null
 sgaccn | character varying(12) | not null
Indexes:
    "pkey1" primary key, btree ("index", sgaccn)
Foreign-key constraints:
    "fk_sgaccn1" FOREIGN KEY (sgaccn) REFERENCES main_tbl(sgaccn) ON UPDATE
CASCADE ON DELETE CASCADE

    * table dic :

Table "public.dic"
 Column |         Type          |                             Modifiers

--------+-----------------------+--------------------------------------------------------------------
 index  | integer               | not null default
nextval('public.dic_index_seq'::text)
 word   | character varying(60) | not null
Indexes:
    "dic_pkey" primary key, btree ("index")
    "dic_word_idx" unique, btree (word)
    "dic_word_key" unique, btree (word)



The table lnk contains 33 000 000 tuples and table dic contains 303 000
tuples.

When we try to execute a join between these two tables, the planner
proposes to excute a hash-join plan :

explain select sgaccn from dic, lnk where dic.index=lnk.index;
                                    QUERY PLAN

-----------------------------------------------------------------------------------
 Hash Join  (cost=6793.29..1716853.80 rows=33743101 width=11)
   Hash Cond: ("outer"."index" = "inner"."index")
   ->  Seq Scan on lnk  (cost=0.00..535920.00 rows=33743100 width=15)
   ->  Hash  (cost=4994.83..4994.83 rows=303783 width=4)
         ->  Seq Scan dic  (cost=0.00..4994.83 rows=303783 width=4)
(5 rows)

So the planner decides to scan 33 000 000 of tuples and we would like to
force it to scan the table dic (303 000 tuples) and to use
the index on the integer index to execute the join. So we have set the
parameters enable_hashjoin and enable_mergejoin to off. So the planner
proposes the following query :

                                            QUERY PLAN

--------------------------------------------------------------------------------------------------
 Nested Loop  (cost=0.00..102642540.60 rows=33743101 width=11)
   ->  Seq Scan on refs_ra_lnk1  (cost=0.00..535920.00 rows=33743100
width=15)
   ->  Index Scan using refs_ra_dic_new_pkey on refs_ra_dic_new
(cost=0.00..3.01 rows=1 width=4)
         Index Cond: (refs_ra_dic_new."index" = "outer"."index")
(4 rows)

We were surprised of this response because the planner continues to
propose us to scan the 33 000 000 of tuples instead of the smaller
table. Is there any way to force it to scan the smaller table ?

Thanks

Celine Charavay


Re: Indexes performance

От
Josh Berkus
Дата:
Charavay,

> ---------------------------------------------------------------------------
>-------- Hash Join  (cost=6793.29..1716853.80 rows=33743101 width=11)
>    Hash Cond: ("outer"."index" = "inner"."index")
>    ->  Seq Scan on lnk  (cost=0.00..535920.00 rows=33743100 width=15)
>    ->  Hash  (cost=4994.83..4994.83 rows=303783 width=4)
>          ->  Seq Scan dic  (cost=0.00..4994.83 rows=303783 width=4)
> (5 rows)

According to the estimate, you are selecting all of the rows in the database.
This is going to require a Seq Scan no matter what.

--
--Josh

Josh Berkus
Aglio Database Solutions
San Francisco

Re: Indexes performance

От
Tom Lane
Дата:
charavay <c.charavay@ibcp.fr> writes:
> ... So the planner decides to scan 33 000 000 of tuples and we would like to
> force it to scan the table dic (303 000 tuples) and to use
> the index on the integer index to execute the join.

I'm mystified why you think that that will be a superior plan.  It still
requires visiting every row of the larger table (I assume that all of
the larger table's rows do join to some row of the smaller table).
All that it accomplishes is to force those visits to occur in a
quasi-random order; which not only loses any chance of kernel read-ahead
optimizations, but very likely causes each page of the table to be read
more than once.

AFAICT the planner made exactly the right choice by picking a hashjoin.
Have you tried comparing its estimates against actual runtimes for the
different plans?  (See EXPLAIN ANALYZE.)

Offhand the only way I can think of to force it to do the nestloop the
other way around from what it wants to is to temporarily drop the
index it wants to use.  You can do that conveniently like so:

    begin;
    alter table dic drop constraint dic_pkey;
    explain analyze select ...;
    rollback;

which of course would be no good for production, but it should at least
serve to destroy your illusions about wanting to do it in production.

            regards, tom lane