Обсуждение: Another index "buglet"?

Поиск
Список
Период
Сортировка

Another index "buglet"?

От
The Hermit Hacker
Дата:
Query is:

SELECT url.status,url2.url,url.url  FROM url,url url2 WHERE url.referrer=url2.rec_id;

There is an index on rec_id and one on referrer ... shouldn't one of the
be used?  Like, I can see it having to go through every url2.rec_id, but
shouldn't the url.referrer= be abe to make use of an index?  I thought
about changing the above to something like:

explain SELECT url.status,url2.url,url.url         FROM url,url url2        WHERE url.referrer IN ( SELECT rec_id FROM
url);
 

but that didn't win me anything else :)  
======

udmsearch=> create index url_rec_id on url using btree ( rec_id );
CREATE
udmsearch=> create index url_referrer on url using btree ( referrer ); 
CREATE
udmsearch=> explain SELECT url.status,url2.url,url.url FROM url,url url2 WHERE
udmsearch->  url.referrer=url2.rec_id;
NOTICE:  QUERY PLAN:

Hash Join  (cost=2045.81 rows=4544 width=36) ->  Seq Scan on url  (cost=863.95 rows=4544 width=20) ->  Hash
(cost=863.95rows=4544 width=16)       ->  Seq Scan on url url2  (cost=863.95 rows=4544 width=16)
 

EXPLAIN
udmsearch=> \d url
Table    = url
+----------------------------------+----------------------------------+-------+
|              Field               |              Type                | Length|
+----------------------------------+----------------------------------+-------+
| rec_id                           | int4 not null default nextval (  |     4 |
| status                           | int4 not null default 0          |     4 |
| url                              | varchar() not null               |   128 |
| content_type                     | varchar() not null default ''    |    32 |
| last_modified                    | varchar() not null default ''    |    32 |
| title                            | varchar() not null default ''    |   128 |
| txt                              | varchar() not null default ''    |   255 |
| docsize                          | int4 not null default 0          |     4 |
| last_index_time                  | int4 not null                    |     4 |
| next_index_time                  | int4 not null                    |     4 |
| referrer                         | int4 not null default 0          |     4 |
| tag                              | int4 not null default 0          |     4 |
| hops                             | int4 not null default 0          |     4 |
| keywords                         | varchar() not null default ''    |   255 |
| description                      | varchar() not null default ''    |   100 |
| crc                              | varchar() not null default ''    |    33 |
+----------------------------------+----------------------------------+-------+
Indices:  url_crc         url_pkey         url_rec_id         url_referrer         url_url



Marc G. Fournier                   ICQ#7615664               IRC Nick: Scrappy
Systems Administrator @ hub.org 
primary: scrappy@hub.org           secondary: scrappy@{freebsd|postgresql}.org 



Re: [HACKERS] Another index "buglet"?

От
Tom Lane
Дата:
The Hermit Hacker <scrappy@hub.org> writes:
> SELECT url.status,url2.url,url.url 
>   FROM url,url url2 
>  WHERE url.referrer=url2.rec_id;

> There is an index on rec_id and one on referrer ... shouldn't one of the
> be used?

Not necessarily --- hash join is a perfectly respectable alternative
choice.  I'd expect to see either a hash or a merge join here (the
merge *would* use both indexes).

Now it could be that the optimizer is misestimating the relative costs
of merge and hash join.  If you're interested in checking that, do
this (*after* running VACUUM ANALYZE, ahem):

1. Start psql with environment variable PGOPTIONS="-fh" (forbid hash).  Do the EXPLAIN --- it'll probably give a
mergejoinplan now.  Note  the estimated total cost.  Run the query itself, and note the runtime.
 

2. Start psql with environment variable PGOPTIONS="-fm" (forbid merge),  and repeat the experiment to get the estimated
costand actual time  for the hash join.
 

I'd be interested to know what you find out.  I'm in the middle of
rejiggering the optimizer's cost estimates right now, so more data
points would be helpful.
        regards, tom lane


Re: [HACKERS] Another index "buglet"?

От
The Hermit Hacker
Дата:
After the VACUUM ANALYZE:

Straight start up:
Hash Join  (cost=9994.31 rows=2740488 width=36) ->  Seq Scan on url  (cost=3368.58 rows=37866 width=20) ->  Hash
(cost=3368.58rows=37866 width=16)       ->  Seq Scan on url url2  (cost=3368.58 rows=37866 width=16)
 

788u 0.327s 0:03.89 28.2%     104+14868k 0+179io 0pf+0w


Forbid merge:
Hash Join  (cost=9994.31 rows=2740488 width=36) ->  Seq Scan on url  (cost=3368.58 rows=37866 width=20) ->  Hash
(cost=3368.58rows=37866 width=16)       ->  Seq Scan on url url2  (cost=3368.58 rows=37866 width=16)
 

0.900u 0.217s 0:04.19 26.4%     103+14638k 0+175io 0pf+0w


Forbid Hash:
Merge Join  (cost=11188.76 rows=2740488 width=36) ->  Index Scan using url_pkey on url url2  (cost=4347.30 rows=37866
width=16)->  Index Scan using url_referrer on url  (cost=4342.30 rows=37866 width=20)
 

0.897u 0.210s 0:03.19 34.4%     106+15120k 0+179io 0pf+0w

On Sat, 8 Jan 2000, Tom Lane wrote:

> The Hermit Hacker <scrappy@hub.org> writes:
> > SELECT url.status,url2.url,url.url 
> >   FROM url,url url2 
> >  WHERE url.referrer=url2.rec_id;
> 
> > There is an index on rec_id and one on referrer ... shouldn't one of the
> > be used?
> 
> Not necessarily --- hash join is a perfectly respectable alternative
> choice.  I'd expect to see either a hash or a merge join here (the
> merge *would* use both indexes).
> 
> Now it could be that the optimizer is misestimating the relative costs
> of merge and hash join.  If you're interested in checking that, do
> this (*after* running VACUUM ANALYZE, ahem):
> 
> 1. Start psql with environment variable PGOPTIONS="-fh" (forbid hash).
>    Do the EXPLAIN --- it'll probably give a mergejoin plan now.  Note
>    the estimated total cost.  Run the query itself, and note the runtime.
> 
> 2. Start psql with environment variable PGOPTIONS="-fm" (forbid merge),
>    and repeat the experiment to get the estimated cost and actual time
>    for the hash join.
> 
> I'd be interested to know what you find out.  I'm in the middle of
> rejiggering the optimizer's cost estimates right now, so more data
> points would be helpful.
> 
>             regards, tom lane
> 

Marc G. Fournier                   ICQ#7615664               IRC Nick: Scrappy
Systems Administrator @ hub.org 
primary: scrappy@hub.org           secondary: scrappy@{freebsd|postgresql}.org 



Re: [HACKERS] Another index "buglet"?

От
Tom Lane
Дата:
The Hermit Hacker <scrappy@hub.org> writes:
> Forbid merge:
> Hash Join  (cost=9994.31 rows=2740488 width=36)
>   -> Seq Scan on url  (cost=3368.58 rows=37866 width=20)
>   -> Hash  (cost=3368.58 rows=37866 width=16)
>     -> Seq Scan on url url2  (cost=3368.58 rows=37866 width=16)

> 0.900u 0.217s 0:04.19 26.4%     103+14638k 0+175io 0pf+0w

> Forbid Hash:
> Merge Join  (cost=11188.76 rows=2740488 width=36)
>   -> Index Scan using url_pkey on url url2  (cost=4347.30 rows=37866 width=16)
>   -> Index Scan using url_referrer on url  (cost=4342.30 rows=37866 width=20)

> 0.897u 0.210s 0:03.19 34.4%     106+15120k 0+179io 0pf+0w

Thanks, but I'm confused about what I'm looking at here.  Are those
time outputs for the backend, or for psql?

Also, how large are these two tables, and how many rows do you actually
get from the query?
        regards, tom lane


Re: [HACKERS] Another index "buglet"?

От
The Hermit Hacker
Дата:
On Sat, 8 Jan 2000, Tom Lane wrote:

> The Hermit Hacker <scrappy@hub.org> writes:
> > Forbid merge:
> > Hash Join  (cost=9994.31 rows=2740488 width=36)
> >   -> Seq Scan on url  (cost=3368.58 rows=37866 width=20)
> >   -> Hash  (cost=3368.58 rows=37866 width=16)
> >     -> Seq Scan on url url2  (cost=3368.58 rows=37866 width=16)
> 
> > 0.900u 0.217s 0:04.19 26.4%     103+14638k 0+175io 0pf+0w
> 
> > Forbid Hash:
> > Merge Join  (cost=11188.76 rows=2740488 width=36)
> >   -> Index Scan using url_pkey on url url2  (cost=4347.30 rows=37866 width=16)
> >   -> Index Scan using url_referrer on url  (cost=4342.30 rows=37866 width=20)
> 
> > 0.897u 0.210s 0:03.19 34.4%     106+15120k 0+179io 0pf+0w
> 
> Thanks, but I'm confused about what I'm looking at here.  Are those
> time outputs for the backend, or for psql?

just from psql ...

> Also, how large are these two tables, and how many rows do you actually
> get from the query?

pgsql> grep http query.out | wc -l  37825

Can't give you a count on the tables though, since I've since had to
rebuiild them :(  Or, rather, the two tables are the same table...