Обсуждение: Another index "buglet"?
Query is: SELECT url.status,url2.url,url.url FROM url,url url2 WHERE url.referrer=url2.rec_id; There is an index on rec_id and one on referrer ... shouldn't one of the be used? Like, I can see it having to go through every url2.rec_id, but shouldn't the url.referrer= be abe to make use of an index? I thought about changing the above to something like: explain SELECT url.status,url2.url,url.url FROM url,url url2 WHERE url.referrer IN ( SELECT rec_id FROM url); but that didn't win me anything else :) ====== udmsearch=> create index url_rec_id on url using btree ( rec_id ); CREATE udmsearch=> create index url_referrer on url using btree ( referrer ); CREATE udmsearch=> explain SELECT url.status,url2.url,url.url FROM url,url url2 WHERE udmsearch-> url.referrer=url2.rec_id; NOTICE: QUERY PLAN: Hash Join (cost=2045.81 rows=4544 width=36) -> Seq Scan on url (cost=863.95 rows=4544 width=20) -> Hash (cost=863.95rows=4544 width=16) -> Seq Scan on url url2 (cost=863.95 rows=4544 width=16) EXPLAIN udmsearch=> \d url Table = url +----------------------------------+----------------------------------+-------+ | Field | Type | Length| +----------------------------------+----------------------------------+-------+ | rec_id | int4 not null default nextval ( | 4 | | status | int4 not null default 0 | 4 | | url | varchar() not null | 128 | | content_type | varchar() not null default '' | 32 | | last_modified | varchar() not null default '' | 32 | | title | varchar() not null default '' | 128 | | txt | varchar() not null default '' | 255 | | docsize | int4 not null default 0 | 4 | | last_index_time | int4 not null | 4 | | next_index_time | int4 not null | 4 | | referrer | int4 not null default 0 | 4 | | tag | int4 not null default 0 | 4 | | hops | int4 not null default 0 | 4 | | keywords | varchar() not null default '' | 255 | | description | varchar() not null default '' | 100 | | crc | varchar() not null default '' | 33 | +----------------------------------+----------------------------------+-------+ Indices: url_crc url_pkey url_rec_id url_referrer url_url Marc G. Fournier ICQ#7615664 IRC Nick: Scrappy Systems Administrator @ hub.org primary: scrappy@hub.org secondary: scrappy@{freebsd|postgresql}.org
The Hermit Hacker <scrappy@hub.org> writes: > SELECT url.status,url2.url,url.url > FROM url,url url2 > WHERE url.referrer=url2.rec_id; > There is an index on rec_id and one on referrer ... shouldn't one of the > be used? Not necessarily --- hash join is a perfectly respectable alternative choice. I'd expect to see either a hash or a merge join here (the merge *would* use both indexes). Now it could be that the optimizer is misestimating the relative costs of merge and hash join. If you're interested in checking that, do this (*after* running VACUUM ANALYZE, ahem): 1. Start psql with environment variable PGOPTIONS="-fh" (forbid hash). Do the EXPLAIN --- it'll probably give a mergejoinplan now. Note the estimated total cost. Run the query itself, and note the runtime. 2. Start psql with environment variable PGOPTIONS="-fm" (forbid merge), and repeat the experiment to get the estimated costand actual time for the hash join. I'd be interested to know what you find out. I'm in the middle of rejiggering the optimizer's cost estimates right now, so more data points would be helpful. regards, tom lane
After the VACUUM ANALYZE: Straight start up: Hash Join (cost=9994.31 rows=2740488 width=36) -> Seq Scan on url (cost=3368.58 rows=37866 width=20) -> Hash (cost=3368.58rows=37866 width=16) -> Seq Scan on url url2 (cost=3368.58 rows=37866 width=16) 788u 0.327s 0:03.89 28.2% 104+14868k 0+179io 0pf+0w Forbid merge: Hash Join (cost=9994.31 rows=2740488 width=36) -> Seq Scan on url (cost=3368.58 rows=37866 width=20) -> Hash (cost=3368.58rows=37866 width=16) -> Seq Scan on url url2 (cost=3368.58 rows=37866 width=16) 0.900u 0.217s 0:04.19 26.4% 103+14638k 0+175io 0pf+0w Forbid Hash: Merge Join (cost=11188.76 rows=2740488 width=36) -> Index Scan using url_pkey on url url2 (cost=4347.30 rows=37866 width=16)-> Index Scan using url_referrer on url (cost=4342.30 rows=37866 width=20) 0.897u 0.210s 0:03.19 34.4% 106+15120k 0+179io 0pf+0w On Sat, 8 Jan 2000, Tom Lane wrote: > The Hermit Hacker <scrappy@hub.org> writes: > > SELECT url.status,url2.url,url.url > > FROM url,url url2 > > WHERE url.referrer=url2.rec_id; > > > There is an index on rec_id and one on referrer ... shouldn't one of the > > be used? > > Not necessarily --- hash join is a perfectly respectable alternative > choice. I'd expect to see either a hash or a merge join here (the > merge *would* use both indexes). > > Now it could be that the optimizer is misestimating the relative costs > of merge and hash join. If you're interested in checking that, do > this (*after* running VACUUM ANALYZE, ahem): > > 1. Start psql with environment variable PGOPTIONS="-fh" (forbid hash). > Do the EXPLAIN --- it'll probably give a mergejoin plan now. Note > the estimated total cost. Run the query itself, and note the runtime. > > 2. Start psql with environment variable PGOPTIONS="-fm" (forbid merge), > and repeat the experiment to get the estimated cost and actual time > for the hash join. > > I'd be interested to know what you find out. I'm in the middle of > rejiggering the optimizer's cost estimates right now, so more data > points would be helpful. > > regards, tom lane > Marc G. Fournier ICQ#7615664 IRC Nick: Scrappy Systems Administrator @ hub.org primary: scrappy@hub.org secondary: scrappy@{freebsd|postgresql}.org
The Hermit Hacker <scrappy@hub.org> writes: > Forbid merge: > Hash Join (cost=9994.31 rows=2740488 width=36) > -> Seq Scan on url (cost=3368.58 rows=37866 width=20) > -> Hash (cost=3368.58 rows=37866 width=16) > -> Seq Scan on url url2 (cost=3368.58 rows=37866 width=16) > 0.900u 0.217s 0:04.19 26.4% 103+14638k 0+175io 0pf+0w > Forbid Hash: > Merge Join (cost=11188.76 rows=2740488 width=36) > -> Index Scan using url_pkey on url url2 (cost=4347.30 rows=37866 width=16) > -> Index Scan using url_referrer on url (cost=4342.30 rows=37866 width=20) > 0.897u 0.210s 0:03.19 34.4% 106+15120k 0+179io 0pf+0w Thanks, but I'm confused about what I'm looking at here. Are those time outputs for the backend, or for psql? Also, how large are these two tables, and how many rows do you actually get from the query? regards, tom lane
On Sat, 8 Jan 2000, Tom Lane wrote: > The Hermit Hacker <scrappy@hub.org> writes: > > Forbid merge: > > Hash Join (cost=9994.31 rows=2740488 width=36) > > -> Seq Scan on url (cost=3368.58 rows=37866 width=20) > > -> Hash (cost=3368.58 rows=37866 width=16) > > -> Seq Scan on url url2 (cost=3368.58 rows=37866 width=16) > > > 0.900u 0.217s 0:04.19 26.4% 103+14638k 0+175io 0pf+0w > > > Forbid Hash: > > Merge Join (cost=11188.76 rows=2740488 width=36) > > -> Index Scan using url_pkey on url url2 (cost=4347.30 rows=37866 width=16) > > -> Index Scan using url_referrer on url (cost=4342.30 rows=37866 width=20) > > > 0.897u 0.210s 0:03.19 34.4% 106+15120k 0+179io 0pf+0w > > Thanks, but I'm confused about what I'm looking at here. Are those > time outputs for the backend, or for psql? just from psql ... > Also, how large are these two tables, and how many rows do you actually > get from the query? pgsql> grep http query.out | wc -l 37825 Can't give you a count on the tables though, since I've since had to rebuiild them :( Or, rather, the two tables are the same table...