Обсуждение: Query improvement
Hi I have 3 tables page - revision - pagecontent CREATE TABLE mediawiki.page ( page_id serial NOT NULL, page_namespace smallint NOT NULL, page_title text NOT NULL, page_restrictions text, page_counter bigint NOT NULL DEFAULT 0, page_is_redirect smallint NOT NULL DEFAULT 0, page_is_new smallint NOT NULL DEFAULT 0, page_random numeric(15,14) NOT NULL DEFAULT random(), page_touched timestamp with time zone, page_latest integer NOT NULL, page_len integer NOT NULL, titlevector tsvector, page_type integer NOT NULL DEFAULT 0, CONSTRAINT page_pkey PRIMARY KEY (page_id) ); CREATE TABLE mediawiki.revision ( rev_id serial NOT NULL, rev_page integer, rev_text_id integer, rev_comment text, rev_user integer NOT NULL, rev_user_text text NOT NULL, rev_timestamp timestamp with time zone NOT NULL, rev_minor_edit smallint NOT NULL DEFAULT 0, rev_deleted smallint NOT NULL DEFAULT 0, rev_len integer, rev_parent_id integer, CONSTRAINT revision_rev_page_fkey FOREIGN KEY (rev_page) REFERENCES mediawiki.page (page_id) MATCH SIMPLE ON UPDATE NO ACTION ON DELETE CASCADE, CONSTRAINT revision_rev_id_key UNIQUE (rev_id) ) CREATE TABLE mediawiki.pagecontent ( old_id integer NOT NULL DEFAULT nextval('mediawiki.text_old_id_seq'::regclass), old_text text, old_flags text, textvector tsvector, CONSTRAINT pagecontent_pkey PRIMARY KEY (old_id) ) where i have query SELECT pa.page_id, pa.page_title, ts_rank(pc.textvector,(to_tsquery('fotbal')))+ts_rank(pa.titlevector,(to_tsquery('fotbal')))*10 as totalrank from mediawiki.page pa, mediawiki.revision re, mediawiki.pagecontent pc WHERE pa.page_id in (SELECT page_id FROM mediawiki.page WHERE page_id IN (SELECT page_id FROM mediawiki.page WHERE (titlevector @@ (to_tsquery('fotbal')))) OR page_id IN (SELECT p.page_id from mediawiki.page p,mediawiki.revision r, (SELECT old_id FROM mediawiki.pagecontent WHERE (textvector @@ (to_tsquery('fotbal')))) ss WHERE (p.page_id=r.rev_page AND r.rev_id=ss.old_id))) AND (pa.page_id=re.rev_page AND re.rev_id=pc.old_id) ORDER BY totalrank LIMIT 100; This query find out titles of pages in page and content in page content by full text search - @@ afterwards i count for the resulted id by ts_rank the relevance. Now the problem. When I try ANALYZE it shows: "Limit (cost=136568.00..136568.25 rows=100 width=185)" " -> Sort (cost=136568.00..137152.26 rows=233703 width=185)" " Sort Key: ((ts_rank(pc.textvector, to_tsquery('fotbal'::text)) + (ts_rank(pa.titlevector, to_tsquery('fotbal'::text)) * 10::double precision)))" " -> Hash Join (cost=61707.99..127636.04 rows=233703 width=185)" " Hash Cond: (re.rev_id = pc.old_id)" " -> Merge Join (cost=24098.90..71107.48 rows=233703 width=66)" " Merge Cond: (pa.page_id = re.rev_page)" " -> Merge Semi Join (cost=24096.98..55665.69 rows=233703 width=66)" " Merge Cond: (pa.page_id = mediawiki.page.page_id)" " -> Index Scan using page_btree_id on page pa (cost=0.00..13155.20 rows=311604 width=62)" " -> Index Scan using page_btree_id on page (cost=24096.98..38810.19 rows=233703 width=4)" " Filter: ((hashed SubPlan 1) OR (hashed SubPlan 2))" " SubPlan 1" " -> Bitmap Heap Scan on page (cost=10.41..900.33 rows=270 width=4)" " Recheck Cond: (titlevector @@ to_tsquery('fotbal'::text))" " -> Bitmap Index Scan on gin_index (cost=0.00..10.34 rows=270 width=0)" " Index Cond: (titlevector @@ to_tsquery('fotbal'::text))" " SubPlan 2" " -> Nested Loop (cost=1499.29..23192.08 rows=1558 width=4)" " -> Nested Loop (cost=1499.29..15967.11 rows=1558 width=4)" " -> Bitmap Heap Scan on pagecontent (cost=1499.29..6448.12 rows=1558 width=4)" " Recheck Cond: (textvector @@ to_tsquery('fotbal'::text))" " -> Bitmap Index Scan on gin_index2 (cost=0.00..1498.90 rows=1558 width=0)" " Index Cond: (textvector @@ to_tsquery('fotbal'::text))" " -> Index Scan using page_btree_rev_content_id on revision r (cost=0.00..6.10 rows=1 width=8)" " Index Cond: (r.rev_id = pagecontent.old_id)" " -> Index Scan using page_btree_id on page p (cost=0.00..4.62 rows=1 width=4)" " Index Cond: (p.page_id = r.rev_page)" " -> Index Scan using page_btree_rev_page_id on revision re (cost=0.00..11850.52 rows=311604 width=8)" " -> Hash (cost=27932.04..27932.04 rows=311604 width=127)" " -> Seq Scan on pagecontent pc (cost=0.00..27932.04 rows=311604 width=127)" I there some posibility to speed up the hash join which takes a lot of time? I have tried to find some solution, but it was not successfull. Thanks a lot.-- View this message in context: http://postgresql.1045698.n5.nabble.com/Query-improvement-tp4362578p4362578.html Sent from the PostgreSQL - performance mailing list archive at Nabble.com.
On Sun, May 1, 2011 at 12:23 PM, Mark <Marek.Balgar@seznam.cz> wrote: > Now the problem. > When I try ANALYZE it shows: That's a regular explain... can you post an EXPLAIN ANALYZE? Hash joins are very inefficient if they require big temporary files. I usually work around that by disabling hash joins for the problematic queries: set enable_hashjoin = false; <query> set enable_hashjoin = true; But an explain analyze would confirm or deny that theory.
Here is EXPLAIN ANALYZE: "Limit (cost=136568.00..136568.25 rows=100 width=185) (actual time=1952.174..1952.215 rows=100 loops=1)" " -> Sort (cost=136568.00..137152.26 rows=233703 width=185) (actual time=1952.172..1952.188 rows=100 loops=1)" " Sort Key: ((ts_rank(pc.textvector, to_tsquery('fotbal'::text)) + (ts_rank(pa.titlevector, to_tsquery('fotbal'::text)) * 10::double precision)))" " Sort Method: top-N heapsort Memory: 23kB" " -> Hash Join (cost=61707.99..127636.04 rows=233703 width=185) (actual time=1046.838..1947.815 rows=3278 loops=1)" " Hash Cond: (re.rev_id = pc.old_id)" " -> Merge Join (cost=24098.90..71107.48 rows=233703 width=66) (actual time=200.884..859.453 rows=3278 loops=1)" " Merge Cond: (pa.page_id = re.rev_page)" " -> Merge Semi Join (cost=24096.98..55665.69 rows=233703 width=66) (actual time=200.843..629.821 rows=3278 loops=1)" " Merge Cond: (pa.page_id = mediawiki.page.page_id)" " -> Index Scan using page_btree_id on page pa (cost=0.00..13155.20 rows=311604 width=62) (actual time=0.027..145.989 rows=311175 loops=1)" " -> Index Scan using page_btree_id on page (cost=24096.98..38810.19 rows=233703 width=4) (actual time=200.779..429.219 rows=3278 loops=1)" " Filter: ((hashed SubPlan 1) OR (hashed SubPlan 2))" " SubPlan 1" " -> Bitmap Heap Scan on page (cost=10.41..900.33 rows=270 width=4) (actual time=0.748..9.845 rows=280 loops=1)" " Recheck Cond: (titlevector @@ to_tsquery('fotbal'::text))" " -> Bitmap Index Scan on gin_index (cost=0.00..10.34 rows=270 width=0) (actual time=0.586..0.586 rows=280 loops=1)" " Index Cond: (titlevector @@ to_tsquery('fotbal'::text))" " SubPlan 2" " -> Nested Loop (cost=1499.29..23192.08 rows=1558 width=4) (actual time=2.032..185.743 rows=3250 loops=1)" " -> Nested Loop (cost=1499.29..15967.11 rows=1558 width=4) (actual time=1.980..109.491 rows=3250 loops=1)" " -> Bitmap Heap Scan on pagecontent (cost=1499.29..6448.12 rows=1558 width=4) (actual time=1.901..36.583 rows=3250 loops=1)" " Recheck Cond: (textvector @@ to_tsquery('fotbal'::text))" " -> Bitmap Index Scan on gin_index2 (cost=0.00..1498.90 rows=1558 width=0) (actual time=1.405..1.405 rows=3250 loops=1)" " Index Cond: (textvector @@ to_tsquery('fotbal'::text))" " -> Index Scan using page_btree_rev_content_id on revision r (cost=0.00..6.10 rows=1 width=8) (actual time=0.020..0.021 rows=1 loops=3250)" " Index Cond: (r.rev_id = pagecontent.old_id)" " -> Index Scan using page_btree_id on page p (cost=0.00..4.62 rows=1 width=4) (actual time=0.022..0.022 rows=1 loops=3250)" " Index Cond: (p.page_id = r.rev_page)" " -> Index Scan using page_btree_rev_page_id on revision re (cost=0.00..11850.52 rows=311604 width=8) (actual time=0.012..166.042 rows=311175 loops=1)" " -> Hash (cost=27932.04..27932.04 rows=311604 width=127) (actual time=801.000..801.000 rows=311604 loops=1)" " Buckets: 1024 Batches: 64 Memory Usage: 744kB" " -> Seq Scan on pagecontent pc (cost=0.00..27932.04 rows=311604 width=127) (actual time=0.018..465.686 rows=311604 loops=1)" "Total runtime: 1952.962 ms" I have tried set enable_hashjoin = false; <query> set enable_hashjoin = true; but the result have been worst than before. By the way is there a posibility to create beeter query with same effect? I have tried more queries, but this has got best performance yet. -- View this message in context: http://postgresql.1045698.n5.nabble.com/Query-improvement-tp4362578p4365717.html Sent from the PostgreSQL - performance mailing list archive at Nabble.com.
On Mon, May 2, 2011 at 10:54 PM, Mark <Marek.Balgar@seznam.cz> wrote: > but the result have been worst than before. By the way is there a posibility > to create beeter query with same effect? > I have tried more queries, but this has got best performance yet. Well, this seems to be the worst part: (SELECT page_id FROM mediawiki.page WHERE page_id IN (SELECT page_id FROM mediawiki.page WHERE (titlevector @@ (to_tsquery('fotbal')))) OR page_id IN (SELECT p.page_id from mediawiki.page p,mediawiki.revision r, (SELECT old_id FROM mediawiki.pagecontent WHERE (textvector @@ (to_tsquery('fotbal')))) ss WHERE (p.page_id=r.rev_page AND r.rev_id=ss.old_id))) If you're running a new enough pg (8.4+), you could try using CTEs for that. I haven't used CTEs much, but I think it goes something like: WITH someids AS ( (SELECT page_id FROM mediawiki.page WHERE page_id IN (SELECT page_id FROM mediawiki.page WHERE (titlevector @@ (to_tsquery('fotbal')))) OR page_id IN (SELECT p.page_id from mediawiki.page p,mediawiki.revision r, (SELECT old_id FROM mediawiki.pagecontent WHERE (textvector @@ (to_tsquery('fotbal')))) ss WHERE (p.page_id=r.rev_page AND r.rev_id=ss.old_id))) ) SELECT pa.page_id, pa.page_title, ts_rank(pc.textvector,(to_tsquery('fotbal')))+ts_rank(pa.titlevector,(to_tsquery('fotbal')))*10 as totalrank from mediawiki.page pa, mediawiki.revision re, mediawiki.pagecontent pc WHERE pa.page_id in someids AND (pa.page_id=re.rev_page AND re.rev_id=pc.old_id) ORDER BY totalrank LIMIT 100;
> On Mon, May 2, 2011 at 10:54 PM, Mark <Marek.Balgar@seznam.cz> wrote: > > but the result have been worst than before. By the way is there a posibility > > to create beeter query with same effect? > > I have tried more queries, but this has got best performance yet. > > Well, this seems to be the worst part: > > (SELECT page_id FROM mediawiki.page WHERE page_id IN > (SELECT page_id FROM mediawiki.page > WHERE (titlevector @@ (to_tsquery('fotbal')))) > OR page_id IN > (SELECT p.page_id from mediawiki.page p,mediawiki.revision r, > (SELECT old_id FROM mediawiki.pagecontent > WHERE (textvector @@ (to_tsquery('fotbal')))) ss > WHERE (p.page_id=r.rev_page AND r.rev_id=ss.old_id))) > 'OR' statements often generate complicated plans. You should try to rewrite your Query with a n UNION clause. Using explicit joins may also help the planner: SELECT page_id FROM mediawiki.page WHERE (titlevector @@ (to_tsquery('fotbal'))) UNION SELECT p.page_id FROM mediawiki.page p JOIN mediawiki.revision r on (p.page_id=r.rev_page) JOIN mediawiki.pagecontent ss on (r.rev_id=ss.old_id) WHERE (ss.textvector @@ (to_tsquery('fotbal'))) HTH, Marc Mamin
Thanks for replies. Finally I have used UNION and JOINS, which helped. Mainly the UNION helped a lot. Now the query takes 1sec max. Thanks a lot. -- View this message in context: http://postgresql.1045698.n5.nabble.com/Query-improvement-tp4362578p4378163.html Sent from the PostgreSQL - performance mailing list archive at Nabble.com.
Thanks for reply both UNION and JOINS helped. Mainly the UNION helped a lot. Now the query takes 1sec max. Thanks a lot. -- View this message in context: http://postgresql.1045698.n5.nabble.com/Query-improvement-tp4362578p4378157.html Sent from the PostgreSQL - performance mailing list archive at Nabble.com.
Thanks a lot for reply. Finally I have used UNION, but thanks for your help. -- View this message in context: http://postgresql.1045698.n5.nabble.com/Query-improvement-tp4362578p4378160.html Sent from the PostgreSQL - performance mailing list archive at Nabble.com.
On Mon, May 2, 2011 at 3:58 AM, Claudio Freire <klaussfreire@gmail.com> wrote: > Hash joins are very inefficient if they require big temporary files. Hmm, that's not been my experience. What have you seen? I've seen a 64-batch hash join beat out a nested-loop-with-inner-indexscan, which I never woulda believed, but... -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company