Обсуждение: tuning SQL
Hi, I am running a query on postgres 7.1.3 at Red Hat 7.2 (2 CPUs, 1.5G RAM, 2 drive disk array). select count(*) from contact a, contact_discard b where a.contacthandle <> b.contacthandle; Table contact has over 9 million rows, contact_diacard has around 259,000 rows, both tables define contacthandle as primary key. Here is the query execution plan: Aggregate (cost=732021.97..732021.97 rows=1 width=24) -> Hash Join (cost=10035.10..731397.95 rows=249608 width=24) -> Seq Scan on contact a (cost=0.00..345002.95 rows=9330995 width=12) -> Hash (cost=9411.08..9411.08 rows=249608 width=12) -> Seq Scan on contact_disacrd b (cost=0.00..9411.08 rows=249608 width=12) I started to run this query at 5:00pm yesterday, it still running!!! My question is Why query plan doesn't use index scan for join, Can we force it to use index? Or any idea to improve the performance? We have more tables bigger than contact, and need to join them among? Am I pushing the postgres to the limit? Help!!! Shared_buffer = 65536 sort_mem = 32768 Anna Zhang
Anna, I'm not sure that this query is doing what you think it's doing. Since your tables aren't linked you'll end up with count(contact) * count(contact_discard) rows to evaluate, or 9,000,000*259,00=2,331,000,000,000 (2.3 trillion) rows. Assuming that you want to find the number of records in contact without a corrisponding record in contact_diacard, you'll probably want something like: select count(*) from contact where contacthandle not in (select contacthandle from contact_discard); This will also be painfully slow (but way faster than your first query), since it's using in, but I'm sure that someone can suggest something faster. Thanks, Peter Darley -----Original Message----- From: pgsql-admin-owner@postgresql.org [mailto:pgsql-admin-owner@postgresql.org]On Behalf Of Zhang, Anna Sent: Tuesday, January 29, 2002 7:57 AM To: pgsql-admin@postgresql.org Subject: [ADMIN] tuning SQL Hi, I am running a query on postgres 7.1.3 at Red Hat 7.2 (2 CPUs, 1.5G RAM, 2 drive disk array). select count(*) from contact a, contact_discard b where a.contacthandle <> b.contacthandle; Table contact has over 9 million rows, contact_diacard has around 259,000 rows, both tables define contacthandle as primary key. Here is the query execution plan: Aggregate (cost=732021.97..732021.97 rows=1 width=24) -> Hash Join (cost=10035.10..731397.95 rows=249608 width=24) -> Seq Scan on contact a (cost=0.00..345002.95 rows=9330995 width=12) -> Hash (cost=9411.08..9411.08 rows=249608 width=12) -> Seq Scan on contact_disacrd b (cost=0.00..9411.08 rows=249608 width=12) I started to run this query at 5:00pm yesterday, it still running!!! My question is Why query plan doesn't use index scan for join, Can we force it to use index? Or any idea to improve the performance? We have more tables bigger than contact, and need to join them among? Am I pushing the postgres to the limit? Help!!! Shared_buffer = 65536 sort_mem = 32768 Anna Zhang ---------------------------(end of broadcast)--------------------------- TIP 6: Have you searched our list archives? http://archives.postgresql.org
On Tue, Jan 29, 2002 at 10:57:01AM -0500, Zhang, Anna wrote: > Hi, > I am running a query on postgres 7.1.3 at Red Hat 7.2 (2 CPUs, 1.5G RAM, 2 > drive disk array). > select count(*) from contact a, contact_discard b where a.contacthandle <> > b.contacthandle; What are you trying to do with this query? As written, it'll give you a (roughly) cartesian product between the two tables. Here's a (small) example from one of my dbs: bioinfo=# select count(*) from people; count ------- 91 (1 row) bioinfo=# select count(*) from new_people; count ------- 70 (1 row) bioinfo=# select count(*) from people p, new_people n where p.peid=n.peid; count ------- 69 (1 row) bioinfo=# select count(*) from people p, new_people n where p.peid <> n.peid; count ------- 6301 (1 row) if what you want is the number of contacts not in contact_discard, that'd be something like: bioinfo=# select count(*) from people p where not exists (select peid from new_people where peid=p.peid); count ------- 22 (1 row) Ross -- Ross Reedstrom, Ph.D. reedstrm@rice.edu Executive Director phone: 713-348-6166 Gulf Coast Consortium for Bioinformatics fax: 713-348-6182 Rice University MS-39 Houston, TX 77005
"Zhang, Anna" <azhang@verisign.com> writes: > select count(*) from contact a, contact_discard b where a.contacthandle <> > b.contacthandle; Did you really mean "<>" here? If so, the plan you showed us is not for this query. The above query is going to take approximately forever to execute :-( If you meant "=", then the hash join that the system is using seems like a perfectly fine plan to me. regards, tom lane
Thanks Peter Darley, Ross J. Reedstrom and Tom lane!! How silly am I! Your messages reminds me. Actually I want to insert rows of contact_discard table which are not exists in contact table to contact table (some duplicates in two tables), first I run insert into contact select * from contact_discard a where not exists ( select 1 from contact b where b.contacthandle = a.contacthandle); It seems takes forever, I killed it after hours(note: contact table has 4 indexes). Then I tried to figure out how many rows that are not duplicated. Now my problem turns to insert performance, in oracle it takes only a few minues. Thanks! Anna Zhang -----Original Message----- From: Ross J. Reedstrom [mailto:reedstrm@rice.edu] Sent: Tuesday, January 29, 2002 11:39 AM To: Zhang, Anna Cc: pgsql-admin@postgresql.org Subject: Re: [ADMIN] tuning SQL On Tue, Jan 29, 2002 at 10:57:01AM -0500, Zhang, Anna wrote: > Hi, > I am running a query on postgres 7.1.3 at Red Hat 7.2 (2 CPUs, 1.5G RAM, 2 > drive disk array). > select count(*) from contact a, contact_discard b where a.contacthandle <> > b.contacthandle; What are you trying to do with this query? As written, it'll give you a (roughly) cartesian product between the two tables. Here's a (small) example from one of my dbs: bioinfo=# select count(*) from people; count ------- 91 (1 row) bioinfo=# select count(*) from new_people; count ------- 70 (1 row) bioinfo=# select count(*) from people p, new_people n where p.peid=n.peid; count ------- 69 (1 row) bioinfo=# select count(*) from people p, new_people n where p.peid <> n.peid; count ------- 6301 (1 row) if what you want is the number of contacts not in contact_discard, that'd be something like: bioinfo=# select count(*) from people p where not exists (select peid from new_people where peid=p.peid); count ------- 22 (1 row) Ross -- Ross Reedstrom, Ph.D. reedstrm@rice.edu Executive Director phone: 713-348-6166 Gulf Coast Consortium for Bioinformatics fax: 713-348-6182 Rice University MS-39 Houston, TX 77005
On Tue, Jan 29, 2002 at 12:23:17PM -0500, Zhang, Anna wrote: > Thanks Peter Darley, Ross J. Reedstrom and Tom lane!! > How silly am I! Your messages reminds me. Actually I want to insert rows of > contact_discard table which are not exists in contact table to contact table > (some duplicates in two tables), first I run > > insert into contact > select * from contact_discard a > where not exists ( select 1 from contact b where b.contacthandle = > a.contacthandle); > > It seems takes forever, I killed it after hours(note: contact table has 4 > indexes). Then I tried to figure out how many rows that are not duplicated. > Now my problem turns to insert performance, in oracle it takes only a few > minues. So, how long does the just the select take? You might be running into something odd with selecting from and inserting into the same table: I'm not sure of the visibility rules here. You might be better off selecting into a new temp table (see: SELECT INTO) then inserting from that. Ross
On Tue, 29 Jan 2002 17:45:34 +0000 (UTC), azhang@verisign.com ("Zhang, Anna") wrote: >Thanks Peter Darley, Ross J. Reedstrom and Tom lane!! >How silly am I! Your messages reminds me. Actually I want to insert rows of >contact_discard table which are not exists in contact table to contact table >(some duplicates in two tables), first I run > >insert into contact >select * from contact_discard a >where not exists ( select 1 from contact b where b.contacthandle = >a.contacthandle); > >It seems takes forever, I killed it after hours(note: contact table has 4 >indexes). Then I tried to figure out how many rows that are not duplicated. >Now my problem turns to insert performance, in oracle it takes only a few >minues. > >Thanks! > >Anna Zhang > With millions of rows, you also might want to create a temporary (real) table with just the select statement, then drop ALL the indexes on contact, do the update, then recreate the indexes. Bob Hairgrove rhairgroveNoSpam@Pleasebigfoot.com