Обсуждение: tuning SQL

Поиск
Список
Период
Сортировка

tuning SQL

От
"Zhang, Anna"
Дата:
Hi,
I am running a query on postgres 7.1.3 at Red Hat 7.2 (2 CPUs, 1.5G RAM, 2
drive disk array).
select count(*) from contact a, contact_discard b where a.contacthandle <>
b.contacthandle;
Table contact has over 9 million rows, contact_diacard has around 259,000
rows, both tables define contacthandle as primary key.
Here is the query execution plan:
Aggregate (cost=732021.97..732021.97 rows=1 width=24)
 -> Hash Join (cost=10035.10..731397.95 rows=249608 width=24)
      -> Seq Scan on contact a (cost=0.00..345002.95 rows=9330995 width=12)
      -> Hash (cost=9411.08..9411.08 rows=249608 width=12)
         -> Seq Scan on contact_disacrd b (cost=0.00..9411.08 rows=249608
width=12)
 I started to run this query at 5:00pm yesterday, it still running!!! My
question is Why query plan doesn't use index scan for join, Can we force it
to use index? Or any idea to improve the performance? We have more tables
bigger than contact, and need to join them among? Am I pushing the postgres
to the limit? Help!!!
Shared_buffer = 65536
sort_mem = 32768

Anna Zhang


Re: tuning SQL

От
"Peter Darley"
Дата:
Anna,
    I'm not sure that this query is doing what you think it's doing.
    Since your tables aren't linked you'll end up with count(contact) *
count(contact_discard) rows to evaluate, or
9,000,000*259,00=2,331,000,000,000 (2.3 trillion) rows.

    Assuming that you want to find the number of records in contact without a
corrisponding record in contact_diacard, you'll probably want something
like:

select count(*) from contact where contacthandle not in (select
contacthandle from contact_discard);

    This will also be painfully slow (but way faster than your first query),
since it's using in, but I'm sure that someone can suggest something faster.

Thanks,
Peter Darley


-----Original Message-----
From: pgsql-admin-owner@postgresql.org
[mailto:pgsql-admin-owner@postgresql.org]On Behalf Of Zhang, Anna
Sent: Tuesday, January 29, 2002 7:57 AM
To: pgsql-admin@postgresql.org
Subject: [ADMIN] tuning SQL


Hi,
I am running a query on postgres 7.1.3 at Red Hat 7.2 (2 CPUs, 1.5G RAM, 2
drive disk array).
select count(*) from contact a, contact_discard b where a.contacthandle <>
b.contacthandle;
Table contact has over 9 million rows, contact_diacard has around 259,000
rows, both tables define contacthandle as primary key.
Here is the query execution plan:
Aggregate (cost=732021.97..732021.97 rows=1 width=24)
 -> Hash Join (cost=10035.10..731397.95 rows=249608 width=24)
      -> Seq Scan on contact a (cost=0.00..345002.95 rows=9330995 width=12)
      -> Hash (cost=9411.08..9411.08 rows=249608 width=12)
         -> Seq Scan on contact_disacrd b (cost=0.00..9411.08 rows=249608
width=12)
 I started to run this query at 5:00pm yesterday, it still running!!! My
question is Why query plan doesn't use index scan for join, Can we force it
to use index? Or any idea to improve the performance? We have more tables
bigger than contact, and need to join them among? Am I pushing the postgres
to the limit? Help!!!
Shared_buffer = 65536
sort_mem = 32768

Anna Zhang


---------------------------(end of broadcast)---------------------------
TIP 6: Have you searched our list archives?

http://archives.postgresql.org


Re: tuning SQL

От
"Ross J. Reedstrom"
Дата:
On Tue, Jan 29, 2002 at 10:57:01AM -0500, Zhang, Anna wrote:
> Hi,
> I am running a query on postgres 7.1.3 at Red Hat 7.2 (2 CPUs, 1.5G RAM, 2
> drive disk array).
> select count(*) from contact a, contact_discard b where a.contacthandle <>
> b.contacthandle;

What are you trying to do with this query? As written, it'll give you
a (roughly) cartesian product between the two tables. Here's a (small) example
from one of my dbs:


bioinfo=# select count(*) from people;
 count
-------
    91
(1 row)

bioinfo=# select count(*) from new_people;
 count
-------
    70
(1 row)

bioinfo=# select count(*) from people p, new_people n where p.peid=n.peid;
 count
-------
    69
(1 row)

bioinfo=# select count(*) from people p, new_people n where p.peid <> n.peid;
 count
-------
  6301
(1 row)

if what you want is the number of contacts not in contact_discard, that'd
be something like:

bioinfo=# select count(*) from people p where not exists (select peid from new_people where peid=p.peid);
 count
-------
    22
(1 row)

Ross

--
Ross Reedstrom, Ph.D.                                 reedstrm@rice.edu
Executive Director                                  phone: 713-348-6166
Gulf Coast Consortium for Bioinformatics              fax: 713-348-6182
Rice University MS-39
Houston, TX 77005

Re: tuning SQL

От
Tom Lane
Дата:
"Zhang, Anna" <azhang@verisign.com> writes:
> select count(*) from contact a, contact_discard b where a.contacthandle <>
> b.contacthandle;

Did you really mean "<>" here?  If so, the plan you showed us is not for
this query.  The above query is going to take approximately forever to
execute :-(

If you meant "=", then the hash join that the system is using seems like
a perfectly fine plan to me.

            regards, tom lane

Re: tuning SQL

От
"Zhang, Anna"
Дата:
Thanks Peter Darley, Ross J. Reedstrom and Tom lane!!
How silly am I! Your messages reminds me. Actually I want to insert rows of
contact_discard table which are not exists in contact table to contact table
(some duplicates in two tables), first I run

insert into contact
select * from contact_discard a
where not exists ( select 1 from contact b where b.contacthandle =
a.contacthandle);

It seems takes forever, I killed it after hours(note: contact table has 4
indexes). Then I tried to figure out how many rows that are not duplicated.
Now my problem turns to insert performance, in oracle it takes only a few
minues.

Thanks!

Anna Zhang


-----Original Message-----
From: Ross J. Reedstrom [mailto:reedstrm@rice.edu]
Sent: Tuesday, January 29, 2002 11:39 AM
To: Zhang, Anna
Cc: pgsql-admin@postgresql.org
Subject: Re: [ADMIN] tuning SQL


On Tue, Jan 29, 2002 at 10:57:01AM -0500, Zhang, Anna wrote:
> Hi,
> I am running a query on postgres 7.1.3 at Red Hat 7.2 (2 CPUs, 1.5G RAM, 2
> drive disk array).
> select count(*) from contact a, contact_discard b where a.contacthandle <>
> b.contacthandle;

What are you trying to do with this query? As written, it'll give you
a (roughly) cartesian product between the two tables. Here's a (small)
example
from one of my dbs:


bioinfo=# select count(*) from people;
 count
-------
    91
(1 row)

bioinfo=# select count(*) from new_people;
 count
-------
    70
(1 row)

bioinfo=# select count(*) from people p, new_people n where p.peid=n.peid;
 count
-------
    69
(1 row)

bioinfo=# select count(*) from people p, new_people n where p.peid <>
n.peid;
 count
-------
  6301
(1 row)

if what you want is the number of contacts not in contact_discard, that'd
be something like:

bioinfo=# select count(*) from people p where not exists (select peid from
new_people where peid=p.peid);
 count
-------
    22
(1 row)

Ross

--
Ross Reedstrom, Ph.D.                                 reedstrm@rice.edu
Executive Director                                  phone: 713-348-6166
Gulf Coast Consortium for Bioinformatics              fax: 713-348-6182
Rice University MS-39
Houston, TX 77005

Re: tuning SQL

От
"Ross J. Reedstrom"
Дата:
On Tue, Jan 29, 2002 at 12:23:17PM -0500, Zhang, Anna wrote:
> Thanks Peter Darley, Ross J. Reedstrom and Tom lane!!
> How silly am I! Your messages reminds me. Actually I want to insert rows of
> contact_discard table which are not exists in contact table to contact table
> (some duplicates in two tables), first I run
>
> insert into contact
> select * from contact_discard a
> where not exists ( select 1 from contact b where b.contacthandle =
> a.contacthandle);
>
> It seems takes forever, I killed it after hours(note: contact table has 4
> indexes). Then I tried to figure out how many rows that are not duplicated.
> Now my problem turns to insert performance, in oracle it takes only a few
> minues.

So, how long does the just the select take? You might be running into
something odd with selecting from and inserting into the same table: I'm
not sure of the visibility rules here. You might be better off selecting
into a new temp table (see: SELECT INTO) then inserting from that.

Ross

Re: tuning SQL

От
rhairgroveNoSpam@Pleasebigfoot.com (Bob Hairgrove)
Дата:
On Tue, 29 Jan 2002 17:45:34 +0000 (UTC), azhang@verisign.com ("Zhang,
Anna") wrote:

>Thanks Peter Darley, Ross J. Reedstrom and Tom lane!!
>How silly am I! Your messages reminds me. Actually I want to insert rows of
>contact_discard table which are not exists in contact table to contact table
>(some duplicates in two tables), first I run
>
>insert into contact
>select * from contact_discard a
>where not exists ( select 1 from contact b where b.contacthandle =
>a.contacthandle);
>
>It seems takes forever, I killed it after hours(note: contact table has 4
>indexes). Then I tried to figure out how many rows that are not duplicated.
>Now my problem turns to insert performance, in oracle it takes only a few
>minues.
>
>Thanks!
>
>Anna Zhang
>

With millions of rows, you also might want to create a temporary
(real) table with just the select statement, then drop ALL the indexes
on contact, do the update, then recreate the indexes.


Bob Hairgrove
rhairgroveNoSpam@Pleasebigfoot.com