Обсуждение: Re: [HACKERS] [6.5.2] join problems ...

Поиск
Список
Период
Сортировка

Re: [HACKERS] [6.5.2] join problems ...

От
Mike Mascari
Дата:
The query you've presented is rather convoluted, but
if I'm reading your query correctly, it should reduce
to a simple, three-way join:

SELECT c.id, c.name, c.url 
FROM aecEntMain a, aecWebEntry b, aecCategory c
WHERE a.status LIKE 'active:ALL%'
AND a.representation LIKE '%:ALL%'
AND b.status LIKE 'active:ALL%'
AND b.indid='$indid' 
AND b.divid='$divid' 
AND (a.id,a.mid = b.id,b.mid) 
AND (b.catid,b.indid,b.divid = c.id,c.ppid,c.pid);

with the following indexes:
aecEntMain:  (status) and (id,mid)
aecWebEntry: (status), (indid), (divid), and            (catid,indid,divid)
aecCategory: (id,ppid,pid)

Now, there are some differences between the above and
what you wrote. For example, the above requires that 
the status begins with 'active:ALL'. Your query 
requires the status begin with 'active' and must also
contain the pattern 'active:ALL'. So for the above 
to be equivalent, you can't have a status such as
'active <some stuff> active:ALL'.

With respect to subqueries and PostgreSQL, as you 
know, the IN clause requires a nested scan. If you
are going to use subqueries, correlated subqueries
using EXISTS clauses can use indexes:

SELECT c.id, c.name, c.url 
FROM aecCategory c
WHERE EXISTS (
SELECT a.status 
FROM aecEntMain a, aecWebEntry b
WHERE a.status LIKE 'active:ALL%'
AND a.representation LIKE '%:ALL%'
AND b.status LIKE 'active:ALL%'
AND b.indid='$indid' 
AND b.divid='$divid' 
AND (a.id,a.mid = b.id,b.mid) 
AND (b.catid,b.indid,b.divid = c.id,c.ppid,c.pid));

Unfortunately, the lack of index support in IN
subqueries affects more than just the IN subquery 
clause, since INTERSECT/EXCEPT uses the rewriter to
rewrite such queries as UNIONS of two queries with
an IN/NOT IN subquery, respectively. This makes the
INTERSECT/EXCEPT feature functionally useless except
on very small tables.

Hope that helps (and is equivalent),

Mike Mascari (mascarim@yahoo.com)

--- The Hermit Hacker <scrappy@hub.org> wrote:
> 
> Morning...
> 
>     This weekend, up at a clients site working with
> them on improving
> database performance.  They are currently running
> MySQL and I'm trying to
> convince them to switch over to PostgreSQL, for
> various features that they
> just don't have with MySQL...
> 
>     One of the 'queries' that they are currently doing
> with MySQL
> consists of two queries that I can reduce down to
> one using subqueries,
> but its so slow that its ridiculous...so I figured
> I'd throw it out as a
> problem to hopefully solve?
> 
>     The query I'm starting out with is works out as:
> 
>                 SELECT id, name, url \
>                   FROM aecCategory \
>                  WHERE ppid='$indid' \
>                    AND pid='$divid'";
> 
>     The results of this get fed into a while look that
> takes the id
> returned and pushes them into:
> 
>             SELECT distinct b.indid, b.divid, b.catid,
> a.id, a.mid \
>           FROM aecEntMain a, aecWebEntry b \
>          WHERE (a.id=b.id AND a.mid=b.mid) \
>            AND (a.status like 'active%' and b.status
> like 'active%')
>            AND (a.status like '%active:ALL%' and
> b.status like '%active:ALL%')
>            AND (a.representation like '%:ALL%')
>            AND (b.indid='$indid' and
> b.divid='$divid' and b.catid='$catid')";
> 
>     Now, I can/have rewritten this as:
> 
> SELECT id, name, url 
>   FROM aecCategory 
>  WHERE ppid='$indid' 
>    AND pid='$divid' 
>    AND id IN ( 
> SELECT distinct c.id 
>   FROM aecEntMain a, aecWebEntry b, aecCategory c 
>  WHERE (a.id=b.id AND a.mid=b.mid and b.catid=c.id) 
>    AND (a.status like 'active%' and b.status like
> 'active%') 
>    AND (a.status like '%active:ALL%' and b.status
> like '%active:ALL%') 
>    AND (a.representation like '%:ALL%') 
>    AND (b.indid='$indid' and b.divid='$divid' and
> b.catid IN ( 
>         SELECT id FROM aecCategory WHERE
> ppid='$indid' AND pid='$divid' ) 
>    ));";
> 
>     An explain of the above shows:
> 
> Index Scan using aeccategory_primary on aeccategory 
> (cost=8.28 rows=1 width=36)
>  SubPlan
>   ->  Unique  (cost=1283.70 rows=21 width=72)
>     ->  Sort  (cost=1283.70 rows=21 width=72)
>       ->  Nested Loop  (cost=1283.70 rows=21
> width=72)
>         ->  Nested Loop  (cost=1280.70 rows=1
> width=60)
>           ->  Index Scan using aecwebentry_primary
> on aecwebentry b  
>                     (cost=1278.63 rows=1 width=36)
>                 SubPlan
>                   ->  Index Scan using
> aeccategory_primary on aeccategory  
>                             (cost=8.28 rows=1
> width=12)
>           ->  Index Scan using aecentmain_primary on
> aecentmain a  
>                     (cost=2.07 rows=348 width=24)
>         ->  Index Scan using aeccategory_id on
> aeccategory c  
>                   (cost=3.00 rows=1170 width=12)
> 
>     Now, a few things bother me with the above explain
> output, based on me 
> hopefully reading this right...
> 
>     The innermost SubPlan reports an estimated rows
> returned of 1...the 
> actual query returns 59 rows...slightly off?
> 
>     The one that bothers me is the one that reports
> 1170 rows returned...if you
> look at the query, the only thing that would/should
> use aeccategory_id is the
> line that goes "SELECT distinct c.id"...if I run
> just that section of the 
> query, it yields a result of 55 rows...way off??
> 
>     All of my queries are currently on static data,
> after a vacuum analyze has 
> been performed...everything is faster if I split
> things up and do a SELECT
> on a per id basis on return values, but, as the list
> of 'ids' grow, the
> number of iterations of the while loop required will
> slow down the query...
> 
>     I'm not sure what else to look at towards
> optimizing the query further,
> or is this something that we still are/need to look
> at in the server itself?
> 
>     The machine we are working off of right now is an
> idle Dual-PIII 450Mhz with
> 512Meg of RAM, very fast SCSI hard drives on a UW
> controller...and that query
> is the only thing running while we test things...so
> we aren't under-powered :)
> 
>     ideas?  
> 
> Marc G. Fournier                   ICQ#7615664      
>         IRC Nick: Scrappy
> Systems Administrator @ hub.org 
> primary: scrappy@hub.org           secondary:
> scrappy@{freebsd|postgresql}.org 

__________________________________________________
Do You Yahoo!?
Bid and sell for free at http://auctions.yahoo.com


Re: [HACKERS] [6.5.2] join problems ...

От
The Hermit Hacker
Дата:
On Sun, 19 Sep 1999, Mike Mascari wrote:

> SELECT c.id, c.name, c.url 
> FROM aecEntMain a, aecWebEntry b, aecCategory c
> WHERE a.status LIKE 'active:ALL%'
> AND a.representation LIKE '%:ALL%'
> AND b.status LIKE 'active:ALL%'
> AND b.indid='$indid' 
> AND b.divid='$divid' 
> AND (a.id,a.mid = b.id,b.mid) 
> AND (b.catid,b.indid,b.divid = c.id,c.ppid,c.pid);

Only point I'd like to make (thanks for all the details, gives me alot to
work with...) is that the above is not valid in PostgreSQL, it seems...I
changed the last two AND lines to be:

AND (a.id=b.id AND a.mid=b.mid) 
AND (b.catid=c.id AND b.indid=c.ppid AND b.divid=c.pid)

and it eliminiated the error, but gave me zero results...

Please note, in my own defence...I'm working on cleaning up a mess created
by someone else using MySQL...what has to be done is a cleanup of the
tables themselves, but trying to fix some of the SQL first appears to be
the "route of least resistance" :(  Or, at least, it appeared to
be...starting to change my mind on that one heavily :)


Marc G. Fournier                   ICQ#7615664               IRC Nick: Scrappy
Systems Administrator @ hub.org 
primary: scrappy@hub.org           secondary: scrappy@{freebsd|postgresql}.org 



Re: [HACKERS] [6.5.2] join problems ...

От
Thomas Lockhart
Дата:
<snip>
> > AND (a.id,a.mid = b.id,b.mid)
> > AND (b.catid,b.indid,b.divid = c.id,c.ppid,c.pid);
> ... the above is not valid in PostgreSQL, it seems...

I have to resort to looking at gram.y for this, since I currently have
the Postgres parser in bits and pieces all over the garage floor ;)

The expressions are *almost* valid for Postgres. The difference is
that you need to put parens around each side of the "row expression":
       | '(' row_descriptor ')' row_op '(' row_descriptor ')'               {                   $$ = makeRowExpr($4,
$2,$6);               }       ;
 

I had implemented this using Date and Darwen as a reference, and afaik
the SQL standard (and any sensible parser) *requires* parens around
the row expression, referred to in gram.y as a "row descriptor".

So, the following should work:
 AND ((a.id,a.mid) = (b.id,b.mid)) AND ((b.catid,b.indid,b.divid) = (c.id,c.ppid,c.pid));

                   - Thomas

-- 
Thomas Lockhart                lockhart@alumni.caltech.edu
South Pasadena, California


Re: [HACKERS] [6.5.2] join problems ...

От
Bruce Momjian
Дата:
> With respect to subqueries and PostgreSQL, as you 
> know, the IN clause requires a nested scan. If you
> are going to use subqueries, correlated subqueries
> using EXISTS clauses can use indexes:
> 
> SELECT c.id, c.name, c.url 
> FROM aecCategory c
> WHERE EXISTS (
> SELECT a.status 
> FROM aecEntMain a, aecWebEntry b
> WHERE a.status LIKE 'active:ALL%'
> AND a.representation LIKE '%:ALL%'
> AND b.status LIKE 'active:ALL%'
> AND b.indid='$indid' 
> AND b.divid='$divid' 
> AND (a.id,a.mid = b.id,b.mid) 
> AND (b.catid,b.indid,b.divid = c.id,c.ppid,c.pid));
> 
> Unfortunately, the lack of index support in IN
> subqueries affects more than just the IN subquery 
> clause, since INTERSECT/EXCEPT uses the rewriter to
> rewrite such queries as UNIONS of two queries with
> an IN/NOT IN subquery, respectively. This makes the
> INTERSECT/EXCEPT feature functionally useless except
> on very small tables.

Yes, we are aware of that IN limitation, and I keep trying to get it
fixed.

--  Bruce Momjian                        |  http://www.op.net/~candle maillist@candle.pha.pa.us            |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026
 


Re: [HACKERS] [6.5.2] join problems ...

От
The Hermit Hacker
Дата:
Comparing my original query against yours, idle machine:

Mine: 0.000u 0.023s 0:07.78 0.2%      48+132k 0+0io 0pf+0w (55 rows)
Your: 0.006u 0.018s 0:12.16 0.0%      408+904k 0+0io 0pf+0w (55 rows)

Takes longer to run, less CPU resources, but, if I'm reading this right,
more memory resources?

On Sun, 19 Sep 1999, Mike Mascari wrote:

> The query you've presented is rather convoluted, but
> if I'm reading your query correctly, it should reduce
> to a simple, three-way join:
> 
> SELECT c.id, c.name, c.url 
> FROM aecEntMain a, aecWebEntry b, aecCategory c
> WHERE a.status LIKE 'active:ALL%'
> AND a.representation LIKE '%:ALL%'
> AND b.status LIKE 'active:ALL%'
> AND b.indid='$indid' 
> AND b.divid='$divid' 
> AND (a.id,a.mid = b.id,b.mid) 
> AND (b.catid,b.indid,b.divid = c.id,c.ppid,c.pid);
> 
> with the following indexes:
> aecEntMain:  (status) and (id,mid)
> aecWebEntry: (status), (indid), (divid), and
>              (catid,indid,divid)
> aecCategory: (id,ppid,pid)
> 
> Now, there are some differences between the above and
> what you wrote. For example, the above requires that 
> the status begins with 'active:ALL'. Your query 
> requires the status begin with 'active' and must also
> contain the pattern 'active:ALL'. So for the above 
> to be equivalent, you can't have a status such as
> 'active <some stuff> active:ALL'.
> 
> With respect to subqueries and PostgreSQL, as you 
> know, the IN clause requires a nested scan. If you
> are going to use subqueries, correlated subqueries
> using EXISTS clauses can use indexes:
> 
> SELECT c.id, c.name, c.url 
> FROM aecCategory c
> WHERE EXISTS (
> SELECT a.status 
> FROM aecEntMain a, aecWebEntry b
> WHERE a.status LIKE 'active:ALL%'
> AND a.representation LIKE '%:ALL%'
> AND b.status LIKE 'active:ALL%'
> AND b.indid='$indid' 
> AND b.divid='$divid' 
> AND (a.id,a.mid = b.id,b.mid) 
> AND (b.catid,b.indid,b.divid = c.id,c.ppid,c.pid));
> 
> Unfortunately, the lack of index support in IN
> subqueries affects more than just the IN subquery 
> clause, since INTERSECT/EXCEPT uses the rewriter to
> rewrite such queries as UNIONS of two queries with
> an IN/NOT IN subquery, respectively. This makes the
> INTERSECT/EXCEPT feature functionally useless except
> on very small tables.
> 
> Hope that helps (and is equivalent),
> 
> Mike Mascari (mascarim@yahoo.com)
> 
> --- The Hermit Hacker <scrappy@hub.org> wrote:
> > 
> > Morning...
> > 
> >     This weekend, up at a clients site working with
> > them on improving
> > database performance.  They are currently running
> > MySQL and I'm trying to
> > convince them to switch over to PostgreSQL, for
> > various features that they
> > just don't have with MySQL...
> > 
> >     One of the 'queries' that they are currently doing
> > with MySQL
> > consists of two queries that I can reduce down to
> > one using subqueries,
> > but its so slow that its ridiculous...so I figured
> > I'd throw it out as a
> > problem to hopefully solve?
> > 
> >     The query I'm starting out with is works out as:
> > 
> >                 SELECT id, name, url \
> >                   FROM aecCategory \
> >                  WHERE ppid='$indid' \
> >                    AND pid='$divid'";
> > 
> >     The results of this get fed into a while look that
> > takes the id
> > returned and pushes them into:
> > 
> >             SELECT distinct b.indid, b.divid, b.catid,
> > a.id, a.mid \
> >           FROM aecEntMain a, aecWebEntry b \
> >          WHERE (a.id=b.id AND a.mid=b.mid) \
> >            AND (a.status like 'active%' and b.status
> > like 'active%')
> >            AND (a.status like '%active:ALL%' and
> > b.status like '%active:ALL%')
> >            AND (a.representation like '%:ALL%')
> >            AND (b.indid='$indid' and
> > b.divid='$divid' and b.catid='$catid')";
> > 
> >     Now, I can/have rewritten this as:
> > 
> > SELECT id, name, url 
> >   FROM aecCategory 
> >  WHERE ppid='$indid' 
> >    AND pid='$divid' 
> >    AND id IN ( 
> > SELECT distinct c.id 
> >   FROM aecEntMain a, aecWebEntry b, aecCategory c 
> >  WHERE (a.id=b.id AND a.mid=b.mid and b.catid=c.id) 
> >    AND (a.status like 'active%' and b.status like
> > 'active%') 
> >    AND (a.status like '%active:ALL%' and b.status
> > like '%active:ALL%') 
> >    AND (a.representation like '%:ALL%') 
> >    AND (b.indid='$indid' and b.divid='$divid' and
> > b.catid IN ( 
> >         SELECT id FROM aecCategory WHERE
> > ppid='$indid' AND pid='$divid' ) 
> >    ));";
> > 
> >     An explain of the above shows:
> > 
> > Index Scan using aeccategory_primary on aeccategory 
> > (cost=8.28 rows=1 width=36)
> >  SubPlan
> >   ->  Unique  (cost=1283.70 rows=21 width=72)
> >     ->  Sort  (cost=1283.70 rows=21 width=72)
> >       ->  Nested Loop  (cost=1283.70 rows=21
> > width=72)
> >         ->  Nested Loop  (cost=1280.70 rows=1
> > width=60)
> >           ->  Index Scan using aecwebentry_primary
> > on aecwebentry b  
> >                     (cost=1278.63 rows=1 width=36)
> >                 SubPlan
> >                   ->  Index Scan using
> > aeccategory_primary on aeccategory  
> >                             (cost=8.28 rows=1
> > width=12)
> >           ->  Index Scan using aecentmain_primary on
> > aecentmain a  
> >                     (cost=2.07 rows=348 width=24)
> >         ->  Index Scan using aeccategory_id on
> > aeccategory c  
> >                   (cost=3.00 rows=1170 width=12)
> > 
> >     Now, a few things bother me with the above explain
> > output, based on me 
> > hopefully reading this right...
> > 
> >     The innermost SubPlan reports an estimated rows
> > returned of 1...the 
> > actual query returns 59 rows...slightly off?
> > 
> >     The one that bothers me is the one that reports
> > 1170 rows returned...if you
> > look at the query, the only thing that would/should
> > use aeccategory_id is the
> > line that goes "SELECT distinct c.id"...if I run
> > just that section of the 
> > query, it yields a result of 55 rows...way off??
> > 
> >     All of my queries are currently on static data,
> > after a vacuum analyze has 
> > been performed...everything is faster if I split
> > things up and do a SELECT
> > on a per id basis on return values, but, as the list
> > of 'ids' grow, the
> > number of iterations of the while loop required will
> > slow down the query...
> > 
> >     I'm not sure what else to look at towards
> > optimizing the query further,
> > or is this something that we still are/need to look
> > at in the server itself?
> > 
> >     The machine we are working off of right now is an
> > idle Dual-PIII 450Mhz with
> > 512Meg of RAM, very fast SCSI hard drives on a UW
> > controller...and that query
> > is the only thing running while we test things...so
> > we aren't under-powered :)
> > 
> >     ideas?  
> > 
> > Marc G. Fournier                   ICQ#7615664      
> >         IRC Nick: Scrappy
> > Systems Administrator @ hub.org 
> > primary: scrappy@hub.org           secondary:
> > scrappy@{freebsd|postgresql}.org 
> 
> __________________________________________________
> Do You Yahoo!?
> Bid and sell for free at http://auctions.yahoo.com
> 

Marc G. Fournier                   ICQ#7615664               IRC Nick: Scrappy
Systems Administrator @ hub.org 
primary: scrappy@hub.org           secondary: scrappy@{freebsd|postgresql}.org