Обсуждение: SQL Query Optimization
Hello, I am using postgresql to house chemical informatics data which consists of several interlinked tables with tens of thousands (maximum) of rows. When doing search queries against these tables (which always requires multiple joins) I have noticed that the semantically equivalent SQL queries can differ vastly in speed performance depending on the order of clauses ANDed together ( "WHERE cond1 AND cond2" takes forever, but "WHERE cond2 AND cond1" comes right back). So it appears I need to do some pre-optimization of the SQL query generated by the user before submitting it to postgresql in order to guarantee (or at least increase the likelihood of) the fastest results. I've tried STFW and RTFM but haven't found any good pointers on where to start with this, although I feel that there must be some published algorithms or theories. Can anyone point me to a URL or other source to get me on my way? Also, I wonder if this sort of query optimization is built into other databases such as Oracle? I did find this URL: http://redbook.cs.berkeley.edu/lec7.html which seems to be interesting, but honestly I'm far from a DB expert so I can't follow most of it, and I can't tell if it is talking about optimization that can be done in application space (query rewrite) or something that has to be done in the database engine itself. I'm going to try to find the book it references though. Basically I feel a bit in over my head, which is ok but I don't want to waste time paddling in the wrong direction, so I'm hoping someone can recognize where I need to look and nudge me in that direction. Maybe I just need proper terminology to plug into google. Thanks, Dav -- Dav Coleman http://www.danger-island.com/dav/
Dav, > I am using postgresql to house chemical informatics data which > consists of > several interlinked tables with tens of thousands (maximum) of rows. > When > doing search queries against these tables (which always requires > multiple > joins) I have noticed that the semantically equivalent SQL queries > can differ > vastly in speed performance depending on the order of clauses ANDed > together ( "WHERE cond1 AND cond2" takes forever, but "WHERE cond2 > AND cond1" comes right back). In most cases, the above kind of optimization difference is due to how you indexed the table. If, for example, you have an index on (field2, field1), and you do a "WHERE field1 = y and field2 = x" then the query parser probably won't use the index because the field order is different. Fortunately, in Postgres 7.2, you now get index usage statistics.Hopefully another user will follow-up this e-mail by explaininghow to access them. The idea is that, if you find that certain views and queries are very slow, then check what tables they all have in common. Then check the indexes and statistics for each table. If you see a large table with only 3 indexes, none of which are getting much use, then they are pobpably the wrong indexes or you need to change the structure of your WHERE clause. Also, EXPLAIN can be a big help here. See http://techdocs.postgresql.org for more stuff about optimization. I can understand that you'd like a tool to make all this easier for you, but I haven't seen any such thing, unless it ships with the Enterprise version of Oracle. -Josh Berkus
"Josh Berkus" <josh@agliodbs.com> writes: >> ( "WHERE cond1 AND cond2" takes forever, but "WHERE cond2 >> AND cond1" comes right back). > In most cases, the above kind of optimization difference is due to how > you indexed the table. If, for example, you have an index on (field2, > field1), and you do a "WHERE field1 = y and field2 = x" then the query > parser probably won't use the index because the field order is > different. Not at all. Postgres understands very well that it's allowed to rearrange AND'ed clauses. Using current sources (so that you can see the index condition in EXPLAIN): regression=# create table foo (f1 int, f2 int, unique(f1,f2)); NOTICE: CREATE TABLE / UNIQUE will create implicit index 'foo_f1_key' for table 'foo' CREATE regression=# explain select * from foo where f1 = 1 and f2 = 42; QUERY PLAN ----------------------------------------------------------------------Index Scan using foo_f1_key on foo (cost=0.00..4.83rows=1 width=8) Index Cond: ((f1 = 1) AND (f2 = 42)) (2 rows) regression=# explain select * from foo where f2 = 42 and f1 = 1; QUERY PLAN ----------------------------------------------------------------------Index Scan using foo_f1_key on foo (cost=0.00..4.83rows=1 width=8) Index Cond: ((f1 = 1) AND (f2 = 42)) (2 rows) I was curious about the details of Dav's query because it wasn't obvious why he'd be getting a different result. Perhaps the two query plans are mistakenly estimated to have exactly the same cost? (Although WHERE clause order doesn't affect the set of plans considered, it can affect the order in which they're considered, which might result in a different choice between two plans that are estimated to have identical costs.) Another possibility: perhaps neither condition is indexable, but cond1 is vastly more expensive to compute than cond2? (Maybe it's a sub-SELECT.) Right now I don't believe there's any code in there that will rearrange AND-clause order strictly on the basis of cost-to-compute-the-clauses-themselves. regards, tom lane
I should be more clear, the problem is that the application user can basically construct the SQL query dynamically, so I have no control on how the original SQL query will be formed or what it will consist of. It can be any possible query in practice. Because of this, it is not just a matter of analyzing any specific queries, and i don't want to start creating every possible index (although i might, if i have to). But I can see where I was heading in the wrong direction already. I was thinking that what I needed was to find theories/algorithms on how to rewrite the SQL before submitting it to postgresql, and I maybe still need to do that, but I guess I also need to EXPLAIN and analyze the bad vs good forms of the queries so I'll know what makes a 'good' vs 'bad' query (so I'll get a sense on how to rewrite queries). Perhaps with that understanding, an algorithm for rewriting the queries will be apparent. I just figured I couldn't be the first person to run into this problem, but I can't find it mentioned anywhere. -Dav btw, I'm running postgresql-7.1.2 (compilied from source) on rh7.0 Josh Berkus [josh@agliodbs.com] wrote: > Dav, > > > I am using postgresql to house chemical informatics data which > > consists of > > several interlinked tables with tens of thousands (maximum) of rows. > > When > > doing search queries against these tables (which always requires > > multiple > > joins) I have noticed that the semantically equivalent SQL queries > > can differ > > vastly in speed performance depending on the order of clauses ANDed > > together ( "WHERE cond1 AND cond2" takes forever, but "WHERE cond2 > > AND cond1" comes right back). > > In most cases, the above kind of optimization difference is due to how > you indexed the table. If, for example, you have an index on (field2, > field1), and you do a "WHERE field1 = y and field2 = x" then the query > parser probably won't use the index because the field order is > different. > > Fortunately, in Postgres 7.2, you now get index usage statistics. > Hopefully another user will follow-up this e-mail by explaining how to > access them. > > The idea is that, if you find that certain views and queries are very > slow, then check what tables they all have in common. Then check the > indexes and statistics for each table. If you see a large table with > only 3 indexes, none of which are getting much use, then they are > pobpably the wrong indexes or you need to change the structure of your > WHERE clause. Also, EXPLAIN can be a big help here. > > See http://techdocs.postgresql.org for more stuff about optimization. > > I can understand that you'd like a tool to make all this easier for > you, but I haven't seen any such thing, unless it ships with the > Enterprise version of Oracle. > > -Josh Berkus -- Dav Coleman http://www.danger-island.com/dav/
Dav, > I should be more clear, the problem is that the application user can > basically construct the SQL query dynamically, so I have no control > on > how the original SQL query will be formed or what it will consist of. > It can be any possible query in practice. Because of this, it is not > just > a matter of analyzing any specific queries, and i don't want to start > creating every possible index (although i might, if i have to). See Tom's response. He's the expert. However, if the user is allowed to write any query they wish, it does sound like you'll need to construct every reasonable index. This will make UPDATES on your tables very expensive, but you have no way of anticipating what the user will ask. You'll also need to take a really hard look at the relational structure of your database. Seemingly trivial lack of Normal Form in your table structures can become very costly as far as subqueries are concerned.Also, DISTINCT can be very costly on large tables. > I just figured I couldn't be the first person to run into this > problem, > but I can't find it mentioned anywhere. Good luck. I can't even think of any books I've reveiwed that would address this issue. Part of the problem, I think, is that optimization is so circumstantial. > btw, I'm running postgresql-7.1.2 (compilied from source) on rh7.0 I very much suggest that you upgrade to 7.2.1. Tom and company have made substantial improvements to the query parser and the tracking of index statistics in 7.2. -Josh
On Thursday 18 April 2002 17:35, Dav Coleman wrote: > I should be more clear, the problem is that the application user can > basically construct the SQL query dynamically > But I can see where I was heading in the wrong direction already. I was > thinking that what I needed was to find theories/algorithms on how to > rewrite the SQL before submitting it to postgresql, and I maybe still > need to do that Sort clauses alphabetically (or whatever makes sense to you) so you always get SELECT * FROM a,b WHERE c AND d rather than "b,a" or "d AND c". That way at least you're not getting variations. > but I guess I also need to EXPLAIN and analyze the Record the queries and times either in PG's log or in the application. > bad vs good forms of the queries so I'll know what makes a 'good' vs > 'bad' query (so I'll get a sense on how to rewrite queries). Perhaps > with that understanding, an algorithm for rewriting the queries will > be apparent. > > I just figured I couldn't be the first person to run into this problem, > but I can't find it mentioned anywhere. After the basics (index on fields involved in joins etc) it all gets a bit specific to the size of the tables/indexes involved and the quirks of the parser. If you logged the query-plan and cost estimates for each query processed it shouldn't be too difficult to automatically add indexes where required and see if it makes any difference. That assumes you have good clean patterns of usage in your queries. We're getting a bit AI there mind. - Richard Huxton