Обсуждение: Controlling complexity in queries
I have a very long query. Due to the planner and good indexing, it runs quite fast. But it's so long, it's quite hard to follow. I'm trying to break it up into pieces, but am running up against limits of SQL. Can you help me with any of these problems? 1. SELECT <complicated expression A with sub expression B> AS A, <complicated expression C with sub expression B> AS C, <complicated expression D with sub expression B> AS D ... I'd like to be able to extract the common subexpression B and give it a name (called "output_name" in the docs). But there's no way then to reference it from the SELECT clause. Any workarounds? 2. complicated join and subquery I'd like to extract subparts of this which are conceptually cohesive and make them VIEWs. The problem is that they depend on parameters (in the ON and WHERE clauses), and VIEWs don't allow parameters. I could use set returning functions, but, besides the headache involved, I've found that these tend to stop the planner from peering inside them, and hence ruin performance. Is there a recommend solution?
Inlined. David J. On Dec 11, 2011, at 19:46, Robert James <srobertjames@gmail.com> wrote: > I have a very long query. Due to the planner and good indexing, it > runs quite fast. But it's so long, it's quite hard to follow. > > I'm trying to break it up into pieces, but am running up against > limits of SQL. Can you help me with any of these problems? > > 1. > SELECT > <complicated expression A with sub expression B> AS A, > <complicated expression C with sub expression B> AS C, > <complicated expression D with sub expression B> AS D > ... > > I'd like to be able to extract the common subexpression B and give it > a name (called "output_name" in the docs). But there's no way then to > reference it from the SELECT clause. Any workarounds? Use a WITH clause on the SELECT statement. > > 2. complicated join and subquery > I'd like to extract subparts of this which are conceptually cohesive > and make them VIEWs. The problem is that they depend on parameters > (in the ON and WHERE clauses), and VIEWs don't allow parameters. I > could use set returning functions, but, besides the headache involved, > I've found that these tend to stop the planner from peering inside > them, and hence ruin performance. > > Is there a recommend solution? Use the VIEWs and let the planner optimize based upon the calling statement's WHERE clause. ON clause parameters area different story than the WHERE clause so see if you can avoid them. In the end if it is a critical area some degree of trail-and-error is useful; custom materialized views may also work. > > -- > Sent via pgsql-general mailing list (pgsql-general@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-general
On 12/12/2011 09:15 AM, David Johnston wrote: > Use a WITH clause on the SELECT statement. Note that WITH is an optimisation fence, so if you're relying on Pg pushing WHERE clauses down into subqueries or anything like that you may find that your query runs a LOT slower when broken up as WITH expressions. There's been talk of a Pg extension that allows optimisation through WITH, but it's not currently possible. Another option is to wrap things up in SQL functions or views. -- Craig Ringer
On Sun, Dec 11, 2011 at 9:10 PM, Craig Ringer <ringerc@ringerc.id.au> wrote: > On 12/12/2011 09:15 AM, David Johnston wrote: >> >> Use a WITH clause on the SELECT statement. > > Note that WITH is an optimisation fence, so if you're relying on Pg pushing > WHERE clauses down into subqueries or anything like that you may find that > your query runs a LOT slower when broken up as WITH expressions. > > There's been talk of a Pg extension that allows optimisation through WITH, > but it's not currently possible. > > Another option is to wrap things up in SQL functions or views. A note about that: abstracting via views vs functions is a completely different approach. Views will not significantly change the way your query works now -- they are inlined as macros and the final query is going to be more or less the same as your hand rolled one. Breaking your large queries into functions OTOH can make significant changes to the plan, often to the worse. This is because functions, especially complicated plpgsql set returning ones with procedural logic, are black boxes to the sql optimizer. The upshot of this is that functions tend to encourage nestloop style plans because the function has to be serially executed. Functions (also WITH) are great in that they can provide very high levels of abstraction when composing complex queries, but there is a price in the sense that you are taking away some of the database's ability to plan and optimize the query. I prefer views unless there is a good reason not to use them. In the end, the performance of your queries is going to be directly related to how well you map the problem into relational logic...the database thinks relationally, so you (the OP) should learn to do so as well. merlin
Merlin Moncure wrote: > Breaking your large queries into functions OTOH can make significant > changes to the plan, often to the worse. As an end-user, I think this is an area where PostgreSQL could really stand out (that and the moon launch). In Rails-land, you don't have The DBA that writes queries. You have a developer and an ORM, and when they step outside that ORM to do the cool things SQL can do, all their DRY habits fall apart, because it's 1990 again and you can either write clear code or fast code but not both. Postgres has made huge inroads with its optimizer. A few years ago, for web startups picking their relational DB, the performant answer was "MySQL for fast, simple, mostly-read queries, PostgreSQL when you grow up." That's become "Probably Postgres", and with 9.2's COUNT(*), it's become "Definitely Postgres". But having to write one big query for performance feels exactly like having to write one big C function with unrolled loops. I'm currently taking a well-factored, function-based query and turning it INTO what Robert James is trying to get OUT of: a monolithic query. > In the end, the performance of your queries is going to be directly > related to how well you map the problem into relational logic It's not just that, though; it's quite possible to think relationally and still fall down. There are plenty of cases where the human eye can see that a modular function can be inlined, but the optimizer can't. I have a pathological case: a query against a database with just a few thousand users takes 1.5 seconds on fast hardware, because it ends up scanning a cartesian product to get 16 rows, even before you get to the nested loops. In fact, most of the time the optimizer does a great job of inlining all my set-returning functions, once 9.0.6/9.1.2 rolled out. I've seen at least three equally ominous pieces that would have to happen to allow DRY, composable SQL: 1. Optional optimization of non-recursive WITH 2. Optional pushdown of WHERE clauses into GROUP BY[1] 3. LATERAL AFAIK, none of these are on anyone's short-term to-do list, and I'm sure none are easy. [1] Since this is my current favorite problem, the pathological case is: select questions.id from questions join ( select u.id from users as u group by u.id ) as s on s.id = questions.user_id where questions.id = 1; With users.id as a primary key, it's obvious that this can return only one row, but it has to scan the users table to get there. See the "Subjquery in a JOIN not getting restricted?" thread on pgsql-performance for Tom's explanation of why that's a hard problem to solve.
On Tue, Dec 13, 2011 at 4:27 PM, Jay Levitt <jay.levitt@gmail.com> wrote: > Merlin Moncure wrote: >> >> Breaking your large queries into functions OTOH can make significant >> changes to the plan, often to the worse. > > > As an end-user, I think this is an area where PostgreSQL could really stand > out (that and the moon launch). In Rails-land, you don't have The DBA that > writes queries. You have a developer and an ORM, and when they step outside > that ORM to do the cool things SQL can do, all their DRY habits fall apart, > because it's 1990 again and you can either write clear code or fast code but > not both. As far as ORMs are concerned, I'll take 1990 over 2011 all day long. The heavy emphasis on procedural/OO coding tactics to model business applications IMNSHO is and always been a total catastrophe. The reasons for that are numerous: insane defect rates, brittle functional relationships, poor concurrency model etc. Enterprises move off SQL because they hate paying the dollars top SQL talent demands only to pay the ultimate price when development gets crushed under the weight of maintaining all that crappy code. This sad state of affairs has been encouraged by some of the top software vendors. > But having to write one big query for performance feels exactly like having > to write one big C function with unrolled loops. I'm currently taking a > well-factored, function-based query and turning it INTO what Robert James is > trying to get OUT of: a monolithic query. SQL has a very powerful abstraction feature: it's called a view. Good use of views is a key design feature for complex databases. Functions are generally not a good choice for query abstraction unless: *) you are working with scalars (string manipulation etc) *) you need non relational features like plpgsql exception handling/notice printing, etc *) this particular operation is not really optimizable anyways and you want to wrap it (WITH RECURSIVE for example) *) your function is inline-able (generally, a one liner sql function that is stable or immutable) etc >> In the end, the performance of your queries is going to be directly >> related to how well you map the problem into relational logic > > > It's not just that, though; it's quite possible to think relationally and > still fall down. There are plenty of cases where the human eye can see that > a modular function can be inlined, but the optimizer can't. I have a > pathological case: a query against a database with just a few thousand users > takes 1.5 seconds on fast hardware, because it ends up scanning a cartesian > product to get 16 rows, even before you get to the nested loops. In fact, > most of the time the optimizer does a great job of inlining all my > set-returning functions, once 9.0.6/9.1.2 rolled out. > > I've seen at least three equally ominous pieces that would have to happen to > allow DRY, composable SQL: > > 1. Optional optimization of non-recursive WITH > 2. Optional pushdown of WHERE clauses into GROUP BY[1] > 3. LATERAL > > AFAIK, none of these are on anyone's short-term to-do list, and I'm sure > none are easy. > > [1] Since this is my current favorite problem, the pathological case is: > > select questions.id > from questions > join ( > select u.id > from users as u > group by u.id > ) as s > on s.id = questions.user_id > where questions.id = 1; > > With users.id as a primary key, it's obvious that this can return only one > row, but it has to scan the users table to get there. See the "Subjquery in > a JOIN not getting restricted?" thread on pgsql-performance for Tom's > explanation of why that's a hard problem to solve. Yeah -- here and there you run into difficult to optimize queries. (For my part, I'd just have converted that to WHERE EXISTS for the semi-join). merlin
Merlin Moncure wrote: > SQL has a very powerful abstraction feature: it's called a view. Good > use of views is a key design feature for complex databases. > > Functions are generally not a good choice for query abstraction unless: One more: * You want contextual queries. (I guess this is a special case of "you need non relational features".) In my case, I want all queries against content to be filtered by their relevance to the current user. That can't go into a view, because views don't have parameters; I need a computed column that may be different every time I run the query, and depends on a piece of information (the current user ID) that Postgres can't know. Relevance itself is a weighed average of a bunch of factors which *also* need the user ID, like "how similar are you to the user who authored that content?" As far as I can tell, the only way to accomplish this is through pure-SQL functions, or by hand-crafting the ultimate SQL query that accomplishes those functions. It's much easier to work with select content.*, relevance(content.id, current_user.id) as r order by r than the underlying query. I'm not doing fancy OO stuff; I'm compartmentalizing the knowledge of "what is relevance", "what is similarity", and "how do I fetch a collection of content". I'm not even in 1990. I'm in (quick Google) 1946, when subroutines were invented: "We also wish to be able to arrange for the splitting up of operations into subsidiary operations." If views had parameters, I'd use views, but I suspect that a parameterized view would be very much like a pure-SQL set-returning function, ya? I'd love to find a better way to do this. Having just read Thinking In Sets, I am sure a *real* SQL programmer would create a view that cross joins every content row with every user, materialize it, and then restrict your eventual query with the user id. But reading that book always leaves me with the same two questions: "Why is Joe Celko yelling at me? And what's SNOBOL?" >> [1] Since this is my current favorite problem, the pathological case is: >> >> select questions.id >> from questions >> join ( >> select u.id >> from users as u >> group by u.id >> ) as s >> on s.id = questions.user_id >> where questions.id = 1; >> >> With users.id as a primary key, it's obvious that this can return only one >> row, but it has to scan the users table to get there. See the "Subjquery in >> a JOIN not getting restricted?" thread on pgsql-performance for Tom's >> explanation of why that's a hard problem to solve. > > Yeah -- here and there you run into difficult to optimize queries. > (For my part, I'd just have converted that to WHERE EXISTS for the > semi-join). I think I'm about to learn a very important relational-algebra equivalence... could you elaborate? Jay
>>> [1] Since this is my current favorite problem, the pathological case is: >>> >>> select questions.id >>> from questions >>> join ( >>> select u.id >>> from users as u >>> group by u.id >>> ) as s >>> on s.id = questions.user_id >>> where questions.id = 1; >>> >>> With users.id as a primary key, it's obvious that this can return only one >>> row, but it has to scan the users table to get there. See the "Subjquery in >>> a JOIN not getting restricted?" thread on pgsql-performance for Tom's >>> explanation of why that's a hard problem to solve. >> >> Yeah -- here and there you run into difficult to optimize queries. >> (For my part, I'd just have converted that to WHERE EXISTS for the >> semi-join). > > I think I'm about to learn a very important relational-algebra equivalence... could you elaborate? You could write that as: select questions.id from questions as q where exists (select 1 from users as u where u.id = q.user_id) and questions.id = 1; That's basically what you are doing, checking that a user with a given id from the questions table exists in the users table.Writing it as WHERE EXISTS is a matter of "phrasing the question" more accurately, which gives the query planner ahint that for your answer a single hit is sufficient - no need to check whether there are other matches after the firstone. That said, wouldn't a foreign key constraint help you even better? If questions.user_id is required to refer to an existingusers.id (by an FK constraint), than the check in the query becomes moot. Alban Hertroys -- Screwing up is an excellent way to attach something to the ceiling.
Alban Hertroys wrote: >> select questions.id from questions >> join ( >> select u.id from users as u >> group by u.id >> ) as s >> on s.id = questions.user_id >> where questions.id = 1; > You could write that as: > > select questions.id > from questions as q > where exists (select 1 from users as u where u.id = q.user_id) > and questions.id = 1; > > That's basically what you are doing, checking that a user with a given id > from the questions table exists in the users table. > > That said, wouldn't a foreign key constraint help you even better? If > questions.user_id is required to refer to an existing users.id (by an FK > constraint), than the check in the query becomes moot. Ahh, I see.. yes, this query is just the smallest possible query that exhibits the same not-using-the-index behavior as the real query, which needs columns from both questions and users, and thus needs the join. (And it has aggregates, and needs the GROUP BY too.) There already is a constraint, questions.user_id always refers to a real users.id, etc. This is actually a great case where relational thinking does NOT map well to functional composability; as Tom Lane pointed out, the solution is just "add the WHERE clause to the subquery too." But the subquery is in a function that doesn't *know* it's being restricted, and (to me) shouldn't have to know; that's what the optimizer does for a living. FWIW, and this may help the OP, my plan for tackling the "but I want readability AND performance" issue is to 1. write a monolithic, optimized, incomprehensible version of the query 2. maintain the pretty functions alongside it 3. Write unit tests that confirm that the output of #1 and #2 is identical. Kinda like how gcc builds gcc and verifies that the output is the same as gcc building gcc building gcc. Jay
Jay Levitt <jay.levitt@gmail.com> writes: > * You want contextual queries. > > (I guess this is a special case of "you need non relational features".) > > In my case, I want all queries against content to be filtered by their > relevance to the current user. That can't go into a view, because > views don't have parameters; I need a computed column that may be > different every time I run the query, and depends on a piece of > information (the current user ID) that Postgres can't know. How about the following: CREATE TABLE test1 ( id serial NOT NULL, username text NOT NULL, value text NOT NULL, PRIMARY KEY (id) ); COPY test1 (username, value) FROM stdin DELIMITER '|'; user1|user1_1 user1|user1_2 user2|user2_1 user2|user2_2 user2|user2_3 \. CREATE VIEW test1v AS SELECT id, username, value FROM test1 WHERE username = current_user; Here the result of "SELECT * FROM test1v" depends on who issued the query.
Hi,
-- Here the result of "SELECT * FROM test1v" depends on who issued the query.
As a more general case, I sometimes load parameters into a utility table, and use them to dynamically restrict the view's output. Downside: it's a multistatement operation... however, when wrapping complex queries into functions gives the planner a bad time, this will solve the issue.
Bèrto
==============================
If Pac-Man had affected us as kids, we'd all be running around in a darkened room munching pills and listening to repetitive music.