Обсуждение: prepared statements suboptimal?
Hi, I'm planning to use prepared statements of indefinite lifetime in a daemon that will execute same statements rather frequently in reply to client requests. This link: http://www.postgresql.org/docs/8.3/static/sql-prepare.html has a note on performance: In some situations, the query plan produced for a prepared statement will be inferior to the query plan that would have been chosen if the statement had been submitted and executed normally. This is because when the statement is planned and the planner attempts to determine the optimal query plan, the actual values of any parameters specified in the statement are unavailable. PostgreSQL collects statistics on the distribution of data in the table, and can use constant values in a statement to make guesses about the likely result of executing the statement. Since this data is unavailable when planning prepared statements with parameters, the chosen plan might be suboptimal. I don't understand why postgres couldn't plan this: SELECT foo.e, foo.f FROM foo WHERE pk=$1 AND b=$2 AND status='1' AND c <= $3; to be later executed any slower than SELECT foo.e, foo.f FROM foo WHERE pk='abcabc' AND b='13' AND status='1' AND c <= '2007-11-20 13:14:15'; Can I help it make more educated guesses? In what scenarios could prepared statements turn around and bite me, being slower than simple queries? Is this a real problem in practice? Should I "refresh" prepared statements from time to time? If so, how? Only by deallocating them and preparing anew? Any knob to tweak for that? Okay, enough questions :) Thank you for any insights.
Il Wednesday 07 November 2007 12:25:46 rihad ha scritto: > I don't understand why postgres couldn't plan this: > SELECT foo.e, foo.f > FROM foo > WHERE pk=$1 AND b=$2 AND status='1' AND c <= $3; > > to be later executed any slower than > > SELECT foo.e, foo.f > FROM foo > WHERE pk='abcabc' AND b='13' AND status='1' AND c <= '2007-11-20 13:14:15'; One solution is that the planner will work when it will see the query. At that time the comparisons are all against unknown values. Try the same with dyamical SQL, that is you dynamically build by placing the current values instead of the $1, $2 and $3 placeholders. In this case the planner will see the query with all current real values. Then you execute it and compare the timings. Maybe you get better scores: there's no warranty for better performances becasue you are going to send the whole query again and again to the planner. Of course you need a plpgsql function for this. -- Reg me Please
rihad wrote: > Hi, I'm planning to use prepared statements of indefinite lifetime in a > daemon that will execute same statements rather frequently in reply to > client requests. > > This link: > http://www.postgresql.org/docs/8.3/static/sql-prepare.html > has a note on performance: > > In some situations, the query plan produced for a prepared statement > will be inferior to the query plan that would have been chosen if the > statement had been submitted and executed normally. This is because when > the statement is planned and the planner attempts to determine the > optimal query plan, the actual values of any parameters specified in the > statement are unavailable. PostgreSQL collects statistics on the > distribution of data in the table, and can use constant values in a > statement to make guesses about the likely result of executing the > statement. Since this data is unavailable when planning prepared > statements with parameters, the chosen plan might be suboptimal. > > I don't understand why postgres couldn't plan this: > SELECT foo.e, foo.f > FROM foo > WHERE pk=$1 AND b=$2 AND status='1' AND c <= $3; > > to be later executed any slower than > > SELECT foo.e, foo.f > FROM foo > WHERE pk='abcabc' AND b='13' AND status='1' AND c <= '2007-11-20 13:14:15'; > > Can I help it make more educated guesses? In what scenarios could > prepared statements turn around and bite me, being slower than simple > queries? Is this a real problem in practice? Should I "refresh" prepared > statements from time to time? If so, how? Only by deallocating them and > preparing anew? Any knob to tweak for that? > > Okay, enough questions :) > > Thank you for any insights. > From http://www.postgresql.org/docs/8.3/static/protocol-flow.html I just read that "This possible penalty is avoided when using the unnamed statement, since it is not planned until actual parameter values are available." Since I'm using Perl's DBI/pg, in postmaster's logs I can see that DBI's prepare() seems to using named prepared statements: Nov 7 15:57:46 sol postgres[1685]: [2-1] LOG: execute dbdpg_1: Nov 7 15:57:46 sol postgres[1685]: [2-2] SELECT ... is there any way to tell it to use unnamed prepared statements? I understand this is not a strictly PostgreSQL question so sorry if I'm off the topic.
rihad wrote: > Hi, I'm planning to use prepared statements of indefinite > lifetime in a daemon that will execute same statements > rather frequently in reply to client requests. > > This link: > http://www.postgresql.org/docs/8.3/static/sql-prepare.html > has a note on performance: > > In some situations, the query plan produced for a prepared statement > will be inferior to the query plan that would have been chosen if the > statement had been submitted and executed normally. [...] > > I don't understand why postgres couldn't plan this: > SELECT foo.e, foo.f > FROM foo > WHERE pk=$1 AND b=$2 AND status='1' AND c <= $3; > > to be later executed any slower than > > SELECT foo.e, foo.f > FROM foo > WHERE pk='abcabc' AND b='13' AND status='1' AND c <= > '2007-11-20 13:14:15'; For example, if the table contains almost no rows in the beginning, the planner will choose to use a full table schan even if - say - 'pk' is the primary key. If you use the same execution plan later when the table is big, the full table scan will hurt considerably, and you would be much better of with an index lookup. Other scenarios are certainly conceivable, but this one is easy to understand. > Can I help it make more educated guesses? In what scenarios could > prepared statements turn around and bite me, being slower than simple > queries? Is this a real problem in practice? Should I > "refresh" prepared statements from time to time? If so, how? Only by > deallocating them and preparing anew? Any knob to tweak for that? You'll probably have to deallocate them and allocate them anew. Yours, Laurenz Albe
rihad <rihad@mail.ru> writes: > I don't understand why postgres couldn't plan this: > SELECT foo.e, foo.f > FROM foo > WHERE pk=$1 AND b=$2 AND status='1' AND c <= $3; > to be later executed any slower than > SELECT foo.e, foo.f > FROM foo > WHERE pk='abcabc' AND b='13' AND status='1' AND c <= '2007-11-20 13:14:15'; The reason is that without knowing the parameter values, the planner has to pick a "generic" plan that will hopefully not be too awful regardless of what the actual values end up being. When it has the actual values it can make much tighter estimates of the number of matching rows, and possibly choose a much better but special-purpose plan. As an example, if the available indexes are on b and c then the best query plan for the first case is probably bitmap indexscan on b. But in the second case, the planner might be able to determine (by consulting the ANALYZE stats) that there are many rows matching b='13' but very few rows with c <= '2007-11-20 13:14:15', so for those specific parameter values an indexscan on c would be better. It would be folly to choose that as the generic plan, though, since on the average a one-sided inequality on c could be expected to not be very selective at all. regards, tom lane
Tom Lane wrote: > rihad <rihad@mail.ru> writes: >> I don't understand why postgres couldn't plan this: >> SELECT foo.e, foo.f >> FROM foo >> WHERE pk=$1 AND b=$2 AND status='1' AND c <= $3; > >> to be later executed any slower than > >> SELECT foo.e, foo.f >> FROM foo >> WHERE pk='abcabc' AND b='13' AND status='1' AND c <= '2007-11-20 13:14:15'; > > The reason is that without knowing the parameter values, the planner has > to pick a "generic" plan that will hopefully not be too awful regardless > of what the actual values end up being. When it has the actual values > it can make much tighter estimates of the number of matching rows, and > possibly choose a much better but special-purpose plan. As an example, > if the available indexes are on b and c then the best query plan for the > first case is probably bitmap indexscan on b. But in the second case, > the planner might be able to determine (by consulting the ANALYZE stats) > that there are many rows matching b='13' but very few rows with c <= > '2007-11-20 13:14:15', so for those specific parameter values an > indexscan on c would be better. It would be folly to choose that as the > generic plan, though, since on the average a one-sided inequality on c > could be expected to not be very selective at all. > Aha, thanks for a thorough explanation. Now I understand that while looking for a way to fulfill the query postgres will try hard to pick the one requiring the least number of rows visits. I've skimmed over my queries: almost all of them make use of the primary key as the first thing in the WHERE clause (say, a username, which is the only pk in the table): shouldn't that be enough for postgres to *always* decide to scan the pk's index (since a query on a pk always returns either one or zero results)? Same question for any number of joins where bar.id or baz.id is always aPK: select ... from foo JOIN bar ON(foo.bar_id=bar.id) JOIN baz ON(foo.baz_id=baz.id) WHERE asd=? AND dsa=?;
rihad <rihad@mail.ru> writes: > Aha, thanks for a thorough explanation. Now I understand that while > looking for a way to fulfill the query postgres will try hard to pick > the one requiring the least number of rows visits. I've skimmed over my > queries: almost all of them make use of the primary key as the first > thing in the WHERE clause (say, a username, which is the only pk in the > table): shouldn't that be enough for postgres to *always* decide to scan > the pk's index (since a query on a pk always returns either one or zero > results)? Yeah, if there's always a PK equality constraint then the dependence on specific parameter values is much weaker, so you could probably use a prepared statement without worrying. The cases where prepared statements tend to suck usually involve either inequalities, or equalities on non-unique columns where the number of matches varies wildly for different data values. In cases like that, knowing the exact value being compared to makes a very large difference in the rowcount estimate. regards, tom lane