Обсуждение: [HACKERS] Cursor With_Hold Performance Workarounds/Optimization
Hi, I originally brought up this issue on the pgsql-performance mailing list [^] to no avail so I am trying again here. During implementation of a runtime environment and the adjoining database abstraction layer I noticed (like many before me [0] and as correctly mentioned in the documentation) the sizeable performance impact of declaring a cursor "with hold" for queries with large result sets. Our use case very often looks like this: open cursor for select from table1 loop { fetch some entries from cursor update table2 commit } During iteration of the result set we commit changes to the database so we must make sure to keep the cursor alive. One option is to use "with hold". Unfortunately the resultset is then instantly materialzed which is a huge performance burden. In our use case the "commit" of changes often does not affect the iteration set. Also the loop might be aborted before the resultset was fully read so we never needed the whole materialzed set anyway. To workaround these problems, we already employ some static analysis to avoid "with hold" in all situations where there are no commits during the lifetime of cursor or portal. For other cursors we choose to use a different database connection inside the same application to protect the cursors from commit operations and avoiding costly copy operations (if they would be used "with hold" on the main database connection). In an attempt to further minimize the performance impact I am thinking about employing a lazy "with hold" where I would fetch all the remaining result rows from a cursor or portal before a commit statement. This way I could at least have great performance in all conflict-free situations until one would arrive at an impass. Naturally I am now wondering why the postgres cursor/portal is not also employing the same trick (at least as an option): Postpone materialization of "with hold" cursors until it is required (like a commit operation is dispatched). Probably I am also missing many (internal) aspects but at that point it might be possible to optimize further. When, for instance, no changes were made to result set of the "with hold" cursor, it must not be materialized. From a previous discussions [1] I heard that one can in fact accomplish that by using a different database connection which is one workaround we are using. I am not sure whether this kind of workaround/optimization work should be done in the database abstraction/interface layer or the database itself. Since a lot of people seem to run into the peformance issue many might profit from some optimization magic in the database for such use cases. We are very invested in this performance issue and are happy to resolve it on either level. Regards, Leon [^] https://www.postgresql.org/message-id/20171010122039.2xp4ipqokoke45zk%40bfw-online.de [0] https://trac.osgeo.org/qgis/ticket/1175 https://stackoverflow.com/questions/33635405/postgres-cursor-with-hold https://www.citusdata.com/blog/2016/03/30/five-ways-to-paginate/ [1] https://bytes.com/topic/postgresql/answers/420717-cursors-transactions-why http://www.postgresql-archive.org/setFetchSize-td4935215.html -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
On Thu, Oct 19, 2017 at 03:20:48PM +0200, Leon Winter wrote: > Hi, > > I originally brought up this issue on the pgsql-performance mailing list [^] to > no avail so I am trying again here. > > During implementation of a runtime environment and the adjoining database > abstraction layer I noticed (like many before me [0] and as correctly mentioned > in the documentation) the sizeable performance impact of declaring a cursor > "with hold" for queries with large result sets. > > Our use case very often looks like this: > > open cursor for select from table1 > loop > { fetch some entries from cursor > update table2 > commit > } This seems like a very odd construct based on ideas about databases that aren't actually true of PostgreSQL, e.g. that joins are expensive, or that some substantial benefit comes of committing at some higher frequency than the logical transaction. What other things did you try, and how did they fail? In particular, what happened when you used UPDATE table2 SET [things based on table1] FROM table1 [qualified] JOIN table2 ON ([conditions]) Best, David. -- David Fetter <david(at)fetter(dot)org> http://fetter.org/ Phone: +1 415 235 3778 Remember to vote! Consider donating to Postgres: http://www.postgresql.org/about/donate -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
> What other things did you try, and how did they fail? In particular, > what happened when you used > > UPDATE table2 > SET [things based on table1] > FROM table1 [qualified] JOIN table2 ON ([conditions]) well, it is not the ideal way of doing things but then again this SQL is merely a consequence of the legacy 4GL language and runtime environment we are running (and trying to migrate to Postgres). We have a lot of those SQL structures and would prefer not to change all of them to avoid this situation. Currently there are also two database backends, one being the old legacy database and the other being Postgres and we are extremely limited from the (lacking) capabilities of the old database. We are surely planning to change many SQL statements to make better use of the database but not at this point for this issue. The loops are more complex in reality of course, more like: open cursor for select from table1 loop { fetch some entries from cursor call some external application do some crazy complicated calculations based on some userinput in the UI * update table2 commit } The calculations inside the loop are written in some dynamic high-level language and cannot easily be translated into SQL. tl;dr: Existing code base has a lot of these patterns. General solution on database (interfacing) level is required. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
On Thu, Oct 19, 2017 at 04:06:47PM +0200, Leon Winter wrote: > > What other things did you try, and how did they fail? In particular, > > what happened when you used > > > > UPDATE table2 > > SET [things based on table1] > > FROM table1 [qualified] JOIN table2 ON ([conditions]) > > well, it is not the ideal way of doing things but then again this SQL is merely > a consequence of the legacy 4GL language and runtime environment we are running > (and trying to migrate to Postgres). We have a lot of those SQL structures and > would prefer not to change all of them to avoid this situation. Currently there > are also two database backends, one being the old legacy database and the other > being Postgres and we are extremely limited from the (lacking) capabilities of > the old database. We are surely planning to change many SQL statements to make > better use of the database but not at this point for this issue. > > The loops are more complex in reality of course, more like: > > open cursor for select from table1 > loop > { fetch some entries from cursor > call some external application > do some crazy complicated calculations based on some user input in the UI * > update table2 > commit > } > > The calculations inside the loop are written in some dynamic > high-level language and cannot easily be translated into SQL. > > tl;dr: Existing code base has a lot of these patterns. General > solution on database (interfacing) level is required. I don't know quite how to put this, but it's not clear to me that the difficulties in this situation are things PostgreSQL could resolve even with much larger development resources than are currently available. If you're updating what are perforce small batches of records in the UI, there's excellent reason to pull only those batches, mark them as being "in process," process them, then update the marked ones as "done" or whatever other states they can get to. As to "crazy complicated calculations," this is what active databases are all about. SQL is Turing complete, so you really can do it. Would you want something that compiles from the user inputs to SQL? Might that have a more general utility? Best, David. -- David Fetter <david(at)fetter(dot)org> http://fetter.org/ Phone: +1 415 235 3778 Remember to vote! Consider donating to Postgres: http://www.postgresql.org/about/donate -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
The calculations inside the loop are written in some dynamic high-level
language and cannot easily be translated into SQL.
Can you not simply create a second connection to perform the updates?
Geoff
Leon Winter <winter-pg@bfw-online.de> writes: > The loops are more complex in reality of course, more like: > open cursor for select from table1 > loop > { fetch some entries from cursor > call some external application > do some crazy complicated calculations based on some user input in the UI * > update table2 > commit > } Hm, somehow it's pretty hard to credit that the materialized cursor is the principal performance bottleneck in this configuration. > The calculations inside the loop are written in some dynamic high-level > language and cannot easily be translated into SQL. You don't really have to --- PG supports functions written in non-SQL languages. Not sure if your problem is big enough to justify developing a new PL interface for $random-4GL-language, but that might be something to consider. But, to get back to the original point: exactly what "sizeable performance impact of declaring a cursor "with hold"" do you think there is? It shouldn't be noticeably more expensive than just selecting the rows would be. Especially not for a number of rows that wouldn't make the above-depicted application structure completely untenable. And for sure I'm not understanding why you think that materializing the result on the client side instead would be better. > Naturally I am now wondering why the postgres cursor/portal is not also > employing the same trick (at least as an option): Postpone > materialization of "with hold" cursors until it is required (like a > commit operation is dispatched). We already do that, and have done it since the feature was invented, AFAIR. FWIW, the primary reason for materializing the cursor contents at commit, rather than just holding onto the active query as you seem to think would be better, is that materializing allows us to release locks on the underlying table(s). If we kept the active query we'd have to keep those locks, as well as the query's active snapshot, thus certainly blocking VACUUM cleanup, and possibly blocking subsequent DDL. The approach of using a separate connection to read the cursor suffers from exactly those same problems. Postgres isn't that happy with very long-lived transactions (neither is any other RDBMS I'm familiar with). So really I think that materializing the cursor right away and then doing your application calculations in whatever chunk size seems convenient is probably your best bet here. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
> I don't know quite how to put this, but it's not clear to me that the > difficulties in this situation are things PostgreSQL could resolve > even with much larger development resources than are currently > available. There does not seem to exist a cursor/portal/pointer semantic that can survive unrelated changes to the database inside a single connection (and is not super expensive like With_Hold). To some instance a similar behavior can be simulated by using a second connection. I assume most people avoid having this situation at all by changing their implementation (to for example a more dynamic UPDATE statement like you suggested). > If you're updating what are perforce small batches of records in the > UI, there's excellent reason to pull only those batches, mark them as > being "in process," process them, then update the marked ones as > "done" or whatever other states they can get to. > > As to "crazy complicated calculations," this is what active databases > are all about. SQL is Turing complete, so you really can do it. Of course all the things we do *could* be done in SQL itself which would be best solution but there is a huge legacy code base in 4GL that one cannot automatically translate into semantically equivalent SQL statements. During a such a loop user input can also be requested for example. > Would you want something that compiles from the user inputs to SQL? > Might that have a more general utility? Well, like I said, a 4GL to SQL conversion would be desirable but would require a lot of effort. Thus one wants to mix the languages and currently one would loop in 4GL, holding a SQL cursor/portal and do some stuff (which might include SQL update statements). One could also imagine looping in SQL and calling the 4GL runtime for each result row to do the computation. I am not sure that is ideal if such an operation waits on user input. Also one would need to analyze every loop looking for update statements and then automatically re-structure them to update statements with dependencies. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
> > The calculations inside the loop are written in some dynamic high-level > > language and cannot easily be translated into SQL. > > > > ???Can you not simply create a second connection to perform the updates? That would be possibe, but I can see some problems: loop { update table1; select something from table1; update table2 based in updated table1; commit; } If we do all the "update" statements in their own transaction, the select statements would not be able to see changes. What we also tried was to give every loop its own connection but that did not scale too well. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
> > The calculations inside the loop are written in some dynamic high-level > > language and cannot easily be translated into SQL. > > You don't really have to --- PG supports functions written in non-SQL > languages. Not sure if your problem is big enough to justify developing > a new PL interface for $random-4GL-language, but that might be something > to consider. Surely it would be a possibilty to loop in SQL and call the other language for each row, but I am not sure Postgres would be too happy if the callee waits for user input and/or wants to do some SQL update operations. > But, to get back to the original point: exactly what "sizeable performance > impact of declaring a cursor "with hold"" do you think there is? It > shouldn't be noticeably more expensive than just selecting the rows would > be. Especially not for a number of rows that wouldn't make the > above-depicted application structure completely untenable. And for sure > I'm not understanding why you think that materializing the result on the > client side instead would be better. For small tables materialization is a non-issue but we have large table where a single select statement over all the rows causes Postgres to create 8GB temp files being busy multiple seconds which is very noticeable. > > Naturally I am now wondering why the postgres cursor/portal is not also > > employing the same trick (at least as an option): Postpone > > materialization of "with hold" cursors until it is required (like a > > commit operation is dispatched). > > We already do that, and have done it since the feature was invented, > AFAIR. When we declare a cursor for a select on the mentioned big table, it takes multiple seconds and a big temp file is created which to me seems like the materialization took place immediately. > FWIW, the primary reason for materializing the cursor contents at commit, > rather than just holding onto the active query as you seem to think would > be better, is that materializing allows us to release locks on the > underlying table(s). If we kept the active query we'd have to keep those > locks, as well as the query's active snapshot, thus certainly blocking > VACUUM cleanup, and possibly blocking subsequent DDL. Of course, providing such a long-lived cursor would be more costly in terms of the resources you describe, but currently the cost of instant-materialization at the opening operation of the cursor is more expensive. > The approach of using a separate connection to read the cursor suffers > from exactly those same problems. Postgres isn't that happy with very > long-lived transactions (neither is any other RDBMS I'm familiar with). > So really I think that materializing the cursor right away and then > doing your application calculations in whatever chunk size seems > convenient is probably your best bet here. In fact we are fetching cursor results in bigger chunks to avoid communication overhead and many cursors reach end of scan immediately but there are cursors on big tables and fetching all the records of them immediately is very expensive. Meanwhile I have implemented a "lazy hold" in my database abstraction layer which pulls the records in at commit which is a okayish trade-off. Better yet, but with the disadvantages you outlined, would be a "with hold" cursor that would hold onto the locks and the snapshot which would avoid materialization entirely (which can be emulated with second database connection). -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
> When we declare a cursor for a select on the mentioned big table, it takes > multiple seconds and a big temp file is created which to me seems like the > materialization took place immediately. Since you mentioned, Postgres already postponed materialization until commit operations we checked again and you were right. When we manually checked, we executed a "declare" statement without opening a transaction block first, which causes instant materialization. When a transaction is opened, it is in fact postponed. Unfortunately we do not use cursors but portals so we can (re-)use prepared statements in multiple instances and there is not "hold" feature for portals so we cannot benefit from the lazy "with hold" of the cursor. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers