Обсуждение: [HACKERS] Cursor With_Hold Performance Workarounds/Optimization

Поиск
Список
Период
Сортировка

[HACKERS] Cursor With_Hold Performance Workarounds/Optimization

От
Leon Winter
Дата:
Hi,

I originally brought up this issue on the pgsql-performance mailing list [^] to
no avail so I am trying again here.

During implementation of a runtime environment and the adjoining database
abstraction layer I noticed (like many before me [0] and as correctly mentioned
in the documentation) the sizeable performance impact of declaring a cursor
"with hold" for queries with large result sets.

Our use case very often looks like this:

open cursor for select from table1
loop
{ fetch some entries from cursor update table2 commit
}

During iteration of the result set we commit changes to the database so we
must make sure to keep the cursor alive. One option is to use "with hold".
Unfortunately the resultset is then instantly materialzed which is a huge
performance burden.
In our use case the "commit" of changes often does not affect the iteration set.
Also the loop might be aborted before the resultset was fully read so we never
needed the whole materialzed set anyway.
To workaround these problems, we already employ some static analysis to avoid
"with hold" in all situations where there are no commits during the lifetime of
cursor or portal. For other cursors we choose to use a different database
connection inside the same application to protect the cursors from commit
operations and avoiding costly copy operations (if they would be used "with
hold" on the main database connection).
In an attempt to further minimize the performance impact I am thinking about
employing a lazy "with hold" where I would fetch all the remaining result rows
from a cursor or portal before a commit statement. This way I could at least
have great performance in all conflict-free situations until one would arrive at
an impass. Naturally I am now wondering why the postgres cursor/portal is not
also employing the same trick (at least as an option): Postpone materialization
of "with hold" cursors until it is required (like a commit operation is
dispatched).
Probably I am also missing many (internal) aspects but at that point it might be
possible to optimize further. When, for instance, no changes were made to result
set of the "with hold" cursor, it must not be materialized. From a previous
discussions [1] I heard that one can in fact accomplish that by using a different
database connection which is one workaround we are using.
I am not sure whether this kind of workaround/optimization work should be done
in the database abstraction/interface layer or the database itself. Since a lot
of people seem to run into the peformance issue many might profit from some
optimization magic in the database for such use cases. We are very invested in
this performance issue and are happy to resolve it on either level.

Regards,
Leon

[^]
https://www.postgresql.org/message-id/20171010122039.2xp4ipqokoke45zk%40bfw-online.de
[0] https://trac.osgeo.org/qgis/ticket/1175   https://stackoverflow.com/questions/33635405/postgres-cursor-with-hold
https://www.citusdata.com/blog/2016/03/30/five-ways-to-paginate/
[1] https://bytes.com/topic/postgresql/answers/420717-cursors-transactions-why
http://www.postgresql-archive.org/setFetchSize-td4935215.html


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Cursor With_Hold Performance Workarounds/Optimization

От
David Fetter
Дата:
On Thu, Oct 19, 2017 at 03:20:48PM +0200, Leon Winter wrote:
> Hi,
> 
> I originally brought up this issue on the pgsql-performance mailing list [^] to
> no avail so I am trying again here.
> 
> During implementation of a runtime environment and the adjoining database
> abstraction layer I noticed (like many before me [0] and as correctly mentioned
> in the documentation) the sizeable performance impact of declaring a cursor
> "with hold" for queries with large result sets.
> 
> Our use case very often looks like this:
> 
> open cursor for select from table1
> loop
> { fetch some entries from cursor
>   update table2
>   commit
> }

This seems like a very odd construct based on ideas about databases
that aren't actually true of PostgreSQL, e.g. that joins are
expensive, or that some substantial benefit comes of committing at
some higher frequency than the logical transaction.

What other things did you try, and how did they fail?  In particular,
what happened when you used
   UPDATE table2   SET [things based on table1]   FROM table1 [qualified] JOIN table2 ON ([conditions])

Best,
David.
-- 
David Fetter <david(at)fetter(dot)org> http://fetter.org/
Phone: +1 415 235 3778

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Cursor With_Hold Performance Workarounds/Optimization

От
Leon Winter
Дата:
> What other things did you try, and how did they fail?  In particular,
> what happened when you used
> 
>     UPDATE table2
>     SET [things based on table1]
>     FROM table1 [qualified] JOIN table2 ON ([conditions])

well, it is not the ideal way of doing things but then again this SQL is merely
a consequence of the legacy 4GL language and runtime environment we are running
(and trying to migrate to Postgres). We have a lot of those SQL structures and
would prefer not to change all of them to avoid this situation. Currently there
are also two database backends, one being the old legacy database and the other
being Postgres and we are extremely limited from the (lacking) capabilities  of
the old database. We are surely planning to change many SQL statements to make
better use of the database but not at this point for this issue.

The loops are more complex in reality of course, more like:

open cursor for select from table1
loop
{ fetch some entries from cursor call some external application do some crazy complicated calculations based on some
userinput in the UI * update table2 commit
 
}

The calculations inside the loop are written in some dynamic high-level
language and cannot easily be translated into SQL.

tl;dr: Existing code base has a lot of these patterns. General solution on
database (interfacing) level is required.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Cursor With_Hold Performance Workarounds/Optimization

От
David Fetter
Дата:
On Thu, Oct 19, 2017 at 04:06:47PM +0200, Leon Winter wrote:
> > What other things did you try, and how did they fail?  In particular,
> > what happened when you used
> > 
> >     UPDATE table2
> >     SET [things based on table1]
> >     FROM table1 [qualified] JOIN table2 ON ([conditions])
> 
> well, it is not the ideal way of doing things but then again this SQL is merely
> a consequence of the legacy 4GL language and runtime environment we are running
> (and trying to migrate to Postgres). We have a lot of those SQL structures and
> would prefer not to change all of them to avoid this situation. Currently there
> are also two database backends, one being the old legacy database and the other
> being Postgres and we are extremely limited from the (lacking) capabilities  of
> the old database. We are surely planning to change many SQL statements to make
> better use of the database but not at this point for this issue.
> 
> The loops are more complex in reality of course, more like:
> 
> open cursor for select from table1
> loop
> { fetch some entries from cursor
>   call some external application
>   do some crazy complicated calculations based on some user input in the UI *
>   update table2
>   commit
> }
> 
> The calculations inside the loop are written in some dynamic
> high-level language and cannot easily be translated into SQL.
> 
> tl;dr: Existing code base has a lot of these patterns. General
> solution on database (interfacing) level is required.

I don't know quite how to put this, but it's not clear to me that the
difficulties in this situation are things PostgreSQL could resolve
even with much larger development resources than are currently
available.

If you're updating what are perforce small batches of records in the
UI, there's excellent reason to pull only those batches, mark them as
being "in process," process them, then update the marked ones as
"done" or whatever other states they can get to.

As to "crazy complicated calculations," this is what active databases
are all about.  SQL is Turing complete, so you really can do it.

Would you want something that compiles from the user inputs to SQL?
Might that have a more general utility?

Best,
David.
-- 
David Fetter <david(at)fetter(dot)org> http://fetter.org/
Phone: +1 415 235 3778

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Cursor With_Hold Performance Workarounds/Optimization

От
Geoff Winkless
Дата:
On 19 October 2017 at 15:06, Leon Winter <winter-pg@bfw-online.de> wrote:
The calculations inside the loop are written in some dynamic high-level
language and cannot easily be translated into SQL.

​Can you not simply create a second connection to perform the updates?

Geoff

Re: [HACKERS] Cursor With_Hold Performance Workarounds/Optimization

От
Tom Lane
Дата:
Leon Winter <winter-pg@bfw-online.de> writes:
> The loops are more complex in reality of course, more like:

> open cursor for select from table1
> loop
> { fetch some entries from cursor
>   call some external application
>   do some crazy complicated calculations based on some user input in the UI *
>   update table2
>   commit
> }

Hm, somehow it's pretty hard to credit that the materialized cursor
is the principal performance bottleneck in this configuration.

> The calculations inside the loop are written in some dynamic high-level
> language and cannot easily be translated into SQL.

You don't really have to --- PG supports functions written in non-SQL
languages.  Not sure if your problem is big enough to justify developing
a new PL interface for $random-4GL-language, but that might be something
to consider.

But, to get back to the original point: exactly what "sizeable performance
impact of declaring a cursor "with hold"" do you think there is?  It
shouldn't be noticeably more expensive than just selecting the rows would
be.  Especially not for a number of rows that wouldn't make the
above-depicted application structure completely untenable.  And for sure
I'm not understanding why you think that materializing the result on the
client side instead would be better.

> Naturally I am now wondering why the postgres cursor/portal is not also
> employing the same trick (at least as an option): Postpone
> materialization of "with hold" cursors until it is required (like a
> commit operation is dispatched).

We already do that, and have done it since the feature was invented,
AFAIR.

FWIW, the primary reason for materializing the cursor contents at commit,
rather than just holding onto the active query as you seem to think would
be better, is that materializing allows us to release locks on the
underlying table(s).  If we kept the active query we'd have to keep those
locks, as well as the query's active snapshot, thus certainly blocking
VACUUM cleanup, and possibly blocking subsequent DDL.

The approach of using a separate connection to read the cursor suffers
from exactly those same problems.  Postgres isn't that happy with very
long-lived transactions (neither is any other RDBMS I'm familiar with).
So really I think that materializing the cursor right away and then
doing your application calculations in whatever chunk size seems
convenient is probably your best bet here.
        regards, tom lane


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Cursor With_Hold Performance Workarounds/Optimization

От
Leon Winter
Дата:
> I don't know quite how to put this, but it's not clear to me that the
> difficulties in this situation are things PostgreSQL could resolve
> even with much larger development resources than are currently
> available.

There does not seem to exist a cursor/portal/pointer semantic that can survive
unrelated changes to the database inside a single connection (and is not super
expensive like With_Hold). To some instance a similar behavior can be simulated
by using a second connection.
I assume most people avoid having this situation at all by changing their
implementation (to for example a more dynamic UPDATE statement like you
suggested).

> If you're updating what are perforce small batches of records in the
> UI, there's excellent reason to pull only those batches, mark them as
> being "in process," process them, then update the marked ones as
> "done" or whatever other states they can get to.
> 
> As to "crazy complicated calculations," this is what active databases
> are all about.  SQL is Turing complete, so you really can do it.

Of course all the things we do *could* be done in SQL itself which would be best
solution but there is a huge legacy code base in 4GL that one cannot
automatically translate into semantically equivalent SQL statements. During a
such a loop user input can also be requested for example.
> Would you want something that compiles from the user inputs to SQL?
> Might that have a more general utility?

Well, like I said, a 4GL to SQL conversion would be desirable but would require
a lot of effort. Thus one wants to mix the languages and currently one would
loop in 4GL, holding a SQL cursor/portal and do some stuff (which might include
SQL update statements).  One could also imagine looping in SQL and calling the
4GL runtime for each result row to do the computation. I am not sure that is
ideal if such an operation waits on user input. Also one would need to analyze
every loop looking for update statements and then automatically re-structure
them to update statements with dependencies.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Cursor With_Hold Performance Workarounds/Optimization

От
Leon Winter
Дата:
> > The calculations inside the loop are written in some dynamic high-level
> > language and cannot easily be translated into SQL.
> >
> 
> ???Can you not simply create a second connection to perform the updates?

That would be possibe, but I can see some problems:
loop
{ update table1; select something from table1; update table2 based in updated table1; commit;
}

If we do all the "update" statements in their own transaction, the select
statements would not be able to see changes. What we also tried was to give
every loop its own connection but that did not scale too well.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Cursor With_Hold Performance Workarounds/Optimization

От
Leon Winter
Дата:
> > The calculations inside the loop are written in some dynamic high-level
> > language and cannot easily be translated into SQL.
> 
> You don't really have to --- PG supports functions written in non-SQL
> languages.  Not sure if your problem is big enough to justify developing
> a new PL interface for $random-4GL-language, but that might be something
> to consider.

Surely it would be a possibilty to loop in SQL and call the other language for
each row, but I am not sure Postgres would be too happy if the callee waits for
user input and/or wants to do some SQL update operations.

> But, to get back to the original point: exactly what "sizeable performance
> impact of declaring a cursor "with hold"" do you think there is?  It
> shouldn't be noticeably more expensive than just selecting the rows would
> be.  Especially not for a number of rows that wouldn't make the
> above-depicted application structure completely untenable.  And for sure
> I'm not understanding why you think that materializing the result on the
> client side instead would be better.

For small tables materialization is a non-issue but we have large table where a
single select statement over all the rows causes Postgres to create 8GB temp
files being busy multiple seconds which is very noticeable.

> > Naturally I am now wondering why the postgres cursor/portal is not also
> > employing the same trick (at least as an option): Postpone
> > materialization of "with hold" cursors until it is required (like a
> > commit operation is dispatched).
> 
> We already do that, and have done it since the feature was invented,
> AFAIR.

When we declare a cursor for a select on the mentioned big table, it takes
multiple seconds and a big temp file is created which to me seems like the
materialization took place immediately.

> FWIW, the primary reason for materializing the cursor contents at commit,
> rather than just holding onto the active query as you seem to think would
> be better, is that materializing allows us to release locks on the
> underlying table(s).  If we kept the active query we'd have to keep those
> locks, as well as the query's active snapshot, thus certainly blocking
> VACUUM cleanup, and possibly blocking subsequent DDL.

Of course, providing such a long-lived cursor would be more costly in terms of
the resources you describe, but currently the cost of instant-materialization at
the opening operation of the cursor is more expensive.

> The approach of using a separate connection to read the cursor suffers
> from exactly those same problems.  Postgres isn't that happy with very
> long-lived transactions (neither is any other RDBMS I'm familiar with).
> So really I think that materializing the cursor right away and then
> doing your application calculations in whatever chunk size seems
> convenient is probably your best bet here.

In fact we are fetching cursor results in bigger chunks to avoid communication
overhead and many cursors reach end of scan immediately but there are cursors on
big tables and fetching all the records of them immediately is very expensive.
Meanwhile I have implemented a "lazy hold" in my database abstraction layer
which pulls the records in at commit which is a okayish trade-off. Better yet,
but with the disadvantages you outlined, would be a "with hold" cursor that
would hold onto the locks and the snapshot which would avoid materialization
entirely (which can be emulated with second database connection).


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Cursor With_Hold Performance Workarounds/Optimization

От
Leon Winter
Дата:
> When we declare a cursor for a select on the mentioned big table, it takes
> multiple seconds and a big temp file is created which to me seems like the
> materialization took place immediately.

Since you mentioned, Postgres already postponed materialization until commit
operations we checked again and you were right. When we manually checked, we
executed a "declare" statement without opening a transaction block first, which
causes instant materialization. When a transaction is opened, it is in fact
postponed.
Unfortunately we do not use cursors but portals so we can (re-)use prepared
statements in multiple instances and there is not "hold" feature for portals so
we cannot benefit from the lazy "with hold" of the cursor.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers