Обсуждение: Streaming ResultSet

Поиск
Список
Период
Сортировка

Streaming ResultSet

От
"Kevin Grittner"
Дата:
Hello all.

Within the PostgreSQL JDBC driver a ResultSet is, by default, streamed
back to the client as the rows are generated by the server, and are
stored in a collection in RAM.  An attempt to get the first row from the
ResultSet blocks until every last row is stored in RAM.  If various
conditions are met, a cursor is used instead, which allows a large
result set to be processed without running out of RAM.

I'm curious about the reasons that this appoach was taken -- why not
return the rows as they are retrieved from the server?  I can think of
some possible explanations:

(1)  The time locks are held on the server is minimized by the default
behavior.

(2)  If an exception occurs, you do not receive that data up to the
point of the problem.  (For example, divide by zero, serialization
(deadlock) problems, or broken connections.)

(3)  The connection can be used for other purposes without conflict.
(Other drivers I've used, which normall stream the results fall back on
reading the whole result set into RAM when a statement is issued which
cannot be run while the result set is pending.)


The disadvantages of the current default approach are:

(1)  The entire result set might not fit in the client RAM.

(2)  You might want to see partial results from a query which ultimately
generates an exception.

(3)  You can't overlap retrieval of the result set with processing of
the rows.


The disadvantages of the cursored approach are:

(1)  It might not be easy to create all of the right conditions for the
cursored approach to work.

(2)  You might think you've met all the conditions for the cursored
approach and be wrong.  The default behavior might kick in and cause an
OutOfMemoryError.

(3)  There is overhead to creating a cursor for a select statement.
Granted, in my tests it was only about 20 microseconds, but that boosted
runtime for my simple test case by 3%.


If there was a connection property which allowed result sets to return
rows as they are retrieved, I would use it.  I'm not lobbying very hard
for it, however, since it is only 20 microseconds per SELECT statement
to use cursors, and I can easily modify my code to use them, although it
will have to be an "all or nothing" change in the framework, so the cost
is paid on every SELECT statement.  I'm just expressing interest, should
the feature be under consideration.

-Kevin



Re: Streaming ResultSet

От
Dave Cramer
Дата:
Kevin,

It's mostly because of the way the server returns result sets, and
also the
fact that java doesn't (didn't) support non-blocking I/O

Dave
On 19-Jul-05, at 4:22 PM, Kevin Grittner wrote:

> Hello all.
>
> Within the PostgreSQL JDBC driver a ResultSet is, by default, streamed
> back to the client as the rows are generated by the server, and are
> stored in a collection in RAM.  An attempt to get the first row
> from the
> ResultSet blocks until every last row is stored in RAM.  If various
> conditions are met, a cursor is used instead, which allows a large
> result set to be processed without running out of RAM.
>
> I'm curious about the reasons that this appoach was taken -- why not
> return the rows as they are retrieved from the server?  I can think of
> some possible explanations:
>
> (1)  The time locks are held on the server is minimized by the default
> behavior.
>
> (2)  If an exception occurs, you do not receive that data up to the
> point of the problem.  (For example, divide by zero, serialization
> (deadlock) problems, or broken connections.)
>
> (3)  The connection can be used for other purposes without conflict.
> (Other drivers I've used, which normall stream the results fall
> back on
> reading the whole result set into RAM when a statement is issued which
> cannot be run while the result set is pending.)
>
>
> The disadvantages of the current default approach are:
>
> (1)  The entire result set might not fit in the client RAM.
>
> (2)  You might want to see partial results from a query which
> ultimately
> generates an exception.
>
> (3)  You can't overlap retrieval of the result set with processing of
> the rows.
>
>
> The disadvantages of the cursored approach are:
>
> (1)  It might not be easy to create all of the right conditions for
> the
> cursored approach to work.
>
> (2)  You might think you've met all the conditions for the cursored
> approach and be wrong.  The default behavior might kick in and
> cause an
> OutOfMemoryError.
>
> (3)  There is overhead to creating a cursor for a select statement.
> Granted, in my tests it was only about 20 microseconds, but that
> boosted
> runtime for my simple test case by 3%.
>
>
> If there was a connection property which allowed result sets to return
> rows as they are retrieved, I would use it.  I'm not lobbying very
> hard
> for it, however, since it is only 20 microseconds per SELECT statement
> to use cursors, and I can easily modify my code to use them,
> although it
> will have to be an "all or nothing" change in the framework, so the
> cost
> is paid on every SELECT statement.  I'm just expressing interest,
> should
> the feature be under consideration.
>
> -Kevin
>
>
>
> ---------------------------(end of
> broadcast)---------------------------
> TIP 1: if posting/reading through Usenet, please send an appropriate
>        subscribe-nomail command to majordomo@postgresql.org so that
> your
>        message can get through to the mailing list cleanly
>
>


Re: Streaming ResultSet

От
Oliver Jowett
Дата:
Kevin Grittner wrote:

> I'm curious about the reasons that this appoach was taken -- why not
> return the rows as they are retrieved from the server?

The short answer is because to stream on the fly is a much more complex
approach and no-one has stepped forward to implement it.

> (1)  It might not be easy to create all of the right conditions for the
> cursored approach to work.

Another solution here is to work on making the cursor support code more
flexible. Again the main reason that this is not done is that no-one has
come up with a robust implementation yet.

> (2)  You might think you've met all the conditions for the cursored
> approach and be wrong.  The default behavior might kick in and cause an
> OutOfMemoryError.

The same is true with a streaming approach. The driver likes to issue
its own queries internally in many places; if you happen to trip up on
one of those while processing your resultset, boom!

Your best bet to see this implemented is to implement it yourself and
send us a patch. It's a fairly large chunk of work with debatable
benefit, and personally I'd rather spend the time on better cursor
support. (if I *had* time!)

-O

Re: Streaming ResultSet

От
Oliver Jowett
Дата:
Kevin Grittner wrote:

> (3)  There is overhead to creating a cursor for a select statement.
> Granted, in my tests it was only about 20 microseconds, but that boosted
> runtime for my simple test case by 3%.

This seems a bit high; what's your query and which driver version are
you using? The current drivers just name the result portal at the
protocol level (which is essentially a FETCH_FORWARD_ONLY cursor, but
not in so many words) and ask for N rows at a time; otherwise, all the
processing should be the same. I'm suprised that the act of naming the
portal and splitting the output into batches adds as much as 20ms to
query execution time..

-O

Re: Streaming ResultSet

От
"Kevin Grittner"
Дата:
20 microseconds (1/50th of a millisecond), not 20 milliseconds.  I could
run it a few more times to make sure that my results are statistically
significant, and not a run of bad luck for the cursored tests, but that
will have to wait for tomorrow.

-Kevin


>>> Oliver Jowett <oliver@opencloud.com> 07/19/05 6:12 PM >>>
Kevin Grittner wrote:

> Granted, in my tests it was only about 20 microseconds

I'm suprised that the act of naming the portal and splitting the output
into batches adds as much as 20ms

Re: Streaming ResultSet

От
Oliver Jowett
Дата:
Kevin Grittner wrote:
> 20 microseconds (1/50th of a millisecond), not 20 milliseconds.

Oh, sorry, misread your email. That sounds more reasonable :)

You are considering using resultset streaming in a query that takes only
~600us (from your 3% number)? How big is that resultset anyway?

-O

Re: Streaming ResultSet

От
Oliver Jowett
Дата:
Kevin Grittner wrote:

> The disadvantages of the current default approach are:

> (2)  You might want to see partial results from a query which ultimately
> generates an exception.

I don't see this is something that any sort of portable application is
ever going to need. It's very unpredictable where you get an error if an
error occurs, since it's so dependent on backend / driver / protocol
implementations.

> The disadvantages of the cursored approach are:

> (1)  It might not be easy to create all of the right conditions for the
> cursored approach to work.

It occurs to me that the only condition that streaming would mitigate is
having to have autocommit off.

You can't usefully use streaming with TYPE_SCROLLABLE_* resultsets as
even though you can stream results just fine while the resultset
current-row is moving strictly forward, you have to keep all previously
retrieved results in memory anyway in case the application asks to move
backwards (perfectly legal with a SCROLLABLE resultset). Keeping the
whole resultset in memory defeats the purpose of streaming in the first
place..

-O

Re: Streaming ResultSet

От
"Kevin Grittner"
Дата:
I'm considering using PostgreSQL in an environment where there is a
middle tier framework running the queries, which has no way of knowing
whether the query it is about to run will return one row or millions of
rows.

When it comes to returning millions of rows, I don't much care about
20us.  Some of the one row queries run in as little as 265us (and I'm
thinking of submitting changes which would bring that down), so 20us
might creep into the range of significance.  The test which showed a
20us difference were running across the LAN, while the production middle
tier will be running on the same box as the database; so, if much of the
20us is in network latency, it becomes much less significant.

-Kevin


>>> Oliver Jowett <oliver@opencloud.com> 07/19/05 6:38 PM >>>

You are considering using resultset streaming in a query that takes only
~600us (from your 3% number)? How big is that resultset anyway?


Re: [JDBC] Streaming ResultSet

От
albertwang
Дата:
Hi Kevin,

How is your question / suggestion going?

I met the same problem. Our company uses PostgreSQL as the OLTP system. I am
working on an ETL project which reads data from a PostgreSQL db, transforms
the data and then load the data into another PostgreSQL db. The ETL tool
uses JDBC.

My problem is that the ETL won't start to transform the data before the
whole data result set has been loaded into memory.

I would like to do the data transforming and loading while extracting data.

Do you know if there is any progress reagrding the streaming result set? As
I know, MySQL does support it.

Thank you.

Regards,
Albert



--
View this message in context: http://www.postgresql-archive.org/Streaming-ResultSet-tp2168704p5968275.html
Sent from the PostgreSQL - jdbc mailing list archive at Nabble.com.


Re: Streaming ResultSet

От
albertwang
Дата:
Hi Kevin,

How is your question / suggestion going?

I met the same problem. Our company uses PostgreSQL as the OLTP system. I am
working on an ETL project which reads data from a PostgreSQL db, transforms
the data and then load the data into another PostgreSQL db. The ETL tool
uses JDBC.

My problem is that the ETL won't start to transform the data before the
whole data result set has been loaded into memory.

I would like to do the data transforming and loading while extracting data.

Do you know if there is any progress reagrding the streaming result set? As
I know, MySQL does support it.

Thank you.

Regards,
Albert



--
View this message in context: http://www.postgresql-archive.org/Streaming-ResultSet-tp2168704p5968275.html
Sent from the PostgreSQL - jdbc mailing list archive at Nabble.com.


Re: Streaming ResultSet

От
Thomas Kellerer
Дата:
albertwang schrieb am 23.06.2017 um 02:08:
> I met the same problem. Our company uses PostgreSQL as the OLTP system. I am
> working on an ETL project which reads data from a PostgreSQL db, transforms
> the data and then load the data into another PostgreSQL db. The ETL tool
> uses JDBC.
>
> My problem is that the ETL won't start to transform the data before the
> whole data result set has been loaded into memory.
>
> I would like to do the data transforming and loading while extracting data.
>
> Do you know if there is any progress reagrding the streaming result set? As
> I know, MySQL does support it.

Not sure what you mean with "streaming resultset", but if you don't want the driver to buffer the complete result, then
thereis a solution to it:  
retrieving based on a cursor.

See the documentation for details:

https://jdbc.postgresql.org/documentation/head/query.html#query-with-cursor

Thomas




Re: [JDBC] Streaming ResultSet

От
Thomas Kellerer
Дата:
albertwang schrieb am 23.06.2017 um 02:08:
> I met the same problem. Our company uses PostgreSQL as the OLTP system. I am
> working on an ETL project which reads data from a PostgreSQL db, transforms
> the data and then load the data into another PostgreSQL db. The ETL tool
> uses JDBC.
>
> My problem is that the ETL won't start to transform the data before the
> whole data result set has been loaded into memory.
>
> I would like to do the data transforming and loading while extracting data.
>
> Do you know if there is any progress reagrding the streaming result set? As
> I know, MySQL does support it.

Not sure what you mean with "streaming resultset", but if you don't want the driver to buffer the complete result, then
thereis a solution to it:  
retrieving based on a cursor.

See the documentation for details:

https://jdbc.postgresql.org/documentation/head/query.html#query-with-cursor

Thomas




Re: [JDBC] Streaming ResultSet

От
danap
Дата:
albertwang wrote:
> Hi Kevin,
>
> How is your question / suggestion going?
>
> I met the same problem. Our company uses PostgreSQL as the OLTP system. I am
> working on an ETL project which reads data from a PostgreSQL db, transforms
> the data and then load the data into another PostgreSQL db. The ETL tool
> uses JDBC.
>
> My problem is that the ETL won't start to transform the data before the
> whole data result set has been loaded into memory.
>
> I would like to do the data transforming and loading while extracting data.

How about instead of using the JDBC intermediary, extract the data into a local
file/memory db, then transform and load from it to the destination PostgreSQL
db. Set up three seperate parallel processes to perform the task.

danap.

>
> Do you know if there is any progress reagrding the streaming result set? As
> I know, MySQL does support it.
>
> Thank you.
>
> Regards,
> Albert




Re: Streaming ResultSet

От
danap
Дата:
albertwang wrote:
> Hi Kevin,
>
> How is your question / suggestion going?
>
> I met the same problem. Our company uses PostgreSQL as the OLTP system. I am
> working on an ETL project which reads data from a PostgreSQL db, transforms
> the data and then load the data into another PostgreSQL db. The ETL tool
> uses JDBC.
>
> My problem is that the ETL won't start to transform the data before the
> whole data result set has been loaded into memory.
>
> I would like to do the data transforming and loading while extracting data.

How about instead of using the JDBC intermediary, extract the data into a local
file/memory db, then transform and load from it to the destination PostgreSQL
db. Set up three seperate parallel processes to perform the task.

danap.

>
> Do you know if there is any progress reagrding the streaming result set? As
> I know, MySQL does support it.
>
> Thank you.
>
> Regards,
> Albert