Обсуждение: Streaming ResultSet rows instead of buffering them in memory
Hi, I'd like to propose that the pgjdbc is modified to avoid buffering of ResultSets to memory by default, whenever it is safeto do so. I created a WIP PR that works for many basic scenarios https://github.com/pgjdbc/pgjdbc/pull/1735 I tried to do a very minimal change that should still be maintainable in the long term. Basically I just moved the protocolparsing variables into a state object and drag it along when processing next row of ResultSet. The effect on applicationsis that queries have an effective fetch size of 1, without any of the downsides of such setting. The motivation for the patch is to speed up queries and lower memory usage. The queries become faster indirectly since thequery processing is parallelized. Now we do in sequence: 1) backend returns results and jdbc driver parses and buffers result set (which can take a while) 2) application consumes result set With this patch the steps occur concurrently 1) backend returns results 2) application consumes result set with the jdbc driver parsing rows from tcp buffer on demand The cursors used by enabling a fetch size do a similar thing, but - it tells the backend to optimize the query plan for fetching only the first few rows instead of the full result set - each batch of rows require a round-trip between backend and application during which neither do any useful work - small fetch size causes many round-trips while large fetch size uses memory and increases latency The bad thing is that if the application is not consuming the result set fast enough the backend cannot release its locksas fast. But this only effects large results that do not fit entirely into the tcp buffers. The WIP code sometimes fails if the ResultSet is not properly closed before the next statement is executed. For many teststhis happens in the tearDown phase. I also think that currently things might not work correctly if application triggersqueries on the connection while processing the results, for example by requesting metadata. In that case we haveno other option than to go back to buffering to clear the connection for reuse. -Mikko
I have now finished the PR https://github.com/pgjdbc/pgjdbc/pull/1735 and fixed all the comments from the earlier discussions.
Would you be open to merging this at some point? At least a new review would be appreciated.
-Mikko
From: Mikko Tiihonen <mikko.tiihonen@nitor.com>
Sent: 15 March 2020 14:57
To: pgsql-jdbc@lists.postgresql.org <pgsql-jdbc@lists.postgresql.org>
Subject: Streaming ResultSet rows instead of buffering them in memory
Sent: 15 March 2020 14:57
To: pgsql-jdbc@lists.postgresql.org <pgsql-jdbc@lists.postgresql.org>
Subject: Streaming ResultSet rows instead of buffering them in memory
Hi,
I'd like to propose that the pgjdbc is modified to avoid buffering of ResultSets to memory by default, whenever it is safe to do so.
I created a WIP PR that works for many basic scenarios https://github.com/pgjdbc/pgjdbc/pull/1735
I tried to do a very minimal change that should still be maintainable in the long term. Basically I just moved the protocol parsing variables into a state object and drag it along when processing next row of ResultSet. The effect on applications is that queries have an effective fetch size of 1, without any of the downsides of such setting.
The motivation for the patch is to speed up queries and lower memory usage. The queries become faster indirectly since the query processing is parallelized. Now we do in sequence:
1) backend returns results and jdbc driver parses and buffers result set (which can take a while)
2) application consumes result set
With this patch the steps occur concurrently
1) backend returns results
2) application consumes result set with the jdbc driver parsing rows from tcp buffer on demand
The cursors used by enabling a fetch size do a similar thing, but
- it tells the backend to optimize the query plan for fetching only the first few rows instead of the full result set
- each batch of rows require a round-trip between backend and application during which neither do any useful work
- small fetch size causes many round-trips while large fetch size uses memory and increases latency
The bad thing is that if the application is not consuming the result set fast enough the backend cannot release its locks as fast. But this only effects large results that do not fit entirely into the tcp buffers.
The WIP code sometimes fails if the ResultSet is not properly closed before the next statement is executed. For many tests this happens in the tearDown phase. I also think that currently things might not work correctly if application triggers queries on the connection while processing the results, for example by requesting metadata. In that case we have no other option than to go back to buffering to clear the connection for reuse.
-Mikko
I'd like to propose that the pgjdbc is modified to avoid buffering of ResultSets to memory by default, whenever it is safe to do so.
I created a WIP PR that works for many basic scenarios https://github.com/pgjdbc/pgjdbc/pull/1735
I tried to do a very minimal change that should still be maintainable in the long term. Basically I just moved the protocol parsing variables into a state object and drag it along when processing next row of ResultSet. The effect on applications is that queries have an effective fetch size of 1, without any of the downsides of such setting.
The motivation for the patch is to speed up queries and lower memory usage. The queries become faster indirectly since the query processing is parallelized. Now we do in sequence:
1) backend returns results and jdbc driver parses and buffers result set (which can take a while)
2) application consumes result set
With this patch the steps occur concurrently
1) backend returns results
2) application consumes result set with the jdbc driver parsing rows from tcp buffer on demand
The cursors used by enabling a fetch size do a similar thing, but
- it tells the backend to optimize the query plan for fetching only the first few rows instead of the full result set
- each batch of rows require a round-trip between backend and application during which neither do any useful work
- small fetch size causes many round-trips while large fetch size uses memory and increases latency
The bad thing is that if the application is not consuming the result set fast enough the backend cannot release its locks as fast. But this only effects large results that do not fit entirely into the tcp buffers.
The WIP code sometimes fails if the ResultSet is not properly closed before the next statement is executed. For many tests this happens in the tearDown phase. I also think that currently things might not work correctly if application triggers queries on the connection while processing the results, for example by requesting metadata. In that case we have no other option than to go back to buffering to clear the connection for reuse.
-Mikko
Hi Mikko,
Yes to both. Let me see what I can do
Dave Cramer
www.postgres.rocks
On Sat, 9 Jan 2021 at 12:10, Mikko Tiihonen <mikko.tiihonen@nitor.com> wrote:
I have now finished the PR https://github.com/pgjdbc/pgjdbc/pull/1735 and fixed all the comments from the earlier discussions.Would you be open to merging this at some point? At least a new review would be appreciated.-MikkoFrom: Mikko Tiihonen <mikko.tiihonen@nitor.com>
Sent: 15 March 2020 14:57
To: pgsql-jdbc@lists.postgresql.org <pgsql-jdbc@lists.postgresql.org>
Subject: Streaming ResultSet rows instead of buffering them in memoryHi,
I'd like to propose that the pgjdbc is modified to avoid buffering of ResultSets to memory by default, whenever it is safe to do so.
I created a WIP PR that works for many basic scenarios https://github.com/pgjdbc/pgjdbc/pull/1735
I tried to do a very minimal change that should still be maintainable in the long term. Basically I just moved the protocol parsing variables into a state object and drag it along when processing next row of ResultSet. The effect on applications is that queries have an effective fetch size of 1, without any of the downsides of such setting.
The motivation for the patch is to speed up queries and lower memory usage. The queries become faster indirectly since the query processing is parallelized. Now we do in sequence:
1) backend returns results and jdbc driver parses and buffers result set (which can take a while)
2) application consumes result set
With this patch the steps occur concurrently
1) backend returns results
2) application consumes result set with the jdbc driver parsing rows from tcp buffer on demand
The cursors used by enabling a fetch size do a similar thing, but
- it tells the backend to optimize the query plan for fetching only the first few rows instead of the full result set
- each batch of rows require a round-trip between backend and application during which neither do any useful work
- small fetch size causes many round-trips while large fetch size uses memory and increases latency
The bad thing is that if the application is not consuming the result set fast enough the backend cannot release its locks as fast. But this only effects large results that do not fit entirely into the tcp buffers.
The WIP code sometimes fails if the ResultSet is not properly closed before the next statement is executed. For many tests this happens in the tearDown phase. I also think that currently things might not work correctly if application triggers queries on the connection while processing the results, for example by requesting metadata. In that case we have no other option than to go back to buffering to clear the connection for reuse.
-Mikko