Обсуждение: Help with bulk read performance

От:
Dan Schaffer
Дата:

Hello

We have an application that needs to do bulk reads of ENTIRE Postgres tables very quickly (i.e. select * from table).
Wehave  
observed that such sequential scans run two orders of magnitude slower than observed raw disk reads (5 MB/s versus 100
MB/s). Part  
of this is due to the storage overhead we have observed in Postgres.  In the example below, it takes 1 GB to store 350
MBof nominal  
data.  However that suggests we would expect to get 35 MB/s bulk read rates.

Observations using iostat and top during these bulk reads suggest that the queries are CPU bound, not I/O bound.  In
fact,repeating  
the queries yields similar response times.  Presumably if it were an I/O issue the response times would be much shorter
thesecond  
time through with the benefit of caching.

We have tried these simple queries using psql, JDBC, pl/java stored procedures, and libpq.  In all cases the client
coderan on the  
same box as the server.
We have experimented with Postgres 8.1, 8.3 and 9.0.

We also tried playing around with some of the server tuning parameters such as shared_buffers to no avail.

Here is uname -a for a machine we have tested on:

Linux nevs-bdb1.fsl.noaa.gov 2.6.18-194.17.1.el5 #1 SMP Mon Sep 20 07:12:06 EDT 2010 x86_64 x86_64 x86_64 GNU/Linux

A sample dataset that reproduces these results looks like the following (there are no indexes):

Table "bulk_performance.counts"
  Column |  Type   | Modifiers
--------+---------+-----------
  i1     | integer |
  i2     | integer |
  i3     | integer |
  i4     | integer |

There are 22 million rows in this case.

We HAVE observed that summation queries run considerably faster.  In this case,

select sum(i1), sum(i2), sum(i3), sum(i4) from bulk_performance.counts

runs at 35 MB/s.


Our business logic does operations on the resulting data such that the output is several orders of magnitude smaller
thanthe input.  
  So we had hoped that by putting our business logic into stored procedures (and thus drastically reducing the amount
ofdata  
flowing to the client) our throughput would go way up.  This did not happen.

So our questions are as follows:

Is there any way using stored procedures (maybe C code that calls SPI directly) or some other approach to get close to
theexpected  
35 MB/s doing these bulk reads?  Or is this the price we have to pay for using SQL instead of some NoSQL solution.  (We
actually 
tried Tokyo Cabinet and found it to perform quite well. However it does not measure up to Postgres in terms of
replication,data  
interrogation, community support, acceptance, etc).

Thanks

Dan Schaffer
Paul Hamer
Nick Matheson

От:
Jim Nasby
Дата:

On Nov 1, 2010, at 9:15 AM, Dan Schaffer wrote:
> We have an application that needs to do bulk reads of ENTIRE Postgres tables very quickly (i.e. select * from table).
We have observed that such sequential scans run two orders of magnitude slower than observed raw disk reads (5 MB/s
versus100 MB/s).  Part of this is due to the storage overhead we have observed in Postgres.  In the example below, it
takes1 GB to store 350 MB of nominal data.  However that suggests we would expect to get 35 MB/s bulk read rates. 
>
> Observations using iostat and top during these bulk reads suggest that the queries are CPU bound, not I/O bound.  In
fact,repeating the queries yields similar response times.  Presumably if it were an I/O issue the response times would
bemuch shorter the second time through with the benefit of caching. 
>
> We have tried these simple queries using psql, JDBC, pl/java stored procedures, and libpq.  In all cases the client
coderan on the same box as the server. 
> We have experimented with Postgres 8.1, 8.3 and 9.0.
>
> We also tried playing around with some of the server tuning parameters such as shared_buffers to no avail.
>
> Here is uname -a for a machine we have tested on:
>
> Linux nevs-bdb1.fsl.noaa.gov 2.6.18-194.17.1.el5 #1 SMP Mon Sep 20 07:12:06 EDT 2010 x86_64 x86_64 x86_64 GNU/Linux
>
> A sample dataset that reproduces these results looks like the following (there are no indexes):
>
> Table "bulk_performance.counts"
> Column |  Type   | Modifiers
> --------+---------+-----------
> i1     | integer |
> i2     | integer |
> i3     | integer |
> i4     | integer |
>
> There are 22 million rows in this case.
>
> We HAVE observed that summation queries run considerably faster.  In this case,
>
> select sum(i1), sum(i2), sum(i3), sum(i4) from bulk_performance.counts
>
> runs at 35 MB/s.
>
>
> Our business logic does operations on the resulting data such that the output is several orders of magnitude smaller
thanthe input.  So we had hoped that by putting our business logic into stored procedures (and thus drastically
reducingthe amount of data flowing to the client) our throughput would go way up.  This did not happen. 
>
> So our questions are as follows:
>
> Is there any way using stored procedures (maybe C code that calls SPI directly) or some other approach to get close
tothe expected 35 MB/s doing these bulk reads?  Or is this the price we have to pay for using SQL instead of some NoSQL
solution. (We actually tried Tokyo Cabinet and found it to perform quite well. However it does not measure up to
Postgresin terms of replication, data interrogation, community support, acceptance, etc). 

Have you by chance tried EXPLAIN ANALYZE SELECT * FROM bulk_performance.counts? That will throw away the query results,
whichremoves client-server considerations. 

Also, when you tested raw disk IO, did you do it with an 8k block size? That's the default size of a Postgres block, so
allof it's IO is done that way. 

What does iostat show you? Are you getting a decent number of read requests/second?
--
Jim C. Nasby, Database Architect                   
512.569.9461 (cell)                         http://jim.nasby.net



От:
Andy Colson
Дата:

On 11/1/2010 9:15 AM, Dan Schaffer wrote:
> Hello
>
> We have an application that needs to do bulk reads of ENTIRE Postgres
> tables very quickly (i.e. select * from table). We have observed that
> such sequential scans run two orders of magnitude slower than observed
> raw disk reads (5 MB/s versus 100 MB/s). Part of this is due to the
> storage overhead we have observed in Postgres. In the example below, it
> takes 1 GB to store 350 MB of nominal data. However that suggests we
> would expect to get 35 MB/s bulk read rates.
>
> Observations using iostat and top during these bulk reads suggest that
> the queries are CPU bound, not I/O bound. In fact, repeating the queries
> yields similar response times. Presumably if it were an I/O issue the
> response times would be much shorter the second time through with the
> benefit of caching.
>
> We have tried these simple queries using psql, JDBC, pl/java stored
> procedures, and libpq. In all cases the client code ran on the same box
> as the server.
> We have experimented with Postgres 8.1, 8.3 and 9.0.
>
> We also tried playing around with some of the server tuning parameters
> such as shared_buffers to no avail.
>
> Here is uname -a for a machine we have tested on:
>
> Linux nevs-bdb1.fsl.noaa.gov 2.6.18-194.17.1.el5 #1 SMP Mon Sep 20
> 07:12:06 EDT 2010 x86_64 x86_64 x86_64 GNU/Linux
>
> A sample dataset that reproduces these results looks like the following
> (there are no indexes):
>
> Table "bulk_performance.counts"
> Column | Type | Modifiers
> --------+---------+-----------
> i1 | integer |
> i2 | integer |
> i3 | integer |
> i4 | integer |
>
> There are 22 million rows in this case.
>
> We HAVE observed that summation queries run considerably faster. In this
> case,
>
> select sum(i1), sum(i2), sum(i3), sum(i4) from bulk_performance.counts
>
> runs at 35 MB/s.
>
>
> Our business logic does operations on the resulting data such that the
> output is several orders of magnitude smaller than the input. So we had
> hoped that by putting our business logic into stored procedures (and
> thus drastically reducing the amount of data flowing to the client) our
> throughput would go way up. This did not happen.
>
> So our questions are as follows:
>
> Is there any way using stored procedures (maybe C code that calls SPI
> directly) or some other approach to get close to the expected 35 MB/s
> doing these bulk reads? Or is this the price we have to pay for using
> SQL instead of some NoSQL solution. (We actually tried Tokyo Cabinet and
> found it to perform quite well. However it does not measure up to
> Postgres in terms of replication, data interrogation, community support,
> acceptance, etc).
>
> Thanks
>
> Dan Schaffer
> Paul Hamer
> Nick Matheson
>
>
>
>

Whoa... Deja Vu

Is this the same thing Nick is working on?  How'd he get along?

http://archives.postgresql.org/message-id/


-Andy

От:
Jim Nasby
Дата:

On Dec 14, 2010, at 9:27 AM, Andy Colson wrote:
> Is this the same thing Nick is working on?  How'd he get along?
>
> http://archives.postgresql.org/message-id/

So it is. The one I replied to stood out because no one had replied to it; I didn't see the earlier email.
--
Jim C. Nasby, Database Architect                   
512.569.9461 (cell)                         http://jim.nasby.net



От:
Andy Colson
Дата:

On 12/14/2010 9:41 AM, Jim Nasby wrote:
> On Dec 14, 2010, at 9:27 AM, Andy Colson wrote:
>> Is this the same thing Nick is working on?  How'd he get along?
>>
>> http://archives.postgresql.org/message-id/
>
> So it is. The one I replied to stood out because no one had replied to it; I didn't see the earlier email.
> --
> Jim C. Nasby, Database Architect                   
> 512.569.9461 (cell)                         http://jim.nasby.net
>
>
>

Oh.. I didn't even notice the date... I thought it was a new post.

But still... (and I'll cc Nick on this)  I'd love to hear an update on
how this worked out.

Did you get it to go fast?  What'd you use?  Did the project go over
budget and did you all get fired?  COME ON MAN!  We need to know! :-)

-Andy

От:
Jim Nasby
Дата:

BTW, have you tried prepared statements? bytea is most likely faster (in part) due to less parsing in the backend.
Preparedstatements would eliminate that parsing step. 

On Dec 14, 2010, at 10:07 AM, Nick Matheson wrote:

> Hey all-
>
> Glad to know you are still interested... ;)
>
> Didn't mean to leave you hanging, the holiday and all have put some bumps in the road.
>
> Dan my co-worker might be able to post some more detailed information here, but here is a brief summary of what I am
awareof: 
>
> 1. We have not tested any stored procedure/SPI based solutions to date.
> 2. The COPY API has been the best of the possible solutions explored to date.
> 3. We were able to get rates on the order of 35 MB/s with the original problem this way.
> 4. Another variant of the problem we were working on included some metadata fields and 300 float values (for this we
triedthree variants) 
>   a. 300 float values as columns
>   b. 300 float in a float array column
>   c. 300 floats packed into a bytea column
> Long story short on these three variants a and b largely performed the same. C was the winner and seems to have
improvedthe throughput on multiple counts. 1. it reduces the data transmitted over the wire by a factor of two (float
columnsand float arrays have a 2x overhead over the raw data requirement.) 2. this reduction seems to have reduced the
cpuburdens on the server side thus producing a better than the expected 2x speed. I think the final numbers left us
somewherein the 80-90 MB/s. 
>
> Thanks again for all the input. If you have any other questions let us know. Also if we get results for the stored
procedure/SPIroute we will try and post, but the improvements via standard JDBC are such that we aren't really pressed
atthis point in time to get more throughput so it may not happen. 
>
> Cheers,
>
> Nick
>> On 12/14/2010 9:41 AM, Jim Nasby wrote:
>>> On Dec 14, 2010, at 9:27 AM, Andy Colson wrote:
>>>> Is this the same thing Nick is working on?  How'd he get along?
>>>>
>>>> http://archives.postgresql.org/message-id/
>>>
>>> So it is. The one I replied to stood out because no one had replied to it; I didn't see the earlier email.
>>> --
>>> Jim C. Nasby, Database Architect                   
>>> 512.569.9461 (cell)                         http://jim.nasby.net
>>>
>>>
>>>
>>
>> Oh.. I didn't even notice the date... I thought it was a new post.
>>
>> But still... (and I'll cc Nick on this)  I'd love to hear an update on how this worked out.
>>
>> Did you get it to go fast?  What'd you use?  Did the project go over budget and did you all get fired?  COME ON MAN!
We need to know! :-) 
>>
>> -Andy
>

--
Jim C. Nasby, Database Architect                   
512.569.9461 (cell)                         http://jim.nasby.net



От:
Nick Matheson
Дата:

Hey all-

Glad to know you are still interested... ;)

Didn't mean to leave you hanging, the holiday and all have put some
bumps in the road.

Dan my co-worker might be able to post some more detailed information
here, but here is a brief summary of what I am aware of:

1. We have not tested any stored procedure/SPI based solutions to date.
2. The COPY API has been the best of the possible solutions explored to
date.
3. We were able to get rates on the order of 35 MB/s with the original
problem this way.
4. Another variant of the problem we were working on included some
metadata fields and 300 float values (for this we tried three variants)
    a. 300 float values as columns
    b. 300 float in a float array column
    c. 300 floats packed into a bytea column
Long story short on these three variants a and b largely performed the
same. C was the winner and seems to have improved the throughput on
multiple counts. 1. it reduces the data transmitted over the wire by a
factor of two (float columns and float arrays have a 2x overhead over
the raw data requirement.) 2. this reduction seems to have reduced the
cpu burdens on the server side thus producing a better than the expected
2x speed. I think the final numbers left us somewhere in the 80-90 MB/s.

Thanks again for all the input. If you have any other questions let us
know. Also if we get results for the stored procedure/SPI route we will
try and post, but the improvements via standard JDBC are such that we
aren't really pressed at this point in time to get more throughput so it
may not happen.

Cheers,

Nick
> On 12/14/2010 9:41 AM, Jim Nasby wrote:
>> On Dec 14, 2010, at 9:27 AM, Andy Colson wrote:
>>> Is this the same thing Nick is working on?  How'd he get along?
>>>
>>> http://archives.postgresql.org/message-id/
>>
>> So it is. The one I replied to stood out because no one had replied
>> to it; I didn't see the earlier email.
>> --
>> Jim C. Nasby, Database Architect                   
>> 512.569.9461 (cell)                         http://jim.nasby.net
>>
>>
>>
>
> Oh.. I didn't even notice the date... I thought it was a new post.
>
> But still... (and I'll cc Nick on this)  I'd love to hear an update on
> how this worked out.
>
> Did you get it to go fast?  What'd you use?  Did the project go over
> budget and did you all get fired?  COME ON MAN!  We need to know! :-)
>
> -Andy


От:
Dan Schaffer
Дата:

Hi,
My name is Dan and I'm a co-worker of Nick Matheson who initially submitted this question (because the mail group had
meblacklisted  
for awhile for some reason).


Thank you for all of the suggestions.  We were able to improve out bulk read performance from 3 MB/s to 60 MB/s
(assumingthe data  
are NOT in cache in both cases) by doing the following:

1. Storing the data in a "bytea" column instead of an "array" column.
2. Retrieving the data via the Postgres 9 CopyManager#copyOut(String sql, OutputStream stream)  method

The key to the dramatic improvement appears to be the reduction in packing and unpacking time on the server and client,

respectively.  The server packing occurs when the retrieved data are packed into a bytestream for sending across the
network. 
Storing the data as a simple byte array reduces this time substantially.  The client-side unpacking time is spent
generatinga  
ResultSet object. By unpacking the bytestream into the desired arrays of floats by hand instead, this time became close
tonegligible. 

The only downside of storing the data in byte arrays is the loss of transparency.  That is, a simple "select *" of a
fewrows shows  
bytes instead of floats.  We hope to mitigate this by writing a simple stored procedures that unpacks the bytes into
floats.

A couple of other results:

If the data are stored as a byte array but retrieve into a ResultSet, the unpacking time goes up by an order of
magnitudeand the  
observed total throughput is 25 MB/s.  If the data are stored in a Postgres float array and unpacked into a byte
stream,the  
observed throughput is 20 MB/s.

Dan (and Nick)

Andy Colson wrote:
> On 12/14/2010 9:41 AM, Jim Nasby wrote:
>> On Dec 14, 2010, at 9:27 AM, Andy Colson wrote:
>>> Is this the same thing Nick is working on?  How'd he get along?
>>>
>>> http://archives.postgresql.org/message-id/
>>
>> So it is. The one I replied to stood out because no one had replied to
>> it; I didn't see the earlier email.
>> --
>> Jim C. Nasby, Database Architect                   
>> 512.569.9461 (cell)                         http://jim.nasby.net
>>
>>
>>
>
> Oh.. I didn't even notice the date... I thought it was a new post.
>
> But still... (and I'll cc Nick on this)  I'd love to hear an update on
> how this worked out.
>
> Did you get it to go fast?  What'd you use?  Did the project go over
> budget and did you all get fired?  COME ON MAN!  We need to know! :-)
>
> -Andy
>

От:
Krzysztof Nienartowicz
Дата:

Hello Daniel,
We have the same scenario for the native Java arrays, so we are storing bytea and doing conversion at the client side,
butfor the server side SQL,  plJava comes very handy: 

No sure how you want to create stored procedures to convert internally but this is how we do this:

One has to define conversion routines in Java then deploy them to plJava. Scanning though this field would be still CPU
bound,around 2x slower than with native arrays and 6x slower than with blobs,  but at least one has this ability. It's
evenpossible to pass them to plR to do some statistical processing directly, so depending on the operations you do it
maybe still cheaper then streaming out  over the wire to the regular JDBC client. 

1. deploy class like this within plJava (null handling left out for brevity)

import java.io.File;
import java.io.IOException;
import java.io.ObjectInputStream;
import java.io.ObjectOutputStream;

public class myArrayConversion
{

    public myArrayConversion() {}

    /** Serialize double array to blob */
    public static byte[] convertDoubleArrayToBytea(double[] obj) throws IOException {
     ByteArrayOutputStream baos =  new ByteArrayOutputStream();
     ObjectOutputStream oos = new ObjectOutputStream(baos);
     oos.writeObject(obj);
     return baos.toByteArray();
    }
    /** Serialize int array to blob */
    public static byte[] convertIntToBytea(int[] obj) throws IOException {
         ByteArrayOutputStream baos =  new ByteArrayOutputStream();
         ObjectOutputStream oos = new ObjectOutputStream(baos);
         oos.writeObject(obj);
         return baos.toByteArray();
    }

    /** Deserialize blob to double array */
    public static double[] convertToDoubleArray(byte[] obj) throws IOException,
                                                             ClassNotFoundException {
        // Deserialize from a byte array
      ObjectInputStream ios = new ObjectInputStream(new ByteArrayInputStream(obj));
      return (double[])ios.readObject();
    }

    /** Deserialize blob to it array */
    public static int[] convertIntToArray(byte[] obj) throws IOException,
                                                             ClassNotFoundException {
        // Deserialize from a byte array
               ObjectInputStream ios = new ObjectInputStream(new ByteArrayInputStream(obj));
               return (int[])ios.readObject();
    }


// other types arrays streaming...
//...
}

2. then create a mapping functions as a db owner:

<sql>
CREATE OR REPLACE FUNCTION public.convertDoubleArrayToBytea(double precision[])
  RETURNS bytea AS
    'mappingPkg.convertDoubleArrayToBytea(double[])'
  LANGUAGE 'javau' IMMUTABLE
  COST 50;

GRANT EXECUTE ON FUNCTION public.convertDoubleArrayToBytea(double precision[]) TO public;



CREATE OR REPLACE FUNCTION public.convertToDoubleArray(bytea)
  RETURNS double precision[] AS
    'mappingPkg.convertToDoubleArray(byte[])'
  LANGUAGE 'javau' IMMUTABLE
  COST 50;

GRANT EXECUTE ON FUNCTION public.convertToDoubleArray(bytea) TO public;
</sql>


then you can have conversion either way:

select convertToDoubleArray(convertDoubleArrayToBytea(array[i::float8,1.1,100.1,i*0.1]::float8[])) from
generate_series(1,100)i; 

so you'd be also able to create bytea objects from native SQL arrays within SQL.

PLJava seems to be enjoying revival last days thanks to Johann 'Myrkraverk' Oskarsson who fixed several long-standing
bugs.Check out the plJava list for details. 


    Krzysztof



On Dec 16, 2010, at 10:22 AM,  wrote:

> From: Dan Schaffer <>
> Date: December 15, 2010 9:15:14 PM GMT+01:00
> To: Andy Colson <>
> Cc: Jim Nasby <>, , Nick Matheson <>
> Subject: Re: Help with bulk read performance
> Reply-To: 
>
>
> Hi,
> My name is Dan and I'm a co-worker of Nick Matheson who initially submitted this question (because the mail group had
meblacklisted for awhile for some reason). 
>
>
> Thank you for all of the suggestions.  We were able to improve out bulk read performance from 3 MB/s to 60 MB/s
(assumingthe data are NOT in cache in both cases) by doing the following: 
>
> 1. Storing the data in a "bytea" column instead of an "array" column.
> 2. Retrieving the data via the Postgres 9 CopyManager#copyOut(String sql, OutputStream stream)  method
>
> The key to the dramatic improvement appears to be the reduction in packing and unpacking time on the server and
client,respectively.  The server packing occurs when the retrieved data are packed into a bytestream for sending across
thenetwork. Storing the data as a simple byte array reduces this time substantially.  The client-side unpacking time is
spentgenerating a ResultSet object. By unpacking the bytestream into the desired arrays of floats by hand instead, this
timebecame close to negligible. 
>
> The only downside of storing the data in byte arrays is the loss of transparency.  That is, a simple "select *" of a
fewrows shows bytes instead of floats.  We hope to mitigate this by writing a simple stored procedures that unpacks the
bytesinto floats. 
>
> A couple of other results:
>
> If the data are stored as a byte array but retrieve into a ResultSet, the unpacking time goes up by an order of
magnitudeand the observed total throughput is 25 MB/s.  If the data are stored in a Postgres float array and unpacked
intoa byte stream, the observed throughput is 20 MB/s. 
>
> Dan (and Nick)


От:
"Pierre C"
Дата:

> If the data are stored as a byte array but retrieve into a ResultSet,
> the unpacking time goes up by an order of magnitude and the
> observed total throughput is 25 MB/s.  If the data are stored in a
> Postgres float array and unpacked into a byte stream, the
> observed throughput is 20 MB/s.


float <-> text conversions are very slow, this is in fact due to the
mismatch between base-2 (IEEE754) and base-10 (text) floating point
representation, which needs very very complex calculations.

От:
Nick Matheson
Дата:

Pierre-

I agree with your observation of float <-> text conversion costs, but in
this case Dan is talking about storing the raw float data (ie: 4 bytes
per float) in a bytea array so there is only the conversion from java
float[n] to java byte[4*n] which is not nearly as costly as float <->
text conversion (especially if you leave it in architecture byte order).

Nick
>
>> If the data are stored as a byte array but retrieve into a ResultSet,
>> the unpacking time goes up by an order of magnitude and the
>> observed total throughput is 25 MB/s.  If the data are stored in a
>> Postgres float array and unpacked into a byte stream, the
>> observed throughput is 20 MB/s.
>
>
> float <-> text conversions are very slow, this is in fact due to the
> mismatch between base-2 (IEEE754) and base-10 (text) floating point
> representation, which needs very very complex calculations.