On 11/03/2010 04:52 PM, Nick Matheson wrote:
> We have an application that needs to do bulk reads of ENTIRE
> Postgres tables very quickly (i.e. select * from table). We have
> observed that such sequential scans run two orders of magnitude slower
> than observed raw disk reads (5 MB/s versus 100 MB/s). Part of this is
> due to the storage overhead we have observed in Postgres. In the
> example below, it takes 1 GB to store 350 MB of nominal data. However
> that suggests we would expect to get 35 MB/s bulk read rates.
> Our business logic does operations on the resulting data such that
> the output is several orders of magnitude smaller than the input. So
> we had hoped that by putting our business logic into stored procedures
> (and thus drastically reducing the amount of data flowing to the
> client) our throughput would go way up. This did not happen.
Can you disclose what kinds of manipulations you want to do on the data?
I am asking because maybe there is a fancy query (possibly using
windowing functions and / or aggregation functions) that gets you the
speed that you need without transferring the whole data set to the client.
> So our questions are as follows:
>
> Is there any way using stored procedures (maybe C code that calls
> SPI directly) or some other approach to get close to the expected 35
> MB/s doing these bulk reads? Or is this the price we have to pay for
> using SQL instead of some NoSQL solution. (We actually tried Tokyo
> Cabinet and found it to perform quite well. However it does not measure
> up to Postgres in terms of replication, data interrogation, community
> support, acceptance, etc).
Kind regards
robert