Обсуждение: [GENERAL] storing large files in database - performance

Поиск
Список
Период
Сортировка

[GENERAL] storing large files in database - performance

От
Eric Hill
Дата:

Hey,

 

I searched and found a few discussions of storing large files in the database in the archives, but none that specifically address performance and how large of files can realistically be stored in the database.

 

I have a node.js application using PostgreSQL to store uploaded files.  The column in which I am storing the file contents is of type “bytea” with “Storage” type set to “EXTENDED”.  Storing a 12.5 MB file is taking 10 seconds, and storing a 25MB file is taking 37 seconds.  Two notable things about those numbers:  It seems like a long time, and the time seems to grow exponentially with file size rather than linearly.

 

Do these numbers surprise you?  Are these files just too large for storage in PostgreSQL to be practical?  Could there be something about my methodology that is slowing things down?

 

I do have the Sequelize ORM and the pg driver in between my code and the database. 

 

Thanks,

 

Eric

Re: [GENERAL] storing large files in database - performance

От
"Daniel Verite"
Дата:
    Eric Hill wrote:

> I am storing the file contents is of type "bytea" with "Storage" type set to
> "EXTENDED".  Storing a 12.5 MB file is taking 10 seconds

That seems really slow indeed.
Can you import the same file to the same server with psql's
\lo_import command and see how much time it takes?
That would give a baseline to compare against.


Best regards,
--
Daniel Vérité
PostgreSQL-powered mailer: http://www.manitou-mail.org
Twitter: @DanielVerite


Re: [GENERAL] storing large files in database - performance

От
Thomas Kellerer
Дата:
Eric Hill schrieb am 16.05.2017 um 14:25:
> I have a node.js application using PostgreSQL to store uploaded
> files. The column in which I am storing the file contents is of type
> “bytea” with “Storage” type set to “EXTENDED”. Storing a 12.5 MB file
> is taking 10 seconds, and storing a 25MB file is taking 37 seconds.
> Two notable things about those numbers: It seems like a long time,
> and the time seems to grow exponentially with file size rather than
> linearly.
>
> Do these numbers surprise you?

When my (JDBC based) SQL client and the database server are on the same computer, I can store a 45MB file about 4
seconds,a 240MB file in about 20seconds  
When I do the same with a server on the (same) network, the 45MB take about 9 seconds, the 240MB take about 60 seconds.

So yes, these numbers sound a bit surprising.

> Could there be something about my methodology that is slowing things down?

How far are your app server and the database server apart?
Maybe it's just a case of a slow network

Thomas

Re: [GENERAL] storing large files in database - performance

От
Adrian Klaver
Дата:
On 05/16/2017 05:25 AM, Eric Hill wrote:
> Hey,
>
> I searched and found a few discussions of storing large files in the
> database in the archives, but none that specifically address performance
> and how large of files can realistically be stored in the database.
>
> I have a node.js application using PostgreSQL to store uploaded files.
> The column in which I am storing the file contents is of type “bytea”
> with “Storage” type set to “EXTENDED”.  Storing a 12.5 MB file is taking
> 10 seconds, and storing a 25MB file is taking 37 seconds.  Two notable
> things about those numbers:  It seems like a long time, and the time
> seems to grow exponentially with file size rather than linearly.
>
> Do these numbers surprise you?  Are these files just too large for
> storage in PostgreSQL to be practical?  Could there be something about
> my methodology that is slowing things down?

Yes, it does surprise me. I just tested inserting an 11MB file using
psycopg2(Python) and it was less then a second.

>
> I do have the Sequelize ORM and the pg driver in between my code and the
> database.
>
> Thanks,
>
> Eric
>


--
Adrian Klaver
adrian.klaver@aklaver.com


Re: [GENERAL] storing large files in database - performance

От
John R Pierce
Дата:
On 5/16/2017 7:35 AM, Thomas Kellerer wrote:
> When my (JDBC based) SQL client and the database server are on the same computer...

node.js is Javascript, not java w/ jdbc


--
john r pierce, recycling bits in santa cruz



Re: [GENERAL] storing large files in database - performance

От
John R Pierce
Дата:
On 5/16/2017 5:25 AM, Eric Hill wrote:
> I do have the Sequelize ORM and the pg driver in between my code and
> the database.


Can you try a similar test without the ORM, just going straight from
node.js to sql ?


--
john r pierce, recycling bits in santa cruz



Re: [GENERAL] storing large files in database - performance

От
Adrian Klaver
Дата:
On 05/16/2017 07:44 AM, John R Pierce wrote:
> On 5/16/2017 7:35 AM, Thomas Kellerer wrote:
>> When my (JDBC based) SQL client and the database server are on the
>> same computer...
>
> node.js is Javascript, not java w/ jdbc

I think it was more a point of comparison, like my using a Python
example. Something to show that is probably not on the Postgres end.

>
>


--
Adrian Klaver
adrian.klaver@aklaver.com


Re: [GENERAL] storing large files in database - performance

От
Thomas Kellerer
Дата:
John R Pierce schrieb am 16.05.2017 um 16:44:
> On 5/16/2017 7:35 AM, Thomas Kellerer wrote:
>> When my (JDBC based) SQL client and the database server are on the same computer...
>
> node.js is Javascript, not java w/ jdbc

I know that.

I mentioned JDBC so that it's clear that the timings were done using a different technology

Maybe it's Node.js or the JavaScript "driver" that causes the problems.


Re: [GENERAL] storing large files in database - performance

От
Eric Hill
Дата:
OK, thanks very much.  It seems like my process is somehow flawed.  I'll try removing some layers and see if I can
figureout what is killing the performance. 

Eric



>
> Do these numbers surprise you?  Are these files just too large for
> storage in PostgreSQL to be practical?  Could there be something about
> my methodology that is slowing things down?

Yes, it does surprise me. I just tested inserting an 11MB file using
psycopg2(Python) and it was less then a second.



Re: [GENERAL] storing large files in database - performance

От
"Peter J. Holzer"
Дата:
On 2017-05-16 12:25:03 +0000, Eric Hill wrote:
> I searched and found a few discussions of storing large files in the database
> in the archives, but none that specifically address performance and how large
> of files can realistically be stored in the database.
>
>
>
> I have a node.js application using PostgreSQL to store uploaded files.  The
> column in which I am storing the file contents is of type “bytea” with
> “Storage” type set to “EXTENDED”.

I have mentioned this little experiment before, but I finally put the
results on my web site: https://www.hjp.at/databases/blob-bench/

(Please note that so far I have run this only on one system.
Generalizing to other systems might be premature).


> Storing a 12.5 MB file is taking 10 seconds, and storing a 25MB file
> is taking 37 seconds.  Two notable things about those numbers:  It
> seems like a long time, and the time seems to grow exponentially with
> file size rather than linearly.
>
>
>
> Do these numbers surprise you?

Yes. on my system, storing a 25 MB bytea value takes well under 1 second.

        hp

--
   _  | Peter J. Holzer    | we build much bigger, better disasters now
|_|_) |                    | because we have much more sophisticated
| |   | hjp@hjp.at         | management tools.
__/   | http://www.hjp.at/ | -- Ross Anderson <https://www.edge.org/>

Вложения

Re: [GENERAL] storing large files in database - performance

От
Merlin Moncure
Дата:
On Tue, May 16, 2017 at 9:51 AM, Thomas Kellerer <spam_eater@gmx.net> wrote:
> John R Pierce schrieb am 16.05.2017 um 16:44:
>> On 5/16/2017 7:35 AM, Thomas Kellerer wrote:
>>> When my (JDBC based) SQL client and the database server are on the same computer...
>>
>> node.js is Javascript, not java w/ jdbc
>
> I know that.
>
> I mentioned JDBC so that it's clear that the timings were done using a different technology
>
> Maybe it's Node.js or the JavaScript "driver" that causes the problems.

When writing large objects to the database, method of transmission
will very much determine performance until you start hitting the
natural boundaries imposed by the database.

via (hastily written):
#include "libpqtypes.h"
#include "stdlib.h"
#include "string.h"

int main()
{
  int s = 1024 * 1024 * 256;

  char *p = malloc(s);
  memset(p, 'x', s);
  p[s-1] = 0;

  PGconn *conn = PQconnectdb("");
  PQinitTypes(conn);

  PGresult *res = PQexecf(conn, "insert into foo values(1,%text)", p);

  if(!res)
    fprintf(stderr, "*ERROR: %s\n", PQgeterror());

  PQclear(res);
}

mmoncure@mernix2 09:13 AM /tmp$ gcc -otest test.c -lpq -lpqtypes -I
/home/mmoncure/src/libpqtypes-1.5.1/src -I
/home/mmoncure/pg94/include/ -L
/home/mmoncure/src/libpqtypes-1.5.1/.libs/
mmoncure@mernix2 09:13 AM /tmp$ psql -c "create table foo(i int, f text)"
CREATE TABLE
mmoncure@mernix2 09:13 AM /tmp$ psql -c "alter table foo alter f set
storage external"
ALTER TABLE
mmoncure@mernix2 09:14 AM /tmp$ time
LD_LIBRARY_PATH=/home/mmoncure/src/libpqtypes-1.5.1/.libs ./test

real 0m3.245s
user 0m0.092s
sys 0m0.232s
mmoncure@mernix2 09:15 AM /tmp$ psql -c "select
pg_size_pretty(pg_table_size('foo'))"
 pg_size_pretty
----------------
 266 MB
(1 row)


...that's over 76mb/sec (to local server) for 256mb transfer.  That's
pretty good I think.   We have a 1GB barrier on bytea/text and (at
least in C, with certain reasonable precautions) you can work
comfortably under that limit.  There might be other better strategies
but it can be done.

merlin


Re: [GENERAL] storing large files in database - performance

От
Eric Hill
Дата:
I would be thrilled to get 76 MB per second, and it is comforting to know that we have that as a rough upper bound on
performance. I've got work to do to figure out how to approach that upper bound from Node.js.  
 

In the meantime, I've been looking at performance on the read side.  For that, I can bypass all my Node.js layers and
justrun a query from pgAdmin 4.  I ran this query, where indexFile.contents for the row in question is 25MB in size.
Thequery itself took 4 seconds in pgAdmin 4.  Better than the 12 seconds I'm getting in Node.js, but still on the order
of6MB per second, not 76.  Do you suppose pgAdmin 4 and I are doing similarly inefficient things in querying bytea
values?

Thanks,

Eric

-----Original Message-----
From: Merlin Moncure [mailto:mmoncure@gmail.com] 
Sent: Wednesday, May 17, 2017 10:21 AM
To: Thomas Kellerer <spam_eater@gmx.net>
Cc: PostgreSQL General <pgsql-general@postgresql.org>
Subject: Re: storing large files in database - performance

On Tue, May 16, 2017 at 9:51 AM, Thomas Kellerer <spam_eater@gmx.net> wrote:
> John R Pierce schrieb am 16.05.2017 um 16:44:
>> On 5/16/2017 7:35 AM, Thomas Kellerer wrote:
>>> When my (JDBC based) SQL client and the database server are on the same computer...
>>
>> node.js is Javascript, not java w/ jdbc
>
> I know that.
>
> I mentioned JDBC so that it's clear that the timings were done using a 
> different technology
>
> Maybe it's Node.js or the JavaScript "driver" that causes the problems.

When writing large objects to the database, method of transmission will very much determine performance until you start
hittingthe natural boundaries imposed by the database.
 

via (hastily written):
#include "libpqtypes.h"
#include "stdlib.h"
#include "string.h"

int main()
{
  int s = 1024 * 1024 * 256;

  char *p = malloc(s);
  memset(p, 'x', s);
  p[s-1] = 0;

  PGconn *conn = PQconnectdb("");
  PQinitTypes(conn);

  PGresult *res = PQexecf(conn, "insert into foo values(1,%text)", p);

  if(!res)
    fprintf(stderr, "*ERROR: %s\n", PQgeterror());

  PQclear(res);
}

mmoncure@mernix2 09:13 AM /tmp$ gcc -otest test.c -lpq -lpqtypes -I /home/mmoncure/src/libpqtypes-1.5.1/src -I
/home/mmoncure/pg94/include/-L /home/mmoncure/src/libpqtypes-1.5.1/.libs/
 
mmoncure@mernix2 09:13 AM /tmp$ psql -c "create table foo(i int, f text)"
CREATE TABLE
mmoncure@mernix2 09:13 AM /tmp$ psql -c "alter table foo alter f set storage external"
ALTER TABLE
mmoncure@mernix2 09:14 AM /tmp$ time
LD_LIBRARY_PATH=/home/mmoncure/src/libpqtypes-1.5.1/.libs ./test

real 0m3.245s
user 0m0.092s
sys 0m0.232s
mmoncure@mernix2 09:15 AM /tmp$ psql -c "select pg_size_pretty(pg_table_size('foo'))"
 pg_size_pretty
----------------
 266 MB
(1 row)


...that's over 76mb/sec (to local server) for 256mb transfer.  That's
pretty good I think.   We have a 1GB barrier on bytea/text and (at
least in C, with certain reasonable precautions) you can work comfortably under that limit.  There might be other
betterstrategies but it can be done.
 

merlin


Re: [GENERAL] storing large files in database - performance

От
Merlin Moncure
Дата:
On Thu, May 18, 2017 at 7:34 AM, Eric Hill <Eric.Hill@jmp.com> wrote:
> I would be thrilled to get 76 MB per second, and it is comforting to know that we have that as a rough upper bound on
performance. I've got work to do to figure out how to approach that upper bound from Node.js. 
>
> In the meantime, I've been looking at performance on the read side.  For that, I can bypass all my Node.js layers and
justrun a query from pgAdmin 4.  I ran this query, where indexFile.contents for the row in question is 25MB in size.
Thequery itself took 4 seconds in pgAdmin 4.  Better than the 12 seconds I'm getting in Node.js, but still on the order
of6MB per second, not 76.  Do you suppose pgAdmin 4 and I are doing similarly inefficient things in querying bytea
values?

Probably.  I haven't spent a lot of time with pgadmin 4 so I'm not
entirely sure.  If you want a quick and dirty comparison, try using
running your query in psql unaligned mode for a comaprison point.  You
can also do \copy BINARY in the case of byte transfers.

The basic problem is not really the database, it's that database
interaction APIs tend not to be directed to this kind of problem.
The big picture issues are:

*) Driver overhead marshaling from wire format to managed types

*) Driver overhead for memory management

*) Wire format issues.  Certain types are *much* faster with the
binary wire format and are additionally much more memory efficient.
Your bytea transfers are probably being serialized to text and back in
both directions which is very wasteful, especially for very large
transfers since it's wasteful in terms of memory.

If I were to seriously look at node.js performance, my rough thinking
is that I'd want to be setting up the javascript variables directly in
C somehow using plv8 internal routines.  Short of that, I would
probably be querying all data out of postgres in json rather than
serializing individual fields (which is what I generally do in
practice).

Another point, some googling turned up
https://www.npmjs.com/package/pg-large-object which is definitely
something to consider trying.

merlin


Re: [GENERAL] storing large files in database - performance

От
Eric Hill
Дата:
My apologies: I said I ran "this query" but failed to include the query.  It was merely this:

SELECT "indexFile"."_id", "indexFile"."contents"
FROM "mySchema"."indexFiles" AS "indexFile"
WHERE "indexFile"."_id" = '591c609bb56d0849404e4720';

Eric

-----Original Message-----
From: Eric Hill [mailto:Eric.Hill@jmp.com] 
Sent: Thursday, May 18, 2017 8:35 AM
To: Merlin Moncure <mmoncure@gmail.com>; Thomas Kellerer <spam_eater@gmx.net>
Cc: PostgreSQL General <pgsql-general@postgresql.org>
Subject: Re: storing large files in database - performance

I would be thrilled to get 76 MB per second, and it is comforting to know that we have that as a rough upper bound on
performance. I've got work to do to figure out how to approach that upper bound from Node.js.  
 

In the meantime, I've been looking at performance on the read side.  For that, I can bypass all my Node.js layers and
justrun a query from pgAdmin 4.  I ran this query, where indexFile.contents for the row in question is 25MB in size.
Thequery itself took 4 seconds in pgAdmin 4.  Better than the 12 seconds I'm getting in Node.js, but still on the order
of6MB per second, not 76.  Do you suppose pgAdmin 4 and I are doing similarly inefficient things in querying bytea
values?

Thanks,

Eric

Re: [GENERAL] storing large files in database - performance

От
Eric Hill
Дата:
Thanks, Merlin - lots of good information here, and I had not yet stumbled across pg-large-object - I will look into
it.

Eric

-----Original Message-----
From: Merlin Moncure [mailto:mmoncure@gmail.com] 
Sent: Thursday, May 18, 2017 9:49 AM
To: Eric Hill <Eric.Hill@jmp.com>
Cc: Thomas Kellerer <spam_eater@gmx.net>; PostgreSQL General <pgsql-general@postgresql.org>
Subject: Re: storing large files in database - performance

EXTERNAL

On Thu, May 18, 2017 at 7:34 AM, Eric Hill <Eric.Hill@jmp.com> wrote:
> I would be thrilled to get 76 MB per second, and it is comforting to know that we have that as a rough upper bound on
performance. I've got work to do to figure out how to approach that upper bound from Node.js.
 
>
> In the meantime, I've been looking at performance on the read side.  For that, I can bypass all my Node.js layers and
justrun a query from pgAdmin 4.  I ran this query, where indexFile.contents for the row in question is 25MB in size.
Thequery itself took 4 seconds in pgAdmin 4.  Better than the 12 seconds I'm getting in Node.js, but still on the order
of6MB per second, not 76.  Do you suppose pgAdmin 4 and I are doing similarly inefficient things in querying bytea
values?

Probably.  I haven't spent a lot of time with pgadmin 4 so I'm not entirely sure.  If you want a quick and dirty
comparison,try using running your query in psql unaligned mode for a comaprison point.  You can also do \copy BINARY in
thecase of byte transfers.
 

The basic problem is not really the database, it's that database interaction APIs tend not to be directed to this kind
ofproblem.
 
The big picture issues are:

*) Driver overhead marshaling from wire format to managed types

*) Driver overhead for memory management

*) Wire format issues.  Certain types are *much* faster with the binary wire format and are additionally much more
memoryefficient.
 
Your bytea transfers are probably being serialized to text and back in both directions which is very wasteful,
especiallyfor very large transfers since it's wasteful in terms of memory.
 

If I were to seriously look at node.js performance, my rough thinking is that I'd want to be setting up the javascript
variablesdirectly in C somehow using plv8 internal routines.  Short of that, I would probably be querying all data out
ofpostgres in json rather than serializing individual fields (which is what I generally do in practice).
 

Another point, some googling turned up
https://www.npmjs.com/package/pg-large-object which is definitely something to consider trying.

merlin

Re: [GENERAL] storing large files in database - performance

От
Eric Hill
Дата:
I am pleased to report that with Merlin's suggestion of using the pg-large-object middleware, I have a test case now
showingthat I can write a 25MB buffer from Node.js to Postgres in roughly 700 milliseconds.  Here is the JavaScript
code,which is nearly verbatim from the example in the pg-large-object doc:
 

packages.testLargeObjects = function(callback) {
   var pgp = require('pg-promise')();
   var LargeObjectManager = require('pg-large-object').LargeObjectManager;
   var PassThrough = require('stream').PassThrough;

   var bufSize = 1024 * 1024 * 25;
   var buf = new Buffer(bufSize);
   buf.fill("pgrocks");

   var connInfo = {
      host:       'localhost',
      port:       5432,
      database:    'mydb',
      user:       'postgres,
      password:    'secret'
   };

   var db = pgp(connInfo);

   db.tx(function(tx) {
      const lObjMgr = new LargeObjectManager({pgPromise: tx});
      const bufferSize = 16384;

      return lObjMgr.createAndWritableStreamAsync(bufferSize)
         .then( ([oid, stream]) => {
            let bufferStream = new PassThrough();
            bufferStream.end(buf);
            bufferStream.pipe(stream);
            return new Promise(function(resolve, reject) {
               stream.on('finish', resolve);
               stream.on('error', reject);
            });
         });
   })
   .then(function() {
      callback();
      pgp.end();
   })
   .catch(function(err) {
      callback(err);
      pgp.end();
   });
};

Thanks very much!

Eric

-----Original Message-----
From: Merlin Moncure [mailto:mmoncure@gmail.com] 

Another point, some googling turned up
https://www.npmjs.com/package/pg-large-object which is definitely something to consider trying.

merlin

Re: [GENERAL] storing large files in database - performance

От
Merlin Moncure
Дата:
On Fri, May 19, 2017 at 2:04 PM, Eric Hill <Eric.Hill@jmp.com> wrote:
> I am pleased to report that with Merlin's suggestion of using the pg-large-object middleware, I have a test case now
showingthat I can write a 25MB buffer from Node.js to Postgres in roughly 700 milliseconds.  Here is the JavaScript
code,which is nearly verbatim from the example in the pg-large-object doc: 
>
> packages.testLargeObjects = function(callback) {
>    var pgp = require('pg-promise')();
>    var LargeObjectManager = require('pg-large-object').LargeObjectManager;
>    var PassThrough = require('stream').PassThrough;
>
>    var bufSize = 1024 * 1024 * 25;
>    var buf = new Buffer(bufSize);
>    buf.fill("pgrocks");
>
>    var connInfo = {
>       host:       'localhost',
>       port:       5432,
>       database:    'mydb',
>       user:       'postgres,
>       password:    'secret'
>    };
>
>    var db = pgp(connInfo);
>
>    db.tx(function(tx) {
>       const lObjMgr = new LargeObjectManager({pgPromise: tx});
>       const bufferSize = 16384;
>
>       return lObjMgr.createAndWritableStreamAsync(bufferSize)
>          .then( ([oid, stream]) => {
>             let bufferStream = new PassThrough();
>             bufferStream.end(buf);
>             bufferStream.pipe(stream);
>             return new Promise(function(resolve, reject) {
>                stream.on('finish', resolve);
>                stream.on('error', reject);
>             });
>          });
>    })
>    .then(function() {
>       callback();
>       pgp.end();
>    })
>    .catch(function(err) {
>       callback(err);
>       pgp.end();
>    });
> };
>
> Thanks very much!

well done sir! that's probably as fast as you're going to get in node,
at least without a large investment at the driver level.

merlin