Обсуждение: Tweaking bytea / large object block sizes?
Hi. I'm a new Postgres user. If I happen to ask stupid questions please feel free to point me to any documentation I should read or guidelines for asking questions. I'm looking into storing binary data in Postgres and trying to understand how data is actually stored in the database. The dataset I'm looking at is images, photos, pdf documents which should commonly be at a minimum 100kb, on average 10mb and can scale up to 100mb for each document. I want to store this data in the database, as I need transactional integrity and want to avoid the extra complexity of managing shared filesystems between a number of frontend application servers and database backends. The binary data will only be accessed as a whole. So either a complete new file is written to the DB or a complete file will be read and cached on a frontend server. I don't need streaming access or be able to stream partial data. The use-case seems to be well supported by Oracle 11g with the introduction of "secure files" (pdf: [1]). But from what I read of Postgres, my best bet is to store data as large objects [2]. Going all the way down this means storing the binary data as 2kb chunks and adding table row overhead for each of those chunks. Using the bytea type and the toast backend [3] it seems to come down to the same: data is actually stored in 2kb chunks for a page size of 8kb. I'm assuming I'll be able to get ~8kb jumbo frame packets over a gigabit network connection and would be able to use a ext4 volume with a block size of either 32kb or 64kb for the volume housing the binary data, but a smaller block size for the one housing the relational data. Given those other constraints, it seems silly to split data up into 2kb chunks on the database level. Is there any way the chunk size for binary data can be increased here independent of the one for "normal relational" data? Thanks, Hanno [1] http://www.oracle.com/us/dm/h2fy11/securefiles-362607.pdf?evite=EMEAFM10041984MPP017 [2] http://www.postgresql.org/docs/9.0/static/catalog-pg-largeobject.html [3] http://www.postgresql.org/docs/9.0/static/storage-toast.html
Le dimanche 12 juin 2011 à 18:00 +0200, Hanno Schlichting a écrit : > I'm looking into storing binary data in Postgres and trying to > understand how data is actually stored in the database. The dataset > I'm looking at is images, photos, pdf documents which should commonly > be at a minimum 100kb, on average 10mb and can scale up to 100mb for > each document. I want to store this data in the database, as I need > transactional integrity and want to avoid the extra complexity of > managing shared filesystems between a number of frontend application > servers and database backends. > I had a similar requirement for the app that's in my sig. It uses a PostgreSQL database, but the binary content of the documents resides in an SQLite database. That way, my PostgreSQL database remains very small, easy to manage and backup. I have a separate procedure to do incremental backups of the documents as they are added. The SQLite db looks like this : CREATE TABLE tbldocument_content (id_courrier INTEGER NOT NULL PRIMARY KEY, content blob); CREATE TABLE tbldocument_state (id_courrier INTEGER NOT NULL, backed_up integer default 0, date_created date default (date('now')), FOREIGN KEY(id_courrier) REFERENCES tbldocument_content(id_courrier)); CREATE INDEX tbldocument_state_backed_up_idx ON tbldocument_state(backed_up); CREATE INDEX tbldocument_state_id_courrier_idx ON tbldocument_state(id_courrier); CREATE TRIGGER create_document_state AFTER INSERT ON tbldocument_content BEGIN INSERT INTO tbldocument_state (id_courrier) VALUES (NEW.id_courrier); END; CREATE TRIGGER drop_document_state AFTER DELETE ON tbldocument_content BEGIN DELETE FROM tbldocument_state WHERE id_courrier=OLD.id_courrier; END; id_courrier is generated by the PostgreSQL db. Works great. You can't see it on in action on the web site with the demo account, though. -- Vincent Veyron http://marica.fr/ Logiciel de gestion des sinistres et des contentieux pour le service juridique
On 6/12/11 12:00:19 PM, Hanno Schlichting wrote: > Hi. > > I'm a new Postgres user. If I happen to ask stupid questions please > feel free to point me to any documentation I should read or guidelines > for asking questions. > > I'm looking into storing binary data in Postgres and trying to > understand how data is actually stored in the database. The dataset > I'm looking at is images, photos, pdf documents which should commonly > be at a minimum 100kb, on average 10mb and can scale up to 100mb for > each document. I want to store this data in the database, as I need > transactional integrity and want to avoid the extra complexity of > managing shared filesystems between a number of frontend application > servers and database backends. > > The binary data will only be accessed as a whole. So either a complete > new file is written to the DB or a complete file will be read and > cached on a frontend server. I don't need streaming access or be able > to stream partial data. The use-case seems to be well supported by > Oracle 11g with the introduction of "secure files" (pdf: [1]). > > But from what I read of Postgres, my best bet is to store data as > large objects [2]. Going all the way down this means storing the > binary data as 2kb chunks and adding table row overhead for each of > those chunks. Using the bytea type and the toast backend [3] it seems > to come down to the same: data is actually stored in 2kb chunks for a > page size of 8kb. > > I'm assuming I'll be able to get ~8kb jumbo frame packets over a > gigabit network connection and would be able to use a ext4 volume with > a block size of either 32kb or 64kb for the volume housing the binary > data, but a smaller block size for the one housing the relational > data. > > Given those other constraints, it seems silly to split data up into > 2kb chunks on the database level. Is there any way the chunk size for > binary data can be increased here independent of the one for "normal > relational" data? You could redefine LOBLKSIZE and/or BLCKSZ such that the result was larger chunks stored for large objects and then build PG from source. I assume that LOBLKSIZE is defined as a multiple of BLKSIZE for a reason, and that adjusting BLKSIZE is the better way to go. Before doing this, I would set yourself up a performance test case so that you can be sure that your changes are actually leading to an improvement. Since an RDBMS isn't typically used to simulate a large-scale filestore, it's likely that he 8K page size and 2K LO block size are inefficient if that's what it's being used for, but I wouldn't assume that larger sizes automatically mean more performance until you actually test it. -- Bill Moran
On 06/13/2011 12:00 AM, Hanno Schlichting wrote: > But from what I read of Postgres, my best bet is to store data as > large objects [2]. Going all the way down this means storing the > binary data as 2kb chunks and adding table row overhead for each of > those chunks. Using the bytea type and the toast backend [3] it seems > to come down to the same: data is actually stored in 2kb chunks for a > page size of 8kb. This is probably much less of a concern than you expect. Consider that your file system almost certainly stores file data in chunks of between 512 bytes and 4kb (the block size) and performs just fine. Given the file sizes you're working with, I'd try using `bytea' and see how you go. Put together a test or simulation that you can use to evaluate performance if you're concerned. Maybe one day Linux systems will have a file system capable of transactional behaviour like NTFS is, so Pg could integrate with the file system for transactional file management. In the mean time, `bytea' or `lo' seem to be your best bet. -- Craig Ringer
On Sun, Jun 12, 2011 at 11:00 AM, Hanno Schlichting <hanno@hannosch.eu> wrote: > Hi. > > I'm a new Postgres user. If I happen to ask stupid questions please > feel free to point me to any documentation I should read or guidelines > for asking questions. > > I'm looking into storing binary data in Postgres and trying to > understand how data is actually stored in the database. The dataset > I'm looking at is images, photos, pdf documents which should commonly > be at a minimum 100kb, on average 10mb and can scale up to 100mb for > each document. I want to store this data in the database, as I need > transactional integrity and want to avoid the extra complexity of > managing shared filesystems between a number of frontend application > servers and database backends. > > The binary data will only be accessed as a whole. So either a complete > new file is written to the DB or a complete file will be read and > cached on a frontend server. I don't need streaming access or be able > to stream partial data. The use-case seems to be well supported by > Oracle 11g with the introduction of "secure files" (pdf: [1]). > > But from what I read of Postgres, my best bet is to store data as > large objects [2]. Going all the way down this means storing the > binary data as 2kb chunks and adding table row overhead for each of > those chunks. Using the bytea type and the toast backend [3] it seems > to come down to the same: data is actually stored in 2kb chunks for a > page size of 8kb. > > I'm assuming I'll be able to get ~8kb jumbo frame packets over a > gigabit network connection and would be able to use a ext4 volume with > a block size of either 32kb or 64kb for the volume housing the binary > data, but a smaller block size for the one housing the relational > data. > > Given those other constraints, it seems silly to split data up into > 2kb chunks on the database level. Is there any way the chunk size for > binary data can be increased here independent of the one for "normal > relational" data? I would not even consider tweaking the internal block sizes until you've determined there is a problem you expect you might solve by doing so. The single most important factor affecting blob performance in postgres is how you send and receive the data -- you absolutely want to use the binary protocol mode (especially for postgres versions that don't support hex mode). The next thing to look at is using bytea/large object -- large objects are a bit faster and have a higher theoretical limit on size but byea is a standard type and this offers a lot of conveniences -- I'd say stick with bytea unless you've determined there is a reason not to. That said, if you are not writing C some client side drivers might only allow binary transmission through the lo interface so that's something to think about. merlin merlin
On 13/06/11 09:27, Merlin Moncure wrote: > want to use the binary protocol mode (especially for postgres versions > that don't support hex mode) Allowing myself to get a wee bit sidetracked: I've been wondering lately why hex was chosen as the new input/output format when the bytea_output change went in. The Base64 encoding is trivial to implement, already supported by standard libraries for many languages and add-ons for the rest, fast to encode/decode, and much more compact than a hex encoding, so it seems like a more attractive option. PostgreSQL already supports base64 in explicit 'escape()' calls. Was concern about input format ambiguity a motivator for avoiding base64? Checking the archives: http://archives.postgresql.org/pgsql-hackers/2009-05/msg00238.php http://archives.postgresql.org/pgsql-hackers/2009-05/msg00192.php ... it was considered but knocked back because it's enough more complex to encode that it could matter on big dumps and standards-compliant base64 appears to require newlines - something that was viewed as ugly and problematic. Initial input format detection reliability options were also raised, but as the same solution used for hex input would apply to base64 input too it doesn't look like that was a big factor. Personally, even with the newline 'ick factor' I think it'd be pretty nice to have as an option for dumps and COPY. Ascii85 (base85) would be another alternative. It's used in PostScript and PDF, but isn't anywhere near as widespread as base64. It's still trivial to implement and is 7-8% more space-efficient than base64. After a bit of digging, though, I can't help wonder if a binary dump format that's machine-representation independent, fast and compact isn't more practical. Tools like Thrift (http://thrift.apache.org), Protocol Buffers, etc might make it less painful. Maybe an interesting GsOC project? Supporting binary COPY with a machine independent format would be a natural extension of that, too. -- Craig Ringer
On Mon, Jun 13, 2011 at 1:58 AM, Craig Ringer <craig@postnewspapers.com.au> wrote: > On 13/06/11 09:27, Merlin Moncure wrote: > >> want to use the binary protocol mode (especially for postgres versions >> that don't support hex mode) > > Allowing myself to get a wee bit sidetracked: > > I've been wondering lately why hex was chosen as the new input/output > format when the bytea_output change went in. The Base64 encoding is > trivial to implement, already supported by standard libraries for many > languages and add-ons for the rest, fast to encode/decode, and much more > compact than a hex encoding, so it seems like a more attractive option. > PostgreSQL already supports base64 in explicit 'escape()' calls. yeah -- I remember the discussion. I think the bottom line is that hex is just simpler all around. This conveys a number of small advantages that, when added up, outweigh the slightly better space efficiency. merlin
On Mon, Jun 13, 2011 at 3:27 AM, Merlin Moncure <mmoncure@gmail.com> wrote: > I would not even consider tweaking the internal block sizes until > you've determined there is a problem you expect you might solve by > doing so. It's not a problem as such, but managing data chunks of 2000 bytes + the hundreds of rows per object in the large_object table for 10mb objects seems like a lot of wasted overhead, especially if the underlying filesystem manages 32kb or 64kb blocks. My impression of those values was that they are a bit antiquated or are tuned for storing small variable character objects, but not anything I'd call "binary large objects" these days. > The single most important factor affecting blob performance > in postgres is how you send and receive the data -- you absolutely > want to use the binary protocol mode (especially for postgres versions > that don't support hex mode). The next thing to look at is using > bytea/large object -- large objects are a bit faster and have a higher > theoretical limit on size but byea is a standard type and this offers > a lot of conveniences -- I'd say stick with bytea unless you've > determined there is a reason not to. That said, if you are not > writing C some client side drivers might only allow binary > transmission through the lo interface so that's something to think > about. Thanks, I got as much from the docs and the blogosphere. We are going to use the large object interface. That seems to be the least we can do - especially to avoid some encoding overhead. We are storing bytes after all and not ascii characters so there should be no encoding at all. We aren't using SQL as the query interface as such but the Python bindings (http://www.initd.org/psycopg/) so we can take full advantage of the underlying large object API's and do pretty direct lo_import / lo_export calls. We are targeting at least Postgres 9.0, potentially going for 9.1 soon after it hits final. Once we get further in the project, we'll of course do some intensive benchmarking for the various options with our specific data and configuration. I'm just trying to understand what bits and bytes are actually stored and transferred behind all those API's. Hanno
On Tue, Jun 14, 2011 at 6:48 AM, Hanno Schlichting <hanno@hannosch.eu> wrote: > On Mon, Jun 13, 2011 at 3:27 AM, Merlin Moncure <mmoncure@gmail.com> wrote: >> I would not even consider tweaking the internal block sizes until >> you've determined there is a problem you expect you might solve by >> doing so. > > It's not a problem as such, but managing data chunks of 2000 bytes + > the hundreds of rows per object in the large_object table for 10mb > objects seems like a lot of wasted overhead, especially if the > underlying filesystem manages 32kb or 64kb blocks. My impression of > those values was that they are a bit antiquated or are tuned for > storing small variable character objects, but not anything I'd call > "binary large objects" these days. That very well may be the case, and 10mb is approaching the upper limit of what is sane to store inside the database. Still, if you're going through the trouble to adjust the setting and recompile, I'd definitely benchmark the changes and post your findings here. Point being, all else being equal, it's always better to run with stock postgres if you can manage it. merlin