Обсуждение: storing binary files / memory limit
Hi, I'm kind of struggling with storing binary files inside the database. I'd like to store them in BYTEA columns, but the problem is the files are quite large (a few megabytes, usually), so the PHP memory_limit is reached when escaping the data. It does not matter whether I use old-fashioned pg_* functions or the new PDO extension - with BYTEA columns both do behave the same. For example with a 16MB file (called input.data) and 8MB memory_limit, this throws an 'Allowed memory exhausted' exception due to the fact that all the data (read from the file) have to be escaped at once: ------------------------------------------------------------------------ $pdo = new PDO('pgsql: ... '); // data_table (id INTEGER, data BYTEA) $stmt = $pdo->prepare('INSERT INTO data_table VALUES (:id, :data)'); $id = 1; $input = fopen('/tmp/input.data', 'rb'); $stmt->bindParam(':id', $id, PDO::PARAM_INT); $stmt->bindValue(':data', $input, PDO::PARAM_LOB); $stmt->execute(); ------------------------------------------------------------------------ Till now I've found two solutions, but none of them meets all my requirements: 1) using LOBs - with LOBs a 'streaming of data' is possible, i.e. you can do this: --------------------------------------------------------------------- $pdo->beginTransaction(); $input = fopen('/tmp/input.data', 'rb'); $oid = $pdo->pgsqlLOBCreate(); $stream = $pdo->pgsqlLOBOpen($oid, 'w'); stream_copy_to_stream($input, $stream); $pdo->commit(); --------------------------------------------------------------------- This is nice because it saves memory, but a serious disadvantage (for me) is a lack of refferential integrity - you can delete a LOB even if it's referenced from some table (i.e. an OID is stored in it). So it's basically the same as storing the data directly in a filesystem, and storing just a path to it. 2) using BYTEA columns and do the 'streaming' on my own - just store the data as usual, but upload them 'by pieces' (say 100kB). This is achieved by an initial INSERT (with say 100kB of data), followed by a number of updates with 'data = data || '... new data ...' appending the data. When reading the data, you have to do the same - read them piece by piece. Yes, it's kind of dirty (especially for large files - this may cause a lot of queries), but preserves all the nice BYTEA colunm features (no dangling / missing LOBs, etc.). Is there any other way to solve storing of large files in PostgreSQL? These are the most important requirements of the solution: - I do want to store the files inside the database, and I don't want to store just the paths. It's quite difficult to combine transactional (database) and non-transactional (filesystem) resources properly. - I want to preserve as much 'nice' PostgreSQL features as possible (for example referential integrity is a nice feature). - Storing / retrieving of the files has to work with a quite small memory_limit PHP option (say 8MB). I can't (and don't want to) modify this option (it's a shared server). - Optimization is a serious criterion, as is reliability. Thanks for all your recommendations. regards Tomas
On Sat, 2008-10-11 at 18:26 +0200, Tomas Vondra wrote: > > Is there any other way to solve storing of large files in PostgreSQL? No, not until there are functions that let you fopen() on the bytea column. Also, your "... || more_column" solution will generate large numbers of dead rows and require frequent vacuuming. > - Optimization is a serious criterion, as is reliability. If you're using tables with very large columns, make sure you index on every other column you're going to access it by. If PostgreSQL has to resort to full-table scans on this table, and especially with a low memory constraint, you could easily end up with it doing an on-disk sort on a copy of the data. If you *have* to store it in a table column (and it really isn't the most efficient way of doing it) then create a separate table for it which is just SERIAL + data. Cheers, Andrew McMillan. ------------------------------------------------------------------------ Andrew @ McMillan .Net .NZ Porirua, New Zealand http://andrew.mcmillan.net.nz/ Phone: +64(272)DEBIAN It is often easier to tame a wild idea than to breathe life into a dull one. -- Alex Osborn ------------------------------------------------------------------------
> On Sat, 2008-10-11 at 18:26 +0200, Tomas Vondra wrote: >> Is there any other way to solve storing of large files in PostgreSQL? > > No, not until there are functions that let you fopen() on the bytea > column. :-( > Also, your "... || more_column" solution will generate large numbers of > dead rows and require frequent vacuuming. Yes, I'm aware of this and it's one more deficiency of the solution. But inserting the data is not so frequent (a few inserts a day), especially when compared to reading (several hundred SELECTs). >> - Optimization is a serious criterion, as is reliability. > > If you're using tables with very large columns, make sure you index on > every other column you're going to access it by. If PostgreSQL has to > resort to full-table scans on this table, and especially with a low > memory constraint, you could easily end up with it doing an on-disk sort > on a copy of the data. Well, the schema is quite well optimized I believe, so this shouldn't be a problem. All the necessary indexes are created etc. so the queries are executed quite fast (we have a comprehensive performance logging of queries, so I'm quite sure about this). But I'm not sure what you mean by 'low memory contraint' - the memory limit I've been talking about is purely PHP feature, so it's related to inserting / reading and escaping / unescaping data. Anyway 99% of queries returning multiple rows do not return BYTEA columns - these columns are references in queries returning single row, so there is no problem with sorting / memory occupied by the postmaster process. > If you *have* to store it in a table column (and it really isn't the > most efficient way of doing it) then create a separate table for it > which is just SERIAL + data. I *want* to store it in a table column, because I'm not able to come up with a better solution. As I understand it, creating a separate table to store the binary data is almost the same as using plain bytea columns. We don't have problems with performance (thanks to creating proper indexes and TOAST architecture), and it does not solve the problem I've described in my original post (hitting the PHP memory limit). OK, it would save a little bit of space when using the 'append' strategy described in my previous post (data = data || new_data), but table usually consists of a small amount of metadata plus large amount of binary data. So the amount of space wasted because of storing metadata in dead rows is negligible compared to space wasted because of dead rows and bytea columns. For example, one of our tables is used to store documents, so it has about this structure Documents ( id SERIAL, created DATE, title VARCHAR(128), author VARCHAR(128), description TEXT, -- short (usually less than 2000 characters) data BYTEA ) The 'data' column may have even several megabytes, so the metadata occupies less than 1% of the row. If you know a better way to store binary data, please describe it here. The only other way I'm aware of is LOB - it solves the problem of inserting data (by streaming), but has other disadvantages (no referential integrity, etc.) But creating a separate table for the binary data looks interesting under one condition - the file will be stored splitted. This basically mimics the LOB storage (pg_largeobject table). It still does not have the support for streaming the data, but it solves the problem with PHP memory limit and does not create large amount of dead rows. regards Tomas
On Sat, 2008-10-11 at 23:41 +0200, Tomas Vondra wrote: > > > > If you're using tables with very large columns, make sure you index on > > every other column you're going to access it by. If PostgreSQL has to > > resort to full-table scans on this table, and especially with a low > > memory constraint, you could easily end up with it doing an on-disk sort > > on a copy of the data. > > But I'm not sure what you mean by 'low memory contraint' - the memory > limit I've been talking about is purely PHP feature, so it's related to > inserting / reading and escaping / unescaping data. In this case I'm not referring to PHP memory, but to PostgreSQL memory. If you're on a memory constrained shared system then it's not just PHP which will be configured for a smaller memory footprint... > I *want* to store it in a table column Yes, that's certainly what you seem to be saying. Personally I would steer clear of storing many megabytes in a bytea column on a memory constrained system, but you're closer to the application and will make your own decision. > If you know a better way to store binary data, please describe it here. > The only other way I'm aware of is LOB - it solves the problem of > inserting data (by streaming), but has other disadvantages (no > referential integrity, etc.) Yes, your trade-off is essentially efficiency vs. referential integrity. This is a common trade-off, and if you have tight control over how rows will be inserted/deleted from your table then referential integrity is merely a nice-to-have. If people will be creating / deleting these things all over the application, without the benefit of an API to do so, then referential integrity obviously becomes much more important. Cheers, Andrew. ------------------------------------------------------------------------ andrew (AT) morphoss (DOT) com +64(272)DEBIAN Writing is turning one's worst moments into money. -- J.P. Donleavy ------------------------------------------------------------------------