Обсуждение: more than 2GB data string save
How to save 2 GB or more text string in Postgresql?
Which data type should I use?
Please any suggestion.
On Tue, Feb 9, 2010 at 9:38 PM, AI Rumman <rummandba@gmail.com> wrote: > How to save 2 GB or more text string in Postgresql? > Which data type should I use? If you have to you can use either the lo interface, or you can use bytea. Large Object (i.e. lo) allows for access much like fopen / fseek etc in C, but the actual data are not stored in a row with other data, but alone in the lo space. Bytea is a legit type that you can have as one of many in a row, but you retrieve the whole thing at once when you get the row. Preferred way to store 2GB data is to put it into a file and put the name of the file into the database.
On Feb 9, 2010, at 9:52 PM, Scott Marlowe wrote: > On Tue, Feb 9, 2010 at 9:38 PM, AI Rumman <rummandba@gmail.com> wrote: >> How to save 2 GB or more text string in Postgresql? >> Which data type should I use? > > If you have to you can use either the lo interface, or you can use > bytea. Large Object (i.e. lo) allows for access much like fopen / > fseek etc in C, but the actual data are not stored in a row with > other data, but alone in the lo space. Bytea is a legit type that you > can have as one of many in a row, but you retrieve the whole thing at > once when you get the row. Bytea definitely won't handle more than 1 GB. I don't think the lo interface will handle more than 2GB. > > Preferred way to store 2GB data is to put it into a file and put the > name of the file into the database. This. Cheers, Steve
On Tue, Feb 9, 2010 at 11:26 PM, Steve Atkins <steve@blighty.com> wrote: > > On Feb 9, 2010, at 9:52 PM, Scott Marlowe wrote: > >> On Tue, Feb 9, 2010 at 9:38 PM, AI Rumman <rummandba@gmail.com> wrote: >>> How to save 2 GB or more text string in Postgresql? >>> Which data type should I use? >> >> If you have to you can use either the lo interface, or you can use >> bytea. Large Object (i.e. lo) allows for access much like fopen / >> fseek etc in C, but the actual data are not stored in a row with >> other data, but alone in the lo space. Bytea is a legit type that you >> can have as one of many in a row, but you retrieve the whole thing at >> once when you get the row. > > Bytea definitely won't handle more than 1 GB. I don't think the lo interface > will handle more than 2GB. That really depends on how compressible it is, doesn't it?
Thanks for your quick answes.
But if I use a file and then store the name in the database, is it possible to use TEXT search tsvector and tsquery indexing on these external files?
On Wed, Feb 10, 2010 at 12:26 PM, Steve Atkins <steve@blighty.com> wrote:
Bytea definitely won't handle more than 1 GB. I don't think the lo interface
On Feb 9, 2010, at 9:52 PM, Scott Marlowe wrote:
> On Tue, Feb 9, 2010 at 9:38 PM, AI Rumman <rummandba@gmail.com> wrote:
>> How to save 2 GB or more text string in Postgresql?
>> Which data type should I use?
>
> If you have to you can use either the lo interface, or you can use
> bytea. Large Object (i.e. lo) allows for access much like fopen /
> fseek etc in C, but the actual data are not stored in a row with
> other data, but alone in the lo space. Bytea is a legit type that you
> can have as one of many in a row, but you retrieve the whole thing at
> once when you get the row.
will handle more than 2GB.This.
>
> Preferred way to store 2GB data is to put it into a file and put the
> name of the file into the database.
Cheers,
Steve
--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general
2010/2/10 Scott Marlowe <scott.marlowe@gmail.com>: > On Tue, Feb 9, 2010 at 11:26 PM, Steve Atkins <steve@blighty.com> wrote: >> >> On Feb 9, 2010, at 9:52 PM, Scott Marlowe wrote: >> >>> On Tue, Feb 9, 2010 at 9:38 PM, AI Rumman <rummandba@gmail.com> wrote: >>>> How to save 2 GB or more text string in Postgresql? >>>> Which data type should I use? >>> >>> If you have to you can use either the lo interface, or you can use >>> bytea. Large Object (i.e. lo) allows for access much like fopen / >>> fseek etc in C, but the actual data are not stored in a row with >>> other data, but alone in the lo space. Bytea is a legit type that you >>> can have as one of many in a row, but you retrieve the whole thing at >>> once when you get the row. >> >> Bytea definitely won't handle more than 1 GB. I don't think the lo interface >> will handle more than 2GB. > > That really depends on how compressible it is, doesn't it? > no. It is maximal length for varlena. TOAST is next possible step. Regards Pavel Stehule p.s. processing very large SQL values - like bytea, or text longer tens megabytes is very expensive on memory. When you processing 100MB bytea, then you need about 300MB RAM, Using a bytea over 100MB is not good idea. LO interface is better and much more faster. > -- > Sent via pgsql-general mailing list (pgsql-general@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-general >
On Tue, Feb 9, 2010 at 11:51 PM, Pavel Stehule <pavel.stehule@gmail.com> wrote: > 2010/2/10 Scott Marlowe <scott.marlowe@gmail.com>: >> On Tue, Feb 9, 2010 at 11:26 PM, Steve Atkins <steve@blighty.com> wrote: >>> >>> On Feb 9, 2010, at 9:52 PM, Scott Marlowe wrote: >>> >>>> On Tue, Feb 9, 2010 at 9:38 PM, AI Rumman <rummandba@gmail.com> wrote: >>>>> How to save 2 GB or more text string in Postgresql? >>>>> Which data type should I use? >>>> >>>> If you have to you can use either the lo interface, or you can use >>>> bytea. Large Object (i.e. lo) allows for access much like fopen / >>>> fseek etc in C, but the actual data are not stored in a row with >>>> other data, but alone in the lo space. Bytea is a legit type that you >>>> can have as one of many in a row, but you retrieve the whole thing at >>>> once when you get the row. >>> >>> Bytea definitely won't handle more than 1 GB. I don't think the lo interface >>> will handle more than 2GB. >> >> That really depends on how compressible it is, doesn't it? >> > > no. It is maximal length for varlena. TOAST is next possible step. > > Regards > Pavel Stehule > > p.s. > > processing very large SQL values - like bytea, or text longer tens > megabytes is very expensive on memory. When you processing 100MB > bytea, then you need about 300MB RAM, Using a bytea over 100MB is not > good idea. LO interface is better and much more faster. Agreed. I wonder what kind of strings the OP is storing, and if there isn't a more efficient way to do what he wants to do. But we're learning by pieces. First size requirements, then the need for FTS. So, OP, what are you trying to do? (i.e. big answer, not small) It could be there are basic inefficiencies in some part of your method that are more easily recognised when know what you're trying to do.
On Feb 9, 2010, at 10:38 PM, AI Rumman wrote: > Thanks for your quick answes. > > But if I use a file and then store the name in the database, is it possible to use TEXT search tsvector and tsquery indexingon these external files? No, no way at all. A database isn't really the right way to do full text search for single files that big. Even if they'd fit in the databaseit's way bigger than the underlying index types tsquery uses are designed for. Are you sure that the documents are that big? A single document of that size would be 400 times the size of the bible. That'sa ridiculously large amount of text, most of a small library. If the answer is "yes, it's really that big and it's really text" then look at clucene or, better, hiring a specialist. Cheers, Steve > > On Wed, Feb 10, 2010 at 12:26 PM, Steve Atkins <steve@blighty.com> wrote: > > On Feb 9, 2010, at 9:52 PM, Scott Marlowe wrote: > > > On Tue, Feb 9, 2010 at 9:38 PM, AI Rumman <rummandba@gmail.com> wrote: > >> How to save 2 GB or more text string in Postgresql? > >> Which data type should I use? > > > > If you have to you can use either the lo interface, or you can use > > bytea. Large Object (i.e. lo) allows for access much like fopen / > > fseek etc in C, but the actual data are not stored in a row with > > other data, but alone in the lo space. Bytea is a legit type that you > > can have as one of many in a row, but you retrieve the whole thing at > > once when you get the row. > > Bytea definitely won't handle more than 1 GB. I don't think the lo interface > will handle more than 2GB. > > > > > Preferred way to store 2GB data is to put it into a file and put the > > name of the file into the database. > > > This. > > Cheers, > Steve > > > -- > Sent via pgsql-general mailing list (pgsql-general@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-general >
On Wed, Feb 10, 2010 at 12:11 AM, Steve Atkins <steve@blighty.com> wrote: > A database isn't really the right way to do full text search for single files that big. Even if they'd fit in the databaseit's way bigger than the underlying index types tsquery uses are designed for. > > Are you sure that the documents are that big? A single document of that size would be 400 times the size of the bible.That's a ridiculously large amount of text, most of a small library. > > If the answer is "yes, it's really that big and it's really text" then look at clucene or, better, hiring a specialist. I'm betting it's something like gene sequences or geological samples, or something other than straight text. But even those bear breaking down into some kind of simple normalization scheme don't they? But if that's what they are, then I'd think that you'd need to be willing to step up and design a type of new pg object that would hold these long strings and be able to run hand written C that does cool things to your data without killing your machine. 2Gigabytes is a lot. But it's not so big on a machine with 128G of ram as it is on a machine with 4G. If both 2G+ objects can fit in memory and be compared or operate on each other in odd ways that could prove useful. But postgresql doesn't really have anything built in to do that. I'd think it would be cheaper to write simple program that reads two text files and does the same thing. With kernel file caching it should load quickly after the first access And on RAID arrays that read at 400 to 500M/sec it's only 4 seconds load time on the first access. If there's some part of doing this that needs to be transactionally sane, then write a simple control program that uses the database to keep track of completed jobs and do it all outside the database in some other language if it's better suited to this.
On Feb 9, 2010, at 11:21 PM, Scott Marlowe wrote: > On Wed, Feb 10, 2010 at 12:11 AM, Steve Atkins <steve@blighty.com> wrote: >> A database isn't really the right way to do full text search for single files that big. Even if they'd fit in the databaseit's way bigger than the underlying index types tsquery uses are designed for. >> >> Are you sure that the documents are that big? A single document of that size would be 400 times the size of the bible.That's a ridiculously large amount of text, most of a small library. >> >> If the answer is "yes, it's really that big and it's really text" then look at clucene or, better, hiring a specialist. > > I'm betting it's something like gene sequences or geological samples, > or something other than straight text. But even those bear breaking > down into some kind of simple normalization scheme don't they? An entire human is a shade over 3 billion base pairs, with an information content of well under a gigabyte. The earth is about 4 billion years old, so if you were sampling every couple of years and you have the perfect core sample... maybe. I'm not sure that any form of full text search that includes stemming would be terribly useful for either. Cheers, Steve
On Wed, Feb 10, 2010 at 10:11 AM, Steve Atkins <steve@blighty.com> wrote: > > On Feb 9, 2010, at 10:38 PM, AI Rumman wrote: > >> Thanks for your quick answes. >> >> But if I use a file and then store the name in the database, is it possible to use TEXT search tsvector and tsquery indexingon these external files? > > No, no way at all. > > A database isn't really the right way to do full text search for single files that big. Even if they'd fit in the databaseit's way bigger than the underlying index types tsquery uses are designed for. > > Are you sure that the documents are that big? A single document of that size would be 400 times the size of the bible.That's a ridiculously large amount of text, most of a small library. > > If the answer is "yes, it's really that big and it's really text" then look at clucene or, better, hiring a specialist. > > Cheers, > Steve > >> >> On Wed, Feb 10, 2010 at 12:26 PM, Steve Atkins <steve@blighty.com> wrote: >> >> On Feb 9, 2010, at 9:52 PM, Scott Marlowe wrote: >> >> > On Tue, Feb 9, 2010 at 9:38 PM, AI Rumman <rummandba@gmail.com> wrote: >> >> How to save 2 GB or more text string in Postgresql? >> >> Which data type should I use? >> > >> > If you have to you can use either the lo interface, or you can use >> > bytea. Large Object (i.e. lo) allows for access much like fopen / >> > fseek etc in C, but the actual data are not stored in a row with >> > other data, but alone in the lo space. Bytea is a legit type that you >> > can have as one of many in a row, but you retrieve the whole thing at >> > once when you get the row. >> >> Bytea definitely won't handle more than 1 GB. I don't think the lo interface >> will handle more than 2GB. >> >> > >> > Preferred way to store 2GB data is to put it into a file and put the >> > name of the file into the database. >> >> >> This. >> >> Cheers, >> Steve >> >> >> -- >> Sent via pgsql-general mailing list (pgsql-general@postgresql.org) >> To make changes to your subscription: >> http://www.postgresql.org/mailpref/pgsql-general >> > > > -- > Sent via pgsql-general mailing list (pgsql-general@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-general > I have some simple (and possibly wrong) solutions are presented below. a) Store only the file name in the db and leave the file on the file system where you can access it after after reading in the file name from the DB to determine it's location This solution may offer the possibility of using many of the very fast and efficient file content search and manipulation algorithms and tools that are part of most OSes (for instance grep, sed, ...) b) If you really need to store the file contents in the DB. You may write a small method which reads in the file contents in batches of n number of bytes and writes each of these batches in a new table row, but one thing to remember is to capture and store the batch number (which should be sequencial) in another field of the respective batch. And finally indicate the last batch by maybe setting another field in the row as 'true'. Doing so will provide you with the opportunity to parallize your the batch persistance to DB and still have a way to correctly put together the sections of the file contents (the batches) while reconstructing the file. Also will ensure the size of your transaction will be as small the size of a batch. Allan.
On Wed, Feb 10, 2010 at 1:21 AM, Scott Marlowe <scott.marlowe@gmail.com> wrote: > > On Wed, Feb 10, 2010 at 12:11 AM, Steve Atkins <steve@blighty.com> wrote: > > A database isn't really the right way to do full text search for single files that big. Even if they'd fit in the databaseit's way bigger than the underlying index types tsquery uses are designed for. > > > > Are you sure that the documents are that big? A single document of that size would be 400 times the size of the bible.That's a ridiculously large amount of text, most of a small library. > > > > If the answer is "yes, it's really that big and it's really text" then look at clucene or, better, hiring a specialist. > > I'm betting it's something like gene sequences or geological samples, > or something other than straight text. But even those bear breaking > down into some kind of simple normalization scheme don't they? > A single genome is ~ 1.3GB as chars, half that size if you use 4 bits / nucleotide (which should work for at least 90% of the use cases). Simplest design is to store a single reference and then for everything else store deltas from it. On average that should require about about 3-5% of your reference sequence per comparative sample (not counting FKs and indexes). As I mentioned on the list a couple of months ago we are in the middle of stuffing a bunch of molecular data (including entire genomes) into Postgres. If anyone else is doing this I would welcome the opportunity to discuss the issues off list... -- Peter Hunsberger
As I mentioned on the list a couple of months ago we are in the middle
of stuffing a bunch of molecular data (including entire genomes) into
Postgres. If anyone else is doing this I would welcome the
opportunity to discuss the issues off list...
I do not stuff molecules or genomes or genomdata into PostgreSQL, but I sure would love to read a case study about it or listen to a talk at pgday.eu about this stuff.
Best wishes,
Harald
--
GHUM Harald Massa
persuadere et programmare
Harald Armin Massa
Spielberger Straße 49
70435 Stuttgart
0173/9409607
no fx, no carrier pigeon
-
%s is too gigantic of an industry to bend to the whims of reality