Обсуждение: more than 2GB data string save

Поиск

Список

Период

Сортировка

more than 2GB data string save

От

AI Rumman

Дата:

10 февраля 2010 г., 03:38:39

How to save 2 GB or more text string in Postgresql?

Which data type should I use?

Please any suggestion.

Re: more than 2GB data string save

От

Scott Marlowe

Дата:

10 февраля 2010 г., 04:52:44

On Tue, Feb 9, 2010 at 9:38 PM, AI Rumman <rummandba@gmail.com> wrote:
> How to save 2 GB or more text string in Postgresql?
> Which data type should I use?

If you have to you can use either the lo interface, or you can use
bytea.  Large Object (i.e. lo) allows for access much like fopen /
fseek  etc in C, but the actual data are not stored in a row with
other data, but alone in the lo space.  Bytea is a legit type that you
can have as one of many in a row, but you retrieve the whole thing at
once when you get the row.

Preferred way to store 2GB data is to put it into a file and put the
name of the file into the database.

Re: more than 2GB data string save

От

Steve Atkins

Дата:

10 февраля 2010 г., 05:27:03

On Feb 9, 2010, at 9:52 PM, Scott Marlowe wrote:

> On Tue, Feb 9, 2010 at 9:38 PM, AI Rumman <rummandba@gmail.com> wrote:
>> How to save 2 GB or more text string in Postgresql?
>> Which data type should I use?
>
> If you have to you can use either the lo interface, or you can use
> bytea.  Large Object (i.e. lo) allows for access much like fopen /
> fseek  etc in C, but the actual data are not stored in a row with
> other data, but alone in the lo space.  Bytea is a legit type that you
> can have as one of many in a row, but you retrieve the whole thing at
> once when you get the row.

Bytea definitely won't handle more than 1 GB. I don't think the lo interface
will handle more than 2GB.

>
> Preferred way to store 2GB data is to put it into a file and put the
> name of the file into the database.


This.

Cheers,
  Steve

Re: more than 2GB data string save

От

Scott Marlowe

Дата:

10 февраля 2010 г., 05:36:50

On Tue, Feb 9, 2010 at 11:26 PM, Steve Atkins <steve@blighty.com> wrote:
>
> On Feb 9, 2010, at 9:52 PM, Scott Marlowe wrote:
>
>> On Tue, Feb 9, 2010 at 9:38 PM, AI Rumman <rummandba@gmail.com> wrote:
>>> How to save 2 GB or more text string in Postgresql?
>>> Which data type should I use?
>>
>> If you have to you can use either the lo interface, or you can use
>> bytea.  Large Object (i.e. lo) allows for access much like fopen /
>> fseek  etc in C, but the actual data are not stored in a row with
>> other data, but alone in the lo space.  Bytea is a legit type that you
>> can have as one of many in a row, but you retrieve the whole thing at
>> once when you get the row.
>
> Bytea definitely won't handle more than 1 GB. I don't think the lo interface
> will handle more than 2GB.

That really depends on how compressible it is, doesn't it?

Re: more than 2GB data string save

От

AI Rumman

Дата:

10 февраля 2010 г., 05:38:16

Thanks for your quick answes.

But if I use a file and then store the name in the database, is it possible to use TEXT search tsvector and tsquery indexing on these external files?

On Wed, Feb 10, 2010 at 12:26 PM, Steve Atkins <steve@blighty.com> wrote:

On Feb 9, 2010, at 9:52 PM, Scott Marlowe wrote:

> On Tue, Feb 9, 2010 at 9:38 PM, AI Rumman <rummandba@gmail.com> wrote:
>> How to save 2 GB or more text string in Postgresql?
>> Which data type should I use?
>
> If you have to you can use either the lo interface, or you can use
> bytea. Large Object (i.e. lo) allows for access much like fopen /
> fseek etc in C, but the actual data are not stored in a row with
> other data, but alone in the lo space. Bytea is a legit type that you
> can have as one of many in a row, but you retrieve the whole thing at
> once when you get the row.

Bytea definitely won't handle more than 1 GB. I don't think the lo interface
will handle more than 2GB.

>
> Preferred way to store 2GB data is to put it into a file and put the
> name of the file into the database.

This.

Cheers,
Steve

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: more than 2GB data string save

От

Pavel Stehule

Дата:

10 февраля 2010 г., 05:52:13

2010/2/10 Scott Marlowe <scott.marlowe@gmail.com>:
> On Tue, Feb 9, 2010 at 11:26 PM, Steve Atkins <steve@blighty.com> wrote:
>>
>> On Feb 9, 2010, at 9:52 PM, Scott Marlowe wrote:
>>
>>> On Tue, Feb 9, 2010 at 9:38 PM, AI Rumman <rummandba@gmail.com> wrote:
>>>> How to save 2 GB or more text string in Postgresql?
>>>> Which data type should I use?
>>>
>>> If you have to you can use either the lo interface, or you can use
>>> bytea.  Large Object (i.e. lo) allows for access much like fopen /
>>> fseek  etc in C, but the actual data are not stored in a row with
>>> other data, but alone in the lo space.  Bytea is a legit type that you
>>> can have as one of many in a row, but you retrieve the whole thing at
>>> once when you get the row.
>>
>> Bytea definitely won't handle more than 1 GB. I don't think the lo interface
>> will handle more than 2GB.
>
> That really depends on how compressible it is, doesn't it?
>

no. It is maximal length for varlena. TOAST is next possible step.

Regards
Pavel Stehule

p.s.

processing very large SQL values - like bytea, or text longer tens
megabytes is very expensive on memory. When you processing 100MB
bytea, then you need about 300MB RAM, Using a bytea over 100MB is not
good idea. LO interface is better and much more faster.



> --
> Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-general
>

Re: more than 2GB data string save

От

Scott Marlowe

Дата:

10 февраля 2010 г., 05:54:41

On Tue, Feb 9, 2010 at 11:51 PM, Pavel Stehule <pavel.stehule@gmail.com> wrote:
> 2010/2/10 Scott Marlowe <scott.marlowe@gmail.com>:
>> On Tue, Feb 9, 2010 at 11:26 PM, Steve Atkins <steve@blighty.com> wrote:
>>>
>>> On Feb 9, 2010, at 9:52 PM, Scott Marlowe wrote:
>>>
>>>> On Tue, Feb 9, 2010 at 9:38 PM, AI Rumman <rummandba@gmail.com> wrote:
>>>>> How to save 2 GB or more text string in Postgresql?
>>>>> Which data type should I use?
>>>>
>>>> If you have to you can use either the lo interface, or you can use
>>>> bytea.  Large Object (i.e. lo) allows for access much like fopen /
>>>> fseek  etc in C, but the actual data are not stored in a row with
>>>> other data, but alone in the lo space.  Bytea is a legit type that you
>>>> can have as one of many in a row, but you retrieve the whole thing at
>>>> once when you get the row.
>>>
>>> Bytea definitely won't handle more than 1 GB. I don't think the lo interface
>>> will handle more than 2GB.
>>
>> That really depends on how compressible it is, doesn't it?
>>
>
> no. It is maximal length for varlena. TOAST is next possible step.
>
> Regards
> Pavel Stehule
>
> p.s.
>
> processing very large SQL values - like bytea, or text longer tens
> megabytes is very expensive on memory. When you processing 100MB
> bytea, then you need about 300MB RAM, Using a bytea over 100MB is not
> good idea. LO interface is better and much more faster.

Agreed.  I wonder what kind of strings the OP is storing, and if there
isn't a more efficient way to do what he wants to do.  But we're
learning by pieces.  First size requirements, then the need for FTS.

So, OP, what are you trying to do? (i.e. big answer, not small)  It
could be there are basic inefficiencies in some part of your method
that are more easily recognised when know what you're trying to do.

Re: more than 2GB data string save

От

Steve Atkins

Дата:

10 февраля 2010 г., 06:11:56

On Feb 9, 2010, at 10:38 PM, AI Rumman wrote:

> Thanks for your quick answes.
>
> But if I use a file and then store the name in the database, is it possible to use TEXT search tsvector and tsquery
indexingon these external files? 

No, no way at all.

A database isn't really the right way to do full text search for single files that big. Even if they'd fit in the
databaseit's way bigger than the underlying index types tsquery uses are designed for. 

Are you sure that the documents are that big? A single document of that size would be 400 times the size of the bible.
That'sa ridiculously large amount of text, most of a small library. 

If the answer is "yes, it's really that big and it's really text" then look at clucene or, better, hiring a specialist.

Cheers,
  Steve

>
> On Wed, Feb 10, 2010 at 12:26 PM, Steve Atkins <steve@blighty.com> wrote:
>
> On Feb 9, 2010, at 9:52 PM, Scott Marlowe wrote:
>
> > On Tue, Feb 9, 2010 at 9:38 PM, AI Rumman <rummandba@gmail.com> wrote:
> >> How to save 2 GB or more text string in Postgresql?
> >> Which data type should I use?
> >
> > If you have to you can use either the lo interface, or you can use
> > bytea.  Large Object (i.e. lo) allows for access much like fopen /
> > fseek  etc in C, but the actual data are not stored in a row with
> > other data, but alone in the lo space.  Bytea is a legit type that you
> > can have as one of many in a row, but you retrieve the whole thing at
> > once when you get the row.
>
> Bytea definitely won't handle more than 1 GB. I don't think the lo interface
> will handle more than 2GB.
>
> >
> > Preferred way to store 2GB data is to put it into a file and put the
> > name of the file into the database.
>
>
> This.
>
> Cheers,
>  Steve
>
>
> --
> Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-general
>

Re: more than 2GB data string save

От

Scott Marlowe

Дата:

10 февраля 2010 г., 06:22:00

On Wed, Feb 10, 2010 at 12:11 AM, Steve Atkins <steve@blighty.com> wrote:
> A database isn't really the right way to do full text search for single files that big. Even if they'd fit in the
databaseit's way bigger than the underlying index types tsquery uses are designed for. 
>
> Are you sure that the documents are that big? A single document of that size would be 400 times the size of the
bible.That's a ridiculously large amount of text, most of a small library. 
>
> If the answer is "yes, it's really that big and it's really text" then look at clucene or, better, hiring a
specialist.

I'm betting it's something like gene sequences or geological samples,
or something other than straight text.  But even those bear breaking
down into some kind of simple normalization scheme don't they?

But if that's what they are, then I'd think that you'd need to be
willing to step up and design a type of new pg object that would hold
these long strings and be able to run hand written C that does cool
things to your data without killing your machine.

2Gigabytes is a lot.  But it's not so big on a machine with 128G of
ram as it is on a machine with 4G.  If both 2G+ objects can fit in
memory and be compared or operate on each other in odd ways that could
prove useful.

But postgresql doesn't really have anything built in to do that.  I'd
think it would be cheaper to write  simple program that reads two text
files and does the same thing.  With kernel file caching it should
load quickly after the first access   And on RAID arrays that read at
400 to 500M/sec it's only 4 seconds load time on the first access.

If there's some part of doing this that needs to be transactionally
sane, then write a simple control program that uses the database to
keep track of completed jobs and do it all outside the database in
some other language if it's better suited to this.

Re: more than 2GB data string save

От

Steve Atkins

Дата:

10 февраля 2010 г., 06:30:55

On Feb 9, 2010, at 11:21 PM, Scott Marlowe wrote:

> On Wed, Feb 10, 2010 at 12:11 AM, Steve Atkins <steve@blighty.com> wrote:
>> A database isn't really the right way to do full text search for single files that big. Even if they'd fit in the
databaseit's way bigger than the underlying index types tsquery uses are designed for. 
>>
>> Are you sure that the documents are that big? A single document of that size would be 400 times the size of the
bible.That's a ridiculously large amount of text, most of a small library. 
>>
>> If the answer is "yes, it's really that big and it's really text" then look at clucene or, better, hiring a
specialist.
>
> I'm betting it's something like gene sequences or geological samples,
> or something other than straight text.  But even those bear breaking
> down into some kind of simple normalization scheme don't they?

An entire human is a shade over 3 billion base pairs, with an information
content of well under a gigabyte.

The earth is about 4 billion years old, so if you were sampling every
couple of years and you have the perfect core sample... maybe.

I'm not sure that any form of full text search that includes stemming
would be terribly useful for either.

Cheers,
  Steve

Re: more than 2GB data string save

От

Allan Kamau

Дата:

10 февраля 2010 г., 06:34:34

On Wed, Feb 10, 2010 at 10:11 AM, Steve Atkins <steve@blighty.com> wrote:
>
> On Feb 9, 2010, at 10:38 PM, AI Rumman wrote:
>
>> Thanks for your quick answes.
>>
>> But if I use a file and then store the name in the database, is it possible to use TEXT search tsvector and tsquery
indexingon these external files? 
>
> No, no way at all.
>
> A database isn't really the right way to do full text search for single files that big. Even if they'd fit in the
databaseit's way bigger than the underlying index types tsquery uses are designed for. 
>
> Are you sure that the documents are that big? A single document of that size would be 400 times the size of the
bible.That's a ridiculously large amount of text, most of a small library. 
>
> If the answer is "yes, it's really that big and it's really text" then look at clucene or, better, hiring a
specialist.
>
> Cheers,
>  Steve
>
>>
>> On Wed, Feb 10, 2010 at 12:26 PM, Steve Atkins <steve@blighty.com> wrote:
>>
>> On Feb 9, 2010, at 9:52 PM, Scott Marlowe wrote:
>>
>> > On Tue, Feb 9, 2010 at 9:38 PM, AI Rumman <rummandba@gmail.com> wrote:
>> >> How to save 2 GB or more text string in Postgresql?
>> >> Which data type should I use?
>> >
>> > If you have to you can use either the lo interface, or you can use
>> > bytea.  Large Object (i.e. lo) allows for access much like fopen /
>> > fseek  etc in C, but the actual data are not stored in a row with
>> > other data, but alone in the lo space.  Bytea is a legit type that you
>> > can have as one of many in a row, but you retrieve the whole thing at
>> > once when you get the row.
>>
>> Bytea definitely won't handle more than 1 GB. I don't think the lo interface
>> will handle more than 2GB.
>>
>> >
>> > Preferred way to store 2GB data is to put it into a file and put the
>> > name of the file into the database.
>>
>>
>> This.
>>
>> Cheers,
>>  Steve
>>
>>
>> --
>> Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
>> To make changes to your subscription:
>> http://www.postgresql.org/mailpref/pgsql-general
>>
>
>
> --
> Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-general
>

I have some simple (and possibly wrong) solutions are presented below.
a) Store only the file name in the db and leave the file on the file
system where you can access it after after reading in the file name
from the DB to determine it's location
This solution may offer the possibility of using many of the very fast
and efficient file content search and manipulation algorithms and
tools that are part of most OSes (for instance grep, sed, ...)

b) If you really need to store the file contents in the DB. You may
write a small method which reads in the file contents in batches of n
number of bytes and writes each of these batches in a new table row,
but one thing to remember is to capture and store the batch number
(which should be sequencial) in another field of the respective batch.
And finally indicate the last batch by maybe setting another field in
the row as 'true'.
Doing so will provide you with the opportunity to parallize your the
batch persistance to DB and still have a way to correctly put together
the sections of the file contents (the batches) while reconstructing
the file.
Also will ensure the size of your transaction will be as small the
size of a batch.

Allan.

Re: more than 2GB data string save

От

Peter Hunsberger

Дата:

10 февраля 2010 г., 15:10:37

On Wed, Feb 10, 2010 at 1:21 AM, Scott Marlowe <scott.marlowe@gmail.com> wrote:
>
> On Wed, Feb 10, 2010 at 12:11 AM, Steve Atkins <steve@blighty.com> wrote:
> > A database isn't really the right way to do full text search for single files that big. Even if they'd fit in the
databaseit's way bigger than the underlying index types tsquery uses are designed for. 
> >
> > Are you sure that the documents are that big? A single document of that size would be 400 times the size of the
bible.That's a ridiculously large amount of text, most of a small library. 
> >
> > If the answer is "yes, it's really that big and it's really text" then look at clucene or, better, hiring a
specialist.
>
> I'm betting it's something like gene sequences or geological samples,
> or something other than straight text.  But even those bear breaking
> down into some kind of simple normalization scheme don't they?
>

A single genome is ~ 1.3GB as chars, half that size if you use 4 bits
/ nucleotide (which should work for at least 90% of the use cases).
Simplest design is to store a single reference and then for everything
else store deltas from it.  On average that should require about about
3-5% of your reference sequence per comparative sample  (not counting
FKs and indexes).

As I mentioned on the list a couple of months ago we are in the middle
of stuffing a bunch of molecular data (including entire genomes) into
Postgres.   If anyone else is doing this I would welcome the
opportunity to discuss the issues off list...

--
Peter Hunsberger

Re: more than 2GB data string save

От

"Massa, Harald Armin"

Дата:

10 февраля 2010 г., 15:26:52

As I mentioned on the list a couple of months ago we are in the middle
of stuffing a bunch of molecular data (including entire genomes) into
Postgres. If anyone else is doing this I would welcome the
opportunity to discuss the issues off list...

I do not stuff molecules or genomes or genomdata into PostgreSQL, but I sure would love to read a case study about it or listen to a talk at pgday.eu about this stuff.

Best wishes,

Harald

--
GHUM Harald Massa
persuadere et programmare
Harald Armin Massa
Spielberger Straße 49
70435 Stuttgart
0173/9409607
no fx, no carrier pigeon
-
%s is too gigantic of an industry to bend to the whims of reality

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Обсуждение: more than 2GB data string save