Re: more than 2GB data string save

Поиск
Список
Период
Сортировка
От Scott Marlowe
Тема Re: more than 2GB data string save
Дата
Msg-id dcc563d11002092321i374afc2bub6c18734d3f1ad71@mail.gmail.com
обсуждение исходный текст
Ответ на Re: more than 2GB data string save  (Steve Atkins <steve@blighty.com>)
Ответы Re: more than 2GB data string save  (Steve Atkins <steve@blighty.com>)
Re: more than 2GB data string save  (Peter Hunsberger <peter.hunsberger@gmail.com>)
Список pgsql-general
On Wed, Feb 10, 2010 at 12:11 AM, Steve Atkins <steve@blighty.com> wrote:
> A database isn't really the right way to do full text search for single files that big. Even if they'd fit in the
databaseit's way bigger than the underlying index types tsquery uses are designed for. 
>
> Are you sure that the documents are that big? A single document of that size would be 400 times the size of the
bible.That's a ridiculously large amount of text, most of a small library. 
>
> If the answer is "yes, it's really that big and it's really text" then look at clucene or, better, hiring a
specialist.

I'm betting it's something like gene sequences or geological samples,
or something other than straight text.  But even those bear breaking
down into some kind of simple normalization scheme don't they?

But if that's what they are, then I'd think that you'd need to be
willing to step up and design a type of new pg object that would hold
these long strings and be able to run hand written C that does cool
things to your data without killing your machine.

2Gigabytes is a lot.  But it's not so big on a machine with 128G of
ram as it is on a machine with 4G.  If both 2G+ objects can fit in
memory and be compared or operate on each other in odd ways that could
prove useful.

But postgresql doesn't really have anything built in to do that.  I'd
think it would be cheaper to write  simple program that reads two text
files and does the same thing.  With kernel file caching it should
load quickly after the first access   And on RAID arrays that read at
400 to 500M/sec it's only 4 seconds load time on the first access.

If there's some part of doing this that needs to be transactionally
sane, then write a simple control program that uses the database to
keep track of completed jobs and do it all outside the database in
some other language if it's better suited to this.

В списке pgsql-general по дате отправления:

Предыдущее
От: "karsten vennemann"
Дата:
Сообщение: dump of 700 GB database
Следующее
От: Pavel Stehule
Дата:
Сообщение: Re: dump of 700 GB database