Re: TEXT column > 1Gb

Поиск
Список
Период
Сортировка
От Joe Carlson
Тема Re: TEXT column > 1Gb
Дата
Msg-id A29ABFF1-8014-46B2-9425-34F59812581A@lbl.gov
обсуждение исходный текст
Ответ на Re: TEXT column > 1Gb  (Mark Dilger <mark.dilger@enterprisedb.com>)
Ответы Re: TEXT column > 1Gb  (Benedict Holland <benedict.m.holland@gmail.com>)
Re: TEXT column > 1Gb  (Ron <ronljohnsonjr@gmail.com>)
Список pgsql-general
I’ve certainly thought about using a different representation. A factor of 2x would be good, for a while anyway. For
nucleotidesequence, we’d need to consider a 10 character alphabet (A, C, G, T, N and the lower case forms when
representing’soft masked’ sequence*). So it would be 2 bases/byte. (Proteins are not nearly so long so a straight
storageis simpler.) But these would be bigger changes on the client side than storing in chunks so I think this is the
wayto go. 

We’re working with plant genomes, which compared to human chromosomes, are HUGE. One chromosome of fava bean is over a
gig.And pine tree is another monster. This, together with the fact that sequence data collection and assembly have
improvedso much in the past couple years has forced us to rethink a lot of our data storage assumptions.  

* for those curious, especially in plants, much of sequence consists of repetitive element that are remnants of ancient
viruses,simple repeats and the like. For people who want to identify particular functional components in a genome, they
typicallydo not want to search against this sequence but restrict searching to coding regions. But the repetitive
sequenceis still important and we need to keep it. 

> On Apr 12, 2023, at 10:04 AM, Mark Dilger <mark.dilger@enterprisedb.com> wrote:
>
>
>
>> On Apr 12, 2023, at 7:59 AM, Joe Carlson <jwcarlson@lbl.gov> wrote:
>>
>> The use case is genomics. Extracting substrings is common. So going to chunked storage makes sense.
>
> Are you storing nucleotide sequences as text strings?  If using the simple 4-character (A,C,G,T) alphabet, you can
storefour bases per byte.  If using a nucleotide code 16-character alphabet you can still get two bases per byte.  An
aminoacid 20-character alphabet can be stored 8 bases per 5 bytes, and so forth.  Such a representation might allow you
tostore sequences two or four times longer than the limit you currently hit, but then you are still at an impasse.
Woulda factor or 2x or 4x be enough for your needs?  
>
> —
> Mark Dilger
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company
>
>
>




В списке pgsql-general по дате отправления:

Предыдущее
От: Mark Dilger
Дата:
Сообщение: Re: TEXT column > 1Gb
Следующее
От: Benedict Holland
Дата:
Сообщение: Re: TEXT column > 1Gb