Re: Postgresql as a dictionary coder backend?

Поиск
Список
Период
Сортировка
От Ben Chobot
Тема Re: Postgresql as a dictionary coder backend?
Дата
Msg-id CA1DD4AA-29C4-41C8-B71A-DDC1717513D4@silentmedia.com
обсуждение исходный текст
Ответ на Postgresql as a dictionary coder backend?  (Attila Nagy <bra@fsn.hu>)
Ответы Re: Postgresql as a dictionary coder backend?
Список pgsql-general
On Jan 23, 2011, at 3:29 AM, Attila Nagy wrote:

> Hello,
>
> I'm looking for a database backend for a dictionary coder project. It would have three major tasks:
> - take a text corpus, get their words and substitute each word by a 64 bit integer (the word:integer is always
constant)and store the result (encoding) 
> - take the previous result and substitute the integers with words (decoding)
> - the words should be reference counted, so if a word can be no longer found in any of the encoded messages, delete
it(and optionally free it's integer ID, but 64 bit is believed to be enough for a long time, although having smaller
IDsresult smaller encoded files). This could be achieved by informing the database of the words of a deleted message,
soit could decrement those refcounts and delete the records if needed. 

Just curious, is this your end goal, or would some existing text search engine (tsearch2, lucerne, etc) fit the bill?


> I can easily do this with any RDBMS, with a table of three columns: auto incremented ID, word and refcount, with a
uniqueindex on word. 
> The challenge could be:
> - that it should scale to several TBs of size and several (hundred) billion of records. One scenario would be to
storeabout 40 TBs of words and the average word length would be about 50-60 bytes (that's about 800*10^9 records). It
shouldwork well both for inserting and searching (encoding and decoding) words. 

Yes, Postgres can do this, but you're probably going to want to look into partitioning your tables. For instance,
insteadof a single word table, you can make word tables that store 100M words each - or if  100M ids per table doesn't
soundright, pick some other number. You can still access the table as if it's a single table, and under the covers PG
willuse constraint exclusion to know which table partition to use. 

> - I need atomicity and durability, but having these on a word (record) level takes too much IOPS and have no use, so
itwould be good to have an interface for inserting about 1000-500000 words in one call, assign a unique ID to each
uniquewords and store them (if the word has had already an ID, increment its refcount) and give back the IDs for each
words.This transaction could be committed as one, so the transactions could be big, sparing IOPS. 

I must be something something because it's unclear why standard transactions don't give you this? Except rolling back
IDsin the case of an abort, of course. Why do you need that? 

> - I need concurrency, so when the above happens from two sources at the same time, the same word in the two
transactionsmust get the same ID 

You can't have this and the previous constraint, at least not with PG. I don't know of any system that will give you
durablewords but still detect when the same word is being inserted in parallel transactions, AND save you the iops of
word-by-wordwrites. 

Unless, of course, you're ok with one transaction failing due to trying to add a word that wasn't there at the start?
Ifyou did something with a unique constraint on the word, you might be able to achieve this. 



В списке pgsql-general по дате отправления:

Предыдущее
От: Arturo Perez
Дата:
Сообщение: temporal period type and select distinct gives equality error
Следующее
От: Fredric Fredricson
Дата:
Сообщение: Re: Postgresql as a dictionary coder backend?