Re: Postgresql as a dictionary coder backend?

Поиск
Список
Период
Сортировка
От Fredric Fredricson
Тема Re: Postgresql as a dictionary coder backend?
Дата
Msg-id 4D3CFF9C.7010108@bonetmail.com
обсуждение исходный текст
Ответ на Postgresql as a dictionary coder backend?  (Attila Nagy <bra@fsn.hu>)
Ответы Re: Postgresql as a dictionary coder backend?
Список pgsql-general
On 01/23/2011 12:29 PM, Attila Nagy wrote:
>  Hello,
>
> I'm looking for a database backend for a dictionary coder project. It
> would have three major tasks:
> - take a text corpus, get their words and substitute each word by a 64
> bit integer (the word:integer is always constant) and store the result
> (encoding)
> - take the previous result and substitute the integers with words
> (decoding)
> - the words should be reference counted, so if a word can be no longer
> found in any of the encoded messages, delete it (and optionally free
> it's integer ID, but 64 bit is believed to be enough for a long time,
> although having smaller IDs result smaller encoded files). This could
> be achieved by informing the database of the words of a deleted
> message, so it could decrement those refcounts and delete the records
> if needed.
Why do you need 64 bits? An language contains somewhere around 1.000.000
words so 32 bits should be enough for more than 2000 languages. I read
somewhere that the Oxford Dictionary contains 150.000 words with a total
of 600.000 word forms.
Anyways, 64 bit seem to be a bit overkill, or am I missing something.
>
> I can easily do this with any RDBMS, with a table of three columns:
> auto incremented ID, word and refcount, with a unique index on word.
> The challenge could be:
> - that it should scale to several TBs of size and several (hundred)
> billion of records. One scenario would be to store about 40 TBs of
> words and the average word length would be about 50-60 bytes (that's
> about 800*10^9 records). It should work well both for inserting and
> searching (encoding and decoding) words.
50-60 bytes! Oh, so we are not talking about natural language here.
Sorry, I just assumed that.
Still, I think performance will be a big issue here. I have never tried
postgresql on anything faster than a fast PC but in my humble experience
an insert will take at least one ms. With this speed 800*10^9 records
would take 25 years to insert.
I think you have to think bigger than a single server (ok, that was
stating the obvious).
> - I need atomicity and durability, but having these on a word (record)
> level takes too much IOPS and have no use, so it would be good to have
> an interface for inserting about 1000-500000 words in one call, assign
> a unique ID to each unique words and store them (if the word has had
> already an ID, increment its refcount) and give back the IDs for each
> words. This transaction could be committed as one, so the transactions
> could be big, sparing IOPS.
> - I need concurrency, so when the above happens from two sources at
> the same time, the same word in the two transactions must get the same ID
Unless the ID is some kind of hash you will have to serialize inserts of
new words.
>
> Is postgresql a good choice for doing this and if yes, what would be
> the optimal (for both time and space efficiency at encoding and
> decoding) use case?
I might be wrong but this kind of project should probably not rely on a
"standard" RDBMS. The data structure itself does not seem to complex and
the performance requirements are quite demanding.

But then, I'm no expert so don't take my word for it.
/Fredric
>
> Thanks,
>
>


Вложения

В списке pgsql-general по дате отправления:

Предыдущее
От: Ben Chobot
Дата:
Сообщение: Re: Postgresql as a dictionary coder backend?
Следующее
От: orgilhp
Дата:
Сообщение: Re: Read problem from Bytea column