Re: Using hashtext and unique constraints together

Поиск
Список
Период
Сортировка
От Mason Hale
Тема Re: Using hashtext and unique constraints together
Дата
Msg-id 8bca3aa10712111600w70fe53bfg202ae2d55e7818cd@mail.gmail.com
обсуждение исходный текст
Ответ на Using hashtext and unique constraints together  ("Mason Hale" <masonhale@gmail.com>)
Список pgsql-general


I recently discovered the hashtext() function, and I'm interested in using it to reduce the size of one of the biggest indexes in my database. 
...

The problem with either MD5 or hashtext() is that neither can guarantee unique output even if the inputs are all unique.
...

The problem I need help with is guaranteeing uniqueness of the URL on insert, with a non-unique index on hashtext(url) and *without* creating a unique index on page(url).

I'm thinking that an insert trigger that ensures (SELECT count(*) FROM page WHERE hashtext(url) = hashtext('http://www.postgresql.org') AND url = ' http://www.postgresql.org' ) = 0 won't work given MVCC, as two transactions could simultaneously insert the same url at the same time.

Can anyone think of a way to guarantee uniqueness of the URL without adding a unique constraint on the page(url) column?

I gather from the lack of response the answer is "no" ?

Sorry to respond to my own post... but I had an idea -- what if hash the same URL two different ways, and create a unique compound index on both hashes?

For example

CREATE unique index on page( hashtext(url), hashtext(md5(url)) )

The idea here is that the odds for two different string to hash to the same value with hashtext *and* md5 is remote enough to rely on for uniqueness. I'm converting the md5 to hashtext to save space.

or even

CREATE unique index on page( hashtext(url), hashtext(url || 'SALT') )

An alternative approach, assumes that two different strings that hash to the same value, will not also hash to the same value if the concatenated with some additional (constant) data. Again, doesn't absolutely guarantee uniqueness, but seems like it should be safe enough. I guess it depends on how truly random the hashtext output is.

Both of the above would require 8 bytes for each index entry (plus overhead), which is still a big savings over our current implementation.

Any thoughts on these ideas?

thanks in advance,
Mason

В списке pgsql-general по дате отправления:

Предыдущее
От: Gregory Stark
Дата:
Сообщение: Re: Hijack!
Следующее
От: "Gregory Williamson"
Дата:
Сообщение: Re: Hijack!