Обсуждение: Hash Function: MD5 or other?

Поиск
Список
Период
Сортировка

Hash Function: MD5 or other?

От
Peter Fein
Дата:
Hi-

I wanted to use a partially unique index (dependent on a flag) on a TEXT
column, but the index row size was too big for btrees.  See the thread
"index row size 2728 exceeds btree maximum, 2713" from the beginning of
this month for someone with a similar problem.  In it, someone suggests
indexing on a hash of the text.  I'm fine with this, as the texts in
question are similar enough to each other to make collisions unlikely
and a collision won't really cause any serious problems.

My question is: is the builtin MD5 appropriate for this use or should I
be using a function from pl/something?  Figures on collision rates would
be nice as well - the typical chunk of text is probably 1k-8k.

Thanks!

--
Peter Fein                 pfein@pobox.com                 773-575-0694

Basically, if you're not a utopianist, you're a schmuck. -J. Feldman

Re: Hash Function: MD5 or other?

От
Shelby Cain
Дата:
--- Peter Fein <pfein@pobox.com> wrote:

> Hi-
>
> I wanted to use a partially unique index (dependent on a flag) on a
> TEXT
> column, but the index row size was too big for btrees.  See the
> thread
> "index row size 2728 exceeds btree maximum, 2713" from the beginning
> of
> this month for someone with a similar problem.  In it, someone
> suggests
> indexing on a hash of the text.  I'm fine with this, as the texts in
> question are similar enough to each other to make collisions unlikely
> and a collision won't really cause any serious problems.
>
> My question is: is the builtin MD5 appropriate for this use or should
> I
> be using a function from pl/something?  Figures on collision rates
> would
> be nice as well - the typical chunk of text is probably 1k-8k.
>
> Thanks!
>
> --

I believe the odds of two arbitrary inputs yielding the same MD5 hash
would be 1 in 2^128.  Even though the odds of collision are small
you'll want to write your query such that use use the index on the hash
and then filter on the text field to guarantee you get the result you
are interested in.

Regards,

Shelby Cain

__________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around
http://mail.yahoo.com

Re: Hash Function: MD5 or other?

От
Greg Stark
Дата:
Shelby Cain <alyandon@yahoo.com> writes:

> > My question is: is the builtin MD5 appropriate for this use or should I be
> > using a function from pl/something? Figures on collision rates would be
> > nice as well - the typical chunk of text is probably 1k-8k.

Note that MD5 is slow and CPU-intensive. By design.

If you want a quick way to find matching records then you might find something
like CRC to be more useful. With MD5 it's supposed to be hard for someone to
come up with inputs that hash to a target value, but if you're not too worried
about people trying to do that then MD5 is probably overkill.


--
greg

Re: Hash Function: MD5 or other?

От
Tino Wildenhain
Дата:
Am Dienstag, den 14.06.2005, 02:40 -0400 schrieb Greg Stark:
> Shelby Cain <alyandon@yahoo.com> writes:
>
> > > My question is: is the builtin MD5 appropriate for this use or should I be
> > > using a function from pl/something? Figures on collision rates would be
> > > nice as well - the typical chunk of text is probably 1k-8k.
>
> Note that MD5 is slow and CPU-intensive. By design.
>
> If you want a quick way to find matching records then you might find something
> like CRC to be more useful. With MD5 it's supposed to be hard for someone to
> come up with inputs that hash to a target value, but if you're not too worried
> about people trying to do that then MD5 is probably overkill.
>
Sha1 is said to be faster.
(And less likely to collisions)
--
Tino Wildenhain <tino@wildenhain.de>


Re: Hash Function: MD5 or other?

От
Alex Stapleton
Дата:
On 13 Jun 2005, at 23:49, Peter Fein wrote:


> Hi-
>
> I wanted to use a partially unique index (dependent on a flag) on a
> TEXT
> column, but the index row size was too big for btrees.  See the thread
> "index row size 2728 exceeds btree maximum, 2713" from the
> beginning of
> this month for someone with a similar problem.  In it, someone
> suggests
> indexing on a hash of the text.  I'm fine with this, as the texts in
> question are similar enough to each other to make collisions unlikely
> and a collision won't really cause any serious problems.
>
> My question is: is the builtin MD5 appropriate for this use or
> should I
> be using a function from pl/something?  Figures on collision rates
> would
> be nice as well - the typical chunk of text is probably 1k-8k.
>
> Thanks!
>

As others have said MD5 isn't the fastest one out there. However no
cryptographically  secure hashes are really that fast. However you
can probably get away with using a CRC hash which is long enough to
reduce your chances of collision a lot. However, PostgreSQL doesn't
have a built in CRC function, which is a bit of a pain unless your
prepared to implement one, or use pl/* to do it, which sounds like
overkill. I suggest you run some benchmarks on MD5 and see if it's
fast enough to meet your current (and perhaps future) needs.

You could of course, just use a hash index on your text field! I
think that would probably cope with larger data sets OK. It has the
advantage of handling collisions for you as well :) However it means
you have to transfer large amounts of data around, so if network
speed ever becomes a limitation, MD5 hashing (or enabling compression
on your PgSQL connection) may help.


> --
> Peter Fein                 pfein@pobox.com
> 773-575-0694
>
> Basically, if you're not a utopianist, you're a schmuck. -J. Feldman
>
> ---------------------------(end of
> broadcast)---------------------------
> TIP 6: Have you searched our list archives?
>
>                http://archives.postgresql.org
>
>
>



Re: Hash Function: MD5 or other?

От
Peter Fein
Дата:
Alex Stapleton wrote:
>
> On 13 Jun 2005, at 23:49, Peter Fein wrote:
>
>
>> Hi-
>>
>> I wanted to use a partially unique index (dependent on a flag) on a  TEXT
>> column, but the index row size was too big for btrees.  See the thread
>> "index row size 2728 exceeds btree maximum, 2713" from the  beginning of
>> this month for someone with a similar problem.  In it, someone  suggests
>> indexing on a hash of the text.  I'm fine with this, as the texts in
>> question are similar enough to each other to make collisions unlikely
>> and a collision won't really cause any serious problems.
>>
>> My question is: is the builtin MD5 appropriate for this use or  should I
>> be using a function from pl/something?  Figures on collision rates  would
>> be nice as well - the typical chunk of text is probably 1k-8k.
>>
>> Thanks!
>>
>
> As others have said MD5 isn't the fastest one out there. However no
> cryptographically  secure hashes are really that fast. However you  can
> probably get away with using a CRC hash which is long enough to  reduce
> your chances of collision a lot. However, PostgreSQL doesn't  have a
> built in CRC function, which is a bit of a pain unless your  prepared to
> implement one, or use pl/* to do it, which sounds like  overkill. I
> suggest you run some benchmarks on MD5 and see if it's  fast enough to
> meet your current (and perhaps future) needs.
>
> You could of course, just use a hash index on your text field! I  think
> that would probably cope with larger data sets OK. It has the  advantage
> of handling collisions for you as well :) However it means  you have to
> transfer large amounts of data around, so if network  speed ever becomes
> a limitation, MD5 hashing (or enabling compression  on your PgSQL
> connection) may help.

Let me clarify my problem:

I've got a bunch of texts, stored in column sometext. These texts are
assigned to groups, group_id. Each group may contain rows with identical
sometext. Finding the appropriate group when inserting is very expensive
and is calculated on the client side. I want to mark *one* of each of
the exact duplicates in a group with a flag, group_representative.
Basically, a partial index on:

(group_id, sometext) WHERE group_representative=TRUE.

The motivation is then to be able to do:

SELECT * FROM mytable WHERE <some other conditions> AND
group_representative=TRUE

instead of:

SELECT * FROM mytable WHERE <some other conditions> DISTINCT ON
(group_id, sometext)

since there are more of these queries than inserts (and they return a
lot more rows than belong to a single group_id). I suspect that running
a DISTINCT in this case would be more expensive than updating the index
when inserting.

Knowing the specifics of the data I'm putting in sometext, a halfway
decent hash function would make collisions so rare as to make the chance
insignificant (and collisions wouldn't break anything anyway). Is this
approach reasonable, or should I use a hash index on (group_id,
sometext) - does this suffer from the same size limitation as btrees? I
thought hash indexes were slow...

Hope that's clearer...

--Pete

--
Peter Fein                 pfein@pobox.com                 773-575-0694

Basically, if you're not a utopianist, you're a schmuck. -J. Feldman

Re: Hash Function: MD5 or other?

От
Alex Stapleton
Дата:
On 14 Jun 2005, at 14:33, Peter Fein wrote:

> Alex Stapleton wrote:
>
>>
>> On 13 Jun 2005, at 23:49, Peter Fein wrote:
>>
>>
>>
>>> Hi-
>>>
>>> I wanted to use a partially unique index (dependent on a flag) on
>>> a  TEXT
>>> column, but the index row size was too big for btrees.  See the
>>> thread
>>> "index row size 2728 exceeds btree maximum, 2713" from the
>>> beginning of
>>> this month for someone with a similar problem.  In it, someone
>>> suggests
>>> indexing on a hash of the text.  I'm fine with this, as the texts in
>>> question are similar enough to each other to make collisions
>>> unlikely
>>> and a collision won't really cause any serious problems.
>>>
>>> My question is: is the builtin MD5 appropriate for this use or
>>> should I
>>> be using a function from pl/something?  Figures on collision
>>> rates  would
>>> be nice as well - the typical chunk of text is probably 1k-8k.
>>>
>>> Thanks!
>>>
>>>
>>
>> As others have said MD5 isn't the fastest one out there. However no
>> cryptographically  secure hashes are really that fast. However
>> you  can
>> probably get away with using a CRC hash which is long enough to
>> reduce
>> your chances of collision a lot. However, PostgreSQL doesn't  have a
>> built in CRC function, which is a bit of a pain unless your
>> prepared to
>> implement one, or use pl/* to do it, which sounds like  overkill. I
>> suggest you run some benchmarks on MD5 and see if it's  fast
>> enough to
>> meet your current (and perhaps future) needs.
>>
>> You could of course, just use a hash index on your text field! I
>> think
>> that would probably cope with larger data sets OK. It has the
>> advantage
>> of handling collisions for you as well :) However it means  you
>> have to
>> transfer large amounts of data around, so if network  speed ever
>> becomes
>> a limitation, MD5 hashing (or enabling compression  on your PgSQL
>> connection) may help.
>>
>
> Let me clarify my problem:
>
> I've got a bunch of texts, stored in column sometext. These texts are
> assigned to groups, group_id. Each group may contain rows with
> identical
> sometext. Finding the appropriate group when inserting is very
> expensive
> and is calculated on the client side. I want to mark *one* of each of
> the exact duplicates in a group with a flag, group_representative.
> Basically, a partial index on:
>
> (group_id, sometext) WHERE group_representative=TRUE.
>
> The motivation is then to be able to do:
>
> SELECT * FROM mytable WHERE <some other conditions> AND
> group_representative=TRUE
>
> instead of:
>
> SELECT * FROM mytable WHERE <some other conditions> DISTINCT ON
> (group_id, sometext)
>
> since there are more of these queries than inserts (and they return a
> lot more rows than belong to a single group_id). I suspect that
> running
> a DISTINCT in this case would be more expensive than updating the
> index
> when inserting.
>
> Knowing the specifics of the data I'm putting in sometext, a halfway
> decent hash function would make collisions so rare as to make the
> chance
> insignificant (and collisions wouldn't break anything anyway). Is this
> approach reasonable, or should I use a hash index on (group_id,
> sometext) - does this suffer from the same size limitation as
> btrees? I
> thought hash indexes were slow...
>
> Hope that's clearer...

 From the manual.

Testing has shown PostgreSQL's hash indexes to perform no better than
B-tree indexes, and the index size and build time for hash indexes is
much worse. For these reasons, hash index use is presently discouraged.

So yes, they are slower, but not by an enormous amount. (The lack of
any data to backup the manuals conclusions worries me, but
nevermind.) It is certainly worth trying.

The only hashing function PG has built in which you can actually
access is MD5. You might be able to get better performance from that
if data transmission takes a long time as you will only need to send
the hash to the database. You could then build a B-Tree index on the
MD5 hashes, which would certainly do what you want it to as the
likelihood of a collision is low. So basically it all boils down to
just how slow MD5 is, and just how fast your network connection to
your DB server(s) is.

> --Pete
>
> --
> Peter Fein                 pfein@pobox.com
> 773-575-0694
>
> Basically, if you're not a utopianist, you're a schmuck. -J. Feldman
>
> ---------------------------(end of
> broadcast)---------------------------
> TIP 4: Don't 'kill -9' the postmaster
>
>


Re: Hash Function: MD5 or other?

От
Shelby Cain
Дата:
--- Greg Stark <gsstark@mit.edu> wrote:
>
> Note that MD5 is slow and CPU-intensive. By design.
>
> If you want a quick way to find matching records then you might find
> something
> like CRC to be more useful. With MD5 it's supposed to be hard for
> someone to
> come up with inputs that hash to a target value, but if you're not
> too worried
> about people trying to do that then MD5 is probably overkill.
>

$ ./hash -b

CRC32:      302.78 MB/sec
HAVAL 128:  165.33 MB/sec
HAVAL 160:  178.69 MB/sec
HAVAL 192:  124.74 MB/sec
HAVAL 224:  123.05 MB/sec
HAVAL 256:  98.14 MB/sec
MD2:        9.03 MB/sec
MD4:        233.36 MB/sec
MD5:        105.39 MB/sec
Panama:     311.21 MB/sec
RIPEMD-128: 129.88 MB/sec
RIPEMD-160: 76.75 MB/sec
SHA1:       135.40 MB/sec
SHA256:     49.42 MB/sec
SHA384:     32.77 MB/sec
SHA512:     31.58 MB/sec
Tiger:      54.02 MB/sec
Whirlpool:  17.51 MB/sec

Elapsed time:       3.56 seconds
Average throughput: 121.06 MB/s

Granted, MD5 isn't the quickest hashing algorithm out there but it is
certainly fast enough for general use IMO.

Regards,

Shelby Cain




__________________________________
Discover Yahoo!
Find restaurants, movies, travel and more fun for the weekend. Check it out!
http://discover.yahoo.com/weekend.html


Re: Hash Function: MD5 or other?

От
Bruno Wolff III
Дата:
On Tue, Jun 14, 2005 at 08:33:34 -0500,
  Peter Fein <pfein@pobox.com> wrote:
>
> Knowing the specifics of the data I'm putting in sometext, a halfway
> decent hash function would make collisions so rare as to make the chance
> insignificant (and collisions wouldn't break anything anyway). Is this
> approach reasonable, or should I use a hash index on (group_id,
> sometext) - does this suffer from the same size limitation as btrees? I
> thought hash indexes were slow...

The hash value should be saved as a separate column. Then it sounds
like you want a partial btree index of (group_id, hash) where the
flag is set.

Re: Hash Function: MD5 or other?

От
Peter Fein
Дата:
Bruno Wolff III wrote:
> On Tue, Jun 14, 2005 at 08:33:34 -0500,
>   Peter Fein <pfein@pobox.com> wrote:
>
>>Knowing the specifics of the data I'm putting in sometext, a halfway
>>decent hash function would make collisions so rare as to make the chance
>>insignificant (and collisions wouldn't break anything anyway). Is this
>>approach reasonable, or should I use a hash index on (group_id,
>>sometext) - does this suffer from the same size limitation as btrees? I
>>thought hash indexes were slow...
>
>
> The hash value should be saved as a separate column. Then it sounds
> like you want a partial btree index of (group_id, hash) where the
> flag is set.

I'm unclear why I'd need to store the hash in a column. I suppose I
could have the hash column populated by a trigger on inserts, but this
seems to get me the same functionality & is less obvious.

For the archives, I did:

CREATE UNIQUE INDEX idx_md5_sometext  ON mytable USING btree
  (group_id, md5(sometext))
  WHERE group_representative = true;

I then basically replicate this in a SELECT on the client side
(including calculating the MD5 by the client) to figure out the correct
value for group_representative before inserting a new row. This is the
only way I use the MD5, so I don't much care about retrieving it in
other contexts.

--
Peter Fein                 pfein@pobox.com                 773-575-0694

Basically, if you're not a utopianist, you're a schmuck. -J. Feldman

Re: Hash Function: MD5 or other?

От
Bruno Wolff III
Дата:
On Tue, Jun 14, 2005 at 15:54:50 -0500,
  Peter Fein <pfein@pobox.com> wrote:
>
> I'm unclear why I'd need to store the hash in a column. I suppose I
> could have the hash column populated by a trigger on inserts, but this
> seems to get me the same functionality & is less obvious.
>
> For the archives, I did:
>
> CREATE UNIQUE INDEX idx_md5_sometext  ON mytable USING btree
>   (group_id, md5(sometext))
>   WHERE group_representative = true;
>
> I then basically replicate this in a SELECT on the client side
> (including calculating the MD5 by the client) to figure out the correct
> value for group_representative before inserting a new row. This is the
> only way I use the MD5, so I don't much care about retrieving it in
> other contexts.

That should work fine.

I wasn't sure that you weren't going to want to use the hash for joins.
And I was a little concerned that because you used the phrase "hash index",
that you might be considering using a hash (as opposed to btree) index.