Обсуждение: Hash Indexes

Поиск
Список
Период
Сортировка

Hash Indexes

От
Naz Gassiep
Дата:
Hi there,
    I am creating functionality where there are blocks of text that are
being stored in the DB and that need to be searched for. No like or
pattern matching, just a plain old WHERE clause. Obviously, hash indexes
would be best here, however I've been warned away from PG's hash
implementation. Would it be better to manually create a column storing
the hashes (maintained with a CHECK (md5(textcol) = hashcol) type
constraint) and generate the hashes in the app?

- Naz.

Re: Hash Indexes

От
Martijn van Oosterhout
Дата:
On Sat, Jan 05, 2008 at 03:20:54AM +1100, Naz Gassiep wrote:
> Hi there,
>    I am creating functionality where there are blocks of text that are
> being stored in the DB and that need to be searched for. No like or
> pattern matching, just a plain old WHERE clause. Obviously, hash indexes
> would be best here, however I've been warned away from PG's hash
> implementation.

Why are hash indexes "obviously" best? In an ideal world with a good
implementation maybe, but postgresql b-trees are really quite good.

> Would it be better to manually create a column storing
> the hashes (maintained with a CHECK (md5(textcol) = hashcol) type
> constraint) and generate the hashes in the app?

You could always do something like:

CREATE INDEX foo ON table((md5(textcol)));

Then it will get used in queries like:
SELECT * FROM table WHERE md5(textcol) = md5('text');

Have a nice day,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> Those who make peaceful revolution impossible will make violent revolution inevitable.
>  -- John F Kennedy

Вложения

Re: Hash Indexes

От
Naz Gassiep
Дата:
> Why are hash indexes "obviously" best? In an ideal world with a good
> implementation maybe, but postgresql b-trees are really quite good.
>
Because doing normal queries on a table where there are large text
blocks is unlikely to be a good idea. E.g.,:

SELECT * FROM table WHERE textcol = 'a 4kb block of text';

> You could always do something like:
>
> CREATE INDEX foo ON table((md5(textcol)));
>
> Then it will get used in queries like:
> SELECT * FROM table WHERE md5(textcol) = md5('text');
>
That's exactly what I was considering doing, however there is always the
change of a hash collision. Yes, this is a very remote chance, however
the ramifications of a collision under those circumstances is
potentially catastrophic. Think a user being delivered text that
contains confidential and sensitive material as opposed to the latest
memo about the cleaning of toilets.

I would assume that hash indexes have inbuilt mechanisms for collision
checking before returning the row as a match. Am I correct in this
assumption?

Best regards,
- Naz.

Re: Hash Indexes

От
Andrew Sullivan
Дата:
On Tue, Jan 08, 2008 at 01:49:53AM +1100, Naz Gassiep wrote:
> Because doing normal queries on a table where there are large text
> blocks is unlikely to be a good idea. E.g.,:
>
> SELECT * FROM table WHERE textcol = 'a 4kb block of text';

I suggest you look at the tsearch stuff instead.

> I would assume that hash indexes have inbuilt mechanisms for collision
> checking before returning the row as a match. Am I correct in this
> assumption?

I think you should avoid any assumptions about the hash index implementation
in PostgreSQL.  The general consensus seems to be that the code has a number
of problems.  Most importantly, hash index operations are _not_ currently
WAL-logged, which means you probably need to REINDEX in the event of a
database crash.  I don't know whether the collision issues are present in
hash indexes, though.

A


Re: Hash Indexes

От
Martijn van Oosterhout
Дата:
On Tue, Jan 08, 2008 at 01:49:53AM +1100, Naz Gassiep wrote:
> I would assume that hash indexes have inbuilt mechanisms for collision
> checking before returning the row as a match. Am I correct in this
> assumption?

Well, they do in the sense that it does the equivalent of:

SELECT * FROM table WHERE md5(textcol) = md5('text') and textcol = 'text';

But hash indexes can't be used for uniqueness checks, so if you need to
stop the same value being entered twice, hash indexes are way out.

The fact that hash indexes don't survive a DB crash is probably your
biggest concern.

Have a nice day,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> Those who make peaceful revolution impossible will make violent revolution inevitable.
>  -- John F Kennedy

Вложения

Re: Hash Indexes

От
Sam Mason
Дата:
On Tue, Jan 08, 2008 at 01:49:53AM +1100, Naz Gassiep wrote:
> >You could always do something like:
> >
> >CREATE INDEX foo ON table((md5(textcol)));
> >
> >Then it will get used in queries like:
> >SELECT * FROM table WHERE md5(textcol) = md5('text');
>
> That's exactly what I was considering doing, however there is always the
> change of a hash collision. Yes, this is a very remote chance, however
> the ramifications of a collision under those circumstances is
> potentially catastrophic.

You could make it a UNIQUE index (i.e. CREATE UNIQUE INDEX and the rest
like above) if you wanted, or you could perform the query as:

  SELECT * FROM table
  WHERE md5(textcol) = md5('text')
    AND textcol = 'text';

this should use the index to do the initial lookup and then filter out
colliding entries.

> I would assume that hash indexes have inbuilt mechanisms for collision
> checking before returning the row as a match. Am I correct in this
> assumption?

The above isn't using hash indexes in any way.  You're creating a b-tree
index on top of the md5-hash of a column.  The only index type that
support uniqueness constraints at the moment are b-tree indexes[1].


  Sam

 [1] http://www.postgresql.org/docs/current/static/sql-createindex.html#AEN47593

Re: Hash Indexes

От
Tom Lane
Дата:
Naz Gassiep <naz@mira.net> writes:
>> Why are hash indexes "obviously" best? In an ideal world with a good
>> implementation maybe, but postgresql b-trees are really quite good.
>>
> Because doing normal queries on a table where there are large text
> blocks is unlikely to be a good idea. E.g.,:
> SELECT * FROM table WHERE textcol = 'a 4kb block of text';

You seem to be harboring some rather severe conceptual errors about
how hash indexes work, or at least how Postgres' hash indexes work.
I get the impression you think that a hash index stores only a hash
code and not the actual field value, but that's not so.

            regards, tom lane