Обсуждение: Hash Indexes
Hi there, I am creating functionality where there are blocks of text that are being stored in the DB and that need to be searched for. No like or pattern matching, just a plain old WHERE clause. Obviously, hash indexes would be best here, however I've been warned away from PG's hash implementation. Would it be better to manually create a column storing the hashes (maintained with a CHECK (md5(textcol) = hashcol) type constraint) and generate the hashes in the app? - Naz.
On Sat, Jan 05, 2008 at 03:20:54AM +1100, Naz Gassiep wrote: > Hi there, > I am creating functionality where there are blocks of text that are > being stored in the DB and that need to be searched for. No like or > pattern matching, just a plain old WHERE clause. Obviously, hash indexes > would be best here, however I've been warned away from PG's hash > implementation. Why are hash indexes "obviously" best? In an ideal world with a good implementation maybe, but postgresql b-trees are really quite good. > Would it be better to manually create a column storing > the hashes (maintained with a CHECK (md5(textcol) = hashcol) type > constraint) and generate the hashes in the app? You could always do something like: CREATE INDEX foo ON table((md5(textcol))); Then it will get used in queries like: SELECT * FROM table WHERE md5(textcol) = md5('text'); Have a nice day, -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > Those who make peaceful revolution impossible will make violent revolution inevitable. > -- John F Kennedy
Вложения
> Why are hash indexes "obviously" best? In an ideal world with a good > implementation maybe, but postgresql b-trees are really quite good. > Because doing normal queries on a table where there are large text blocks is unlikely to be a good idea. E.g.,: SELECT * FROM table WHERE textcol = 'a 4kb block of text'; > You could always do something like: > > CREATE INDEX foo ON table((md5(textcol))); > > Then it will get used in queries like: > SELECT * FROM table WHERE md5(textcol) = md5('text'); > That's exactly what I was considering doing, however there is always the change of a hash collision. Yes, this is a very remote chance, however the ramifications of a collision under those circumstances is potentially catastrophic. Think a user being delivered text that contains confidential and sensitive material as opposed to the latest memo about the cleaning of toilets. I would assume that hash indexes have inbuilt mechanisms for collision checking before returning the row as a match. Am I correct in this assumption? Best regards, - Naz.
On Tue, Jan 08, 2008 at 01:49:53AM +1100, Naz Gassiep wrote: > Because doing normal queries on a table where there are large text > blocks is unlikely to be a good idea. E.g.,: > > SELECT * FROM table WHERE textcol = 'a 4kb block of text'; I suggest you look at the tsearch stuff instead. > I would assume that hash indexes have inbuilt mechanisms for collision > checking before returning the row as a match. Am I correct in this > assumption? I think you should avoid any assumptions about the hash index implementation in PostgreSQL. The general consensus seems to be that the code has a number of problems. Most importantly, hash index operations are _not_ currently WAL-logged, which means you probably need to REINDEX in the event of a database crash. I don't know whether the collision issues are present in hash indexes, though. A
On Tue, Jan 08, 2008 at 01:49:53AM +1100, Naz Gassiep wrote: > I would assume that hash indexes have inbuilt mechanisms for collision > checking before returning the row as a match. Am I correct in this > assumption? Well, they do in the sense that it does the equivalent of: SELECT * FROM table WHERE md5(textcol) = md5('text') and textcol = 'text'; But hash indexes can't be used for uniqueness checks, so if you need to stop the same value being entered twice, hash indexes are way out. The fact that hash indexes don't survive a DB crash is probably your biggest concern. Have a nice day, -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > Those who make peaceful revolution impossible will make violent revolution inevitable. > -- John F Kennedy
Вложения
On Tue, Jan 08, 2008 at 01:49:53AM +1100, Naz Gassiep wrote: > >You could always do something like: > > > >CREATE INDEX foo ON table((md5(textcol))); > > > >Then it will get used in queries like: > >SELECT * FROM table WHERE md5(textcol) = md5('text'); > > That's exactly what I was considering doing, however there is always the > change of a hash collision. Yes, this is a very remote chance, however > the ramifications of a collision under those circumstances is > potentially catastrophic. You could make it a UNIQUE index (i.e. CREATE UNIQUE INDEX and the rest like above) if you wanted, or you could perform the query as: SELECT * FROM table WHERE md5(textcol) = md5('text') AND textcol = 'text'; this should use the index to do the initial lookup and then filter out colliding entries. > I would assume that hash indexes have inbuilt mechanisms for collision > checking before returning the row as a match. Am I correct in this > assumption? The above isn't using hash indexes in any way. You're creating a b-tree index on top of the md5-hash of a column. The only index type that support uniqueness constraints at the moment are b-tree indexes[1]. Sam [1] http://www.postgresql.org/docs/current/static/sql-createindex.html#AEN47593
Naz Gassiep <naz@mira.net> writes: >> Why are hash indexes "obviously" best? In an ideal world with a good >> implementation maybe, but postgresql b-trees are really quite good. >> > Because doing normal queries on a table where there are large text > blocks is unlikely to be a good idea. E.g.,: > SELECT * FROM table WHERE textcol = 'a 4kb block of text'; You seem to be harboring some rather severe conceptual errors about how hash indexes work, or at least how Postgres' hash indexes work. I get the impression you think that a hash index stores only a hash code and not the actual field value, but that's not so. regards, tom lane