Re: Index on TEXT versus CHAR(32)... fast exact TEXT matching
От | NTPT |
---|---|
Тема | Re: Index on TEXT versus CHAR(32)... fast exact TEXT matching |
Дата | |
Msg-id | 006301c49460$3399df70$74beebd5@wbp1 обсуждение исходный текст |
Ответ на | Index on TEXT versus CHAR(32)... fast exact TEXT matching (Jon Lapham <lapham@jandr.org>) |
Список | pgsql-general |
what about to use a CRC32 checksum of the text, computed by client application an then make index of crc32 data ? ie add column crc int4, add column md5 varchar(255 ) create index blabla on mytable (crc) or even create index blabla2 on mytable (crc,md5) and query like ......... where crc='crc32 of your searched text' AND md5='md5 hash' i am not shure, but this should work pretty fast. ----- Original Message ----- From: "Jon Lapham" <lapham@jandr.org> To: <pgsql-general@postgresql.org> Sent: Saturday, September 04, 2004 3:04 PM Subject: [GENERAL] Index on TEXT versus CHAR(32)... fast exact TEXT matching > I have a table that stores TEXT information. I need query this table to > find *exact* matches to the TEXT... no regular expressions, no LIKE > queries, etc. The TEXT could be from 1 to 10000+ characters in length, > quite variable. If it matters, the TEXT may contain UNICODE characters... > > Example: > CREATE TABLE a (id SERIAL, thetext TEXT); > SELECT id FROM a WHERE thetext='Some other text'; > > One way I thought to optimize this process would be to store an MD5 hash > of the "thetext" column and query on that: > > CREATE TABLE a (id SERIAL, thetext TEXT, thetext_md5 CHAR(32)); > SELECT id FROM a WHERE thetext_m5d=md5('Some other text'); > > Now, obviously I would want to build an INDEX on either "thetext" or > "thetext_md5", depending on which way I decide to make the table. > > My question is, what is the absolute fastest way to find the exact match > to a TEXT column? Any amount of pre-processing is fine (such as > calculating the MD5's of all the TEXT tuples), but the query must be > extremely fast. > > Has anyone compared (theoretical or practical) the performance of > querying a TEXT-based INDEX versus a CHAR(32)-based INDEX? Is my MD5 > idea a waste of time? Is there something better than MD5? Would it be > better to store the "fingerprint" of the TEXT as an integer somehow, so > that the INDEX could be based on a INT* column? > > Thanks for any help! > > -Jon > > -- > -**-*-*---*-*---*-*---*-----*-*-----*---*-*---*-----*-----*-*-----*--- > Jon Lapham <lapham@jandr.org> Rio de Janeiro, Brasil > Personal: http://www.jandr.org/ > ***-*--*----*-------*------------*--------------------*--------------- > > > ---------------------------(end of broadcast)--------------------------- > TIP 5: Have you checked our extensive FAQ? > > http://www.postgresql.org/docs/faqs/FAQ.html > >
В списке pgsql-general по дате отправления: