Re: Searching for "bare" letters
От | Eduardo Morras |
---|---|
Тема | Re: Searching for "bare" letters |
Дата | |
Msg-id | 4E5EF2B1008AF013@ обсуждение исходный текст |
Ответ на | Searching for "bare" letters ("Reuven M. Lerner" <reuven@lerner.co.il>) |
Список | pgsql-general |
At 01:25 02/10/2011, Reuven M. Lerner wrote: >Hi, everyone.=C2 I'm working on a project on=20 >PostgreSQL 9.0 (soon to be upgraded to 9.1,=20 >given that we haven't yet launched).=C2 The=20 >project will involve numerous text fields=20 >containing English, Spanish, and=20 >Portuguese.=C2 Some of those text fields will be=20 >searchable by the user.=C2 That's easy enough to=20 >do; for our purposes, I was planning to use some=20 >combination of LIKE searches; the database is=20 >small enough that this doesn't take very much=20 >time, and we don't expect the number of=20 >searchable records (or columns within those records) to be all that large. > >The thing is, the people running the site want=20 >searches to work on what I'm calling (for lack=20 >of a better term) "bare" letters.=C2 That is, if=20 >the user searches for "n", then the search=20 >should also match Spanish words containing=20 >"=C3=B1".=C2 I'm told by Spanish-speaking members of=20 >the team that this is how they would expect=20 >searches to work.=C2 However, when I just did a=20 >quick test using a UTF-8 encoded 9.0 database, I=20 >found that PostgreSQL didn't=C2 see the two=20 >characters as identical.=C2 (I must say, this is=20 >the behavior that I would have expected, had the=20 >Spanish-speaking team member not said anything on the subject.) > >So my question is whether I can somehow wrangle=20 >PostgreSQL into thinking that "n" and "=C3=B1" are=20 >the same character for search purposes, or if I=20 >need to do something else -- use regexps, keep a=20 >"naked," searchable version of each column=20 >alongside the native one, or something else entirely -- to get this to wor= k. > >Any ideas? You can use perceptual hashing for that. There=20 are multiple algorithms, some of them can be tuned for specific languages. See this documentation: http://en.wikipedia.org/wiki/Phonetic_algorithm for a general description, http://en.wikipedia.org/wiki/Soundex is the first one developed, very old, http://en.wikipedia.org/wiki/Metaphone is a=20 family of several modern algorithms. Remember that they are hashing algorithms, some=20 words can collide because they have the same pronunciation but write differ= ent. I remember that datapark search engine uses them=20 with dictionaries. You can check it too. http://www.dataparksearch.org/ >Thanks, > >Reuven HTH=20
В списке pgsql-general по дате отправления: