Обсуждение: Enhancing phonetic search support for more languages - GSoC 2010

Поиск
Список
Период
Сортировка

Enhancing phonetic search support for more languages - GSoC 2010

От
Dhiraj Lohiya
Дата:
Hello

I am Dhiraj Lohiya, Computer Science undergraduate from BITS Pilani. I wanted to propose idea to improvise upon the phonetic search support, initially for some Indian languages like Hindi and Marathi with a framework for extending it to other languages easily by contributing the rules in a simple format. I am looking to take it forward as a GSoC projectCheck out if you find this interesting enough:

I plan to customize the soundex algorithm for all languages where each language could have a different phonetic equivalent class of rules (Generally around 20 rules for most Indian languages I have worked with).  I would keep the approach layered so that support for multiple language rules could be easily contributed and more languages could be added by others.

Moreover, since it is important that once a base set of rules are defined by someone, the rules could themselves be added/evolve based on the user input and usage.
For instance, if many users(above a threshold set by us) insert some search string for which no wanted search result is retrieved, we could track what he finally selects and then accordingly append/modify our set of phonetic rules based on the phonetic mismatch amongst the  query inserted and result wanted according to our set of rules. Using this, the rule sets it could evolve itself when we collect usage statistics from users based on their experience. This feature would add a new dimension to the searchfunctionality and would surely stand out.

Initially I plan to code this for few Indian languages like Hindi, Marathi etc. and define a simple way (probably a gui on concept based on GoogleImageLabeler, wherein two words which sound similar will be mapped for improving upon the rules set) in which rules for different languages can be directly added and then people knowing those languages could contribute.


Samples:
  • Some case of Hindi songs, 
  • if I search for a song which has word "naiyya" but I spell the word as  ''nayya", presently no result would be returned since this is not in the playlist.
  • Moreover, if "pyar" is searched, the results vary than when "pyaar" is searched but it is easy to realize that both are the same and hence should give the same results.
Some background on this:
I have already worked out a basic customized version of soundex algorithm as a part of my intern project at PennyWiseSolutions and implemented it in java (which had features of self improving upon its rule set based on the 2 input phonetically similar words as well). Right now, the rule sets are designed only for Hindi and Marathi. The results are narrowed down pretty well with much less false positives and this works well with Marath and Hindi. Now since the algorithm part remains same (almost equivalent to soundex) and only the rule set of other languages is to be contributed which would be used by the algorithm to process, I guess this could do. Some specific customization that was done included not to take care of silent letters like in soundex since when spelling a Hindi word in English, users don't really use silent letters.

I would be glad to have more input on this.

--
Regards
Dhiraj Lohiya

Re: Enhancing phonetic search support for more languages - GSoC 2010

От
Josh Berkus
Дата:
Dhiraj,

> For instance, if many users(above a threshold set by us) insert some 
> search  string for which no wanted  search  result is retrieved, we
> could track what he finally selects and then accordingly append/modify
> our set of phonetic rules based on the phonetic mismatch amongst the
>  query inserted and result wanted according to our set of rules. Using
> this, the *  rule sets it could evolve itself when we collect usage
> statistics from users based on their experience.  * This feature would
> add a new dimension to the  search functionality and would surely stand
> out.

You're mixing two completely different kinds of features here.  One is a
backend function and the other is an application for building soundex
rules.  While both of these are interesting projects, it is unlikely you
can complete both in one summer.

What I'd suggest focussing on for SoC is creating a new soundex funciton
(suggested name: soundex_ml) which includes a facility for loadable
algorithms and callability on a per-language basis.  That would be
plenty of work by itself.  From there, you could then continue your
undergraduate work on the tool to build the algorithms in the first place.

I'm also curious why you chose to focus on the extremely imprecise
soundex instead of the more discriminating metaphone.

--                                  -- Josh Berkus                                    PostgreSQL Experts Inc.
                        http://www.pgexperts.com
 


Re: Enhancing phonetic search support for more languages - GSoC 2010

От
Robert Haas
Дата:
On Wed, Apr 7, 2010 at 4:24 PM, Dhiraj Lohiya <lohiya.dhiraj@gmail.com> wrote:
> For instance, if many users(above a threshold set by us) insert
> some search string for which no wanted search result is retrieved, we could
> track what he finally selects and then accordingly append/modify our set of
> phonetic rules based on the phonetic mismatch amongst the  query inserted
> and result wanted according to our set of rules. Using this, the rule
> sets it could evolve itself when we collect usage statistics from users
> based on their experience. This feature would add a new dimension to
> the searchfunctionality and would surely stand out.

This is really more of an application than something you're going to
be able to build into the database.  It might be an interesting
project, but it isn't really a PostgreSQL project (though you might
choose to use PostgreSQL to implement it).

...Robert


Re: Enhancing phonetic search support for more languages - GSoC 2010

От
Dhiraj Lohiya
Дата:


I'm also curious why you chose to focus on the extremely imprecise
soundex instead of the more discriminating metaphone.


The main reason to choose soundex over metaphone/double metaphone is for Indian languages, soundex itself with some customizations works pretty well. Use of Double Metaphone only increases upon the processing overhead  alongwith the need to store 2 hashes but the performance would remain the same since the way the words are pronounced in Indian languages is based on the Phonology of Devnagri script in which we don't have silent letters and other accent related inclusions (which was a major consideration that went in the design of Double Metaphone). One more customization required with reference to Indian languages is that the characters in the words aren't taken one by one but are broken as substrings of continuous vowels and consonants and accordingly are mapped to the equivalent class. Also, one rule from metaphone needs to be incorporated wherein in soundex the first letter of the word is not considered but  we would encode it also for the corresponding equivalent class.

Now with this approach of Soundex (without consideration for silent letters and breaking the word into substrings not on a character by character basis) delivers with almost same performance and much less overhead compared to Double metaphone with considerations for silent letters, accents etc. which don't have much impact on Indian languages and hence this would be more efficient.

For western languages, double metaphone is known to perform with great results. Hence, it could be used.

My previous  mail was concentrated on soundex since I had also considered how it would proceed to self improve its rule set of equivalent classes, which is a little trickier in double metaphone whereas in soundex, we can map the rules based on the  corresponding mapping that are present. But this could be looked upon later whether we want to include this functionality as well.

So for the SoC project, as proposed, probably I could concentrate on the algorithmic part for multi-lingual support. Once the framework is set ready with tutorials and wiki as to how to add rules for a new language, this could be contributed upon for other users for more languages by the community and after testing for a particular quality threshold, this could be incorporated.

Thanks for the inputs. More suggestions/reviews please!

--
Regards
Dhiraj Lohiya

Re: Enhancing phonetic search support for more languages - GSoC 2010

От
Dhiraj Lohiya
Дата:
Hello

Please find my project proposal at the following link:

I would be glad to have your review/feedback on the same.

--
Regards
Dhiraj Lohiya