Обсуждение: fuzzystrmatch module buggy? observations
			
				The fuzzystrmatch module (http://www.postgresql.org/docs/9.2/static/fuzzystrmatch.html) is currently, as of 9.2.1, documented with the caution "At present, the soundex, metaphone, dmetaphone, and dmetaphone_alt functions do not work well with multibyte encodings (such as UTF-8)". 
			
		
		
	While the venerable algorithms contained in the module seem to generally work for Latin strings from European languages which all have accented/diacritic characters such as äöüñáéíóúàèìòù, for languages with non-Latin characters such as Kyrillic, Hebrew, Arabic, Chinese, these venerable algorithms return NULL (empty) or plain weirdness. 
Some examples:
dmetaphone ('Новости') = 'NN'
soundex ('Новости') = NULL
dmetaphone ('לפחות') = NULL
 soundex ('לפחות') = NULL
soundex ('相关搜索') = NULL
dmetaphone ('相关搜索') = NULL
metaphone() crashes with SQL state: 42883 for all these strings (it tells me I should cast the 'unknown' input).
The string 'äöüñáéíóúàèìòù' causes metaphone(), dmetaphone(), dmetaphone_alt, soundex() to fail.
Only levenshtein() appears to function correctly with all above inputs, even when I let it compare Hebrew against Chinese strings.
Summarizing my experience:
* for english (ASCII equivalent), the module works, 
* for the rest of the Latin charsets (equivalent to ISO 8859-x) the module works unreliably,
 * for non-latin chars (UTF8 with 2-4 bytes per char) the module does not work
Note: My DB and the OS are set up for UTF-8.
This would appear to be less a problem of Postgresql and the fuzzystrmach module itself but because there
appear to exist no replacement algorithms adequate for a multilingual world - at least that is my impression 
after looking at the IPA and http://www.lt-world.org websites and branching out from there.
Given all this I have no idea of this is a bug at all or the state-of-the-art around this topic is inadequate.
Questions (to the developers):
- Is there anything in work or planned for the fuzzystrmatch module?
- Does anybody know about adequate replacements or upgrades of the soundex, metaphone etc. algorithms from academia?
On Tue, Oct 30, 2012 at 02:29:09PM +0100, r d wrote: > The fuzzystrmatch module (http://www.postgresql.org/docs/9.2/static/ > fuzzystrmatch.html) is currently, as of 9.2.1, documented with the caution "At > present, the soundex, metaphone, dmetaphone, and dmetaphone_alt functions do > not work well with multibyte encodings (such as UTF-8)". > > While the venerable algorithms contained in the module seem to generally work > for Latin strings from European languages which all have accented/diacritic > characters such as äöüñáéíóúàèìòù, for languages with non-Latin characters such > as Kyrillic, Hebrew, Arabic, Chinese, these venerable algorithms return NULL > (empty) or plain weirdness. > > Some examples: > > dmetaphone ('Новости') = 'NN' > soundex ('Новости') = NULL > > dmetaphone ('לפחות') = NULL > soundex ('לפחות') = NULL > > soundex ('相关搜索') = NULL > dmetaphone ('相关搜索') = NULL > > metaphone() crashes with SQL state: 42883 for all these strings (it tells me I > should cast the 'unknown' input). > > The string 'äöüñáéíóúàèìòù' causes metaphone(), dmetaphone(), dmetaphone_alt, > soundex() to fail. > > Only levenshtein() appears to function correctly with all above inputs, even > when I let it compare Hebrew against Chinese strings. > > Summarizing my experience: > * for english (ASCII equivalent), the module works, > * for the rest of the Latin charsets (equivalent to ISO 8859-x) the module > works unreliably, > * for non-latin chars (UTF8 with 2-4 bytes per char) the module does not work > > Note: My DB and the OS are set up for UTF-8. > > This would appear to be less a problem of Postgresql and the fuzzystrmach > module itself but because there > appear to exist no replacement algorithms adequate for a multilingual world - > at least that is my impression > after looking at the IPA and http://www.lt-world.org websites and branching out > from there. This is a very good summary. I was not aware of all these behaviors. > Given all this I have no idea of this is a bug at all or the state-of-the-art > around this topic is inadequate. I have no idea either. > Questions (to the developers): > - Is there anything in work or planned for the fuzzystrmatch module? > - Does anybody know about adequate replacements or upgrades of the soundex, > metaphone etc. algorithms from academia? I have not heard of anyone working in this area. What usually happens is some expert in the field shows up and submits a patch to improve it. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. +