Re: Remaining dependency on setlocale()
| От | Peter Eisentraut |
|---|---|
| Тема | Re: Remaining dependency on setlocale() |
| Дата | |
| Msg-id | 108e07a2-0632-4f00-984d-fe0e0d0ec726@eisentraut.org обсуждение исходный текст |
| Ответ на | Re: Remaining dependency on setlocale() (Jeff Davis <pgsql@j-davis.com>) |
| Ответы |
Re: Remaining dependency on setlocale()
|
| Список | pgsql-hackers |
On 23.12.25 21:09, Jeff Davis wrote: > On Wed, 2025-12-17 at 11:39 +0100, Peter Eisentraut wrote: >> For Metaphone, I found the reference implementation linked from its >> Wikipedia page, and it looks like our implementation is pretty >> closely >> aligned to that. That reference implementation also contains the >> C-with-cedilla case explicitly. The correct fix here would probably >> be >> to change the implementation to work on wide characters. But I think >> for the moment you could try a shortcut like, use pg_ascii_toupper(), >> but if the encoding is LATIN1 (or LATIN9 or whichever other encodings >> also contain C-with-cedilla at that code point), then explicitly >> uppercase that one as well. This would preserve the existing >> behavior. > > Done, attached new patches. > > Interestingly, WIN1256 encodes only the SMALL LETTER C WITH CEDILLA. I > think, for the purposes here, we can still consider it to "uppercase" > to \xc7, so that it can still be treated as the same sound. Technically > I think that would be an improvement over the current code in this edge > case, and suggests that case folding would be a better approach than > uppercasing. On further reflection, it seems just as easy to have dmetaphone() take the input collation and use that to do a proper collation-aware upper-casing. This has the same effect (that is, it will still only support certain single-byte encodings), but it avoids elaborately hard-coding a bunch of things, and if we ever want to make this multibyte-aware, then we'll have to go this way anyway, I think. See attached patch.
Вложения
В списке pgsql-hackers по дате отправления: