lower and upper not UTF-8 safe
От | Julian Satchell |
---|---|
Тема | lower and upper not UTF-8 safe |
Дата | |
Msg-id | 1060004637.28875.3215.camel@jsatchell.eris.qinetiq.com обсуждение исходный текст |
Ответы |
Re: lower and upper not UTF-8 safe
|
Список | pgsql-hackers |
The implementations of lower and upper in src/backend/utils/adt/oracle_compat.c use the single byte macros from ctype.h to alter individual bytes in the text string. If the text is UTF-8 encoded this is totally wrong, and will result in an invalid string that is no longer UTF-8. The code is basically unchanged in both 7.3.4 and CVS tip. I can see two options - remove access to these functions if the database is running UNICODE, or rewrite/extend them so the correct thing happens. The easiest way to do this is probably to convert the UTF-8 to a fixed width encoding (say UCS-4), perform the lower operation to get a new set of character indices, then convert back to UTF-8. The byte length of the output might even be different from the input, although I don't know of an example where this happens. At the very least, the documentation for lower and upper in the manual should warn the user not to use them in a UNICODE database. -- Julian Satchell <j.satchell@eris.qinetiq.com> QinetiQ
В списке pgsql-hackers по дате отправления: