Обсуждение: upper and UTF-8

Поиск
Список
Период
Сортировка

upper and UTF-8

От
"Benjamin Krajmalnik"
Дата:

I just used the upper(text) function on a database which is utf8 encoded and which has spanish text.

All of the regular characters were properly converted, except for characters which had accents.

 

Re: upper and UTF-8

От
Scott Marlowe
Дата:
On Mon, Jul 26, 2010 at 3:03 PM, Benjamin Krajmalnik <kraj@servoyant.com> wrote:
> I just used the upper(text) function on a database which is utf8 encoded and
> which has spanish text.
>
> All of the regular characters were properly converted, except for characters
> which had accents.

What are your various LC_* variables for that database?

--
To understand recursion, one must first understand recursion.

Re: upper and UTF-8

От
"Benjamin Krajmalnik"
Дата:
CREATE DATABASE ishield
  WITH OWNER = postgres
       ENCODING = 'UTF8'
       LC_COLLATE = 'C'
       LC_CTYPE = 'C'
       CONNECTION LIMIT = -1;


> -----Original Message-----
> From: Scott Marlowe [mailto:scott.marlowe@gmail.com]
> Sent: Monday, July 26, 2010 3:17 PM
> To: Benjamin Krajmalnik
> Cc: pgsql-admin@postgresql.org
> Subject: Re: [ADMIN] upper and UTF-8
>
> On Mon, Jul 26, 2010 at 3:03 PM, Benjamin Krajmalnik
> <kraj@servoyant.com> wrote:
> > I just used the upper(text) function on a database which is utf8
> encoded and
> > which has spanish text.
> >
> > All of the regular characters were properly converted, except for
> characters
> > which had accents.
>
> What are your various LC_* variables for that database?
>
> --
> To understand recursion, one must first understand recursion.

Re: upper and UTF-8

От
Scott Marlowe
Дата:
I'd try creating a db with en_US or even better whatever is spanish
encoding for lc_collate and see what happens.

On Mon, Jul 26, 2010 at 3:18 PM, Benjamin Krajmalnik <kraj@servoyant.com> wrote:
> CREATE DATABASE ishield
>  WITH OWNER = postgres
>       ENCODING = 'UTF8'
>       LC_COLLATE = 'C'
>       LC_CTYPE = 'C'
>       CONNECTION LIMIT = -1;
>
>
>> -----Original Message-----
>> From: Scott Marlowe [mailto:scott.marlowe@gmail.com]
>> Sent: Monday, July 26, 2010 3:17 PM
>> To: Benjamin Krajmalnik
>> Cc: pgsql-admin@postgresql.org
>> Subject: Re: [ADMIN] upper and UTF-8
>>
>> On Mon, Jul 26, 2010 at 3:03 PM, Benjamin Krajmalnik
>> <kraj@servoyant.com> wrote:
>> > I just used the upper(text) function on a database which is utf8
>> encoded and
>> > which has spanish text.
>> >
>> > All of the regular characters were properly converted, except for
>> characters
>> > which had accents.
>>
>> What are your various LC_* variables for that database?
>>
>> --
>> To understand recursion, one must first understand recursion.
>



--
To understand recursion, one must first understand recursion.

Re: upper and UTF-8

От
"Benjamin Krajmalnik"
Дата:
Unfortunately, the database has to accept data in multiple languages, since it is a SaaS offering.
It is not a big deal - I just found it interesting that it did not uppercase the accented letters.
The reason I came across it is that I created a table of all the ISO countries.  I had found a NySQL script which
createdit, and it had the fields in both upper case and mixed case.  Since our platform is multi-lingual, we expanded
thetable to add the language code and started adding the translation.  After I finished the translation, I figured for
consistencyI would upper case the one field into the other, and this is where I saw the inconsistency. 
Operationally, it does not affect me in any way - but I found it strange that it did not handle the accented
characters.
For now we are keeping the column to facilitate the translation to other languages - ultimately it will be dropped.


> -----Original Message-----
> From: Scott Marlowe [mailto:scott.marlowe@gmail.com]
> Sent: Monday, July 26, 2010 3:39 PM
> To: Benjamin Krajmalnik
> Cc: pgsql-admin@postgresql.org
> Subject: Re: [ADMIN] upper and UTF-8
>
> I'd try creating a db with en_US or even better whatever is spanish
> encoding for lc_collate and see what happens.
>
> On Mon, Jul 26, 2010 at 3:18 PM, Benjamin Krajmalnik
> <kraj@servoyant.com> wrote:
> > CREATE DATABASE ishield
> >  WITH OWNER = postgres
> >       ENCODING = 'UTF8'
> >       LC_COLLATE = 'C'
> >       LC_CTYPE = 'C'
> >       CONNECTION LIMIT = -1;
> >
> >
> >> -----Original Message-----
> >> From: Scott Marlowe [mailto:scott.marlowe@gmail.com]
> >> Sent: Monday, July 26, 2010 3:17 PM
> >> To: Benjamin Krajmalnik
> >> Cc: pgsql-admin@postgresql.org
> >> Subject: Re: [ADMIN] upper and UTF-8
> >>
> >> On Mon, Jul 26, 2010 at 3:03 PM, Benjamin Krajmalnik
> >> <kraj@servoyant.com> wrote:
> >> > I just used the upper(text) function on a database which is utf8
> >> encoded and
> >> > which has spanish text.
> >> >
> >> > All of the regular characters were properly converted, except for
> >> characters
> >> > which had accents.
> >>
> >> What are your various LC_* variables for that database?
> >>
> >> --
> >> To understand recursion, one must first understand recursion.
> >
>
>
>
> --
> To understand recursion, one must first understand recursion.

Re: upper and UTF-8

От
Scott Marlowe
Дата:
On Mon, Jul 26, 2010 at 3:47 PM, Benjamin Krajmalnik <kraj@servoyant.com> wrote:
> Unfortunately, the database has to accept data in multiple languages, since it is a SaaS offering.

The encoding determines that, not the collation.  UTF-8 allows you to
insert various languages in that encoding.

> It is not a big deal - I just found it interesting that it did not uppercase the accented letters.

Just tested it and the lc_collate seems to make the difference.

Re: upper and UTF-8

От
Scott Marlowe
Дата:
On Mon, Jul 26, 2010 at 3:51 PM, Scott Marlowe <scott.marlowe@gmail.com> wrote:
> On Mon, Jul 26, 2010 at 3:47 PM, Benjamin Krajmalnik <kraj@servoyant.com> wrote:
>> Unfortunately, the database has to accept data in multiple languages, since it is a SaaS offering.
>
> The encoding determines that, not the collation.  UTF-8 allows you to
> insert various languages in that encoding.
>
>> It is not a big deal - I just found it interesting that it did not uppercase the accented letters.
>
> Just tested it and the lc_collate seems to make the difference.

To be more specific, when my lc_collate is en_US, it works properly.
I didn't have to use a spanish collation to make it work.  Note that
changing collation will change sort order, and some matching rules and
things like that.  Also, a db is usually noticeably faster working
with text in locale of C, because it then treats the data mostly as
though it's in byte order.

Re: upper and UTF-8

От
Alvaro Herrera
Дата:
Excerpts from Benjamin Krajmalnik's message of lun jul 26 17:03:54 -0400 2010:
> I just used the upper(text) function on a database which is utf8 encoded
> and which has spanish text.
>
> All of the regular characters were properly converted, except for
> characters which had accents.

FWIW it works fine for me:

alvherre=# show lc_collate ;
 lc_collate
------------
 es_CL.utf8
(1 fila)

alvherre=# select upper('benjamín');
  upper
----------
 BENJAMÍN
(1 fila)



I suspect that the problem is an incorrect client_encoding setting.

Re: upper and UTF-8

От
Scott Marlowe
Дата:
On Mon, Jul 26, 2010 at 8:09 PM, Alvaro Herrera
<alvherre@commandprompt.com> wrote:
> Excerpts from Benjamin Krajmalnik's message of lun jul 26 17:03:54 -0400 2010:
>> I just used the upper(text) function on a database which is utf8 encoded
>> and which has spanish text.
>>
>> All of the regular characters were properly converted, except for
>> characters which had accents.
>
> FWIW it works fine for me:
>
> alvherre=# show lc_collate ;
>  lc_collate
> ------------
>  es_CL.utf8
> (1 fila)
>
> alvherre=# select upper('benjamín');
>  upper
> ----------
>  BENJAMÍN
> (1 fila)
>
> I suspect that the problem is an incorrect client_encoding setting.

Yeah, OP had set lc_collate to C under the mistaken impression that
collation controlled the character sets you could insert into the
database.  If you create a db with lc_collate='C' then the upper only
works on basic ascii characters near as I can tell.

Re: upper and UTF-8

От
Alvaro Herrera
Дата:
Excerpts from Scott Marlowe's message of lun jul 26 23:12:08 -0400 2010:
> On Mon, Jul 26, 2010 at 8:09 PM, Alvaro Herrera
> <alvherre@commandprompt.com> wrote:

> > I suspect that the problem is an incorrect client_encoding setting.
>
> Yeah, OP had set lc_collate to C under the mistaken impression that
> collation controlled the character sets you could insert into the
> database.  If you create a db with lc_collate='C' then the upper only
> works on basic ascii characters near as I can tell.

Makes sense.  The code seems to say that it's lc_ctype that's important
though, see str_toupper in formatting.c.  So I think you could still set
collation to C and use a language-specific lc_ctype.

Re: upper and UTF-8

От
Michael Gould
Дата:

Benjamin,

We're using the contrib module citext for all text columns so that we can do case insensitive searches and so far we haven't found any that it doesn't find.

Best Regards

Mike Gould

 

"Benjamin Krajmalnik" <kraj@servoyant.com> wrote:

I just used the upper(text) function on a database which is utf8 encoded and which has spanish text.

All of the regular characters were properly converted, except for characters which had accents.