Обсуждение: ORDER BY different locales

Поиск
Список
Период
Сортировка

ORDER BY different locales

От
Karel Zak
Дата:
Hi,
a lot  of people  sometimes need  order same  data in  same DB  by moredifferent locales. For  example multi-language
webapplication  with DBin UTF-8.  It's  problem in PostgreSQL, because  PostgreSQL require setLC_COLLATE by initdb.
 
I  think possible  solution is  special function  used ORDER  BY clausewhich knows to switch by safe  way to wanted
locales,convert string bystrxfrm() and switch back to backend locales.
 
Is this  function interesting  for PostgreSQL  contrib or  main tree? Ithink it's very useful for a lot of users.  I
canprepare a patch.
 
Note, the original idea and patch is from Honza Pazdziora <adelton@informatics.muni.cz>.
For example, the Czech alphabet has between 'h' and 'i' letter 'ch':
# SHOW LC_COLLATE; lc_collate ------------    C
# SELECT data FROM str ORDER BY nls_string(data,'en_US'); data  ------- aaaa cccc chccc dddd hhhh iiii zzzz
# SELECT data FROM str ORDER BY nls_string(data,'cs_CZ'); data  ------- aaaa cccc dddd hhhh chccc iiii zzzz
The function returns result encoded in unsigned octal:
# SELECT nls_string('pg','en_US');    nls_string        -------------------------- 033022001010010001002002

Source:

static char *lc_collate_cache = NULL;

PG_FUNCTION_INFO_V1(nls_string);

Datum
nls_string(PG_FUNCTION_ARGS) 
{text *locale = PG_GETARG_TEXT_P(1);char *locale_str;int locale_len;text *txt = PG_GETARG_TEXT_P(0);char *txt_str;int
txt_len;text*txt_out;char *txt_tmp;size_t size = 0;size_t rest = 0;int i;
 
if ((VARSIZE(locale) - VARHDRSZ) <= 0 || (VARSIZE(txt) - VARHDRSZ) <= 0)    PG_RETURN_NULL();/* * Save original locale
setting*/if (!lc_collate_cache){    if ((lc_collate_cache = setlocale(LC_COLLATE, NULL)))        /* cached independent
onPostgreSQL mmgr */        lc_collate_cache = strdup(lc_collate_cache);}if (!lc_collate_cache)    elog(ERROR, "invalid
systemLC_COLLATE setting");
 
/* * Conversion to standard strings */locale_len = VARSIZE(locale) - VARHDRSZ;locale_str = palloc(locale_len +
1);memcpy(locale_str,VARDATA(locale), locale_len);*(locale_str + locale_len) = '\0';
 
txt_len = VARSIZE(txt) - VARHDRSZ;txt_str = palloc(txt_len + 1);memcpy(txt_str, VARDATA(txt), txt_len);*(txt_str +
txt_len)= '\0';
 
/* * Set wanted locale */if (!setlocale(LC_COLLATE, locale_str)){    setlocale(LC_COLLATE, lc_collate_cache);    /*
paranoid?*/    elog(ERROR, "invalid LC_COLLATE setting: %s", locale_str);}pfree(locale_str);/* * Text transformation
*/size= txt_len * 2;txt_tmp = palloc(size);memset(txt_tmp, 0, size);
 
rest = strxfrm(txt_tmp, txt_str, size) + 1;if (rest >= size) {    pfree(txt_tmp);    txt_tmp = palloc(rest);
memset(txt_tmp,0, rest);    rest = strxfrm(txt_tmp, txt_str, rest);}/* * Transformation to unsigned octal */txt_out =
(text*) palloc(3 * rest + VARHDRSZ);memset(txt_out, 0, 3 * rest + VARHDRSZ);for (i = 0; i < rest; i++) {
sprintf(VARDATA(txt_out)+ 3 * i, "%03o",        (int)(unsigned char)*(txt_tmp + i));}pfree(txt_tmp);
 
VARATT_SIZEP(txt_out) = 3 * rest + VARHDRSZ;
/* * Set original locale */if (!setlocale(LC_COLLATE, lc_collate_cache))    elog(ERROR, "invalid LC_COLLATE setting:
%s",lc_collate_cache);PG_RETURN_TEXT_P(txt_out);
 
}


-- Karel Zak  <zakkr@zf.jcu.cz>http://home.zf.jcu.cz/~zakkr/


Re: ORDER BY different locales

От
Tom Lane
Дата:
Karel Zak <zakkr@zf.jcu.cz> writes:
>  I  think possible  solution is  special function  used ORDER  BY clause
>  which knows to switch by safe  way to wanted locales, convert string by
>  strxfrm() and switch back to backend locales.

This function breaks the whole backend if an elog() failure occurs while
it's got the wrong locale set.  I believe it would also be remarkably
slow --- doesn't setlocale() involve reading a new locale definition
file from whereever those are stored?

I think the ultimate solution to our multi-locale problems will have to
involve abandoning the C library's support functions and writing locale
support that allows multiple locale-defining structures referenced by
pointers.  It's a big task though :-(.  Peter was looking at it awhile
back but I don't know how far he's gotten.
        regards, tom lane


Re: ORDER BY different locales

От
Karel Zak
Дата:
On Thu, Feb 26, 2004 at 09:16:03AM -0500, Tom Lane wrote:
> Karel Zak <zakkr@zf.jcu.cz> writes:
> >  I  think possible  solution is  special function  used ORDER  BY clause
> >  which knows to switch by safe  way to wanted locales, convert string by
> >  strxfrm() and switch back to backend locales.
> 
> This function breaks the whole backend if an elog() failure occurs while
I  don't think  so. There  is setlocale()  to  original locales  beforeelog(). But important  is idea of  this
function.We can rewrite  it tofix some minor problems...
 

> it's got the wrong locale set.  I believe it would also be remarkably
> slow --- doesn't setlocale() involve reading a new locale definition
> file from whereever those are stored?
Yes, speed can be problem. I will test it. But I hope libc read localesone time  only. The common usage  is with
SELECTwhere you  apply samelocales to all lines of result.
 

> I think the ultimate solution to our multi-locale problems will have to
> involve abandoning the C library's support functions and writing locale
Yes, but I think nls_string() is nice solution for now. Butter than say"no way"... :-)
   Karel

-- Karel Zak  <zakkr@zf.jcu.cz>http://home.zf.jcu.cz/~zakkr/


Re: ORDER BY different locales

От
Greg Stark
Дата:
Tom Lane <tgl@sss.pgh.pa.us> writes:

> This function breaks the whole backend if an elog() failure occurs while
> it's got the wrong locale set.  I believe it would also be remarkably
> slow --- doesn't setlocale() involve reading a new locale definition
> file from whereever those are stored?

I posted a similar function a while back using strxfrm and someone else
refined to eliminate a similar problem with elog(). I tested the speed under
glibc and it blazingly fast. I think the time difference was hardly even
noticeable over any other string function.

Certainly I expect there would be some platforms that would perform poorly,
but then there are lots of things various platforms do poorly. I don't think
that means postgres should reimplement everything itself for consistency. That
way lies Oracle.

-- 
greg



Re: ORDER BY different locales

От
Karel Zak
Дата:
On Thu, Feb 26, 2004 at 09:16:03AM -0500, Tom Lane wrote:
> Karel Zak <zakkr@zf.jcu.cz> writes:
> >  I  think possible  solution is  special function  used ORDER  BY clause
> >  which knows to switch by safe  way to wanted locales, convert string by
> >  strxfrm() and switch back to backend locales.
>
> This function breaks the whole backend if an elog() failure occurs while

 Fixed by sigsetjmp(Warn_restart..). I hope it's more safe now.

> it's got the wrong locale set.  I believe it would also be remarkably
> slow --- doesn't setlocale() involve reading a new locale definition

 You're right, it's slow. But sometimes  is more important that it works
 and not all queries work with thousands records like my test below.

> I think the ultimate solution to our multi-locale problems will have to
> involve abandoning the C library's support functions and writing locale
> support that allows multiple locale-defining structures referenced by

 Agree. But as you said it's huge task and I think if it won't implement
 in 7.5 we  can add nls_string() to the  contrib tree. BTW, nls_string()
 is  "product" of  Czech database  list  where Oracle  users have  still
 problems with PostgreSQL ;-)

 Latest version:
 ftp://ftp2.zf.jcu.cz/users/zakkr/pg/postgresql-nls-string-0.52.tar.gz

 Note, I  add "CC:"  to pgsql-general, maybe  it's interesting  for some
 normal users too.

 Tests:

        # SELECT count(*) FROM nlstest;
         count
        --------
         100000

        # SELECT data FROM nlstest ORDER BY upper(data) DESC LIMIT 1;
        Time: 1213.87 ms


        # SELECT data FROM nlstest ORDER BY nls_string(data, 'en_US') LIMIT 1;
        Time: 4269.00 ms


    Karel

--
 Karel Zak  <zakkr@zf.jcu.cz>
 http://home.zf.jcu.cz/~zakkr/