Обсуждение: Sorting Problem

Поиск

Список

Период

Сортировка

Sorting Problem

От

"Tim Edwards"

Дата:

12 августа 2003 г., 18:58:51

Currently I'm running PostgreSQL 7.2.3 and having a problem sorting.

I've got two colums of data, one Int4 one Varchar.

When I sort ASC on the varchar I get some strange results. Here a section of
data cut after running a sort. It starts with RM- then does RMT- Then goes
back for more RM-. I've checked for spaces or tabs or anything else odd and
cant find anything that would effect the sort.

63   RM-791
38   RM-AV2100
9   RM-PP401
10   RM-PP402
8   RM-PP404
3   RMT-D10
4   RMT-D108A
5   RMT-D109A
6   RMT-D116A
34   RMT-V402
40   RM-V10
41   RM-V11
45   RM-V12

Here's the sort string I'm using:

SELECT "remote_id", "remote_model" FROM "remote" ORDER BY "remote_model"
ASC;

Any idea or suggestions why it would do this ?

_________________________________________________________________
MSN 8 with e-mail virus protection service: 2 months FREE*
http://join.msn.com/?page=features/virus

Re: Sorting Problem

От

Tom Lane

Дата:

12 августа 2003 г., 19:15:25

"Tim Edwards" <mor4321@hotmail.com> writes:
> When I sort ASC on the varchar I get some strange results. Here a section of
> data cut after running a sort. It starts with RM- then does RMT- Then goes
> back for more RM-.

Sounds like you're in en_US locale, or at least something other than C
locale.

Unfortunately this can only be fixed by re-initdb'ing :-(

            regards, tom lane

Re: Sorting Problem

От

Kathy zhu

Дата:

12 августа 2003 г., 19:30:41

Do you mean that soring doesn't work for en_US locale ???
And, does encoding affect sorting at all ??

thanks,
kathy

Tom Lane wrote:
> "Tim Edwards" <mor4321@hotmail.com> writes:
>
>>When I sort ASC on the varchar I get some strange results. Here a section of
>>data cut after running a sort. It starts with RM- then does RMT- Then goes
>>back for more RM-.
>
>
> Sounds like you're in en_US locale, or at least something other than C
> locale.
>
> Unfortunately this can only be fixed by re-initdb'ing :-(
>
>             regards, tom lane
>
> ---------------------------(end of broadcast)---------------------------
> TIP 2: you can get off all lists at once with the unregister command
>     (send "unregister YourEmailAddressHere" to majordomo@postgresql.org)


--

Re: Sorting Problem

От

Dennis Gearon

Дата:

12 августа 2003 г., 19:34:31

To help my understanding of this type of thing, when he reinits the database, can he get the PG backend to be running
witha different LOCALE than the machine's? 

Tom Lane wrote:

> "Tim Edwards" <mor4321@hotmail.com> writes:
>
>>When I sort ASC on the varchar I get some strange results. Here a section of
>>data cut after running a sort. It starts with RM- then does RMT- Then goes
>>back for more RM-.
>
>
> Sounds like you're in en_US locale, or at least something other than C
> locale.
>
> Unfortunately this can only be fixed by re-initdb'ing :-(
>
>             regards, tom lane
>
> ---------------------------(end of broadcast)---------------------------
> TIP 2: you can get off all lists at once with the unregister command
>     (send "unregister YourEmailAddressHere" to majordomo@postgresql.org)
>

Re: Sorting Problem

От

"Maksim Likharev"

Дата:

12 августа 2003 г., 19:35:44

en_US locale skips? punctuation from sorting index,
so in your case

RM-791
RM-AV2100
RM-PP401
RM-PP402
RM-PP404
RM-V10
RM-V11
RM-V12
RMT-D10
RMT-D108A
RMT-D109A
RMT-D116A
RMT-V402

==>

RM791
RMAV2100
RMPP401
RMPP402
RMPP404
RMV10
RMV11
RMV12
RMTD10
RMTD108A
RMTD109A
RMTD116A
RMTV402

-----Original Message-----
From: Kathy zhu [mailto:Kathy.zhu@Sun.COM]
Sent: Tuesday, August 12, 2003 3:30 PM
To: Tom Lane
Cc: Tim Edwards; pgsql-general@postgresql.org
Subject: Re: [GENERAL] Sorting Problem


Do you mean that soring doesn't work for en_US locale ???
And, does encoding affect sorting at all ??

thanks,
kathy

Tom Lane wrote:
> "Tim Edwards" <mor4321@hotmail.com> writes:
>
>>When I sort ASC on the varchar I get some strange results. Here a
section of
>>data cut after running a sort. It starts with RM- then does RMT- Then
goes
>>back for more RM-.
>
>
> Sounds like you're in en_US locale, or at least something other than C
> locale.
>
> Unfortunately this can only be fixed by re-initdb'ing :-(
>
>             regards, tom lane
>
> ---------------------------(end of
broadcast)---------------------------
> TIP 2: you can get off all lists at once with the unregister command
>     (send "unregister YourEmailAddressHere" to
majordomo@postgresql.org)


--



---------------------------(end of broadcast)---------------------------
TIP 9: the planner will ignore your desire to choose an index scan if
your
      joining column's datatypes do not match

Re: Sorting Problem

От

Tom Lane

Дата:

12 августа 2003 г., 19:47:59

Dennis Gearon <gearond@cvc.net> writes:
> To help my understanding of this type of thing, when he reinits the database, can he get the PG backend to be running
witha different LOCALE than the machine's? 

Yeah, he needs to set LANG and/or LC_ALL to be 'C' instead of whatever
it's defaulting to on his machine.  Then run initdb with that
environment, and he's set.

            regards, tom lane

Re: Sorting Problem

От

Tom Lane

Дата:

12 августа 2003 г., 19:48:43

Kathy zhu <Kathy.zhu@Sun.COM> writes:
> Do you mean that soring doesn't work for en_US locale ???

Oh it works all right, it just doesn't agree with Tim's idea of what
sorted order is ;-)

            regards, tom lane

Re: Sorting Problem

От

Kathy zhu

Дата:

12 августа 2003 г., 20:08:02

If it skips "-", then RMT-* should come before RM-V*, but they don't, why  ??


Maksim Likharev wrote:
> en_US locale skips? punctuation from sorting index,
> so in your case
>
> RM-791
> RM-AV2100
> RM-PP401
> RM-PP402
> RM-PP404
> RM-V10
> RM-V11
> RM-V12
> RMT-D10
> RMT-D108A
> RMT-D109A
> RMT-D116A
> RMT-V402
>
> ==>
>
> RM791
> RMAV2100
> RMPP401
> RMPP402
> RMPP404
> RMV10
> RMV11
> RMV12
> RMTD10
> RMTD108A
> RMTD109A
> RMTD116A
> RMTV402
>
> -----Original Message-----
> From: Kathy zhu [mailto:Kathy.zhu@Sun.COM]
> Sent: Tuesday, August 12, 2003 3:30 PM
> To: Tom Lane
> Cc: Tim Edwards; pgsql-general@postgresql.org
> Subject: Re: [GENERAL] Sorting Problem
>
>
> Do you mean that soring doesn't work for en_US locale ???
> And, does encoding affect sorting at all ??
>
> thanks,
> kathy
>
> Tom Lane wrote:
>
>>"Tim Edwards" <mor4321@hotmail.com> writes:
>>
>>
>>>When I sort ASC on the varchar I get some strange results. Here a
>>
> section of
>
>>>data cut after running a sort. It starts with RM- then does RMT- Then
>>
> goes
>
>>>back for more RM-.
>>
>>
>>Sounds like you're in en_US locale, or at least something other than C
>>locale.
>>
>>Unfortunately this can only be fixed by re-initdb'ing :-(
>>
>>            regards, tom lane
>>
>>---------------------------(end of
>
> broadcast)---------------------------
>
>>TIP 2: you can get off all lists at once with the unregister command
>>    (send "unregister YourEmailAddressHere" to
>
> majordomo@postgresql.org)
>
>


--

Re: Sorting Problem

От

Dennis Gearon

Дата:

12 августа 2003 г., 20:17:41

You mean in his own local environment? So all his programs, console operations, etc, will have the new encoding? Or
'LANG/LC_ALL'for Posgres specifically? 

Tom Lane wrote:

> Dennis Gearon <gearond@cvc.net> writes:
>
>>To help my understanding of this type of thing, when he reinits the database, can he get the PG backend to be running
witha different LOCALE than the machine's? 
>
>
> Yeah, he needs to set LANG and/or LC_ALL to be 'C' instead of whatever
> it's defaulting to on his machine.  Then run initdb with that
> environment, and he's set.
>
>             regards, tom lane
>

Re: Sorting Problem

От

Jeffrey Melloy

Дата:

12 августа 2003 г., 20:28:38

It does if you look at the original email.  Maksim must've just
transposed a couple letters when he was writing his demo.

Jeff

Kathy zhu wrote:

> If it skips "-", then RMT-* should come before RM-V*, but they don't,
> why  ??
>
>
> Maksim Likharev wrote:
>
>> en_US locale skips? punctuation from sorting index,
>> so in your case
>>
>> RM-791
>> RM-AV2100
>> RM-PP401
>> RM-PP402
>> RM-PP404
>> RM-V10
>> RM-V11
>> RM-V12
>> RMT-D10
>> RMT-D108A
>> RMT-D109A
>> RMT-D116A
>> RMT-V402
>>
>> ==>
>>
>> RM791
>> RMAV2100
>> RMPP401
>> RMPP402
>> RMPP404
>> RMV10
>> RMV11
>> RMV12
>> RMTD10
>> RMTD108A
>> RMTD109A
>> RMTD116A
>> RMTV402
>>
>> -----Original Message-----
>> From: Kathy zhu [mailto:Kathy.zhu@Sun.COM]
>> Sent: Tuesday, August 12, 2003 3:30 PM
>> To: Tom Lane
>> Cc: Tim Edwards; pgsql-general@postgresql.org
>> Subject: Re: [GENERAL] Sorting Problem
>>
>>
>> Do you mean that soring doesn't work for en_US locale ???
>> And, does encoding affect sorting at all ??
>>
>> thanks,
>> kathy
>>
>> Tom Lane wrote:
>>
>>> "Tim Edwards" <mor4321@hotmail.com> writes:
>>>
>>>
>>>> When I sort ASC on the varchar I get some strange results. Here a
>>>
>>>
>> section of
>>
>>>> data cut after running a sort. It starts with RM- then does RMT- Then
>>>
>>>
>> goes
>>
>>>> back for more RM-.
>>>
>>>
>>>
>>> Sounds like you're in en_US locale, or at least something other than C
>>> locale.
>>>
>>> Unfortunately this can only be fixed by re-initdb'ing :-(
>>>
>>>             regards, tom lane
>>>
>>> ---------------------------(end of
>>
>>
>> broadcast)---------------------------
>>
>>> TIP 2: you can get off all lists at once with the unregister command
>>>    (send "unregister YourEmailAddressHere" to
>>
>>
>> majordomo@postgresql.org)
>>
>>

Re: Sorting Problem

От

Dennis Gearon

Дата:

12 августа 2003 г., 20:28:47

Danke, Spacibo, gracias, thanks.

Tom Lane wrote:

> Dennis Gearon <gearond@cvc.net> writes:
>
>>You mean in his own local environment? So all his programs, console operations, etc, will have the new encoding? Or
'LANG/LC_ALL'for Posgres specifically? 
>
>
> I mean he needs to run initdb with C as the selected locale.  It has
> nothing to do with what environment his other programs run in.
>
>             regards, tom lane
>

Re: Sorting Problem

От

Tom Lane

Дата:

12 августа 2003 г., 20:29:16

Dennis Gearon <gearond@cvc.net> writes:
> You mean in his own local environment? So all his programs, console operations, etc, will have the new encoding? Or
'LANG/LC_ALL'for Posgres specifically? 

I mean he needs to run initdb with C as the selected locale.  It has
nothing to do with what environment his other programs run in.

            regards, tom lane

Re: Sorting Problem

От

"Maksim Likharev"

Дата:

12 августа 2003 г., 20:33:55

and they come:

sorted asc:
 RM-791
 RM-AV2100
 RM-PP401
 RM-PP402
 RM-PP404
 RMT-D10
 RMT-D108A
 RMT-D109A
 RMT-D116A
 RMT-V402
 RM-V10
 RM-V11
 RM-V12

sorted asc:
 RM791
 RMAV2100
 RMPP401
 RMPP402
 RMPP404
 RMTD10
 RMTD108A
 RMTD109A
 RMTD116A
 RMTV402
 RMV10
 RMV11
 RMV12

I think I copied and pasted wrogn, sorry.

-----Original Message-----
From: Kathy zhu [mailto:Kathy.zhu@Sun.COM]
Sent: Tuesday, August 12, 2003 4:08 PM
To: Maksim Likharev
Cc: pgsql-general@postgresql.org
Subject: Re: [GENERAL] Sorting Problem


If it skips "-", then RMT-* should come before RM-V*, but they don't,
why  ??


Maksim Likharev wrote:
> en_US locale skips? punctuation from sorting index,
> so in your case
>
> RM-791
> RM-AV2100
> RM-PP401
> RM-PP402
> RM-PP404
> RM-V10
> RM-V11
> RM-V12
> RMT-D10
> RMT-D108A
> RMT-D109A
> RMT-D116A
> RMT-V402
>
> ==>
>
> RM791
> RMAV2100
> RMPP401
> RMPP402
> RMPP404
> RMV10
> RMV11
> RMV12
> RMTD10
> RMTD108A
> RMTD109A
> RMTD116A
> RMTV402
>
> -----Original Message-----
> From: Kathy zhu [mailto:Kathy.zhu@Sun.COM]
> Sent: Tuesday, August 12, 2003 3:30 PM
> To: Tom Lane
> Cc: Tim Edwards; pgsql-general@postgresql.org
> Subject: Re: [GENERAL] Sorting Problem
>
>
> Do you mean that soring doesn't work for en_US locale ???
> And, does encoding affect sorting at all ??
>
> thanks,
> kathy
>
> Tom Lane wrote:
>
>>"Tim Edwards" <mor4321@hotmail.com> writes:
>>
>>
>>>When I sort ASC on the varchar I get some strange results. Here a
>>
> section of
>
>>>data cut after running a sort. It starts with RM- then does RMT- Then
>>
> goes
>
>>>back for more RM-.
>>
>>
>>Sounds like you're in en_US locale, or at least something other than C
>>locale.
>>
>>Unfortunately this can only be fixed by re-initdb'ing :-(
>>
>>            regards, tom lane
>>
>>---------------------------(end of
>
> broadcast)---------------------------
>
>>TIP 2: you can get off all lists at once with the unregister command
>>    (send "unregister YourEmailAddressHere" to
>
> majordomo@postgresql.org)
>
>


--

Re: Sorting Problem

От

Kathy zhu

Дата:

12 августа 2003 г., 20:43:28

This brings up another question:

Say initdb with en_US locale, and we have localized strings for different
languages store in the db.

If we have a client in Germany, and want to see the text sorted in german. I
mean that we want do db soring for german strings and display the result in the
browser. How are we going to handle that ??

thanks,
kathy

Dennis Gearon wrote:
> Danke, Spacibo, gracias, thanks.
>
> Tom Lane wrote:
>
>> Dennis Gearon <gearond@cvc.net> writes:
>>
>>> You mean in his own local environment? So all his programs, console
>>> operations, etc, will have the new encoding? Or 'LANG/LC_ALL' for
>>> Posgres specifically?
>>
>>
>>
>> I mean he needs to run initdb with C as the selected locale.  It has
>> nothing to do with what environment his other programs run in.
>>
>>             regards, tom lane
>>
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 7: don't forget to increase your free space map settings

--

Re: Sorting Problem

От

"Maksim Likharev"

Дата:

12 августа 2003 г., 20:54:41

If you are talking about everything that lies under so called LATIN-1 (
ISO-8859-1 )
en_US encapsulates ( at least suppose to ) all those sorting rules,
do not remember about accents tho.

-----Original Message-----
From: Kathy zhu [mailto:Kathy.zhu@Sun.COM]
Sent: Tuesday, August 12, 2003 4:43 PM
To: gearond@cvc.net
Cc: Tom Lane; Tim Edwards; pgsql-general@postgresql.org
Subject: Re: [GENERAL] Sorting Problem

This brings up another question:

Say initdb with en_US locale, and we have localized strings for
different
languages store in the db.

If we have a client in Germany, and want to see the text sorted in
german. I
mean that we want do db soring for german strings and display the result
in the
browser. How are we going to handle that ??

thanks,
kathy

Dennis Gearon wrote:
> Danke, Spacibo, gracias, thanks.
>
> Tom Lane wrote:
>
>> Dennis Gearon <gearond@cvc.net> writes:
>>
>>> You mean in his own local environment? So all his programs, console
>>> operations, etc, will have the new encoding? Or 'LANG/LC_ALL' for
>>> Posgres specifically?
>>
>>
>>
>> I mean he needs to run initdb with C as the selected locale.  It has
>> nothing to do with what environment his other programs run in.
>>
>>             regards, tom lane
>>
>
>
> ---------------------------(end of
broadcast)---------------------------
> TIP 7: don't forget to increase your free space map settings

--

---------------------------(end of broadcast)---------------------------
TIP 6: Have you searched our list archives?

               http://archives.postgresql.org

Re: Sorting Problem

От

Kathy

Дата:

13 августа 2003 г., 02:10:08

Does including sorting rules for asian chars too ??
Do you know where I can find docuemt on this subject ??
thanks,
kathy

Maksim Likharev wrote:

>If you are talking about everything that lies under so called LATIN-1 (
>ISO-8859-1 )
>en_US encapsulates ( at least suppose to ) all those sorting rules,
>do not remember about accents tho.
>
>
>-----Original Message-----
>From: Kathy zhu [mailto:Kathy.zhu@Sun.COM]
>Sent: Tuesday, August 12, 2003 4:43 PM
>To: gearond@cvc.net
>Cc: Tom Lane; Tim Edwards; pgsql-general@postgresql.org
>Subject: Re: [GENERAL] Sorting Problem
>
>
>This brings up another question:
>
>Say initdb with en_US locale, and we have localized strings for
>different
>languages store in the db.
>
>If we have a client in Germany, and want to see the text sorted in
>german. I
>mean that we want do db soring for german strings and display the result
>in the
>browser. How are we going to handle that ??
>
>thanks,
>kathy
>
>
>
>
>Dennis Gearon wrote:
>
>
>>Danke, Spacibo, gracias, thanks.
>>
>>Tom Lane wrote:
>>
>>
>>
>>>Dennis Gearon <gearond@cvc.net> writes:
>>>
>>>
>>>
>>>>You mean in his own local environment? So all his programs, console
>>>>operations, etc, will have the new encoding? Or 'LANG/LC_ALL' for
>>>>Posgres specifically?
>>>>
>>>>
>>>
>>>I mean he needs to run initdb with C as the selected locale.  It has
>>>nothing to do with what environment his other programs run in.
>>>
>>>            regards, tom lane
>>>
>>>
>>>
>>---------------------------(end of
>>
>>
>broadcast)---------------------------
>
>
>>TIP 7: don't forget to increase your free space map settings
>>
>>
>
>
>
>

Re: Sorting Problem

От

Dennis Björklund

Дата:

13 августа 2003 г., 06:00:51

On Tue, 12 Aug 2003, Maksim Likharev wrote:

> If you are talking about everything that lies under so called LATIN-1 (
> ISO-8859-1 ) en_US encapsulates ( at least suppose to ) all those
> sorting rules, do not remember about accents tho.

It does not work like that. Different countries in europe that all use
latin1 use different sort order for the same characters. There is no way
to have one set of rules work for all. If we take swedish for example,
then v and w is treatead as the same letter when sorting. So in a
dictionary all words beginning with v or w is mixed together.

In the future we need indexes that depend on the locale (and a lot of
other changes).

--
/Dennis

Re: Sorting Problem

От

"Tim Edwards"

Дата:

13 августа 2003 г., 10:32:20

I was afraid of something like that. I found a sort term fix by using the
following:

select remote_model,
translate(remote_model, '-', 'a')
from remote
order by translate(remote_model, '-', 'a') asc

But it's an ugly answer to this, but it works for now. Guess I'm going to be
readin up on Locales

_________________________________________________________________
Help STOP SPAM with the new MSN 8 and get 2 months FREE*
http://join.msn.com/?page=features/junkmail

Re: Sorting Problem

От

Dennis Gearon

Дата:

13 августа 2003 г., 13:05:24

Dennis Björklund wrote:

> In the future we need indexes that depend on the locale (and a lot of other changes).
>

I agree. I've been looking at the web on this subject a lot lately. I am **NOT** a microslop fan, but SQL-SERVER even
letsa user define a language(maybe encoding) down to the column level!

I've been reading on GNU-C and on languages, encoding, and localization.

http://pauillac.inria.fr/~lang/hotlist/free/licence/fsf96/drepper/paper-1.html
http://h21007.www2.hp.com/dspp/tech/tech_TechSingleTipDetailPage_IDX/1,2366,1222,00.html

There are three basic approaches to doing different langauges in computerized text:

A/ various adaptations of the 8 bit character set, I.E. the ISO-8859-x series.
One byte per character.
Easy storing, small size for a string.
Easy storing, if english characters, 100% efficient use of storage space.
Easy processing between applications, works well in the stream model of *nix
Easy processing in applications, a byte is a character.
Easy string handling, NOY NULL bytes in a string, except end of string.
NOT easy to know encoding from inherently in the document.
This is not the way of the future.

B/ wide characters
UTF16, UTF32, SHIFT-JIS-16, others
each character the same width, 2 or 4 bytes (2 bytes handles 99% of all languages)
Not so easy storing, if english characters, 50% to 75% loss of storage space.
Difficult processing between applications, does NOT work well in the stream model of *nix
Easy processing in applications, a set width of bits/bytes is a character.
Difficult string handling, MANY NULL bytes in a string, especially if in English.
Moderately easy to tell encoding/language in the document.
********This should be how Postgress stores data internally.********

C/ Multibyte characters
UTF8
variable width for different characters 1-5
Not so easy storing, if non english characters, 50% to 80% loss of storage space,
(in reality, most common western languages hover aournd 5-20% loss of storage space
most common non western languages hover aournd 40-60%% loss of storage space)
Easy processing between applications, works well in the stream model of *nix
Difficult processing in applications, a variable number of bytes is a character.
Easy string string handling, ONE NULL byte in a string.
Moderately easy to tell encoding/language in the document.
********This is how Postgress should default to sending data OUT of the application,
i.e. to the display or the web, or other system applications********

Re: Sorting Problem

От

Stephan Szabo

Дата:

13 августа 2003 г., 13:31:43

On Wed, 13 Aug 2003, Dennis Gearon wrote:

> Dennis Bj�rklund wrote:
>
> > In the future we need indexes that depend on the locale (and a lot of other changes).
> >
>
> I agree. I've been looking at the web on this subject a lot lately. I
> am **NOT** a microslop fan, but SQL-SERVER even lets a user define a
> language(maybe encoding) down to the column level!
>
> I've been reading on GNU-C and on languages, encoding, and localization.
>
> http://pauillac.inria.fr/~lang/hotlist/free/licence/fsf96/drepper/paper-1.html
> http://h21007.www2.hp.com/dspp/tech/tech_TechSingleTipDetailPage_IDX/1,2366,1222,00.html
>
>
> There are three basic approaches to doing different langauges in computerized text:
>
>     A/ various adaptations of the 8 bit character set, I.E. the ISO-8859-x series.
>     B/ wide characters
>     ********This should be how Postgress stores data internally.********
>     C/ Multibyte characters
>     ********This is how Postgress should default to sending data OUT of the application,
>             i.e. to the display or the web, or other system applications********

SQL has a system for defining character set specifications, collations and
such (per column/literal in some cases).  We should probably look at it
before making decisions on how to do things.

Re: Sorting Problem

От

Dennis Gearon

Дата:

13 августа 2003 г., 13:48:32

Got a link to that section of the standard, or better yet, to a 'interpreted' version of the standard? :-)

Stephan Szabo wrote:

> On Wed, 13 Aug 2003, Dennis Gearon wrote:
>
>
>>Dennis Bj?rklund wrote:
>>
>>
>>>In the future we need indexes that depend on the locale (and a lot of other changes).
>>>
>>
>>I agree. I've been looking at the web on this subject a lot lately. I
>>am **NOT** a microslop fan, but SQL-SERVER even lets a user define a
>>language(maybe encoding) down to the column level!
>>
>>I've been reading on GNU-C and on languages, encoding, and localization.
>>
>>http://pauillac.inria.fr/~lang/hotlist/free/licence/fsf96/drepper/paper-1.html
>>http://h21007.www2.hp.com/dspp/tech/tech_TechSingleTipDetailPage_IDX/1,2366,1222,00.html
>>
>>
>>There are three basic approaches to doing different langauges in computerized text:
>>
>>    A/ various adaptations of the 8 bit character set, I.E. the ISO-8859-x series.
>>    B/ wide characters
>>    ********This should be how Postgress stores data internally.********
>>    C/ Multibyte characters
>>    ********This is how Postgress should default to sending data OUT of the application,
>>            i.e. to the display or the web, or other system applications********
>
>
> SQL has a system for defining character set specifications, collations and
> such (per column/literal in some cases).  We should probably look at it
> before making decisions on how to do things.
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org
>

Re: Sorting Problem

От

"Maksim Likharev"

Дата:

13 августа 2003 г., 14:05:09

Not only SQL Server has all those possible cases for collate,
but it does work very well.

Assigning collate/encoding per column probably too much ( I never used
it in my life )
but assigning collate/encoding per table very much helpful.

Case insensitive collation another very useful thing, which is on by
default ( in SQL Server )
and I really missing it in postgres, although I might probably initdb
with some wacky
case insensitive collate but that's more theoretical then practical.

And of cause on top of that, ability to store UTF-16 data in a column (
build in data type )
independent of db collate would be priceless which SQL Server allow to
do.
Not much of dithyrambs for SQL Server but rather reality.

Thanks.

-----Original Message-----
From: Stephan Szabo [mailto:sszabo@megazone.bigpanda.com]
Sent: Wednesday, August 13, 2003 9:32 AM
To: Dennis Gearon
Cc: Dennis Björklund; Maksim Likharev; pgsql-general@postgresql.org
Subject: Re: [GENERAL] Sorting Problem

On Wed, 13 Aug 2003, Dennis Gearon wrote:

> Dennis Björklund wrote:
>
> > In the future we need indexes that depend on the locale (and a lot
of other changes).
> >
>
> I agree. I've been looking at the web on this subject a lot lately. I
> am **NOT** a microslop fan, but SQL-SERVER even lets a user define a
> language(maybe encoding) down to the column level!
>
> I've been reading on GNU-C and on languages, encoding, and
localization.
>
>
http://pauillac.inria.fr/~lang/hotlist/free/licence/fsf96/drepper/paper-
1.html
>
http://h21007.www2.hp.com/dspp/tech/tech_TechSingleTipDetailPage_IDX/1,2
366,1222,00.html
>
>
> There are three basic approaches to doing different langauges in
computerized text:
>
>     A/ various adaptations of the 8 bit character set, I.E. the
ISO-8859-x series.
>     B/ wide characters
>     ********This should be how Postgress stores data
internally.********
>     C/ Multibyte characters
>     ********This is how Postgress should default to sending data OUT
of the application,
>             i.e. to the display or the web, or other system
applications********

SQL has a system for defining character set specifications, collations
and
such (per column/literal in some cases).  We should probably look at it
before making decisions on how to do things.

Re: Sorting Problem

От

Stephan Szabo

Дата:

13 августа 2003 г., 14:06:04

On Wed, 13 Aug 2003, Dennis Gearon wrote:

> Got a link to that section of the standard, or better yet, to a
> 'interpreted' version of the standard? :-)

The standard draft yes, an interpreted version, unfortunately not (unless
Date's book covers it and I can find where my copy is.

Here are some of the highlights

----
         k) form-of-use: A convention (or encoding) for representing
characters (in character strings). Some forms-of-use are
fixed-length codings and others are variable-length codings.

         l) form-of-use conversion: A method of converting character
strings from one form-of-use to another form-of-use.

----
         A character set is described by a character set descriptor. A
character set descriptor includes:

         -  the name of the character set or character repertoire,

         -  if the character set is a character repertoire, then the name of
            the form-of-use,

         -  an indication of what characters are in the character set, and

         -  the name of the default collation of the character set.

         For every character set, there is at least one collation. A
collation is described by a collation descriptor. A collation descriptor
         includes:

         -  the name of the collation,

         -  the name of the character set on which the collation operates,

         -  whether the collation has the NO PAD or the PAD SPACE attribute,
            and

         -  an indication of how the collation is performed.

---

The character data types and literals can include a character set
definition. Character type columns can include a collation.  There's a
COLLATE <blah> clause that looks like it can be used in expressions as
well.

Re: Sorting Problem

От

Dennis Gearon

Дата:

13 августа 2003 г., 14:27:08

If we really want to do it, we won't have to do the grunt work ourselves, just the tie-in, and Postgres specific
implementation:

    http://www.unicode.org/reports/tr10/

Stephan Szabo wrote:

> On Wed, 13 Aug 2003, Dennis Gearon wrote:
>
>
>>Got a link to that section of the standard, or better yet, to a
>>'interpreted' version of the standard? :-)
>
>
> The standard draft yes, an interpreted version, unfortunately not (unless
> Date's book covers it and I can find where my copy is.
>
> Here are some of the highlights
>
> ----
>          k) form-of-use: A convention (or encoding) for representing
> characters (in character strings). Some forms-of-use are
> fixed-length codings and others are variable-length codings.
>
>          l) form-of-use conversion: A method of converting character
> strings from one form-of-use to another form-of-use.
>
> ----
>          A character set is described by a character set descriptor. A
> character set descriptor includes:
>
>          -  the name of the character set or character repertoire,
>
>          -  if the character set is a character repertoire, then the name of
>             the form-of-use,
>
>          -  an indication of what characters are in the character set, and
>
>          -  the name of the default collation of the character set.
>
>          For every character set, there is at least one collation. A
> collation is described by a collation descriptor. A collation descriptor
>          includes:
>
>          -  the name of the collation,
>
>          -  the name of the character set on which the collation operates,
>
>          -  whether the collation has the NO PAD or the PAD SPACE attribute,
>             and
>
>          -  an indication of how the collation is performed.
>
> ---
>
> The character data types and literals can include a character set
> definition. Character type columns can include a collation.  There's a
> COLLATE <blah> clause that looks like it can be used in expressions as
> well.
>
>

Re: Sorting Problem

От

Gianni Mariani

Дата:

13 августа 2003 г., 14:30:57

Dennis Gearon wrote:

> Got a link to that section of the standard, or better yet, to a
> 'interpreted' version of the standard? :-)
>
> Stephan Szabo wrote:
>
>> On Wed, 13 Aug 2003, Dennis Gearon wrote:
>>
>>
>>> Dennis Bj?rklund wrote:
>>>
>>>
>>>> In the future we need indexes that depend on the locale (and a lot
>>>> of other changes).
>>>>
>>>
>>> I agree. I've been looking at the web on this subject a lot lately. I
>>> am **NOT** a microslop fan, but SQL-SERVER even lets a user define a
>>> language(maybe encoding) down to the column level!
>>>
>>> I've been reading on GNU-C and on languages, encoding, and
>>> localization.
>>>
>>> http://pauillac.inria.fr/~lang/hotlist/free/licence/fsf96/drepper/paper-1.html
>>>
>>> http://h21007.www2.hp.com/dspp/tech/tech_TechSingleTipDetailPage_IDX/1,2366,1222,00.html
>>>
>>>
>>>
>>> There are three basic approaches to doing different langauges in
>>> computerized text:
>>>
>>>    A/ various adaptations of the 8 bit character set, I.E. the
>>> ISO-8859-x series.
>>>    B/ wide characters
>>>    ********This should be how Postgress stores data internally.********
>>>    C/ Multibyte characters
>>>    ********This is how Postgress should default to sending data OUT
>>> of the application,
>>>            i.e. to the display or the web, or other system
>>> applications********
>>
>>
>>
>> SQL has a system for defining character set specifications,
>> collations and
>> such (per column/literal in some cases).  We should probably look at it
>> before making decisions on how to do things.
>

I thought UNIX (SCOTM) systems also had a way of being able to define
collation order.

see:
    ftp://dkuug.dk/i18n/WG15-collection/locales

for a collection of all ISO standardized locales (the WG15 ISO work
group's stuff).

Do a "man localedef" on most Linuxen or UNIXen.

As for wide characters vs multibyte, there is no clear winner.  The
right answer DEPENDS on the situation.

Wide characters on some platforms are 16 bit which means that when you
do Unicode you'll still have problems with surrogate pairs (meaning that
it's still multi (wide) char) so you still have all the problems of
multi-byte encodings.

You could decide to process everything in a PG specific 4 byte wide char
and do all text in Unicode but the overhead in processing 4 times the
data is quite significant.  The other option is to store all data in
utf-8 and have all text code become utf-8 aware.

I have found in practice that the utf-8 option is significantly easier
to implement, 100% Unicode compliant and the best performer (because of
reduced memory requirements).

The Posix API's for locales are not very good for modern day programs,
I'm not sure where the "mbr*" and the "wcr*" apis are in the
standardization process but if these are not well supported, you're on
your own and will need to implement similar functionality from scratch
and for that matter, the collation functions all operate on a "current"
locate which is really difficult to work with on multi-locale applications.

Re: Sorting Problem

От

Dennis Gearon

Дата:

13 августа 2003 г., 14:56:19

total agreement, ?????? (Maksim).

Maksim Likharev wrote:

> Not only SQL Server has all those possible cases for collate,
> but it does work very well.
>
> Assigning collate/encoding per column probably too much ( I never used
> it in my life )
> but assigning collate/encoding per table very much helpful.
>
> Case insensitive collation another very useful thing, which is on by
> default ( in SQL Server )
> and I really missing it in postgres, although I might probably initdb
> with some wacky
> case insensitive collate but that's more theoretical then practical.
>
> And of cause on top of that, ability to store UTF-16 data in a column (
> build in data type )
> independent of db collate would be priceless which SQL Server allow to
> do.
> Not much of dithyrambs for SQL Server but rather reality.
>
> Thanks.
>
> -----Original Message-----
> From: Stephan Szabo [mailto:sszabo@megazone.bigpanda.com]
> Sent: Wednesday, August 13, 2003 9:32 AM
> To: Dennis Gearon
> Cc: Dennis Björklund; Maksim Likharev; pgsql-general@postgresql.org
> Subject: Re: [GENERAL] Sorting Problem
>
>
>
> On Wed, 13 Aug 2003, Dennis Gearon wrote:
>
>
>>Dennis Björklund wrote:
>>
>>
>>>In the future we need indexes that depend on the locale (and a lot
>
> of other changes).
>
>>I agree. I've been looking at the web on this subject a lot lately. I
>>am **NOT** a microslop fan, but SQL-SERVER even lets a user define a
>>language(maybe encoding) down to the column level!
>>
>>I've been reading on GNU-C and on languages, encoding, and
>
> localization.
>
>>
> http://pauillac.inria.fr/~lang/hotlist/free/licence/fsf96/drepper/paper-
> 1.html
>
> http://h21007.www2.hp.com/dspp/tech/tech_TechSingleTipDetailPage_IDX/1,2
> 366,1222,00.html
>
>>
>>There are three basic approaches to doing different langauges in
>
> computerized text:
>
>>    A/ various adaptations of the 8 bit character set, I.E. the
>
> ISO-8859-x series.
>
>>    B/ wide characters
>>    ********This should be how Postgress stores data
>
> internally.********
>
>>    C/ Multibyte characters
>>    ********This is how Postgress should default to sending data OUT
>
> of the application,
>
>>            i.e. to the display or the web, or other system
>
> applications********
>
> SQL has a system for defining character set specifications, collations
> and
> such (per column/literal in some cases).  We should probably look at it
> before making decisions on how to do things.
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 9: the planner will ignore your desire to choose an index scan if your
>       joining column's datatypes do not match
>

Re: Sorting Problem

От

Dennis Gearon

Дата:

13 августа 2003 г., 15:00:32

I agree with all of that except for one caveat:

    all my reading, and just general off the cuff thinking, says that processing variable width characters
SIGNIFICANTLYslows an application. It seems better to PROCESS fixed width characters (1,2,4 byte), and TRANSMIT
variablewidth characters (avoiding the null problem.) 

Gianni Mariani wrote:

> Dennis Gearon wrote:
>
>> Got a link to that section of the standard, or better yet, to a
>> 'interpreted' version of the standard? :-)
>>
>> Stephan Szabo wrote:
>>
>>> On Wed, 13 Aug 2003, Dennis Gearon wrote:
>>>
>>>
>>>> Dennis Bj?rklund wrote:
>>>>
>>>>
>>>>> In the future we need indexes that depend on the locale (and a lot
>>>>> of other changes).
>>>>>
>>>>
>>>> I agree. I've been looking at the web on this subject a lot lately. I
>>>> am **NOT** a microslop fan, but SQL-SERVER even lets a user define a
>>>> language(maybe encoding) down to the column level!
>>>>
>>>> I've been reading on GNU-C and on languages, encoding, and
>>>> localization.
>>>>
>>>> http://pauillac.inria.fr/~lang/hotlist/free/licence/fsf96/drepper/paper-1.html
>>>>
>>>> http://h21007.www2.hp.com/dspp/tech/tech_TechSingleTipDetailPage_IDX/1,2366,1222,00.html
>>>>
>>>>
>>>>
>>>> There are three basic approaches to doing different langauges in
>>>> computerized text:
>>>>
>>>>    A/ various adaptations of the 8 bit character set, I.E. the
>>>> ISO-8859-x series.
>>>>    B/ wide characters
>>>>    ********This should be how Postgress stores data internally.********
>>>>    C/ Multibyte characters
>>>>    ********This is how Postgress should default to sending data OUT
>>>> of the application,
>>>>            i.e. to the display or the web, or other system
>>>> applications********
>>>
>>>
>>>
>>>
>>> SQL has a system for defining character set specifications,
>>> collations and
>>> such (per column/literal in some cases).  We should probably look at it
>>> before making decisions on how to do things.
>>
>>
>
> I thought UNIX (SCOTM) systems also had a way of being able to define
> collation order.
>
> see:
>    ftp://dkuug.dk/i18n/WG15-collection/locales
>
> for a collection of all ISO standardized locales (the WG15 ISO work
> group's stuff).
>
> Do a "man localedef" on most Linuxen or UNIXen.
>
> As for wide characters vs multibyte, there is no clear winner.  The
> right answer DEPENDS on the situation.
>
> Wide characters on some platforms are 16 bit which means that when you
> do Unicode you'll still have problems with surrogate pairs (meaning that
> it's still multi (wide) char) so you still have all the problems of
> multi-byte encodings.
>
> You could decide to process everything in a PG specific 4 byte wide char
> and do all text in Unicode but the overhead in processing 4 times the
> data is quite significant.  The other option is to store all data in
> utf-8 and have all text code become utf-8 aware.
>
> I have found in practice that the utf-8 option is significantly easier
> to implement, 100% Unicode compliant and the best performer (because of
> reduced memory requirements).
> The Posix API's for locales are not very good for modern day programs,
> I'm not sure where the "mbr*" and the "wcr*" apis are in the
> standardization process but if these are not well supported, you're on
> your own and will need to implement similar functionality from scratch
> and for that matter, the collation functions all operate on a "current"
> locate which is really difficult to work with on multi-locale applications.
>
>
>
>
>
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 8: explain analyze is your friend
>

Re: Sorting Problem

От

Bruce Momjian

Дата:

13 августа 2003 г., 15:14:00

I think the question is how often are you passing data around/storing it
_in_ your application and how often are you processing it.

---------------------------------------------------------------------------

Dennis Gearon wrote:
> I agree with all of that except for one caveat:
>
>     all my reading, and just general off the cuff thinking, says that processing variable width characters
SIGNIFICANTLYslows an application. It seems better to PROCESS fixed width characters (1,2,4 byte), and TRANSMIT
variablewidth characters (avoiding the null problem.) 
>
> Gianni Mariani wrote:
>
> > Dennis Gearon wrote:
> >
> >> Got a link to that section of the standard, or better yet, to a
> >> 'interpreted' version of the standard? :-)
> >>
> >> Stephan Szabo wrote:
> >>
> >>> On Wed, 13 Aug 2003, Dennis Gearon wrote:
> >>>
> >>>
> >>>> Dennis Bj?rklund wrote:
> >>>>
> >>>>
> >>>>> In the future we need indexes that depend on the locale (and a lot
> >>>>> of other changes).
> >>>>>
> >>>>
> >>>> I agree. I've been looking at the web on this subject a lot lately. I
> >>>> am **NOT** a microslop fan, but SQL-SERVER even lets a user define a
> >>>> language(maybe encoding) down to the column level!
> >>>>
> >>>> I've been reading on GNU-C and on languages, encoding, and
> >>>> localization.
> >>>>
> >>>> http://pauillac.inria.fr/~lang/hotlist/free/licence/fsf96/drepper/paper-1.html
> >>>>
> >>>> http://h21007.www2.hp.com/dspp/tech/tech_TechSingleTipDetailPage_IDX/1,2366,1222,00.html
> >>>>
> >>>>
> >>>>
> >>>> There are three basic approaches to doing different langauges in
> >>>> computerized text:
> >>>>
> >>>>    A/ various adaptations of the 8 bit character set, I.E. the
> >>>> ISO-8859-x series.
> >>>>    B/ wide characters
> >>>>    ********This should be how Postgress stores data internally.********
> >>>>    C/ Multibyte characters
> >>>>    ********This is how Postgress should default to sending data OUT
> >>>> of the application,
> >>>>            i.e. to the display or the web, or other system
> >>>> applications********
> >>>
> >>>
> >>>
> >>>
> >>> SQL has a system for defining character set specifications,
> >>> collations and
> >>> such (per column/literal in some cases).  We should probably look at it
> >>> before making decisions on how to do things.
> >>
> >>
> >
> > I thought UNIX (SCOTM) systems also had a way of being able to define
> > collation order.
> >
> > see:
> >    ftp://dkuug.dk/i18n/WG15-collection/locales
> >
> > for a collection of all ISO standardized locales (the WG15 ISO work
> > group's stuff).
> >
> > Do a "man localedef" on most Linuxen or UNIXen.
> >
> > As for wide characters vs multibyte, there is no clear winner.  The
> > right answer DEPENDS on the situation.
> >
> > Wide characters on some platforms are 16 bit which means that when you
> > do Unicode you'll still have problems with surrogate pairs (meaning that
> > it's still multi (wide) char) so you still have all the problems of
> > multi-byte encodings.
> >
> > You could decide to process everything in a PG specific 4 byte wide char
> > and do all text in Unicode but the overhead in processing 4 times the
> > data is quite significant.  The other option is to store all data in
> > utf-8 and have all text code become utf-8 aware.
> >
> > I have found in practice that the utf-8 option is significantly easier
> > to implement, 100% Unicode compliant and the best performer (because of
> > reduced memory requirements).
> > The Posix API's for locales are not very good for modern day programs,
> > I'm not sure where the "mbr*" and the "wcr*" apis are in the
> > standardization process but if these are not well supported, you're on
> > your own and will need to implement similar functionality from scratch
> > and for that matter, the collation functions all operate on a "current"
> > locate which is really difficult to work with on multi-locale applications.
> >
> >
> >
> >
> >
> >
> >
> > ---------------------------(end of broadcast)---------------------------
> > TIP 8: explain analyze is your friend
> >
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 7: don't forget to increase your free space map settings
>

--
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073

Re: Sorting Problem

От

"Maksim Likharev"

Дата:

13 августа 2003 г., 15:15:13

Agreement about what,
necessity of collate per table and Unicode build in data type, hail
yeah...
but that's really change nothing unless I get big shovel and start
"digging on" by my own
producing in reasonable period of time something...
and even for that, nobody guarantee that my exercise will be even
considered by core team...
So guess not really an answer


-----Original Message-----
From: Dennis Gearon [mailto:gearond@cvc.net]
Sent: Wednesday, August 13, 2003 11:00 AM
To: Maksim Likharev
Cc: pgsql-general@postgresql.org
Subject: Re: [GENERAL] Sorting Problem


total agreement, ?????? (Maksim).

Maksim Likharev wrote:

> Not only SQL Server has all those possible cases for collate,
> but it does work very well.
>
> Assigning collate/encoding per column probably too much ( I never used
> it in my life )
> but assigning collate/encoding per table very much helpful.
>
> Case insensitive collation another very useful thing, which is on by
> default ( in SQL Server )
> and I really missing it in postgres, although I might probably initdb
> with some wacky
> case insensitive collate but that's more theoretical then practical.
>
> And of cause on top of that, ability to store UTF-16 data in a column
(
> build in data type )
> independent of db collate would be priceless which SQL Server allow to
> do.
> Not much of dithyrambs for SQL Server but rather reality.
>
> Thanks.
>
> -----Original Message-----
> From: Stephan Szabo [mailto:sszabo@megazone.bigpanda.com]
> Sent: Wednesday, August 13, 2003 9:32 AM
> To: Dennis Gearon
> Cc: Dennis Björklund; Maksim Likharev; pgsql-general@postgresql.org
> Subject: Re: [GENERAL] Sorting Problem
>
>
>
> On Wed, 13 Aug 2003, Dennis Gearon wrote:
>
>
>>Dennis Björklund wrote:
>>
>>
>>>In the future we need indexes that depend on the locale (and a lot
>
> of other changes).
>
>>I agree. I've been looking at the web on this subject a lot lately. I
>>am **NOT** a microslop fan, but SQL-SERVER even lets a user define a
>>language(maybe encoding) down to the column level!
>>
>>I've been reading on GNU-C and on languages, encoding, and
>
> localization.
>
>>
>
http://pauillac.inria.fr/~lang/hotlist/free/licence/fsf96/drepper/paper-
> 1.html
>
>
http://h21007.www2.hp.com/dspp/tech/tech_TechSingleTipDetailPage_IDX/1,2
> 366,1222,00.html
>
>>
>>There are three basic approaches to doing different langauges in
>
> computerized text:
>
>>    A/ various adaptations of the 8 bit character set, I.E. the
>
> ISO-8859-x series.
>
>>    B/ wide characters
>>    ********This should be how Postgress stores data
>
> internally.********
>
>>    C/ Multibyte characters
>>    ********This is how Postgress should default to sending data OUT
>
> of the application,
>
>>            i.e. to the display or the web, or other system
>
> applications********
>
> SQL has a system for defining character set specifications, collations
> and
> such (per column/literal in some cases).  We should probably look at
it
> before making decisions on how to do things.
>
>
> ---------------------------(end of
broadcast)---------------------------
> TIP 9: the planner will ignore your desire to choose an index scan if
your
>       joining column's datatypes do not match
>

Re: Sorting Problem

От

Gianni Mariani

Дата:

13 августа 2003 г., 15:15:20

Dennis Gearon wrote:

> I agree with all of that except for one caveat:
>
>     all my reading, and just general off the cuff thinking, says that
> processing variable width characters SIGNIFICANTLY slows an
> application. It seems better to PROCESS fixed width characters (1,2,4
> byte), and TRANSMIT variable width characters (avoiding the null
> problem.)

I can and have solved that problem.  If you can assume utf-8 encoding
then there are available to you a bunch o tricks that takes this problem
away.

The other problem with memory (and hence cache) utilization of a wide
char only solution it far more significant.
Cache effects are the primary killer for performance in an app like a
database.

Anyhow, before making any "decisions" one should do a bunch of analysis.

Cheers
G

Re: Sorting Problem

От

Dennis Gearon

Дата:

13 августа 2003 г., 15:23:33

I agree, mostly. In the case of a database, I would bet that the INTERNAL, IN-APPLICATION processing
FAAAAAAAAAAARRRRRRRRRRRexceeds that of sending and receiving it. 

    i.e. comparisons, sorts, triggers, indexes, views, functions, logging to tables, ordering by,
        grouping, etc.

    except backups, restores, logging to text files (these would be good in UTF8)

Bruce Momjian wrote:

> I think the question is how often are you passing data around/storing it
> _in_ your application and how often are you processing it.
>
> ---------------------------------------------------------------------------
>
> Dennis Gearon wrote:
>
>>I agree with all of that except for one caveat:
>>
>>    all my reading, and just general off the cuff thinking, says that processing variable width characters
SIGNIFICANTLYslows an application. It seems better to PROCESS fixed width characters (1,2,4 byte), and TRANSMIT
variablewidth characters (avoiding the null problem.) 
>>
>>Gianni Mariani wrote:
>>
>>
>>>Dennis Gearon wrote:
>>>
>>>
>>>>Got a link to that section of the standard, or better yet, to a
>>>>'interpreted' version of the standard? :-)
>>>>
>>>>Stephan Szabo wrote:
>>>>
>>>>
>>>>>On Wed, 13 Aug 2003, Dennis Gearon wrote:
>>>>>
>>>>>
>>>>>
>>>>>>Dennis Bj?rklund wrote:
>>>>>>
>>>>>>
>>>>>>
>>>>>>>In the future we need indexes that depend on the locale (and a lot
>>>>>>>of other changes).
>>>>>>>
>>>>>>
>>>>>>I agree. I've been looking at the web on this subject a lot lately. I
>>>>>>am **NOT** a microslop fan, but SQL-SERVER even lets a user define a
>>>>>>language(maybe encoding) down to the column level!
>>>>>>
>>>>>>I've been reading on GNU-C and on languages, encoding, and
>>>>>>localization.
>>>>>>
>>>>>>http://pauillac.inria.fr/~lang/hotlist/free/licence/fsf96/drepper/paper-1.html
>>>>>>
>>>>>>http://h21007.www2.hp.com/dspp/tech/tech_TechSingleTipDetailPage_IDX/1,2366,1222,00.html
>>>>>>
>>>>>>
>>>>>>
>>>>>>There are three basic approaches to doing different langauges in
>>>>>>computerized text:
>>>>>>
>>>>>>   A/ various adaptations of the 8 bit character set, I.E. the
>>>>>>ISO-8859-x series.
>>>>>>   B/ wide characters
>>>>>>   ********This should be how Postgress stores data internally.********
>>>>>>   C/ Multibyte characters
>>>>>>   ********This is how Postgress should default to sending data OUT
>>>>>>of the application,
>>>>>>           i.e. to the display or the web, or other system
>>>>>>applications********
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>SQL has a system for defining character set specifications,
>>>>>collations and
>>>>>such (per column/literal in some cases).  We should probably look at it
>>>>>before making decisions on how to do things.
>>>>
>>>>
>>>I thought UNIX (SCOTM) systems also had a way of being able to define
>>>collation order.
>>>
>>>see:
>>>   ftp://dkuug.dk/i18n/WG15-collection/locales
>>>
>>>for a collection of all ISO standardized locales (the WG15 ISO work
>>>group's stuff).
>>>
>>>Do a "man localedef" on most Linuxen or UNIXen.
>>>
>>>As for wide characters vs multibyte, there is no clear winner.  The
>>>right answer DEPENDS on the situation.
>>>
>>>Wide characters on some platforms are 16 bit which means that when you
>>>do Unicode you'll still have problems with surrogate pairs (meaning that
>>>it's still multi (wide) char) so you still have all the problems of
>>>multi-byte encodings.
>>>
>>>You could decide to process everything in a PG specific 4 byte wide char
>>>and do all text in Unicode but the overhead in processing 4 times the
>>>data is quite significant.  The other option is to store all data in
>>>utf-8 and have all text code become utf-8 aware.
>>>
>>>I have found in practice that the utf-8 option is significantly easier
>>>to implement, 100% Unicode compliant and the best performer (because of
>>>reduced memory requirements).
>>>The Posix API's for locales are not very good for modern day programs,
>>>I'm not sure where the "mbr*" and the "wcr*" apis are in the
>>>standardization process but if these are not well supported, you're on
>>>your own and will need to implement similar functionality from scratch
>>>and for that matter, the collation functions all operate on a "current"
>>>locate which is really difficult to work with on multi-locale applications.
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>---------------------------(end of broadcast)---------------------------
>>>TIP 8: explain analyze is your friend
>>>
>>
>>
>>---------------------------(end of broadcast)---------------------------
>>TIP 7: don't forget to increase your free space map settings
>>
>
>

Re: Sorting Problem

От

Dennis Gearon

Дата:

13 августа 2003 г., 15:26:31

How did you solve the problem .......... :-)

Gianni Mariani wrote:

> Dennis Gearon wrote:
>
>> I agree with all of that except for one caveat:
>>
>>     all my reading, and just general off the cuff thinking, says that
>> processing variable width characters SIGNIFICANTLY slows an
>> application. It seems better to PROCESS fixed width characters (1,2,4
>> byte), and TRANSMIT variable width characters (avoiding the null
>> problem.)
>
>
>
> I can and have solved that problem.  If you can assume utf-8 encoding
> then there are available to you a bunch o tricks that takes this problem
> away.
>
> The other problem with memory (and hence cache) utilization of a wide
> char only solution it far more significant.
> Cache effects are the primary killer for performance in an app like a
> database.
>
> Anyhow, before making any "decisions" one should do a bunch of analysis.
>
> Cheers
> G
>
>
>
>

Re: Sorting Problem

От

Dennis Gearon

Дата:

13 августа 2003 г., 15:29:58

Perhaps the only way to get around the cache problem is to use an ISO-8859-x 8bit character set, but to have per table,
orper column encoding attributes. And of course, ways to access what those are, in the Postgres API. Good for speed,
butnot for easy storing of multiple language/encodings per column/table. 

Gianni Mariani wrote:

> Dennis Gearon wrote:
>
>> I agree with all of that except for one caveat:
>>
>>     all my reading, and just general off the cuff thinking, says that
>> processing variable width characters SIGNIFICANTLY slows an
>> application. It seems better to PROCESS fixed width characters (1,2,4
>> byte), and TRANSMIT variable width characters (avoiding the null
>> problem.)
>
>
>
> I can and have solved that problem.  If you can assume utf-8 encoding
> then there are available to you a bunch o tricks that takes this problem
> away.
>
> The other problem with memory (and hence cache) utilization of a wide
> char only solution it far more significant.
> Cache effects are the primary killer for performance in an app like a
> database.
>
> Anyhow, before making any "decisions" one should do a bunch of analysis.
>
> Cheers
> G
>
>
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org
>

Re: Sorting Problem

От

Gianni Mariani

Дата:

13 августа 2003 г., 15:55:06

Dennis Gearon wrote:

> How did you solve the problem .......... :-)

inlining - most chars are just ascii and there are trivial optimizations
that can lead to just as fast as moving 4x the data around.

Re: Sorting Problem

От

Kathy zhu

Дата:

13 августа 2003 г., 18:13:21

Ok, thanks for all the discussion followed, vey educational :-))
But nobody really followed up my question :-(

For example, you have a table that is displayed in the browser. You want to let
the user do sorting on one or multible columns, including those which contain
localized strings.

If the db supports sorting for multi-linguages, you can retreive all the rows
and do "ORDER BY" on the columns directly.

But if db doesn't support that, in Java, you will have to retrieve all the rows
first, sort the objects on multiple columnes listed in the "ORDER BY" in java.
It is much slower than the previous one.

My problem is that if initdb is done with en_US, if a customer in Japan request
the page with that sorted table, the text is not sorted in a Japanese one. Then
I have a problem.

Any suggestions on that ??


Dennis Gearon wrote:
> I agree, mostly. In the case of a database, I would bet that the
> INTERNAL, IN-APPLICATION processing FAAAAAAAAAAARRRRRRRRRRR exceeds that
> of sending and receiving it.
>
>     i.e. comparisons, sorts, triggers, indexes, views, functions,
> logging to tables, ordering by,
>         grouping, etc.
>
>     except backups, restores, logging to text files (these would be good
> in UTF8)
>
> Bruce Momjian wrote:
>
>> I think the question is how often are you passing data around/storing it
>> _in_ your application and how often are you processing it.
>>
>> ---------------------------------------------------------------------------
>>
>>
>> Dennis Gearon wrote:
>>
>>> I agree with all of that except for one caveat:
>>>
>>>     all my reading, and just general off the cuff thinking, says that
>>> processing variable width characters SIGNIFICANTLY slows an
>>> application. It seems better to PROCESS fixed width characters (1,2,4
>>> byte), and TRANSMIT variable width characters (avoiding the null
>>> problem.)
>>>
>>> Gianni Mariani wrote:
>>>
>>>
>>>> Dennis Gearon wrote:
>>>>
>>>>
>>>>> Got a link to that section of the standard, or better yet, to a
>>>>> 'interpreted' version of the standard? :-)
>>>>>
>>>>> Stephan Szabo wrote:
>>>>>
>>>>>
>>>>>> On Wed, 13 Aug 2003, Dennis Gearon wrote:
>>>>>>
>>>>>>
>>>>>>
>>>>>>> Dennis Bj?rklund wrote:
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>> In the future we need indexes that depend on the locale (and a
>>>>>>>> lot of other changes).
>>>>>>>>
>>>>>>>
>>>>>>> I agree. I've been looking at the web on this subject a lot
>>>>>>> lately. I
>>>>>>> am **NOT** a microslop fan, but SQL-SERVER even lets a user define a
>>>>>>> language(maybe encoding) down to the column level!
>>>>>>>
>>>>>>> I've been reading on GNU-C and on languages, encoding, and
>>>>>>> localization.
>>>>>>>
>>>>>>> http://pauillac.inria.fr/~lang/hotlist/free/licence/fsf96/drepper/paper-1.html
>>>>>>>
>>>>>>> http://h21007.www2.hp.com/dspp/tech/tech_TechSingleTipDetailPage_IDX/1,2366,1222,00.html
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> There are three basic approaches to doing different langauges in
>>>>>>> computerized text:
>>>>>>>
>>>>>>>   A/ various adaptations of the 8 bit character set, I.E. the
>>>>>>> ISO-8859-x series.
>>>>>>>   B/ wide characters
>>>>>>>   ********This should be how Postgress stores data
>>>>>>> internally.********
>>>>>>>   C/ Multibyte characters
>>>>>>>   ********This is how Postgress should default to sending data
>>>>>>> OUT of the application,
>>>>>>>           i.e. to the display or the web, or other system
>>>>>>> applications********
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> SQL has a system for defining character set specifications,
>>>>>> collations and
>>>>>> such (per column/literal in some cases).  We should probably look
>>>>>> at it
>>>>>> before making decisions on how to do things.
>>>>>
>>>>>
>>>>>
>>>> I thought UNIX (SCOTM) systems also had a way of being able to
>>>> define collation order.
>>>>
>>>> see:
>>>>   ftp://dkuug.dk/i18n/WG15-collection/locales
>>>>
>>>> for a collection of all ISO standardized locales (the WG15 ISO work
>>>> group's stuff).
>>>>
>>>> Do a "man localedef" on most Linuxen or UNIXen.
>>>>
>>>> As for wide characters vs multibyte, there is no clear winner.  The
>>>> right answer DEPENDS on the situation.
>>>>
>>>> Wide characters on some platforms are 16 bit which means that when
>>>> you do Unicode you'll still have problems with surrogate pairs
>>>> (meaning that it's still multi (wide) char) so you still have all
>>>> the problems of multi-byte encodings.
>>>>
>>>> You could decide to process everything in a PG specific 4 byte wide
>>>> char and do all text in Unicode but the overhead in processing 4
>>>> times the data is quite significant.  The other option is to store
>>>> all data in utf-8 and have all text code become utf-8 aware.
>>>>
>>>> I have found in practice that the utf-8 option is significantly
>>>> easier to implement, 100% Unicode compliant and the best performer
>>>> (because of reduced memory requirements).
>>>> The Posix API's for locales are not very good for modern day
>>>> programs, I'm not sure where the "mbr*" and the "wcr*" apis are in
>>>> the standardization process but if these are not well supported,
>>>> you're on your own and will need to implement similar functionality
>>>> from scratch and for that matter, the collation functions all
>>>> operate on a "current" locate which is really difficult to work with
>>>> on multi-locale applications.
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> ---------------------------(end of
>>>> broadcast)---------------------------
>>>> TIP 8: explain analyze is your friend
>>>>
>>>
>>>
>>> ---------------------------(end of broadcast)---------------------------
>>> TIP 7: don't forget to increase your free space map settings
>>>
>>
>>
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 9: the planner will ignore your desire to choose an index scan if your
>      joining column's datatypes do not match


--

Re: Sorting Problem

От

Dennis Gearon

Дата:

13 августа 2003 г., 18:39:26

From what I've read, the db will ONLY sort on the language it was 'initdb'd on. I guess the 'C' locale should would
sortit sort of, in other languages. 

Kathy zhu wrote:

> Ok, thanks for all the discussion followed, vey educational :-))
> But nobody really followed up my question :-(
>
> For example, you have a table that is displayed in the browser. You want
> to let the user do sorting on one or multible columns, including those
> which contain localized strings.
>
> If the db supports sorting for multi-linguages, you can retreive all the
> rows and do "ORDER BY" on the columns directly.
>
> But if db doesn't support that, in Java, you will have to retrieve all
> the rows first, sort the objects on multiple columnes listed in the
> "ORDER BY" in java. It is much slower than the previous one.
>
> My problem is that if initdb is done with en_US, if a customer in Japan
> request the page with that sorted table, the text is not sorted in a
> Japanese one. Then I have a problem.
>
> Any suggestions on that ??
>
>
> Dennis Gearon wrote:
>
>> I agree, mostly. In the case of a database, I would bet that the
>> INTERNAL, IN-APPLICATION processing FAAAAAAAAAAARRRRRRRRRRR exceeds
>> that of sending and receiving it.
>>
>>     i.e. comparisons, sorts, triggers, indexes, views, functions,
>> logging to tables, ordering by,
>>         grouping, etc.
>>
>>     except backups, restores, logging to text files (these would be
>> good in UTF8)
>>
>> Bruce Momjian wrote:
>>
>>> I think the question is how often are you passing data around/storing it
>>> _in_ your application and how often are you processing it.
>>>
>>> ---------------------------------------------------------------------------
>>>
>>>
>>> Dennis Gearon wrote:
>>>
>>>> I agree with all of that except for one caveat:
>>>>
>>>>     all my reading, and just general off the cuff thinking, says
>>>> that processing variable width characters SIGNIFICANTLY slows an
>>>> application. It seems better to PROCESS fixed width characters
>>>> (1,2,4 byte), and TRANSMIT variable width characters (avoiding the
>>>> null problem.)
>>>>
>>>> Gianni Mariani wrote:
>>>>
>>>>
>>>>> Dennis Gearon wrote:
>>>>>
>>>>>
>>>>>> Got a link to that section of the standard, or better yet, to a
>>>>>> 'interpreted' version of the standard? :-)
>>>>>>
>>>>>> Stephan Szabo wrote:
>>>>>>
>>>>>>
>>>>>>> On Wed, 13 Aug 2003, Dennis Gearon wrote:
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>> Dennis Bj?rklund wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>> In the future we need indexes that depend on the locale (and a
>>>>>>>>> lot of other changes).
>>>>>>>>>
>>>>>>>>
>>>>>>>> I agree. I've been looking at the web on this subject a lot
>>>>>>>> lately. I
>>>>>>>> am **NOT** a microslop fan, but SQL-SERVER even lets a user
>>>>>>>> define a
>>>>>>>> language(maybe encoding) down to the column level!
>>>>>>>>
>>>>>>>> I've been reading on GNU-C and on languages, encoding, and
>>>>>>>> localization.
>>>>>>>>
>>>>>>>> http://pauillac.inria.fr/~lang/hotlist/free/licence/fsf96/drepper/paper-1.html
>>>>>>>>
>>>>>>>> http://h21007.www2.hp.com/dspp/tech/tech_TechSingleTipDetailPage_IDX/1,2366,1222,00.html
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> There are three basic approaches to doing different langauges in
>>>>>>>> computerized text:
>>>>>>>>
>>>>>>>>   A/ various adaptations of the 8 bit character set, I.E. the
>>>>>>>> ISO-8859-x series.
>>>>>>>>   B/ wide characters
>>>>>>>>   ********This should be how Postgress stores data
>>>>>>>> internally.********
>>>>>>>>   C/ Multibyte characters
>>>>>>>>   ********This is how Postgress should default to sending data
>>>>>>>> OUT of the application,
>>>>>>>>           i.e. to the display or the web, or other system
>>>>>>>> applications********
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> SQL has a system for defining character set specifications,
>>>>>>> collations and
>>>>>>> such (per column/literal in some cases).  We should probably look
>>>>>>> at it
>>>>>>> before making decisions on how to do things.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>> I thought UNIX (SCOTM) systems also had a way of being able to
>>>>> define collation order.
>>>>>
>>>>> see:
>>>>>   ftp://dkuug.dk/i18n/WG15-collection/locales
>>>>>
>>>>> for a collection of all ISO standardized locales (the WG15 ISO work
>>>>> group's stuff).
>>>>>
>>>>> Do a "man localedef" on most Linuxen or UNIXen.
>>>>>
>>>>> As for wide characters vs multibyte, there is no clear winner.  The
>>>>> right answer DEPENDS on the situation.
>>>>>
>>>>> Wide characters on some platforms are 16 bit which means that when
>>>>> you do Unicode you'll still have problems with surrogate pairs
>>>>> (meaning that it's still multi (wide) char) so you still have all
>>>>> the problems of multi-byte encodings.
>>>>>
>>>>> You could decide to process everything in a PG specific 4 byte wide
>>>>> char and do all text in Unicode but the overhead in processing 4
>>>>> times the data is quite significant.  The other option is to store
>>>>> all data in utf-8 and have all text code become utf-8 aware.
>>>>>
>>>>> I have found in practice that the utf-8 option is significantly
>>>>> easier to implement, 100% Unicode compliant and the best performer
>>>>> (because of reduced memory requirements).
>>>>> The Posix API's for locales are not very good for modern day
>>>>> programs, I'm not sure where the "mbr*" and the "wcr*" apis are in
>>>>> the standardization process but if these are not well supported,
>>>>> you're on your own and will need to implement similar functionality
>>>>> from scratch and for that matter, the collation functions all
>>>>> operate on a "current" locate which is really difficult to work
>>>>> with on multi-locale applications.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> ---------------------------(end of
>>>>> broadcast)---------------------------
>>>>> TIP 8: explain analyze is your friend
>>>>>
>>>>
>>>>
>>>> ---------------------------(end of
>>>> broadcast)---------------------------
>>>> TIP 7: don't forget to increase your free space map settings
>>>>
>>>
>>>
>>
>>
>> ---------------------------(end of broadcast)---------------------------
>> TIP 9: the planner will ignore your desire to choose an index scan if
>> your
>>      joining column's datatypes do not match
>
>
>

Re: Sorting Problem

От

"Maksim Likharev"

Дата:

13 августа 2003 г., 19:04:27

When I said use en_US as collate, I meant that you language set
fall into "Western European" group, another word "Latin1 General"
collate,
still as Dennis Björklund pointed that Swedish has different sorting
rules,
( That's why I guess they have separate Swedish collate )

Generally speaking if you language set is unlimited you effectively
stuck?
with sorting on client side by using particular language rules.
So then use 'C' collate for DB,
I do not know what kind of encoding will be good for you tho, Unicode??.

So you can sort on a client using xsl:sort and specifing language or any
other means that you
feel comfortable with.
Or just use binary sorting rules, that what 'C' collate does.
Choices, choices...






-----Original Message-----
From: Kathy zhu [mailto:Kathy.zhu@Sun.COM]
Sent: Wednesday, August 13, 2003 2:13 PM
To: gearond@cvc.net
Cc: Bruce Momjian; Gianni Mariani; pgsql-general@postgresql.org
Subject: Re: [GENERAL] Sorting Problem


Ok, thanks for all the discussion followed, vey educational :-))
But nobody really followed up my question :-(

For example, you have a table that is displayed in the browser. You want
to let
the user do sorting on one or multible columns, including those which
contain
localized strings.

If the db supports sorting for multi-linguages, you can retreive all the
rows
and do "ORDER BY" on the columns directly.

But if db doesn't support that, in Java, you will have to retrieve all
the rows
first, sort the objects on multiple columnes listed in the "ORDER BY" in
java.
It is much slower than the previous one.

My problem is that if initdb is done with en_US, if a customer in Japan
request
the page with that sorted table, the text is not sorted in a Japanese
one. Then
I have a problem.

Any suggestions on that ??


Dennis Gearon wrote:
> I agree, mostly. In the case of a database, I would bet that the
> INTERNAL, IN-APPLICATION processing FAAAAAAAAAAARRRRRRRRRRR exceeds
that
> of sending and receiving it.
>
>     i.e. comparisons, sorts, triggers, indexes, views, functions,
> logging to tables, ordering by,
>         grouping, etc.
>
>     except backups, restores, logging to text files (these would be
good
> in UTF8)
>
> Bruce Momjian wrote:
>
>> I think the question is how often are you passing data around/storing
it
>> _in_ your application and how often are you processing it.
>>
>>
------------------------------------------------------------------------
---
>>
>>
>> Dennis Gearon wrote:
>>
>>> I agree with all of that except for one caveat:
>>>
>>>     all my reading, and just general off the cuff thinking, says
that
>>> processing variable width characters SIGNIFICANTLY slows an
>>> application. It seems better to PROCESS fixed width characters
(1,2,4
>>> byte), and TRANSMIT variable width characters (avoiding the null
>>> problem.)
>>>
>>> Gianni Mariani wrote:
>>>
>>>
>>>> Dennis Gearon wrote:
>>>>
>>>>
>>>>> Got a link to that section of the standard, or better yet, to a
>>>>> 'interpreted' version of the standard? :-)
>>>>>
>>>>> Stephan Szabo wrote:
>>>>>
>>>>>
>>>>>> On Wed, 13 Aug 2003, Dennis Gearon wrote:
>>>>>>
>>>>>>
>>>>>>
>>>>>>> Dennis Bj?rklund wrote:
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>> In the future we need indexes that depend on the locale (and a
>>>>>>>> lot of other changes).
>>>>>>>>
>>>>>>>
>>>>>>> I agree. I've been looking at the web on this subject a lot
>>>>>>> lately. I
>>>>>>> am **NOT** a microslop fan, but SQL-SERVER even lets a user
define a
>>>>>>> language(maybe encoding) down to the column level!
>>>>>>>
>>>>>>> I've been reading on GNU-C and on languages, encoding, and
>>>>>>> localization.
>>>>>>>
>>>>>>>
http://pauillac.inria.fr/~lang/hotlist/free/licence/fsf96/drepper/paper-
1.html
>>>>>>>
>>>>>>>
http://h21007.www2.hp.com/dspp/tech/tech_TechSingleTipDetailPage_IDX/1,2
366,1222,00.html
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> There are three basic approaches to doing different langauges in

>>>>>>> computerized text:
>>>>>>>
>>>>>>>   A/ various adaptations of the 8 bit character set, I.E. the
>>>>>>> ISO-8859-x series.
>>>>>>>   B/ wide characters
>>>>>>>   ********This should be how Postgress stores data
>>>>>>> internally.********
>>>>>>>   C/ Multibyte characters
>>>>>>>   ********This is how Postgress should default to sending data
>>>>>>> OUT of the application,
>>>>>>>           i.e. to the display or the web, or other system
>>>>>>> applications********
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> SQL has a system for defining character set specifications,
>>>>>> collations and
>>>>>> such (per column/literal in some cases).  We should probably look

>>>>>> at it
>>>>>> before making decisions on how to do things.
>>>>>
>>>>>
>>>>>
>>>> I thought UNIX (SCOTM) systems also had a way of being able to
>>>> define collation order.
>>>>
>>>> see:
>>>>   ftp://dkuug.dk/i18n/WG15-collection/locales
>>>>
>>>> for a collection of all ISO standardized locales (the WG15 ISO work

>>>> group's stuff).
>>>>
>>>> Do a "man localedef" on most Linuxen or UNIXen.
>>>>
>>>> As for wide characters vs multibyte, there is no clear winner.  The

>>>> right answer DEPENDS on the situation.
>>>>
>>>> Wide characters on some platforms are 16 bit which means that when
>>>> you do Unicode you'll still have problems with surrogate pairs
>>>> (meaning that it's still multi (wide) char) so you still have all
>>>> the problems of multi-byte encodings.
>>>>
>>>> You could decide to process everything in a PG specific 4 byte wide

>>>> char and do all text in Unicode but the overhead in processing 4
>>>> times the data is quite significant.  The other option is to store
>>>> all data in utf-8 and have all text code become utf-8 aware.
>>>>
>>>> I have found in practice that the utf-8 option is significantly
>>>> easier to implement, 100% Unicode compliant and the best performer
>>>> (because of reduced memory requirements).
>>>> The Posix API's for locales are not very good for modern day
>>>> programs, I'm not sure where the "mbr*" and the "wcr*" apis are in
>>>> the standardization process but if these are not well supported,
>>>> you're on your own and will need to implement similar functionality

>>>> from scratch and for that matter, the collation functions all
>>>> operate on a "current" locate which is really difficult to work
with
>>>> on multi-locale applications.
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> ---------------------------(end of
>>>> broadcast)---------------------------
>>>> TIP 8: explain analyze is your friend
>>>>
>>>
>>>
>>> ---------------------------(end of
broadcast)---------------------------
>>> TIP 7: don't forget to increase your free space map settings
>>>
>>
>>
>
>
> ---------------------------(end of
broadcast)---------------------------
> TIP 9: the planner will ignore your desire to choose an index scan if
your
>      joining column's datatypes do not match


--



---------------------------(end of broadcast)---------------------------
TIP 2: you can get off all lists at once with the unregister command
    (send "unregister YourEmailAddressHere" to majordomo@postgresql.org)

Re: Sorting Problem

От

Martijn van Oosterhout

Дата:

13 августа 2003 г., 20:44:45

On Wed, Aug 13, 2003 at 11:33:31AM -0700, Dennis Gearon wrote:
> Perhaps the only way to get around the cache problem is to use an
> ISO-8859-x 8bit character set, but to have per table, or per column
> encoding attributes. And of course, ways to access what those are, in the
> Postgres API. Good for speed, but not for easy storing of multiple
> language/encodings per column/table.

Well, each column will have to have a native encoding/collation order. This
is the one stored on disk and the one used for indexes. Remember, if your
index is collated by en_US and your query asks for nl_NL, you can't use your
query for sorting the result. I guess you could have the same column indexed
twice with the different collation orders.
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> "All that is needed for the forces of evil to triumph is for enough good
> men to do nothing." - Edmond Burke
> "The penalty good people pay for not being interested in politics is to be
> governed by people worse than themselves." - Plato

Вложения

msg-32091-64922.dat

Re: Sorting Problem

От

Peter Eisentraut

Дата:

19 августа 2003 г., 05:27:44

Dennis Gearon writes:

> If we really want to do it, we won't have to do the grunt work ourselves, just the tie-in, and Postgres specific
implementation:
>
>     http://www.unicode.org/reports/tr10/

It's already in the works.

--
Peter Eisentraut   peter_e@gmx.net

Re: Sorting Problem

От

Tom Lane

Дата:

19 августа 2003 г., 10:27:46

Peter Eisentraut <peter_e@gmx.net> writes:
> It's already in the works.

Oh?  Please clue us in, if there's work going on that we don't know
about.  Otherwise someone might be duplicating effort.

            regards, tom lane

Re: Sorting Problem

От

Dennis Gearon

Дата:

19 августа 2003 г., 12:53:03

Is there a list on this, that I could be part of ?

Peter Eisentraut wrote:

> Dennis Gearon writes:
>
>
>>If we really want to do it, we won't have to do the grunt work ourselves, just the tie-in, and Postgres specific
implementation:
>>
>>    http://www.unicode.org/reports/tr10/
>
>
> It's already in the works.
>

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Обсуждение: Sorting Problem

Вложения