Re: Unicode support

Поиск
Список
Период
Сортировка
От Andrew Gierth
Тема Re: Unicode support
Дата
Msg-id 87r5zualin.fsf@news-spur.riddles.org.uk
обсуждение исходный текст
Ответ на Re: Unicode support  (Peter Eisentraut <peter_e@gmx.net>)
Список pgsql-hackers
>>>>> "Peter" == Peter Eisentraut <peter_e@gmx.net> writes:
> On Tuesday 14 April 2009 07:07:27 Andrew Gierth wrote:>> FWIW, the SQL spec puts the onus of normalization squarely
onthe>> application; the database is allowed to assume that Unicode>> strings are already normalized, is allowed to
behavein>> implementation-defined ways when presented with strings that>> aren't normalized, and provision of
normalizationfunctions and>> predicates is just another optional feature.
 
Peter> Can you name chapter and verse on that?

4.2.8 Universal character sets
 A UCS string is a character string whose character repertoire is UCS and whose character encoding form is one of UTF8,
UTF16,or UTF32. Any two UCS strings are comparable.
 
 An SQL-implementation may assume that all UCS strings are normalized in one of Normalization Form C (NFC),
NormalizationForm D (NFD), Normalization Form KC (NFKC), or Normalization Form KD (NFKD), as specified by [Unicode15].
<normalizedpredicate> may be used to verify the normalization form to which a particular UCS string conforms.
Applicationsmay also use <normalize function> to enforce a particular <normal form>. With the exception of <normalize
function>and <normalized predicate>, the result of any operation on an unnormalized UCS string is
implementation-defined.
 Conversion of UCS strings from one character set to another is automatic.
 Detection of a noncharacter in a UCS-string causes an exception condition to be raised. The detection of an unassigned
codepoint does not.
 

[Obviously there are things here that we don't conform to anyway (we
don't raise exceptions for noncharacters, for example. We don't claim
conformance to T061.]

<normalized predicate> ::= <row value predicand> <normalized predicate part 2>
<normalized predicate part 2> ::= IS [ NOT ] [ <normal form> ] NORMALIZED

1) Without Feature T061, "UCS support", conforming SQL language shall  not contain a <normalized predicate>.

2) Without Feature F394, "Optional normal form specification",  conforming SQL language shall not contain <normal
form>.

<normalize function> ::= NORMALIZE <left paren> <character value expression>     [ <comma> <normal form> [ <comma>
<normalizefunction result length> ] ] <right paren>
 

<normal form> ::=   NFC | NFD | NFKC | NFKD

7) Without Feature T061, "UCS support", conforming SQL language shall  not contain a <normalize function>.

9) Without Feature F394, "Optional normal form specification",  conforming SQL language shall not contain <normal
form>.
Peter> I see this, for example,
Peter> 6.27 <numeric value function>[...]Peter> So SQL redirects the question of character length the UnicodePeter>
standard. I have not been able to find anything there on aPeter> quick look, but I'm sure the Unicode standard has some
veryPeter>specific ideas on this.  Note that the matter of normalizationPeter> is not mentioned here.
 

I've taken a not-so-quick look at the Unicode standard (though I don't
claim to be any sort of expert on it), and I certainly can't see any
definitive indication what the length is supposed to be; however, the
use of terminology such as "combining character sequence" (meaning a
series of codepoints that combine to make a single glyph) certainly
seems to strongly imply that our interpretation is correct and that
the OP's is not.

Other indications: the units used by length() must be the same as the
units used by position() and substring() (in the spec, when USING
CHARACTERS is specified), and it would not make sense to use a
definition of "character" that did not allow you to look inside a
combining sequence.

I've also failed so far to find any examples of other programming
languages in which a combining character sequence is taken to be a
single character for purposes of length or position specification.

-- 
Andrew (irc:RhodiumToad)


В списке pgsql-hackers по дате отправления:

Предыдущее
От: Bruce Momjian
Дата:
Сообщение: Re: proposal: add columns created and altered to pg_proc and pg_class
Следующее
От: Tom Lane
Дата:
Сообщение: Re: proposal: add columns created and altered to pg_proc and pg_class