Re: Unicode support
От | Andrew Gierth |
---|---|
Тема | Re: Unicode support |
Дата | |
Msg-id | 87r5zualin.fsf@news-spur.riddles.org.uk обсуждение исходный текст |
Ответ на | Re: Unicode support (Peter Eisentraut <peter_e@gmx.net>) |
Список | pgsql-hackers |
>>>>> "Peter" == Peter Eisentraut <peter_e@gmx.net> writes: > On Tuesday 14 April 2009 07:07:27 Andrew Gierth wrote:>> FWIW, the SQL spec puts the onus of normalization squarely onthe>> application; the database is allowed to assume that Unicode>> strings are already normalized, is allowed to behavein>> implementation-defined ways when presented with strings that>> aren't normalized, and provision of normalizationfunctions and>> predicates is just another optional feature. Peter> Can you name chapter and verse on that? 4.2.8 Universal character sets A UCS string is a character string whose character repertoire is UCS and whose character encoding form is one of UTF8, UTF16,or UTF32. Any two UCS strings are comparable. An SQL-implementation may assume that all UCS strings are normalized in one of Normalization Form C (NFC), NormalizationForm D (NFD), Normalization Form KC (NFKC), or Normalization Form KD (NFKD), as specified by [Unicode15]. <normalizedpredicate> may be used to verify the normalization form to which a particular UCS string conforms. Applicationsmay also use <normalize function> to enforce a particular <normal form>. With the exception of <normalize function>and <normalized predicate>, the result of any operation on an unnormalized UCS string is implementation-defined. Conversion of UCS strings from one character set to another is automatic. Detection of a noncharacter in a UCS-string causes an exception condition to be raised. The detection of an unassigned codepoint does not. [Obviously there are things here that we don't conform to anyway (we don't raise exceptions for noncharacters, for example. We don't claim conformance to T061.] <normalized predicate> ::= <row value predicand> <normalized predicate part 2> <normalized predicate part 2> ::= IS [ NOT ] [ <normal form> ] NORMALIZED 1) Without Feature T061, "UCS support", conforming SQL language shall not contain a <normalized predicate>. 2) Without Feature F394, "Optional normal form specification", conforming SQL language shall not contain <normal form>. <normalize function> ::= NORMALIZE <left paren> <character value expression> [ <comma> <normal form> [ <comma> <normalizefunction result length> ] ] <right paren> <normal form> ::= NFC | NFD | NFKC | NFKD 7) Without Feature T061, "UCS support", conforming SQL language shall not contain a <normalize function>. 9) Without Feature F394, "Optional normal form specification", conforming SQL language shall not contain <normal form>. Peter> I see this, for example, Peter> 6.27 <numeric value function>[...]Peter> So SQL redirects the question of character length the UnicodePeter> standard. I have not been able to find anything there on aPeter> quick look, but I'm sure the Unicode standard has some veryPeter>specific ideas on this. Note that the matter of normalizationPeter> is not mentioned here. I've taken a not-so-quick look at the Unicode standard (though I don't claim to be any sort of expert on it), and I certainly can't see any definitive indication what the length is supposed to be; however, the use of terminology such as "combining character sequence" (meaning a series of codepoints that combine to make a single glyph) certainly seems to strongly imply that our interpretation is correct and that the OP's is not. Other indications: the units used by length() must be the same as the units used by position() and substring() (in the spec, when USING CHARACTERS is specified), and it would not make sense to use a definition of "character" that did not allow you to look inside a combining sequence. I've also failed so far to find any examples of other programming languages in which a combining character sequence is taken to be a single character for purposes of length or position specification. -- Andrew (irc:RhodiumToad)
В списке pgsql-hackers по дате отправления:
Предыдущее
От: Bruce MomjianДата:
Сообщение: Re: proposal: add columns created and altered to pg_proc and pg_class
Следующее
От: Tom LaneДата:
Сообщение: Re: proposal: add columns created and altered to pg_proc and pg_class