Обсуждение: sql92 character sets
For my own amusement I'm reading the sql 92 spec about character sets.
There are some concepts that are a bit difficult that maybe someone can
explain for me:
character set character repertoire
for example in 4.2.1 it says:
A character set is described by a character set descriptor. A character set descriptor includes:
- the name of the character set or character repertoire,
- if the character set is a character
repertoire,then the name of the form-of-use,
- an indication of what characters are in the character set, and
- the name of the default collation of the character set.
What I have understod so far is that form-of-use is the encoding. So if
the character set is UNICODE then the form-of-use could be UTF-8, UTF-16
and so on.
The character repertoire however I don't have an intuition about it all.
Then we have this little section:
The <implementation-defined character repertoire name> SQL_TEXT specifies the name of a character repertoire and
impliedform-of- use that can represent every character that is in <SQL language character> and all other characters
thatare in character sets supported by the implementation.
Had unicode been a superset of all character sets, then one could just
have used unicode for SQL_TEXT. Exactly how do we create a character
repertoire that can store any character from any character set.. Storing
the character set for each character is not such a cool thing to do
even if it would work :-)
SQL_ASCII in pg is similar, it's basically a number of bytes. But the spec
seems to say that one should be able to count the characters as well (not
the bytes) so SQL_ASCII is not the same as SQL_TEXT.
ps. This is not me volunteering to implement all this :-)
--
/Dennis Björklund
Dennis Bjorklund wrote: > What I have understod so far is that form-of-use is the encoding. So > if the character set is UNICODE then the form-of-use could be UTF-8, > UTF-16 and so on. Exactly. > The character repertoire however I don't have an intuition about it > all. A character repertoire is basically an abstract bag of characters (say, "a to z" or "all modern greek characters") that you plan to represent using a character set. In SQL 99, this terminology was altered a little (unfortunately not quite compatibly). There, a character repertoire is an abstract set of characters whose internal representation is irrelevant. Add to that an encoding (how to convert characters to bits) and a form-of-use (how to assemble characters into a string (for stateful encodings?, endianness?)), and that together makes a character set. And then they say that "character repertoire" and "character set" are used interchangeably except where communication with external systems is concerned. The only real consequence of this difference is that character strings of the same repertoire but possibly using different encodings/forms-of-use should still be comparable or assignable. But that should only concern us if we allowed different character sets per datum and we actually had cases of different encodings for the same repertoire. > Had unicode been a superset of all character sets, then one could > just have used unicode for SQL_TEXT. Exactly how do we create a > character repertoire that can store any character from any character > set.. Storing the character set for each character is not such a cool > thing to do even if it would work :-) Actually that's exactly what "Mule Internal Code" does. > SQL_ASCII in pg is similar, it's basically a number of bytes. But the > spec seems to say that one should be able to count the characters as > well (not the bytes) so SQL_ASCII is not the same as SQL_TEXT. SQL_ASCII is a kludge, albeit a practical one. We should not design further extensions around it.