On 2019-03-17 15:01:40 +0000, Warner, Gary, Jr wrote:
> Many of us have faced character encoding issues because we are not in control
> of our input sources and made the common assumption that UTF-8 covers
> everything.
UTF-8 covers "everything" in the sense that there is a round-trip from
each character in every commonly-used charset/encoding to Unicode and
back.
The actual code may of course be different. For example, the € sign is
0xA4 in iso-8859-15, but U+20AC in Unicode. So you need an
encoding/decoding step.
And "commonly-used" means just that. Unicode covers a lot of character
sets, but it can't cover every character set ever invented (I invented
my own character sets when I was sixteen. Nobody except me ever used
them and they have long succumbed to bit rot).
> In my lab, as an example, some of our social media posts have included ZawGyi
> Burmese character sets rather than Unicode Burmese. (Because Myanmar developed
> technology In a closed to the world environment, they made up their own
> non-standard character set which is very common still in Mobile phones.).
I'd be surprised if there was a character set which is "very common in
Mobile phones", even in a relatively poor country like Myanmar. Does
ZawGyi actually include characters which aren't in Unicode are are they
just encoded differently?
hp
--
_ | Peter J. Holzer | we build much bigger, better disasters now
|_|_) | | because we have much more sophisticated
| | | hjp@hjp.at | management tools.
__/ | http://www.hjp.at/ | -- Ross Anderson <https://www.edge.org/>