Обсуждение: Do we still need MULE_INTERNAL?

Поиск
Список
Период
Сортировка

Do we still need MULE_INTERNAL?

От
Thomas Munro
Дата:
Hi,

MULE_INTERNAL solved a really hard problem years ago and must have
been extremely useful, but I think we might be able to drop it now,
and I have a patch.  If I am wrong about that and there are users who
would object, then we should probably improve it instead, and I have
some ideas (part of larger reworkings), but first I'd like to
establish whether it is already completely obsolete.

This history may be very well known to hackers in Japan, but I had to
start from zero with my archeologist hat on, and I suspect this is as
obscure to many others as it was to me, so here's what I have come up
with:

In the early nineties (perhaps beginning in the late 80s?),
researchers at AIST developed the MULE "meta-encoding" for Nemacs
(Nihon Emacs), later merged into Xemacs and GNU Emacs.  Unlike early
UTF-16-only versions of Unicode, Emacs' internal encoding was
multi-byte and backward-compatible with ASCII and traditional
in-memory and on-disk representations of text.  Aside from lacking a
multi-byte encoding, early versions of Unicode also perhaps failed to
cover all CJK characters needed for information systems of the time,
apparently.

It's a simple and clever idea, just messy in the details and a little
inefficient: each byte was either ASCII or a lead byte that says which
encoding follows (perhaps with light reencoding/escaping in some
cases, IDK), so except for ASCII, it was always less efficient by at
least one byte than whatever it wraps, but there was nothing it
couldn't handle.  It could mix around 41 encodings this way, so for
the first time you could have (say) Chinese and Arabic in one document
in a multi-byte format compatible with traditional conventions.

The idea doesn't seem to have been adopted by any other software
except PostgreSQL (at least that I could find in quick searches, I'd
be interested to hear of any others).  That's probably because Unicode
gained UTF-8 only a bit later in 1993, providing the missing
multi-byte encoding.  Instead of referencing 41 other moving
standards, it was one unified standard with full international
industry support, and neatly fitted into the C strings and existing
text file conventions (not to mention other design goals like
self-synchronisation).  The rest is history.

Our implementation of MULE_INTERNAL only supports a few sub-encodings,
for Latin, Cyrillic, Chinese, Japanese and Korean, and hasn't been
updated to support modern versions of the CJK ones (ie when we got
EUC_JIS_2004, we didn't handle the corresponding MULE_INTERNAL lead
byte, and I haven't checked the Chinese or Korean situation), which I
suspect might be an actionable clue that it is not in use... but I
lack the context to say, that's a hypothesis.  Our code references the
Xemacs project's internals documentation, last published in 1997, with
a note added in 2012 that we'd started following GNU's implementation
instead, which I think means that mule-conf.el[1] is the closest thing
to a standard.  We added some more IDs as they were assigned, but they
remain unimplemented.  (If we actually do need to keep this, perhaps
our implementation could dispatch to our "direct" encoding routines
instead of open-coding the sub-encodings?  That might be hopelessly
naive and I can see the combination problem we have and they don't
since 23, they only convert to/from Unicode, it's just a thought, but
I think something like that would be more like what Emacs is doing
IIUC.)

Modern GNU Emacs switched to using UTF-8 internally[2] as of Emacs 23
(2009).  It can still convert what it calls "Emacs 21 internal format"
when loading a file, but I suspect we might be the last ones to
support the idea directly as an internal representation.

Emacs' internal representation (both old and new) is a technically a
superset of Unicode, as they are proud to say, but AFAICT that just
means you're free to map your made up script's made up encoding into
the 5-byte UTF-8 sequence space not used by Unicode (or in the old
system, using private lead bytes), not anything actually useful for
our purposes.  And if you just want to put your Klingon or Tolkien
elvish homework into PostgreSQL, see the ConScript Unicode Registry,
it'd use less disk space!  More seriously, I think there have been
periods when eg JIS rolled out a new standard with characters that
Unicode didn't have yet.  Unicode simply added them to a minor release
(eg 3.2), but for a short time you could have said that Unicode was
not a superset or theoretically sufficient. On the other hand,
PostgreSQL wouldn't stop you using such hypothetical characters
anyway: our UTF-8 validation is for well-formedness, not definedness.
There may of course be implications for sorting and classifying, but
all of that seems a bit bogus: we stopped updating MULE_INTERNAL even
for Japanese, we routinely upgrade Unicode, and locales never worked
for MULE_INTERNAL anyway.  I also doubt very much that Unicode would
be out of the loop on new character assignments in modern times.

As for interchange and system boundaries, (1) standard locales on real
systems don't come in MULE_INTERNAL encodings so none of that stuff
works, (2) the JDBC driver and presumably any driver/language that has
its own firm ideas about strings can't support it either, (3) even
applications using libpq would be hard pressed to know what text
actually means outside ASCII, if they choose it as a client encoding,
except perhaps Emacs if you're lucky.

The motivation for removing it would be the unnecessary security
risks, and maintenance burden for future development in our encoding
and locale support.  The motivation for keeping it would be that there
are users with important data trapped in it.  In the absence of hard
data, I tried to imagine why you'd want to use it, other than perhaps
just "we needed it in 199x and haven't migrated yet".  I don't know
too much about CJK computing but I am aware of the space issue:
commonly used CJK characters take 3 UTF-8 bytes to represent, one more
than the national EUC_* encodings.  That's a motivation for preferring
EUC_*, but let's see how MULE_INTERNAL compares:

                                     kanji kana
MULE_INTERNAL-wrapped-JISX0208/0212: 3     3
MULE_INTERNAL-wrapped-JISX0201K:     N/A   2
UTF8:                                3     3
EUC_JP:                              2     2
EUC_JIS_2004:                        2     2

Since there are two encodings for kana characters and MULE's
superpower is to switch, I guess it depends how you chose to encode it
and what your ratio of kana to kanji is.  Google gives me a first
guess of 50/50.  I see that the sjis2mic() conversion is clever enough
to use JISX0201K for kana, so if your client is speaking SJIS then I
suppose you might actually finish up with around ~2.5 bytes per
character.  That's smaller than UTF-8, and larger than EUC_*.  On the
other hand, EUC_JIS_2004 handles more Japanese characters, and UTF-8
handles all of the world's scripts.  So *maybe* there is a small
motivation there, depending on what you think about JIS 2004.  I
somehow doubt the trade-off makes sense in practice though, you'd be
forever dealing with weird problems when some guy called, to pick an
example character I googled that is common but missing in the older
standard, "凜" needs to appear in your data, if I understood all of
that correctly.

For Chinese, the calculus is simpler as they only use hànzì (~=
kanji), nothing potentially smaller like kana to affect the average.
For Korean, I have no clue.

Can any Japanese (or other) experts offer any clues?  Concrete questions:

* Is anyone actually using MULE_INTERNAL today?
* If so, what prevented migration?
* Was it ever actually used outside Japan?
* Is the lack of interest in the new (22 year old) JIS standard in
MULE_INTERNAL meaningful?

[1] https://github.com/emacs-mirror/emacs/blob/master/lisp/international/mule-conf.el
[2] https://www.gnu.org/software/emacs/manual/html_node/elisp/Text-Representations.html



Re: Do we still need MULE_INTERNAL?

От
Tom Lane
Дата:
Thomas Munro <thomas.munro@gmail.com> writes:
> MULE_INTERNAL solved a really hard problem years ago and must have
> been extremely useful, but I think we might be able to drop it now,
> and I have a patch.

FWIW, I am on board with dropping it, and I have another reason
you didn't list: AFAICS there are multiple ways to represent the
same string in MULE.  Any character available in more than one
encoding has more than one equally-legitimate MULE representation,
which is catastrophic for functions as basic as text equality.
You could argue that this is no worse than the situation for
combining characters in Unicode, but there there's at least an
agreed-on normal form.

> This history may be very well known to hackers in Japan, but I had to
> start from zero with my archeologist hat on, and I suspect this is as
> obscure to many others as it was to me, so here's what I have come up
> with:

Thanks for doing that research, BTW.  This was mostly new to me.

            regards, tom lane



Re: Do we still need MULE_INTERNAL?

От
Tatsuo Ishii
Дата:
Hi Thomas,

Thank you for the report. I find it is quite useful, especially the
Emacs 23 internal (new to me). I agree that MULE_INTERNAL has
fulfilled its historic role.

Small comments.

>                                      kanji kana
> MULE_INTERNAL-wrapped-JISX0208/0212: 3     3
> MULE_INTERNAL-wrapped-JISX0201K:     N/A   2
> UTF8:                                3     3
> EUC_JP:                              2     2
> EUC_JIS_2004:                        2     2
> 
> Since there are two encodings for kana characters and MULE's
> superpower is to switch, I guess it depends how you chose to encode it
> and what your ratio of kana to kanji is.

The reason for 2 encodings in MULE for "kana" exist is, it's a nature
of the character sets mule supports. In Japanese there are 2 types of
"kana", one is "hiragana" and the other is "katakana". JIS X0208/0212
includes both types of "kana", while JIS X0201 includes only
"katakana". So why "katakana" appears on those two encodings? Katakana
in JIS X0201 is often rendered on screen in half width comparing with
JIS X 0208 and 0212. Some users find this beneficial.

> UTF8:                                3     3

I thought some of JIS 2004 kanji are mapped to 4-byte UTF8 character.

> Can any Japanese (or other) experts offer any clues?  Concrete questions:

I am not an expert but let me try to answer your questions.

> * Is anyone actually using MULE_INTERNAL today?
> * If so, what prevented migration?

As far as I know, MULE_INTERNAL is not used in production PostgreSQL
databases today. Of course this does not nessary mean nobody is using
MULE_INTERNAL. My perspective is limited.

> * Was it ever actually used outside Japan?

I don't know.

> * Is the lack of interest in the new (22 year old) JIS standard in
> MULE_INTERNAL meaningful?

Maybe because at the time when JIS 2004 was out, quite few users were
trying to adopt it.

Best regards,
--
Tatsuo Ishii
SRA OSS K.K.
English: http://www.sraoss.co.jp/index_en/
Japanese:http://www.sraoss.co.jp



Re: Do we still need MULE_INTERNAL?

От
Thomas Munro
Дата:
On Wed, Feb 11, 2026 at 7:52 PM Tatsuo Ishii <ishii@postgresql.org> wrote:
> Thank you for the report. I find it is quite useful, especially the
> Emacs 23 internal (new to me). I agree that MULE_INTERNAL has
> fulfilled its historic role.

Thanks Ishii-san and Tom.  Here's a patch.  Obviously it mostly just
deletes thousands of lines, but also: I had to preserve the encoding
number, so there's a hole in the table, and I had to think of a new
name for cyrillic_and_mic.c, so I went with cyrillic.c because it
handles 4 single-byte encodings and it wasn't clear how to fit into
the existing x_and_y pattern (ie which two to highlight arbitrarily in
the name).

> > Since there are two encodings for kana characters and MULE's
> > superpower is to switch, I guess it depends how you chose to encode it
> > and what your ratio of kana to kanji is.
>
> The reason for 2 encodings in MULE for "kana" exist is, it's a nature
> of the character sets mule supports. In Japanese there are 2 types of
> "kana", one is "hiragana" and the other is "katakana". JIS X0208/0212
> includes both types of "kana", while JIS X0201 includes only
> "katakana". So why "katakana" appears on those two encodings? Katakana
> in JIS X0201 is often rendered on screen in half width comparing with
> JIS X 0208 and 0212. Some users find this beneficial.

Ah, right, I see.  And judging by Wikipedia's article on half-width
katakana, it sounds like any scenario where it's mixed with hiragana
and kanji would probably not use them anyway, so perhaps 3 is a better
guess.  In other words, MULE_INTERNAL databases would probably not get
bigger if reloaded as UTF-8.

> > UTF8:                                3     3
>
> I thought some of JIS 2004 kanji are mapped to 4-byte UTF8 character.

Looks like it:

grep 'U+[0-9A-F][0-9A-F][0-9A-F][0-9A-F][0-9A-F].*\[200[04]\]' \
./src/backend/utils/mb/Unicode/euc-jis-2004-std.txt

They are in "CJK Unified Ideographs Extension B" for "rare and
historic CJK ideographs", so I guess they wouldn't matter much, but in
any case we're talking about a hypothetical user moving from
MULE_INTERNAL, which *doesn't* have JIS 2004.  I think the older
standards are entirely in the basic plane, so only 1-3-byte UTF-8
sequences.

. o O ( UTF-16 would probably be the ideal storage for CJK text if we
could do it... )

Вложения