Обсуждение: Unicode grapheme clusters
Just my luck, I had to dig into a two-"character" emoji that came to me
as part of a Google Calendar entry --- here it is:
    👩🏼⚕️🩺
    
                          libc
    Unicode     UTF8      len
    U+1F469  f0 9f 91 a9   2   woman
    U+1F3FC  f0 9f 8f bc   2   emoji modifier fitzpatrick type-3 (skin tone)
    U+200D   e2 80 8d      0   zero width joiner (ZWJ)
    U+2695   e2 9a 95      1   staff with snake
    U+FE0F   ef b8 8f      0   variation selector-16 (VS16) (previous character as emoji)
    U+1FA7A  f0 9f a9 ba   2   stethoscope
Now, in Debian 11 character apps like vi, I see:
  a woman(2) - a black box(2) - a staff with snake(1) - a stethoscope(2)
Display widths are in parentheses.  I also see '<200d>' in blue.
In current Firefox, I see a woman with a stethoscope around her neck,
and then a stethoscope.  Copying the Unicode string above into a browser
URL bar should show you the same thing, thought it might be too small to
see.
For those looking for details on how these should be handled, see this
for an explanation of grapheme clusters that use things like skin tone
modifiers and zero-width joiners:
    https://tonsky.me/blog/emoji/
These comments explain the confusion of the term character:
https://stackoverflow.com/questions/27331819/whats-the-difference-between-a-character-a-code-point-a-glyph-and-a-grapheme
and I think this comment summarizes it well:
    https://github.com/kovidgoyal/kitty/issues/3998#issuecomment-914807237
    This is by design. wcwidth() is utterly broken. Any terminal or terminal
    application that uses it is also utterly broken. Forget about emoji
    wcwidth() doesn't even work with combining characters, zero width
    joiners, flags, and a whole bunch of other things.
I decided to see how Postgres, without ICU, handles it:
    show lc_ctype;
      lc_ctype
    -------------
     en_US.UTF-8
    select octet_length('👩🏼⚕️🩺');
     octet_length
    --------------
               21
    
    select character_length('👩🏼⚕️🩺');
     character_length
    ------------------
                    6
The octet_length() is verified as correct by counting the UTF8 bytes
above.  I think character_length() is correct if we consider the number
of Unicode characters, display and non-display.
I then started looking at how Postgres computes and uses _display_
width.  The display width, when properly processed like by Firefox, is 4
(two double-wide displayed characters.)  Based on the libc display
lengths above and incorrect displayed character lengths in Debian 11, it
would be 7.
libpq has PQdsplen(), which calls pg_encoding_dsplen(), which then calls
the per-encoding width function stored in pg_wchar_table.dsplen --- for
UTF8, the function is pg_utf_dsplen().
There is no SQL API for display length, but PQdsplen() that can be
called with a string by calling pg_wcswidth() the gdb debugger:
    pg_wcswidth(const char *pwcs, size_t len, int encoding)
    UTF8 encoding == 6
    (gdb) print (int)pg_wcswidth("abcd", 4, 6)
    $8 = 4
    (gdb) print (int)pg_wcswidth("👩🏼⚕️🩺", 21, 6))
    $9 = 7
Here is the psql output:
    SELECT octet_length('👩🏼⚕️🩺'), '👩🏼⚕️🩺', character_length('👩🏼⚕️🩺');
     octet_length | ?column? | character_length
    --------------+----------+------------------
               21 | 👩🏼⚕️🩺  |                6
More often called from psql are pg_wcssize() and pg_wcsformat(), which
also calls PQdsplen().
I think the question is whether we want to report a string width that
assumes the display doesn't understand the more complex UTF8
controls/"characters" listed above.
 
tsearch has p_isspecial() calls pg_dsplen() which also uses
pg_wchar_table.dsplen.  p_isspecial() also has a small table of what it
calls "strange_letter",
Here is a report about Unicode variation selector and combining
characters from May, 2022:
    https://www.postgresql.org/message-id/flat/013f01d873bb%24ff5f64b0%24fe1e2e10%24%40ndensan.co.jp
Is this something people want improved?
-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com
Embrace your flaws.  They make you human, rather than perfect,
which you will never be.
			
		čt 19. 1. 2023 v 1:20 odesílatel Bruce Momjian <bruce@momjian.us> napsal:
Just my luck, I had to dig into a two-"character" emoji that came to me
as part of a Google Calendar entry --- here it is:
👩🏼⚕️🩺
libc
Unicode UTF8 len
U+1F469 f0 9f 91 a9 2 woman
U+1F3FC f0 9f 8f bc 2 emoji modifier fitzpatrick type-3 (skin tone)
U+200D e2 80 8d 0 zero width joiner (ZWJ)
U+2695 e2 9a 95 1 staff with snake
U+FE0F ef b8 8f 0 variation selector-16 (VS16) (previous character as emoji)
U+1FA7A f0 9f a9 ba 2 stethoscope
Now, in Debian 11 character apps like vi, I see:
a woman(2) - a black box(2) - a staff with snake(1) - a stethoscope(2)
Display widths are in parentheses. I also see '<200d>' in blue.
In current Firefox, I see a woman with a stethoscope around her neck,
and then a stethoscope. Copying the Unicode string above into a browser
URL bar should show you the same thing, thought it might be too small to
see.
For those looking for details on how these should be handled, see this
for an explanation of grapheme clusters that use things like skin tone
modifiers and zero-width joiners:
https://tonsky.me/blog/emoji/
These comments explain the confusion of the term character:
https://stackoverflow.com/questions/27331819/whats-the-difference-between-a-character-a-code-point-a-glyph-and-a-grapheme
and I think this comment summarizes it well:
https://github.com/kovidgoyal/kitty/issues/3998#issuecomment-914807237
This is by design. wcwidth() is utterly broken. Any terminal or terminal
application that uses it is also utterly broken. Forget about emoji
wcwidth() doesn't even work with combining characters, zero width
joiners, flags, and a whole bunch of other things.
I decided to see how Postgres, without ICU, handles it:
show lc_ctype;
lc_ctype
-------------
en_US.UTF-8
select octet_length('👩🏼⚕️🩺');
octet_length
--------------
21
select character_length('👩🏼⚕️🩺');
character_length
------------------
6
The octet_length() is verified as correct by counting the UTF8 bytes
above. I think character_length() is correct if we consider the number
of Unicode characters, display and non-display.
I then started looking at how Postgres computes and uses _display_
width. The display width, when properly processed like by Firefox, is 4
(two double-wide displayed characters.) Based on the libc display
lengths above and incorrect displayed character lengths in Debian 11, it
would be 7.
libpq has PQdsplen(), which calls pg_encoding_dsplen(), which then calls
the per-encoding width function stored in pg_wchar_table.dsplen --- for
UTF8, the function is pg_utf_dsplen().
There is no SQL API for display length, but PQdsplen() that can be
called with a string by calling pg_wcswidth() the gdb debugger:
pg_wcswidth(const char *pwcs, size_t len, int encoding)
UTF8 encoding == 6
(gdb) print (int)pg_wcswidth("abcd", 4, 6)
$8 = 4
(gdb) print (int)pg_wcswidth("👩🏼⚕️🩺", 21, 6))
$9 = 7
Here is the psql output:
SELECT octet_length('👩🏼⚕️🩺'), '👩🏼⚕️🩺', character_length('👩🏼⚕️🩺');
octet_length | ?column? | character_length
--------------+----------+------------------
21 | 👩🏼⚕️🩺 | 6
More often called from psql are pg_wcssize() and pg_wcsformat(), which
also calls PQdsplen().
I think the question is whether we want to report a string width that
assumes the display doesn't understand the more complex UTF8
controls/"characters" listed above.
tsearch has p_isspecial() calls pg_dsplen() which also uses
pg_wchar_table.dsplen. p_isspecial() also has a small table of what it
calls "strange_letter",
Here is a report about Unicode variation selector and combining
characters from May, 2022:
https://www.postgresql.org/message-id/flat/013f01d873bb%24ff5f64b0%24fe1e2e10%24%40ndensan.co.jp
Is this something people want improved?
Surely it should be fixed. Unfortunately - all the terminals that I can use don't support it. So at this moment it may be premature to fix it, because the visual form will still be broken.
Regards
Pavel
--
Bruce Momjian <bruce@momjian.us> https://momjian.us
EDB https://enterprisedb.com
Embrace your flaws. They make you human, rather than perfect,
which you will never be.
On Thu, Jan 19, 2023 at 02:44:57PM +0100, Pavel Stehule wrote:
> Surely it should be fixed. Unfortunately - all the terminals that I can use
> don't support it. So at this moment it may be premature to fix it, because the
> visual form will still be broken.
Yes, none of my terminal emulators handle grapheme clusters either.  In
fact, viewing this email messed up my screen and I had to use control-L
to fix it.
I think one big problem is that our Unicode library doesn't have any way
I know of to query the display device to determine how it
supports/renders Unicode characters, so any display width we report
could be wrong.
Oddly, it seems grapheme clusters were added in Unicode 3.2, which came
out in 2002:
    https://www.unicode.org/reports/tr28/tr28-3.html
    https://www.quora.com/What-is-graphemeCluster
but somehow I am only seeing studying them now.
Anyway, I added a psql item for this so we don't forget about it:
    https://wiki.postgresql.org/wiki/Todo#psql
-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com
Embrace your flaws.  They make you human, rather than perfect,
which you will never be.
			
		This is how we've always documented it. Postgres treats code points as "characters" not graphemes.
			
		You don't need to go to anything as esoteric as emojis to see this either. Accented characters like é have no canonical forms that are multiple code points and in some character sets some accented characters can only be represented that way.
But I don't think there's any reason to consider changing e existing functions. They have to be consistent with substr and the other string manipulation functions.
We could add new functions to work with graphemes but it might bring more pain keeping it up to date....
On Thu, Jan 19, 2023 at 07:37:48PM -0500, Greg Stark wrote: > This is how we've always documented it. Postgres treats code points as > "characters" not graphemes. > > You don't need to go to anything as esoteric as emojis to see this either. > Accented characters like é have no canonical forms that are multiple code > points and in some character sets some accented characters can only be > represented that way. > > But I don't think there's any reason to consider changing e existing functions. > They have to be consistent with substr and the other string manipulation > functions. > > We could add new functions to work with graphemes but it might bring more pain > keeping it up to date.... I am not sure what you are referring to above? character_length? I was talking about display length, and psql uses that --- at some point, our lack of support for graphemes will cause psql to not align columns. -- Bruce Momjian <bruce@momjian.us> https://momjian.us EDB https://enterprisedb.com Embrace your flaws. They make you human, rather than perfect, which you will never be.
Bruce Momjian <bruce@momjian.us> writes:
> I am not sure what you are referring to above?  character_length?  I was
> talking about display length, and psql uses that --- at some point, our
> lack of support for graphemes will cause psql to not align columns.
That's going to happen regardless, as long as we can't be sure
what the display will do with the characters --- and that's a
problem that will persist for a very long time.
Ideally, yeah, it'd be great if all this stuff rendered perfectly;
but IMO it's so far outside mainstream usage of psql that it's
not something that could possibly repay the investment of time
to get even a partial solution.
            regards, tom lane
			
		On Thu, Jan 19, 2023 at 07:53:43PM -0500, Tom Lane wrote: > Bruce Momjian <bruce@momjian.us> writes: > > I am not sure what you are referring to above? character_length? I was > > talking about display length, and psql uses that --- at some point, our > > lack of support for graphemes will cause psql to not align columns. > > That's going to happen regardless, as long as we can't be sure > what the display will do with the characters --- and that's a > problem that will persist for a very long time. > > Ideally, yeah, it'd be great if all this stuff rendered perfectly; > but IMO it's so far outside mainstream usage of psql that it's > not something that could possibly repay the investment of time > to get even a partial solution. We have a few options: * TODO item * document psql works that way * do nothing I think the big question is how common such cases will be in the future. The report from 2022, and one from 2019 didn't seem to clearly outline the issue so it would good to have something documented somewhere. -- Bruce Momjian <bruce@momjian.us> https://momjian.us EDB https://enterprisedb.com Embrace your flaws. They make you human, rather than perfect, which you will never be.
pá 20. 1. 2023 v 2:55 odesílatel Bruce Momjian <bruce@momjian.us> napsal:
On Thu, Jan 19, 2023 at 07:53:43PM -0500, Tom Lane wrote:
> Bruce Momjian <bruce@momjian.us> writes:
> > I am not sure what you are referring to above? character_length? I was
> > talking about display length, and psql uses that --- at some point, our
> > lack of support for graphemes will cause psql to not align columns.
>
> That's going to happen regardless, as long as we can't be sure
> what the display will do with the characters --- and that's a
> problem that will persist for a very long time.
>
> Ideally, yeah, it'd be great if all this stuff rendered perfectly;
> but IMO it's so far outside mainstream usage of psql that it's
> not something that could possibly repay the investment of time
> to get even a partial solution.
We have a few options:
* TODO item
* document psql works that way
* do nothing
I think the big question is how common such cases will be in the future.
The report from 2022, and one from 2019 didn't seem to clearly outline
the issue so it would good to have something documented somewhere.
I partially watch an progres in VTE - one of the widely used terminal libs, and I am very sceptical so there will be support in the next two years. 
Maybe the new microsoft terminal will give this area a new dynamic, but currently only few people on the planet are working on fixing or enhancing terminal's technologies. Unfortunately there is too much historical balast.
Regards
Pavel
--
Bruce Momjian <bruce@momjian.us> https://momjian.us
EDB https://enterprisedb.com
Embrace your flaws. They make you human, rather than perfect,
which you will never be.
On Fri, 20 Jan 2023 at 00:07, Pavel Stehule <pavel.stehule@gmail.com> wrote: > > I partially watch an progres in VTE - one of the widely used terminal libs, and I am very sceptical so there will be supportin the next two years. > > Maybe the new microsoft terminal will give this area a new dynamic, but currently only few people on the planet are workingon fixing or enhancing terminal's technologies. Unfortunately there is too much historical balast. Fwiw this isn't really about terminal emulators. psql is also used to generate text files for reports or for display in various ways. I think it's worth using whatever APIs we have available to implement better alignment for grapheme clusters and just assume whatever will eventually be used to display the output will display it "properly". I do not think it's worth trying to implement this ourselves if the libraries aren't there yet. And I don't think it's worth trying to adapt to the current state of the current terminal. We don't know that that's the only place the output will be viewed and it'll all be wasted effort when the terminals eventually implement full support. (If we were really crazy about this we could use terminal escape codes to query the current cursor position after emitting multicharacter graphemes. But as I said, I don't even think that would be useful, even if there weren't other reasons it would be a bad idea) -- greg
so 21. 1. 2023 v 17:21 odesílatel Greg Stark <stark@mit.edu> napsal:
On Fri, 20 Jan 2023 at 00:07, Pavel Stehule <pavel.stehule@gmail.com> wrote:
>
> I partially watch an progres in VTE - one of the widely used terminal libs, and I am very sceptical so there will be support in the next two years.
>
> Maybe the new microsoft terminal will give this area a new dynamic, but currently only few people on the planet are working on fixing or enhancing terminal's technologies. Unfortunately there is too much historical balast.
Fwiw this isn't really about terminal emulators. psql is also used to
generate text files for reports or for display in various ways.
I think it's worth using whatever APIs we have available to implement
better alignment for grapheme clusters and just assume whatever will
eventually be used to display the output will display it "properly".
I do not think it's worth trying to implement this ourselves if the
libraries aren't there yet. And I don't think it's worth trying to
adapt to the current state of the current terminal. We don't know that
that's the only place the output will be viewed and it'll all be
wasted effort when the terminals eventually implement full support.
(If we were really crazy about this we could use terminal escape codes
to query the current cursor position after emitting multicharacter
graphemes. But as I said, I don't even think that would be useful,
even if there weren't other reasons it would be a bad idea)
+1
Pavel
--
greg
Greg Stark <stark@mit.edu> writes:
> (If we were really crazy about this we could use terminal escape codes
> to query the current cursor position after emitting multicharacter
> graphemes. But as I said, I don't even think that would be useful,
> even if there weren't other reasons it would be a bad idea)
Yeah, use of a pager would be enough to break that.
            regards, tom lane
			
		On Sat, Jan 21, 2023 at 11:20:39AM -0500, Greg Stark wrote:
> On Fri, 20 Jan 2023 at 00:07, Pavel Stehule <pavel.stehule@gmail.com> wrote:
> >
> > I partially watch an progres in VTE - one of the widely used terminal libs, and I am very sceptical so there will
besupport in the next two years.
 
> >
> > Maybe the new microsoft terminal will give this area a new dynamic, but currently only few people on the planet are
workingon fixing or enhancing terminal's technologies. Unfortunately there is too much historical balast.
 
> 
> Fwiw this isn't really about terminal emulators. psql is also used to
> generate text files for reports or for display in various ways.
> 
> I think it's worth using whatever APIs we have available to implement
> better alignment for grapheme clusters and just assume whatever will
> eventually be used to display the output will display it "properly".
> 
> I do not think it's worth trying to implement this ourselves if the
> libraries aren't there yet. And I don't think it's worth trying to
> adapt to the current state of the current terminal. We don't know that
> that's the only place the output will be viewed and it'll all be
> wasted effort when the terminals eventually implement full support.
Well, as one of the URLs I quoted said:
    This is by design. wcwidth() is utterly broken. Any terminal or
    terminal application that uses it is also utterly broken. Forget
    about emoji wcwidth() doesn't even work with combining characters,
    zero width joiners, flags, and a whole bunch of other things.
So, either we have to find a function in the library that will do the
looping over the string for us, or we need to identify the special
Unicode characters that create grapheme clusters and handle them in our
code.
-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com
Embrace your flaws.  They make you human, rather than perfect,
which you will never be.
			
		On Sat, Jan 21, 2023 at 12:37:30PM -0500, Bruce Momjian wrote:
> Well, as one of the URLs I quoted said:
> 
>     This is by design. wcwidth() is utterly broken. Any terminal or
>     terminal application that uses it is also utterly broken. Forget
>     about emoji wcwidth() doesn't even work with combining characters,
>     zero width joiners, flags, and a whole bunch of other things.
> 
> So, either we have to find a function in the library that will do the
> looping over the string for us, or we need to identify the special
> Unicode characters that create grapheme clusters and handle them in our
> code.
I just checked if wcswidth() would honor graphene clusters, though
wcwidth() does not, but it seems wcswidth() treats characters just like
wcwidth():
    $ LANG=en_US.UTF-8 grapheme_test
    wcswidth len=7
    
    bytes_consumed=4, wcwidth len=2
    bytes_consumed=4, wcwidth len=2
    bytes_consumed=3, wcwidth len=0
    bytes_consumed=3, wcwidth len=1
    bytes_consumed=3, wcwidth len=0
    bytes_consumed=4, wcwidth len=2
C test program attached.  This is on Debian 11.
-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com
Embrace your flaws.  They make you human, rather than perfect,
which you will never be.
			
		Вложения
Bruce Momjian <bruce@momjian.us> writes:
> I just checked if wcswidth() would honor graphene clusters, though
> wcwidth() does not, but it seems wcswidth() treats characters just like
> wcwidth():
Well, that's at least potentially fixable within libc, while wcwidth
clearly can never do this right.
Probably our long-term answer is to avoid depending on wcwidth
and use wcswidth instead.  But it's hard to get excited about
doing the legwork for that until popular libc implementations
get it right.
            regards, tom lane
			
		On Sat, Jan 21, 2023 at 01:17:27PM -0500, Tom Lane wrote: > Bruce Momjian <bruce@momjian.us> writes: > > I just checked if wcswidth() would honor graphene clusters, though > > wcwidth() does not, but it seems wcswidth() treats characters just like > > wcwidth(): > > Well, that's at least potentially fixable within libc, while wcwidth > clearly can never do this right. > > Probably our long-term answer is to avoid depending on wcwidth > and use wcswidth instead. But it's hard to get excited about > doing the legwork for that until popular libc implementations > get it right. Agreed. -- Bruce Momjian <bruce@momjian.us> https://momjian.us EDB https://enterprisedb.com Embrace your flaws. They make you human, rather than perfect, which you will never be.
On Sat, 21 Jan 2023 at 13:17, Tom Lane <tgl@sss.pgh.pa.us> wrote: > > Probably our long-term answer is to avoid depending on wcwidth > and use wcswidth instead. But it's hard to get excited about > doing the legwork for that until popular libc implementations > get it right. Here's an interesting blog post about trying to do this in Rust: https://tomdebruijn.com/posts/rust-string-length-width-calculations/ TL;DR... Even counting the number of graphemes isn't enough because terminals typically (but not always) display emoji graphemes using two columns. At the end of the day Unicode kind of assumes a variable-width display where the rendering is handled by something that has access to the actual font metrics. So anything trying to line things up in columns in a way that works with any rendering system down the line using any font is going to be making a best guess. -- greg
On Tue, 24 Jan 2023 at 11:40, Greg Stark <stark@mit.edu> wrote:
At the end of the day Unicode kind of assumes a variable-width display
where the rendering is handled by something that has access to the
actual font metrics. So anything trying to line things up in columns
in a way that works with any rendering system down the line using any
font is going to be making a best guess.
Really what is needed is another Unicode attribute: how many columns of a monospaced display each character (or grapheme cluster) should take up. The standard should include a precisely defined function that can take any sequence of characters and give back its width in monospaced display character spaces. Typefaces should only qualify as monospaced if they respect this standard-defined computation.
Note that this is not actually a new thing: this was included in ASCII implicitly, with a value of 1 for every character, and a value of n for every n-character string. It has always been possible to line up values displayed on monospaced displays by adding spaces, and it is only the omission of this feature from Unicode which currently makes it impossible.
On Tue, Jan 24, 2023 at 11:40:01AM -0500, Greg Stark wrote: > On Sat, 21 Jan 2023 at 13:17, Tom Lane <tgl@sss.pgh.pa.us> wrote: > > > > Probably our long-term answer is to avoid depending on wcwidth > > and use wcswidth instead. But it's hard to get excited about > > doing the legwork for that until popular libc implementations > > get it right. > > Here's an interesting blog post about trying to do this in Rust: > > https://tomdebruijn.com/posts/rust-string-length-width-calculations/ > > TL;DR... Even counting the number of graphemes isn't enough because > terminals typically (but not always) display emoji graphemes using two > columns. > > At the end of the day Unicode kind of assumes a variable-width display > where the rendering is handled by something that has access to the > actual font metrics. So anything trying to line things up in columns > in a way that works with any rendering system down the line using any > font is going to be making a best guess. Yes, good article, though I am still surprised this is not discussed more often. Anyway, for psql, we assume a fixed width output device, so we can just assume that for computation. You are right that Unicode just doesn't seem to consider fixed width output cases and doesn't provide much guidance. Beyond psql, should we update our docs to say that character_length() for Unicode returns the number of Unicode code points, and not necessarily the number of displayed characters if grapheme clusters are present? -- Bruce Momjian <bruce@momjian.us> https://momjian.us EDB https://enterprisedb.com Embrace your flaws. They make you human, rather than perfect, which you will never be.