Обсуждение: UTF16 surrogate pairs in UTF8 encoding

Поиск
Список
Период
Сортировка

UTF16 surrogate pairs in UTF8 encoding

От
Tom Lane
Дата:
I just noticed that we are now advertising the ability to insert UTF16
surrogate pairs in strings and identifiers (see section 4.1.2.2 in
current docs, in particular).  Is this really wise?  I thought that
surrogate pairs were specifically prohibited in UTF8 strings, because
of the security hazards implicit in having more than one way to
represent the same code point.
        regards, tom lane


Re: UTF16 surrogate pairs in UTF8 encoding

От
Peter Eisentraut
Дата:
On sön, 2010-08-22 at 14:29 -0400, Tom Lane wrote:
> I just noticed that we are now advertising the ability to insert UTF16
> surrogate pairs in strings and identifiers (see section 4.1.2.2 in
> current docs, in particular).  Is this really wise?  I thought that
> surrogate pairs were specifically prohibited in UTF8 strings, because
> of the security hazards implicit in having more than one way to
> represent the same code point.

We combine the surrogate pair components to a single code point and
encode that in UTF-8.  We don't encode the components separately; that
would be wrong.



Re: UTF16 surrogate pairs in UTF8 encoding

От
Tom Lane
Дата:
Peter Eisentraut <peter_e@gmx.net> writes:
> On sön, 2010-08-22 at 14:29 -0400, Tom Lane wrote:
>> I just noticed that we are now advertising the ability to insert UTF16
>> surrogate pairs in strings and identifiers (see section 4.1.2.2 in
>> current docs, in particular).  Is this really wise?  I thought that
>> surrogate pairs were specifically prohibited in UTF8 strings, because
>> of the security hazards implicit in having more than one way to
>> represent the same code point.

> We combine the surrogate pair components to a single code point and
> encode that in UTF-8.  We don't encode the components separately; that
> would be wrong.

Oh, OK.  Should the docs make that a bit clearer?
        regards, tom lane


Re: UTF16 surrogate pairs in UTF8 encoding

От
Florian Weimer
Дата:
* Tom Lane:

> I just noticed that we are now advertising the ability to insert UTF16
> surrogate pairs in strings and identifiers (see section 4.1.2.2 in
> current docs, in particular).  Is this really wise?  I thought that
> surrogate pairs were specifically prohibited in UTF8 strings, because
> of the security hazards implicit in having more than one way to
> represent the same code point.

There is relatively little risk because surrogate pairs cannot encode
characters in the BMP, and presumably, most of the critical characters
are located there.

However, if this is converted to regular UTF-8, I really question the
sense of this.  Usually, people want CESU-8 to preserve ordering
between languages such as C# and Java and their database, and
conversion destroys this property.

--
Florian Weimer                <fweimer@bfk.de>
BFK edv-consulting GmbH       http://www.bfk.de/
Kriegsstraße 100              tel: +49-721-96201-1
D-76133 Karlsruhe             fax: +49-721-96201-99


Re: UTF16 surrogate pairs in UTF8 encoding

От
Marko Kreen
Дата:
On 8/22/10, Peter Eisentraut <peter_e@gmx.net> wrote:
> On sön, 2010-08-22 at 14:29 -0400, Tom Lane wrote:
>  > I just noticed that we are now advertising the ability to insert UTF16
>  > surrogate pairs in strings and identifiers (see section 4.1.2.2 in
>  > current docs, in particular).  Is this really wise?  I thought that
>  > surrogate pairs were specifically prohibited in UTF8 strings, because
>  > of the security hazards implicit in having more than one way to
>  > represent the same code point.
>
>
> We combine the surrogate pair components to a single code point and
>  encode that in UTF-8.  We don't encode the components separately; that
>  would be wrong.

AFAICS our UTF8 validator (pg_utf8_islegal) detects and rejects
such sequences, if they are inserted via any means, eg. \x

Although it's not very obvious...

--
marko


Re: UTF16 surrogate pairs in UTF8 encoding

От
Peter Eisentraut
Дата:
On sön, 2010-08-22 at 15:15 -0400, Tom Lane wrote:
> > We combine the surrogate pair components to a single code point and
> > encode that in UTF-8.  We don't encode the components separately;
> that
> > would be wrong.
> 
> Oh, OK.  Should the docs make that a bit clearer?

Done.



Re: UTF16 surrogate pairs in UTF8 encoding

От
Marko Kreen
Дата:
On 9/7/10, Peter Eisentraut <peter_e@gmx.net> wrote:
> On sön, 2010-08-22 at 15:15 -0400, Tom Lane wrote:
>  > > We combine the surrogate pair components to a single code point and
>  > > encode that in UTF-8.  We don't encode the components separately;
>  > that
>  > > would be wrong.
>  >
>  > Oh, OK.  Should the docs make that a bit clearer?
>
>
> Done.

This is confusing:
(When surrogatepairs are used when the server encoding is <literal>UTF8</>, theyare first combined into a single code
pointthat is then encodedin UTF-8.) 

So something else happens if encoding is not UTF8?

I think this part can be simply removed, it does not add anything.

Or say that surrogate pairs are only allowed in UTF8 encoding.
Reason is that you cannot encode 0..7F codepoints with them,
and only those are allowed to be given numerically.  But this is
already mentioned before.

--
marko


Re: UTF16 surrogate pairs in UTF8 encoding

От
Peter Eisentraut
Дата:
On ons, 2010-09-08 at 10:18 +0300, Marko Kreen wrote:
> On 9/7/10, Peter Eisentraut <peter_e@gmx.net> wrote:
> > On sön, 2010-08-22 at 15:15 -0400, Tom Lane wrote:
> >  > > We combine the surrogate pair components to a single code point and
> >  > > encode that in UTF-8.  We don't encode the components separately;
> >  > that
> >  > > would be wrong.
> >  >
> >  > Oh, OK.  Should the docs make that a bit clearer?
> >
> >
> > Done.
> 
> This is confusing:
> 
>  (When surrogate
>  pairs are used when the server encoding is <literal>UTF8</>, they
>  are first combined into a single code point that is then encoded
>  in UTF-8.)
> 
> So something else happens if encoding is not UTF8?

Then you can't specify surrogate pairs because they are outside of the
ASCII range, per constraint mentioned earlier in the paragraph.

> I think this part can be simply removed, it does not add anything.
> 
> Or say that surrogate pairs are only allowed in UTF8 encoding.
> Reason is that you cannot encode 0..7F codepoints with them,
> and only those are allowed to be given numerically.  But this is
> already mentioned before.

Well, Tom wanted an additional explanation.  I personally agree with
you; this is not the place to explain encoding and Unicode internals,
when really the code only does what it's supposed to.



Re: UTF16 surrogate pairs in UTF8 encoding

От
Marko Kreen
Дата:
On 9/8/10, Peter Eisentraut <peter_e@gmx.net> wrote:
> On ons, 2010-09-08 at 10:18 +0300, Marko Kreen wrote:
>  > On 9/7/10, Peter Eisentraut <peter_e@gmx.net> wrote:
>  > > On sön, 2010-08-22 at 15:15 -0400, Tom Lane wrote:
>  > >  > > We combine the surrogate pair components to a single code point and
>  > >  > > encode that in UTF-8.  We don't encode the components separately;
>  > >  > that
>  > >  > > would be wrong.
>  > >  >
>  > >  > Oh, OK.  Should the docs make that a bit clearer?
>  > >
>  > >
>  > > Done.
>  >
>  > This is confusing:
>  >
>  >  (When surrogate
>  >  pairs are used when the server encoding is <literal>UTF8</>, they
>  >  are first combined into a single code point that is then encoded
>  >  in UTF-8.)
>  >
>  > So something else happens if encoding is not UTF8?
>
>
> Then you can't specify surrogate pairs because they are outside of the
>  ASCII range, per constraint mentioned earlier in the paragraph.
>
>
>  > I think this part can be simply removed, it does not add anything.
>  >
>  > Or say that surrogate pairs are only allowed in UTF8 encoding.
>  > Reason is that you cannot encode 0..7F codepoints with them,
>  > and only those are allowed to be given numerically.  But this is
>  > already mentioned before.
>
>
> Well, Tom wanted an additional explanation.  I personally agree with
>  you; this is not the place to explain encoding and Unicode internals,
>  when really the code only does what it's supposed to.

Ah OK, I had the impression you changed wording before that too,
so then this addition seemed unnecessary.  But seems you only changed
formatting.

Anyway, this "when" makes it weird.  Maybe more concise version:
 To repeat, surrogate pairs are combined to single character and then encoded, not stored separately.

Although it does seem unnecessary.

--
marko


Re: UTF16 surrogate pairs in UTF8 encoding

От
Tom Lane
Дата:
Marko Kreen <markokr@gmail.com> writes:
> Although it does seem unnecessary.

The reason I asked for this to be spelled out is that ordinarily,
a backslash escape \nnn is a very low-level thing that will insert
exactly what you say.  To me it's quite unexpected that the system
would editorialize on that to the extent of replacing two UTF16
surrogate characters by a single code point.  That's necessary for
correctness because our underlying storage is UTF8, but it's not
obvious that it will happen.  (As a counterexample, if our underlying
storage were UTF16, then very different things would need to happen
for the exact same SQL input.)

I think a lot of people will have this same question when reading
this para, which is why I asked for an explanation there.
        regards, tom lane


Re: UTF16 surrogate pairs in UTF8 encoding

От
Marko Kreen
Дата:
On 9/8/10, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Marko Kreen <markokr@gmail.com> writes:
>  > Although it does seem unnecessary.
>
>
> The reason I asked for this to be spelled out is that ordinarily,
>  a backslash escape \nnn is a very low-level thing that will insert
>  exactly what you say.  To me it's quite unexpected that the system
>  would editorialize on that to the extent of replacing two UTF16
>  surrogate characters by a single code point.  That's necessary for
>  correctness because our underlying storage is UTF8, but it's not
>  obvious that it will happen.  (As a counterexample, if our underlying
>  storage were UTF16, then very different things would need to happen
>  for the exact same SQL input.)
>
>  I think a lot of people will have this same question when reading
>  this para, which is why I asked for an explanation there.

Ok, but I still don't like the "when"s.  How about:

-    6-digit form technically makes this unnecessary.  (When surrogate
-    pairs are used when the server encoding is <literal>UTF8</>, they
-    are first combined into a single code point that is then encoded
-    in UTF-8.)
+    6-digit form technically makes this unnecessary.  (Surrogate
+    pairs are not stored directly, but combined into a single
+    code point that is then encoded in UTF-8.)

-- 
marko


Re: UTF16 surrogate pairs in UTF8 encoding

От
Bruce Momjian
Дата:
Marko Kreen wrote:
> On 9/8/10, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > Marko Kreen <markokr@gmail.com> writes:
> >  > Although it does seem unnecessary.
> >
> >
> > The reason I asked for this to be spelled out is that ordinarily,
> >  a backslash escape \nnn is a very low-level thing that will insert
> >  exactly what you say.  To me it's quite unexpected that the system
> >  would editorialize on that to the extent of replacing two UTF16
> >  surrogate characters by a single code point.  That's necessary for
> >  correctness because our underlying storage is UTF8, but it's not
> >  obvious that it will happen.  (As a counterexample, if our underlying
> >  storage were UTF16, then very different things would need to happen
> >  for the exact same SQL input.)
> >
> >  I think a lot of people will have this same question when reading
> >  this para, which is why I asked for an explanation there.
> 
> Ok, but I still don't like the "when"s.  How about:
> 
> -    6-digit form technically makes this unnecessary.  (When surrogate
> -    pairs are used when the server encoding is <literal>UTF8</>, they
> -    are first combined into a single code point that is then encoded
> -    in UTF-8.)
> +    6-digit form technically makes this unnecessary.  (Surrogate
> +    pairs are not stored directly, but combined into a single
> +    code point that is then encoded in UTF-8.)

Applied, thanks.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + It's impossible for everything to be true. +