Обсуждение: UTF16 surrogate pairs in UTF8 encoding

Поиск

Список

Период

Сортировка

UTF16 surrogate pairs in UTF8 encoding

От

Tom Lane

Дата:

22 августа 2010 г., 15:29:30

I just noticed that we are now advertising the ability to insert UTF16
surrogate pairs in strings and identifiers (see section 4.1.2.2 in
current docs, in particular).  Is this really wise?  I thought that
surrogate pairs were specifically prohibited in UTF8 strings, because
of the security hazards implicit in having more than one way to
represent the same code point.
        regards, tom lane

Re: UTF16 surrogate pairs in UTF8 encoding

От

Peter Eisentraut

Дата:

22 августа 2010 г., 16:12:55

On sön, 2010-08-22 at 14:29 -0400, Tom Lane wrote:
> I just noticed that we are now advertising the ability to insert UTF16
> surrogate pairs in strings and identifiers (see section 4.1.2.2 in
> current docs, in particular).  Is this really wise?  I thought that
> surrogate pairs were specifically prohibited in UTF8 strings, because
> of the security hazards implicit in having more than one way to
> represent the same code point.

We combine the surrogate pair components to a single code point and
encode that in UTF-8.  We don't encode the components separately; that
would be wrong.

Re: UTF16 surrogate pairs in UTF8 encoding

От

Tom Lane

Дата:

22 августа 2010 г., 16:15:14

Peter Eisentraut <peter_e@gmx.net> writes:
> On sön, 2010-08-22 at 14:29 -0400, Tom Lane wrote:
>> I just noticed that we are now advertising the ability to insert UTF16
>> surrogate pairs in strings and identifiers (see section 4.1.2.2 in
>> current docs, in particular).  Is this really wise?  I thought that
>> surrogate pairs were specifically prohibited in UTF8 strings, because
>> of the security hazards implicit in having more than one way to
>> represent the same code point.

> We combine the surrogate pair components to a single code point and
> encode that in UTF-8.  We don't encode the components separately; that
> would be wrong.

Oh, OK.  Should the docs make that a bit clearer?
        regards, tom lane

Re: UTF16 surrogate pairs in UTF8 encoding

От

Florian Weimer

Дата:

23 августа 2010 г., 03:50:47

* Tom Lane:

> I just noticed that we are now advertising the ability to insert UTF16
> surrogate pairs in strings and identifiers (see section 4.1.2.2 in
> current docs, in particular).  Is this really wise?  I thought that
> surrogate pairs were specifically prohibited in UTF8 strings, because
> of the security hazards implicit in having more than one way to
> represent the same code point.

There is relatively little risk because surrogate pairs cannot encode
characters in the BMP, and presumably, most of the critical characters
are located there.

However, if this is converted to regular UTF-8, I really question the
sense of this.  Usually, people want CESU-8 to preserve ordering
between languages such as C# and Java and their database, and
conversion destroys this property.

--
Florian Weimer                <fweimer@bfk.de>
BFK edv-consulting GmbH       http://www.bfk.de/
Kriegsstraße 100              tel: +49-721-96201-1
D-76133 Karlsruhe             fax: +49-721-96201-99

Re: UTF16 surrogate pairs in UTF8 encoding

От

Marko Kreen

Дата:

23 августа 2010 г., 07:21:14

On 8/22/10, Peter Eisentraut <peter_e@gmx.net> wrote:
> On sön, 2010-08-22 at 14:29 -0400, Tom Lane wrote:
>  > I just noticed that we are now advertising the ability to insert UTF16
>  > surrogate pairs in strings and identifiers (see section 4.1.2.2 in
>  > current docs, in particular).  Is this really wise?  I thought that
>  > surrogate pairs were specifically prohibited in UTF8 strings, because
>  > of the security hazards implicit in having more than one way to
>  > represent the same code point.
>
>
> We combine the surrogate pair components to a single code point and
>  encode that in UTF-8.  We don't encode the components separately; that
>  would be wrong.

AFAICS our UTF8 validator (pg_utf8_islegal) detects and rejects
such sequences, if they are inserted via any means, eg. \x

Although it's not very obvious...

--
marko

Re: UTF16 surrogate pairs in UTF8 encoding

От

Peter Eisentraut

Дата:

07 сентября 2010 г., 15:54:59

On sön, 2010-08-22 at 15:15 -0400, Tom Lane wrote:
> > We combine the surrogate pair components to a single code point and
> > encode that in UTF-8.  We don't encode the components separately;
> that
> > would be wrong.
> 
> Oh, OK.  Should the docs make that a bit clearer?

Done.

Re: UTF16 surrogate pairs in UTF8 encoding

От

Marko Kreen

Дата:

08 сентября 2010 г., 04:18:45

On 9/7/10, Peter Eisentraut <peter_e@gmx.net> wrote:
> On sön, 2010-08-22 at 15:15 -0400, Tom Lane wrote:
>  > > We combine the surrogate pair components to a single code point and
>  > > encode that in UTF-8.  We don't encode the components separately;
>  > that
>  > > would be wrong.
>  >
>  > Oh, OK.  Should the docs make that a bit clearer?
>
>
> Done.

This is confusing:
(When surrogatepairs are used when the server encoding is <literal>UTF8</>, theyare first combined into a single code
pointthat is then encodedin UTF-8.) 

So something else happens if encoding is not UTF8?

I think this part can be simply removed, it does not add anything.

Or say that surrogate pairs are only allowed in UTF8 encoding.
Reason is that you cannot encode 0..7F codepoints with them,
and only those are allowed to be given numerically.  But this is
already mentioned before.

--
marko

Re: UTF16 surrogate pairs in UTF8 encoding

От

Peter Eisentraut

Дата:

08 сентября 2010 г., 07:36:33

On ons, 2010-09-08 at 10:18 +0300, Marko Kreen wrote:
> On 9/7/10, Peter Eisentraut <peter_e@gmx.net> wrote:
> > On sön, 2010-08-22 at 15:15 -0400, Tom Lane wrote:
> >  > > We combine the surrogate pair components to a single code point and
> >  > > encode that in UTF-8.  We don't encode the components separately;
> >  > that
> >  > > would be wrong.
> >  >
> >  > Oh, OK.  Should the docs make that a bit clearer?
> >
> >
> > Done.
> 
> This is confusing:
> 
>  (When surrogate
>  pairs are used when the server encoding is <literal>UTF8</>, they
>  are first combined into a single code point that is then encoded
>  in UTF-8.)
> 
> So something else happens if encoding is not UTF8?

Then you can't specify surrogate pairs because they are outside of the
ASCII range, per constraint mentioned earlier in the paragraph.

> I think this part can be simply removed, it does not add anything.
> 
> Or say that surrogate pairs are only allowed in UTF8 encoding.
> Reason is that you cannot encode 0..7F codepoints with them,
> and only those are allowed to be given numerically.  But this is
> already mentioned before.

Well, Tom wanted an additional explanation.  I personally agree with
you; this is not the place to explain encoding and Unicode internals,
when really the code only does what it's supposed to.

Re: UTF16 surrogate pairs in UTF8 encoding

От

Marko Kreen

Дата:

08 сентября 2010 г., 07:45:50

On 9/8/10, Peter Eisentraut <peter_e@gmx.net> wrote:
> On ons, 2010-09-08 at 10:18 +0300, Marko Kreen wrote:
>  > On 9/7/10, Peter Eisentraut <peter_e@gmx.net> wrote:
>  > > On sön, 2010-08-22 at 15:15 -0400, Tom Lane wrote:
>  > >  > > We combine the surrogate pair components to a single code point and
>  > >  > > encode that in UTF-8.  We don't encode the components separately;
>  > >  > that
>  > >  > > would be wrong.
>  > >  >
>  > >  > Oh, OK.  Should the docs make that a bit clearer?
>  > >
>  > >
>  > > Done.
>  >
>  > This is confusing:
>  >
>  >  (When surrogate
>  >  pairs are used when the server encoding is <literal>UTF8</>, they
>  >  are first combined into a single code point that is then encoded
>  >  in UTF-8.)
>  >
>  > So something else happens if encoding is not UTF8?
>
>
> Then you can't specify surrogate pairs because they are outside of the
>  ASCII range, per constraint mentioned earlier in the paragraph.
>
>
>  > I think this part can be simply removed, it does not add anything.
>  >
>  > Or say that surrogate pairs are only allowed in UTF8 encoding.
>  > Reason is that you cannot encode 0..7F codepoints with them,
>  > and only those are allowed to be given numerically.  But this is
>  > already mentioned before.
>
>
> Well, Tom wanted an additional explanation.  I personally agree with
>  you; this is not the place to explain encoding and Unicode internals,
>  when really the code only does what it's supposed to.

Ah OK, I had the impression you changed wording before that too,
so then this addition seemed unnecessary.  But seems you only changed
formatting.

Anyway, this "when" makes it weird.  Maybe more concise version:
 To repeat, surrogate pairs are combined to single character and then encoded, not stored separately.

Although it does seem unnecessary.

--
marko

Re: UTF16 surrogate pairs in UTF8 encoding

От

Tom Lane

Дата:

08 сентября 2010 г., 11:01:48

Marko Kreen <markokr@gmail.com> writes:
> Although it does seem unnecessary.

The reason I asked for this to be spelled out is that ordinarily,
a backslash escape \nnn is a very low-level thing that will insert
exactly what you say.  To me it's quite unexpected that the system
would editorialize on that to the extent of replacing two UTF16
surrogate characters by a single code point.  That's necessary for
correctness because our underlying storage is UTF8, but it's not
obvious that it will happen.  (As a counterexample, if our underlying
storage were UTF16, then very different things would need to happen
for the exact same SQL input.)

I think a lot of people will have this same question when reading
this para, which is why I asked for an explanation there.
        regards, tom lane

Re: UTF16 surrogate pairs in UTF8 encoding

От

Marko Kreen

Дата:

08 сентября 2010 г., 11:23:54

On 9/8/10, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Marko Kreen <markokr@gmail.com> writes:
>  > Although it does seem unnecessary.
>
>
> The reason I asked for this to be spelled out is that ordinarily,
>  a backslash escape \nnn is a very low-level thing that will insert
>  exactly what you say.  To me it's quite unexpected that the system
>  would editorialize on that to the extent of replacing two UTF16
>  surrogate characters by a single code point.  That's necessary for
>  correctness because our underlying storage is UTF8, but it's not
>  obvious that it will happen.  (As a counterexample, if our underlying
>  storage were UTF16, then very different things would need to happen
>  for the exact same SQL input.)
>
>  I think a lot of people will have this same question when reading
>  this para, which is why I asked for an explanation there.

Ok, but I still don't like the "when"s.  How about:

-    6-digit form technically makes this unnecessary.  (When surrogate
-    pairs are used when the server encoding is <literal>UTF8</>, they
-    are first combined into a single code point that is then encoded
-    in UTF-8.)
+    6-digit form technically makes this unnecessary.  (Surrogate
+    pairs are not stored directly, but combined into a single
+    code point that is then encoded in UTF-8.)

-- 
marko

Re: UTF16 surrogate pairs in UTF8 encoding

От

Bruce Momjian

Дата:

19 февраля 2011 г., 20:00:57

Marko Kreen wrote:
> On 9/8/10, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > Marko Kreen <markokr@gmail.com> writes:
> >  > Although it does seem unnecessary.
> >
> >
> > The reason I asked for this to be spelled out is that ordinarily,
> >  a backslash escape \nnn is a very low-level thing that will insert
> >  exactly what you say.  To me it's quite unexpected that the system
> >  would editorialize on that to the extent of replacing two UTF16
> >  surrogate characters by a single code point.  That's necessary for
> >  correctness because our underlying storage is UTF8, but it's not
> >  obvious that it will happen.  (As a counterexample, if our underlying
> >  storage were UTF16, then very different things would need to happen
> >  for the exact same SQL input.)
> >
> >  I think a lot of people will have this same question when reading
> >  this para, which is why I asked for an explanation there.
> 
> Ok, but I still don't like the "when"s.  How about:
> 
> -    6-digit form technically makes this unnecessary.  (When surrogate
> -    pairs are used when the server encoding is <literal>UTF8</>, they
> -    are first combined into a single code point that is then encoded
> -    in UTF-8.)
> +    6-digit form technically makes this unnecessary.  (Surrogate
> +    pairs are not stored directly, but combined into a single
> +    code point that is then encoded in UTF-8.)

Applied, thanks.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + It's impossible for everything to be true. +

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Обсуждение: UTF16 surrogate pairs in UTF8 encoding