Обсуждение: Implementing full UTF-8 support (aka supporting 0x00)

Поиск
Список
Период
Сортировка

Implementing full UTF-8 support (aka supporting 0x00)

От
Álvaro Hernández Tortosa
Дата:
    Hi list.
    As has been previously discussed (see 
https://www.postgresql.org/message-id/BAY7-F17FFE0E324AB3B642C547E96890%40phx.gbl 
for instance) varlena fields cannot accept the literal 0x00 value. Sure, 
you can use bytea, but this hardly a good solution. The problem seems to 
be hitting some use cases, like:

- People migrating data from other databases (apart from PostgreSQL, I 
don't know of any other database which suffers the same problem).
- People using drivers which use UTF-8 or equivalent encodings by 
default (Java for example)
    Given that 0x00 is a perfectly legal UTF-8 character, I conclude 
we're strictly non-compliant. And given the general Postgres policy 
regarding standards compliance and the people being hit by this, I think 
it should be addressed. Specially since all the usual fixes are a real 
PITA (re-parsing, re-generating strings, which is very expensive, or 
dropping data).
    What would it take to support it? Isn't the varlena header 
propagated everywhere, which could help infer the real length of the 
string? Any pointers or suggestions would be welcome.
    Thanks,
    Álvaro


-- 

Álvaro Hernández Tortosa


-----------
8Kdata




Re: Implementing full UTF-8 support (aka supporting 0x00)

От
Tom Lane
Дата:
Álvaro Hernández Tortosa <aht@8kdata.com> writes:
>      As has been previously discussed (see 
> https://www.postgresql.org/message-id/BAY7-F17FFE0E324AB3B642C547E96890%40phx.gbl 
> for instance) varlena fields cannot accept the literal 0x00 value.

Yup.

>      What would it take to support it?

One key reason why that's hard is that datatype input and output
functions use nul-terminated C strings as the representation of the
text form of any datatype.  We can't readily change that API without
breaking huge amounts of code, much of it not under the core project's
control.

There may be other places where nul-terminated strings would be a hazard
(mumble fgets mumble), but offhand that API seems like the major problem
so far as the backend is concerned.

There would be a slew of client-side problems as well.  For example this
would assuredly break psql and pg_dump, along with every other client that
supposes that it can treat PQgetvalue() as returning a nul-terminated
string.  This end of it would possibly be even worse than fixing the
backend, because so little of the affected code is under our control.

In short, the problem is not with having an embedded nul in a stored
text value.  The problem is the reams of code that suppose that the
text representation of any data value is a nul-terminated C string.
        regards, tom lane



Re: Implementing full UTF-8 support (aka supporting 0x00)

От
Kevin Grittner
Дата:
On Wed, Aug 3, 2016 at 9:54 AM, Álvaro Hernández Tortosa <aht@8kdata.com> wrote:

>     What would it take to support it?

Would it be of any value to support "Modified UTF-8"?

https://en.wikipedia.org/wiki/UTF-8#Modified_UTF-8

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Implementing full UTF-8 support (aka supporting 0x00)

От
Peter Eisentraut
Дата:
On 8/3/16 11:47 AM, Kevin Grittner wrote:
> On Wed, Aug 3, 2016 at 9:54 AM, Álvaro Hernández Tortosa <aht@8kdata.com> wrote:
> 
>>     What would it take to support it?
> 
> Would it be of any value to support "Modified UTF-8"?
> 
> https://en.wikipedia.org/wiki/UTF-8#Modified_UTF-8

Will this work with OS libraries?


-- 
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Implementing full UTF-8 support (aka supporting 0x00)

От
Geoff Winkless
Дата:
On 3 August 2016 at 15:54, Álvaro Hernández Tortosa <aht@8kdata.com> wrote:
>     Given that 0x00 is a perfectly legal UTF-8 character, I conclude we're
> strictly non-compliant.

It's perhaps worth mentioning that 0x00 is valid ASCII too, and
PostgreSQL has never stored that either.

If you want to start quoting standards, there is in fact specific
mention in the ANSI spec of null terminators in passing strings to
host languages, so if postgresql stored NULs in that way we would end
up with parameters that we couldn't pass to UDFs in a
standards-compliant way.

Geoff



Re: Implementing full UTF-8 support (aka supporting 0x00)

От
Craig Ringer
Дата:
On 3 August 2016 at 22:54, Álvaro Hernández Tortosa <aht@8kdata.com> wrote:

    Hi list.

    As has been previously discussed (see https://www.postgresql.org/message-id/BAY7-F17FFE0E324AB3B642C547E96890%40phx.gbl for instance) varlena fields cannot accept the literal 0x00 value. Sure, you can use bytea, but this hardly a good solution. The problem seems to be hitting some use cases, like:

- People migrating data from other databases (apart from PostgreSQL, I don't know of any other database which suffers the same problem).
- People using drivers which use UTF-8 or equivalent encodings by default (Java for example)

    Given that 0x00 is a perfectly legal UTF-8 character, I conclude we're strictly non-compliant. And given the general Postgres policy regarding standards compliance and the people being hit by this, I think it should be addressed. Specially since all the usual fixes are a real PITA (re-parsing, re-generating strings, which is very expensive, or dropping data).

    What would it take to support it? Isn't the varlena header propagated everywhere, which could help infer the real length of the string? Any pointers or suggestions would be welcome.

One of the bigger pain points is that our interaction with C library collation routines for sorting uses NULL-terminated C strings.  strcoll, strxfrm, etc. 

--
 Craig Ringer                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: Implementing full UTF-8 support (aka supporting 0x00)

От
Álvaro Hernández Tortosa
Дата:

On 03/08/16 17:23, Tom Lane wrote:
> Álvaro Hernández Tortosa <aht@8kdata.com> writes:
>>       As has been previously discussed (see
>> https://www.postgresql.org/message-id/BAY7-F17FFE0E324AB3B642C547E96890%40phx.gbl
>> for instance) varlena fields cannot accept the literal 0x00 value.
> Yup.
>
>>       What would it take to support it?
> One key reason why that's hard is that datatype input and output
> functions use nul-terminated C strings as the representation of the
> text form of any datatype.  We can't readily change that API without
> breaking huge amounts of code, much of it not under the core project's
> control.
>
> There may be other places where nul-terminated strings would be a hazard
> (mumble fgets mumble), but offhand that API seems like the major problem
> so far as the backend is concerned.
>
> There would be a slew of client-side problems as well.  For example this
> would assuredly break psql and pg_dump, along with every other client that
> supposes that it can treat PQgetvalue() as returning a nul-terminated
> string.  This end of it would possibly be even worse than fixing the
> backend, because so little of the affected code is under our control.
>
> In short, the problem is not with having an embedded nul in a stored
> text value.  The problem is the reams of code that suppose that the
> text representation of any data value is a nul-terminated C string.
>
>             regards, tom lane
    Wow. That seems like a daunting task.
    I guess, then, than even implementing a new datatype based on bytea 
but that would use the text IO functions to show up as text (not 
send/recv) would neither work, right?
    Thanks for the input,
    Álvaro


-- 

Álvaro Hernández Tortosa


-----------
8Kdata




Re: Implementing full UTF-8 support (aka supporting 0x00)

От
Álvaro Hernández Tortosa
Дата:

On 03/08/16 17:47, Kevin Grittner wrote:
> On Wed, Aug 3, 2016 at 9:54 AM, Álvaro Hernández Tortosa <aht@8kdata.com> wrote:
>
>>      What would it take to support it?
> Would it be of any value to support "Modified UTF-8"?
>
> https://en.wikipedia.org/wiki/UTF-8#Modified_UTF-8
>
    That's nice, but I don't think so.
    The problem is that you cannot predict how people would send you 
data, like when importing from other databases. I guess it may work if 
Postgres would implement such UTF-8 variant and also the drivers, but 
that would still require an encoding conversion (i.e., parsing every 
string) to change the 0x00, which seems like a serious performance hit.
    It could be worse than nothing, though!
    Thanks,
    Álvaro

-- 

Álvaro Hernández Tortosa


-----------
8Kdata




Re: Implementing full UTF-8 support (aka supporting 0x00)

От
Álvaro Hernández Tortosa
Дата:

On 03/08/16 18:35, Geoff Winkless wrote:
> On 3 August 2016 at 15:54, Álvaro Hernández Tortosa <aht@8kdata.com> wrote:
>>      Given that 0x00 is a perfectly legal UTF-8 character, I conclude we're
>> strictly non-compliant.
> It's perhaps worth mentioning that 0x00 is valid ASCII too, and
> PostgreSQL has never stored that either.
    Then yes, it could also be a problem. But as of today I believe 
solving the problem for UTF-8 would solve the great majority of this 
embedded NUL problems.
    Álvaro


-- 

Álvaro Hernández Tortosa


-----------
8Kdata




Re: Implementing full UTF-8 support (aka supporting 0x00)

От
Álvaro Hernández Tortosa
Дата:

On 03/08/16 20:14, Álvaro Hernández Tortosa wrote:
>
>
> On 03/08/16 17:47, Kevin Grittner wrote:
>> On Wed, Aug 3, 2016 at 9:54 AM, Álvaro Hernández Tortosa 
>> <aht@8kdata.com> wrote:
>>
>>>      What would it take to support it?
>> Would it be of any value to support "Modified UTF-8"?
>>
>> https://en.wikipedia.org/wiki/UTF-8#Modified_UTF-8
>>
>
>     That's nice, but I don't think so.
>
>     The problem is that you cannot predict how people would send you 
> data, like when importing from other databases. I guess it may work if 
> Postgres would implement such UTF-8 variant and also the drivers, but 
> that would still require an encoding conversion (i.e., parsing every 
> string) to change the 0x00, which seems like a serious performance hit.
>
>     It could be worse than nothing, though!
>
>     Thanks,
>
>     Álvaro
>
    It may indeed work.
    According to https://en.wikipedia.org/wiki/UTF-8#Codepage_layout 
the encoding used in Modified UTF-8 is an (otherwise) invalid UTF-8 code 
point. In short, the \u00 nul is represented (overlong encoding) by the 
two-byte, 1 character sequence \uc080. These two bytes are invalid UTF-8 
so should not appear in an otherwise valid UTF-8 string. Yet they are 
accepted by Postgres (like if Postgres would support Modified UTF-8 
intentionally). The caracter in psql does not render as a nul but as 
this symbol: "삀".
    Given that this works, the process would look like this:

- Parse all input data looking for bytes with hex value 0x00. If they 
appear in the string, they are the null byte.
- Replace that byte with the two bytes 0xc080.
- Reverse the operation when reading.
    This is OK but of course a performance hit (searching for 0x00 and 
then augmenting the byte[] or whatever data structure to account for the 
extra byte). A little bit of a PITA, but I guess better than fixing it 
all :)

    Álvaro


-- 

Álvaro Hernández Tortosa


-----------
8Kdata




Re: Implementing full UTF-8 support (aka supporting 0x00)

От
Geoff Winkless
Дата:
On 3 August 2016 at 20:13, Álvaro Hernández Tortosa <aht@8kdata.com> wrote:
> Yet they are accepted by Postgres
> (like if Postgres would support Modified UTF-8 intentionally). The caracter
> in psql does not render as a nul but as this symbol: "삀".

Not accepted as valid utf8:

# select E'\xc0\x80';
ERROR:  invalid byte sequence for encoding "UTF8": 0xc0 0x80

You would need a "modified utf8" encoding, I think.

Geoff



Re: Implementing full UTF-8 support (aka supporting 0x00)

От
Álvaro Hernández Tortosa
Дата:

On 03/08/16 21:31, Geoff Winkless wrote:
> On 3 August 2016 at 20:13, Álvaro Hernández Tortosa <aht@8kdata.com> wrote:
>> Yet they are accepted by Postgres
>> (like if Postgres would support Modified UTF-8 intentionally). The caracter
>> in psql does not render as a nul but as this symbol: "삀".
> Not accepted as valid utf8:
>
> # select E'\xc0\x80';
> ERROR:  invalid byte sequence for encoding "UTF8": 0xc0 0x80
>
> You would need a "modified utf8" encoding, I think.
>
> Geoff
    Isn't the correct syntax something like:

select E'\uc080', U&'\c080';

?
    It is a single character, 16 bit unicode sequence (see 
https://www.postgresql.org/docs/current/static/sql-syntax-lexical.html).

    Álvaro

-- 

Álvaro Hernández Tortosa


-----------
8Kdata




Re: Implementing full UTF-8 support (aka supporting 0x00)

От
Tom Lane
Дата:
Álvaro Hernández Tortosa <aht@8kdata.com> writes:
>      According to https://en.wikipedia.org/wiki/UTF-8#Codepage_layout 
> the encoding used in Modified UTF-8 is an (otherwise) invalid UTF-8 code 
> point. In short, the \u00 nul is represented (overlong encoding) by the 
> two-byte, 1 character sequence \uc080. These two bytes are invalid UTF-8 
> so should not appear in an otherwise valid UTF-8 string. Yet they are 
> accepted by Postgres (like if Postgres would support Modified UTF-8 
> intentionally).

Really?  It sure looks to me like pg_utf8_islegal() would reject this.

We could hack it to allow the case, no doubt, but I concur with Peter's
concern that we'd have trouble with OS-level code that is strict about
what UTF8 allows.  glibc, for example, is known to do very strange things
with strings that it thinks are invalid in the active encoding.
        regards, tom lane



Re: Implementing full UTF-8 support (aka supporting 0x00)

От
Geoff Winkless
Дата:
On 3 August 2016 at 20:36, Álvaro Hernández Tortosa <aht@8kdata.com> wrote:
>     Isn't the correct syntax something like:
>
> select E'\uc080', U&'\c080';
>
> ?
>
>     It is a single character, 16 bit unicode sequence (see
> https://www.postgresql.org/docs/current/static/sql-syntax-lexical.html).

No, what you've done there is created the three-byte utf8 sequence \xec8280

# select U&'\c080'::bytea; bytea
----------\xec8280

It's not a UCS2 c080, it's utf8 c080.

Geoff



Re: Implementing full UTF-8 support (aka supporting 0x00)

От
Thomas Munro
Дата:
On Thu, Aug 4, 2016 at 5:16 AM, Craig Ringer <craig@2ndquadrant.com> wrote:
> On 3 August 2016 at 22:54, Álvaro Hernández Tortosa <aht@8kdata.com> wrote:
>>     What would it take to support it? Isn't the varlena header propagated
>> everywhere, which could help infer the real length of the string? Any
>> pointers or suggestions would be welcome.
>
>
> One of the bigger pain points is that our interaction with C library
> collation routines for sorting uses NULL-terminated C strings.  strcoll,
> strxfrm, etc.

That particular bit of the problem would go away if this ever happened:

https://wiki.postgresql.org/wiki/Todo:ICU

ucoll_strcoll takes explicit lengths (though optionally accepts -1 for
null terminated mode).

http://userguide.icu-project.org/strings#TOC-Using-C-Strings:-NUL-Terminated-vs.-Length-Parameters

--
Thomas Munro
http://www.enterprisedb.com



Re: Implementing full UTF-8 support (aka supporting 0x00)

От
Álvaro Hernández Tortosa
Дата:

On 03/08/16 21:42, Geoff Winkless wrote:
> On 3 August 2016 at 20:36, Álvaro Hernández Tortosa <aht@8kdata.com> wrote:
>>      Isn't the correct syntax something like:
>>
>> select E'\uc080', U&'\c080';
>>
>> ?
>>
>>      It is a single character, 16 bit unicode sequence (see
>> https://www.postgresql.org/docs/current/static/sql-syntax-lexical.html).
> No, what you've done there is created the three-byte utf8 sequence \xec8280
>
> # select U&'\c080'::bytea;
>    bytea
> ----------
>   \xec8280
>
> It's not a UCS2 c080, it's utf8 c080.
>
> Geoff
    Yes, you're absolutely right ^_^
    Álvaro


-- 

Álvaro Hernández Tortosa


-----------
8Kdata




Re: Implementing full UTF-8 support (aka supporting 0x00)

От
Craig Ringer
Дата:


On 4 August 2016 at 05:00, Thomas Munro <thomas.munro@enterprisedb.com> wrote:
On Thu, Aug 4, 2016 at 5:16 AM, Craig Ringer <craig@2ndquadrant.com> wrote:
> On 3 August 2016 at 22:54, Álvaro Hernández Tortosa <aht@8kdata.com> wrote:
>>     What would it take to support it? Isn't the varlena header propagated
>> everywhere, which could help infer the real length of the string? Any
>> pointers or suggestions would be welcome.
>
>
> One of the bigger pain points is that our interaction with C library
> collation routines for sorting uses NULL-terminated C strings.  strcoll,
> strxfrm, etc.

That particular bit of the problem would go away if this ever happened:

https://wiki.postgresql.org/wiki/Todo:ICU

ucoll_strcoll takes explicit lengths (though optionally accepts -1 for
null terminated mode).

http://userguide.icu-project.org/strings#TOC-Using-C-Strings:-NUL-Terminated-vs.-Length-Parameters

Yep, it does. But we've made little to no progress on integration of ICU support and AFAIK nobody's working on it right now. 

I wonder how MySQL implements their collation and encoding support?

--
 Craig Ringer                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: Implementing full UTF-8 support (aka supporting 0x00)

От
Bruce Momjian
Дата:
On Thu, Aug  4, 2016 at 08:22:25AM +0800, Craig Ringer wrote:
> Yep, it does. But we've made little to no progress on integration of ICU
> support and AFAIK nobody's working on it right now. 

Uh, this email from July says Peter Eisentraut will submit it in
September  :-)
https://www.postgresql.org/message-id/2b833706-1133-1e11-39d9-4fa2288925bd@2ndquadrant.com

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com

+ As you are, so once was I. As I am, so you will be. +
+                     Ancient Roman grave inscription +



Improved ICU patch - WAS: Implementing full UTF-8 support (aka supporting 0x00)

От
Palle Girgensohn
Дата:
> 4 aug. 2016 kl. 02:40 skrev Bruce Momjian <bruce@momjian.us>:
>
> On Thu, Aug  4, 2016 at 08:22:25AM +0800, Craig Ringer wrote:
>> Yep, it does. But we've made little to no progress on integration of ICU
>> support and AFAIK nobody's working on it right now.
>
> Uh, this email from July says Peter Eisentraut will submit it in
> September  :-)
>
>     https://www.postgresql.org/message-id/2b833706-1133-1e11-39d9-4fa2288925bd@2ndquadrant.com

Cool.

I have brushed up my decade+ old patches [1] for ICU, so they now have support for COLLATE on columns.


https://github.com/girgen/postgres/


in branches icu/XXX where XXX is master or REL9_X_STABLE.

They've been used for the FreeBSD ports since 2005, and have served us well. I have of course updated them regularly.
Inthis latest version, I've removed support for other encodings beside UTF-8, mostly since I don't know how to test
them,but also, I see little point in supporting anything else using ICU. 



I have one question for someone with knowledge about Turkish (Devrim?). This is the diff from regression tests, when
running

$ gmake check EXTRA_TESTS=collate.linux.utf8 LANG=sv_SE.UTF-8

$ cat "/Users/girgen/postgresql/obj/src/test/regress/regression.diffs"
*** /Users/girgen/postgresql/postgres/src/test/regress/expected/collate.linux.utf8.out    2016-08-10 21:09:03.000000000
+0200
--- /Users/girgen/postgresql/obj/src/test/regress/results/collate.linux.utf8.out    2016-08-10 21:12:53.000000000 +0200
***************
*** 373,379 **** SELECT 'Türkiye' COLLATE "tr_TR" ~* 'KI' AS "false";  false -------
!  f (1 row)
 SELECT 'bıt' ~* 'BIT' COLLATE "en_US" AS "false";
--- 373,379 ---- SELECT 'Türkiye' COLLATE "tr_TR" ~* 'KI' AS "false";  false -------
!  t (1 row)
 SELECT 'bıt' ~* 'BIT' COLLATE "en_US" AS "false";
***************
*** 385,391 **** SELECT 'bıt' ~* 'BIT' COLLATE "tr_TR" AS "true";  true ------
!  t (1 row)
 -- The following actually exercises the selectivity estimation for ~*.
--- 385,391 ---- SELECT 'bıt' ~* 'BIT' COLLATE "tr_TR" AS "true";  true ------
!  f (1 row)
 -- The following actually exercises the selectivity estimation for ~*.

======================================================================

The Linux locale behaves differently from ICU for the above (corner ?) cases. Any ideas if one is more correct than the
other?I seems unclear to me. Perhaps it depends on whether the case-insensitive match is done using lower(both) or
upper(both)?I haven't investigated this yet. @Devrim, is one more correct than the other? 


As Thomas points out, using ucoll_strcoll it is quick, since no copying is needed. I will get some benchmarks soon.

Palle



[1] https://people.freebsd.org/~girgen/postgresql-icu/README.html


Re: Improved ICU patch - WAS: Implementing full UTF-8 support (aka supporting 0x00)

От
Peter Geoghegan
Дата:
On Wed, Aug 10, 2016 at 1:42 PM, Palle Girgensohn <girgen@pingpong.net> wrote:
> They've been used for the FreeBSD ports since 2005, and have served us well. I have of course updated them regularly.
Inthis latest version, I've removed support for other encodings beside UTF-8, mostly since I don't know how to test
them,but also, I see little point in supporting anything else using ICU. 

Looks like you're not using the ICU equivalent of strxfrm(). While 9.5
is not the release that introduced its use, it did expand it
significantly. I think you need to fix this, even though it isn't
actually used to sort text at present, since presumably FreeBSD builds
of 9.5 don't TRUST_STRXFRM. Since you're using ICU, though, you could
reasonably trust the ICU equivalent of strxfrm(), so that's a missed
opportunity. (See the wiki page on the abbreviated keys issue [1] if
you don't know what I'm talking about.)

Shouldn't you really have a strxfrm() wrapper, used across the board,
including for callers outside of varlena.c? convert_string_datum() has
been calling strxfrm() for many releases now. These calls are still
used in FreeBSD builds, I would think, which seems like a bug that is
not dodged by simply not defining TRUST_STRXFRM. Isn't its assumption
that that matching the ordering used elsewhere not really hold on
FreeBSD builds?

[1] https://wiki.postgresql.org/wiki/Abbreviated_keys_glibc_issue
--
Peter Geoghegan



Re: Improved ICU patch - WAS: Implementing full UTF-8 support (aka supporting 0x00)

От
Palle Girgensohn
Дата:
> 11 aug. 2016 kl. 03:05 skrev Peter Geoghegan <pg@heroku.com>:
>
> On Wed, Aug 10, 2016 at 1:42 PM, Palle Girgensohn <girgen@pingpong.net> wrote:
>> They've been used for the FreeBSD ports since 2005, and have served us well. I have of course updated them
regularly.In this latest version, I've removed support for other encodings beside UTF-8, mostly since I don't know how
totest them, but also, I see little point in supporting anything else using ICU. 
>
> Looks like you're not using the ICU equivalent of strxfrm(). While 9.5
> is not the release that introduced its use, it did expand it
> significantly. I think you need to fix this, even though it isn't
> actually used to sort text at present, since presumably FreeBSD builds
> of 9.5 don't TRUST_STRXFRM. Since you're using ICU, though, you could
> reasonably trust the ICU equivalent of strxfrm(), so that's a missed
> opportunity. (See the wiki page on the abbreviated keys issue [1] if
> you don't know what I'm talking about.)

My plan was to get it working without TRUST_STRXFRM first, and then add that functinality. I've made some preliminary
testsusing ICU:s ucol_getSortKey but I will have to test it a bit more. For now, I just expect not to trust strxfrm. It
isthe first iteration wrt strxfrm, the plan is to use that code base. 

>
> Shouldn't you really have a strxfrm() wrapper, used across the board,
> including for callers outside of varlena.c? convert_string_datum() has
> been calling strxfrm() for many releases now. These calls are still
> used in FreeBSD builds, I would think, which seems like a bug that is
> not dodged by simply not defining TRUST_STRXFRM. Isn't its assumption
> that that matching the ordering used elsewhere not really hold on
> FreeBSD builds?

I was not aware of convert_string_datum, I will check that, thanks! Using a wrapper across the board seems like a good
ideafor refactoring. 

>
> [1] https://wiki.postgresql.org/wiki/Abbreviated_keys_glibc_issue
> --
> Peter Geoghegan


Re: Improved ICU patch - WAS: Implementing full UTF-8 support (aka supporting 0x00)

От
Palle Girgensohn
Дата:
<br class="" /><blockquote class="" type="cite">11 aug. 2016 kl. 11:15 skrev Palle Girgensohn <<a class=""
href="mailto:girgen@pingpong.net">girgen@pingpong.net</a>>:<brclass="" /><br class="" /><blockquote class=""
style="font-family:Menlo-Regular; font-size: 13px; font-style: normal; font-variant-caps: normal; font-weight: normal;
letter-spacing:normal; orphans: auto; text-align: start; text-indent: 0px; text-transform: none; white-space: normal;
widows:auto; word-spacing: 0px; -webkit-text-stroke-width: 0px;" type="cite"><br class="" />11 aug. 2016 kl. 03:05
skrevPeter Geoghegan <<a class="" href="mailto:pg@heroku.com">pg@heroku.com</a>>:<br class="" /><br class="" />On
Wed,Aug 10, 2016 at 1:42 PM, Palle Girgensohn <<a class=""
href="mailto:girgen@pingpong.net">girgen@pingpong.net</a>>wrote:<br class="" /><blockquote class=""
type="cite">They'vebeen used for the FreeBSD ports since 2005, and have served us well. I have of course updated them
regularly.In this latest version, I've removed support for other encodings beside UTF-8, mostly since I don't know how
totest them, but also, I see little point in supporting anything else using ICU.<br class="" /></blockquote><br
class=""/>Looks like you're not using the ICU equivalent of strxfrm(). While 9.5<br class="" />is not the release that
introducedits use, it did expand it<br class="" />significantly. I think you need to fix this, even though it isn't<br
class=""/>actually used to sort text at present, since presumably FreeBSD builds<br class="" />of 9.5 don't
TRUST_STRXFRM.Since you're using ICU, though, you could<br class="" />reasonably trust the ICU equivalent of strxfrm(),
sothat's a missed<br class="" />opportunity. (See the wiki page on the abbreviated keys issue [1] if<br class="" />you
don'tknow what I'm talking about.)<br class="" /></blockquote><br class="" />My plan was to get it working without
TRUST_STRXFRMfirst, and then add that functinality. I've made some preliminary tests using ICU:s ucol_getSortKey but I
willhave to test it a bit more. For now, I just expect not to trust strxfrm. It is the first iteration wrt strxfrm, the
planis to use that code base.<br class="" /></blockquote><div class=""><br class="" /></div><div class="">Here are some
preliminaryresults running 10000 times comparing the same two strings in a tight loop.</div><br class="" /><div
class="">             ucol_strcollUTF8: -1<span class="Apple-tab-span" style="white-space:pre"> </span>0.002448<br
class=""/>                       strcoll: 1<span class="Apple-tab-span" style="white-space:pre"> </span>0.060711<br
class=""/>              ucol_strcollIter: -1<span class="Apple-tab-span" style="white-space:pre"> </span>0.009221<br
class=""/>                 direct memcmp: 1<span class="Apple-tab-span" style="white-space:pre"> </span>0.000457<br
class=""/>                 memcpy memcmp: 1<span class="Apple-tab-span" style="white-space:pre"> </span>0.001706<br
class=""/>                memcpy strcoll: 1<span class="Apple-tab-span" style="white-space:pre"> </span>0.068425<br
class=""/>               nextSortKeyPart: -1<span class="Apple-tab-span" style="white-space:pre"> </span>0.041011<br
class=""/>    ucnv_toUChars + getSortKey: -1<span class="Apple-tab-span" style="white-space:pre"> </span>0.050379<br
class=""/><br class="" /></div><div class=""><br class="" /></div><div class="">correct answer is -1, but since we
compareåasdf and äasdf with a Swedish locale, memcmp and strcoll fails of course, as espected. Direct memcmp is 5 times
fasterthan ucol_strcollUTF8 (used in my patch), but sadly the best implementation using sort keys with
ICU, nextSortKeyPart,is way slower.</div><div class=""><br class="" /></div><div class=""><br class="" /></div><div
class=""><brclass="" /></div><div class=""><span class="Apple-tab-span" style="white-space:pre"> </span>startTime =
getRealTime();<brclass="" /><span class="Apple-tab-span" style="white-space:pre"> </span>for ( int i = 0; i < loop;
i++){<br class="" /><span class="Apple-tab-span" style="white-space:pre"> </span>result = ucol_strcollUTF8(collator,
arg1,len1, arg2, len2, &status);<br class="" /><span class="Apple-tab-span" style="white-space:pre"> </span>}<br
class=""/><span class="Apple-tab-span" style="white-space:pre"> </span>endTime = getRealTime();<br class="" /><span
class="Apple-tab-span"style="white-space:pre"> </span>printf("%30s: %d\t%lf\n", "ucol_strcollUTF8", result, endTime -
startTime);<brclass="" /><br class="" /></div><div class=""><span class="Apple-tab-span" style="white-space:pre"><span
class=""style="white-space: normal;"><br class="" /></span></span></div><div class=""><span class="Apple-tab-span"
style="white-space:pre"><spanclass="" style="white-space: normal;"><br class="" /></span></span></div><div
class=""><spanclass="Apple-tab-span" style="white-space:pre"><span class="" style="white-space:
normal;">vs</span></span></div><divclass=""><span class="Apple-tab-span" style="white-space:pre"><span class=""
style="white-space:normal;"><br class="" /></span></span></div><div class=""><span class="Apple-tab-span"
style="white-space:pre"><spanclass="" style="white-space: normal;"><br class="" /></span></span></div><div
class=""></div><divclass=""><span class="Apple-tab-span" style="white-space:pre"><span class="Apple-tab-span"
style="white-space:pre;"> </span>int sortkeysize=8;<br class="" /><br class="" /><span class="" style="white-space:
normal;"><spanclass="Apple-tab-span" style="white-space:pre"> </span>startTime = getRealTime();</span></span></div><div
class=""><spanclass="Apple-tab-span" style="white-space:pre"><span class="" style="white-space: normal;"><span
class="Apple-tab-span"style="white-space:pre"> </span>uint8_t key1[sortkeysize],
key2[sortkeysize];</span></span></div><divclass=""><span class="Apple-tab-span" style="white-space:pre"><span class=""
style="white-space:normal;"><span class="Apple-tab-span" style="white-space:pre"> </span>uint32_t sState[2],
tState[2];</span></span></div><divclass=""><span class="Apple-tab-span" style="white-space: pre;"> </span>UCharIterator
sIter,tIter; </div><div class=""><span class="Apple-tab-span" style="white-space:pre"><br class="" /></span>for ( int i
=0; i < loop; i++) {</div><div class=""><span class="Apple-tab-span" style="white-space:pre">
</span>uiter_setUTF8(&sIter,arg1, len1);<br class="" /><span class="Apple-tab-span" style="white-space:pre">
</span>uiter_setUTF8(&tIter,arg2, len2);<br class="" /><span class="Apple-tab-span" style="white-space:pre">
</span>sState[0]= 0; sState[1] = 0;<br class="" /><span class="Apple-tab-span" style="white-space:pre">
</span>tState[0]= 0; tState[1] = 0;<br class="" /><span class="Apple-tab-span" style="white-space:pre">
</span>ucol_nextSortKeyPart(collator,&sIter, sState, key1, sortkeysize, &status);<br class="" /><span
class="Apple-tab-span"style="white-space:pre"> </span>ucol_nextSortKeyPart(collator, &tIter, tState, key2,
sortkeysize,&status);<br class="" /><span class="Apple-tab-span" style="white-space:pre"> </span>result = memcmp
(key1,key2, sortkeysize);<br class="" /><span class="Apple-tab-span" style="white-space:pre"> </span>}<br class=""
/><spanclass="Apple-tab-span" style="white-space:pre"> </span>endTime = getRealTime();<br class="" /><span
class="Apple-tab-span"style="white-space:pre"> </span>printf("%30s: %d\t%lf\n", "nextSortKeyPart", result, endTime -
startTime);<brclass="" /><br class="" /></div><div class=""><br class="" /></div><div class=""><br class=""
/></div><divclass="">But in your strxfrm code in PostgreSQL, the keys are cached, and represented as int64:s if I
remembercorrectly, so perhaps there is still a benefit using the abbreviated keys? More testing is required, I
guess...</div><divclass=""><br class="" /></div><div class="">Palle</div><div class=""><br class="" /></div><div
class=""><brclass="" /></div><div class=""><br class="" /></div> 

Re: Improved ICU patch - WAS: Implementing full UTF-8 support (aka supporting 0x00)

От
Peter Geoghegan
Дата:
On Thu, Aug 11, 2016 at 4:22 AM, Palle Girgensohn <girgen@pingpong.net> wrote:
> But in your strxfrm code in PostgreSQL, the keys are cached, and represented
> as int64:s if I remember correctly, so perhaps there is still a benefit
> using the abbreviated keys? More testing is required, I guess...

ICU's ucol_nextSortKeyPart() interface is faster for this, I believe,
and works because you only need the first sizeof(Datum) bytes (4 or 8
bytes). So, you have the right idea about using it (at least for the
abbreviated key stuff), but the second last argument needs to be
sizeof(Datum).

The whole point of the abbreviated key optimization is that you can
avoid pointer chasing during each and every comparison. Your test here
is invalid because it doesn't involved the reuse of the keys. Often,
during a sort, each item has to be compared about 20 times.

I've experimented with ICU, and know it works well with this. You
really need to create a scenario with a real sort, and all the
conditions I describe.

-- 
Peter Geoghegan