Обсуждение: Unicode normalization SQL functions

Поиск

Список

Период

Сортировка

Unicode normalization SQL functions

От

Peter Eisentraut

Дата:

12 декабря 2019 г., 14:46:21

Here are patches to add support for Unicode normalization into SQL, per 
SQL standard:

     normalize($string [, form])
     $string is [form] normalized

(comment about silly SQL syntax here)

We already have all the infrastructure for Unicode normalization for the 
SASLprep functionality.  The first patch extends the internal APIs to 
support all four normal forms instead of only NFKC used by SASLprep. 
The second patch adds the SQL layer on top of it.

This could be used to preprocess or check strings before using them with 
deterministic collations or locale implementations that don't deal with 
non-NFC data correctly, perhaps using triggers, generated columns, or 
domains.  The NFKC and NFKD normalizations could also be used for 
general data cleaning, similar to what SASLprep does.

As a future idea, I think we could also hook Unicode normalization into 
the protocol-level encoding conversion.

Also, there is a way to optimize the "is normalized" test for common 
cases, described in UTR #15.  For that we'll need an additional data 
file from Unicode.  In order to simplify that, I would like my patch 
"Add support for automatically updating Unicode derived files" 
integrated first.

-- 
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Вложения

Re: Unicode normalization SQL functions

От

"Daniel Verite"

Дата:

06 января 2020 г., 19:00:11

    Peter Eisentraut wrote:

> Also, there is a way to optimize the "is normalized" test for common
> cases, described in UTR #15.  For that we'll need an additional data
> file from Unicode.  In order to simplify that, I would like my patch
> "Add support for automatically updating Unicode derived files"
> integrated first.

Would that explain that the NFC/NFKC normalization and "is normalized"
check seem abnormally slow with the current patch, or should
it be regarded independently of the other patch?

For instance, testing 10000 short ASCII strings:

postgres=# select count(*) from (select md5(i::text) as t from
generate_series(1,10000) as i) s where t is nfc normalized ;
 count
-------
 10000
(1 row)

Time: 2573,859 ms (00:02,574)

By comparison, the NFD/NFKD case is faster by two orders of magnitude:

postgres=# select count(*) from (select md5(i::text) as t from
generate_series(1,10000) as i) s where t is nfd normalized ;
 count
-------
 10000
(1 row)

Time: 29,962 ms

Although NFC/NFKC has a recomposition step that NFD/NFKD
doesn't have, such a difference is surprising.

I've tried an alternative implementation based on ICU's
unorm2_isNormalized() /unorm2_normalize() functions (which I'm
currently adding to the icu_ext extension to be exposed in SQL).
With these, the 4 normal forms are in the 20ms ballpark with the above
test case, without a clear difference between composed and decomposed
forms.

Independently of the performance, I've compared the results
of the ICU implementation vs this patch on large series of strings
with all normal forms and could not find any difference.

Best regards,
--
Daniel Vérité
PostgreSQL-powered mailer: http://www.manitou-mail.org
Twitter: @DanielVerite

Re: Unicode normalization SQL functions

От

Peter Eisentraut

Дата:

09 января 2020 г., 12:20:14

On 2020-01-06 17:00, Daniel Verite wrote:
>     Peter Eisentraut wrote:
> 
>> Also, there is a way to optimize the "is normalized" test for common
>> cases, described in UTR #15.  For that we'll need an additional data
>> file from Unicode.  In order to simplify that, I would like my patch
>> "Add support for automatically updating Unicode derived files"
>> integrated first.
> 
> Would that explain that the NFC/NFKC normalization and "is normalized"
> check seem abnormally slow with the current patch, or should
> it be regarded independently of the other patch?

That's unrelated.

> For instance, testing 10000 short ASCII strings:
> 
> postgres=# select count(*) from (select md5(i::text) as t from
> generate_series(1,10000) as i) s where t is nfc normalized ;
>   count
> -------
>   10000
> (1 row)
> 
> Time: 2573,859 ms (00:02,574)
> 
> By comparison, the NFD/NFKD case is faster by two orders of magnitude:
> 
> postgres=# select count(*) from (select md5(i::text) as t from
> generate_series(1,10000) as i) s where t is nfd normalized ;
>   count
> -------
>   10000
> (1 row)
> 
> Time: 29,962 ms
> 
> Although NFC/NFKC has a recomposition step that NFD/NFKD
> doesn't have, such a difference is surprising.

It's very likely that this is because the recomposition calls 
recompose_code() which does a sequential scan of UnicodeDecompMain for 
each character.  To optimize that, we should probably build a bespoke 
reverse mapping table that can be accessed more efficiently.

> I've tried an alternative implementation based on ICU's
> unorm2_isNormalized() /unorm2_normalize() functions (which I'm
> currently adding to the icu_ext extension to be exposed in SQL).
> With these, the 4 normal forms are in the 20ms ballpark with the above
> test case, without a clear difference between composed and decomposed
> forms.

That's good feedback.

> Independently of the performance, I've compared the results
> of the ICU implementation vs this patch on large series of strings
> with all normal forms and could not find any difference.

And that too.

-- 
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Unicode normalization SQL functions

От

Peter Eisentraut

Дата:

20 января 2020 г., 12:17:50

Here is an updated patch set that now also implements the "quick check" 
algorithm from UTR #15 for making IS NORMALIZED very fast in many cases, 
which I had mentioned earlier in the thread.

-- 
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Вложения

Re: Unicode normalization SQL functions

От

"Daniel Verite"

Дата:

28 января 2020 г., 12:48:45

    Peter Eisentraut wrote:

> Here is an updated patch set that now also implements the "quick check"
> algorithm from UTR #15 for making IS NORMALIZED very fast in many cases,
> which I had mentioned earlier in the thread.

I found a bug in unicode_is_normalized_quickcheck() which is
triggered when the last codepoint of the string is beyond
U+10000. On encountering it, it does:
+        if (is_supplementary_codepoint(ch))
+            p++;
When ch is the last codepoint, it makes p point to
the ending zero, but the subsequent p++ done by
the for loop makes it miss the exit and go into over-reading.

But anyway, what's the reason for skipping the codepoint
following a codepoint outside of the BMP?
I've figured that it comes from porting the Java code in UAX#15:

public int quickCheck(String source) {
    short lastCanonicalClass = 0;
    int result = YES;
    for (int i = 0; i < source.length(); ++i) {
    int ch = source.codepointAt(i);
    if (Character.isSupplementaryCodePoint(ch)) ++i;
    short canonicalClass = getCanonicalClass(ch);
    if (lastCanonicalClass > canonicalClass && canonicalClass != 0) {
        return NO;          }
    int check = isAllowed(ch);
    if (check == NO) return NO;
    if (check == MAYBE) result = MAYBE;
    lastCanonicalClass = canonicalClass;
    }
    return result;
}

source.length() is the length in UTF-16 code units, in which a surrogate
pair counts for 2. This would be why it does
   if (Character.isSupplementaryCodePoint(ch)) ++i;
it's meant to skip the 2nd UTF-16 code of the pair.
As this does not apply to the 32-bit pg_wchar, I think the two lines above
in the C implementation should just be removed.


Best regards,
--
Daniel Vérité
PostgreSQL-powered mailer: http://www.manitou-mail.org
Twitter: @DanielVerite

Re: Unicode normalization SQL functions

От

Peter Eisentraut

Дата:

28 января 2020 г., 23:21:18

On 2020-01-28 10:48, Daniel Verite wrote:
> I found a bug in unicode_is_normalized_quickcheck() which is
> triggered when the last codepoint of the string is beyond
> U+10000. On encountering it, it does:
> +        if (is_supplementary_codepoint(ch))
> +            p++;
> When ch is the last codepoint, it makes p point to
> the ending zero, but the subsequent p++ done by
> the for loop makes it miss the exit and go into over-reading.
> 
> But anyway, what's the reason for skipping the codepoint
> following a codepoint outside of the BMP?

You're right, this didn't make any sense.  Here is a new patch set with 
that fixed.

-- 
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Вложения

Re: Unicode normalization SQL functions

От

Andreas Karlsson

Дата:

13 февраля 2020 г., 03:23:41

On 1/28/20 9:21 PM, Peter Eisentraut wrote:
> You're right, this didn't make any sense.  Here is a new patch set with 
> that fixed.

Thanks for this patch. This is a feature which has been on my personal 
todo list for a while and something which I have wished to have a couple 
of times.

I took a quick look at the patch and here is some feedback:

A possible concern is increased binary size from the new tables for the 
quickcheck but personally I think they are worth it.

A potential optimization would be to merge utf8_to_unicode() and 
pg_utf_mblen() into one function in unicode_normalize_func() since 
utf8_to_unicode() already knows length of the character. Probably not 
worth it though.

It feels a bit wasteful to measure output_size in 
unicode_is_normalized() since unicode_normalize() actually already knows 
the length of the buffer, it just does not return it.

A potential optimization for the normalized case would be to abort the 
quick check on the first maybe and normalize from that point on only. If 
I can find the time I might try this out and benchmark it.

Nitpick: "split/\s*;\s*/, $line" in generate-unicode_normprops_table.pl 
should be "split /\s*;\s*/, $line".

What about using else if in the code below for clarity?

+        if (check == UNICODE_NORM_QC_NO)
+            return UNICODE_NORM_QC_NO;
+        if (check == UNICODE_NORM_QC_MAYBE)
+            result = UNICODE_NORM_QC_MAYBE;

Remove extra space in the line below.

+    else if (quickcheck == UNICODE_NORM_QC_NO )

Andreas

Re: Unicode normalization SQL functions

От

Michael Paquier

Дата:

13 февраля 2020 г., 06:40:59

On Thu, Feb 13, 2020 at 01:23:41AM +0100, Andreas Karlsson wrote:
> On 1/28/20 9:21 PM, Peter Eisentraut wrote:
>> You're right, this didn't make any sense.  Here is a new patch set with
>> that fixed.
>
> Thanks for this patch. This is a feature which has been on my personal todo
> list for a while and something which I have wished to have a couple of
> times.

(The size of the patch set may justify compressing it)
--
Michael

Вложения

signature.asc

Re: Unicode normalization SQL functions

От

Peter Eisentraut

Дата:

17 февраля 2020 г., 18:56:01

On 2020-02-13 01:23, Andreas Karlsson wrote:
> A potential optimization would be to merge utf8_to_unicode() and
> pg_utf_mblen() into one function in unicode_normalize_func() since
> utf8_to_unicode() already knows length of the character. Probably not
> worth it though.

This would also require untangling the entire encoding API.

> It feels a bit wasteful to measure output_size in
> unicode_is_normalized() since unicode_normalize() actually already knows
> the length of the buffer, it just does not return it.

Sure, but really most string APIs work like that.  They surely know the 
string length internally, but afterwards you often have to call strlen() 
again.

> A potential optimization for the normalized case would be to abort the
> quick check on the first maybe and normalize from that point on only. If
> I can find the time I might try this out and benchmark it.

Are you sure this would always be valid?  The fact that this wasn't 
mentioned in UTR #15 makes me suspicious.

> Nitpick: "split/\s*;\s*/, $line" in generate-unicode_normprops_table.pl
> should be "split /\s*;\s*/, $line".

done

> What about using else if in the code below for clarity?
> 
> +        if (check == UNICODE_NORM_QC_NO)
> +            return UNICODE_NORM_QC_NO;
> +        if (check == UNICODE_NORM_QC_MAYBE)
> +            result = UNICODE_NORM_QC_MAYBE;

done

> Remove extra space in the line below.
> 
> +    else if (quickcheck == UNICODE_NORM_QC_NO )

I didn't find this in my local copy.

-- 
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Unicode normalization SQL functions

От

"Daniel Verite"

Дата:

17 февраля 2020 г., 22:08:00

  Hi,

I've checked the v3 patch against the results of the normalization
done by ICU [1] on my test data again, and they're identical
(as they were with v1; v2 had the bug discussed upthread, now fixed).

Concerning execution speed, there's an excessive CPU usage when
normalizing into NFC or NFKC. Looking at pre-existing code, it looks
like recompose_code() in unicode_norm.c looping over the
UnicodeDecompMain array might be very costly.

Another point is that the ICU-based implementation appears
to be significantly faster in all cases, which makes me wonder
why ICU builds should not just use ICU instead of the PG-core
implementation.
To illustrate this, here are the execution times reported by psql for
the queries below exercising the normalization code, both with the
functions provided by the patch and with the equivalent functions
implemented with ICU.
The dataset is ~10 million unique short strings
extracted from real data, and the number is a median execution time in
millisecs, for 10 successive runs with query parallelism off
(stddev in parentheses).

 operation  |     core       |    icu
------------+--------------+-----------
 nfc check  | 4398 (20)    | 3088 (27)
 nfc conv   | 771502 (414) | 5503 (19)
 nfd check  | 4510 (10)    | 2898 (8)
 nfd conv   | 9102 (1)       | 5569 (6)
 nfkc check | 4825 (51)    | 3273 (4)
 nfkc conv  | 772240 (340) | 5763 (8)
 nfkd check | 4794 (4)       | 3170 (39)
 nfkd conv  | 9229 (4)       | 5824 (9)

The queries:

check w/core:
  select count(*) from words where w is $NORM normalized;

conversion w/core:
  select sum(length(normalize(w, $NORM))) from words;

check w/icu:
  select count(*) from words where icu_is_normalized(w, '$NORM');

conversion w/icu:
  select sum(length(icu_normalize(w, '$NORM'))) from words;


[1] https://github.com/dverite/icu_ext/blob/master/icu_normalize.c

Best regards,
--
Daniel Vérité
PostgreSQL-powered mailer: http://www.manitou-mail.org
Twitter: @DanielVerite

Re: Unicode normalization SQL functions

От

"Daniel Verite"

Дата:

17 февраля 2020 г., 22:14:03

 One nitpick:

Around this hunk:

- * unicode_normalize_kc - Normalize a Unicode string to NFKC form.
+ * unicode_normalize - Normalize a Unicode string to the specified form.
  *
  * The input is a 0-terminated array of codepoints.
  *
@@ -304,8 +306,10 @@ decompose_code(pg_wchar code, pg_wchar **result, int
*current)
  * string is palloc'd instead, and OOM is reported with ereport().
  */



The comment in full says:

/*
 * unicode_normalize - Normalize a Unicode string to the specified form.
 *
 * The input is a 0-terminated array of codepoints.
 *
 * In frontend, returns a 0-terminated array of codepoints, allocated with
 * malloc. Or NULL if we run out of memory. In frontend, the returned
 * string is palloc'd instead, and OOM is reported with ereport().
 */

It looks like the 2nd occurrence of "frontend" was meant to be "backend".


Best regards,
--
Daniel Vérité
PostgreSQL-powered mailer: http://www.manitou-mail.org
Twitter: @DanielVerite

Re: Unicode normalization SQL functions

От

Peter Eisentraut

Дата:

19 февраля 2020 г., 23:09:42

On 2020-02-17 20:14, Daniel Verite wrote:
> The comment in full says:
> 
> /*                                    
>   * unicode_normalize - Normalize a Unicode string to the specified form.
>   *                                    
>   * The input is a 0-terminated array of codepoints.            
>   *                                    
>   * In frontend, returns a 0-terminated array of codepoints, allocated with
>   * malloc. Or NULL if we run out of memory. In frontend, the returned    
>   * string is palloc'd instead, and OOM is reported with ereport().    
>   */
> 
> It looks like the 2nd occurrence of "frontend" was meant to be "backend".

This was a pre-existing problem, so I have fixed that separately.

-- 
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Unicode normalization SQL functions

От

Peter Eisentraut

Дата:

19 февраля 2020 г., 23:23:12

On 2020-02-17 20:08, Daniel Verite wrote:
> Concerning execution speed, there's an excessive CPU usage when
> normalizing into NFC or NFKC. Looking at pre-existing code, it looks
> like recompose_code() in unicode_norm.c looping over the
> UnicodeDecompMain array might be very costly.

Yes, this is a known issue and I think room for future optimization work.

> Another point is that the ICU-based implementation appears
> to be significantly faster in all cases, which makes me wonder
> why ICU builds should not just use ICU instead of the PG-core
> implementation.

That would require linking libpq to ICU (for SASLprep), and in general 
would either make ICU required or require maintaining multiple 
implementations.  I don't think we're there yet.

-- 
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Unicode normalization SQL functions

От

Peter Eisentraut

Дата:

19 марта 2020 г., 17:41:13

What is that status of this patch set?  I think we have nailed down the 
behavior, but there were some concerns about certain performance 
characteristics.  Do people feel that those are required to be addressed 
in this cycle?

-- 
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Unicode normalization SQL functions

От

Andreas Karlsson

Дата:

19 марта 2020 г., 20:01:12

On 3/19/20 3:41 PM, Peter Eisentraut wrote:
> What is that status of this patch set?  I think we have nailed down the 
> behavior, but there were some concerns about certain performance 
> characteristics.  Do people feel that those are required to be addressed 
> in this cycle?

Personally I would rather see it merged if the code is correct (which it 
seems like it is from what I can tell) as the performance seems to be 
good enough for it to be useful.

Unicode normalization is a feature which I have wished and at least for 
my use cases the current implementation is more than fast enough.

Andreas

Re: Unicode normalization SQL functions

От

"Daniel Verite"

Дата:

23 марта 2020 г., 19:26:20

    Peter Eisentraut wrote:

> What is that status of this patch set?  I think we have nailed down the
> behavior, but there were some concerns about certain performance
> characteristics.  Do people feel that those are required to be addressed
> in this cycle?

Not finding any other issue with v3 or objections in the thread,
I've set the status to Ready For Committer in the CF.

Best regards,
--
Daniel Vérité
PostgreSQL-powered mailer: http://www.manitou-mail.org
Twitter: @DanielVerite

Re: Unicode normalization SQL functions

От

Peter Eisentraut

Дата:

24 марта 2020 г., 12:20:54

On 2020-03-23 17:26, Daniel Verite wrote:
>     Peter Eisentraut wrote:
> 
>> What is that status of this patch set?  I think we have nailed down the
>> behavior, but there were some concerns about certain performance
>> characteristics.  Do people feel that those are required to be addressed
>> in this cycle?
> 
> Not finding any other issue with v3 or objections in the thread,
> I've set the status to Ready For Committer in the CF.

I have committed the 0001 patch.

Now I have some concerns about the size of the new table in 
unicode_normprops_table.h, and the resulting binary size.  At the very 
least, we should probably make that #ifndef FRONTEND or something like 
that so libpq isn't bloated by it unnecessarily.  Perhaps there is a 
better format for that table?  Any ideas?

-- 
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Unicode normalization SQL functions

От

Peter Eisentraut

Дата:

26 марта 2020 г., 10:25:46

On 2020-03-24 10:20, Peter Eisentraut wrote:
> Now I have some concerns about the size of the new table in
> unicode_normprops_table.h, and the resulting binary size.  At the very
> least, we should probably make that #ifndef FRONTEND or something like
> that so libpq isn't bloated by it unnecessarily.  Perhaps there is a
> better format for that table?  Any ideas?

I have figured this out.  New patch is attached.

First, I have added #ifndef FRONTEND, as mentioned above, so libpq isn't 
bloated.  Second, I have changed the lookup structure to a bitfield, so 
each entry is only 32 bits instead of 64.  Third, I have dropped the 
quickcheck tables for the NFD and NFKD forms.  Those are by far the 
biggest tables, and you still get okay performance if you do the 
normalization check the long way, since we don't need the recomposition 
step on those cases, which is by far the slowest part.  The main use 
case of all of this, I expect, is to check for NFC normalization, so 
it's okay if the other variants are not optimized to the same extent.

-- 
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Вложения

v4-0001-Add-SQL-functions-for-Unicode-normalization.patch

Re: Unicode normalization SQL functions

От

John Naylor

Дата:

26 марта 2020 г., 20:41:43

On Thu, Mar 26, 2020 at 3:26 PM Peter Eisentraut
<peter.eisentraut@2ndquadrant.com> wrote:
> I have figured this out.  New patch is attached.
>
> First, I have added #ifndef FRONTEND, as mentioned above, so libpq isn't
> bloated.  Second, I have changed the lookup structure to a bitfield, so
> each entry is only 32 bits instead of 64.  Third, I have dropped the
> quickcheck tables for the NFD and NFKD forms.  Those are by far the
> biggest tables, and you still get okay performance if you do the
> normalization check the long way, since we don't need the recomposition
> step on those cases, which is by far the slowest part.  The main use
> case of all of this, I expect, is to check for NFC normalization, so
> it's okay if the other variants are not optimized to the same extent.

Reading the link cited in the patch

http://www.unicode.org/reports/tr15/#Detecting_Normalization_Forms

"The data for the implementation of the isAllowed() call can be
accessed in memory with a hash table or a trie (see Section 14,
Implementation Notes); the latter will be the fastest."

We don't have a trie implementation in Postgres, but we do have a
perfect hash implementation. Doing that would bring the tables back to
64 bits per entry, but would likely be noticeably faster than binary
search. Since v4 has left out the biggest tables entirely, I think
this might be worth a look for the smaller tables remaining.

In the attached v5, when building the hash tables, we sort the code
points by NO/MAYBE, and store the index of the beginning of the NO
block:

MMMNNNNNNNNN
~~~^

That way we can tell a NO from a MAYBE by testing the result of the hash lookup.

Regression tests pass, but I haven't measured performance yet. I had
to fiddle with the hash seeds a bit to get the larger table to build.

Also, if we go with v4, I noticed the following test is present twice:

+SELECT "normalize"('abc', 'def');  -- run-time error

--
John Naylor                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Вложения

v5-Add-SQL-functions-for-Unicode-normalization.patch

Re: Unicode normalization SQL functions

От

John Naylor

Дата:

27 марта 2020 г., 10:16:12

I wrote:
>
> Regression tests pass, but I haven't measured performance yet.

Using a test similar to one upthread:

select count(*) from (select md5(i::text) as t from
generate_series(1,100000) as i) s where t is nfc normalized ;

I get (median of three)
v4  419ms
v5  310ms

with binary size
v4  HEAD + 33kB
v5  HEAD + 57kB

-- 
John Naylor                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Unicode normalization SQL functions

От

Peter Eisentraut

Дата:

02 апреля 2020 г., 10:45:09

On 2020-03-26 08:25, Peter Eisentraut wrote:
> On 2020-03-24 10:20, Peter Eisentraut wrote:
>> Now I have some concerns about the size of the new table in
>> unicode_normprops_table.h, and the resulting binary size.  At the very
>> least, we should probably make that #ifndef FRONTEND or something like
>> that so libpq isn't bloated by it unnecessarily.  Perhaps there is a
>> better format for that table?  Any ideas?
> 
> I have figured this out.  New patch is attached.
> 
> First, I have added #ifndef FRONTEND, as mentioned above, so libpq isn't
> bloated.  Second, I have changed the lookup structure to a bitfield, so
> each entry is only 32 bits instead of 64.  Third, I have dropped the
> quickcheck tables for the NFD and NFKD forms.  Those are by far the
> biggest tables, and you still get okay performance if you do the
> normalization check the long way, since we don't need the recomposition
> step on those cases, which is by far the slowest part.  The main use
> case of all of this, I expect, is to check for NFC normalization, so
> it's okay if the other variants are not optimized to the same extent.

I have committed this version.

-- 
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Unicode normalization SQL functions

От

Peter Eisentraut

Дата:

02 апреля 2020 г., 10:51:19

On 2020-03-26 18:41, John Naylor wrote:
> We don't have a trie implementation in Postgres, but we do have a
> perfect hash implementation. Doing that would bring the tables back to
> 64 bits per entry, but would likely be noticeably faster than binary
> search. Since v4 has left out the biggest tables entirely, I think
> this might be worth a look for the smaller tables remaining.
> 
> In the attached v5, when building the hash tables, we sort the code
> points by NO/MAYBE, and store the index of the beginning of the NO
> block:

This is a valuable idea, but I fear it's a bit late now in this cycle. 
I have questions about some details.  For example, you mention that you 
had to fiddle with the hash seed.  How does that affect other users of 
PerfectHash?  What happens when we update Unicode data and the hash 
doesn't work anymore?  These discussions might derail this patch at this 
hour, so I have committed the previous patch.  We can consider your 
patch as a follow-up patch, either now or in the future.

 > Also, if we go with v4, I noticed the following test is present twice:
 >
 > +SELECT "normalize"('abc', 'def');  -- run-time error

I think this is correct.  The other test is for "is_normalized".

-- 
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Unicode normalization SQL functions

От

John Naylor

Дата:

03 апреля 2020 г., 07:01:47

On Thu, Apr 2, 2020 at 3:51 PM Peter Eisentraut
<peter.eisentraut@2ndquadrant.com> wrote:
>
> On 2020-03-26 18:41, John Naylor wrote:
> > We don't have a trie implementation in Postgres, but we do have a
> > perfect hash implementation. Doing that would bring the tables back to
> > 64 bits per entry, but would likely be noticeably faster than binary
> > search. Since v4 has left out the biggest tables entirely, I think
> > this might be worth a look for the smaller tables remaining.
> >
> > In the attached v5, when building the hash tables, we sort the code
> > points by NO/MAYBE, and store the index of the beginning of the NO
> > block:
>
> This is a valuable idea, but I fear it's a bit late now in this cycle.
> I have questions about some details.  For example, you mention that you
> had to fiddle with the hash seed.  How does that affect other users of
> PerfectHash?

They would still try the same multipliers they use now, so no effect on them.

> What happens when we update Unicode data and the hash
> doesn't work anymore?

The script would choose different multipliers and/or seeds
automatically. Only if you're unlucky would you have to fiddle with
the hash parameters again. The last-resort multipliers in the v2 patch
in the other thread [1] seem very effective and easily build both the
quick check D tables, which I tried for amusement's sake.

That said, we could reduce the chances of that happening this way:
After trying all the shift-and-add multipliers, we could add another
loop to try a bunch of numbers in a range. We'd need a quick check to
weed out multiples of small primes so the number has a decent chance
of being prime. To save time, just try a few seeds and move on to the
next number. Maybe emit a warning that it exhausted the shift-and-add
multipliers in case the developer wanted to intervene.

If I resubmit this, I would split the build up into two steps: have
the current manual script build the quick check array for later commit
into the tree, and build the hash function separately from that as a
Makefile distprep target. There's no reason to have the hash functions
checked in as I did in v5, like we don't check in the keyword hash
functions.

I would also consider splitting it into two patches:

1. Keep binary search but with a more abstract array representation
(think PG_RMGR). This patch would be large but mostly mechanical.
2. Switch to hash lookup. A smaller one for ease of review.

[1] https://www.postgresql.org/message-id/CACPNZCvMMj88Bsnk1k%3DRffW6gBw%2BFH7wcwCBfcKLDM%3DUEG2UWg%40mail.gmail.com

-- 
John Naylor                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Обсуждение: Unicode normalization SQL functions

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения