Обсуждение: BUG #19341: REPLACE() fails to match final character when using nondeterministic ICU collation

Поиск

Список

Период

Сортировка

BUG #19341: REPLACE() fails to match final character when using nondeterministic ICU collation

От

PG Bug reporting form

Дата:

02 декабря 2025 г., 13:03:44

The following bug has been logged on the website:

Bug reference:      19341
Logged by:          Adam Warland
Email address:      adam.warland@infor.com
PostgreSQL version: 18.1
Operating system:   Windows 11 Enterprise
Description:

When using a nondeterministic ICU collation, the replace() function fails to
replace a substring when that substring appears at the end of the input
string.

Occurrences of the same substring earlier in the string are replaced
normally.

This appears to be unintended and inconsistent with the documented
limitations of nondeterministic collations. The failure seems specific to
situations where:
        • a nondeterministic ICU collation is applied to both source and
match strings, and
        • the substring being replaced appears as the final character of the
source string.

The behavior reproduces reliably.

Expected Behavior
replace() should replace all occurrences of the match substring, including
one at the final position, regardless of collation — or, if nondeterministic
collations cannot support this operation, documentation should explicitly
state the limitation (as is already done for LIKE and regular expressions).

Actual Behavior
Under a nondeterministic ICU collation, the final occurrence of the
substring is not replaced, even though earlier occurrences are.
Example output (actual) replace x with y :
res1  | testx     -- unchanged, incorrect (final character not replaced)
res2  | yabcdx    -- first 'x' replaced, final 'x' not replaced
Under a deterministic or C collation, output is correct:
res_C | testy

Environment
SELECT version();
→ PostgreSQL 18.1 (Debian 18.1-1.pgdg13+2) on x86_64-pc-linux-gnu, compiled
by gcc (Debian 14.2.0-19) 14.2.0, 64-bit.
SHOW server_encoding;
→ UTF8
SHOW lc_collate;
SHOW lc_ctype;
→ en_US.utf8
en_US.utf8

Specific collation used:
create collation test_nondeterministic (
    provider = icu,
    locale = 'und-u-ks-level2',
    deterministic = false
)


Minimal Reproduction Case

drop COLLATION if EXISTS test_nondeterministic;
create collation test_nondeterministic (
    provider = icu,
    locale = 'und-u-ks-level2',
    deterministic = false
);
-- Replace final character under nondeterministic collation
SELECT replace(
    'testx' COLLATE "test_nondeterministic",
    'x'     COLLATE "test_nondeterministic",
    'y') AS res1;
-- Replace substring appearing twice — final one fails
SELECT replace(
    'xabcdx' COLLATE "test_nondeterministic",
    'x'      COLLATE "test_nondeterministic",
    'y') AS res2;
-- Control test using deterministic collation
SELECT replace(
    'testx' COLLATE "C",
    'x'     COLLATE "C",
    'y') AS res_C;

Observed result:
res1 and the final x in res2 are not replaced.

Additional Notes
        • The issue does not occur with deterministic ICU collations or with
C collation.
        • The failure seems tied specifically to the last character
position, which may indicate an off-by-one issue or a limitation in
substring matching under nondeterministic collation rules.
        • No documentation currently states that replace() is partially
unsupported with nondeterministic collations, although other operations
(LIKE, regex) have historically been restricted.

Conclusion
replace() appears to behave incorrectly when matching a substring at the end
of a string under nondeterministic ICU collations. This is either:
        • a defect in how nondeterministic collations interact with
substring matching functions, or
an undocumented limitation that should be clarified.

Re: BUG #19341: REPLACE() fails to match final character when using nondeterministic ICU collation

От

Laurenz Albe

Дата:

02 декабря 2025 г., 19:24:54

On Tue, 2025-12-02 at 10:03 +0000, PG Bug reporting form wrote:
> PostgreSQL version: 18.1
>
> When using a nondeterministic ICU collation, the replace() function fails to
> replace a substring when that substring appears at the end of the input
> string.
>
> Occurrences of the same substring earlier in the string are replaced
> normally.
>
> Specific collation used:
> create collation test_nondeterministic (
>     provider = icu,
>     locale = 'und-u-ks-level2',
>     deterministic = false
> )
>
> -- Replace final character under nondeterministic collation
> SELECT replace(
>     'testx' COLLATE "test_nondeterministic",
>     'x'     COLLATE "test_nondeterministic",
>     'y') AS res1;

I can reproduce the problem, and the attached patch fixes it for me.

I am not certain if it is safe to apply pg_mblen() to "haystack_end", though.

Yours,
Laurenz Albe

Re: BUG #19341: REPLACE() fails to match final character when using nondeterministic ICU collation

От

Laurenz Albe

Дата:

02 декабря 2025 г., 19:31:54

On Tue, 2025-12-02 at 17:24 +0100, Laurenz Albe wrote:
> I can reproduce the problem, and the attached patch fixes it for me.

Erm, and here is the patch.

Laurenz Albe

Вложения

v1-0001-Fix-greedy-substring-search-for-non-deterministic.patch

Re: BUG #19341: REPLACE() fails to match final character when using nondeterministic ICU collation

От

Heikki Linnakangas

Дата:

02 декабря 2025 г., 19:36:06

On 02/12/2025 18:24, Laurenz Albe wrote:
> On Tue, 2025-12-02 at 10:03 +0000, PG Bug reporting form wrote:
>> PostgreSQL version: 18.1
>>
>> When using a nondeterministic ICU collation, the replace() function fails to
>> replace a substring when that substring appears at the end of the input
>> string.
>>
>> Occurrences of the same substring earlier in the string are replaced
>> normally.
>>
>> Specific collation used:
>> create collation test_nondeterministic (
>>      provider = icu,
>>      locale = 'und-u-ks-level2',
>>      deterministic = false
>> )
>>
>> -- Replace final character under nondeterministic collation
>> SELECT replace(
>>      'testx' COLLATE "test_nondeterministic",
>>      'x'     COLLATE "test_nondeterministic",
>>      'y') AS res1;
> 
> I can reproduce the problem, and the attached patch fixes it for me.

+1, looks good to me. Let's also add a regression test for this.

> I am not certain if it is safe to apply pg_mblen() to "haystack_end", though.

It doesn't do that though, does it? There are two pg_mblen() calls in 
the vicinity:

>             for (const char *test_end = hptr; test_end <= haystack_end; test_end += pg_mblen(test_end))
>             {
>                 if (pg_strncoll(hptr, (test_end - hptr), needle, needle_len, state->locale) == 0)
>                 {
>                     state->last_match_len_tmp = (test_end - hptr);
>                     result_hptr = hptr;
>                     if (!state->greedy)
>                         break;
>                 }
>             }
>             if (result_hptr)
>                 break;
> 
>             hptr += pg_mblen(hptr);

Neither of those will get called with 'haystack_end' as far as I can see.

- Heikki

Re: BUG #19341: REPLACE() fails to match final character when using nondeterministic ICU collation

От

Laurenz Albe

Дата:

02 декабря 2025 г., 20:18:11

On Tue, 2025-12-02 at 18:36 +0200, Heikki Linnakangas wrote:
> +1, looks good to me. Let's also add a regression test for this.

Right, done in the attached.

> > I am not certain if it is safe to apply pg_mblen() to "haystack_end", though.
>
> It doesn't do that though, does it? There are two pg_mblen() calls in
> the vicinity:
>
> >             for (const char *test_end = hptr; test_end <= haystack_end; test_end += pg_mblen(test_end))
> >             {
> >                 if (pg_strncoll(hptr, (test_end - hptr), needle, needle_len, state->locale) == 0)
> >                 {
> >                     state->last_match_len_tmp = (test_end - hptr);
> >                     result_hptr = hptr;
> >                     if (!state->greedy)
> >                         break;
> >                 }
> >             }
> >             if (result_hptr)
> >                 break;
> >
> >             hptr += pg_mblen(hptr);
>
> Neither of those will get called with 'haystack_end' as far as I can see.

During the last iteration of the loop, "test_end" will be equal to "haystack_end",
and the loop increment will call "pg_mblen(test_end)".

Yours,
Laurenz Albe

Вложения

v2-0001-Fix-greedy-substring-search-for-non-deterministic.patch

Re: BUG #19341: REPLACE() fails to match final character when using nondeterministic ICU collation

От

Tom Lane

Дата:

02 декабря 2025 г., 20:25:52

Laurenz Albe <laurenz.albe@cybertec.at> writes:
>>> for (const char *test_end = hptr; test_end <= haystack_end; test_end += pg_mblen(test_end))

> During the last iteration of the loop, "test_end" will be equal to "haystack_end",
> and the loop increment will call "pg_mblen(test_end)".

Right, clearly unsafe (and I bet valgrind would complain about it).
You need to rearrange the loop logic so that we won't attempt to
increment test_end that last time through.  Perhaps a for-loop
isn't the best way to write it.

            regards, tom lane

Re: BUG #19341: REPLACE() fails to match final character when using nondeterministic ICU collation

От

Heikki Linnakangas

Дата:

02 декабря 2025 г., 20:29:06

On 02/12/2025 18:36, Heikki Linnakangas wrote:
> On 02/12/2025 18:24, Laurenz Albe wrote:
>> On Tue, 2025-12-02 at 10:03 +0000, PG Bug reporting form wrote:
>>> PostgreSQL version: 18.1
>>>
>>> When using a nondeterministic ICU collation, the replace() function 
>>> fails to
>>> replace a substring when that substring appears at the end of the input
>>> string.
>>>
>>> Occurrences of the same substring earlier in the string are replaced
>>> normally.
>>>
>>> Specific collation used:
>>> create collation test_nondeterministic (
>>>      provider = icu,
>>>      locale = 'und-u-ks-level2',
>>>      deterministic = false
>>> )
>>>
>>> -- Replace final character under nondeterministic collation
>>> SELECT replace(
>>>      'testx' COLLATE "test_nondeterministic",
>>>      'x'     COLLATE "test_nondeterministic",
>>>      'y') AS res1;
>>
>> I can reproduce the problem, and the attached patch fixes it for me.
> 
> +1, looks good to me. Let's also add a regression test for this.

I added a simple test for this, and I think this is still not quite 
right. I added the following to collate.icu.utf test:

  CREATE TABLE test4nfd (a int, b text);
  INSERT INTO test4nfd VALUES (1, 'cote'), (2, 'côte'), (3, 'coté'), (4, 
'côté');
  UPDATE test4nfd SET b = normalize(b, nfd);
  -- This shows why replace should be greedy.  Otherwise, in the NFD
  -- case, the match would stop before the decomposed accents, which
  -- would leave the accents in the results.
  SELECT a, b, replace(b COLLATE ignore_accents, 'co', 'ma') FROM test4;
   a |  b   | replace
  ---+------+---------
   1 | cote | mate
   2 | côte | mate
   3 | coté | maté
   4 | côté | maté
  (4 rows)

  SELECT a, b, replace(b COLLATE ignore_accents, 'co', 'ma') FROM test4nfd;
   a |  b   | replace
  ---+------+---------
   1 | cote | mate
   2 | côte | mate
   3 | coté | maté
   4 | côté | maté
  (4 rows)

+-- Test for match at the end of the string.  (We had a bug on that
+-- once)
+SELECT a, b, replace(b COLLATE ignore_accents, 'te', 'ma') FROM test4nfd;
+ a |  b   | replace
+---+------+---------
+ 1 | cote | coma
+ 2 | côte | coma
+ 3 | coté | coma
+ 4 | côté | coma
+(4 rows)
+

In the added test query, the accents on the 'o' are stripped, which 
doesn't look correct.

- Heikki

Re: BUG #19341: REPLACE() fails to match final character when using nondeterministic ICU collation

От

Laurenz Albe

Дата:

02 декабря 2025 г., 20:45:47

On Tue, 2025-12-02 at 19:29 +0200, Heikki Linnakangas wrote:
> I added a simple test for this, and I think this is still not quite
> right. I added the following to collate.icu.utf test:
>
> +-- Test for match at the end of the string.  (We had a bug on that
> +-- once)
> +SELECT a, b, replace(b COLLATE ignore_accents, 'te', 'ma') FROM test4nfd;
> + a |  b   | replace
> +---+------+---------
> + 1 | cote | coma
> + 2 | côte | coma
> + 3 | coté | coma
> + 4 | côté | coma
> +(4 rows)
> +
>
> In the added test query, the accents on the 'o' are stripped, which
> doesn't look correct.

I am not sure if that is OK or not (after all, it's an accent
insensitive collation, so "coma" and "côma" should be the same).

But it seems unrelated to the bug report at hand.

Yours,
Laurenz Albe

Re: BUG #19341: REPLACE() fails to match final character when using nondeterministic ICU collation

От

Laurenz Albe

Дата:

02 декабря 2025 г., 20:51:14

On Tue, 2025-12-02 at 12:25 -0500, Tom Lane wrote:
> Laurenz Albe <laurenz.albe@cybertec.at> writes:
> > > > for (const char *test_end = hptr; test_end <= haystack_end; test_end += pg_mblen(test_end))
>
> > During the last iteration of the loop, "test_end" will be equal to "haystack_end",
> > and the loop increment will call "pg_mblen(test_end)".
>
> Right, clearly unsafe (and I bet valgrind would complain about it).
> You need to rearrange the loop logic so that we won't attempt to
> increment test_end that last time through.  Perhaps a for-loop
> isn't the best way to write it.

Right.  The attached patch v3 turns it into a while loop to avoid
the problem.

Yours,
Laurenz Albe

Вложения

v3-0001-Fix-greedy-substring-search-for-non-deterministic.patch

Re: BUG #19341: REPLACE() fails to match final character when using nondeterministic ICU collation

От

Tom Lane

Дата:

02 декабря 2025 г., 23:53:46

Laurenz Albe <laurenz.albe@cybertec.at> writes:
> On Tue, 2025-12-02 at 12:25 -0500, Tom Lane wrote:
>> You need to rearrange the loop logic so that we won't attempt to
>> increment test_end that last time through.  Perhaps a for-loop
>> isn't the best way to write it.

> Right.  The attached patch v3 turns it into a while loop to avoid
> the problem.

Looking at the code overall, I wonder if the outer loop doesn't have
the same issue.  The comments claim that we should be able to handle
zero-length matches, but if the overall haystack is of length zero,
we will fail to check for such a match.

Also, since we have haystack <= haystack_end as a starting condition,
I think both loops could omit the initial test.  I'd be inclined
to code them like

    test_ptr = start point;
    for (;;)
    {
        ...
        if (test_ptr >= haystack_end)
            break;
        test_ptr += pg_mblen(test_ptr);
    }

On the other hand ... is that comment really right about zero-length
match being possible?  If it is, the API for this function is in
need of redesign, because callers that try to find "the next match"
would go into an infinite loop re-finding the same zero-length
match over and over.

            regards, tom lane

Re: BUG #19341: REPLACE() fails to match final character when using nondeterministic ICU collation

От

Laurenz Albe

Дата:

03 декабря 2025 г., 00:44:53

On Tue, 2025-12-02 at 15:53 -0500, Tom Lane wrote:
> On the other hand ... is that comment really right about zero-length
> match being possible?  If it is, the API for this function is in
> need of redesign, because callers that try to find "the next match"
> would go into an infinite loop re-finding the same zero-length
> match over and over.

I know too little about exotic collations to answer that, but it sure
would be a problem.  All I find in the discussion is the claim by
Peter E. in [1] that it is so.  Perhaps he can enlighten us.

Yours,
Laurenz Albe

 [1]: https://postgr.es/m/6107daa2-5cf7-4cf2-a526-626be1d15b18%40eisentraut.org

Re: BUG #19341: REPLACE() fails to match final character when using nondeterministic ICU collation

От

Laurenz Albe

Дата:

03 декабря 2025 г., 10:51:22

On Tue, 2025-12-02 at 15:53 -0500, Tom Lane wrote:
> > The attached patch v3 turns it into a while loop to avoid
> > the problem.
>
> Looking at the code overall, I wonder if the outer loop doesn't have
> the same issue.  The comments claim that we should be able to handle
> zero-length matches, but if the overall haystack is of length zero,
> we will fail to check for such a match.

If you can find zero-length matches at all, you could find a
zero-length match in a non-empty haystack.  Perhaps the function is
never called with an empty haystack...

> Also, since we have haystack <= haystack_end as a starting condition,
> I think both loops could omit the initial test.  I'd be inclined
> to code them like
>
>     test_ptr = start point;
>     for (;;)
>     {
>         ...
>         if (test_ptr >= haystack_end)
>             break;
>         test_ptr += pg_mblen(test_ptr);
>     }

True.  The attached v4 patch does it like that.

> On the other hand ... is that comment really right about zero-length
> match being possible?  If it is, the API for this function is in
> need of redesign, because callers that try to find "the next match"
> would go into an infinite loop re-finding the same zero-length
> match over and over.

Right.  I'll see if I can trigger such a case.

Yours,
Laurenz Albe

Вложения

v4-0001-Fix-greedy-substring-search-for-non-deterministic.patch

Re: BUG #19341: REPLACE() fails to match final character when using nondeterministic ICU collation

От

Tom Lane

Дата:

03 декабря 2025 г., 18:12:07

Laurenz Albe <laurenz.albe@cybertec.at> writes:
> On Tue, 2025-12-02 at 15:53 -0500, Tom Lane wrote:
>> Looking at the code overall, I wonder if the outer loop doesn't have
>> the same issue.  The comments claim that we should be able to handle
>> zero-length matches, but if the overall haystack is of length zero,
>> we will fail to check for such a match.

> If you can find zero-length matches at all, you could find a
> zero-length match in a non-empty haystack.  Perhaps the function is
> never called with an empty haystack...

After further thought, it seems to me that this comment is an
unjustified extrapolation from what Peter actually said, which was
that the match substring could be physically shorter than the needle.
Which is certainly true, for instance case-folding or accent-stripping
might shorten the string.  But it doesn't follow that a nonempty
needle could ever match an empty substring; and that does not seem
like it could be sane behavior to me.  We're considering string
comparison here, not regexes.

We do require callers to eliminate the empty-needle case, and given
that I think we could assume that match substrings must be at least
1 byte long.  That assumption is what justifies the current API for
these functions, and perhaps we can also simplify this loop by
using it.

            regards, tom lane

Re: BUG #19341: REPLACE() fails to match final character when using nondeterministic ICU collation

От

Laurenz Albe

Дата:

04 декабря 2025 г., 20:14:51

On Wed, 2025-12-03 at 10:12 -0500, Tom Lane wrote:
> Laurenz Albe <laurenz.albe@cybertec.at> writes:
> > On Tue, 2025-12-02 at 15:53 -0500, Tom Lane wrote:
> > > Looking at the code overall, I wonder if the outer loop doesn't have
> > > the same issue.  The comments claim that we should be able to handle
> > > zero-length matches, but if the overall haystack is of length zero,
> > > we will fail to check for such a match.
>
>
> After further thought, it seems to me that this comment is an
> unjustified extrapolation from what Peter actually said, which was
> that the match substring could be physically shorter than the needle.
> Which is certainly true, for instance case-folding or accent-stripping
> might shorten the string.  But it doesn't follow that a nonempty
> needle could ever match an empty substring; and that does not seem
> like it could be sane behavior to me.  We're considering string
> comparison here, not regexes.
>
> We do require callers to eliminate the empty-needle case, and given
> that I think we could assume that match substrings must be at least
> 1 byte long.  That assumption is what justifies the current API for
> these functions, and perhaps we can also simplify this loop by
> using it.

I think I get it.  I don't see an explicit requirement for a non-empty
needle, but all callers of text_position_next_internal() handle that
case separately.

The attached v5 patch simplifies the loop to a do-while loop, assuming
that we cannot find a zero-length match.

I have also updated the comments to no longer mention the possibility
of an empty match, and for good measure I have added an Assert() that
the needle cannot be empty.

Yours,
Laurenz Albe

Вложения

v5-0001-Fix-greedy-substring-search-for-non-deterministic.patch

Re: BUG #19341: REPLACE() fails to match final character when using nondeterministic ICU collation

От

Tom Lane

Дата:

06 декабря 2025 г., 04:12:12

Laurenz Albe <laurenz.albe@cybertec.at> writes:
> On Wed, 2025-12-03 at 10:12 -0500, Tom Lane wrote:
>> We do require callers to eliminate the empty-needle case, and given
>> that I think we could assume that match substrings must be at least
>> 1 byte long.  That assumption is what justifies the current API for
>> these functions, and perhaps we can also simplify this loop by
>> using it.

> The attached v5 patch simplifies the loop to a do-while loop, assuming
> that we cannot find a zero-length match.
> I have also updated the comments to no longer mention the possibility
> of an empty match, and for good measure I have added an Assert() that
> the needle cannot be empty.

LGTM.  Pushed with tiny cosmetic fixes (mostly, more work on the
comments).

            regards, tom lane

Re: BUG #19341: REPLACE() fails to match final character when using nondeterministic ICU collation

От

Laurenz Albe

Дата:

10 декабря 2025 г., 13:03:27

On Fri, 2025-12-05 at 20:12 -0500, Tom Lane wrote:
> LGTM.  Pushed with tiny cosmetic fixes (mostly, more work on the
> comments).

Thanks for all the tuition!

Yours,
Laurenz Albe

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Обсуждение: BUG #19341: REPLACE() fails to match final character when using nondeterministic ICU collation

Вложения

Вложения

Вложения

Вложения

Вложения