Обсуждение: BUG #19341: REPLACE() fails to match final character when using nondeterministic ICU collation
BUG #19341: REPLACE() fails to match final character when using nondeterministic ICU collation
От
PG Bug reporting form
Дата:
The following bug has been logged on the website:
Bug reference: 19341
Logged by: Adam Warland
Email address: adam.warland@infor.com
PostgreSQL version: 18.1
Operating system: Windows 11 Enterprise
Description:
When using a nondeterministic ICU collation, the replace() function fails to
replace a substring when that substring appears at the end of the input
string.
Occurrences of the same substring earlier in the string are replaced
normally.
This appears to be unintended and inconsistent with the documented
limitations of nondeterministic collations. The failure seems specific to
situations where:
• a nondeterministic ICU collation is applied to both source and
match strings, and
• the substring being replaced appears as the final character of the
source string.
The behavior reproduces reliably.
Expected Behavior
replace() should replace all occurrences of the match substring, including
one at the final position, regardless of collation — or, if nondeterministic
collations cannot support this operation, documentation should explicitly
state the limitation (as is already done for LIKE and regular expressions).
Actual Behavior
Under a nondeterministic ICU collation, the final occurrence of the
substring is not replaced, even though earlier occurrences are.
Example output (actual) replace x with y :
res1 | testx -- unchanged, incorrect (final character not replaced)
res2 | yabcdx -- first 'x' replaced, final 'x' not replaced
Under a deterministic or C collation, output is correct:
res_C | testy
Environment
SELECT version();
→ PostgreSQL 18.1 (Debian 18.1-1.pgdg13+2) on x86_64-pc-linux-gnu, compiled
by gcc (Debian 14.2.0-19) 14.2.0, 64-bit.
SHOW server_encoding;
→ UTF8
SHOW lc_collate;
SHOW lc_ctype;
→ en_US.utf8
en_US.utf8
Specific collation used:
create collation test_nondeterministic (
provider = icu,
locale = 'und-u-ks-level2',
deterministic = false
)
Minimal Reproduction Case
drop COLLATION if EXISTS test_nondeterministic;
create collation test_nondeterministic (
provider = icu,
locale = 'und-u-ks-level2',
deterministic = false
);
-- Replace final character under nondeterministic collation
SELECT replace(
'testx' COLLATE "test_nondeterministic",
'x' COLLATE "test_nondeterministic",
'y') AS res1;
-- Replace substring appearing twice — final one fails
SELECT replace(
'xabcdx' COLLATE "test_nondeterministic",
'x' COLLATE "test_nondeterministic",
'y') AS res2;
-- Control test using deterministic collation
SELECT replace(
'testx' COLLATE "C",
'x' COLLATE "C",
'y') AS res_C;
Observed result:
res1 and the final x in res2 are not replaced.
Additional Notes
• The issue does not occur with deterministic ICU collations or with
C collation.
• The failure seems tied specifically to the last character
position, which may indicate an off-by-one issue or a limitation in
substring matching under nondeterministic collation rules.
• No documentation currently states that replace() is partially
unsupported with nondeterministic collations, although other operations
(LIKE, regex) have historically been restricted.
Conclusion
replace() appears to behave incorrectly when matching a substring at the end
of a string under nondeterministic ICU collations. This is either:
• a defect in how nondeterministic collations interact with
substring matching functions, or
an undocumented limitation that should be clarified.
On Tue, 2025-12-02 at 10:03 +0000, PG Bug reporting form wrote: > PostgreSQL version: 18.1 > > When using a nondeterministic ICU collation, the replace() function fails to > replace a substring when that substring appears at the end of the input > string. > > Occurrences of the same substring earlier in the string are replaced > normally. > > Specific collation used: > create collation test_nondeterministic ( > provider = icu, > locale = 'und-u-ks-level2', > deterministic = false > ) > > -- Replace final character under nondeterministic collation > SELECT replace( > 'testx' COLLATE "test_nondeterministic", > 'x' COLLATE "test_nondeterministic", > 'y') AS res1; I can reproduce the problem, and the attached patch fixes it for me. I am not certain if it is safe to apply pg_mblen() to "haystack_end", though. Yours, Laurenz Albe
On Tue, 2025-12-02 at 17:24 +0100, Laurenz Albe wrote: > I can reproduce the problem, and the attached patch fixes it for me. Erm, and here is the patch. Laurenz Albe
Вложения
Re: BUG #19341: REPLACE() fails to match final character when using nondeterministic ICU collation
От
Heikki Linnakangas
Дата:
On 02/12/2025 18:24, Laurenz Albe wrote:
> On Tue, 2025-12-02 at 10:03 +0000, PG Bug reporting form wrote:
>> PostgreSQL version: 18.1
>>
>> When using a nondeterministic ICU collation, the replace() function fails to
>> replace a substring when that substring appears at the end of the input
>> string.
>>
>> Occurrences of the same substring earlier in the string are replaced
>> normally.
>>
>> Specific collation used:
>> create collation test_nondeterministic (
>> provider = icu,
>> locale = 'und-u-ks-level2',
>> deterministic = false
>> )
>>
>> -- Replace final character under nondeterministic collation
>> SELECT replace(
>> 'testx' COLLATE "test_nondeterministic",
>> 'x' COLLATE "test_nondeterministic",
>> 'y') AS res1;
>
> I can reproduce the problem, and the attached patch fixes it for me.
+1, looks good to me. Let's also add a regression test for this.
> I am not certain if it is safe to apply pg_mblen() to "haystack_end", though.
It doesn't do that though, does it? There are two pg_mblen() calls in
the vicinity:
> for (const char *test_end = hptr; test_end <= haystack_end; test_end += pg_mblen(test_end))
> {
> if (pg_strncoll(hptr, (test_end - hptr), needle, needle_len, state->locale) == 0)
> {
> state->last_match_len_tmp = (test_end - hptr);
> result_hptr = hptr;
> if (!state->greedy)
> break;
> }
> }
> if (result_hptr)
> break;
>
> hptr += pg_mblen(hptr);
Neither of those will get called with 'haystack_end' as far as I can see.
- Heikki
On Tue, 2025-12-02 at 18:36 +0200, Heikki Linnakangas wrote:
> +1, looks good to me. Let's also add a regression test for this.
Right, done in the attached.
> > I am not certain if it is safe to apply pg_mblen() to "haystack_end", though.
>
> It doesn't do that though, does it? There are two pg_mblen() calls in
> the vicinity:
>
> > for (const char *test_end = hptr; test_end <= haystack_end; test_end += pg_mblen(test_end))
> > {
> > if (pg_strncoll(hptr, (test_end - hptr), needle, needle_len, state->locale) == 0)
> > {
> > state->last_match_len_tmp = (test_end - hptr);
> > result_hptr = hptr;
> > if (!state->greedy)
> > break;
> > }
> > }
> > if (result_hptr)
> > break;
> >
> > hptr += pg_mblen(hptr);
>
> Neither of those will get called with 'haystack_end' as far as I can see.
During the last iteration of the loop, "test_end" will be equal to "haystack_end",
and the loop increment will call "pg_mblen(test_end)".
Yours,
Laurenz Albe
Вложения
Laurenz Albe <laurenz.albe@cybertec.at> writes:
>>> for (const char *test_end = hptr; test_end <= haystack_end; test_end += pg_mblen(test_end))
> During the last iteration of the loop, "test_end" will be equal to "haystack_end",
> and the loop increment will call "pg_mblen(test_end)".
Right, clearly unsafe (and I bet valgrind would complain about it).
You need to rearrange the loop logic so that we won't attempt to
increment test_end that last time through. Perhaps a for-loop
isn't the best way to write it.
regards, tom lane
Re: BUG #19341: REPLACE() fails to match final character when using nondeterministic ICU collation
От
Heikki Linnakangas
Дата:
On 02/12/2025 18:36, Heikki Linnakangas wrote: > On 02/12/2025 18:24, Laurenz Albe wrote: >> On Tue, 2025-12-02 at 10:03 +0000, PG Bug reporting form wrote: >>> PostgreSQL version: 18.1 >>> >>> When using a nondeterministic ICU collation, the replace() function >>> fails to >>> replace a substring when that substring appears at the end of the input >>> string. >>> >>> Occurrences of the same substring earlier in the string are replaced >>> normally. >>> >>> Specific collation used: >>> create collation test_nondeterministic ( >>> provider = icu, >>> locale = 'und-u-ks-level2', >>> deterministic = false >>> ) >>> >>> -- Replace final character under nondeterministic collation >>> SELECT replace( >>> 'testx' COLLATE "test_nondeterministic", >>> 'x' COLLATE "test_nondeterministic", >>> 'y') AS res1; >> >> I can reproduce the problem, and the attached patch fixes it for me. > > +1, looks good to me. Let's also add a regression test for this. I added a simple test for this, and I think this is still not quite right. I added the following to collate.icu.utf test: CREATE TABLE test4nfd (a int, b text); INSERT INTO test4nfd VALUES (1, 'cote'), (2, 'côte'), (3, 'coté'), (4, 'côté'); UPDATE test4nfd SET b = normalize(b, nfd); -- This shows why replace should be greedy. Otherwise, in the NFD -- case, the match would stop before the decomposed accents, which -- would leave the accents in the results. SELECT a, b, replace(b COLLATE ignore_accents, 'co', 'ma') FROM test4; a | b | replace ---+------+--------- 1 | cote | mate 2 | côte | mate 3 | coté | maté 4 | côté | maté (4 rows) SELECT a, b, replace(b COLLATE ignore_accents, 'co', 'ma') FROM test4nfd; a | b | replace ---+------+--------- 1 | cote | mate 2 | côte | mate 3 | coté | maté 4 | côté | maté (4 rows) +-- Test for match at the end of the string. (We had a bug on that +-- once) +SELECT a, b, replace(b COLLATE ignore_accents, 'te', 'ma') FROM test4nfd; + a | b | replace +---+------+--------- + 1 | cote | coma + 2 | côte | coma + 3 | coté | coma + 4 | côté | coma +(4 rows) + In the added test query, the accents on the 'o' are stripped, which doesn't look correct. - Heikki
On Tue, 2025-12-02 at 19:29 +0200, Heikki Linnakangas wrote: > I added a simple test for this, and I think this is still not quite > right. I added the following to collate.icu.utf test: > > +-- Test for match at the end of the string. (We had a bug on that > +-- once) > +SELECT a, b, replace(b COLLATE ignore_accents, 'te', 'ma') FROM test4nfd; > + a | b | replace > +---+------+--------- > + 1 | cote | coma > + 2 | côte | coma > + 3 | coté | coma > + 4 | côté | coma > +(4 rows) > + > > In the added test query, the accents on the 'o' are stripped, which > doesn't look correct. I am not sure if that is OK or not (after all, it's an accent insensitive collation, so "coma" and "côma" should be the same). But it seems unrelated to the bug report at hand. Yours, Laurenz Albe
On Tue, 2025-12-02 at 12:25 -0500, Tom Lane wrote: > Laurenz Albe <laurenz.albe@cybertec.at> writes: > > > > for (const char *test_end = hptr; test_end <= haystack_end; test_end += pg_mblen(test_end)) > > > During the last iteration of the loop, "test_end" will be equal to "haystack_end", > > and the loop increment will call "pg_mblen(test_end)". > > Right, clearly unsafe (and I bet valgrind would complain about it). > You need to rearrange the loop logic so that we won't attempt to > increment test_end that last time through. Perhaps a for-loop > isn't the best way to write it. Right. The attached patch v3 turns it into a while loop to avoid the problem. Yours, Laurenz Albe
Вложения
Laurenz Albe <laurenz.albe@cybertec.at> writes:
> On Tue, 2025-12-02 at 12:25 -0500, Tom Lane wrote:
>> You need to rearrange the loop logic so that we won't attempt to
>> increment test_end that last time through. Perhaps a for-loop
>> isn't the best way to write it.
> Right. The attached patch v3 turns it into a while loop to avoid
> the problem.
Looking at the code overall, I wonder if the outer loop doesn't have
the same issue. The comments claim that we should be able to handle
zero-length matches, but if the overall haystack is of length zero,
we will fail to check for such a match.
Also, since we have haystack <= haystack_end as a starting condition,
I think both loops could omit the initial test. I'd be inclined
to code them like
test_ptr = start point;
for (;;)
{
...
if (test_ptr >= haystack_end)
break;
test_ptr += pg_mblen(test_ptr);
}
On the other hand ... is that comment really right about zero-length
match being possible? If it is, the API for this function is in
need of redesign, because callers that try to find "the next match"
would go into an infinite loop re-finding the same zero-length
match over and over.
regards, tom lane
On Tue, 2025-12-02 at 15:53 -0500, Tom Lane wrote: > On the other hand ... is that comment really right about zero-length > match being possible? If it is, the API for this function is in > need of redesign, because callers that try to find "the next match" > would go into an infinite loop re-finding the same zero-length > match over and over. I know too little about exotic collations to answer that, but it sure would be a problem. All I find in the discussion is the claim by Peter E. in [1] that it is so. Perhaps he can enlighten us. Yours, Laurenz Albe [1]: https://postgr.es/m/6107daa2-5cf7-4cf2-a526-626be1d15b18%40eisentraut.org