Re: BUG #19406: substring(text) fails on valid UTF-8 toasted value in PostgreSQL 15.16

Поиск
Список
Период
Сортировка
От Noah Misch
Тема Re: BUG #19406: substring(text) fails on valid UTF-8 toasted value in PostgreSQL 15.16
Дата
Msg-id 20260214193344.48@rfd.leadboat.com
обсуждение исходный текст
Ответ на Re: BUG #19406: substring(text) fails on valid UTF-8 toasted value in PostgreSQL 15.16  (Thomas Munro <thomas.munro@gmail.com>)
Ответы Re: BUG #19406: substring(text) fails on valid UTF-8 toasted value in PostgreSQL 15.16
Список pgsql-bugs
On Sun, Feb 15, 2026 at 08:07:22AM +1300, Thomas Munro wrote:
> On Sat, Feb 14, 2026 at 9:15 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> > On Sat, Feb 14, 2026 at 6:38 PM Noah Misch <noah@leadboat.com> wrote:
> > > [mblen-valgrind-after-report-v1.patch]
> >
> > LGTM.  The new valgrind check should clearly be after the new non-local exit.
> >
> > Studying the other patch...

Thank you.

>                         /*
> -                        * Total slice size in bytes can't be any
> longer than the start
> -                        * position plus substring length times the
> encoding max length.
> -                        * If that overflows, we can just use -1.
> +                        * Total slice size in bytes can't be any
> longer than the
> +                        * inclusive end position times the encoding
> max length.  If that
> +                        * overflows, we can just use -1.
>                          */
> -                       if (pg_mul_s32_overflow(E, eml, &slice_size))
> +                       if (pg_mul_s32_overflow(E - 1, eml, &slice_size))
>                                 slice_size = -1;
> 
> Isn't it still conceptually "exclusive", but adjusted to be zero-indexed?

Since it's discrete (being an integer), "up to E, exclusive" and "up to E - 1,
inclusive" are the same thing.  My comment may not be the optimal way to
express that.

>                 /* Now we can get the actual length of the slice in MB
> characters */
> -               slice_strlen = pg_mbstrlen_with_len(VARDATA_ANY(slice),
> -
>                  slice_len);
> +               slice_strlen =
> +                       (slice_size == -1 ?
> +                        pg_mbstrlen_with_len(VARDATA_ANY(slice), slice_len) :
> +                        pg_mbcharcliplen_chars(VARDATA_ANY(slice),
> slice_len, E - 1));
> 
> Comment presumably needs adjustment to say that we only count as far
> as we need to, and why.

Changed to:

        /*
         * Now we can get the actual length of the slice in MB characters,
         * stopping at the end of the substring.  Continuing beyond the
         * substring end could find an incomplete character attributable
         * solely to DatumGetTextPSlice() chopping in the middle of a
         * character, and it would be superfluous work at best.
         */

> There is something a bit strange about all this, though.
> pg_mbstrlen_with_len(..., -1) returns 0, so if you ask for characters
> that really exist past 2^29 (~500 million), you must get an empty
> string, right?   That's hard to reach, pre-existing and out of scope
> for the immediate problem report, except ... now we're contorting the
> code even further to keep it.

- slice_size is the amount we *requested* from the toaster.  It can be -1,
  which retrieves the max available.
- slice_len is the amount *returned* from the toaster.  It's nonnegative.

Does a behavior specific to strings >2^29 still exist?

> The outline I had come up with before seeing your patch was: let's
> just delete it.  The position search can check bounds incrementally,
> following our general approach.  This avoids the reported problem by
> ditching the pre-flight scan through the slice (up to 4x more
> pg_mblen_XXX calls and memory access than we strictly need), and also
> the special cases for empty strings since they already fall out of the
> general behaviour (am I missing something?), not leaving much code
> behind.

Like you, I made a note that it's wasteful to make two mblen passes over the
string.  I'm only seeing a 50% reduction in mblen calls, not an 80% reduction,
but I didn't look too closely.  I guessed such a change would be less clearly
correct, so I figured it would be less suitable for back branches.  Hence, I
didn't draft it.

> As far as I can see so far, the only user-visible side-effect
> requires corruption: substring() moves from the
> internal-NUL-is-terminator category to internal-NUL-is-character
> category, but that's an implementation detail.

That does carry some risk, not necessarily too much to accept.

> When I saw your patch yesterday, I initially abandoned the thought,
> thinking that your idea looked more conservative, but after sleeping
> on it and reflecting again on these oddities, I have merged my draft
> implementation with your tests, ancient detoasting fence post
> observation and commit message, just to see if you think this approach
> might be worth considering further.

My first impression, hurried due to the commit ETA in 30 minutes, is that this
is less conservative and should hold for master-only.



В списке pgsql-bugs по дате отправления: