Re: BUG #19406: substring(text) fails on valid UTF-8 toasted value in PostgreSQL 15.16
| От | Noah Misch |
|---|---|
| Тема | Re: BUG #19406: substring(text) fails on valid UTF-8 toasted value in PostgreSQL 15.16 |
| Дата | |
| Msg-id | 20260214193344.48@rfd.leadboat.com обсуждение исходный текст |
| Ответ на | Re: BUG #19406: substring(text) fails on valid UTF-8 toasted value in PostgreSQL 15.16 (Thomas Munro <thomas.munro@gmail.com>) |
| Ответы |
Re: BUG #19406: substring(text) fails on valid UTF-8 toasted value in PostgreSQL 15.16
|
| Список | pgsql-bugs |
On Sun, Feb 15, 2026 at 08:07:22AM +1300, Thomas Munro wrote:
> On Sat, Feb 14, 2026 at 9:15 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> > On Sat, Feb 14, 2026 at 6:38 PM Noah Misch <noah@leadboat.com> wrote:
> > > [mblen-valgrind-after-report-v1.patch]
> >
> > LGTM. The new valgrind check should clearly be after the new non-local exit.
> >
> > Studying the other patch...
Thank you.
> /*
> - * Total slice size in bytes can't be any
> longer than the start
> - * position plus substring length times the
> encoding max length.
> - * If that overflows, we can just use -1.
> + * Total slice size in bytes can't be any
> longer than the
> + * inclusive end position times the encoding
> max length. If that
> + * overflows, we can just use -1.
> */
> - if (pg_mul_s32_overflow(E, eml, &slice_size))
> + if (pg_mul_s32_overflow(E - 1, eml, &slice_size))
> slice_size = -1;
>
> Isn't it still conceptually "exclusive", but adjusted to be zero-indexed?
Since it's discrete (being an integer), "up to E, exclusive" and "up to E - 1,
inclusive" are the same thing. My comment may not be the optimal way to
express that.
> /* Now we can get the actual length of the slice in MB
> characters */
> - slice_strlen = pg_mbstrlen_with_len(VARDATA_ANY(slice),
> -
> slice_len);
> + slice_strlen =
> + (slice_size == -1 ?
> + pg_mbstrlen_with_len(VARDATA_ANY(slice), slice_len) :
> + pg_mbcharcliplen_chars(VARDATA_ANY(slice),
> slice_len, E - 1));
>
> Comment presumably needs adjustment to say that we only count as far
> as we need to, and why.
Changed to:
/*
* Now we can get the actual length of the slice in MB characters,
* stopping at the end of the substring. Continuing beyond the
* substring end could find an incomplete character attributable
* solely to DatumGetTextPSlice() chopping in the middle of a
* character, and it would be superfluous work at best.
*/
> There is something a bit strange about all this, though.
> pg_mbstrlen_with_len(..., -1) returns 0, so if you ask for characters
> that really exist past 2^29 (~500 million), you must get an empty
> string, right? That's hard to reach, pre-existing and out of scope
> for the immediate problem report, except ... now we're contorting the
> code even further to keep it.
- slice_size is the amount we *requested* from the toaster. It can be -1,
which retrieves the max available.
- slice_len is the amount *returned* from the toaster. It's nonnegative.
Does a behavior specific to strings >2^29 still exist?
> The outline I had come up with before seeing your patch was: let's
> just delete it. The position search can check bounds incrementally,
> following our general approach. This avoids the reported problem by
> ditching the pre-flight scan through the slice (up to 4x more
> pg_mblen_XXX calls and memory access than we strictly need), and also
> the special cases for empty strings since they already fall out of the
> general behaviour (am I missing something?), not leaving much code
> behind.
Like you, I made a note that it's wasteful to make two mblen passes over the
string. I'm only seeing a 50% reduction in mblen calls, not an 80% reduction,
but I didn't look too closely. I guessed such a change would be less clearly
correct, so I figured it would be less suitable for back branches. Hence, I
didn't draft it.
> As far as I can see so far, the only user-visible side-effect
> requires corruption: substring() moves from the
> internal-NUL-is-terminator category to internal-NUL-is-character
> category, but that's an implementation detail.
That does carry some risk, not necessarily too much to accept.
> When I saw your patch yesterday, I initially abandoned the thought,
> thinking that your idea looked more conservative, but after sleeping
> on it and reflecting again on these oddities, I have merged my draft
> implementation with your tests, ancient detoasting fence post
> observation and commit message, just to see if you think this approach
> might be worth considering further.
My first impression, hurried due to the commit ETA in 30 minutes, is that this
is less conservative and should hold for master-only.
В списке pgsql-bugs по дате отправления: