> Now, in the multibyte case, again in textlen(), I see:
>
> /* optimization for single byte encoding */
> if (pg_database_encoding_max_length() <= 1)
> PG_RETURN_INT32(VARSIZE(t) - VARHDRSZ);
>
> PG_RETURN_INT32(
> pg_mbstrlen_with_len(VARDATA(t), VARSIZE(t) - VARHDRSZ));
>
> Three questions here.
> 1) In the case of encoding max length == 1, can we treat it the same as
> the non-multibyte case (I presume they are exactly the same)?
Yes.
> 2) Can encoding max length ever be < 1? Doesn't make sense to me.
No. It seems just a defensive coding.
> 3) In the case of encoding max length > 1, if I understand correctly,
> each encoded character can be one *or more* bytes, up to and encluding
> encoding max length bytes.
Right.
> So the *only* way presently to get the length
> of the original character string is to loop through the entire string
> checking the length of each individual character (that's what
> pg_mbstrlen_with_len() does it seems)?
Yes.
> Finally, if 3) is true, then there is no way to avoid the retrieval and
> decompression of the datum just to find out its length. For large
> datums, detoasting plus the looping through each character would add a
> huge amount of overhead just to get at the length of the original
> string. I don't know if we need to be able to get *just* the length
> often enough to really care, but if we do, I had an idea for some future
> release (I wouldn't propose doing this for 7.3):
>
> - add a new EXTENDED state to va_external for MULTIBYTE
> - any string with max encoding length > 1 would be EXTENDED even if it
> is not EXTERNAL and not COMPRESSED.
> - to each of the structs in the union, add va_strlen
> - populate va_strlen on INSERT and maintain it on UPDATE.
>
> Now a new function similar to toast_raw_datum_size(), maybe
> toast_raw_datum_strlen() could be used to get the original string
> length, whether MB or not, without needing to retrieve and decompress
> the entire datum.
>
> I understand we would either: have to steal another bit from the VARHDR
> which would reduce the effective size of a valena from 1GB down to .5GB;
> or we would need to add a byte or two to the VARHDR which is extra
> per-datum overhead. I'm not sure we would want to do either. But I
> wanted to toss out the idea while it was fresh on my mind.
Interesting idea. I also was thinking about adding some extra
infomation to text data types such as character set, collation
etc. for 7.4 or later.
--
Tatsuo Ishii