Tatsuo Ishii wrote:
>>Now a new function similar to toast_raw_datum_size(), maybe
>>toast_raw_datum_strlen() could be used to get the original string
>>length, whether MB or not, without needing to retrieve and decompress
>>the entire datum.
>>
>>I understand we would either: have to steal another bit from the VARHDR
>>which would reduce the effective size of a valena from 1GB down to .5GB;
>>or we would need to add a byte or two to the VARHDR which is extra
>>per-datum overhead. I'm not sure we would want to do either. But I
>>wanted to toss out the idea while it was fresh on my mind.
>
>
> Interesting idea. I also was thinking about adding some extra
> infomation to text data types such as character set, collation
> etc. for 7.4 or later.
I ran some tests to confirm the theory above regarding overhead;
create table strtest(f1 text);
do 100 times insert into strtest values('12345....'); -- 100000 characters
loop
do 1000 times select length(f1) from strtest;
loop
Results:
SQL_ASCII database, new code:
=============================
PG_RETURN_INT32(toast_raw_datum_size(PG_GETARG_DATUM(0)) - VARHDRSZ);
==> 2 seconds
SQL_ASCII database, old code:
=============================
text *t = PG_GETARG_TEXT_P(0);
PG_RETURN_INT32(VARSIZE(t) - VARHDRSZ);
==> 66 seconds
EUC_JP database, new & old code:
================================
text *t = PG_GETARG_TEXT_P(0);
PG_RETURN_INT32(pg_mbstrlen_with_len(VARDATA(t), VARSIZE(t) - VARHDRSZ));
==> 469 seconds
So it appears that, while detoasting is moderately expensive (adds 64
seconds to the test), the call to pg_mbstrlen_with_len() is very
expensive (adds 403 seconds to the test).
Joe