Обсуждение: Re: [GENERAL] workaround for lack of REPLACE() function
(crossposted to -hackers, should follow up on that list)
> Well, OVERLAY is defined as:
> overlay(string placing string from integer [for integer])
> and replace() is defined (by me at least) as:
> replace(inputstring, old-substr, new-substr)
OK.
> OVERLAY requires that I know the "from" position and possibly the "for"
> in advance. Other functions (such as strpos() and substr()) can be used
> to help...
Right. So you can do your example pretty easily:
thomas=# select overlay(f1 placing '' from position('/local' in f1)
thomas-# for length('/local')) from strtest;
overlay
--------------------
/usr/pgsql/data
/m1/usr/pgsql/data
And if you don't like that much typing you can do:
thomas=# create function replace(text, text, text) returns text as '
thomas'# select overlay($1 placing $3 from position($2 in $1) for
length($2));
thomas'# ' language 'sql';
CREATE FUNCTION
thomas=# select replace(f1, '/local', '') from strtest;
replace
--------------------
/usr/pgsql/data
/m1/usr/pgsql/data
> But now what happens if you wanted to replace all of the '/' characters
> with '\'?...
> You can't do this at all with overlay(), unless you want to write a
> PL/pgSQL function and loop through each string. I started out with
> exactly this, using strpos() and substr(), but I thought a C function
> was cleaner, and it is certainly faster.
OK, this is in the "can't do it what we have" category. Should we have
it accept a regular expression rather than a simple string? In either
case it should probably go into the main distro. Except that I see
"REPLACE" is mentioned as a reserved word in SQL99. But has no other
mention in my copy of the draft standard. Anyone else have an idea what
it might be used for in the standard?
The other functions look useful too, unless to_char() and varbit can be
evolved to support this functionality.
- Thomas
Thomas Lockhart wrote:> (crossposted to -hackers, should follow up on that list)
<snip>
> OK, this is in the "can't do it what we have" category. Should we have
> it accept a regular expression rather than a simple string? In either
> case it should probably go into the main distro. Except that I see
> "REPLACE" is mentioned as a reserved word in SQL99. But has no other
> mention in my copy of the draft standard. Anyone else have an idea what
> it might be used for in the standard?
Not sure, but I see what you mean. Perhaps because of Oracle pushing to
legitimize the "CREATE OR REPLACE" syntax? In any case, this works in 8i:
SQL> select replace('hello','l','x') from dual;
REPLACE('HELLO','L','X')
------------------------
hexxo
and here it is in MSSQL 7:
select replace('hello','l','x')
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
hexxo
(1 row(s) affected)
and my proposed PostgreSQL function:
test=# select replace('hello','l','x'); replace
--------- hexxo
(1 row)
so at least we would be consistant/compatable with these two.
>
> The other functions look useful too, unless to_char() and varbit can be
> evolved to support this functionality.
I will take a look at merging these into existing functions, but I have
a few other things ahead of this in my queue.
One of the reasons I wasn't pushing too hard to get replace() into the
backend is because my current solution is a bit of a hack. It uses the
builtin length, strpos and substr text functions (which I think makes
sense since they already know how to deal with mb strings), but because
they accept and return text, I'm doing lots of conversions back and
forth from (* text) to (* char). To do this "right" probably means
reworking the text string manipulation functions to be wrappers around
some equivalent functions accepting and returning C strings. That was
more work than I had time for when I wrote the current replace(). But as
I said, if there is support for getting this into the backend, I'll add
it to my todo list:
- Create new backend function replace()
- Either create new backend functions, or merge into existing functions:
to_hex() and extract_tok()
Joe
Joe Conway wrote:
> more work than I had time for when I wrote the current replace(). But as
> I said, if there is support for getting this into the backend, I'll add
> it to my todo list:
>
> - Create new backend function replace()
> - Either create new backend functions, or merge into existing functions:
> to_hex() and extract_tok()
>
I'm just starting to take a look at this again. While studying the
current text_substr() function I found two behaviors which conflict with
specific SQL92/SQL99 requirements, and one bug. First the spec
compliance -- SQL92 section 6.7/SQL99 section 6.18 say:
If <character substring function> is specified, then:
a) Let C be the value of the <character value expression>, let LC be the length of C, and let S be the value of the
<startposition>.
b) If <string length> is specified, then let L be the value of <string length> and let E be S+L. Otherwise, let E be
thelarger of LC + 1 and S.
c) If either C, S, or L is the null value, then the result of the <character substring function> is the null value.
d) If E is less than S, then an exception condition is raised: data exception-substring error.
e) Case: i) If S is greater than LC or if E is less than 1, then the result of the <character substring
function>is a zero-length string. ii) Otherwise, 1) Let SI be the larger of S and 1. Let El be the smaller of E
and LC+l. Let Ll be El-Sl. 2) The result of the <character substring function> is a character
stringcontaining the Ll characters of C starting at character number Sl in the same order that the characters
appearin C.
The only way for d) to be true is when L < 0. Instead of an error, we do:
test=# select substr('hello',2,-1); substr
-------- ello
(1 row)
The other spec issue is wrt para e)i). If E (=S+L) < 1, we should return
a zero-length string. Currently I get:
test=# select substr('hello',-4,3); substr
-------- hello
(1 row)
Neither behavior is documented (unless it's somewhere other than:
http://developer.postgresql.org/docs/postgres/functions-string.html ).
The bug is this one:
test=# create DATABASE testmb with encoding = 'EUC_JP';
CREATE DATABASE
test=# \c testmb
You are now connected to database testmb.
testmb=# select substr('hello',6,2); substr
-------- ~
(1 row)
testmb=# \c test
You are now connected to database test.
test=# select substr('hello',6,2); substr
--------
(1 row)
The multibyte database behavior is the bug. The SQL_ASCII behavior is
correct (zero-length string):
test=# select substr('hello',6,2) is null; ?column?
---------- f
(1 row)
Any objection if I rework this function to meet SQL92 and fix the bug?
Or is the SQL92 part not desirable because it breaks backward
compatability?
In any case, can the #ifdef MULTIBYTE's be removed now in favor of a
test for encoding max length?
Joe
> Any objection if I rework this function to meet SQL92 and fix the bug? I don't object. > Or is the SQL92 part not desirable because it breaks backward > compatability? I don't think so. > In any case, can the #ifdef MULTIBYTE's be removed now in favor of a > test for encoding max length? Sure. -- Tatsuo Ishii
Tatsuo Ishii wrote: >>Any objection if I rework this function to meet SQL92 and fix the bug? > I've started working on text_substr() as described in this thread (which is hopefully prep work for the replace() function that started the thread). I haven't really looked at toast or multibyte closely before, so I'd like to ask a couple of questions to be sure I'm understanding the relevant issues correctly. First, in textlen() I see (ignoring multibyte for a moment): text *t = PG_GETARG_TEXT_P(0); PG_RETURN_INT32(VARSIZE(t) - VARHDRSZ); Tom has pointed out to me before that PG_GETARG_TEXT_P(n) incurs the overhead of retrieving and possibly decompressing a toasted datum. So my first question is, can we simply do: PG_RETURN_INT32(toast_raw_datum_size(PG_GETARG_DATUM(0)) - VARHDRSZ); and save the overhead of retrieving and decompressing the whole datum? Now, in the multibyte case, again in textlen(), I see: /* optimization for single byte encoding */ if (pg_database_encoding_max_length() <= 1) PG_RETURN_INT32(VARSIZE(t)- VARHDRSZ); PG_RETURN_INT32( pg_mbstrlen_with_len(VARDATA(t), VARSIZE(t) - VARHDRSZ)); Three questions here. 1) In the case of encoding max length == 1, can we treat it the same as the non-multibyte case (I presume they are exactly the same)? 2) Can encoding max length ever be < 1? Doesn't make sense to me. 3) In the case of encoding max length > 1, if I understand correctly, each encoded character can be one *or more* bytes, up to and encluding encoding max length bytes. So the *only* way presently to get the length of the original character string is to loop through the entire string checking the length of each individual character (that's what pg_mbstrlen_with_len() does it seems)? Finally, if 3) is true, then there is no way to avoid the retrieval and decompression of the datum just to find out its length. For large datums, detoasting plus the looping through each character would add a huge amount of overhead just to get at the length of the original string. I don't know if we need to be able to get *just* the length often enough to really care, but if we do, I had an idea for some future release (I wouldn't propose doing this for 7.3): - add a new EXTENDED state to va_external for MULTIBYTE - any string with max encoding length > 1 would be EXTENDED even if it is not EXTERNAL and not COMPRESSED. - to each of the structs in the union, add va_strlen - populate va_strlen on INSERT and maintain it on UPDATE. Now a new function similar to toast_raw_datum_size(), maybe toast_raw_datum_strlen() could be used to get the original string length, whether MB or not, without needing to retrieve and decompress the entire datum. I understand we would either: have to steal another bit from the VARHDR which would reduce the effective size of a valena from 1GB down to .5GB; or we would need to add a byte or two to the VARHDR which is extra per-datum overhead. I'm not sure we would want to do either. But I wanted to toss out the idea while it was fresh on my mind. Thanks, Joe
Tatsuo Ishii wrote: >>Any objection if I rework this function to meet SQL92 and fix the bug? > I don't object. One more question on this: how can I generate some characters with > 1 encoding length? I need a way to test the work I'm doing, and I'm not quite sure how to test it. Just making a database that uses a MB encoding doesn't make 0-9/A-Z/a-z into characters of > 1 byte, does it? Sorry, but never having used MB encoding has left me a bit clueless on this ;-) Joe
> Now, in the multibyte case, again in textlen(), I see: > > /* optimization for single byte encoding */ > if (pg_database_encoding_max_length() <= 1) > PG_RETURN_INT32(VARSIZE(t) - VARHDRSZ); > > PG_RETURN_INT32( > pg_mbstrlen_with_len(VARDATA(t), VARSIZE(t) - VARHDRSZ)); > > Three questions here. > 1) In the case of encoding max length == 1, can we treat it the same as > the non-multibyte case (I presume they are exactly the same)? Yes. > 2) Can encoding max length ever be < 1? Doesn't make sense to me. No. It seems just a defensive coding. > 3) In the case of encoding max length > 1, if I understand correctly, > each encoded character can be one *or more* bytes, up to and encluding > encoding max length bytes. Right. > So the *only* way presently to get the length > of the original character string is to loop through the entire string > checking the length of each individual character (that's what > pg_mbstrlen_with_len() does it seems)? Yes. > Finally, if 3) is true, then there is no way to avoid the retrieval and > decompression of the datum just to find out its length. For large > datums, detoasting plus the looping through each character would add a > huge amount of overhead just to get at the length of the original > string. I don't know if we need to be able to get *just* the length > often enough to really care, but if we do, I had an idea for some future > release (I wouldn't propose doing this for 7.3): > > - add a new EXTENDED state to va_external for MULTIBYTE > - any string with max encoding length > 1 would be EXTENDED even if it > is not EXTERNAL and not COMPRESSED. > - to each of the structs in the union, add va_strlen > - populate va_strlen on INSERT and maintain it on UPDATE. > > Now a new function similar to toast_raw_datum_size(), maybe > toast_raw_datum_strlen() could be used to get the original string > length, whether MB or not, without needing to retrieve and decompress > the entire datum. > > I understand we would either: have to steal another bit from the VARHDR > which would reduce the effective size of a valena from 1GB down to .5GB; > or we would need to add a byte or two to the VARHDR which is extra > per-datum overhead. I'm not sure we would want to do either. But I > wanted to toss out the idea while it was fresh on my mind. Interesting idea. I also was thinking about adding some extra infomation to text data types such as character set, collation etc. for 7.4 or later. -- Tatsuo Ishii
Tatsuo Ishii wrote:
>>Now a new function similar to toast_raw_datum_size(), maybe
>>toast_raw_datum_strlen() could be used to get the original string
>>length, whether MB or not, without needing to retrieve and decompress
>>the entire datum.
>>
>>I understand we would either: have to steal another bit from the VARHDR
>>which would reduce the effective size of a valena from 1GB down to .5GB;
>>or we would need to add a byte or two to the VARHDR which is extra
>>per-datum overhead. I'm not sure we would want to do either. But I
>>wanted to toss out the idea while it was fresh on my mind.
>
>
> Interesting idea. I also was thinking about adding some extra
> infomation to text data types such as character set, collation
> etc. for 7.4 or later.
I ran some tests to confirm the theory above regarding overhead;
create table strtest(f1 text);
do 100 times insert into strtest values('12345....'); -- 100000 characters
loop
do 1000 times select length(f1) from strtest;
loop
Results:
SQL_ASCII database, new code:
=============================
PG_RETURN_INT32(toast_raw_datum_size(PG_GETARG_DATUM(0)) - VARHDRSZ);
==> 2 seconds
SQL_ASCII database, old code:
=============================
text *t = PG_GETARG_TEXT_P(0);
PG_RETURN_INT32(VARSIZE(t) - VARHDRSZ);
==> 66 seconds
EUC_JP database, new & old code:
================================
text *t = PG_GETARG_TEXT_P(0);
PG_RETURN_INT32(pg_mbstrlen_with_len(VARDATA(t), VARSIZE(t) - VARHDRSZ));
==> 469 seconds
So it appears that, while detoasting is moderately expensive (adds 64
seconds to the test), the call to pg_mbstrlen_with_len() is very
expensive (adds 403 seconds to the test).
Joe