Обсуждение: Re: [GENERAL] workaround for lack of REPLACE() function
(crossposted to -hackers, should follow up on that list) > Well, OVERLAY is defined as: > overlay(string placing string from integer [for integer]) > and replace() is defined (by me at least) as: > replace(inputstring, old-substr, new-substr) OK. > OVERLAY requires that I know the "from" position and possibly the "for" > in advance. Other functions (such as strpos() and substr()) can be used > to help... Right. So you can do your example pretty easily: thomas=# select overlay(f1 placing '' from position('/local' in f1) thomas-# for length('/local')) from strtest; overlay -------------------- /usr/pgsql/data /m1/usr/pgsql/data And if you don't like that much typing you can do: thomas=# create function replace(text, text, text) returns text as ' thomas'# select overlay($1 placing $3 from position($2 in $1) for length($2)); thomas'# ' language 'sql'; CREATE FUNCTION thomas=# select replace(f1, '/local', '') from strtest; replace -------------------- /usr/pgsql/data /m1/usr/pgsql/data > But now what happens if you wanted to replace all of the '/' characters > with '\'?... > You can't do this at all with overlay(), unless you want to write a > PL/pgSQL function and loop through each string. I started out with > exactly this, using strpos() and substr(), but I thought a C function > was cleaner, and it is certainly faster. OK, this is in the "can't do it what we have" category. Should we have it accept a regular expression rather than a simple string? In either case it should probably go into the main distro. Except that I see "REPLACE" is mentioned as a reserved word in SQL99. But has no other mention in my copy of the draft standard. Anyone else have an idea what it might be used for in the standard? The other functions look useful too, unless to_char() and varbit can be evolved to support this functionality. - Thomas
Thomas Lockhart wrote:> (crossposted to -hackers, should follow up on that list) <snip> > OK, this is in the "can't do it what we have" category. Should we have > it accept a regular expression rather than a simple string? In either > case it should probably go into the main distro. Except that I see > "REPLACE" is mentioned as a reserved word in SQL99. But has no other > mention in my copy of the draft standard. Anyone else have an idea what > it might be used for in the standard? Not sure, but I see what you mean. Perhaps because of Oracle pushing to legitimize the "CREATE OR REPLACE" syntax? In any case, this works in 8i: SQL> select replace('hello','l','x') from dual; REPLACE('HELLO','L','X') ------------------------ hexxo and here it is in MSSQL 7: select replace('hello','l','x') ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- hexxo (1 row(s) affected) and my proposed PostgreSQL function: test=# select replace('hello','l','x'); replace --------- hexxo (1 row) so at least we would be consistant/compatable with these two. > > The other functions look useful too, unless to_char() and varbit can be > evolved to support this functionality. I will take a look at merging these into existing functions, but I have a few other things ahead of this in my queue. One of the reasons I wasn't pushing too hard to get replace() into the backend is because my current solution is a bit of a hack. It uses the builtin length, strpos and substr text functions (which I think makes sense since they already know how to deal with mb strings), but because they accept and return text, I'm doing lots of conversions back and forth from (* text) to (* char). To do this "right" probably means reworking the text string manipulation functions to be wrappers around some equivalent functions accepting and returning C strings. That was more work than I had time for when I wrote the current replace(). But as I said, if there is support for getting this into the backend, I'll add it to my todo list: - Create new backend function replace() - Either create new backend functions, or merge into existing functions: to_hex() and extract_tok() Joe
Joe Conway wrote: > more work than I had time for when I wrote the current replace(). But as > I said, if there is support for getting this into the backend, I'll add > it to my todo list: > > - Create new backend function replace() > - Either create new backend functions, or merge into existing functions: > to_hex() and extract_tok() > I'm just starting to take a look at this again. While studying the current text_substr() function I found two behaviors which conflict with specific SQL92/SQL99 requirements, and one bug. First the spec compliance -- SQL92 section 6.7/SQL99 section 6.18 say: If <character substring function> is specified, then: a) Let C be the value of the <character value expression>, let LC be the length of C, and let S be the value of the <startposition>. b) If <string length> is specified, then let L be the value of <string length> and let E be S+L. Otherwise, let E be thelarger of LC + 1 and S. c) If either C, S, or L is the null value, then the result of the <character substring function> is the null value. d) If E is less than S, then an exception condition is raised: data exception-substring error. e) Case: i) If S is greater than LC or if E is less than 1, then the result of the <character substring function>is a zero-length string. ii) Otherwise, 1) Let SI be the larger of S and 1. Let El be the smaller of E and LC+l. Let Ll be El-Sl. 2) The result of the <character substring function> is a character stringcontaining the Ll characters of C starting at character number Sl in the same order that the characters appearin C. The only way for d) to be true is when L < 0. Instead of an error, we do: test=# select substr('hello',2,-1); substr -------- ello (1 row) The other spec issue is wrt para e)i). If E (=S+L) < 1, we should return a zero-length string. Currently I get: test=# select substr('hello',-4,3); substr -------- hello (1 row) Neither behavior is documented (unless it's somewhere other than: http://developer.postgresql.org/docs/postgres/functions-string.html ). The bug is this one: test=# create DATABASE testmb with encoding = 'EUC_JP'; CREATE DATABASE test=# \c testmb You are now connected to database testmb. testmb=# select substr('hello',6,2); substr -------- ~ (1 row) testmb=# \c test You are now connected to database test. test=# select substr('hello',6,2); substr -------- (1 row) The multibyte database behavior is the bug. The SQL_ASCII behavior is correct (zero-length string): test=# select substr('hello',6,2) is null; ?column? ---------- f (1 row) Any objection if I rework this function to meet SQL92 and fix the bug? Or is the SQL92 part not desirable because it breaks backward compatability? In any case, can the #ifdef MULTIBYTE's be removed now in favor of a test for encoding max length? Joe
> Any objection if I rework this function to meet SQL92 and fix the bug? I don't object. > Or is the SQL92 part not desirable because it breaks backward > compatability? I don't think so. > In any case, can the #ifdef MULTIBYTE's be removed now in favor of a > test for encoding max length? Sure. -- Tatsuo Ishii
Tatsuo Ishii wrote: >>Any objection if I rework this function to meet SQL92 and fix the bug? > I've started working on text_substr() as described in this thread (which is hopefully prep work for the replace() function that started the thread). I haven't really looked at toast or multibyte closely before, so I'd like to ask a couple of questions to be sure I'm understanding the relevant issues correctly. First, in textlen() I see (ignoring multibyte for a moment): text *t = PG_GETARG_TEXT_P(0); PG_RETURN_INT32(VARSIZE(t) - VARHDRSZ); Tom has pointed out to me before that PG_GETARG_TEXT_P(n) incurs the overhead of retrieving and possibly decompressing a toasted datum. So my first question is, can we simply do: PG_RETURN_INT32(toast_raw_datum_size(PG_GETARG_DATUM(0)) - VARHDRSZ); and save the overhead of retrieving and decompressing the whole datum? Now, in the multibyte case, again in textlen(), I see: /* optimization for single byte encoding */ if (pg_database_encoding_max_length() <= 1) PG_RETURN_INT32(VARSIZE(t)- VARHDRSZ); PG_RETURN_INT32( pg_mbstrlen_with_len(VARDATA(t), VARSIZE(t) - VARHDRSZ)); Three questions here. 1) In the case of encoding max length == 1, can we treat it the same as the non-multibyte case (I presume they are exactly the same)? 2) Can encoding max length ever be < 1? Doesn't make sense to me. 3) In the case of encoding max length > 1, if I understand correctly, each encoded character can be one *or more* bytes, up to and encluding encoding max length bytes. So the *only* way presently to get the length of the original character string is to loop through the entire string checking the length of each individual character (that's what pg_mbstrlen_with_len() does it seems)? Finally, if 3) is true, then there is no way to avoid the retrieval and decompression of the datum just to find out its length. For large datums, detoasting plus the looping through each character would add a huge amount of overhead just to get at the length of the original string. I don't know if we need to be able to get *just* the length often enough to really care, but if we do, I had an idea for some future release (I wouldn't propose doing this for 7.3): - add a new EXTENDED state to va_external for MULTIBYTE - any string with max encoding length > 1 would be EXTENDED even if it is not EXTERNAL and not COMPRESSED. - to each of the structs in the union, add va_strlen - populate va_strlen on INSERT and maintain it on UPDATE. Now a new function similar to toast_raw_datum_size(), maybe toast_raw_datum_strlen() could be used to get the original string length, whether MB or not, without needing to retrieve and decompress the entire datum. I understand we would either: have to steal another bit from the VARHDR which would reduce the effective size of a valena from 1GB down to .5GB; or we would need to add a byte or two to the VARHDR which is extra per-datum overhead. I'm not sure we would want to do either. But I wanted to toss out the idea while it was fresh on my mind. Thanks, Joe
Tatsuo Ishii wrote: >>Any objection if I rework this function to meet SQL92 and fix the bug? > I don't object. One more question on this: how can I generate some characters with > 1 encoding length? I need a way to test the work I'm doing, and I'm not quite sure how to test it. Just making a database that uses a MB encoding doesn't make 0-9/A-Z/a-z into characters of > 1 byte, does it? Sorry, but never having used MB encoding has left me a bit clueless on this ;-) Joe
> Now, in the multibyte case, again in textlen(), I see: > > /* optimization for single byte encoding */ > if (pg_database_encoding_max_length() <= 1) > PG_RETURN_INT32(VARSIZE(t) - VARHDRSZ); > > PG_RETURN_INT32( > pg_mbstrlen_with_len(VARDATA(t), VARSIZE(t) - VARHDRSZ)); > > Three questions here. > 1) In the case of encoding max length == 1, can we treat it the same as > the non-multibyte case (I presume they are exactly the same)? Yes. > 2) Can encoding max length ever be < 1? Doesn't make sense to me. No. It seems just a defensive coding. > 3) In the case of encoding max length > 1, if I understand correctly, > each encoded character can be one *or more* bytes, up to and encluding > encoding max length bytes. Right. > So the *only* way presently to get the length > of the original character string is to loop through the entire string > checking the length of each individual character (that's what > pg_mbstrlen_with_len() does it seems)? Yes. > Finally, if 3) is true, then there is no way to avoid the retrieval and > decompression of the datum just to find out its length. For large > datums, detoasting plus the looping through each character would add a > huge amount of overhead just to get at the length of the original > string. I don't know if we need to be able to get *just* the length > often enough to really care, but if we do, I had an idea for some future > release (I wouldn't propose doing this for 7.3): > > - add a new EXTENDED state to va_external for MULTIBYTE > - any string with max encoding length > 1 would be EXTENDED even if it > is not EXTERNAL and not COMPRESSED. > - to each of the structs in the union, add va_strlen > - populate va_strlen on INSERT and maintain it on UPDATE. > > Now a new function similar to toast_raw_datum_size(), maybe > toast_raw_datum_strlen() could be used to get the original string > length, whether MB or not, without needing to retrieve and decompress > the entire datum. > > I understand we would either: have to steal another bit from the VARHDR > which would reduce the effective size of a valena from 1GB down to .5GB; > or we would need to add a byte or two to the VARHDR which is extra > per-datum overhead. I'm not sure we would want to do either. But I > wanted to toss out the idea while it was fresh on my mind. Interesting idea. I also was thinking about adding some extra infomation to text data types such as character set, collation etc. for 7.4 or later. -- Tatsuo Ishii
Tatsuo Ishii wrote: >>Now a new function similar to toast_raw_datum_size(), maybe >>toast_raw_datum_strlen() could be used to get the original string >>length, whether MB or not, without needing to retrieve and decompress >>the entire datum. >> >>I understand we would either: have to steal another bit from the VARHDR >>which would reduce the effective size of a valena from 1GB down to .5GB; >>or we would need to add a byte or two to the VARHDR which is extra >>per-datum overhead. I'm not sure we would want to do either. But I >>wanted to toss out the idea while it was fresh on my mind. > > > Interesting idea. I also was thinking about adding some extra > infomation to text data types such as character set, collation > etc. for 7.4 or later. I ran some tests to confirm the theory above regarding overhead; create table strtest(f1 text); do 100 times insert into strtest values('12345....'); -- 100000 characters loop do 1000 times select length(f1) from strtest; loop Results: SQL_ASCII database, new code: ============================= PG_RETURN_INT32(toast_raw_datum_size(PG_GETARG_DATUM(0)) - VARHDRSZ); ==> 2 seconds SQL_ASCII database, old code: ============================= text *t = PG_GETARG_TEXT_P(0); PG_RETURN_INT32(VARSIZE(t) - VARHDRSZ); ==> 66 seconds EUC_JP database, new & old code: ================================ text *t = PG_GETARG_TEXT_P(0); PG_RETURN_INT32(pg_mbstrlen_with_len(VARDATA(t), VARSIZE(t) - VARHDRSZ)); ==> 469 seconds So it appears that, while detoasting is moderately expensive (adds 64 seconds to the test), the call to pg_mbstrlen_with_len() is very expensive (adds 403 seconds to the test). Joe