Обсуждение: Re: [GENERAL] workaround for lack of REPLACE() function

Поиск
Список
Период
Сортировка

Re: [GENERAL] workaround for lack of REPLACE() function

От
Thomas Lockhart
Дата:
(crossposted to -hackers, should follow up on that list)

> Well, OVERLAY is defined as:
>      overlay(string placing string from integer [for integer])
> and replace() is defined (by me at least) as:
>      replace(inputstring, old-substr, new-substr)

OK.

> OVERLAY requires that I know the "from" position and possibly the "for"
> in advance. Other functions (such as strpos() and substr()) can be used
> to help...

Right. So you can do your example pretty easily:

thomas=# select overlay(f1 placing '' from position('/local' in f1)
thomas-# for length('/local')) from strtest;
      overlay
--------------------
 /usr/pgsql/data
 /m1/usr/pgsql/data

And if you don't like that much typing you can do:

thomas=# create function replace(text, text, text) returns text as '
thomas'# select overlay($1 placing $3 from position($2 in $1) for
length($2));
thomas'# ' language 'sql';
CREATE FUNCTION
thomas=# select replace(f1, '/local', '') from strtest;
      replace
--------------------
 /usr/pgsql/data
 /m1/usr/pgsql/data

> But now what happens if you wanted to replace all of the '/' characters
> with '\'?...
> You can't do this at all with overlay(), unless you want to write a
> PL/pgSQL function and loop through each string. I started out with
> exactly this, using strpos() and substr(), but I thought a C function
> was cleaner, and it is certainly faster.

OK, this is in the "can't do it what we have" category. Should we have
it accept a regular expression rather than a simple string? In either
case it should probably go into the main distro. Except that I see
"REPLACE" is mentioned as a reserved word in SQL99. But has no other
mention in my copy of the draft standard. Anyone else have an idea what
it might be used for in the standard?

The other functions look useful too, unless to_char() and varbit can be
evolved to support this functionality.

                     - Thomas

Re: [GENERAL] workaround for lack of REPLACE() function

От
Joe Conway
Дата:
Thomas Lockhart wrote:> (crossposted to -hackers, should follow up on that list)

<snip>

> OK, this is in the "can't do it what we have" category. Should we have
> it accept a regular expression rather than a simple string? In either
> case it should probably go into the main distro. Except that I see
> "REPLACE" is mentioned as a reserved word in SQL99. But has no other
> mention in my copy of the draft standard. Anyone else have an idea what
> it might be used for in the standard?

Not sure, but I see what you mean. Perhaps because of Oracle pushing to 
legitimize the "CREATE OR REPLACE" syntax? In any case, this works in 8i:

SQL> select replace('hello','l','x') from dual;

REPLACE('HELLO','L','X')
------------------------
hexxo

and here it is in MSSQL 7:

select replace('hello','l','x')

----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------


hexxo

(1 row(s) affected)

and my proposed PostgreSQL function:

test=# select replace('hello','l','x'); replace
--------- hexxo
(1 row)

so at least we would be consistant/compatable with these two.


> 
> The other functions look useful too, unless to_char() and varbit can be
> evolved to support this functionality.

I will take a look at merging these into existing functions, but I have 
a few other things ahead of this in my queue.

One of the reasons I wasn't pushing too hard to get replace() into the 
backend is because my current solution is a bit of a hack. It uses the 
builtin length, strpos and substr text functions (which I think makes 
sense since they already know how to deal with mb strings), but because 
they accept and return text, I'm doing lots of conversions back and 
forth from (* text) to (* char). To do this "right" probably means 
reworking the text string manipulation functions to be wrappers around 
some equivalent functions accepting and returning C strings. That was 
more work than I had time for when I wrote the current replace(). But as 
I said, if there is support for getting this into the backend, I'll add 
it to my todo list:

- Create new backend function replace()
- Either create new backend functions, or merge into existing functions: 
to_hex() and extract_tok()

Joe







Re: [GENERAL] workaround for lack of REPLACE() function

От
Joe Conway
Дата:
Joe Conway wrote:
> more work than I had time for when I wrote the current replace(). But as 
> I said, if there is support for getting this into the backend, I'll add 
> it to my todo list:
> 
> - Create new backend function replace()
> - Either create new backend functions, or merge into existing functions: 
> to_hex() and extract_tok()
> 

I'm just starting to take a look at this again. While studying the 
current text_substr() function I found two behaviors which conflict with 
specific SQL92/SQL99 requirements, and one bug. First the spec 
compliance -- SQL92 section 6.7/SQL99 section 6.18 say:

If <character substring function> is specified, then:
a) Let C be the value of the <character value expression>, let LC be the   length of C, and let S be the value of the
<startposition>.
 
b) If <string length> is specified, then let L be the value of <string   length> and let E be S+L. Otherwise, let E be
thelarger of LC + 1   and S.
 
c) If either C, S, or L is the null value, then the result of the   <character substring function> is the null value.
d) If E is less than S, then an exception condition is raised: data   exception-substring error.
e) Case:   i) If S is greater than LC or if E is less than 1, then the result of      the <character substring
function>is a zero-length string.  ii) Otherwise,      1) Let SI be the larger of S and 1. Let El be the smaller of E
and        LC+l. Let Ll be El-Sl.      2) The result of the <character substring function> is a character
stringcontaining the Ll characters of C starting at character         number Sl in the same order that the characters
appearin C.
 

The only way for d) to be true is when L < 0. Instead of an error, we do:
test=# select substr('hello',2,-1); substr
-------- ello
(1 row)

The other spec issue is wrt para e)i). If E (=S+L) < 1, we should return 
a zero-length string. Currently I get:
test=# select substr('hello',-4,3); substr
-------- hello
(1 row)

Neither behavior is documented (unless it's somewhere other than:
http://developer.postgresql.org/docs/postgres/functions-string.html ).

The bug is this one:
test=# create DATABASE testmb with encoding = 'EUC_JP';
CREATE DATABASE
test=# \c testmb
You are now connected to database testmb.
testmb=# select substr('hello',6,2); substr
-------- ~
(1 row)

testmb=# \c test
You are now connected to database test.
test=# select substr('hello',6,2); substr
--------

(1 row)

The multibyte database behavior is the bug. The SQL_ASCII behavior is 
correct (zero-length string):
test=# select substr('hello',6,2) is null; ?column?
---------- f
(1 row)


Any objection if I rework this function to meet SQL92 and fix the bug? 
Or is the SQL92 part not desirable because it breaks backward 
compatability?

In any case, can the #ifdef MULTIBYTE's be removed now in favor of a 
test for encoding max length?

Joe



Re: [GENERAL] workaround for lack of REPLACE() function

От
Tatsuo Ishii
Дата:
> Any objection if I rework this function to meet SQL92 and fix the bug? 

I don't object.

> Or is the SQL92 part not desirable because it breaks backward 
> compatability?

I don't think so.

> In any case, can the #ifdef MULTIBYTE's be removed now in favor of a 
> test for encoding max length?

Sure.
--
Tatsuo Ishii


Re: [GENERAL] workaround for lack of REPLACE() function

От
Joe Conway
Дата:
Tatsuo Ishii wrote:
>>Any objection if I rework this function to meet SQL92 and fix the bug? 
> 

I've started working on text_substr() as described in this thread (which 
is hopefully prep work for the replace() function that started the 
thread). I haven't really looked at toast or multibyte closely before, 
so I'd like to ask a couple of questions to be sure I'm understanding 
the relevant issues correctly.

First, in textlen() I see (ignoring multibyte for a moment):
  text       *t = PG_GETARG_TEXT_P(0);  PG_RETURN_INT32(VARSIZE(t) - VARHDRSZ);

Tom has pointed out to me before that PG_GETARG_TEXT_P(n) incurs the 
overhead of retrieving and possibly decompressing a toasted datum. So my 
first question is, can we simply do:  PG_RETURN_INT32(toast_raw_datum_size(PG_GETARG_DATUM(0)) - VARHDRSZ);
and save the overhead of retrieving and decompressing the whole datum?

Now, in the multibyte case, again in textlen(), I see:
  /* optimization for single byte encoding */  if (pg_database_encoding_max_length() <= 1)
PG_RETURN_INT32(VARSIZE(t)- VARHDRSZ);
 
  PG_RETURN_INT32(      pg_mbstrlen_with_len(VARDATA(t), VARSIZE(t) - VARHDRSZ));

Three questions here.
1) In the case of encoding max length == 1, can we treat it the same as 
the non-multibyte case (I presume they are exactly the same)?

2) Can encoding max length ever be < 1? Doesn't make sense to me.

3) In the case of encoding max length > 1, if I understand correctly, 
each encoded character can be one *or more* bytes, up to and encluding 
encoding max length bytes. So the *only* way presently to get the length 
of the original character string is to loop through the entire string 
checking the length of each individual character (that's what 
pg_mbstrlen_with_len() does it seems)?

Finally, if 3) is true, then there is no way to avoid the retrieval and 
decompression of the datum just to find out its length. For large 
datums, detoasting plus the looping through each character would add a 
huge amount of overhead just to get at the length of the original 
string. I don't know if we need to be able to get *just* the length 
often enough to really care, but if we do, I had an idea for some future 
release (I wouldn't propose doing this for 7.3):

- add a new EXTENDED state to va_external for MULTIBYTE
- any string with max encoding length > 1 would be EXTENDED even if it  is not EXTERNAL and not COMPRESSED.
- to each of the structs in the union, add va_strlen
- populate va_strlen on INSERT and maintain it on UPDATE.

Now a new function similar to toast_raw_datum_size(), maybe 
toast_raw_datum_strlen() could be used to get the original string 
length, whether MB or not, without needing to retrieve and decompress 
the entire datum.

I understand we would either: have to steal another bit from the VARHDR 
which would reduce the effective size of a valena from 1GB down to .5GB; 
or we would need to add a byte or two to the VARHDR which is extra 
per-datum overhead. I'm not sure we would want to do either. But I 
wanted to toss out the idea while it was fresh on my mind.

Thanks,

Joe




Re: [GENERAL] workaround for lack of REPLACE() function

От
Joe Conway
Дата:
Tatsuo Ishii wrote:
>>Any objection if I rework this function to meet SQL92 and fix the bug? 
> I don't object.

One more question on this: how can I generate some characters with > 1 
encoding length? I need a way to test the work I'm doing, and I'm not 
quite sure how to test it.

Just making a database that uses a MB encoding doesn't make 0-9/A-Z/a-z 
into characters of > 1 byte, does it? Sorry, but never having used MB 
encoding has left me a bit clueless on this ;-)

Joe



Re: [GENERAL] workaround for lack of REPLACE() function

От
Tatsuo Ishii
Дата:
> Now, in the multibyte case, again in textlen(), I see:
> 
>    /* optimization for single byte encoding */
>    if (pg_database_encoding_max_length() <= 1)
>      PG_RETURN_INT32(VARSIZE(t) - VARHDRSZ);
> 
>    PG_RETURN_INT32(
>        pg_mbstrlen_with_len(VARDATA(t), VARSIZE(t) - VARHDRSZ));
> 
> Three questions here.
> 1) In the case of encoding max length == 1, can we treat it the same as 
> the non-multibyte case (I presume they are exactly the same)?

Yes.

> 2) Can encoding max length ever be < 1? Doesn't make sense to me.

No. It seems just a defensive coding.

> 3) In the case of encoding max length > 1, if I understand correctly, 
> each encoded character can be one *or more* bytes, up to and encluding 
> encoding max length bytes.

Right.

> So the *only* way presently to get the length 
> of the original character string is to loop through the entire string 
> checking the length of each individual character (that's what 
> pg_mbstrlen_with_len() does it seems)?

Yes.

> Finally, if 3) is true, then there is no way to avoid the retrieval and 
> decompression of the datum just to find out its length. For large 
> datums, detoasting plus the looping through each character would add a 
> huge amount of overhead just to get at the length of the original 
> string. I don't know if we need to be able to get *just* the length 
> often enough to really care, but if we do, I had an idea for some future 
> release (I wouldn't propose doing this for 7.3):
> 
> - add a new EXTENDED state to va_external for MULTIBYTE
> - any string with max encoding length > 1 would be EXTENDED even if it
>    is not EXTERNAL and not COMPRESSED.
> - to each of the structs in the union, add va_strlen
> - populate va_strlen on INSERT and maintain it on UPDATE.
> 
> Now a new function similar to toast_raw_datum_size(), maybe 
> toast_raw_datum_strlen() could be used to get the original string 
> length, whether MB or not, without needing to retrieve and decompress 
> the entire datum.
> 
> I understand we would either: have to steal another bit from the VARHDR 
> which would reduce the effective size of a valena from 1GB down to .5GB; 
> or we would need to add a byte or two to the VARHDR which is extra 
> per-datum overhead. I'm not sure we would want to do either. But I 
> wanted to toss out the idea while it was fresh on my mind.

Interesting idea. I also was thinking about adding some extra
infomation to text data types such as character set, collation
etc. for 7.4 or later.
--
Tatsuo Ishii


Re: [GENERAL] workaround for lack of REPLACE() function

От
Joe Conway
Дата:
Tatsuo Ishii wrote:
>>Now a new function similar to toast_raw_datum_size(), maybe 
>>toast_raw_datum_strlen() could be used to get the original string 
>>length, whether MB or not, without needing to retrieve and decompress 
>>the entire datum.
>>
>>I understand we would either: have to steal another bit from the VARHDR 
>>which would reduce the effective size of a valena from 1GB down to .5GB; 
>>or we would need to add a byte or two to the VARHDR which is extra 
>>per-datum overhead. I'm not sure we would want to do either. But I 
>>wanted to toss out the idea while it was fresh on my mind.
> 
> 
> Interesting idea. I also was thinking about adding some extra
> infomation to text data types such as character set, collation
> etc. for 7.4 or later.

I ran some tests to confirm the theory above regarding overhead;

create table strtest(f1 text);
do 100 times   insert into strtest values('12345....');  -- 100000 characters
loop
do 1000 times   select length(f1) from strtest;
loop

Results:

SQL_ASCII database, new code:
=============================
PG_RETURN_INT32(toast_raw_datum_size(PG_GETARG_DATUM(0)) - VARHDRSZ);
==> 2 seconds

SQL_ASCII database, old code:
=============================
text    *t = PG_GETARG_TEXT_P(0);
PG_RETURN_INT32(VARSIZE(t) - VARHDRSZ);
==> 66 seconds

EUC_JP database, new & old code:
================================
text    *t = PG_GETARG_TEXT_P(0);
PG_RETURN_INT32(pg_mbstrlen_with_len(VARDATA(t),                        VARSIZE(t) - VARHDRSZ));
==> 469 seconds

So it appears that, while detoasting is moderately expensive (adds 64 
seconds to the test), the call to pg_mbstrlen_with_len() is very 
expensive (adds 403 seconds to the test).

Joe