Обсуждение: split string by special characters

Поиск
Список
Период
Сортировка

split string by special characters

От
Jan-Erik
Дата:
I wonder if you could please help me out to extract a character string
to an array or better yet, a table.

I'd like to split strings of text up into words and delimiters (but
not delete the delimiters). The delimiters are defined as comma,
space, dot, singe/double quotation mark, question mark etc.¹ in a
separate table (delimiters) depending on what rules apply for the
input.

regexp_split_to_array/table seem quite suitable but I have
difficulties to form the right expression with it, apart from that it
remove the delimiters as well.

Example:
This is just a text that contain special characters such as , (comma),
"(", ")" (left and right parenthesis) as well as "?" question mark.
How do I split it up with PostgreSQL?

Expected result:
{This, " ", is, " ", just, " ", a, ..., PostgreSQL, "?" }
__________________
¹)  Also later on tags such as <html> and at other times something
else depending on the circumstances.

//Jan-Erik

Re: split string by special characters

От
Andreas Wenk
Дата:
Jan-Erik wrote:
> I wonder if you could please help me out to extract a character string
> to an array or better yet, a table.
>
> I'd like to split strings of text up into words and delimiters (but
> not delete the delimiters). The delimiters are defined as comma,
> space, dot, singe/double quotation mark, question mark etc.¹ in a
> separate table (delimiters) depending on what rules apply for the
> input.
>
> regexp_split_to_array/table seem quite suitable but I have
> difficulties to form the right expression with it, apart from that it
> remove the delimiters as well.
>
> Example:
> This is just a text that contain special characters such as , (comma),
> "(", ")" (left and right parenthesis) as well as "?" question mark.
> How do I split it up with PostgreSQL?
>
> Expected result:
> {This, " ", is, " ", just, " ", a, ..., PostgreSQL, "?" }
> __________________
> ¹)  Also later on tags such as <html> and at other times something
> else depending on the circumstances.
>
> //Jan-Erik
>
Hi,

I was thinking about that and in my opinion the approach to let the
database do that is the wrong direction. Sure you can do a lot with
regexp_split_to_table or regexp_split_to_array but they are kind of
limited compared to a programming language using regular expressions. If
I had to try to get your jobdone, I would try regexp_matches() like:

SELECT regexp_matches('This is just a text, that contain special
characters such as, (comma),"(", ")" (left and right parenthesis) as
well as "?" question, mark.How do I split it up with PostgreSQL?',
E'(\\w*.)\\s+','g');

regexp_matches
----------------
  {This}
  {is}
  {just}
  {a}
  {"text,"}
  {that}
  {contain}
  {special}
  {characters}
  {such}
  {"as,"}
  {","}
  {"\""}
  {left}
  {and}
  {right}
  {parenthesis)}
  {as}
  {well}
  {as}
  {"\""}
  {"question,"}
  {How}
  {do}
  {I}
  {split}
  {it}
  {up}
  {with}
(29 rows)

So, you have the ability to catch the seperators like ','. But for now,
teh example just catches the comma. But you want to catch a lot of other
seperators as well. I suggest you do that within the logic of your
coding language because I don't think this will be an easy way to walk
;-). This is no database job in my opinion.

Cheers

Andy




Re: split string by special characters

От
Jan-Erik
Дата:
On 24 Juli, 23:22, a.w...@netzmeister-st-pauli.de (Andreas Wenk)
wrote:
>
> Hi,
>
> I was thinking about that and in my opinion the approach to let the
> database do that is the wrong direction. Sure you can do a lot with
> regexp_split_to_table or regexp_split_to_array but they are kind of

Yes, I see. You're quite right, the split was intended to do give me
everything in processed chunks it in some easy way as the last part of
the interpretation of the text.

> limited compared to a programming language using regular expressions. If
> I had to try to get your jobdone, I would try regexp_matches() like:
>
> SELECT regexp_matches('This is just a text, that contain special
> characters such as, (comma),"(", ")" (left and right parenthesis) as
> well as "?" question, mark.How do I split it up with PostgreSQL?',
> E'(\\w*.)\\s+','g');
>
> regexp_matches
> ----------------
>   {This}
>   {is}
>   {just}
>   {a}
>   {"text,"}
>   {that}
>   {contain}
>   {special}
>   {characters}
>   {such}
>   {"as,"}
>   {","}
>   {"\""}
>   {left}
>   {and}
>   {right}
>   {parenthesis)}
>   {as}
>   {well}
>   {as}
>   {"\""}
>   {"question,"}
>   {How}
>   {do}
>   {I}
>   {split}
>   {it}
>   {up}
>   {with}
> (29 rows)
>
> So, you have the ability to catch the seperators like ','. But for now,
> teh example just catches the comma. But you want to catch a lot of other

Yes, but then I ran into the problems with separators that regexp
consider as part of the expression and how to dynamically build the
right expression in some unified way for each language.

> seperators as well. I suggest you do that within the logic of your
> coding language because I don't think this will be an easy way to walk

Guess you're right, because I didn't know how to handle it with the
regexp-approach.
I sat down yesterday and wrote a function that does the job for me in
PL/pgSQL, I'm not quite finished, but can see the light at the end of
the tunnel.
The basic approach I'm working with now is to let it find the position
of each delimiter combination within the text, then sort the resulting
array to get it ordered and extract each part.
It won't be fast as lightning, but sufficient for now and as it seem,
allow me to parse text from various files written in different
languages (e.g. programming) just by specifying the delimiters.

> ;-). This is no database job in my opinion.

I didn't intend to try it either before I spotted some of those
functions... :-)
Then figured it would be nice to do it within the db-engine as all the
data is present there.
I wrote code outside the db-engine some time ago, but then other
aspects made it less desirable to use.
>
> Cheers
>
> Andy
>

Thank you Andy for the code example and your advice.
I really appreciate that you took your time to show me how and explain
why.

//Jan-Erik

Re: split string by special characters

От
Andreas Wenk
Дата:
Jan-Erik schrieb:
> On 24 Juli, 23:22, a.w...@netzmeister-st-pauli.de (Andreas Wenk)
> wrote:
>> Hi,
>>
>> I was thinking about that and in my opinion the approach to let the
>> database do that is the wrong direction. Sure you can do a lot with
>> regexp_split_to_table or regexp_split_to_array but they are kind of
>
> Yes, I see. You're quite right, the split was intended to do give me
> everything in processed chunks it in some easy way as the last part of
> the interpretation of the text.
>
>> limited compared to a programming language using regular expressions. If
>> I had to try to get your jobdone, I would try regexp_matches() like:
>>
>> SELECT regexp_matches('This is just a text, that contain special
>> characters such as, (comma),"(", ")" (left and right parenthesis) as
>> well as "?" question, mark.How do I split it up with PostgreSQL?',
>> E'(\\w*.)\\s+','g');
>>
>> regexp_matches
>> ----------------
>>   {This}
>>   {is}
>>   {just}
>>   {a}
>>   {"text,"}
>>   {that}
>>   {contain}
>>   {special}
>>   {characters}
>>   {such}
>>   {"as,"}
>>   {","}
>>   {"\""}
>>   {left}
>>   {and}
>>   {right}
>>   {parenthesis)}
>>   {as}
>>   {well}
>>   {as}
>>   {"\""}
>>   {"question,"}
>>   {How}
>>   {do}
>>   {I}
>>   {split}
>>   {it}
>>   {up}
>>   {with}
>> (29 rows)
>>
>> So, you have the ability to catch the seperators like ','. But for now,
>> teh example just catches the comma. But you want to catch a lot of other
>
> Yes, but then I ran into the problems with separators that regexp
> consider as part of the expression and how to dynamically build the
> right expression in some unified way for each language.
>
>> seperators as well. I suggest you do that within the logic of your
>> coding language because I don't think this will be an easy way to walk
>
> Guess you're right, because I didn't know how to handle it with the
> regexp-approach.
> I sat down yesterday and wrote a function that does the job for me in
> PL/pgSQL, I'm not quite finished, but can see the light at the end of
> the tunnel.

just in case you are running into a black performance hole - you could
try to write it in C as a user defined function. Actually for me it
would be a real big challenge ;-)

> The basic approach I'm working with now is to let it find the position
> of each delimiter combination within the text, then sort the resulting
> array to get it ordered and extract each part.
> It won't be fast as lightning, but sufficient for now and as it seem,
> allow me to parse text from various files written in different
> languages (e.g. programming) just by specifying the delimiters.
>
>> ;-). This is no database job in my opinion.
>
> I didn't intend to try it either before I spotted some of those
> functions... :-)
> Then figured it would be nice to do it within the db-engine as all the
> data is present there.
> I wrote code outside the db-engine some time ago, but then other
> aspects made it less desirable to use.
>> Cheers
>>
>> Andy
>>
>
> Thank you Andy for the code example and your advice.
> I really appreciate that you took your time to show me how and explain
> why.
 > //Jan-Erik

hey you're welcome. When you're done it would be really great to see the
resulting function ;-)

Cheers

Andy



Re: split string by special characters

От
Jan-Erik
Дата:
Current design of function
---text_to_array( sen text, col collection )---
DECLARE
  delimiter_pos int[] := '{}';
  return_val text[] := '{}';
  delimiters_chr RECORD;
  remaining_text text;
  delimiter_curr text;
  i int := 0;
  j int := 0;
  prev_pos int := 1;
  curr_pos int := 0;
  delimiter_len int;
BEGIN
  FOR delimiters_chr IN SELECT upper( chars.chars ) AS chars FROM
chars, word WHERE chars.chars_id = word.chars_id AND
word.classification_id = clss.classification_id LOOP
    delimiter_len := character_length( delimiters_chr.chars ); --
Length of the current delimiter we're looking at
    IF prev_pos > 0 THEN -- Don't add text if it hasn't been altered
(previous delimiter not found)
        remaining_text := upper( sen ); -- Temporary copy of text to
search through
    END IF;
    prev_pos := 0;
    curr_pos := position( delimiters_chr.chars IN sen ); -- Delimiter
position in text (sentence)

    WHILE curr_pos > 0 LOOP -- Found delimiter in text
      delimiter_pos := array_append( delimiter_pos, curr_pos +
prev_pos ); -- Add position of delimiter
      delimiter_pos := array_append( delimiter_pos, curr_pos +
prev_pos + delimiter_len ); -- Add position where delimiter ends
      remaining_text := substring( remaining_text FROM curr_pos +
delimiter_len ); -- Look only at text still
      prev_pos := curr_pos + prev_pos; -- Need to keep track of how
much text has been stripped off
      curr_pos := position( delimiters_chr.chars IN remaining_text );
-- Calculate the next occurance
    END LOOP;
  END LOOP;
  remaining_text := sen; -- Temporary copy of text to extract text
from
  SELECT DISTINCT * INTO delimiter_pos FROM array_append( array_prepend
( 1, sort( delimiter_pos ) ), character_length( sen ) + 1 ); -- Add
first character pos and last to the sorted array.
  curr_pos := array_upper( delimiter_pos, 1 ); -- Calculate the size
of the array
  WHILE j < curr_pos LOOP -- Look through the array
      j := i + 1;
      return_val := array_append( return_val, substring
( remaining_text FROM delimiter_pos[i] FOR (delimiter_pos[j] -
delimiter_pos[i]) ) ); -- Add the parts to a new array with text
      i := i + 1;
  END LOOP;
  RETURN return_val; -- Return the result
END;
----
Need an additional function as well, to sort the array.
The delimiters has to be placed in another table.

Please note that it can't handle variations to a delimiter such as
<table style="..."> due to the present design as I couldn't use
regexp. <table> (case doesn't matter) without any other extras, work
fine though. I initially hoped it would be possible to allow matchings
such as <td(.*)> as well as pairs <td>(.*)</td> and variations
thereof, such as <td(.*)>(.*)</td>, but that's something I need a bit
of help with (hint, hint).

Written to run on eComStation v2.0 (successor of OS/2)
//Jan-Erik