Обсуждение: split string by special characters
I wonder if you could please help me out to extract a character string to an array or better yet, a table. I'd like to split strings of text up into words and delimiters (but not delete the delimiters). The delimiters are defined as comma, space, dot, singe/double quotation mark, question mark etc.¹ in a separate table (delimiters) depending on what rules apply for the input. regexp_split_to_array/table seem quite suitable but I have difficulties to form the right expression with it, apart from that it remove the delimiters as well. Example: This is just a text that contain special characters such as , (comma), "(", ")" (left and right parenthesis) as well as "?" question mark. How do I split it up with PostgreSQL? Expected result: {This, " ", is, " ", just, " ", a, ..., PostgreSQL, "?" } __________________ ¹) Also later on tags such as <html> and at other times something else depending on the circumstances. //Jan-Erik
Jan-Erik wrote: > I wonder if you could please help me out to extract a character string > to an array or better yet, a table. > > I'd like to split strings of text up into words and delimiters (but > not delete the delimiters). The delimiters are defined as comma, > space, dot, singe/double quotation mark, question mark etc.¹ in a > separate table (delimiters) depending on what rules apply for the > input. > > regexp_split_to_array/table seem quite suitable but I have > difficulties to form the right expression with it, apart from that it > remove the delimiters as well. > > Example: > This is just a text that contain special characters such as , (comma), > "(", ")" (left and right parenthesis) as well as "?" question mark. > How do I split it up with PostgreSQL? > > Expected result: > {This, " ", is, " ", just, " ", a, ..., PostgreSQL, "?" } > __________________ > ¹) Also later on tags such as <html> and at other times something > else depending on the circumstances. > > //Jan-Erik > Hi, I was thinking about that and in my opinion the approach to let the database do that is the wrong direction. Sure you can do a lot with regexp_split_to_table or regexp_split_to_array but they are kind of limited compared to a programming language using regular expressions. If I had to try to get your jobdone, I would try regexp_matches() like: SELECT regexp_matches('This is just a text, that contain special characters such as, (comma),"(", ")" (left and right parenthesis) as well as "?" question, mark.How do I split it up with PostgreSQL?', E'(\\w*.)\\s+','g'); regexp_matches ---------------- {This} {is} {just} {a} {"text,"} {that} {contain} {special} {characters} {such} {"as,"} {","} {"\""} {left} {and} {right} {parenthesis)} {as} {well} {as} {"\""} {"question,"} {How} {do} {I} {split} {it} {up} {with} (29 rows) So, you have the ability to catch the seperators like ','. But for now, teh example just catches the comma. But you want to catch a lot of other seperators as well. I suggest you do that within the logic of your coding language because I don't think this will be an easy way to walk ;-). This is no database job in my opinion. Cheers Andy
On 24 Juli, 23:22, a.w...@netzmeister-st-pauli.de (Andreas Wenk) wrote: > > Hi, > > I was thinking about that and in my opinion the approach to let the > database do that is the wrong direction. Sure you can do a lot with > regexp_split_to_table or regexp_split_to_array but they are kind of Yes, I see. You're quite right, the split was intended to do give me everything in processed chunks it in some easy way as the last part of the interpretation of the text. > limited compared to a programming language using regular expressions. If > I had to try to get your jobdone, I would try regexp_matches() like: > > SELECT regexp_matches('This is just a text, that contain special > characters such as, (comma),"(", ")" (left and right parenthesis) as > well as "?" question, mark.How do I split it up with PostgreSQL?', > E'(\\w*.)\\s+','g'); > > regexp_matches > ---------------- > {This} > {is} > {just} > {a} > {"text,"} > {that} > {contain} > {special} > {characters} > {such} > {"as,"} > {","} > {"\""} > {left} > {and} > {right} > {parenthesis)} > {as} > {well} > {as} > {"\""} > {"question,"} > {How} > {do} > {I} > {split} > {it} > {up} > {with} > (29 rows) > > So, you have the ability to catch the seperators like ','. But for now, > teh example just catches the comma. But you want to catch a lot of other Yes, but then I ran into the problems with separators that regexp consider as part of the expression and how to dynamically build the right expression in some unified way for each language. > seperators as well. I suggest you do that within the logic of your > coding language because I don't think this will be an easy way to walk Guess you're right, because I didn't know how to handle it with the regexp-approach. I sat down yesterday and wrote a function that does the job for me in PL/pgSQL, I'm not quite finished, but can see the light at the end of the tunnel. The basic approach I'm working with now is to let it find the position of each delimiter combination within the text, then sort the resulting array to get it ordered and extract each part. It won't be fast as lightning, but sufficient for now and as it seem, allow me to parse text from various files written in different languages (e.g. programming) just by specifying the delimiters. > ;-). This is no database job in my opinion. I didn't intend to try it either before I spotted some of those functions... :-) Then figured it would be nice to do it within the db-engine as all the data is present there. I wrote code outside the db-engine some time ago, but then other aspects made it less desirable to use. > > Cheers > > Andy > Thank you Andy for the code example and your advice. I really appreciate that you took your time to show me how and explain why. //Jan-Erik
Jan-Erik schrieb: > On 24 Juli, 23:22, a.w...@netzmeister-st-pauli.de (Andreas Wenk) > wrote: >> Hi, >> >> I was thinking about that and in my opinion the approach to let the >> database do that is the wrong direction. Sure you can do a lot with >> regexp_split_to_table or regexp_split_to_array but they are kind of > > Yes, I see. You're quite right, the split was intended to do give me > everything in processed chunks it in some easy way as the last part of > the interpretation of the text. > >> limited compared to a programming language using regular expressions. If >> I had to try to get your jobdone, I would try regexp_matches() like: >> >> SELECT regexp_matches('This is just a text, that contain special >> characters such as, (comma),"(", ")" (left and right parenthesis) as >> well as "?" question, mark.How do I split it up with PostgreSQL?', >> E'(\\w*.)\\s+','g'); >> >> regexp_matches >> ---------------- >> {This} >> {is} >> {just} >> {a} >> {"text,"} >> {that} >> {contain} >> {special} >> {characters} >> {such} >> {"as,"} >> {","} >> {"\""} >> {left} >> {and} >> {right} >> {parenthesis)} >> {as} >> {well} >> {as} >> {"\""} >> {"question,"} >> {How} >> {do} >> {I} >> {split} >> {it} >> {up} >> {with} >> (29 rows) >> >> So, you have the ability to catch the seperators like ','. But for now, >> teh example just catches the comma. But you want to catch a lot of other > > Yes, but then I ran into the problems with separators that regexp > consider as part of the expression and how to dynamically build the > right expression in some unified way for each language. > >> seperators as well. I suggest you do that within the logic of your >> coding language because I don't think this will be an easy way to walk > > Guess you're right, because I didn't know how to handle it with the > regexp-approach. > I sat down yesterday and wrote a function that does the job for me in > PL/pgSQL, I'm not quite finished, but can see the light at the end of > the tunnel. just in case you are running into a black performance hole - you could try to write it in C as a user defined function. Actually for me it would be a real big challenge ;-) > The basic approach I'm working with now is to let it find the position > of each delimiter combination within the text, then sort the resulting > array to get it ordered and extract each part. > It won't be fast as lightning, but sufficient for now and as it seem, > allow me to parse text from various files written in different > languages (e.g. programming) just by specifying the delimiters. > >> ;-). This is no database job in my opinion. > > I didn't intend to try it either before I spotted some of those > functions... :-) > Then figured it would be nice to do it within the db-engine as all the > data is present there. > I wrote code outside the db-engine some time ago, but then other > aspects made it less desirable to use. >> Cheers >> >> Andy >> > > Thank you Andy for the code example and your advice. > I really appreciate that you took your time to show me how and explain > why. > //Jan-Erik hey you're welcome. When you're done it would be really great to see the resulting function ;-) Cheers Andy
Current design of function ---text_to_array( sen text, col collection )--- DECLARE delimiter_pos int[] := '{}'; return_val text[] := '{}'; delimiters_chr RECORD; remaining_text text; delimiter_curr text; i int := 0; j int := 0; prev_pos int := 1; curr_pos int := 0; delimiter_len int; BEGIN FOR delimiters_chr IN SELECT upper( chars.chars ) AS chars FROM chars, word WHERE chars.chars_id = word.chars_id AND word.classification_id = clss.classification_id LOOP delimiter_len := character_length( delimiters_chr.chars ); -- Length of the current delimiter we're looking at IF prev_pos > 0 THEN -- Don't add text if it hasn't been altered (previous delimiter not found) remaining_text := upper( sen ); -- Temporary copy of text to search through END IF; prev_pos := 0; curr_pos := position( delimiters_chr.chars IN sen ); -- Delimiter position in text (sentence) WHILE curr_pos > 0 LOOP -- Found delimiter in text delimiter_pos := array_append( delimiter_pos, curr_pos + prev_pos ); -- Add position of delimiter delimiter_pos := array_append( delimiter_pos, curr_pos + prev_pos + delimiter_len ); -- Add position where delimiter ends remaining_text := substring( remaining_text FROM curr_pos + delimiter_len ); -- Look only at text still prev_pos := curr_pos + prev_pos; -- Need to keep track of how much text has been stripped off curr_pos := position( delimiters_chr.chars IN remaining_text ); -- Calculate the next occurance END LOOP; END LOOP; remaining_text := sen; -- Temporary copy of text to extract text from SELECT DISTINCT * INTO delimiter_pos FROM array_append( array_prepend ( 1, sort( delimiter_pos ) ), character_length( sen ) + 1 ); -- Add first character pos and last to the sorted array. curr_pos := array_upper( delimiter_pos, 1 ); -- Calculate the size of the array WHILE j < curr_pos LOOP -- Look through the array j := i + 1; return_val := array_append( return_val, substring ( remaining_text FROM delimiter_pos[i] FOR (delimiter_pos[j] - delimiter_pos[i]) ) ); -- Add the parts to a new array with text i := i + 1; END LOOP; RETURN return_val; -- Return the result END; ---- Need an additional function as well, to sort the array. The delimiters has to be placed in another table. Please note that it can't handle variations to a delimiter such as <table style="..."> due to the present design as I couldn't use regexp. <table> (case doesn't matter) without any other extras, work fine though. I initially hoped it would be possible to allow matchings such as <td(.*)> as well as pairs <td>(.*)</td> and variations thereof, such as <td(.*)>(.*)</td>, but that's something I need a bit of help with (hint, hint). Written to run on eComStation v2.0 (successor of OS/2) //Jan-Erik