Обсуждение: Patch: regexp_matches variant returning an array of matching positions

Поиск
Список
Период
Сортировка

Patch: regexp_matches variant returning an array of matching positions

От
Björn Harrtell
Дата:
I've written a variant of regexp_matches called regexp_matches_positions which instead of returning matching substrings will return matching positions. I found use of this when processing OCR scanned text and wanted to prioritize matches based on their position.

The patch is for discussion. I'd also appriciate general suggestions as this is my first experience with the postgresql code base.

The patch is against the master branch and includes a simple regression test.
Вложения

Re: Patch: regexp_matches variant returning an array of matching positions

От
Alvaro Herrera
Дата:
Björn Harrtell wrote:
> I've written a variant of regexp_matches called regexp_matches_positions
> which instead of returning matching substrings will return matching
> positions. I found use of this when processing OCR scanned text and wanted
> to prioritize matches based on their position.

Interesting.  I didn't read the patch but I wonder if it would be of
more general applicability to return more info in a fell swoop a
function returning a set (position, length, text of match), rather than
an array.  So instead of first calling one function to get the match and
then their positions, do it all in one pass.

(See pg_event_trigger_dropped_objects for a simple example of a function
that returns in that fashion.  There are several others but AFAIR that's
the simplest one.)

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services



Re: Patch: regexp_matches variant returning an array of matching positions

От
David Johnston
Дата:
Alvaro Herrera-9 wrote
> Björn Harrtell wrote:
>> I've written a variant of regexp_matches called regexp_matches_positions
>> which instead of returning matching substrings will return matching
>> positions. I found use of this when processing OCR scanned text and
>> wanted
>> to prioritize matches based on their position.
>
> Interesting.  I didn't read the patch but I wonder if it would be of
> more general applicability to return more info in a fell swoop a
> function returning a set (position, length, text of match), rather than
> an array.  So instead of first calling one function to get the match and
> then their positions, do it all in one pass.
>
> (See pg_event_trigger_dropped_objects for a simple example of a function
> that returns in that fashion.  There are several others but AFAIR that's
> the simplest one.)

Confused as to your thinking. Like regexp_matches this returns "SETOF
type[]".  In this case integer but text for the matches.  I could see adding
a generic function that returns a SETOF named composite (match varchar[],
position int[], length int[]) and the corresponding type.  I'm not imagining
a situation where you'd want the position but not the text and so having to
evaluate the regexp twice seems wasteful.  The length is probably a waste
though since it can readily be gotten from the text and is less often
needed.  But if it's pre-calculated anyway...

My question is what position is returned in a multiple-match situation? The
supplied test only covers the simple, non-global, situation.  It needs to
exercise empty sub-matches and global searches.  One theory is that the
first array slot should cover the global position of match zero (i.e., the
entire pattern) within the larger document while sub-matches would be
relative offsets within that single match.  This conflicts, though, with the
fact that _matches only returns array elements for () items and never for
the full match - the goal in this function being parallel un-nesting. But as
nesting is allowed it is still possible to have occur.

How does this resolve in the patch?

SELECT regexp_matches('abcabc','((a)(b)(c))','g');

David J.







--
View this message in context:
http://postgresql.1045698.n5.nabble.com/Patch-regexp-matches-variant-returning-an-array-of-matching-positions-tp5789321p5789414.html
Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.



Re: Re: Patch: regexp_matches variant returning an array of matching positions

От
"Erik Rijkers"
Дата:
On Wed, January 29, 2014 05:16, David Johnston wrote:
>
> How does this resolve in the patch?
>
> SELECT regexp_matches('abcabc','((a)(b)(c))','g');
>

With the patch:

testdb=# SELECT regexp_matches('abcabc','((a)(b)(c))','g'),
regexp_matches_positions('abcabc','((a)(b)(c))');regexp_matches| regexp_matches_positions
 
----------------+--------------------------{abc,a,b,c}    | {1,1,2,3}{abc,a,b,c}    | {1,1,2,3}
(2 rows)

testdb=# SELECT regexp_matches('abcabc','((a)(b)(c))','g'), regexp_matches_positions('abcabc','((a)(b)(c))',
'g');regexp_matches| regexp_matches_positions
 
----------------+--------------------------{abc,a,b,c}    | {1,1,2,3}{abc,a,b,c}    | {4,4,5,6}
(2 rows)



( in HEAD:

testdb=# SELECT regexp_matches('abcabc','((a)(b)(c))','g');regexp_matches
----------------{abc,a,b,c}{abc,a,b,c}
(2 rows)
)







Re: Patch: regexp_matches variant returning an array of matching positions

От
David Johnston
Дата:
Erik Rijkers wrote
> On Wed, January 29, 2014 05:16, David Johnston wrote:
>>
>> How does this resolve in the patch?
>>
>> SELECT regexp_matches('abcabc','((a)(b)(c))','g');
>>
> 
> With the patch:
> 
> testdb=# SELECT regexp_matches('abcabc','((a)(b)(c))','g'),
> regexp_matches_positions('abcabc','((a)(b)(c))');
>  regexp_matches | regexp_matches_positions
> ----------------+--------------------------
>  {abc,a,b,c}    | {1,1,2,3}
>  {abc,a,b,c}    | {1,1,2,3}
> (2 rows)

The {1,1,2,3} in the second row is an artifact/copy from
set-value-function-in-select-list repetition and has nothing to do with the
second match.


> testdb=# SELECT regexp_matches('abcabc','((a)(b)(c))','g'),
> regexp_matches_positions('abcabc','((a)(b)(c))', 'g');
>  regexp_matches | regexp_matches_positions
> ----------------+--------------------------
>  {abc,a,b,c}    | {1,1,2,3}
>  {abc,a,b,c}    | {4,4,5,6}
> (2 rows)

As expected.

David J.





--
View this message in context:
http://postgresql.1045698.n5.nabble.com/Patch-regexp-matches-variant-returning-an-array-of-matching-positions-tp5789321p5789434.html
Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.



Re: Patch: regexp_matches variant returning an array of matching positions

От
Björn Harrtell
Дата:
I'll elaborate on the use case. I have OCR scanned text for a large amounts of images, corresponding to one row per image. I want to match against words in another table. I need two results sets, one with all matched words and one with only the first matched word within the first 50 chars of the OCR scanned text. Having the matched position in the first result set makes it easy to produce the second.

I cannot find the position using the substring because I use word boundaries in my regexp.

Returning a SETOF named composite makes sense, so I could try to make such a function instead if there is interest. Perhaps a good name for such a function would be simply regexp_match och regexp_search (as in python).

/Björn


2014-01-29 David Johnston <polobo@yahoo.com>
Alvaro Herrera-9 wrote
> Björn Harrtell wrote:
>> I've written a variant of regexp_matches called regexp_matches_positions
>> which instead of returning matching substrings will return matching
>> positions. I found use of this when processing OCR scanned text and
>> wanted
>> to prioritize matches based on their position.
>
> Interesting.  I didn't read the patch but I wonder if it would be of
> more general applicability to return more info in a fell swoop a
> function returning a set (position, length, text of match), rather than
> an array.  So instead of first calling one function to get the match and
> then their positions, do it all in one pass.
>
> (See pg_event_trigger_dropped_objects for a simple example of a function
> that returns in that fashion.  There are several others but AFAIR that's
> the simplest one.)

Confused as to your thinking. Like regexp_matches this returns "SETOF
type[]".  In this case integer but text for the matches.  I could see adding
a generic function that returns a SETOF named composite (match varchar[],
position int[], length int[]) and the corresponding type.  I'm not imagining
a situation where you'd want the position but not the text and so having to
evaluate the regexp twice seems wasteful.  The length is probably a waste
though since it can readily be gotten from the text and is less often
needed.  But if it's pre-calculated anyway...

My question is what position is returned in a multiple-match situation? The
supplied test only covers the simple, non-global, situation.  It needs to
exercise empty sub-matches and global searches.  One theory is that the
first array slot should cover the global position of match zero (i.e., the
entire pattern) within the larger document while sub-matches would be
relative offsets within that single match.  This conflicts, though, with the
fact that _matches only returns array elements for () items and never for
the full match - the goal in this function being parallel un-nesting. But as
nesting is allowed it is still possible to have occur.

How does this resolve in the patch?

SELECT regexp_matches('abcabc','((a)(b)(c))','g');

David J.







--
View this message in context: http://postgresql.1045698.n5.nabble.com/Patch-regexp-matches-variant-returning-an-array-of-matching-positions-tp5789321p5789414.html
Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.