Обсуждение: large document multiple regex
Hello, I am receiving a large (300k+_ document from an external agent and need to reduce a few interesting bits of data out of the document on an insert trigger into separate fields. regex seems one way to handle this but is there any way to avoid rescanning the document for each regex. One solution I am kicking around is some C hackery but then I lose the expressive power of regex. Ideally, I need to be able to scan some text and return a comma delimited string of values extracted from it. Does anybody know if this is possible or have any other suggestions? merlin
On Jan 26, 2007, at 9:06 AM, Merlin Moncure wrote: > I am receiving a large (300k+_ document from an external agent and > need to reduce a few interesting bits of data out of the document on > an insert trigger into separate fields. > > regex seems one way to handle this but is there any way to avoid > rescanning the document for each regex. One solution I am kicking > around is some C hackery but then I lose the expressive power of > regex. Ideally, I need to be able to scan some text and return a > comma delimited string of values extracted from it. Does anybody know > if this is possible or have any other suggestions? Have you thought about something like ~ '(first_string|second_string| third_string)'? Obviously your example would be more complex, but I believe that with careful crafting, you can get regex to do a lot without resorting to multiple passes. -- Jim Nasby jim@nasby.net EnterpriseDB http://enterprisedb.com 512.569.9461 (cell)
On 2/1/07, Jim Nasby <decibel@decibel.org> wrote: > Have you thought about something like ~ '(first_string|second_string| > third_string)'? Obviously your example would be more complex, but I > believe that with careful crafting, you can get regex to do a lot > without resorting to multiple passes. that doesn't work...i researched the problem further and found that postgresql regex implementation has the built in limitation to quit scanning after the first matched group (this is noted in the documentation). There is no way that I can see to extract two or more non contiguous text chunks in a single regex. To do it properly, you need to have the sophistication of perl regex with it's magic variables. merlin
On Fri, Feb 02, 2007 at 12:00:27PM -0500, Merlin Moncure wrote: > On 2/1/07, Jim Nasby <decibel@decibel.org> wrote: > >Have you thought about something like ~ > >'(first_string|second_string| third_string)'? Obviously your > >example would be more complex, but I believe that with careful > >crafting, you can get regex to do a lot without resorting to > >multiple passes. > > that doesn't work...i researched the problem further and found that > postgresql regex implementation has the built in limitation to quit > scanning after the first matched group (this is noted in the > documentation). There is no way that I can see to extract two or > more non contiguous text chunks in a single regex. > > To do it properly, you need to have the sophistication of perl regex > with it's magic variables. It looks like that's coming in 8.3 :) <http://archives.postgresql.org/pgsql-hackers/2007-02/msg00039.php> Cheers, D -- David Fetter <david@fetter.org> http://fetter.org/ phone: +1 415 235 3778 AIM: dfetter666 Skype: davidfetter Remember to vote!