Re: [E] Re: Regexp_replace bug / does not terminate on long strings

Поиск
Список
Период
Сортировка
От Markhof, Ingolf
Тема Re: [E] Re: Regexp_replace bug / does not terminate on long strings
Дата
Msg-id CALZg0g5QeVLSjCQeTA8Csh6EHLjuekPuPCUJ3t-XcCqqOnfwyA@mail.gmail.com
обсуждение исходный текст
Ответ на Re: [E] Re: Regexp_replace bug / does not terminate on long strings  ("Markhof, Ingolf" <ingolf.markhof@de.verizon.com>)
Ответы Re: [E] Re: Regexp_replace bug / does not terminate on long strings  (Francisco Olarte <folarte@peoplecall.com>)
Список pgsql-general
Argh...

Yes, When I use (\1)? instead of (\1)+, the expression is evaluated quickly, but it doesn't return what I want. Once a word is written, it is not subject to matching again. i.e.

select regexp_replace( --> remove double entries
  'one,one,one,two,two,three,three',
      '([^,]+)(,\1)?($|,)',
        '\1\3',
        'g'
   ) as res;  

returns 'one,one,two,three'. The first 'one' is found, followed by a ',one, so the 'one,one,' is replaced by a 'one,'. It looks like the system is 'reading' from an input string and writing to output string. And the 'one,' send to the output is not matched again. So the system instead finds just another 'one,' in the input string and writes another 'one,' to the output string.

Honestly, this behaviour seems to be incorrect for me. Once the system replaces the first two 'one,one,' by a single 'one,', I'd expect to match this replaced one 'one,' with the next 'one,' following, replacing these two by another, single 'one,', again...

Regards,
Ingolf
 


Ingolf Markhof <ingolf.markhof@de.verizon.com>
International Network Product - International Access <INP-IntlAccess@verizon.com>

 

Sebrathweg 20, 44149 Dortmund, GermanyOffice: +49 231 972 1475 | Vnet: 317-1475



On Fri, Aug 20, 2021 at 6:11 PM Markhof, Ingolf <ingolf.markhof@de.verizon.com> wrote:
Hi Tom,

thank you very much for your reply. Actually, I was assuming all these regular expressions are based on the same core implementation. Interestingly, this doesn't seem to be true...

I am also surprised that you say the (\1)+ subpattern is computationally expensive. Regular expressions are greedy by default. I.e. in case of a* matching against a string of 1000 a's, the system will not try a, aa, aaa, ... and so on, right? Instead, it will consume all the a's in one go. Likewise, I was expecting that the system would eat all the repetitions of a word in one go. Your proposal to use (\1)? instead at the first glance seems to require more effort, because the same word and it's matching successor will need to be matched again and again and again. Roughly, 2N matches are to be done instead of just N. 

However, you are perfectly right: When I use (\1)? instead of (\1)+, the expression is evaluates quickly!

Thank you very much for looking into this and for proposing the alternative approach which is working fine.

Regards
Ingolf



On Fri, Aug 20, 2021 at 12:42 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
"Markhof, Ingolf" <ingolf.markhof@de.verizon.com> writes:
> BRIEF:
> regexp_replace(source,pattern,replacement,flags) needs very (!) long to
> complete or does not complete at all (?!) for big input strings (a few k
> characters). (Oracle SQL completes the same in a few ms)

Regexps containing backrefs are inherently hard --- every engine has
strengths and weaknesses.  I doubt it'd be hard to find cases where
our engine is orders of magnitude faster than Oracle's; but you've
hit on a case where the opposite is true.

The core of the problem is that it's hard to tell how much of the
string could be matched by the (,\1)* subpattern.  In principle, *all*
of the remaining string could be, if it were N repetitions of the
initial word.  Or it could be N-1 repetitions followed by one other
word, and so on.  The difficulty is that since our engine guarantees
to find the longest feasible match, it tries these options from
longest to shortest.  Usually the actual match (if any) will be pretty
short, so that you have O(N) wasted work per word, making the runtime
at least O(N^2).

I think your best bet is to not try to eliminate multiple duplicates
at a time.  Get rid of one dup at a time, say by
     str := regexp_replace(str, '([^,]+)(,\1)?($|,)', '\1\3', 'g');
and repeat till the string doesn't get any shorter.

I did come across a performance bug [1] while poking at this, but
alas fixing it doesn't move the needle very much for this example.

                        regards, tom lane

[1] https://urldefense.proofpoint.com/v2/url?u=https-3A__www.postgresql.org_message-2Did_1808998.1629412269-2540sss.pgh.pa.us&d=DwIBAg&c=udBTRvFvXC5Dhqg7UHpJlPps3mZ3LRxpb6__0PomBTQ&r=ivZWA-ECVj3XrXBe0obDwKY7Ui7K5Nj9oD2KKWLm0Bw&m=q11bVTHCxVx8BQu2pjn6-3nOY8aN8hORXofVK38HqF8&s=hJxrzmTT6G7AUomoeFgh0IGDO3NcUP4gB9kvYHnt3m0&e=

В списке pgsql-general по дате отправления:

Предыдущее
От: GARAPATI CHANDRA SYAM KUMAR
Дата:
Сообщение: Multiple Postgres process are running in background
Следующее
От: Oliver Kohll
Дата:
Сообщение: Re: Incremental Materialized Views