Обсуждение: Regular expression query

Поиск
Список
Период
Сортировка

Regular expression query

От
Rodger Donaldson
Дата:
I have a large group of records which were entered with trailing garbage, in
the form of superfluous \n.

The main problem this has caused, other than the obvious one, is that the
record in question is being used as a primary key, so some duplicates have
slipped through.  I assumed a simple statement like:

SELECT url 
FROM sites
WHERE url ~ url || '\\s+'

...would allow me to find all the duplicate-but-not-quite records.

While this concatenation works with the LIKE directive (ie LIKE url || '%'),
postgresql barfs on it in a regexp with the error:

ERROR:  Unable to identify an operator '||' for types 'bool' and 'unknown'You will have to retype this query using an
explicitcast
 

Encapsulating the concatenation in brackets leads the query parser to stop
bleating.  I presume this is simply a limitation in the parser's ability to
make inferences about regexps vs. LIKEs.

The other aspect of this is that it seems that postgresql's regexp engine
doesn't understand some expected regexps; I've tried both escaped and
unescaped versions of, eg \w, \s, \n and so on a pg seems to ignore them.
Am I exceeding the capabilities of the regexp parser?

-- 
Rodger Donaldson        rodgerd@diaspora.gen.nz
I just had this vision of a young boy cowering in terror, whispering:
"I see dumb people"    -- Steve VanDevender 


Re: Regular expression query

От
Tom Lane
Дата:
Rodger Donaldson <rodgerd@diaspora.gen.nz> writes:
> SELECT url 
> FROM sites
> WHERE url ~ url || '\\s+'

> While this concatenation works with the LIKE directive (ie LIKE url || '%'),
> postgresql barfs on it in a regexp with the error:

> ERROR:  Unable to identify an operator '||' for types 'bool' and 'unknown'
>     You will have to retype this query using an explicit cast

LIKE and ~ do not have the same precedence.  See
http://www.postgresql.org/docs/postgres/operators.htm.
~ and || actually fall in the same category ("all other") and therefore
are grouped left-to-right; so you're getting (url ~ url) || '...'.

> The other aspect of this is that it seems that postgresql's regexp engine
> doesn't understand some expected regexps; I've tried both escaped and
> unescaped versions of, eg \w, \s, \n and so on a pg seems to ignore them.

The regexp package we currently use implements POSIX 1003.2 regexps
(see src/backend/regex/re_format.7).  I believe there is an item on the
TODO list about upgrading the regexp parser to something more modern
... feel free to hop on that project if it's bugging you ...
        regards, tom lane


Re: Regular expression query

От
Oliver Seidel
Дата:
> The regexp package we currently use implements POSIX 1003.2 regexps
> (see src/backend/regex/re_format.7).  I believe there is an item on the
> TODO list about upgrading the regexp parser to something more modern
> ... feel free to hop on that project if it's bugging you ...

I would like to recommend

ftp://ftp.cus.cam.ac.uk/pub/software/programs/pcre/ 

which stands for "PERL compatible regular expressions" and has an
expression analyzer & compiler for more efficient repeated matching.
Since Philip Hazel has only written this package to support his Mail
Transfer Agent (exim), it is optimised for performance.

I hope I have been of help,

Oliver Seidel



Re: Regular expression query

От
Bruce Momjian
Дата:
> Rodger Donaldson <rodgerd@diaspora.gen.nz> writes:
> > SELECT url 
> > FROM sites
> > WHERE url ~ url || '\\s+'
> 
> > While this concatenation works with the LIKE directive (ie LIKE url || '%'),
> > postgresql barfs on it in a regexp with the error:
> 
> > ERROR:  Unable to identify an operator '||' for types 'bool' and 'unknown'
> >     You will have to retype this query using an explicit cast
> 
> LIKE and ~ do not have the same precedence.  See
> http://www.postgresql.org/docs/postgres/operators.htm.
> ~ and || actually fall in the same category ("all other") and therefore
> are grouped left-to-right; so you're getting (url ~ url) || '...'.
> 
> > The other aspect of this is that it seems that postgresql's regexp engine
> > doesn't understand some expected regexps; I've tried both escaped and
> > unescaped versions of, eg \w, \s, \n and so on a pg seems to ignore them.
> 
> The regexp package we currently use implements POSIX 1003.2 regexps
> (see src/backend/regex/re_format.7).  I believe there is an item on the
> TODO list about upgrading the regexp parser to something more modern
> ... feel free to hop on that project if it's bugging you ...

I bug Henry Spencer regularly.  His new code is in TCL/TK, but has not
been released into any other code.

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026