Обсуждение: Regular expression query
I have a large group of records which were entered with trailing garbage, in the form of superfluous \n. The main problem this has caused, other than the obvious one, is that the record in question is being used as a primary key, so some duplicates have slipped through. I assumed a simple statement like: SELECT url FROM sites WHERE url ~ url || '\\s+' ...would allow me to find all the duplicate-but-not-quite records. While this concatenation works with the LIKE directive (ie LIKE url || '%'), postgresql barfs on it in a regexp with the error: ERROR: Unable to identify an operator '||' for types 'bool' and 'unknown'You will have to retype this query using an explicitcast Encapsulating the concatenation in brackets leads the query parser to stop bleating. I presume this is simply a limitation in the parser's ability to make inferences about regexps vs. LIKEs. The other aspect of this is that it seems that postgresql's regexp engine doesn't understand some expected regexps; I've tried both escaped and unescaped versions of, eg \w, \s, \n and so on a pg seems to ignore them. Am I exceeding the capabilities of the regexp parser? -- Rodger Donaldson rodgerd@diaspora.gen.nz I just had this vision of a young boy cowering in terror, whispering: "I see dumb people" -- Steve VanDevender
Rodger Donaldson <rodgerd@diaspora.gen.nz> writes: > SELECT url > FROM sites > WHERE url ~ url || '\\s+' > While this concatenation works with the LIKE directive (ie LIKE url || '%'), > postgresql barfs on it in a regexp with the error: > ERROR: Unable to identify an operator '||' for types 'bool' and 'unknown' > You will have to retype this query using an explicit cast LIKE and ~ do not have the same precedence. See http://www.postgresql.org/docs/postgres/operators.htm. ~ and || actually fall in the same category ("all other") and therefore are grouped left-to-right; so you're getting (url ~ url) || '...'. > The other aspect of this is that it seems that postgresql's regexp engine > doesn't understand some expected regexps; I've tried both escaped and > unescaped versions of, eg \w, \s, \n and so on a pg seems to ignore them. The regexp package we currently use implements POSIX 1003.2 regexps (see src/backend/regex/re_format.7). I believe there is an item on the TODO list about upgrading the regexp parser to something more modern ... feel free to hop on that project if it's bugging you ... regards, tom lane
> The regexp package we currently use implements POSIX 1003.2 regexps > (see src/backend/regex/re_format.7). I believe there is an item on the > TODO list about upgrading the regexp parser to something more modern > ... feel free to hop on that project if it's bugging you ... I would like to recommend ftp://ftp.cus.cam.ac.uk/pub/software/programs/pcre/ which stands for "PERL compatible regular expressions" and has an expression analyzer & compiler for more efficient repeated matching. Since Philip Hazel has only written this package to support his Mail Transfer Agent (exim), it is optimised for performance. I hope I have been of help, Oliver Seidel
> Rodger Donaldson <rodgerd@diaspora.gen.nz> writes: > > SELECT url > > FROM sites > > WHERE url ~ url || '\\s+' > > > While this concatenation works with the LIKE directive (ie LIKE url || '%'), > > postgresql barfs on it in a regexp with the error: > > > ERROR: Unable to identify an operator '||' for types 'bool' and 'unknown' > > You will have to retype this query using an explicit cast > > LIKE and ~ do not have the same precedence. See > http://www.postgresql.org/docs/postgres/operators.htm. > ~ and || actually fall in the same category ("all other") and therefore > are grouped left-to-right; so you're getting (url ~ url) || '...'. > > > The other aspect of this is that it seems that postgresql's regexp engine > > doesn't understand some expected regexps; I've tried both escaped and > > unescaped versions of, eg \w, \s, \n and so on a pg seems to ignore them. > > The regexp package we currently use implements POSIX 1003.2 regexps > (see src/backend/regex/re_format.7). I believe there is an item on the > TODO list about upgrading the regexp parser to something more modern > ... feel free to hop on that project if it's bugging you ... I bug Henry Spencer regularly. His new code is in TCL/TK, but has not been released into any other code. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026