Обсуждение: regular expressions from hell
I've noticed there are no less then 10^10 regex implementations. Is there a standard? Does ANSI have a regexp standard, or is there a regex standard in the ANSI SQL spec? What do we use? Personally, I'm a perl guy, so everytime I have to bend my brain to some other regex syntax, I get a headache. As part of my perl PL package, perl regexps will be included as a set of operators. Is there interest in the release of perl-style regexp operators for postgres before the PL is completed? Note that this requires the entire perl library to be loaded when the operator is used (possibly expensive). But, if you have a shared perl library, this only has to happen once.
> I've noticed there are no less then 10^10 regex implementations.
> Is there a standard? Does ANSI have a regexp standard, or is there
> a regex standard in the ANSI SQL spec? What do we use?
afaik the only regex in ANSI SQL is that implemented for the LIKE
operator. Pretty pathetic: uses "%" for match-all and "_" for match-any
and that's it. Ingres had a bit more, with bracketed character ranges
also. None as rich as what we already have in the backend of Postgres.
Don't know about any other ANSI standards for regex, but I don't know
that there isn't one either...
- Tom
> I've noticed there are no less then 10^10 regex implementations.
> Is there a standard? Does ANSI have a regexp standard, or is there
> a regex standard in the ANSI SQL spec? What do we use?
Good question. I think one of the standard unix regex's should be ok. At least
everyone knows how to work it, and they are quite small.
> Personally, I'm a perl guy, so everytime I have to bend my brain to
> some other regex syntax, I get a headache. As part of my perl PL
> package, perl regexps will be included as a set of operators.
>
> Is there interest in the release of perl-style regexp operators for
> postgres before the PL is completed? Note that this requires the
> entire perl library to be loaded when the operator is used (possibly
> expensive). But, if you have a shared perl library, this only has to
> happen once.
Hmmm, I really like the perl regex's, especially the extended syntax, but
I don't want to load a whole perl lib to get this.
-dg
David Gould dg@illustra.com 510.628.3783 or 510.305.9468
Informix Software (No, really) 300 Lakeside Drive Oakland, CA 94612
"Of course, someone who knows more about this will correct me if I'm wrong,
and someone who knows less will correct me if I'm right."
--David Palmer (palmer@tybalt.caltech.edu)
Unfortunately, there's no other way. This is mentioned in the
perlcall manpage, I beleive. One method which is ok in my book is to
load the shared perl lib once, in one backend, and then it can be
shared between all other backends when they need perl regex's.
There is no mechanism for auto-loading the type/func shared libraries
on postmaster startup correct? It happens per backend sessions? So
to do the above you'd have to have one "Dummy" connection which just
did a simple regex and then while(1) { sleep(10^32) };
On Sun, 31 May 1998, at 16:46:30, David Gould wrote:
> Hmmm, I really like the perl regex's, especially the extended syntax, but
> I don't want to load a whole perl lib to get this.
>
> -dg
>
> David Gould dg@illustra.com 510.628.3783 or 510.305.9468
> Informix Software (No, really) 300 Lakeside Drive Oakland, CA 94612
> "Of course, someone who knows more about this will correct me if I'm wrong,
> and someone who knows less will correct me if I'm right."
> --David Palmer (palmer@tybalt.caltech.edu)
>
Not to mention the fact that if perl (or mod_perl) is already running
(and you're using a shared libperl), the library is already loaded.
On Sun, 31 May 1998, at 17:23:16, Brett McCormick wrote:
> Unfortunately, there's no other way. This is mentioned in the
> perlcall manpage, I beleive. One method which is ok in my book is to
> load the shared perl lib once, in one backend, and then it can be
> shared between all other backends when they need perl regex's.
>
> There is no mechanism for auto-loading the type/func shared libraries
> on postmaster startup correct? It happens per backend sessions? So
> to do the above you'd have to have one "Dummy" connection which just
> did a simple regex and then while(1) { sleep(10^32) };
>
> On Sun, 31 May 1998, at 16:46:30, David Gould wrote:
>
> > Hmmm, I really like the perl regex's, especially the extended syntax, but
> > I don't want to load a whole perl lib to get this.
> >
> > -dg
> >
> > David Gould dg@illustra.com 510.628.3783 or 510.305.9468
> > Informix Software (No, really) 300 Lakeside Drive Oakland, CA 94612
> > "Of course, someone who knows more about this will correct me if I'm wrong,
> > and someone who knows less will correct me if I'm right."
> > --David Palmer (palmer@tybalt.caltech.edu)
> >
>
> Date: Sun, 31 May 1998 18:56:29 -0700 (PDT)
> From: Brett McCormick <brett@work.chicken.org>
> Sender: owner-pgsql-hackers@hub.org
> Not to mention the fact that if perl (or mod_perl) is already running
> (and you're using a shared libperl), the library is already loaded.
If you're running Apache, mod_perl or not, isn't Posix regex loaded?
(HSREGEX or compatible?)
> On Sun, 31 May 1998, at 17:23:16, Brett McCormick wrote:
>
> > Unfortunately, there's no other way. This is mentioned in the
> > perlcall manpage, I beleive. One method which is ok in my book is to
> > load the shared perl lib once, in one backend, and then it can be
> > shared between all other backends when they need perl regex's.
> >
> > There is no mechanism for auto-loading the type/func shared libraries
> > on postmaster startup correct? It happens per backend sessions? So
> > to do the above you'd have to have one "Dummy" connection which just
> > did a simple regex and then while(1) { sleep(10^32) };
...
> > Not to mention the fact that if perl (or mod_perl) is already running > (and you're using a shared libperl), the library is already loaded. Ok, my vote is to build regexes into the pgsql binary or into a .so that we distribute. There should be no need to have perl installed on a system to run postgresql. If we are going to extend the language to improve on the very lame sql92 like clause, we need to have it be part of the system that can be counted on, not something you might or might not have depending on what else is installed. -dg David Gould dg@illustra.com 510.628.3783 or 510.305.9468 Informix Software 300 Lakeside Drive Oakland, CA 94612 - A child of five could understand this! Fetch me a child of five.
On Sun, 31 May 1998, Thomas G. Lockhart wrote:
> > I've noticed there are no less then 10^10 regex implementations.
> > Is there a standard? Does ANSI have a regexp standard, or is there
> > a regex standard in the ANSI SQL spec? What do we use?
>
> afaik the only regex in ANSI SQL is that implemented for the LIKE
> operator. Pretty pathetic: uses "%" for match-all and "_" for match-any
> and that's it. Ingres had a bit more, with bracketed character ranges
> also. None as rich as what we already have in the backend of Postgres.
>
> Don't know about any other ANSI standards for regex, but I don't know
> that there isn't one either...
>
- SQL3 SIMILAR condition.
SIMILAR is intended for character string pattern matching. The difference
between SIMILAR and LIKE is that SIMILAR supports a much more extensive
range of possibilities ("wild cards," etc.) than LIKE does.
Here the syntax:
expression [ NOT ] SIMILAR TO pattern [ ESCAPE escape ]
Jose'
> > > > > Not to mention the fact that if perl (or mod_perl) is already running > > (and you're using a shared libperl), the library is already loaded. > > Ok, my vote is to build regexes into the pgsql binary or into a .so that > we distribute. There should be no need to have perl installed on a system > to run postgresql. If we are going to extend the language to improve on > the very lame sql92 like clause, we need to have it be part of the system > that can be counted on, not something you might or might not have depending > on what else is installed. We already have it as ~, just not with Perl extensions. Our implementation is very slow, and the author has said he is working on a rewrite, though no time frame was given. -- Bruce Momjian | 830 Blythe Avenue maillist@candle.pha.pa.us | Drexel Hill, Pennsylvania 19026 + If your life is a hard drive, | (610) 353-9879(w) + Christ can be your backup. | (610) 853-3000(h)
On Mon, 1 June 1998, at 10:16:35, Bruce Momjian wrote: > > Ok, my vote is to build regexes into the pgsql binary or into a .so that > > we distribute. There should be no need to have perl installed on a system > > to run postgresql. If we are going to extend the language to improve on > > the very lame sql92 like clause, we need to have it be part of the system > > that can be counted on, not something you might or might not have depending > > on what else is installed. I'm not suggesting we require perl to be installed to run postgres, or replace the current regexp implementation with perl. i was just lamenting the fact that there are no less than 10 different regexp implementations, with different metacharacters. why should I have to remember one syntax when I use perl, one for sed, one for emacs, and another for postgresql? this isn't a problem with postgres per se, just the fact that there seems to be no standard. I love perl regex's. I'm merely suggesting (and planning on implementing) a different set of regexp operators (not included with postgres, but as a contrib module) that use perl regex's. There are some pros and cons, which have been discussed. It should be there for people who want it. > > We already have it as ~, just not with Perl extensions. Our > implementation is very slow, and the author has said he is working on a > rewrite, though no time frame was given.
On Sun, 31 May 1998, David Gould wrote:
> >
> > Not to mention the fact that if perl (or mod_perl) is already running
> > (and you're using a shared libperl), the library is already loaded.
>
> Ok, my vote is to build regexes into the pgsql binary or into a .so that
> we distribute. There should be no need to have perl installed on a system
> to run postgresql. If we are going to extend the language to improve on
> the very lame sql92 like clause, we need to have it be part of the system
> that can be counted on, not something you might or might not have depending
> on what else is installed.
Odd question here, but how many systems nowadays *don't* have Perl
installed that would be running PostgreSQL? IMHO, perl is an invaluable
enough tool that I can't imagine a site not running it *shrug*
Brett McCormick wrote: > > On Mon, 1 June 1998, at 10:16:35, Bruce Momjian wrote: > > > > Ok, my vote is to build regexes into the pgsql binary or into a .so that > > > we distribute. There should be no need to have perl installed on a system > > > to run postgresql. If we are going to extend the language to improve on > > > the very lame sql92 like clause, we need to have it be part of the system > > > that can be counted on, not something you might or might not have depending > > > on what else is installed. > > I'm not suggesting we require perl to be installed to run postgres, or > replace the current regexp implementation with perl. i was just > lamenting the fact that there are no less than 10 different regexp > implementations, with different metacharacters. why should I have to > remember one syntax when I use perl, one for sed, one for emacs, and > another for postgresql? this isn't a problem with postgres per se, > just the fact that there seems to be no standard. I think most of this is due to different decisions on what needs to be escaped or not. For instance, if memory serves, GNU grep treats parens as metacharacters, which must be escaped with a backslash to match parens, while in Emacs, parens match parens and must be escaped to get their meta-character meaning. Things have gone too far to have one standard now I'm afraid. Ocie
-----BEGIN PGP SIGNED MESSAGE-----
>>>>> "ocie" == ocie <ocie@paracel.com> writes:
ocie> I think most of this is due to different decisions on what
ocie> needs to be escaped or not. For instance, if memory serves,
ocie> GNU grep treats parens as metacharacters, which must be
ocie> escaped with a backslash to match parens, while in Emacs,
ocie> parens match parens and must be escaped to get their
ocie> meta-character meaning. Things have gone too far to have
ocie> one standard now I'm afraid.
Please try to remember that there are historical reasons for some of
this. grep and egrep behave differently with respect to parentheses;
again, this is historical.
Personally, I like Perl regexps. And there is a library for Tcl/Tk
(nre) that implements the same syntax for that language. But I do
like Emacs' syntax tables and character classes. I can live with
switching back and forth to some extent....
roland
-----BEGIN PGP SIGNATURE-----
Version: 2.6.2
Comment: Processed by Mailcrypt 3.4, an Emacs/PGP interface
iQCVAwUBNXSyLuoW38lmvDvNAQHatQQAsyp+akdXl0TiptXsSlrp7tM2/Jb/jLnW
SfpkYVkk53iER/JMYMU4trfQQssePkqGmaF8GMeU5i8eMW6Vi3Vus2pqovnLa1eV
w5rCgxKXqpZnIhGJZeHIYieMfWxfdmWOUjawrjKv85vBRdZDYdRkLBoAWvI4ZaJb
JxAEwqbZrQw=
=Zgvo
-----END PGP SIGNATURE-----
--
Roland B. Roberts, PhD Custom Software Solutions
roberts@panix.com 101 West 15th St #4NN
New York, NY 10011
Roland B. Roberts, PhD writes: > >>>>> "ocie" == ocie <ocie@paracel.com> writes: > > ocie> I think most of this is due to different decisions on what > ocie> needs to be escaped or not. For instance, if memory serves, > ocie> GNU grep treats parens as metacharacters, which must be > ocie> escaped with a backslash to match parens, while in Emacs, > ocie> parens match parens and must be escaped to get their > ocie> meta-character meaning. Things have gone too far to have > ocie> one standard now I'm afraid. > > Please try to remember that there are historical reasons for some of > this. grep and egrep behave differently with respect to parentheses; > again, this is historical. > > Personally, I like Perl regexps. And there is a library for Tcl/Tk > (nre) that implements the same syntax for that language. But I do > like Emacs' syntax tables and character classes. I can live with > switching back and forth to some extent.... Emacs! Huh! I like VI regexes... Uh oh, sorry, wrong flamewar. Isn't there a POSIX regex? Perhaps we could consider that, unless of course it is well and truly broken. Secondly, I seem to remember a post here in this same thread that said we already had regexes. Perhaps we should move on. Seriously as part of a Perl extension to postgresql, perl regexes would be the naturaly thing. But if we already have a regex package, I think adding just perl regexes without perl, but requireing perl.so is uhmmm, premature. -dg David Gould dg@illustra.com 510.628.3783 or 510.305.9468 Informix Software (No, really) 300 Lakeside Drive Oakland, CA 94612 "Don't worry about people stealing your ideas. If your ideas are any good, you'll have to ram them down people's throats." -- Howard Aiken
> I've noticed there are no less then 10^10 regex implementations.
> Is there a standard? Does ANSI have a regexp standard, or is there
> a regex standard in the ANSI SQL spec? What do we use?
>
> Personally, I'm a perl guy, so everytime I have to bend my brain to
> some other regex syntax, I get a headache. As part of my perl PL
> package, perl regexps will be included as a set of operators.
>
> Is there interest in the release of perl-style regexp operators for
> postgres before the PL is completed? Note that this requires the
> entire perl library to be loaded when the operator is used (possibly
> expensive). But, if you have a shared perl library, this only has to
> happen once.
Well, not to bring this up for discussion again, but there is apparently
a Posix standard, and even better a free implementation:
Article 10705 of comp.os.linux.misc:
Newsgroups: gnu.announce,gnu.utils.bug,comp.os.linux.misc,alt.sources.d
Subject: Rx 1.9
Date: Wed, 10 Jun 1998 10:40:00 -0700 (PDT)
Approved: info-gnu@gnu.org
The latest version of Rx, 1.9, is available on the web at:
http://users.lanminds.com/~lord
ftp://emf.net/users/lord/src/rx-1.9.tar.gz
and at ftp://ftp.gnu.org/pub/gnu/rx-1.9.tar.gz and mirrors of that
site (see list below).
Rx is a regexp pattern matching library. The library exports these
functions which are standardized by Posix:
regcomp - compile a regexp
regexec - search for a match
regfree - release storage for a regexp
regerr - translate error codes to strings
The library exports many other functions as well, and does a lot
more than Posix requires.
RECENT CHANGES
1. Rx 1.9
Recent changes: More "dead code" was recently discarded,
and the remaining code simplified.
Benchmark comparisons to GNU regex and older
versions of Rx were added to the distribution.
0. Rx 1.8
Recent changes: Various bug-fixes and small performance improvements.
A great deal of "dead code" was recently discarded,
making the size of the Rx library smaller and the
source easier to maintain (in theory).
[ Most GNU software is compressed using the GNU `gzip' compression program.
Source code is available on most sites distributing GNU software.
Executables for various systems and information about using gzip can be
found at the URL http://www.gzip.org.
For information on how to order GNU software on CD-ROM and
printed GNU manuals, see http://www.gnu.org/order/order.html
or e-mail a request to: gnu@gnu.org
By ordering your GNU software from the FSF, you help us continue to
develop more free software. Media revenues are our primary source of
support. Donations to FSF are deductible on US tax returns.
The above software will soon be at these ftp sites as well.
Please try them before ftp.gnu.org as ftp.gnu.org is very busy!
A possibly more up-to-date list is at the URL
http://www.gnu.org/order/ftp.html
thanx -gnu@gnu.org
Here are the mirrored ftp sites for the GNU Project, listed by country:
United States:
California - labrea.stanford.edu/pub/gnu, gatekeeper.dec.com/pub/GNU
Hawaii - ftp.hawaii.edu/mirrors/gnu
Illinois - uiarchive.cso.uiuc.edu/pub/gnu (Internet address 128.174.5.14)
Kentucky - ftp.ms.uky.edu/pub/gnu
Maryland - ftp.digex.net/pub/gnu (Internet address 164.109.10.23)
Michigan - gnu.egr.msu.edu/pub/gnu
Missouri - wuarchive.wustl.edu/systems/gnu
New York - ftp.cs.columbia.edu/archives/gnu/prep
Ohio - ftp.cis.ohio-state.edu/mirror/gnu
Utah - jaguar.utah.edu/gnustuff
Virginia - ftp.uu.net/archive/systems/gnu
Africa:
South Africa - ftp.sun.ac.za/pub/gnu
The Americas:
Brazil - ftp.unicamp.br/pub/gnu
Canada - ftp.cs.ubc.ca/mirror2/gnu
Chile - ftp.inf.utfsm.cl/pub/gnu (Internet address 146.83.198.3)
Costa Rica - sunsite.ulatina.ac.cr/GNU
Mexico - ftp.uaem.mx/pub/gnu
Asia and Australia:
Australia - archie.au/gnu (archie.oz or archie.oz.au for ACSnet)
Australia - ftp.progsoc.uts.edu.au/pub/gnu
Japan - tron.um.u-tokyo.ac.jp/pub/GNU/prep
Japan - ftp.cs.titech.ac.jp/pub/gnu
Korea - cair-archive.kaist.ac.kr/pub/gnu (Internet address 143.248.186.3)
Thailand - ftp.nectec.or.th/pub/mirrors/gnu (Internet address - 192.150.251.32)
Europe:
Austria - ftp.univie.ac.at/packages/gnu
Czech Republic - ftp.fi.muni.cz/pub/gnu/
Denmark - ftp.denet.dk/mirror/ftp.gnu.org/pub/gnu
Finland - ftp.funet.fi/pub/gnu (Internet address 128.214.6.100)
France - ftp.univ-lyon1.fr/pub/gnu
France - ftp.irisa.fr/pub/gnu
Germany - ftp.informatik.tu-muenchen.de/pub/comp/os/unix/gnu/
Germany - ftp.informatik.rwth-aachen.de/pub/gnu
Germany - ftp.de.uu.net/pub/gnu
Greece - ftp.ntua.gr/pub/gnu
Greece - ftp.aua.gr/pub/mirrors/GNU (Internet address 143.233.187.61)
Ireland - ftp.ieunet.ie/pub/gnu (Internet address 192.111.39.1)
Netherlands - ftp.eu.net/gnu (Internet address 192.16.202.1)
Netherlands - ftp.nluug.nl/pub/gnu
Netherlands - ftp.win.tue.nl/pub/gnu (Internet address 131.155.70.100)
Norway - ugle.unit.no/pub/gnu (Internet address 129.241.1.97)
Spain - ftp.etsimo.uniovi.es/pub/gnu
Sweden - ftp.isy.liu.se/pub/gnu
Sweden - ftp.stacken.kth.se
Sweden - ftp.luth.se/pub/unix/gnu
Sweden - ftp.sunet.se/pub/gnu (Internet address 130.238.127.3)
Also mirrors the Mailing List Archives.
Switzerland - ftp.eunet.ch/mirrors4/gnu
Switzerland - sunsite.cnlab-switch.ch/mirror/gnu (Internet address 193.5.24.1)
United Kingdom - ftp.mcc.ac.uk/pub/gnu (Internet address 130.88.203.12)
United Kingdom - unix.hensa.ac.uk/mirrors/gnu
United Kingdom - ftp.warwick.ac.uk (Internet address 137.205.192.14)
United Kingdom - SunSITE.doc.ic.ac.uk/gnu (Internet address 193.63.255.4)
]
-dg
David Gould dg@illustra.com 510.628.3783 or 510.305.9468
Informix Software (No, really) 300 Lakeside Drive Oakland, CA 94612
"Don't worry about people stealing your ideas. If your ideas are any
good, you'll have to ram them down people's throats." -- Howard Aiken
On Thu, 11 Jun 1998, David Gould wrote: > Article 10705 of comp.os.linux.misc: > Newsgroups: gnu.announce,gnu.utils.bug,comp.os.linux.misc,alt.sources.d > Subject: Rx 1.9 > Date: Wed, 10 Jun 1998 10:40:00 -0700 (PDT) > Approved: info-gnu@gnu.org > > The latest version of Rx, 1.9, is available on the web at: > > http://users.lanminds.com/~lord > ftp://emf.net/users/lord/src/rx-1.9.tar.gz > and at ftp://ftp.gnu.org/pub/gnu/rx-1.9.tar.gz and mirrors of that > site (see list below). The reason that we do not use this particular Regex package is that *it* falls under the "Almighty GPL", which conflicts with our Berkeley Copyright... Now, is there is a standardized spec on this, though, what would it take to change our Regex to follow it, *without* the risk of tainting our code with GPLd code? Marc G. Fournier Systems Administrator @ hub.org primary: scrappy@hub.org secondary: scrappy@{freebsd|postgresql}.org