Обсуждение: SQL Where Like - Range it?!

Поиск
Список
Период
Сортировка

SQL Where Like - Range it?!

От
Sterling
Дата:
H-

I've found the docs on how to select a list of rows from a table were
all the records have a last name starting with 'W%'.
select * from table where last_name LIKE 'W%'

What I'd like to do is pull a list of records where there is a range of
last names; say from A - F.
select * from table where last_name LIKE 'A%' AND last_name LIKE 'F%' -
for example.

The above code I've tried for this doesn't seem to work as I'd expect it
too?
I've even done
select * from table where last_name LIKE 'A%' AND LIKE 'F%'

Can anyone provide some details or insights on how to accomplish this?

Thanks. Much appreciated.
-Sterling

Re: SQL Where Like - Range it?!

От
"Len Morgan"
Дата:
>What I'd like to do is pull a list of records where there is a range of
>last names; say from A - F.
>select * from table where last_name LIKE 'A%' AND last_name LIKE 'F%' -
>for example.
>
>The above code I've tried for this doesn't seem to work as I'd expect it
>too?

SELECT * FROM table WHERE last_name BETWEEN 'A' AND 'Fzzzzzzzzzzzzz' ;
worked for me.

You could also use BETWEEN 'A' AND 'G' to avoid all of the zzzzzzzzs at the
end.  Crude but effective.

len morgan


Re: SQL Where Like - Range it?!

От
will trillich
Дата:
On Fri, Apr 27, 2001 at 09:52:26AM -0500, Len Morgan wrote:
> >What I'd like to do is pull a list of records where there is a range of
> >last names; say from A - F.
> >select * from table where last_name LIKE 'A%' AND last_name LIKE 'F%' -
> >for example.
> >
> >The above code I've tried for this doesn't seem to work as I'd expect it
> >too?
>
> SELECT * FROM table WHERE last_name BETWEEN 'A' AND 'Fzzzzzzzzzzzzz' ;
> worked for me.
>
> You could also use BETWEEN 'A' AND 'G' to avoid all of the zzzzzzzzs at the
> end.  Crude but effective.

how about regex?

    select * from tbl where fld ~ '^[A-F]';

--
don't visit this page. it's bad for you. take my expert word for it.
http://www.salon.com/people/col/pagl/2001/03/21/spring/index1.html

will@serensoft.com
http://sourceforge.net/projects/newbiedoc -- we need your brain!
http://www.dontUthink.com/ -- your brain needs us!

Re: SQL Where Like - Range it?!

От
Lincoln Yeoh
Дата:
At 04:00 PM 27-04-2001 -0500, will trillich wrote:
>On Fri, Apr 27, 2001 at 09:52:26AM -0500, Len Morgan wrote:
>>
>> SELECT * FROM table WHERE last_name BETWEEN 'A' AND 'Fzzzzzzzzzzzzz' ;
>> worked for me.
>>
>> You could also use BETWEEN 'A' AND 'G' to avoid all of the zzzzzzzzs at the
>> end.  Crude but effective.
>
>how about regex?
>
>    select * from tbl where fld ~ '^[A-F]';

Would the regex query still use a basic text index?

If you're using US-ASCII encoding you could avoid the zzzzs by using a '~'
e.g.
SELECT * FROM table where last_name >='A' and last_name <='F~';

If not use a < 'G' instead of a <=.

Cheerio,
Link.



Re: Re: SQL Where Like - Range it?!

От
will trillich
Дата:
On Sat, Apr 28, 2001 at 10:53:08AM +0800, Lincoln Yeoh wrote:
> At 04:00 PM 27-04-2001 -0500, will trillich wrote:
> >On Fri, Apr 27, 2001 at 09:52:26AM -0500, Len Morgan wrote:
> >>
> >> SELECT * FROM table WHERE last_name BETWEEN 'A' AND 'Fzzzzzzzzzzzzz' ;
> >> worked for me.
> >>
> >> You could also use BETWEEN 'A' AND 'G' to avoid all of the zzzzzzzzs at the
> >> end.  Crude but effective.
> >
> >how about regex?
> >
> >    select * from tbl where fld ~ '^[A-F]';
>
> Would the regex query still use a basic text index?

apparently it does IF you use the 'anchor-at-beginning'
construct, namely the "^":

    fld ~ '^[A-F]' -- STARTS with A,B,C,D,E, or F
    fld ~ '[A-F]'  -- merely contains A,B,C,D,E, or F
    fld ~ '[A-F]$' -- ENDS with A-F

if starts-with (^) then it uses the index. so i hear.

> If you're using US-ASCII encoding you could avoid the zzzzs by using a '~'

that condition should let you know that you're hard-coding
something that you probably shouldn't, if you're interested in
portability...

> e.g.
> SELECT * FROM table where last_name >='A' and last_name <='F~';

eek. avoid. breaks on ebcdic, maybe unicode, possibly certain
latin implementations....

> If not use a < 'G' instead of a <=.

much wiser...

--
don't visit this page. it's bad for you. take my expert word for it.
http://www.salon.com/people/col/pagl/2001/03/21/spring/index1.html

will@serensoft.com
http://sourceforge.net/projects/newbiedoc -- we need your brain!
http://www.dontUthink.com/ -- your brain needs us!

Re: SQL Where Like - Range it?!

От
Sterling
Дата:
H-

Thank you to all that have replied.

Unfortunately none of the examples provided worked for me on my system.

I tried:
SELECT * from table where last_name BETWEEN 'A' AND 'G';
SELECT * from table where last_name >='A' AND last_name<'G';
SELECT * from table where last_name ~* '^[A-F]';

to no avail. 8^(

I didn't find a between function in the postgres documentation.

Is that some special function? I'm using PostgreSQL 6.5.3.

If anyone has any further thoughts please drop a line.
Thanks again and hopefully something will work out.
-Sterling



en Morgan wrote:
>
> >What I'd like to do is pull a list of records where there is a range of
> >last names; say from A - F.
> >select * from table where last_name LIKE 'A%' AND last_name LIKE 'F%' -
> >for example.
> >
> >The above code I've tried for this doesn't seem to work as I'd expect it
> >too?
>
> SELECT * FROM table WHERE last_name BETWEEN 'A' AND 'Fzzzzzzzzzzzzz' ;
> worked for me.
>
> You could also use BETWEEN 'A' AND 'G' to avoid all of the zzzzzzzzs at the
> end.  Crude but effective.
>
> len morgan
>
> ---------------------------(end of broadcast)---------------------------
> TIP 3: if posting/reading through Usenet, please send an appropriate
> subscribe-nomail command to majordomo@postgresql.org so that your
> message can get through to the mailing list cleanly

Re: SQL Where Like - Range it?!

От
Sterling
Дата:
H-

Thanks again.

Being slow at times I know why ~* '^[A-F]' wasn't working.

Yeah, that's right. There weren't any records in that range. I moved it
up and pulled some records in a range of A-P and it worked as expected.

Thanks to all that wrote in and replied. I understand why the LIKE 'A%'
AND LIKE 'F%' wasn't working now also.
Or why it wouldn't work if there were records there.

Hope everyone has a good weekend.
Back to the code.
-Sterling


Sterling wrote:
>
> H-
>
> Thank you to all that have replied.
>
> Unfortunately none of the examples provided worked for me on my system.
>
> I tried:
> SELECT * from table where last_name BETWEEN 'A' AND 'G';
> SELECT * from table where last_name >='A' AND last_name<'G';
> SELECT * from table where last_name ~* '^[A-F]';
>
> to no avail. 8^(
>
> I didn't find a between function in the postgres documentation.
>
> Is that some special function? I'm using PostgreSQL 6.5.3.
>
> If anyone has any further thoughts please drop a line.
> Thanks again and hopefully something will work out.
> -Sterling
>
> en Morgan wrote:
> >
> > >What I'd like to do is pull a list of records where there is a range of
> > >last names; say from A - F.
> > >select * from table where last_name LIKE 'A%' AND last_name LIKE 'F%' -
> > >for example.
> > >
> > >The above code I've tried for this doesn't seem to work as I'd expect it
> > >too?
> >
> > SELECT * FROM table WHERE last_name BETWEEN 'A' AND 'Fzzzzzzzzzzzzz' ;
> > worked for me.
> >
> > You could also use BETWEEN 'A' AND 'G' to avoid all of the zzzzzzzzs at the
> > end.  Crude but effective.
> >
> > len morgan
> >
> > ---------------------------(end of broadcast)---------------------------
> > TIP 3: if posting/reading through Usenet, please send an appropriate
> > subscribe-nomail command to majordomo@postgresql.org so that your
> > message can get through to the mailing list cleanly

Re: SQL Where Like - Range it?!

От
Ashley Clark
Дата:
* will trillich in "Re: Re: SQL Where Like - Range it?!" dated
* 2001/04/28 03:17 wrote:

> apparently it does IF you use the 'anchor-at-beginning'
> construct, namely the "^":
>
>     fld ~ '^[A-F]' -- STARTS with A,B,C,D,E, or F
>     fld ~ '[A-F]'  -- merely contains A,B,C,D,E, or F
>     fld ~ '[A-F]$' -- ENDS with A-F
>
> if starts-with (^) then it uses the index. so i hear.

Being the curious sort that I am I tried a few things and got some more
questions.

db=# explain SELECT name from builders where name ~ '^A' or name ~ '^B';
NOTICE:  QUERY PLAN:

Index Scan using builders_name_key, builders_name_key on builders
(cost=0.00..10.25 rows=16 width=12)

EXPLAIN
db=# explain SELECT name from builders where name ~ '^[AB]';
NOTICE:  QUERY PLAN:

Seq Scan on builders  (cost=0.00..9.44 rows=355 width=12)

EXPLAIN

These are the same query, why would the one using index scan have a
higher cost that the combined condition query? Shouldn't they be the
same? And which one is faster/scales better?

And one more question

db=# explain SELECT name from builders where name like 'A%' or name
     like 'B%';
NOTICE:  QUERY PLAN:

Index Scan using builders_name_key, builders_name_key on builders
(cost=0.00..10.25 rows=16 width=12)

EXPLAIN

Does the similarity of these numbers to the first ones above have any
significance or is it just coincidence?

--
ashley clark

Вложения

Re: Re: SQL Where Like - Range it?!

От
Tom Lane
Дата:
Ashley Clark <aclark@ghoti.org> writes:
> db=3D# explain SELECT name from builders where name ~ '^A' or name ~ '^B';
> NOTICE:  QUERY PLAN:

> Index Scan using builders_name_key, builders_name_key on builders
> (cost=3D0.00..10.25 rows=3D16 width=3D12)

> db=3D# explain SELECT name from builders where name ~ '^[AB]';
> NOTICE:  QUERY PLAN:

> Seq Scan on builders  (cost=3D0.00..9.44 rows=3D355 width=3D12)

> These are the same query, why would the one using index scan have a
> higher cost that the combined condition query?

Always remember that the cost estimates quoted by EXPLAIN are estimates,
not reality.

In this case the reason for the difference is that the planner doesn't
have any detailed understanding of the semantics of bracket-expressions
in regexps, so it doesn't realize that ^[AB] could usefully use an
index.  It wants to see ^ followed by at least one character of fixed
pattern before it will think about an indexscan ...

            regards, tom lane

Re: Re: SQL Where Like - Range it?!

От
will trillich
Дата:
On Mon, Apr 30, 2001 at 01:11:21AM -0400, Tom Lane wrote:
> Always remember that the cost estimates quoted by EXPLAIN are estimates,
> not reality.
>
> In this case the reason for the difference is that the planner doesn't
> have any detailed understanding of the semantics of bracket-expressions
> in regexps, so it doesn't realize that ^[AB] could usefully use an
> index.  It wants to see ^ followed by at least one character of fixed
> pattern before it will think about an indexscan ...

so the query planner doesn't use indexes for ~'^[A-F]' then?
(meaning, such regex queries are all sequential scans...)

or are you saying that just the explainer routines don't?
(meaning, explain doesn't grok as well as the actual fetch-
and-match routines do...)

hmm?

--
don't visit this page. it's bad for you. take my expert word for it.
http://www.salon.com/people/col/pagl/2001/03/21/spring/index1.html

will@serensoft.com
http://sourceforge.net/projects/newbiedoc -- we need your brain!
http://www.dontUthink.com/ -- your brain needs us!

Regular Expression INTRO [long] [OT]

От
will trillich
Дата:
wow -- this was nearly a year ago and i still haven't
done it!

Justin Clift wrote:
> Would you mind taking a few moments and writing a guide on using Regex
> functions in PostgreSQL?
>
> This is to put on the techdocs.postgresql.org website so everyone in the
> PostgreSQL community will have access to it.

i'm posting to pgsql-general so the community
can help find holes and plug them...

here goes --






Finding data in PostgreSQL using LIKE or ~ (REGULAR EXPRESSIONS):


LIKE
====

You're probably familiar with the WHERE clause
of the SELECT statement --

    SELECT somefield FROM table WHERE somefield = somevalue;

For example, to show all customers within the
post code "60201" you could try

    SELECT * FROM cust WHERE zip = '60201';

But if you also have nine-digit zips stored in the zip
field (in the U.S.A. we have standard five-digit postal
codes and also a four-digit extension for further
geographic resolution) then this search will OMIT
customers having more-specific zips such as '60201-9876'
or '60201-1234'.


The Percent "%"
---------------

Try using LIKE instead of = (EQUALS):

    SELECT * FROM cust WHERE zip LIKE '60201%';

With LIKE, your search values can use the percent (%)
to tell PostgreSQL that anything can occupy that spot
-- one character, twelve symbols or zero digits -- and
still satisfy the search.

On Unix or Linux, this is basically the same as the
ASTERISK (*) at the command line, when dealing with
file names:

    # list all files whose names begin with '.bash'
    ls .bash*

    # list all files containing 'out' anywhere in the name
    ls *out*

    # list all file names ending with '.pl'
    ls *.pl

    # list file starting with 'proj', ending with '.c'
    ls proj*.c

With PostgreSQL using the LIKE operator, use the
percent, instead:

    -- list all customers within the 47610 postal code
    SELECT * FROM cust WHERE zip LIKE '47610%';

    -- display customers who have 'Corp' in their names
    SELECT * FROM cust WHERE name LIKE '%Corp%';

    -- show customers whose names end in 'LLC'
    SELECT * FROM cust WHERE name LIKE '%LLC';

    -- documents beginning with 'We', ending with 'rica'
    SELECT * FROM doc WHERE contents LIKE 'We%rica';

Wherever the '%' appears (using the LIKE operator)
Postgres allows anything at all to match -- from a
lengthy string of text, to one single character, to
a zero-length string -- i.e. nothing at all.

    ...ILIKE 'A%Z'
    -- matches 'AZ' and 'A to Z' and 'ABC2468XYZ' and 'A@$*Z'


The Underscore "_"
------------------

For the LIKE operator, the UNDERSCORE (_) takes on the same
meaning as the question mark does to shell operations for
Unix and Linux file names:

    # list files starting with dot, followed by at least two chars
    ls .??*

The underscore matches just one character exactly:

    SELECT * FROM atable WHERE afield LIKE '_';
    -- shows records where afield is exactly one character
    -- omitting ones where it's blank or has two (or more)
    -- characters in it)

You can restrict your searches to finding fields
of certain lengths this way:

    SELECT * FROM cust WHERE zip LIKE '_____' OR zip LIKE '_________';
    -- display all five- or nine-character zip codes

Combine it with percent to find fields over a certain length:

    SELECT * FROM atable WHERE afield LIKE '___%';
    -- display records where afield has three or more characters

Or you can accomodate some language idiosyncracies, as
well:

    SELECT * FROM activity WHERE venue LIKE 'Theat__';
    -- find u.k. 'theatre' and u.s. 'theater'


ILIKE
=====

But what if a customer's name is all capitals, as in
'THE ACME CORPORATION'? The "...LIKE '%Corp%'" won't
find it, because uppercase 'O' is not identical to
lowercase 'o'!

Finding text independent of uppercase/lowercase
is something we often need to do. And so we have
the ILIKE operator.

Use ILIKE instead of LIKE when your search should
ignore case -- the "I" stands for case-[I]nsensitive:

    -- show all customers with 'corp' in the name
    SELECT * FROM cust WHERE name ILIKE '%Corp%';
    -- ignoring whether the field is upper- or lowercase

That'll find 'Bubba Gump Shrimp Corp' and 'ACME CORP'
and even 'Amalgamated Switch and Relay corporation'
because anything can ('%') precede or ('%') follow the
string 'Corp' which can be uppercase or lowercase.

But there's even more power in "regular expressions"!
See below...


Notes
-----

Always include some real data to search for!

    -- show everything, doing a lot of unnecessary work:
    SELECT * FROM atable WHERE afield LIKE '%';

It would be rather silly to make PostgreSQL compare
every record to see if it matched 'anything goes,
in "afield"'. Just say no.

Convert your spoken-language request to a LIKE
clause like this:

    "afield ENDS WITH avalue"
    reword => "afield starts with anything, ending with avalue"
    sql => "... WHERE afield LIKE '%avalue'"

    "afield STARTS WITH avalue"
    reword => "afield starts with avalue, ending with anything"
    sql => "... WHERE afield LIKE 'avalue%'"

See how that works?

    "afield CONTAINS avalue"
    sql => "... WHERE afield LIKE '%avalue%'"

If you're looking for something at the beginning of
a field, that's where your data goes, and you END with
the percent. If you're looking for something at the end
of a field, put your data there and allow anything at
the front by putting the percent there.

ALSO -- if your field is indexed, you'll defeat the index
unless you anchor your search to the beginning of the field.

    -- find fields starting with 'something'
    ... afield LIKE 'something%'; -- uses afield's index

The example below isn't anchored at the start of the field,
so it does no good to use the index:

    -- find fields containing 'pooh'
    ... afield LIKE '%pooh%'; -- can't use afield's index

Afield must contain 'pooh' BUT it could be anywhere in
the field. The index would be useless.


LIMITATION
==========

Okay. Let's say you have a full-name field that has
first, middle, and last name all in the one field.
Sometimes you'll have a middle name, sometimes
you won't. Sometimes the first name will be spelled
out, other times it'll be shortened or even abbreviated
to just the first initial.

Here's how you'd search that field for "Abraham Lincoln"
using the LIKE operator:

    SELECT * FROM peron WHERE fullname LIKE 'A%Lincoln';

It'll find 'Abraham Lincoln' and 'A. Lincoln' and
'Abe Lincoln' because anything can go between the
initial 'A' and the ending 'Lincoln'.

But -- it'll also match 'Andrew M. Lincoln' and 'Abner
Sasquatch Lincoln' as well... and this is why some bright
souls created regular expressions!


REGULAR EXPRESSIONS
===================

A regular expression is a pattern to search for.
The structure of the pattern has to match a
rigid set of rules so that PostgreSQL will know
what you're trying to search for.

This introduction will barely scratch the surface,
but hopefully it'll get you started--

If you've ever tinkered with perl, you've probably
encountered regular expressions:

    perl -ne 'print if /[A-Za-z]/;' somefile

Here, the "[A-Za-z]" is a regular expression.
It matches any line containing uppercase (A-Z)
or lowercase (a-z) letters. In this example, the
lines will come from the file "somefile" and
lines that match will be printed out.

Well, PostgreSQL can do that, too!


Example
-------

    SELECT * FROM person
    WHERE fullname ~ 'A(be|braham|\\.) *Lincoln';

The REGULAR EXPRESSION operator is the TILDE (~).
The value you're searching for then must be a valid
regular expression. In the above example, we're asking
PostgreSQL to display all rows containing

    'A'

followed by EITHER

    'be'
OR
    'braham'
OR
    '\\.' -- a dot

which is then followed by

    ' ' -- a space
    '*' -- zero or more times

followed by

    'Lincoln'

and that's all. Whew!

As you can guess, these will all successfully match:

    'A. Lincoln'
    'AbeLincoln' -- without any space at all
    'Abraham    Lincoln' -- with lots of spaces

and these won't match:

    'A Lincoln' -- no dot, no 'braham', no 'be'
    'Abe Gump Lincoln' -- nothing will match 'Gump'


Quoting "\\"
------------

So what's with the BACKSLASH in front of the dot? Well,
just as LIKE has the UNDERSCORE (_) to denote "any single
character", REGULAR EXPRESSIONS use the DOT (.) for
that very same purpose. So we have to "escape" the
dot to alter its normal interpretation, using the.

Note that string literals in PostgreSQL already use the
backslash, so you'll have to double any backslash
you use:

    psql=# SELECT * FROM person
    psql-# WHERE fullname ~ 'A(be|braham|\\.) *Lincoln'
    psql-# -- a second backslash added
    psql-# ;


Grouping "()"
-------------

How about the PARENTHESES () and the BAR |, hmm? Well,
that's how you can group permitted alternatives:

    ... ~ 'A(be|braham|\\.)'

That says that 'A' can be followed by any of the three
sub-expressions within the parentheses, namely 'be' or
'braham' or <dot>.

These all produce the same matches:

    ... ~ 'A(be|braham|\\.)'
    ... ~ 'Abe|Abraham|A\\.'
    ... ~ 'A(b(e|raham)|\\.)'

Let's dissect this last one -- it specifies that:

    'A'

can be followed by either

    'b' followed by some more stuff
or
    <dot>

nestling into the details following the 'b' -- if
'A' is followed by 'b', then the 'b' must be followed
by either

    'e' -- which matches 'Abe'
or
    'raham' -- which matches 'Abraham'

Do you see how powerful this can be?


Character Classes "[]"
----------------------

So you're trying to find "McAndrews" -- but maybe it's
spelled with a lower-case "a" as in "Mcandrews", instead!

    SELECT * FROM person
    WHERE fullname ~ 'Mc[Aa]ndrews';

Using the BRACKETS tells the regular expression
parser that you're allowing any one of a whole class
of characters in that spot.

    '[Aa]'

It specifies that you're looking for either an uppercase
'A' or a lowercase 'a' in that spot.

Here's how you can find fields containing vowels:

    ... ~ '[AEIOUaeiou]'

Search for fields containing lowercase letters:

    ... ~ '[abcdefghijklmnopqrstuvwxyz]'

Of course, there's a shortcut for specifying character
classes that cover a whole range:

    ... ~ '[a-z]' -- also matches any lowercase letter

Display fields that contain digits:

    ... ~ '[0-9]'

Here's an example using a standard U.S. phone pattern
(neglecting the area code for clarity and space):

    SELECT * FROM cust
    WHERE descr ~ '[0-9][0-9][0-9]-[0-9][0-9][0-9][0-9]';

That will show customers whose "descr" field contains a
pattern of digits that looks like a U.S. phone number:

    '[0-9][0-9][0-9]' -- three digits

followed by

    '-' -- a hyphen

followed by

    '[0-9][0-9][0-9][0-9]' -- four digits

As you can see, the hyphen doesn't do anything special
for a regular expression unless it's inside the square
brackets of a character class -- in which case it
means "anything between".

What if you want to allow a hyphen within a character class?
Simply make it the first character inside the brackets:

    '[-.,_]'

That class will match any hyphen, dot, comma or underscore.


Bounds "{}"
-----------

Rather than having to specify each of those [0-9]
monstrosities every time, let's abbreviate. Instead
of:

    '[0-9][0-9][0-9]-[0-9][0-9][0-9][0-9]'

we can just say

    '[0-9]{3}-[0-9]{4}'.

The BOUNDS specification, inside braces, comes immediately
after whatever you want a certain number of. That is,
first you specify what you're looking for, then you specify
how many you need. (The default is {1,1} one, as you can tell:
one 'A' followed by one 'b' followed by...)

Inside the BRACES that specify your "bounds" you can
say you want exactly so many, as we did above (3 and
then 4) or you can give a low-to-high pair:

    ... ~ 'Z{3}' -- need three Z's
    ... ~ '@{5,}' -- five or more @'s
    ... ~ 'Q{2,5}' -- at least two, at most five, Q's
    ... ~ '(This|That){0,3}'

The last one allows for 'This' or 'That', anywhere from
zero to three times. So 'ThisThatThis' and '' will match!
Convenient, hmm?

And for very common cases, there are handy abbreviations:

    'x{1,1}'  'x'    one (the default)
    'r{0,1}'  'r?'   zero or one, i.e. 'x' is optional
    'B{0,}'   'B*'   zero or more
    'z{1,}'   'z+'   one or more

These three abbreviations will pop up often:

    -- allow 'Ms' and 'Mrs' (the 'r' is optional):
    ... title ~ '(Mr|Mr?s|Dr)'

    -- maybe there's a space, maybe there isn't:
    ... lname ~ 'Mac ?Affee'

    -- don't care how many trailing spaces:
    ... lname ~ 'Smith *'

    -- find records with two or more even digits:
    ... afield ~ '[02468]{2,}'
or
    ... afield ~ '[02468][02468]+'

That last one breaks down to

    '[02468]' -- an even digit

followed by

    '[02468]' -- an even digit
    '+' -- one or more times

Remember: the plus means '{1,}' -- i.e. that the preceeding
item must match one or more times.

Getting back to our phone number pattern... this is
the same request as above, but it's easier to
see what's going on, using the {bounds} feature:

    SELECT * FROM cust
    WHERE descr ~ '[0-9]{3}-[0-9]{4}';
    -- three digits, hyphen, four digits


Excluding Characters "[^]"
--------------------------

Note that the above pattern will also match 00000-0000
because the last three digits of the five in the left
chunk, with the four digits of the right chunk, fit
perfectly into the constraints of what we asked for.
And this ain't no phone number.

Here's an approach to fixing that snag:

    SELECT * FROM cust
    WHERE descr ~ '[^0-9][0-9]{3}-[0-9]{4}';

Note that we've prefixed the previous pattern with

    '[^0-9]'

because, within the square brackets of a character class,
the CARAT (^) means "anything EXCEPT..."

So now, 00000-0000 will NOT match; this is what we're
after.

But there's a new problem: if the phone number is the
very first thing in the "descr" field it would never
match because we're demanding that there be SOMETHING
(besides a digit) before the phone number. What to do?


At The Very Beginning "^"
-------------------------

You can match "beginning-of-field" with the CARAT (^).
(When it's the first thing inside square brackets, it
negates the character class; outside brackets, a carat
means "beginning-of-text". Got that? It's important!)

    SELECT * FROM cust
    WHERE descr ~ '(^|[^0-9])[0-9]{3}-[0-9]{4}';

Here's what this pattern looks for:

Either
    '^' -- at the very beginning of text
or
    '[^0-9]' -- find something that's a non-digit

followed by

    '[0-9]{3}' -- three digits

and then a

    '-' -- hyphen

and finally

    '[0-9]{4}' -- four digits

and we've solved the 'at-beginning-of-field' problem.


At The Very End "$"
-------------------

Still, we could inadvertently match something like
000-000000000 (probably an id number for a replacement
part) which we don't want to do.

Note how this is the same problem, for the end
of the field, that we had earlier, with the beginning
of the field? The solution is similar:

    SELECT * FROM cust
    WHERE descr ~ '(^|[^0-9])[0-9]{3}-[0-9]{4}($|[^0-9])';

The DOLLAR SIGN ($) represents 'end-of-text' just as
carat, outside of square brackets, means 'beginning-of-
text'. The above pattern is the same as before, except
we appended

    '($|[^0-9])'

to the end, which means we're looking for

either

    '$' -- at the very end-of-text
or
    '[^0-9]' -- something that's not a digit

after our previous pattern.

And, finally, we have a comprehensive phone pattern --
it'll match ###-#### while excluding other numeric
arrangements -- in a text field.


Anchoring
---------

Did you notice that regular expressions aren't "anchored",
as the LIKE expressions are? To specify that you're looking
only at the beginning of a field, you need to use '^' and
to include the end of a field, you must use '$'.

It's a good thing regular expressions aren't anchored
automatically -- we wouldn't have been able to specify
our phone number pattern above! Things like '(^|[^0-9])'
are possible because YOU get to specify when and how
beginning-of-text is required. Cool, eh?

And, just as in the LIKE operator, unless regular expressions
ARE anchored at the beginning of a field, you'll defeat any
index you have for that field. Indexes help alphabetize by
comparing the beginning of your fields to each other; unless
you're looking at the beginning of the field, your index can't
be used.

Here are some comparisons between operators LIKE and ~ :

    -- list all customers within the 47610 postal code:
    SELECT * FROM cust WHERE zip LIKE '47610%';
        -- zip begins with '47610' then has anything
    SELECT * FROM cust WHERE zip ~ '^47610';
        -- zip begins with '47610'

    -- display customers who have 'Corp' in their names
    SELECT * FROM cust WHERE name LIKE '%Corp%';
        -- name contains 'Corp' with anything before and after
    SELECT * FROM cust WHERE name ~ 'Corp';
        -- name contains 'Corp'

    -- show customers whose names end in 'LLC'
    SELECT * FROM cust WHERE name LIKE '%LLC';
        -- name can have anything, with 'LLC' at the end
    SELECT * FROM cust WHERE name ~ 'LLC$';
        -- name must have 'LLC' at the end

    -- documents beginning with 'We', ending with 'rica'
    SELECT * FROM doc WHERE contents LIKE 'We%rica';
        -- starts with 'We', has anything, ending with 'rica'
    SELECT * FROM doc WHERE contents ~ '^We.*rica$';
        -- start with 'We', zero or more chars, end with 'rica'

Remember that in regular expressions, DOT means 'any character'.
Thus '.*' means 'any character, zero times or more' which, in
English, means "anything can go here, including nothing at all".

There are ways to anchor your searches to word boundaries,
as well -- not just beginning-of-field and end-of-field. See
your documentation for details.


Case-Insensitive "~*"
---------------------

If you're not worried about differentiating between
uppercase and lowercase in your regular expressions,
you could go full boar and try

    SELECT * FROM cust
    WHERE lname ~ '[Oo][Cc][Tt][Aa][Vv][Ii][Aa][Nn]';

Fortunately there's a quickie to make it easier for you --
where you've been using the tilde (~) as your operator, use
tilde-star (~*) instead:

    SELECT * FROM cust
    WHERE lname ~* 'Octavian';

ILIKE is case-insensitive (where LIKE is case-specific) just
as ~* is case-insensitive for regular expressions (where ~ is
case-specific). Very handy!


Example
-------

Email addresses can look like any of these:

    me@my.net
    someone-unimportant9237@this.little.org
    first.last.title@obscure.sub-net.biggie.com
    _weird_@somewhere.out_there.net

A reasonably-functional email address pattern might be
something like this:

    '[a-z0-9_]+([\\-\\.][a-z0-9_]+)*@[a-z0-9_]+([\\-\\.][a-z0-9_]+)+'

That breaks down to (using case-indifferent via ~*, of course):

    '[a-z0-9_]'

any alphanumeric (including underscore) character

    '+'

one or more times, followed by

    '([\\-\\.][a-z0-9_]+)'

email stuff [1],

    '*'

zero or more times, followed by

    '@'

which is then followed by

    '[a-z0-9_]'

alphanumerics (including underscore)

    '+'

one or more times, followed by

    '([\\-\\.][a-z0-9_]+)'

site stuff [2]

    '+'

one or more times.


As for [1] email stuff, it can be (zero or more of):

    '[\\-\\.]'

hyphen or dot, followed by

    '[a-z0-9_]'

alphanumerics (or underscore)

    '+'

one or more times.


And as for [2] site stuff, it can be (one or more of):

    '[\\-\\.]'

hyphen or dot, followed by

    '[a-z0-9_]'

alphanumerics (or underscore)

    '+'

one or more times.

So the personal part of the address can just be a
single word, with optional dot-or-hyphen joining other
words (ZERO or more) onto it; this is followed by at-sign,
which is followed by the site portion of the address,
namely: any word, followed by ONE or more dot-or-hyphen-
followed-by-another-word combo's.

It's not a perfect match for all legal email patterns,
(for example, bad@my-addr would match) but something like
it might be sufficient depending on your requirements.


CONCLUSION
==========

Regular expressions are complicated, because they're
powerful! (Or is it the other way around?)

Check out "Pattern Matching" in the PostgreSQL manual
for more details -- on my Debian "Potato" system it's in
html/postgres/functions-matching.html under the
/usr/share/doc/postgresql-doc/ directory.

--
mailto:will@serensoft.com
http://www.dontUthink.com/

Re: Regular Expression INTRO [long] [OT]

От
will trillich
Дата:
On Tue, Apr 09, 2002 at 06:05:43PM -0600, will trillich wrote:
[snip]
> With LIKE, your search values can use the percent (%)
> to tell PostgreSQL that anything can occupy that spot
> -- one character, twelve symbols or zero digits -- and
> still satisfy the search.
>
> On Unix or Linux, this is basically the same as the
> ASTERISK (*) at the command line, when dealing with
> file names:
>
>     # list all files whose names begin with '.bash'
>     ls .bash*
>
>     # list all files containing 'out' anywhere in the name
>     ls *out*
>
>     # list all file names ending with '.pl'
>     ls *.pl
>
>     # list file starting with 'proj', ending with '.c'
>     ls proj*.c

i'm torn between leaving in a full set of shell-glob samples to
pair up with postgres-like samples, and using only one (as
recommended by Holger Klawitter -- thanks!)...

> With PostgreSQL using the LIKE operator, use the
> percent, instead:
>
>     -- list all customers within the 47610 postal code
>     SELECT * FROM cust WHERE zip LIKE '47610%';
>
>     -- display customers who have 'Corp' in their names
>     SELECT * FROM cust WHERE name LIKE '%Corp%';
>
>     -- show customers whose names end in 'LLC'
>     SELECT * FROM cust WHERE name LIKE '%LLC';
>
>     -- documents beginning with 'We', ending with 'rica'
>     SELECT * FROM doc WHERE contents LIKE 'We%rica';
>
> Wherever the '%' appears (using the LIKE operator)
> Postgres allows anything at all to match -- from a
> lengthy string of text, to one single character, to
> a zero-length string -- i.e. nothing at all.
>
>     ...ILIKE 'A%Z'
>     -- matches 'AZ' and 'A to Z' and 'ABC2468XYZ' and 'A@$*Z'

whoops -- this shoulda been LIKE, not ILIKE (at least not yet)

[snip]

> So what's with the BACKSLASH in front of the dot? Well,
> just as LIKE has the UNDERSCORE (_) to denote "any single
> character", REGULAR EXPRESSIONS use the DOT (.) for
> that very same purpose. So we have to "escape" the
> dot to alter its normal interpretation, using the.

using the what? um, "...using the backslash." oops!

[snip]

> And for very common cases, there are handy abbreviations:
>
>     'x{1,1}'  'x'    one (the default)
>     'r{0,1}'  'r?'   zero or one, i.e. 'x' is optional

should be "'r' is optional". (thanks, Holger)

>     'B{0,}'   'B*'   zero or more
>     'z{1,}'   'z+'   one or more

may as well flesh these out, to be parallel with the others:

    'B{0,}'   'B*'   zero or more B's
    'z{1,}'   'z+'   one or more z's

[snip]

> Note that we've prefixed the previous pattern with
>
>     '[^0-9]'
>
> because, within the square brackets of a character class,
> the CARAT (^) means "anything EXCEPT..."
>
> So now, 00000-0000 will NOT match; this is what we're
> after.

maybe i could be more clear, by saying:

    So now, our pattern will NOT match something like
    00000-0000, and since we don't want it to, we're making
    progress.

[snip]

> And, just as in the LIKE operator, unless regular expressions
> ARE anchored at the beginning of a field, you'll defeat any
> index you have for that field. Indexes help alphabetize by
> comparing the beginning of your fields to each other; unless
> you're looking at the beginning of the field, your index can't
> be used.

some of you asked if regexes are actually able to use the index.
well, here's the answer: if they're anchored at the front, YES!

    EXPLAIN
    repo=# explain
    repo-# select * from _table_1015197075 where str ~ '^f';
    NOTICE:  QUERY PLAN:

    Index Scan using _table_ix on _table_1015197075  (cost=0.00..2.01 rows=1 width=24)

    EXPLAIN
    repo=# explain
    repo-# select * from _table_1015197075 where str ~ 'f';
    NOTICE:  QUERY PLAN:

    Seq Scan on _table_1015197075  (cost=100000000.00..100000001.03 rows=1 width=24)

to be honest, i "SET ENABLE_SEQSCAN = false" and created a
quickie table with about fifty rows, with an index on the "str"
column.

with '^f' it DID use the index; without the carat, 'f', it
absolutely COULD NOT use the index... imagine looking for words
in the dictionary that /contain/ the letter 'f', right? (no
mystery here -- but whether the '^f' COULD use the index was the
question, of course -- and it did).

cool. (i suspected, but wasn't certain.)

[snip]

> There are ways to anchor your searches to word boundaries,
> as well -- not just beginning-of-field and end-of-field. See
> your documentation for details.

rewording is in order here -- how about:

    There are ways to anchor your searches to word boundaries,
    as well -- you're not restricted to testing only for
    beginning-of-field or end-of-field. (In fact, that's a more
    likely solution to the problem than what I devised here.)
    See your documentation for information on [[:<:]] and
    [[:>:]].

[snip]

> As for [1] email stuff, it can be (zero or more of):
>
>     '[\\-\\.]'
>
> hyphen or dot, followed by
>
>     '[a-z0-9_]'
>
> alphanumerics (or underscore)
>
>     '+'
>
> one or more times.

how about a little added explanation for clarity?

    This means an address (before the @domain.com) can be
    alphanumerics only, or if it contains a dot or hyphen that
    more alphanumerics must follow the dot or hyphen. So 'me'
    and 'my.self' and 'albert.einstein-newton-john.jr' are all
    acceptable to this part of the pattern.

> And as for [2] site stuff, it can be (one or more of):
>
>     '[\\-\\.]'
>
> hyphen or dot, followed by
>
>     '[a-z0-9_]'
>
> alphanumerics (or underscore)
>
>     '+'
>
> one or more times.

more clarity -- something like:

    So the site stuff means that the domain (@some.where) MUST
    contain at least one dot or hyphen, each of which must be
    followed by alphanumerics. These would NOT match:

        'me..you' -- no alphanumerics after first dot
        'oops-' -- no alphanumerics after hyphen
        '-eesh' -- need alphanumerics before hyphen

    To be strict, that last one WILL match the 'site stuff'
    fragment of the pattern, but it'll fail in the whole pattern,
    because after the '@' we require SOME alphanumerics.


feel free to post comments/feedback to pgsql-user...

--
"We will fight them on the beaches, we will fight them on the
sons of beaches" -- Miguel Churchill, Winston's bastard Mexican
brother.
--lifted from http://www.astray.com/acmemail/stable/documentation.xml


will@serensoft.com
http://sourceforge.net/projects/newbiedoc -- we need your brain!
http://www.dontUthink.com/ -- your brain needs us!

Re: Regular Expression INTRO [long] [OT]

От
Bruce Momjian
Дата:
Justin, have added this to techdocs?  I have a followup email here with
some clarifications.

---------------------------------------------------------------------------

will trillich wrote:
> wow -- this was nearly a year ago and i still haven't
> done it!
>
> Justin Clift wrote:
> > Would you mind taking a few moments and writing a guide on using Regex
> > functions in PostgreSQL?
> >
> > This is to put on the techdocs.postgresql.org website so everyone in the
> > PostgreSQL community will have access to it.
>
> i'm posting to pgsql-general so the community
> can help find holes and plug them...
>
> here goes --
>
>
>
>
>
>
> Finding data in PostgreSQL using LIKE or ~ (REGULAR EXPRESSIONS):
>
>
> LIKE
> ====
>
> You're probably familiar with the WHERE clause
> of the SELECT statement --
>
>     SELECT somefield FROM table WHERE somefield = somevalue;
>
> For example, to show all customers within the
> post code "60201" you could try
>
>     SELECT * FROM cust WHERE zip = '60201';
>
> But if you also have nine-digit zips stored in the zip
> field (in the U.S.A. we have standard five-digit postal
> codes and also a four-digit extension for further
> geographic resolution) then this search will OMIT
> customers having more-specific zips such as '60201-9876'
> or '60201-1234'.
>
>
> The Percent "%"
> ---------------
>
> Try using LIKE instead of = (EQUALS):
>
>     SELECT * FROM cust WHERE zip LIKE '60201%';
>
> With LIKE, your search values can use the percent (%)
> to tell PostgreSQL that anything can occupy that spot
> -- one character, twelve symbols or zero digits -- and
> still satisfy the search.
>
> On Unix or Linux, this is basically the same as the
> ASTERISK (*) at the command line, when dealing with
> file names:
>
>     # list all files whose names begin with '.bash'
>     ls .bash*
>
>     # list all files containing 'out' anywhere in the name
>     ls *out*
>
>     # list all file names ending with '.pl'
>     ls *.pl
>
>     # list file starting with 'proj', ending with '.c'
>     ls proj*.c
>
> With PostgreSQL using the LIKE operator, use the
> percent, instead:
>
>     -- list all customers within the 47610 postal code
>     SELECT * FROM cust WHERE zip LIKE '47610%';
>
>     -- display customers who have 'Corp' in their names
>     SELECT * FROM cust WHERE name LIKE '%Corp%';
>
>     -- show customers whose names end in 'LLC'
>     SELECT * FROM cust WHERE name LIKE '%LLC';
>
>     -- documents beginning with 'We', ending with 'rica'
>     SELECT * FROM doc WHERE contents LIKE 'We%rica';
>
> Wherever the '%' appears (using the LIKE operator)
> Postgres allows anything at all to match -- from a
> lengthy string of text, to one single character, to
> a zero-length string -- i.e. nothing at all.
>
>     ...ILIKE 'A%Z'
>     -- matches 'AZ' and 'A to Z' and 'ABC2468XYZ' and 'A@$*Z'
>
>
> The Underscore "_"
> ------------------
>
> For the LIKE operator, the UNDERSCORE (_) takes on the same
> meaning as the question mark does to shell operations for
> Unix and Linux file names:
>
>     # list files starting with dot, followed by at least two chars
>     ls .??*
>
> The underscore matches just one character exactly:
>
>     SELECT * FROM atable WHERE afield LIKE '_';
>     -- shows records where afield is exactly one character
>     -- omitting ones where it's blank or has two (or more)
>     -- characters in it)
>
> You can restrict your searches to finding fields
> of certain lengths this way:
>
>     SELECT * FROM cust WHERE zip LIKE '_____' OR zip LIKE '_________';
>     -- display all five- or nine-character zip codes
>
> Combine it with percent to find fields over a certain length:
>
>     SELECT * FROM atable WHERE afield LIKE '___%';
>     -- display records where afield has three or more characters
>
> Or you can accomodate some language idiosyncracies, as
> well:
>
>     SELECT * FROM activity WHERE venue LIKE 'Theat__';
>     -- find u.k. 'theatre' and u.s. 'theater'
>
>
> ILIKE
> =====
>
> But what if a customer's name is all capitals, as in
> 'THE ACME CORPORATION'? The "...LIKE '%Corp%'" won't
> find it, because uppercase 'O' is not identical to
> lowercase 'o'!
>
> Finding text independent of uppercase/lowercase
> is something we often need to do. And so we have
> the ILIKE operator.
>
> Use ILIKE instead of LIKE when your search should
> ignore case -- the "I" stands for case-[I]nsensitive:
>
>     -- show all customers with 'corp' in the name
>     SELECT * FROM cust WHERE name ILIKE '%Corp%';
>     -- ignoring whether the field is upper- or lowercase
>
> That'll find 'Bubba Gump Shrimp Corp' and 'ACME CORP'
> and even 'Amalgamated Switch and Relay corporation'
> because anything can ('%') precede or ('%') follow the
> string 'Corp' which can be uppercase or lowercase.
>
> But there's even more power in "regular expressions"!
> See below...
>
>
> Notes
> -----
>
> Always include some real data to search for!
>
>     -- show everything, doing a lot of unnecessary work:
>     SELECT * FROM atable WHERE afield LIKE '%';
>
> It would be rather silly to make PostgreSQL compare
> every record to see if it matched 'anything goes,
> in "afield"'. Just say no.
>
> Convert your spoken-language request to a LIKE
> clause like this:
>
>     "afield ENDS WITH avalue"
>     reword => "afield starts with anything, ending with avalue"
>     sql => "... WHERE afield LIKE '%avalue'"
>
>     "afield STARTS WITH avalue"
>     reword => "afield starts with avalue, ending with anything"
>     sql => "... WHERE afield LIKE 'avalue%'"
>
> See how that works?
>
>     "afield CONTAINS avalue"
>     sql => "... WHERE afield LIKE '%avalue%'"
>
> If you're looking for something at the beginning of
> a field, that's where your data goes, and you END with
> the percent. If you're looking for something at the end
> of a field, put your data there and allow anything at
> the front by putting the percent there.
>
> ALSO -- if your field is indexed, you'll defeat the index
> unless you anchor your search to the beginning of the field.
>
>     -- find fields starting with 'something'
>     ... afield LIKE 'something%'; -- uses afield's index
>
> The example below isn't anchored at the start of the field,
> so it does no good to use the index:
>
>     -- find fields containing 'pooh'
>     ... afield LIKE '%pooh%'; -- can't use afield's index
>
> Afield must contain 'pooh' BUT it could be anywhere in
> the field. The index would be useless.
>
>
> LIMITATION
> ==========
>
> Okay. Let's say you have a full-name field that has
> first, middle, and last name all in the one field.
> Sometimes you'll have a middle name, sometimes
> you won't. Sometimes the first name will be spelled
> out, other times it'll be shortened or even abbreviated
> to just the first initial.
>
> Here's how you'd search that field for "Abraham Lincoln"
> using the LIKE operator:
>
>     SELECT * FROM peron WHERE fullname LIKE 'A%Lincoln';
>
> It'll find 'Abraham Lincoln' and 'A. Lincoln' and
> 'Abe Lincoln' because anything can go between the
> initial 'A' and the ending 'Lincoln'.
>
> But -- it'll also match 'Andrew M. Lincoln' and 'Abner
> Sasquatch Lincoln' as well... and this is why some bright
> souls created regular expressions!
>
>
> REGULAR EXPRESSIONS
> ===================
>
> A regular expression is a pattern to search for.
> The structure of the pattern has to match a
> rigid set of rules so that PostgreSQL will know
> what you're trying to search for.
>
> This introduction will barely scratch the surface,
> but hopefully it'll get you started--
>
> If you've ever tinkered with perl, you've probably
> encountered regular expressions:
>
>     perl -ne 'print if /[A-Za-z]/;' somefile
>
> Here, the "[A-Za-z]" is a regular expression.
> It matches any line containing uppercase (A-Z)
> or lowercase (a-z) letters. In this example, the
> lines will come from the file "somefile" and
> lines that match will be printed out.
>
> Well, PostgreSQL can do that, too!
>
>
> Example
> -------
>
>     SELECT * FROM person
>     WHERE fullname ~ 'A(be|braham|\\.) *Lincoln';
>
> The REGULAR EXPRESSION operator is the TILDE (~).
> The value you're searching for then must be a valid
> regular expression. In the above example, we're asking
> PostgreSQL to display all rows containing
>
>     'A'
>
> followed by EITHER
>
>     'be'
> OR
>     'braham'
> OR
>     '\\.' -- a dot
>
> which is then followed by
>
>     ' ' -- a space
>     '*' -- zero or more times
>
> followed by
>
>     'Lincoln'
>
> and that's all. Whew!
>
> As you can guess, these will all successfully match:
>
>     'A. Lincoln'
>     'AbeLincoln' -- without any space at all
>     'Abraham    Lincoln' -- with lots of spaces
>
> and these won't match:
>
>     'A Lincoln' -- no dot, no 'braham', no 'be'
>     'Abe Gump Lincoln' -- nothing will match 'Gump'
>
>
> Quoting "\\"
> ------------
>
> So what's with the BACKSLASH in front of the dot? Well,
> just as LIKE has the UNDERSCORE (_) to denote "any single
> character", REGULAR EXPRESSIONS use the DOT (.) for
> that very same purpose. So we have to "escape" the
> dot to alter its normal interpretation, using the.
>
> Note that string literals in PostgreSQL already use the
> backslash, so you'll have to double any backslash
> you use:
>
>     psql=# SELECT * FROM person
>     psql-# WHERE fullname ~ 'A(be|braham|\\.) *Lincoln'
>     psql-# -- a second backslash added
>     psql-# ;
>
>
> Grouping "()"
> -------------
>
> How about the PARENTHESES () and the BAR |, hmm? Well,
> that's how you can group permitted alternatives:
>
>     ... ~ 'A(be|braham|\\.)'
>
> That says that 'A' can be followed by any of the three
> sub-expressions within the parentheses, namely 'be' or
> 'braham' or <dot>.
>
> These all produce the same matches:
>
>     ... ~ 'A(be|braham|\\.)'
>     ... ~ 'Abe|Abraham|A\\.'
>     ... ~ 'A(b(e|raham)|\\.)'
>
> Let's dissect this last one -- it specifies that:
>
>     'A'
>
> can be followed by either
>
>     'b' followed by some more stuff
> or
>     <dot>
>
> nestling into the details following the 'b' -- if
> 'A' is followed by 'b', then the 'b' must be followed
> by either
>
>     'e' -- which matches 'Abe'
> or
>     'raham' -- which matches 'Abraham'
>
> Do you see how powerful this can be?
>
>
> Character Classes "[]"
> ----------------------
>
> So you're trying to find "McAndrews" -- but maybe it's
> spelled with a lower-case "a" as in "Mcandrews", instead!
>
>     SELECT * FROM person
>     WHERE fullname ~ 'Mc[Aa]ndrews';
>
> Using the BRACKETS tells the regular expression
> parser that you're allowing any one of a whole class
> of characters in that spot.
>
>     '[Aa]'
>
> It specifies that you're looking for either an uppercase
> 'A' or a lowercase 'a' in that spot.
>
> Here's how you can find fields containing vowels:
>
>     ... ~ '[AEIOUaeiou]'
>
> Search for fields containing lowercase letters:
>
>     ... ~ '[abcdefghijklmnopqrstuvwxyz]'
>
> Of course, there's a shortcut for specifying character
> classes that cover a whole range:
>
>     ... ~ '[a-z]' -- also matches any lowercase letter
>
> Display fields that contain digits:
>
>     ... ~ '[0-9]'
>
> Here's an example using a standard U.S. phone pattern
> (neglecting the area code for clarity and space):
>
>     SELECT * FROM cust
>     WHERE descr ~ '[0-9][0-9][0-9]-[0-9][0-9][0-9][0-9]';
>
> That will show customers whose "descr" field contains a
> pattern of digits that looks like a U.S. phone number:
>
>     '[0-9][0-9][0-9]' -- three digits
>
> followed by
>
>     '-' -- a hyphen
>
> followed by
>
>     '[0-9][0-9][0-9][0-9]' -- four digits
>
> As you can see, the hyphen doesn't do anything special
> for a regular expression unless it's inside the square
> brackets of a character class -- in which case it
> means "anything between".
>
> What if you want to allow a hyphen within a character class?
> Simply make it the first character inside the brackets:
>
>     '[-.,_]'
>
> That class will match any hyphen, dot, comma or underscore.
>
>
> Bounds "{}"
> -----------
>
> Rather than having to specify each of those [0-9]
> monstrosities every time, let's abbreviate. Instead
> of:
>
>     '[0-9][0-9][0-9]-[0-9][0-9][0-9][0-9]'
>
> we can just say
>
>     '[0-9]{3}-[0-9]{4}'.
>
> The BOUNDS specification, inside braces, comes immediately
> after whatever you want a certain number of. That is,
> first you specify what you're looking for, then you specify
> how many you need. (The default is {1,1} one, as you can tell:
> one 'A' followed by one 'b' followed by...)
>
> Inside the BRACES that specify your "bounds" you can
> say you want exactly so many, as we did above (3 and
> then 4) or you can give a low-to-high pair:
>
>     ... ~ 'Z{3}' -- need three Z's
>     ... ~ '@{5,}' -- five or more @'s
>     ... ~ 'Q{2,5}' -- at least two, at most five, Q's
>     ... ~ '(This|That){0,3}'
>
> The last one allows for 'This' or 'That', anywhere from
> zero to three times. So 'ThisThatThis' and '' will match!
> Convenient, hmm?
>
> And for very common cases, there are handy abbreviations:
>
>     'x{1,1}'  'x'    one (the default)
>     'r{0,1}'  'r?'   zero or one, i.e. 'x' is optional
>     'B{0,}'   'B*'   zero or more
>     'z{1,}'   'z+'   one or more
>
> These three abbreviations will pop up often:
>
>     -- allow 'Ms' and 'Mrs' (the 'r' is optional):
>     ... title ~ '(Mr|Mr?s|Dr)'
>
>     -- maybe there's a space, maybe there isn't:
>     ... lname ~ 'Mac ?Affee'
>
>     -- don't care how many trailing spaces:
>     ... lname ~ 'Smith *'
>
>     -- find records with two or more even digits:
>     ... afield ~ '[02468]{2,}'
> or
>     ... afield ~ '[02468][02468]+'
>
> That last one breaks down to
>
>     '[02468]' -- an even digit
>
> followed by
>
>     '[02468]' -- an even digit
>     '+' -- one or more times
>
> Remember: the plus means '{1,}' -- i.e. that the preceeding
> item must match one or more times.
>
> Getting back to our phone number pattern... this is
> the same request as above, but it's easier to
> see what's going on, using the {bounds} feature:
>
>     SELECT * FROM cust
>     WHERE descr ~ '[0-9]{3}-[0-9]{4}';
>     -- three digits, hyphen, four digits
>
>
> Excluding Characters "[^]"
> --------------------------
>
> Note that the above pattern will also match 00000-0000
> because the last three digits of the five in the left
> chunk, with the four digits of the right chunk, fit
> perfectly into the constraints of what we asked for.
> And this ain't no phone number.
>
> Here's an approach to fixing that snag:
>
>     SELECT * FROM cust
>     WHERE descr ~ '[^0-9][0-9]{3}-[0-9]{4}';
>
> Note that we've prefixed the previous pattern with
>
>     '[^0-9]'
>
> because, within the square brackets of a character class,
> the CARAT (^) means "anything EXCEPT..."
>
> So now, 00000-0000 will NOT match; this is what we're
> after.
>
> But there's a new problem: if the phone number is the
> very first thing in the "descr" field it would never
> match because we're demanding that there be SOMETHING
> (besides a digit) before the phone number. What to do?
>
>
> At The Very Beginning "^"
> -------------------------
>
> You can match "beginning-of-field" with the CARAT (^).
> (When it's the first thing inside square brackets, it
> negates the character class; outside brackets, a carat
> means "beginning-of-text". Got that? It's important!)
>
>     SELECT * FROM cust
>     WHERE descr ~ '(^|[^0-9])[0-9]{3}-[0-9]{4}';
>
> Here's what this pattern looks for:
>
> Either
>     '^' -- at the very beginning of text
> or
>     '[^0-9]' -- find something that's a non-digit
>
> followed by
>
>     '[0-9]{3}' -- three digits
>
> and then a
>
>     '-' -- hyphen
>
> and finally
>
>     '[0-9]{4}' -- four digits
>
> and we've solved the 'at-beginning-of-field' problem.
>
>
> At The Very End "$"
> -------------------
>
> Still, we could inadvertently match something like
> 000-000000000 (probably an id number for a replacement
> part) which we don't want to do.
>
> Note how this is the same problem, for the end
> of the field, that we had earlier, with the beginning
> of the field? The solution is similar:
>
>     SELECT * FROM cust
>     WHERE descr ~ '(^|[^0-9])[0-9]{3}-[0-9]{4}($|[^0-9])';
>
> The DOLLAR SIGN ($) represents 'end-of-text' just as
> carat, outside of square brackets, means 'beginning-of-
> text'. The above pattern is the same as before, except
> we appended
>
>     '($|[^0-9])'
>
> to the end, which means we're looking for
>
> either
>
>     '$' -- at the very end-of-text
> or
>     '[^0-9]' -- something that's not a digit
>
> after our previous pattern.
>
> And, finally, we have a comprehensive phone pattern --
> it'll match ###-#### while excluding other numeric
> arrangements -- in a text field.
>
>
> Anchoring
> ---------
>
> Did you notice that regular expressions aren't "anchored",
> as the LIKE expressions are? To specify that you're looking
> only at the beginning of a field, you need to use '^' and
> to include the end of a field, you must use '$'.
>
> It's a good thing regular expressions aren't anchored
> automatically -- we wouldn't have been able to specify
> our phone number pattern above! Things like '(^|[^0-9])'
> are possible because YOU get to specify when and how
> beginning-of-text is required. Cool, eh?
>
> And, just as in the LIKE operator, unless regular expressions
> ARE anchored at the beginning of a field, you'll defeat any
> index you have for that field. Indexes help alphabetize by
> comparing the beginning of your fields to each other; unless
> you're looking at the beginning of the field, your index can't
> be used.
>
> Here are some comparisons between operators LIKE and ~ :
>
>     -- list all customers within the 47610 postal code:
>     SELECT * FROM cust WHERE zip LIKE '47610%';
>         -- zip begins with '47610' then has anything
>     SELECT * FROM cust WHERE zip ~ '^47610';
>         -- zip begins with '47610'
>
>     -- display customers who have 'Corp' in their names
>     SELECT * FROM cust WHERE name LIKE '%Corp%';
>         -- name contains 'Corp' with anything before and after
>     SELECT * FROM cust WHERE name ~ 'Corp';
>         -- name contains 'Corp'
>
>     -- show customers whose names end in 'LLC'
>     SELECT * FROM cust WHERE name LIKE '%LLC';
>         -- name can have anything, with 'LLC' at the end
>     SELECT * FROM cust WHERE name ~ 'LLC$';
>         -- name must have 'LLC' at the end
>
>     -- documents beginning with 'We', ending with 'rica'
>     SELECT * FROM doc WHERE contents LIKE 'We%rica';
>         -- starts with 'We', has anything, ending with 'rica'
>     SELECT * FROM doc WHERE contents ~ '^We.*rica$';
>         -- start with 'We', zero or more chars, end with 'rica'
>
> Remember that in regular expressions, DOT means 'any character'.
> Thus '.*' means 'any character, zero times or more' which, in
> English, means "anything can go here, including nothing at all".
>
> There are ways to anchor your searches to word boundaries,
> as well -- not just beginning-of-field and end-of-field. See
> your documentation for details.
>
>
> Case-Insensitive "~*"
> ---------------------
>
> If you're not worried about differentiating between
> uppercase and lowercase in your regular expressions,
> you could go full boar and try
>
>     SELECT * FROM cust
>     WHERE lname ~ '[Oo][Cc][Tt][Aa][Vv][Ii][Aa][Nn]';
>
> Fortunately there's a quickie to make it easier for you --
> where you've been using the tilde (~) as your operator, use
> tilde-star (~*) instead:
>
>     SELECT * FROM cust
>     WHERE lname ~* 'Octavian';
>
> ILIKE is case-insensitive (where LIKE is case-specific) just
> as ~* is case-insensitive for regular expressions (where ~ is
> case-specific). Very handy!
>
>
> Example
> -------
>
> Email addresses can look like any of these:
>
>     me@my.net
>     someone-unimportant9237@this.little.org
>     first.last.title@obscure.sub-net.biggie.com
>     _weird_@somewhere.out_there.net
>
> A reasonably-functional email address pattern might be
> something like this:
>
>     '[a-z0-9_]+([\\-\\.][a-z0-9_]+)*@[a-z0-9_]+([\\-\\.][a-z0-9_]+)+'
>
> That breaks down to (using case-indifferent via ~*, of course):
>
>     '[a-z0-9_]'
>
> any alphanumeric (including underscore) character
>
>     '+'
>
> one or more times, followed by
>
>     '([\\-\\.][a-z0-9_]+)'
>
> email stuff [1],
>
>     '*'
>
> zero or more times, followed by
>
>     '@'
>
> which is then followed by
>
>     '[a-z0-9_]'
>
> alphanumerics (including underscore)
>
>     '+'
>
> one or more times, followed by
>
>     '([\\-\\.][a-z0-9_]+)'
>
> site stuff [2]
>
>     '+'
>
> one or more times.
>
>
> As for [1] email stuff, it can be (zero or more of):
>
>     '[\\-\\.]'
>
> hyphen or dot, followed by
>
>     '[a-z0-9_]'
>
> alphanumerics (or underscore)
>
>     '+'
>
> one or more times.
>
>
> And as for [2] site stuff, it can be (one or more of):
>
>     '[\\-\\.]'
>
> hyphen or dot, followed by
>
>     '[a-z0-9_]'
>
> alphanumerics (or underscore)
>
>     '+'
>
> one or more times.
>
> So the personal part of the address can just be a
> single word, with optional dot-or-hyphen joining other
> words (ZERO or more) onto it; this is followed by at-sign,
> which is followed by the site portion of the address,
> namely: any word, followed by ONE or more dot-or-hyphen-
> followed-by-another-word combo's.
>
> It's not a perfect match for all legal email patterns,
> (for example, bad@my-addr would match) but something like
> it might be sufficient depending on your requirements.
>
>
> CONCLUSION
> ==========
>
> Regular expressions are complicated, because they're
> powerful! (Or is it the other way around?)
>
> Check out "Pattern Matching" in the PostgreSQL manual
> for more details -- on my Debian "Potato" system it's in
> html/postgres/functions-matching.html under the
> /usr/share/doc/postgresql-doc/ directory.
>
> --
> mailto:will@serensoft.com
> http://www.dontUthink.com/
>
> ---------------------------(end of broadcast)---------------------------
> TIP 5: Have you checked our extensive FAQ?
>
> http://www.postgresql.org/users-lounge/docs/faq.html
>

--
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073