Regular Expression INTRO [long] [OT]

Поиск
Список
Период
Сортировка
От	will trillich
Тема	Regular Expression INTRO [long] [OT]
Дата	9 апреля 2002 г. 22:57:34
Msg-id	3CB38FDE.2DEA1F3F@serensoft.com обсуждение исходный текст
Ответ на	Re: SQL Where Like - Range it?! ("Len Morgan" <len-morgan@crcom.net>)
Ответы	Re: Regular Expression INTRO [long] [OT] (will trillich <will@serensoft.com>) Re: Regular Expression INTRO [long] [OT] (Bruce Momjian <pgman@candle.pha.pa.us>)
Список	pgsql-general
Дерево обсуждения
wow -- this was nearly a year ago and i still haven't
done it!

Justin Clift wrote:
> Would you mind taking a few moments and writing a guide on using Regex
> functions in PostgreSQL?
>
> This is to put on the techdocs.postgresql.org website so everyone in the
> PostgreSQL community will have access to it.

i'm posting to pgsql-general so the community
can help find holes and plug them...

here goes --






Finding data in PostgreSQL using LIKE or ~ (REGULAR EXPRESSIONS):


LIKE
====

You're probably familiar with the WHERE clause
of the SELECT statement --

    SELECT somefield FROM table WHERE somefield = somevalue;

For example, to show all customers within the
post code "60201" you could try

    SELECT * FROM cust WHERE zip = '60201';

But if you also have nine-digit zips stored in the zip
field (in the U.S.A. we have standard five-digit postal
codes and also a four-digit extension for further
geographic resolution) then this search will OMIT
customers having more-specific zips such as '60201-9876'
or '60201-1234'.


The Percent "%"
---------------

Try using LIKE instead of = (EQUALS):

    SELECT * FROM cust WHERE zip LIKE '60201%';

With LIKE, your search values can use the percent (%)
to tell PostgreSQL that anything can occupy that spot
-- one character, twelve symbols or zero digits -- and
still satisfy the search.

On Unix or Linux, this is basically the same as the
ASTERISK (*) at the command line, when dealing with
file names:

    # list all files whose names begin with '.bash'
    ls .bash*

    # list all files containing 'out' anywhere in the name
    ls *out*

    # list all file names ending with '.pl'
    ls *.pl

    # list file starting with 'proj', ending with '.c'
    ls proj*.c

With PostgreSQL using the LIKE operator, use the
percent, instead:

    -- list all customers within the 47610 postal code
    SELECT * FROM cust WHERE zip LIKE '47610%';

    -- display customers who have 'Corp' in their names
    SELECT * FROM cust WHERE name LIKE '%Corp%';

    -- show customers whose names end in 'LLC'
    SELECT * FROM cust WHERE name LIKE '%LLC';

    -- documents beginning with 'We', ending with 'rica'
    SELECT * FROM doc WHERE contents LIKE 'We%rica';

Wherever the '%' appears (using the LIKE operator)
Postgres allows anything at all to match -- from a
lengthy string of text, to one single character, to
a zero-length string -- i.e. nothing at all.

    ...ILIKE 'A%Z'
    -- matches 'AZ' and 'A to Z' and 'ABC2468XYZ' and 'A@$*Z'


The Underscore "_"
------------------

For the LIKE operator, the UNDERSCORE (_) takes on the same
meaning as the question mark does to shell operations for
Unix and Linux file names:

    # list files starting with dot, followed by at least two chars
    ls .??*

The underscore matches just one character exactly:

    SELECT * FROM atable WHERE afield LIKE '_';
    -- shows records where afield is exactly one character
    -- omitting ones where it's blank or has two (or more)
    -- characters in it)

You can restrict your searches to finding fields
of certain lengths this way:

    SELECT * FROM cust WHERE zip LIKE '_____' OR zip LIKE '_________';
    -- display all five- or nine-character zip codes

Combine it with percent to find fields over a certain length:

    SELECT * FROM atable WHERE afield LIKE '___%';
    -- display records where afield has three or more characters

Or you can accomodate some language idiosyncracies, as
well:

    SELECT * FROM activity WHERE venue LIKE 'Theat__';
    -- find u.k. 'theatre' and u.s. 'theater'


ILIKE
=====

But what if a customer's name is all capitals, as in
'THE ACME CORPORATION'? The "...LIKE '%Corp%'" won't
find it, because uppercase 'O' is not identical to
lowercase 'o'!

Finding text independent of uppercase/lowercase
is something we often need to do. And so we have
the ILIKE operator.

Use ILIKE instead of LIKE when your search should
ignore case -- the "I" stands for case-[I]nsensitive:

    -- show all customers with 'corp' in the name
    SELECT * FROM cust WHERE name ILIKE '%Corp%';
    -- ignoring whether the field is upper- or lowercase

That'll find 'Bubba Gump Shrimp Corp' and 'ACME CORP'
and even 'Amalgamated Switch and Relay corporation'
because anything can ('%') precede or ('%') follow the
string 'Corp' which can be uppercase or lowercase.

But there's even more power in "regular expressions"!
See below...


Notes
-----

Always include some real data to search for!

    -- show everything, doing a lot of unnecessary work:
    SELECT * FROM atable WHERE afield LIKE '%';

It would be rather silly to make PostgreSQL compare
every record to see if it matched 'anything goes,
in "afield"'. Just say no.

Convert your spoken-language request to a LIKE
clause like this:

    "afield ENDS WITH avalue"
    reword => "afield starts with anything, ending with avalue"
    sql => "... WHERE afield LIKE '%avalue'"

    "afield STARTS WITH avalue"
    reword => "afield starts with avalue, ending with anything"
    sql => "... WHERE afield LIKE 'avalue%'"

See how that works?

    "afield CONTAINS avalue"
    sql => "... WHERE afield LIKE '%avalue%'"

If you're looking for something at the beginning of
a field, that's where your data goes, and you END with
the percent. If you're looking for something at the end
of a field, put your data there and allow anything at
the front by putting the percent there.

ALSO -- if your field is indexed, you'll defeat the index
unless you anchor your search to the beginning of the field.

    -- find fields starting with 'something'
    ... afield LIKE 'something%'; -- uses afield's index

The example below isn't anchored at the start of the field,
so it does no good to use the index:

    -- find fields containing 'pooh'
    ... afield LIKE '%pooh%'; -- can't use afield's index

Afield must contain 'pooh' BUT it could be anywhere in
the field. The index would be useless.


LIMITATION
==========

Okay. Let's say you have a full-name field that has
first, middle, and last name all in the one field.
Sometimes you'll have a middle name, sometimes
you won't. Sometimes the first name will be spelled
out, other times it'll be shortened or even abbreviated
to just the first initial.

Here's how you'd search that field for "Abraham Lincoln"
using the LIKE operator:

    SELECT * FROM peron WHERE fullname LIKE 'A%Lincoln';

It'll find 'Abraham Lincoln' and 'A. Lincoln' and
'Abe Lincoln' because anything can go between the
initial 'A' and the ending 'Lincoln'.

But -- it'll also match 'Andrew M. Lincoln' and 'Abner
Sasquatch Lincoln' as well... and this is why some bright
souls created regular expressions!


REGULAR EXPRESSIONS
===================

A regular expression is a pattern to search for.
The structure of the pattern has to match a
rigid set of rules so that PostgreSQL will know
what you're trying to search for.

This introduction will barely scratch the surface,
but hopefully it'll get you started--

If you've ever tinkered with perl, you've probably
encountered regular expressions:

    perl -ne 'print if /[A-Za-z]/;' somefile

Here, the "[A-Za-z]" is a regular expression.
It matches any line containing uppercase (A-Z)
or lowercase (a-z) letters. In this example, the
lines will come from the file "somefile" and
lines that match will be printed out.

Well, PostgreSQL can do that, too!


Example
-------

    SELECT * FROM person
    WHERE fullname ~ 'A(be|braham|\\.) *Lincoln';

The REGULAR EXPRESSION operator is the TILDE (~).
The value you're searching for then must be a valid
regular expression. In the above example, we're asking
PostgreSQL to display all rows containing

    'A'

followed by EITHER

    'be'
OR
    'braham'
OR
    '\\.' -- a dot

which is then followed by

    ' ' -- a space
    '*' -- zero or more times

followed by

    'Lincoln'

and that's all. Whew!

As you can guess, these will all successfully match:

    'A. Lincoln'
    'AbeLincoln' -- without any space at all
    'Abraham    Lincoln' -- with lots of spaces

and these won't match:

    'A Lincoln' -- no dot, no 'braham', no 'be'
    'Abe Gump Lincoln' -- nothing will match 'Gump'


Quoting "\\"
------------

So what's with the BACKSLASH in front of the dot? Well,
just as LIKE has the UNDERSCORE (_) to denote "any single
character", REGULAR EXPRESSIONS use the DOT (.) for
that very same purpose. So we have to "escape" the
dot to alter its normal interpretation, using the.

Note that string literals in PostgreSQL already use the
backslash, so you'll have to double any backslash
you use:

    psql=# SELECT * FROM person
    psql-# WHERE fullname ~ 'A(be|braham|\\.) *Lincoln'
    psql-# -- a second backslash added
    psql-# ;


Grouping "()"
-------------

How about the PARENTHESES () and the BAR |, hmm? Well,
that's how you can group permitted alternatives:

    ... ~ 'A(be|braham|\\.)'

That says that 'A' can be followed by any of the three
sub-expressions within the parentheses, namely 'be' or
'braham' or <dot>.

These all produce the same matches:

    ... ~ 'A(be|braham|\\.)'
    ... ~ 'Abe|Abraham|A\\.'
    ... ~ 'A(b(e|raham)|\\.)'

Let's dissect this last one -- it specifies that:

    'A'

can be followed by either

    'b' followed by some more stuff
or
    <dot>

nestling into the details following the 'b' -- if
'A' is followed by 'b', then the 'b' must be followed
by either

    'e' -- which matches 'Abe'
or
    'raham' -- which matches 'Abraham'

Do you see how powerful this can be?


Character Classes "[]"
----------------------

So you're trying to find "McAndrews" -- but maybe it's
spelled with a lower-case "a" as in "Mcandrews", instead!

    SELECT * FROM person
    WHERE fullname ~ 'Mc[Aa]ndrews';

Using the BRACKETS tells the regular expression
parser that you're allowing any one of a whole class
of characters in that spot.

    '[Aa]'

It specifies that you're looking for either an uppercase
'A' or a lowercase 'a' in that spot.

Here's how you can find fields containing vowels:

    ... ~ '[AEIOUaeiou]'

Search for fields containing lowercase letters:

    ... ~ '[abcdefghijklmnopqrstuvwxyz]'

Of course, there's a shortcut for specifying character
classes that cover a whole range:

    ... ~ '[a-z]' -- also matches any lowercase letter

Display fields that contain digits:

    ... ~ '[0-9]'

Here's an example using a standard U.S. phone pattern
(neglecting the area code for clarity and space):

    SELECT * FROM cust
    WHERE descr ~ '[0-9][0-9][0-9]-[0-9][0-9][0-9][0-9]';

That will show customers whose "descr" field contains a
pattern of digits that looks like a U.S. phone number:

    '[0-9][0-9][0-9]' -- three digits

followed by

    '-' -- a hyphen

followed by

    '[0-9][0-9][0-9][0-9]' -- four digits

As you can see, the hyphen doesn't do anything special
for a regular expression unless it's inside the square
brackets of a character class -- in which case it
means "anything between".

What if you want to allow a hyphen within a character class?
Simply make it the first character inside the brackets:

    '[-.,_]'

That class will match any hyphen, dot, comma or underscore.


Bounds "{}"
-----------

Rather than having to specify each of those [0-9]
monstrosities every time, let's abbreviate. Instead
of:

    '[0-9][0-9][0-9]-[0-9][0-9][0-9][0-9]'

we can just say

    '[0-9]{3}-[0-9]{4}'.

The BOUNDS specification, inside braces, comes immediately
after whatever you want a certain number of. That is,
first you specify what you're looking for, then you specify
how many you need. (The default is {1,1} one, as you can tell:
one 'A' followed by one 'b' followed by...)

Inside the BRACES that specify your "bounds" you can
say you want exactly so many, as we did above (3 and
then 4) or you can give a low-to-high pair:

    ... ~ 'Z{3}' -- need three Z's
    ... ~ '@{5,}' -- five or more @'s
    ... ~ 'Q{2,5}' -- at least two, at most five, Q's
    ... ~ '(This|That){0,3}'

The last one allows for 'This' or 'That', anywhere from
zero to three times. So 'ThisThatThis' and '' will match!
Convenient, hmm?

And for very common cases, there are handy abbreviations:

    'x{1,1}'  'x'    one (the default)
    'r{0,1}'  'r?'   zero or one, i.e. 'x' is optional
    'B{0,}'   'B*'   zero or more
    'z{1,}'   'z+'   one or more

These three abbreviations will pop up often:

    -- allow 'Ms' and 'Mrs' (the 'r' is optional):
    ... title ~ '(Mr|Mr?s|Dr)'

    -- maybe there's a space, maybe there isn't:
    ... lname ~ 'Mac ?Affee'

    -- don't care how many trailing spaces:
    ... lname ~ 'Smith *'

    -- find records with two or more even digits:
    ... afield ~ '[02468]{2,}'
or
    ... afield ~ '[02468][02468]+'

That last one breaks down to

    '[02468]' -- an even digit

followed by

    '[02468]' -- an even digit
    '+' -- one or more times

Remember: the plus means '{1,}' -- i.e. that the preceeding
item must match one or more times.

Getting back to our phone number pattern... this is
the same request as above, but it's easier to
see what's going on, using the {bounds} feature:

    SELECT * FROM cust
    WHERE descr ~ '[0-9]{3}-[0-9]{4}';
    -- three digits, hyphen, four digits


Excluding Characters "[^]"
--------------------------

Note that the above pattern will also match 00000-0000
because the last three digits of the five in the left
chunk, with the four digits of the right chunk, fit
perfectly into the constraints of what we asked for.
And this ain't no phone number.

Here's an approach to fixing that snag:

    SELECT * FROM cust
    WHERE descr ~ '[^0-9][0-9]{3}-[0-9]{4}';

Note that we've prefixed the previous pattern with

    '[^0-9]'

because, within the square brackets of a character class,
the CARAT (^) means "anything EXCEPT..."

So now, 00000-0000 will NOT match; this is what we're
after.

But there's a new problem: if the phone number is the
very first thing in the "descr" field it would never
match because we're demanding that there be SOMETHING
(besides a digit) before the phone number. What to do?


At The Very Beginning "^"
-------------------------

You can match "beginning-of-field" with the CARAT (^).
(When it's the first thing inside square brackets, it
negates the character class; outside brackets, a carat
means "beginning-of-text". Got that? It's important!)

    SELECT * FROM cust
    WHERE descr ~ '(^|[^0-9])[0-9]{3}-[0-9]{4}';

Here's what this pattern looks for:

Either
    '^' -- at the very beginning of text
or
    '[^0-9]' -- find something that's a non-digit

followed by

    '[0-9]{3}' -- three digits

and then a

    '-' -- hyphen

and finally

    '[0-9]{4}' -- four digits

and we've solved the 'at-beginning-of-field' problem.


At The Very End "$"
-------------------

Still, we could inadvertently match something like
000-000000000 (probably an id number for a replacement
part) which we don't want to do.

Note how this is the same problem, for the end
of the field, that we had earlier, with the beginning
of the field? The solution is similar:

    SELECT * FROM cust
    WHERE descr ~ '(^|[^0-9])[0-9]{3}-[0-9]{4}($|[^0-9])';

The DOLLAR SIGN ($) represents 'end-of-text' just as
carat, outside of square brackets, means 'beginning-of-
text'. The above pattern is the same as before, except
we appended

    '($|[^0-9])'

to the end, which means we're looking for

either

    '$' -- at the very end-of-text
or
    '[^0-9]' -- something that's not a digit

after our previous pattern.

And, finally, we have a comprehensive phone pattern --
it'll match ###-#### while excluding other numeric
arrangements -- in a text field.


Anchoring
---------

Did you notice that regular expressions aren't "anchored",
as the LIKE expressions are? To specify that you're looking
only at the beginning of a field, you need to use '^' and
to include the end of a field, you must use '$'.

It's a good thing regular expressions aren't anchored
automatically -- we wouldn't have been able to specify
our phone number pattern above! Things like '(^|[^0-9])'
are possible because YOU get to specify when and how
beginning-of-text is required. Cool, eh?

And, just as in the LIKE operator, unless regular expressions
ARE anchored at the beginning of a field, you'll defeat any
index you have for that field. Indexes help alphabetize by
comparing the beginning of your fields to each other; unless
you're looking at the beginning of the field, your index can't
be used.

Here are some comparisons between operators LIKE and ~ :

    -- list all customers within the 47610 postal code:
    SELECT * FROM cust WHERE zip LIKE '47610%';
        -- zip begins with '47610' then has anything
    SELECT * FROM cust WHERE zip ~ '^47610';
        -- zip begins with '47610'

    -- display customers who have 'Corp' in their names
    SELECT * FROM cust WHERE name LIKE '%Corp%';
        -- name contains 'Corp' with anything before and after
    SELECT * FROM cust WHERE name ~ 'Corp';
        -- name contains 'Corp'

    -- show customers whose names end in 'LLC'
    SELECT * FROM cust WHERE name LIKE '%LLC';
        -- name can have anything, with 'LLC' at the end
    SELECT * FROM cust WHERE name ~ 'LLC$';
        -- name must have 'LLC' at the end

    -- documents beginning with 'We', ending with 'rica'
    SELECT * FROM doc WHERE contents LIKE 'We%rica';
        -- starts with 'We', has anything, ending with 'rica'
    SELECT * FROM doc WHERE contents ~ '^We.*rica$';
        -- start with 'We', zero or more chars, end with 'rica'

Remember that in regular expressions, DOT means 'any character'.
Thus '.*' means 'any character, zero times or more' which, in
English, means "anything can go here, including nothing at all".

There are ways to anchor your searches to word boundaries,
as well -- not just beginning-of-field and end-of-field. See
your documentation for details.


Case-Insensitive "~*"
---------------------

If you're not worried about differentiating between
uppercase and lowercase in your regular expressions,
you could go full boar and try

    SELECT * FROM cust
    WHERE lname ~ '[Oo][Cc][Tt][Aa][Vv][Ii][Aa][Nn]';

Fortunately there's a quickie to make it easier for you --
where you've been using the tilde (~) as your operator, use
tilde-star (~*) instead:

    SELECT * FROM cust
    WHERE lname ~* 'Octavian';

ILIKE is case-insensitive (where LIKE is case-specific) just
as ~* is case-insensitive for regular expressions (where ~ is
case-specific). Very handy!


Example
-------

Email addresses can look like any of these:

    me@my.net
    someone-unimportant9237@this.little.org
    first.last.title@obscure.sub-net.biggie.com
    _weird_@somewhere.out_there.net

A reasonably-functional email address pattern might be
something like this:

    '[a-z0-9_]+([\\-\\.][a-z0-9_]+)*@[a-z0-9_]+([\\-\\.][a-z0-9_]+)+'

That breaks down to (using case-indifferent via ~*, of course):

    '[a-z0-9_]'

any alphanumeric (including underscore) character

    '+'

one or more times, followed by

    '([\\-\\.][a-z0-9_]+)'

email stuff [1],

    '*'

zero or more times, followed by

    '@'

which is then followed by

    '[a-z0-9_]'

alphanumerics (including underscore)

    '+'

one or more times, followed by

    '([\\-\\.][a-z0-9_]+)'

site stuff [2]

    '+'

one or more times.


As for [1] email stuff, it can be (zero or more of):

    '[\\-\\.]'

hyphen or dot, followed by

    '[a-z0-9_]'

alphanumerics (or underscore)

    '+'

one or more times.


And as for [2] site stuff, it can be (one or more of):

    '[\\-\\.]'

hyphen or dot, followed by

    '[a-z0-9_]'

alphanumerics (or underscore)

    '+'

one or more times.

So the personal part of the address can just be a
single word, with optional dot-or-hyphen joining other
words (ZERO or more) onto it; this is followed by at-sign,
which is followed by the site portion of the address,
namely: any word, followed by ONE or more dot-or-hyphen-
followed-by-another-word combo's.

It's not a perfect match for all legal email patterns,
(for example, bad@my-addr would match) but something like
it might be sufficient depending on your requirements.


CONCLUSION
==========

Regular expressions are complicated, because they're
powerful! (Or is it the other way around?)

Check out "Pattern Matching" in the PostgreSQL manual
for more details -- on my Debian "Potato" system it's in
html/postgres/functions-matching.html under the
/usr/share/doc/postgresql-doc/ directory.

--
mailto:will@serensoft.com
http://www.dontUthink.com/
В списке pgsql-general по дате отправления:
Предыдущее

От: Martijn van Oosterhout
Дата: 09 апреля 2002 г., 22:44:59
Сообщение: Re: more about pg_toast growth
Следующее

От: eric.jones@hua.army.mil
Дата: 09 апреля 2002 г., 22:59:12
Сообщение: PostgreSQL said: ERROR: Attribute 'id' not found
Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Regular Expression INTRO [long] [OT]

Предыдущее

Следующее