Обсуждение: Regular expression character escape
Hello. I'd like to perform a query using user-submitted input in a regular expression. Something along the lines of: select some_col from some_table where some_col ~ ('^' || user_submitted_input || '\d*$') This query is looking for every value matching the user submitted input with optional trailing decimal characters. My problem is: the user can break this using special regular expression characters (ex: 'test('). I'm looking for a way to escape regular expression characters from the input (maybe a function already exists for that purpose?). I could do it on the client side, but I'm not really sure how python and postgresql flavors of regular expressions differ. I've seen that the (?q) modifier is supported (treat the rest of the regexp as normal characters), but unless I missed something it I can't use it anywhere else than at the beginning of the regexp. From the docs at http://www.postgresql.org/docs/current/static/functions-matching.html: "Embedded options take effect at the ) terminating the sequence. They can appear only at the start of an ARE." Any idea on how to achieve this ? Thank you. -- Ronan Dunklau
-----Original Message----- From: pgsql-general-owner@postgresql.org [mailto:pgsql-general-owner@postgresql.org] On Behalf Of Ronan Dunklau Sent: Friday, February 24, 2012 6:34 AM To: pgsql-general@postgresql.org Subject: [GENERAL] Regular expression character escape Hello. I'd like to perform a query using user-submitted input in a regular expression. Something along the lines of: select some_col from some_table where some_col ~ ('^' || user_submitted_input || '\d*$') This query is looking for every value matching the user submitted input with optional trailing decimal characters. My problem is: the user can break this using special regular expression characters (ex: 'test('). I'm looking for a way to escape regular expression characters from the input (maybe a function already exists for that purpose?). I could do it on the client side, but I'm not really sure how python and postgresql flavors of regular expressions differ. I've seen that the (?q) modifier is supported (treat the rest of the regexp as normal characters), but unless I missed somethingit I can't use it anywhere else than at the beginning of the regexp. From the docs at http://www.postgresql.org/docs/current/static/functions-matching.html: "Embedded options take effect at the ) terminating the sequence. They can appear only at the start of an ARE." Any idea on how to achieve this ? Thank you. -- Ronan Dunklau ------------------------------------------------------------------------------------------------ How about: WHERE some_col LIKE (user_submitted_input || '%') AND some_col ~ ('^.{' || length_of_user_submitted_input || '}\d*$') I'd have some reservations regarding multi-byte characters however - but this avoids any escaping of the input string. You could (should?) write the escaping routine on the server side in a user-defined function: WHERE some_col ~ ('^' || make_regexp_literal(user_submitted_stringliteral) || '\d*$') David J.
On 24/02/2012 16:38, David Johnston wrote: > How about: > > WHERE some_col LIKE (user_submitted_input || '%') AND some_col ~ ('^.{' || length_of_user_submitted_input || '}\d*$') > > I'd have some reservations regarding multi-byte characters however - but this avoids any escaping of the input string. That's a clever trick, I might end using it. > You could (should?) write the escaping routine on the server side in a user-defined function: > > WHERE some_col ~ ('^' || make_regexp_literal(user_submitted_stringliteral) || '\d*$') I totally agree, but I hoped I could use an already existing function without having to read the whole spec to figure what should and should not be escaped. > David J. > > -- Ronan Dunklau
Am 24.02.2012 17:04, schrieb Ronan Dunklau: > On 24/02/2012 16:38, David Johnston wrote: >> You could (should?) write the escaping routine on the server side in a user-defined function: >> >> WHERE some_col ~ ('^' || make_regexp_literal(user_submitted_stringliteral) || '\d*$') > > I totally agree, but I hoped I could use an already existing function > without having to read the whole spec to figure what should and should > not be escaped. Use the corresponding function of your programming language/framework of choice. E.g. Python delivers this as re.escape(). -- --- Heiko.
On 24/02/2012 17:09, Heiko Wundram wrote: > Am 24.02.2012 17:04, schrieb Ronan Dunklau: >> On 24/02/2012 16:38, David Johnston wrote: >>> You could (should?) write the escaping routine on the server side in >>> a user-defined function: >>> >>> WHERE some_col ~ ('^' || >>> make_regexp_literal(user_submitted_stringliteral) || '\d*$') >> >> I totally agree, but I hoped I could use an already existing function >> without having to read the whole spec to figure what should and should >> not be escaped. > > Use the corresponding function of your programming language/framework of > choice. E.g. Python delivers this as re.escape(). Thank you, but as I wrote in the original post, I don't know how postgresql and python differ in their regexp syntax. Specifically, I know that re.escape escapes any non-alphanumeric character, including accented letters. If you have any evidence proving that everything will work fine with re.escape, I'll be more than happy to use it. -- Ronan Dunklau
Am 24.02.2012 17:40, schrieb Ronan Dunklau: > On 24/02/2012 17:09, Heiko Wundram wrote: >> Use the corresponding function of your programming language/framework of >> choice. E.g. Python delivers this as re.escape(). > > Thank you, but as I wrote in the original post, I don't know how > postgresql and python differ in their regexp syntax. Specifically, I > know that re.escape escapes any non-alphanumeric character, including > accented letters. > > If you have any evidence proving that everything will work fine with > re.escape, I'll be more than happy to use it. This will work (AFAICT, _and_ as far as I've used it): the regex-syntax of Python is a superset of the regex-syntax of PostgreSQL (both are based on PCRE, not on the library, but the syntax), and as such you'll possibly do "too much" quoting, but never too little when simply using the Python-builtin. -- --- Heiko.
On 24/02/2012 17:43, Heiko Wundram wrote: > Am 24.02.2012 17:40, schrieb Ronan Dunklau: >> On 24/02/2012 17:09, Heiko Wundram wrote: >>> Use the corresponding function of your programming language/framework of >>> choice. E.g. Python delivers this as re.escape(). >> >> Thank you, but as I wrote in the original post, I don't know how >> postgresql and python differ in their regexp syntax. Specifically, I >> know that re.escape escapes any non-alphanumeric character, including >> accented letters. >> >> If you have any evidence proving that everything will work fine with >> re.escape, I'll be more than happy to use it. > > This will work (AFAICT, _and_ as far as I've used it): the regex-syntax > of Python is a superset of the regex-syntax of PostgreSQL (both are > based on PCRE, not on the library, but the syntax), and as such you'll > possibly do "too much" quoting, but never too little when simply using > the Python-builtin. Unfortunately for my use case, "too much" quoting can lead to errors in postgresql. I just tested it using the following code: postgres=# create or replace function escape_re(atext varchar) returns varchar as $$ import re return re.escape(atext.decode('utf8')) $$ language plpythonu; postgres=# select escape_re('testé'); escape_re ----------- test\é postgres=# select 'testé' ~ escape_re('testé'); ERROR: invalid regular expression: invalid escape \ sequence -- Ronan Dunklau
Ronan Dunklau <rdunklau@gmail.com> writes: > Unfortunately for my use case, "too much" quoting can lead to errors in > postgresql. AFAIR, the only stuff that's unsafe to insert a backslash before is ASCII letters. This is documented in the fine print discussing regular expressions, btw. regards, tom lane