Re: SQL-spec incompatibilities in similar_escape() and related stuff

Поиск
Список
Период
Сортировка
От Tom Lane
Тема Re: SQL-spec incompatibilities in similar_escape() and related stuff
Дата
Msg-id 32617.1558649424@sss.pgh.pa.us
обсуждение исходный текст
Ответ на Re: SQL-spec incompatibilities in similar_escape() and related stuff  (Tom Lane <tgl@sss.pgh.pa.us>)
Ответы Re: SQL-spec incompatibilities in similar_escape() and related stuff  (Alvaro Herrera from 2ndQuadrant <alvherre@alvh.no-ip.org>)
Re: SQL-spec incompatibilities in similar_escape() and related stuff  (Alvaro Herrera from 2ndQuadrant <alvherre@alvh.no-ip.org>)
Список pgsql-hackers
I wrote:
> I propose therefore that we leave similar_escape in place with its
> current behavior, as a compatibility measure for cases like this.
> Intead, invent two new strict functions, say
>     similar_to_escape(pattern)
>     similar_to_escape(pattern, escape)
> and change the parser and the implementation of SUBSTRING() to
> rely on these going forward.

> The net effect will be to make explicit "ESCAPE NULL" spec-compliant,
> and to get rid of the performance problem from inlining failure for
> substring().  All else is just doc clarifications.

Here's a proposed patch for that.  I think it's a bit too late to be
messing with this kind of thing for v12, so I'll add this to the
upcoming CF.

            regards, tom lane

diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index a79e7c0..7d4472a 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -4146,6 +4146,14 @@ cast(-44 as bit(12))           <lineannotation>111111010100</lineannotation>
    </para>

    <para>
+    According to the SQL standard, omitting <literal>ESCAPE</literal>
+    means there is no escape character (rather than defaulting to a
+    backslash), and a zero-length <literal>ESCAPE</literal> value is
+    disallowed.  <productname>PostgreSQL</productname>'s behavior in
+    this regard is therefore slightly nonstandard.
+   </para>
+
+   <para>
     The key word <token>ILIKE</token> can be used instead of
     <token>LIKE</token> to make the match case-insensitive according
     to the active locale.  This is not in the <acronym>SQL</acronym> standard but is a
@@ -4280,9 +4288,27 @@ cast(-44 as bit(12))           <lineannotation>111111010100</lineannotation>
    </para>

    <para>
-    As with <function>LIKE</function>, a backslash disables the special meaning
-    of any of these metacharacters; or a different escape character can
-    be specified with <literal>ESCAPE</literal>.
+    As with <function>LIKE</function>, a backslash disables the special
+    meaning of any of these metacharacters.  A different escape character
+    can be specified with <literal>ESCAPE</literal>, or the escape
+    capability can be disabled by writing <literal>ESCAPE ''</literal>.
+   </para>
+
+   <para>
+    According to the SQL standard, omitting <literal>ESCAPE</literal>
+    means there is no escape character (rather than defaulting to a
+    backslash), and a zero-length <literal>ESCAPE</literal> value is
+    disallowed.  <productname>PostgreSQL</productname>'s behavior in
+    this regard is therefore slightly nonstandard.
+   </para>
+
+   <para>
+    Another nonstandard extension is that following the escape character
+    with a letter or digit provides access to the same escape sequences
+    defined for POSIX regular expressions, below (see
+    <xref linkend="posix-character-entry-escapes-table"/>,
+    <xref linkend="posix-class-shorthand-escapes-table"/>, and
+    <xref linkend="posix-constraint-escapes-table"/>).
    </para>

    <para>
diff --git a/src/backend/parser/gram.y b/src/backend/parser/gram.y
index 8311b1d..6462e13 100644
--- a/src/backend/parser/gram.y
+++ b/src/backend/parser/gram.y
@@ -13073,15 +13073,15 @@ a_expr:        c_expr                                    { $$ = $1; }

             | a_expr SIMILAR TO a_expr                            %prec SIMILAR
                 {
-                    FuncCall *n = makeFuncCall(SystemFuncName("similar_escape"),
-                                               list_make2($4, makeNullAConst(-1)),
+                    FuncCall *n = makeFuncCall(SystemFuncName("similar_to_escape"),
+                                               list_make1($4),
                                                @2);
                     $$ = (Node *) makeSimpleA_Expr(AEXPR_SIMILAR, "~",
                                                    $1, (Node *) n, @2);
                 }
             | a_expr SIMILAR TO a_expr ESCAPE a_expr            %prec SIMILAR
                 {
-                    FuncCall *n = makeFuncCall(SystemFuncName("similar_escape"),
+                    FuncCall *n = makeFuncCall(SystemFuncName("similar_to_escape"),
                                                list_make2($4, $6),
                                                @2);
                     $$ = (Node *) makeSimpleA_Expr(AEXPR_SIMILAR, "~",
@@ -13089,15 +13089,15 @@ a_expr:        c_expr                                    { $$ = $1; }
                 }
             | a_expr NOT_LA SIMILAR TO a_expr                    %prec NOT_LA
                 {
-                    FuncCall *n = makeFuncCall(SystemFuncName("similar_escape"),
-                                               list_make2($5, makeNullAConst(-1)),
+                    FuncCall *n = makeFuncCall(SystemFuncName("similar_to_escape"),
+                                               list_make1($5),
                                                @2);
                     $$ = (Node *) makeSimpleA_Expr(AEXPR_SIMILAR, "!~",
                                                    $1, (Node *) n, @2);
                 }
             | a_expr NOT_LA SIMILAR TO a_expr ESCAPE a_expr        %prec NOT_LA
                 {
-                    FuncCall *n = makeFuncCall(SystemFuncName("similar_escape"),
+                    FuncCall *n = makeFuncCall(SystemFuncName("similar_to_escape"),
                                                list_make2($5, $7),
                                                @2);
                     $$ = (Node *) makeSimpleA_Expr(AEXPR_SIMILAR, "!~",
@@ -14323,9 +14323,9 @@ subquery_Op:
             | NOT_LA ILIKE
                     { $$ = list_make1(makeString("!~~*")); }
 /* cannot put SIMILAR TO here, because SIMILAR TO is a hack.
- * the regular expression is preprocessed by a function (similar_escape),
+ * the regular expression is preprocessed by a function (similar_to_escape),
  * and the ~ operator for posix regular expressions is used.
- *        x SIMILAR TO y     ->    x ~ similar_escape(y)
+ *        x SIMILAR TO y     ->    x ~ similar_to_escape(y)
  * this transformation is made on the fly by the parser upwards.
  * however the SubLink structure which handles any/some/all stuff
  * is not ready for such a thing.
diff --git a/src/backend/utils/adt/regexp.c b/src/backend/utils/adt/regexp.c
index 90a9197..3d38aef 100644
--- a/src/backend/utils/adt/regexp.c
+++ b/src/backend/utils/adt/regexp.c
@@ -654,15 +654,18 @@ textregexreplace(PG_FUNCTION_ARGS)
 }

 /*
- * similar_escape()
- * Convert a SQL:2008 regexp pattern to POSIX style, so it can be used by
- * our regexp engine.
+ * similar_to_escape(), similar_escape()
+ *
+ * Convert a SQL "SIMILAR TO" regexp pattern to POSIX style, so it can be
+ * used by our regexp engine.
+ *
+ * similar_escape_internal() is the common workhorse for three SQL-exposed
+ * functions.  esc_text can be passed as NULL to select the default escape
+ * (which is '\'), or as an empty string to select no escape character.
  */
-Datum
-similar_escape(PG_FUNCTION_ARGS)
+static text *
+similar_escape_internal(text *pat_text, text *esc_text)
 {
-    text       *pat_text;
-    text       *esc_text;
     text       *result;
     char       *p,
                *e,
@@ -673,13 +676,9 @@ similar_escape(PG_FUNCTION_ARGS)
     bool        incharclass = false;
     int            nquotes = 0;

-    /* This function is not strict, so must test explicitly */
-    if (PG_ARGISNULL(0))
-        PG_RETURN_NULL();
-    pat_text = PG_GETARG_TEXT_PP(0);
     p = VARDATA_ANY(pat_text);
     plen = VARSIZE_ANY_EXHDR(pat_text);
-    if (PG_ARGISNULL(1))
+    if (esc_text == NULL)
     {
         /* No ESCAPE clause provided; default to backslash as escape */
         e = "\\";
@@ -687,12 +686,11 @@ similar_escape(PG_FUNCTION_ARGS)
     }
     else
     {
-        esc_text = PG_GETARG_TEXT_PP(1);
         e = VARDATA_ANY(esc_text);
         elen = VARSIZE_ANY_EXHDR(esc_text);
         if (elen == 0)
             e = NULL;            /* no escape character */
-        else
+        else if (elen > 1)
         {
             int            escape_mblen = pg_mbstrlen_with_len(e, elen);

@@ -898,6 +896,65 @@ similar_escape(PG_FUNCTION_ARGS)

     SET_VARSIZE(result, r - ((char *) result));

+    return result;
+}
+
+/*
+ * similar_to_escape(pattern, escape)
+ */
+Datum
+similar_to_escape_2(PG_FUNCTION_ARGS)
+{
+    text       *pat_text = PG_GETARG_TEXT_PP(0);
+    text       *esc_text = PG_GETARG_TEXT_PP(1);
+    text       *result;
+
+    result = similar_escape_internal(pat_text, esc_text);
+
+    PG_RETURN_TEXT_P(result);
+}
+
+/*
+ * similar_to_escape(pattern)
+ * Inserts a default escape character.
+ */
+Datum
+similar_to_escape_1(PG_FUNCTION_ARGS)
+{
+    text       *pat_text = PG_GETARG_TEXT_PP(0);
+    text       *result;
+
+    result = similar_escape_internal(pat_text, NULL);
+
+    PG_RETURN_TEXT_P(result);
+}
+
+/*
+ * similar_escape(pattern, escape)
+ *
+ * Legacy function for compatibility with views stored using the
+ * pre-v13 expansion of SIMILAR TO.  Unlike the above functions, this
+ * is non-strict, which leads to not-per-spec handling of "ESCAPE NULL".
+ */
+Datum
+similar_escape(PG_FUNCTION_ARGS)
+{
+    text       *pat_text;
+    text       *esc_text;
+    text       *result;
+
+    /* This function is not strict, so must test explicitly */
+    if (PG_ARGISNULL(0))
+        PG_RETURN_NULL();
+    pat_text = PG_GETARG_TEXT_PP(0);
+
+    if (PG_ARGISNULL(1))
+        esc_text = NULL;        /* use default escape character */
+    else
+        esc_text = PG_GETARG_TEXT_PP(1);
+
+    result = similar_escape_internal(pat_text, esc_text);
+
     PG_RETURN_TEXT_P(result);
 }

diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 8733524..15045d0 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -3321,9 +3321,15 @@
   proname => 'repeat', prorettype => 'text', proargtypes => 'text int4',
   prosrc => 'repeat' },

-{ oid => '1623', descr => 'convert SQL99 regexp pattern to POSIX style',
+{ oid => '1623', descr => 'convert SQL regexp pattern to POSIX style',
   proname => 'similar_escape', proisstrict => 'f', prorettype => 'text',
   proargtypes => 'text text', prosrc => 'similar_escape' },
+{ oid => '1986', descr => 'convert SQL regexp pattern to POSIX style',
+  proname => 'similar_to_escape', prorettype => 'text',
+  proargtypes => 'text text', prosrc => 'similar_to_escape_2' },
+{ oid => '1987', descr => 'convert SQL regexp pattern to POSIX style',
+  proname => 'similar_to_escape', prorettype => 'text', proargtypes => 'text',
+  prosrc => 'similar_to_escape_1' },

 { oid => '1624',
   proname => 'mul_d_interval', prorettype => 'interval',
@@ -5743,10 +5749,10 @@
 { oid => '2073', descr => 'extract text matching regular expression',
   proname => 'substring', prorettype => 'text', proargtypes => 'text text',
   prosrc => 'textregexsubstr' },
-{ oid => '2074', descr => 'extract text matching SQL99 regular expression',
+{ oid => '2074', descr => 'extract text matching SQL regular expression',
   proname => 'substring', prolang => 'sql', prorettype => 'text',
   proargtypes => 'text text text',
-  prosrc => 'select pg_catalog.substring($1, pg_catalog.similar_escape($2, $3))' },
+  prosrc => 'select pg_catalog.substring($1, pg_catalog.similar_to_escape($2, $3))' },

 { oid => '2075', descr => 'convert int8 to bitstring',
   proname => 'bit', prorettype => 'bit', proargtypes => 'int8 int4',
diff --git a/src/test/regress/expected/strings.out b/src/test/regress/expected/strings.out
index 486c00b..2483966 100644
--- a/src/test/regress/expected/strings.out
+++ b/src/test/regress/expected/strings.out
@@ -410,7 +410,56 @@ SELECT SUBSTRING('abcdefg' FROM 'b(.*)f') AS "cde";
  cde
 (1 row)

--- PostgreSQL extension to allow using back reference in replace string;
+-- Check behavior of SIMILAR TO, which uses largely the same regexp variant
+SELECT 'abcdefg' SIMILAR TO '_bcd%' AS true;
+ true
+------
+ t
+(1 row)
+
+SELECT 'abcdefg' SIMILAR TO 'bcd%' AS false;
+ false
+-------
+ f
+(1 row)
+
+SELECT 'abcdefg' SIMILAR TO '_bcd#%' ESCAPE '#' AS false;
+ false
+-------
+ f
+(1 row)
+
+SELECT 'abcd%' SIMILAR TO '_bcd#%' ESCAPE '#' AS true;
+ true
+------
+ t
+(1 row)
+
+-- Postgres uses '\' as the default escape character, which is not per spec
+SELECT 'abcdefg' SIMILAR TO '_bcd\%' AS false;
+ false
+-------
+ f
+(1 row)
+
+-- and an empty string to mean "no escape", which is also not per spec
+SELECT 'abcd\efg' SIMILAR TO '_bcd\%' ESCAPE '' AS true;
+ true
+------
+ t
+(1 row)
+
+-- these behaviors are per spec, though:
+SELECT 'abcdefg' SIMILAR TO '_bcd%' ESCAPE NULL AS null;
+ null
+------
+
+(1 row)
+
+SELECT 'abcdefg' SIMILAR TO '_bcd#%' ESCAPE '##' AS error;
+ERROR:  invalid escape string
+HINT:  Escape string must be empty or one character.
+-- Test back reference in regexp_replace
 SELECT regexp_replace('1112223333', E'(\\d{3})(\\d{3})(\\d{4})', E'(\\1) \\2-\\3');
  regexp_replace
 ----------------
diff --git a/src/test/regress/sql/strings.sql b/src/test/regress/sql/strings.sql
index 5744c9f..b5e75c3 100644
--- a/src/test/regress/sql/strings.sql
+++ b/src/test/regress/sql/strings.sql
@@ -144,7 +144,20 @@ SELECT SUBSTRING('abcdefg' FROM 'c.e') AS "cde";
 -- With a parenthesized subexpression, return only what matches the subexpr
 SELECT SUBSTRING('abcdefg' FROM 'b(.*)f') AS "cde";

--- PostgreSQL extension to allow using back reference in replace string;
+-- Check behavior of SIMILAR TO, which uses largely the same regexp variant
+SELECT 'abcdefg' SIMILAR TO '_bcd%' AS true;
+SELECT 'abcdefg' SIMILAR TO 'bcd%' AS false;
+SELECT 'abcdefg' SIMILAR TO '_bcd#%' ESCAPE '#' AS false;
+SELECT 'abcd%' SIMILAR TO '_bcd#%' ESCAPE '#' AS true;
+-- Postgres uses '\' as the default escape character, which is not per spec
+SELECT 'abcdefg' SIMILAR TO '_bcd\%' AS false;
+-- and an empty string to mean "no escape", which is also not per spec
+SELECT 'abcd\efg' SIMILAR TO '_bcd\%' ESCAPE '' AS true;
+-- these behaviors are per spec, though:
+SELECT 'abcdefg' SIMILAR TO '_bcd%' ESCAPE NULL AS null;
+SELECT 'abcdefg' SIMILAR TO '_bcd#%' ESCAPE '##' AS error;
+
+-- Test back reference in regexp_replace
 SELECT regexp_replace('1112223333', E'(\\d{3})(\\d{3})(\\d{4})', E'(\\1) \\2-\\3');
 SELECT regexp_replace('AAA   BBB   CCC   ', E'\\s+', ' ', 'g');
 SELECT regexp_replace('AAA', '^|$', 'Z', 'g');

В списке pgsql-hackers по дате отправления:

Предыдущее
От: Donald Dong
Дата:
Сообщение: Re: Why could GEQO produce plans with lower costs than thestandard_join_search?
Следующее
От: Andres Freund
Дата:
Сообщение: Re: Remove useless associativity/precedence from parsers