Обсуждение: [BUGS] BUG #14628: regex description in online documentationmisleadingly/wrong

Поиск
Список
Период
Сортировка

[BUGS] BUG #14628: regex description in online documentationmisleadingly/wrong

От
t.glaser@tarent.de
Дата:
The following bug has been logged on the website:

Bug reference:      14628
Logged by:          Thorsten Glaser
Email address:      t.glaser@tarent.de
PostgreSQL version: 9.6.1
Operating system:   GNU/Linux
Description:

https://www.postgresql.org/docs/9.6/static/functions-matching.html#FUNCTIONS-POSIX-REGEXP
clearly says that ~ matches a POSIX regular expression.

This is only somewhat true: this does match:

tarent=> SELECT 'a\bc' ~ '^[a\\b]*$';?column?
----------f
(1 row)

tarent=> SELECT 'a\b' ~ '^[a\\b]*$';?column?
----------t
(1 row)


But this does not match:

tarent=> SELECT 'a\b' ~ '^[a\b]*$';?column?
----------f
(1 row)


The cause is likely this statement, burrowed way down in another chapter:
“Note: PostgreSQL always initially presumes that a regular expression
follows the ARE rules.”

And indeed, it’s an ARE!

tarent=> SELECT 'a\b' ~ '(?e)^[a\b]*$';?column? 
----------t
(1 row)


I find this extremely misleading (it also does not state whether it matches
BRE or ERE by default, just “POSIX re”), especially as it’s extremely
important to know precisely what RE syntax you’re targetting when escaping a
user-provided string into part of a RE (you have to precisely know where to
escape and where to not escape, for example), which is why I personally
always use POSIX standard RE (normally BRE).

Please indicate in *all* places in the documentation dealing with regular
expressions that it’s about ARE and link ARE to the section in the manual
explaining it -
https://www.postgresql.org/docs/9.6/static/functions-matching.html#POSIX-SYNTAX-DETAILS
- in all of those places. Also, make clear at the beginning of that section
how to force standard POSIX RE (i.e. BRE and ERE).


--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

Re: [BUGS] BUG #14628: regex description in online documentation misleadingly/wrong

От
"David G. Johnston"
Дата:
On Thu, Apr 20, 2017 at 8:25 AM, <t.glaser@tarent.de> wrote:
The following bug has been logged on the website:

Bug reference:      14628
Logged by:          Thorsten Glaser
Email address:      t.glaser@tarent.de
PostgreSQL version: 9.6.1
Operating system:   GNU/Linux
Description:

https://www.postgresql.org/docs/9.6/static/functions-matching.html#FUNCTIONS-POSIX-REGEXP
clearly says that ~ matches a POSIX regular expression.

This is only somewhat true: this does match:


Based on what you wrote below I'd maybe (though leaning toward not) modify the chapter title to "POSIX (ARE) Regular Expressions"

I would then likely add two more sentences before Table 9-14 (before the existing intro sentence).

POSIX regular expressions come in multiple flavors, of which PostgreSQL uses ARE by default.  Further information on these flavors is presented in the first subsection, "Regular Expression Details", below.  What follows is an overview of the general mechanics involved with any regular expression.


The cause is likely this statement, burrowed way down in another chapter:
“Note: PostgreSQL always initially presumes that a regular expression
follows the ARE rules.”


​While this maybe could be improved ​the above characterization seems overblown.  9.7.3.1 is a sub-section of 9.7.3 so "[buried] way down" isn't accurate.  That we choose to provide the high-level conceptual overview of regular expressions first, and then delve into ARE/BRE/ERE has caused few or no complaints from the typical reader for whom the defaults are adequate and they just want to know how to get things to work in the simple case.

 
And indeed, it’s an ARE!

tarent=> SELECT 'a\b' ~ '(?e)^[a\b]*$';
 ?column?
----------
 t
(1 row)


I find this extremely misleading (it also does not state whether it matches
BRE or ERE by default, just “POSIX re”),

You missed the big bubble note in 9.7.3.1: "​PostgreSQL always initially presumes that a regular expression follows the ARE rules".
 
especially as it’s extremely
important to know precisely what RE syntax you’re targetting when escaping a
user-provided string into part of a RE (you have to precisely know where to
escape and where to not escape, for example),

​​I'd say that is advanced usage and as you were able to find the needed documentation in 9.7.3.1 I'm not sure there is anything to fix based upon this.​
which is why I personally
always use POSIX standard RE (normally BRE).

​So basically you feel its necessary for us to redundantly emphasize the fact that we default to ARE because its different from your default choice and, you imply but do not support, the choice of the majority of other regular expression implementations.  If one wants to understand the regular expression implementation they read 9.7.3 - in all other places we can just call them regular expressions.  Now, as I note below, if you have specific areas that you think need to be fixed please point them out.


Please indicate in *all* places in the documentation dealing with regular
expressions that it’s about ARE and link ARE to the section in the manual
explaining it -
https://www.postgresql.org/docs/9.6/static/functions-matching.html#POSIX-SYNTAX-DETAILS
- in all of those places. Also, make clear at the beginning of that section
how to force standard POSIX RE (i.e. BRE and ERE).

​You seem to have a very firm grasp of the topic and so might consider some actual firm suggestions and/or a patch.  I've not seen an actual factual omission or error in all of this and while I firmly believe that documentation can always be improved, and that the TCL implementation that we use has its quirks, I don't foresee the requested surgery happening from scratch based upon this report.  I've suggested a fairly easy clarification at the top of the chapter (9.7.3) to at least bring immediate awareness of the flavor issue.  Does that work for you?

David J.

Re: [BUGS] BUG #14628: regex description in online documentation misleadingly/wrong

От
Tom Lane
Дата:
"David G. Johnston" <david.g.johnston@gmail.com> writes:
> ​So basically you feel its necessary for us to redundantly emphasize the
> fact that we default to ARE because its different from your default choice
> and, you imply but do not support, the choice of the majority of other
> regular expression implementations.

I can't get excited about this.  It seems fairly difficult to me to make
the case that more people would take "POSIX regular expression" to mean
the basic form than the extended form.  Most of our users probably don't
know the difference in the first place, and would consider section 9.7.3.1
to be *the* definition of what an RE is to Postgres.  Those who do know
the difference would probably also turn to 9.7.3.1 to find out which form
we're talking about.  Once you get there, the requested information is
pretty much the first thing you find.

It's possible that we should restructure the section nesting to bring
9.7.3.1 up a level and thus make it more visible.  I'm thinking something
like

9.7.1. LIKE
9.7.2. SIMILAR TO Regular Expressions
9.7.3. POSIX Regular Expression Operators
9.7.4. POSIX Regular Expression Definition

But that would be doing some violence to the basic structure of the
chapter.  We don't have separate sections for the definition of LIKE
patterns or SIMILAR TO patterns --- admittedly, they hardly need it.

One really simple change that might be worth doing is to turn the
sentence "The POSIX pattern language is described in much greater detail
below" into an actual link to 9.7.3.1.  Once upon a time there wasn't
much material in between, but now it seems like a clickable link would
be good.
        regards, tom lane


-- 
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

Re: [BUGS] BUG #14628: regex description in online documentationmisleadingly/wrong

От
Thorsten Glaser
Дата:
On Thu, 20 Apr 2017, David G. Johnston wrote:

> Based on what you wrote below I'd maybe (though leaning toward not) modify
> the chapter title to "POSIX (ARE) Regular Expressions"

Your ARE are not POSIX regular expressions, period.

They seem, according to the source, to come from Tcl.

> POSIX regular expressions come in multiple flavors, of which PostgreSQL
> uses ARE by default.

Strike the POSIX here:

| Regular expressions come in multiple flavors, of which PostgreSQL uses
| ARE by default, supporting POSIX basic and extended regular expressions
| (BRE and ERE, respectively) as an option.

>  Further information on these flavors is presented in
> the first subsection, "Regular Expression Details", below.  What follows is
> an overview of the general mechanics involved with any regular expression.

With my suggested change, I’d like this.

> or no complaints from the typical reader for whom the defaults are adequate
> and they just want to know how to get things to work in the simple case.

Right, I’m just not your typical reader. As an example, when I learnt
Python in a week-long crash course I delved into the C source code of
its standard library to learn how it behaves in some corner cases the
tutor didn’t know and which are undocumented, as I felt that important
to my handling of the language.

Something similar is going on here.

> ​So basically you feel its necessary for us to redundantly emphasize the

OK, maybe not redundantly then. But please do not ever write “POSIX regular
expression” when a function or operator defaults to ARE.

> ​You seem to have a very firm grasp of the topic and so might consider some
> actual firm suggestions and/or a patch.  I've not seen an actual factual
> omission or error in all of this and while I firmly believe that
> documentation can always be improved, and that the TCL implementation that
> we use has its quirks, I don't foresee the requested surgery happening from
> scratch based upon this report.  I've suggested a fairly easy clarification
> at the top of the chapter (9.7.3) to at least bring immediate awareness of
> the flavor issue.  Does that work for you?

See above. I’m definitely willing to help out with this and open for
further discussion.


On Thu, 20 Apr 2017, Tom Lane wrote:

> I can't get excited about this.  It seems fairly difficult to me to make
> the case that more people would take "POSIX regular expression" to mean
> the basic form than the extended form.

My problem here is not BRE vs. ERE/ARE but POSIX RE (BRE/ERE) vs. ARE,
as ARE are *not* POSIX RE.

>  Most of our users probably don't

See above…

> know the difference in the first place, and would consider section 9.7.3.1
> to be *the* definition of what an RE is to Postgres.  Those who do know

OK. But if some text like the one suggested at the beginning of this
mail is added _and_ everywhere in the documentation, functions and
operators taking ARE by default are NOT documented as “POSIX regular
expression anything”, I’d be happier.

> It's possible that we should restructure the section nesting to bring
> 9.7.3.1 up a level and thus make it more visible.  I'm thinking something
> like
>
> 9.7.1. LIKE
> 9.7.2. SIMILAR TO Regular Expressions
> 9.7.3. POSIX Regular Expression Operators
> 9.7.4. POSIX Regular Expression Definition

They’re not POSIX, is the problem.

> But that would be doing some violence to the basic structure of the
> chapter.  We don't have separate sections for the definition of LIKE
> patterns or SIMILAR TO patterns --- admittedly, they hardly need it.

Hmh. But if they had chapters, they can be linked to as well…

> One really simple change that might be worth doing is to turn the
> sentence "The POSIX pattern language is described in much greater detail
> below" into an actual link to 9.7.3.1.  Once upon a time there wasn't

That could also help, yes.

> much material in between, but now it seems like a clickable link would
> be good.

Thanks,
//mirabilos
--
tarent solutions GmbH
Rochusstraße 2-4, D-53123 Bonn • http://www.tarent.de/
Tel: +49 228 54881-393 • Fax: +49 228 54881-235
HRB 5168 (AG Bonn) • USt-ID (VAT): DE122264941
Geschäftsführer: Dr. Stefan Barth, Kai Ebenrett, Boris Esser, Alexander Steeg


--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

Re: [BUGS] BUG #14628: regex description in online documentation misleadingly/wrong

От
"David G. Johnston"
Дата:
On Thu, Apr 20, 2017 at 10:15 AM, Thorsten Glaser <t.glaser@tarent.de> wrote:
On Thu, 20 Apr 2017, David G. Johnston wrote:

> Based on what you wrote below I'd maybe (though leaning toward not) modify
> the chapter title to "POSIX (ARE) Regular Expressions"

Your ARE are not POSIX regular expressions, period.

Skimming these:


I suppose doing s/POSIX/TCL/g (limited to regular expression usage obviously) would be an acceptable change.

Adding "Advanced" in appropriate places would be good too.

I suppose I took it as face value that we were POSIX compliant rather than just POSIX ERE compatible.  It sounds like that was a false premise, though, and that our use of POSIX, for those people for whom its not just a noise word, implies a set of acceptable behaviors that we far exceed.  Calling it TCL is more accurate and allows people to easily tie back our implementation to external resources like the first one above.

David J.