Обсуждение: sgml cleanup: unescaped '>' characters
I found myself rewriting the ./src/tools/find_gt_lt script in Perl this evening, since the existing script was quite broken (the main problem is it's not capable of understanding CDATA or sgml comment sections, and hence produces a bunch of noise). The rewritten version picked up a few stylistic inconsistencies in the SGML, such as: * breaking the trailing '>' of an SGML marker across lines. AFAIK this is legal, but is a bit inconsistent and just confuses simplistic tools like find_gt_lt * using single quotes instead of double quotes to surround a node attribute, as in <orderedlist numeration='loweralpha'> as well as seemingly-invalid SGML, such as using '>' unescaped inside normal SGML entries. I've attached a patch to fix these problems. I can send in the new version of find_gt_lt if these changes prove useful. Josh
Вложения
On ons, 2011-08-24 at 23:28 -0400, Josh Kupershmidt wrote: > I found myself rewriting the ./src/tools/find_gt_lt script in Perl > this evening, since the existing script was quite broken (the main > problem is it's not capable of understanding CDATA or sgml comment > sections, and hence produces a bunch of noise). > > The rewritten version picked up a few stylistic inconsistencies in the > SGML, such as: > * breaking the trailing '>' of an SGML marker across lines. AFAIK > this is legal, but is a bit inconsistent and just confuses simplistic > tools like find_gt_lt The cases you show don't appear to be terribly useful, but I think on occasion this can be necessary to work around some arcane whitespace rules in SGML or XML. (Just look at the generated HTML; it uses this technique throughout.) > * using single quotes instead of double quotes to surround a node > attribute, as in <orderedlist numeration='loweralpha'> It would be better if the tool could handle that, because sometimes you want to use single quotes if the value contains double quotes. > as well as seemingly-invalid SGML, such as using '>' unescaped inside > normal SGML entries. Unescaped > is valid, AFAIK.
On Sat, Aug 27, 2011 at 3:48 PM, Peter Eisentraut <peter_e@gmx.net> wrote: > On ons, 2011-08-24 at 23:28 -0400, Josh Kupershmidt wrote: >> I found myself rewriting the ./src/tools/find_gt_lt script in Perl >> this evening, since the existing script was quite broken (the main >> problem is it's not capable of understanding CDATA or sgml comment >> sections, and hence produces a bunch of noise). >> >> The rewritten version picked up a few stylistic inconsistencies in the >> SGML, such as: >> * breaking the trailing '>' of an SGML marker across lines. AFAIK >> this is legal, but is a bit inconsistent and just confuses simplistic >> tools like find_gt_lt > > The cases you show don't appear to be terribly useful, but I think on > occasion this can be necessary to work around some arcane whitespace > rules in SGML or XML. (Just look at the generated HTML; it uses this > technique throughout.) Hrm, well if the spurious whitespace isn't serving any purpose in these cases, why not just fix it to match the rest of SGML style? >> * using single quotes instead of double quotes to surround a node >> attribute, as in <orderedlist numeration='loweralpha'> > > It would be better if the tool could handle that, because sometimes you > want to use single quotes if the value contains double quotes. It's trivial to adjust the regex I was using to ignore such cases. I'm just on about stylistic consistency here. If there's a reason to use single quotes, such as when the value contains double quotes, then that's fine -- but I don't think any of the cases I pointed out fall under that category. >> as well as seemingly-invalid SGML, such as using '>' unescaped inside >> normal SGML entries. > > Unescaped > is valid, AFAIK. Oh, that's interesting. I took a quick look at "The SGML FAQ book", page 73 [1], which supports this claim. But I notice we've been fixing such issues in the recent past (e.g. commit d420ba2a2d4ea4831f89a3fd7ce86b05eff932ff). Don't we want to continue doing so? Not to mention the fact that we have ./src/tools/find_gt_lt, which while somewhat broken, has the ostensible goal of finding such problems in the SGML. Or do we want to stop worrying about '>' entirely, and rename find_gt_lt to find_lt, instead? Josh [1] http://books.google.com/books?id=OyJHFJsnh10C&lpg=PA229&ots=DGkYDdvbhE&pg=PA73#v=onepage&q&f=false
On mån, 2011-08-29 at 18:22 -0500, Josh Kupershmidt wrote: > >> The rewritten version picked up a few stylistic inconsistencies in the > >> SGML, such as: > >> * breaking the trailing '>' of an SGML marker across lines. AFAIK > >> this is legal, but is a bit inconsistent and just confuses simplistic > >> tools like find_gt_lt > > > > The cases you show don't appear to be terribly useful, but I think on > > occasion this can be necessary to work around some arcane whitespace > > rules in SGML or XML. (Just look at the generated HTML; it uses this > > technique throughout.) > > Hrm, well if the spurious whitespace isn't serving any purpose in > these cases, why not just fix it to match the rest of SGML style? > > >> * using single quotes instead of double quotes to surround a node > >> attribute, as in <orderedlist numeration='loweralpha'> > > > > It would be better if the tool could handle that, because sometimes you > > want to use single quotes if the value contains double quotes. > > It's trivial to adjust the regex I was using to ignore such cases. I'm > just on about stylistic consistency here. If there's a reason to use > single quotes, such as when the value contains double quotes, then > that's fine -- but I don't think any of the cases I pointed out fall > under that category. I have committed your fixes relevant to these two points. > >> as well as seemingly-invalid SGML, such as using '>' unescaped inside > >> normal SGML entries. > > > > Unescaped > is valid, AFAIK. > > Oh, that's interesting. I took a quick look at "The SGML FAQ book", > page 73 [1], which supports this claim. > > But I notice we've been fixing such issues in the recent past (e.g. > commit d420ba2a2d4ea4831f89a3fd7ce86b05eff932ff). Don't we want to > continue doing so? Not to mention the fact that we have > ./src/tools/find_gt_lt, which while somewhat broken, has the > ostensible goal of finding such problems in the SGML. Or do we want to > stop worrying about '>' entirely, and rename find_gt_lt to find_lt, > instead? > [1] http://books.google.com/books?id=OyJHFJsnh10C&lpg=PA229&ots=DGkYDdvbhE&pg=PA73#v=onepage&q&f=false I don't know what the rationale for this tool is. I have never used it. Clearly, the reference shows, and the tools we use confirm, that it is not necessary to use it.
Peter Eisentraut wrote: > > >> as well as seemingly-invalid SGML, such as using '>' unescaped inside > > >> normal SGML entries. > > > > > > Unescaped > is valid, AFAIK. > > > > Oh, that's interesting. I took a quick look at "The SGML FAQ book", > > page 73 [1], which supports this claim. > > > > But I notice we've been fixing such issues in the recent past (e.g. > > commit d420ba2a2d4ea4831f89a3fd7ce86b05eff932ff). Don't we want to > > continue doing so? Not to mention the fact that we have > > ./src/tools/find_gt_lt, which while somewhat broken, has the > > ostensible goal of finding such problems in the SGML. Or do we want to > > stop worrying about '>' entirely, and rename find_gt_lt to find_lt, > > instead? > > > [1] http://books.google.com/books?id=OyJHFJsnh10C&lpg=PA229&ots=DGkYDdvbhE&pg=PA73#v=onepage&q&f=false > > I don't know what the rationale for this tool is. I have never used it. > Clearly, the reference shows, and the tools we use confirm, that it is > not necessary to use it. I have updated the scripts and instructions accordingly. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. +
On tor, 2011-09-01 at 10:17 -0400, Bruce Momjian wrote: > Peter Eisentraut wrote: > > > >> as well as seemingly-invalid SGML, such as using '>' unescaped inside > > > >> normal SGML entries. > > > > > > > > Unescaped > is valid, AFAIK. > > > > > > Oh, that's interesting. I took a quick look at "The SGML FAQ book", > > > page 73 [1], which supports this claim. > > > > > > But I notice we've been fixing such issues in the recent past (e.g. > > > commit d420ba2a2d4ea4831f89a3fd7ce86b05eff932ff). Don't we want to > > > continue doing so? Not to mention the fact that we have > > > ./src/tools/find_gt_lt, which while somewhat broken, has the > > > ostensible goal of finding such problems in the SGML. Or do we want to > > > stop worrying about '>' entirely, and rename find_gt_lt to find_lt, > > > instead? > > > > > [1] http://books.google.com/books?id=OyJHFJsnh10C&lpg=PA229&ots=DGkYDdvbhE&pg=PA73#v=onepage&q&f=false > > > > I don't know what the rationale for this tool is. I have never used it. > > Clearly, the reference shows, and the tools we use confirm, that it is > > not necessary to use it. > > I have updated the scripts and instructions accordingly. That still leaves open why we bother about escaping <.
Peter Eisentraut wrote: > On tor, 2011-09-01 at 10:17 -0400, Bruce Momjian wrote: > > Peter Eisentraut wrote: > > > > >> as well as seemingly-invalid SGML, such as using '>' unescaped inside > > > > >> normal SGML entries. > > > > > > > > > > Unescaped > is valid, AFAIK. > > > > > > > > Oh, that's interesting. I took a quick look at "The SGML FAQ book", > > > > page 73 [1], which supports this claim. > > > > > > > > But I notice we've been fixing such issues in the recent past (e.g. > > > > commit d420ba2a2d4ea4831f89a3fd7ce86b05eff932ff). Don't we want to > > > > continue doing so? Not to mention the fact that we have > > > > ./src/tools/find_gt_lt, which while somewhat broken, has the > > > > ostensible goal of finding such problems in the SGML. Or do we want to > > > > stop worrying about '>' entirely, and rename find_gt_lt to find_lt, > > > > instead? > > > > > > > [1] http://books.google.com/books?id=OyJHFJsnh10C&lpg=PA229&ots=DGkYDdvbhE&pg=PA73#v=onepage&q&f=false > > > > > > I don't know what the rationale for this tool is. I have never used it. > > > Clearly, the reference shows, and the tools we use confirm, that it is > > > not necessary to use it. > > > > I have updated the scripts and instructions accordingly. > > That still leaves open why we bother about escaping <. The problem is that I often add SGML that has: if (1 < 0) ... I need something to warn me about those, especially in the release notes. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. +
On tor, 2011-09-01 at 14:17 -0400, Bruce Momjian wrote: > > That still leaves open why we bother about escaping <. > > The problem is that I often add SGML that has: > > if (1 < 0) ... > > I need something to warn me about those, especially in the release > notes. Why do you need to be warned about that?
Peter Eisentraut wrote: > On tor, 2011-09-01 at 14:17 -0400, Bruce Momjian wrote: > > > That still leaves open why we bother about escaping <. > > > > The problem is that I often add SGML that has: > > > > if (1 < 0) ... > > > > I need something to warn me about those, especially in the release > > notes. > > Why do you need to be warned about that? If I have: if (1 < fred) it will think "fred" is a SGML tag, no? -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. +
On tor, 2011-09-01 at 17:31 -0400, Bruce Momjian wrote: > Peter Eisentraut wrote: > > On tor, 2011-09-01 at 14:17 -0400, Bruce Momjian wrote: > > > > That still leaves open why we bother about escaping <. > > > > > > The problem is that I often add SGML that has: > > > > > > if (1 < 0) ... > > > > > > I need something to warn me about those, especially in the release > > > notes. > > > > Why do you need to be warned about that? > > If I have: > > if (1 < fred) > > it will think "fred" is a SGML tag, no? No, a < followed by a space is not a tag, it's character data. If it thought it were a tag, it would complain.
Peter Eisentraut wrote: > On tor, 2011-09-01 at 17:31 -0400, Bruce Momjian wrote: > > Peter Eisentraut wrote: > > > On tor, 2011-09-01 at 14:17 -0400, Bruce Momjian wrote: > > > > > That still leaves open why we bother about escaping <. > > > > > > > > The problem is that I often add SGML that has: > > > > > > > > if (1 < 0) ... > > > > > > > > I need something to warn me about those, especially in the release > > > > notes. > > > > > > Why do you need to be warned about that? > > > > If I have: > > > > if (1 < fred) > > > > it will think "fred" is a SGML tag, no? > > No, a < followed by a space is not a tag, it's character data. If it > thought it were a tag, it would complain. Sometimes it is '<' (in single quotes), which I thought would be a problem. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. +
On lör, 2011-09-03 at 16:47 -0400, Bruce Momjian wrote: > Peter Eisentraut wrote: > > On tor, 2011-09-01 at 17:31 -0400, Bruce Momjian wrote: > > > Peter Eisentraut wrote: > > > > On tor, 2011-09-01 at 14:17 -0400, Bruce Momjian wrote: > > > > > > That still leaves open why we bother about escaping <. > > > > > > > > > > The problem is that I often add SGML that has: > > > > > > > > > > if (1 < 0) ... > > > > > > > > > > I need something to warn me about those, especially in the release > > > > > notes. > > > > > > > > Why do you need to be warned about that? > > > > > > If I have: > > > > > > if (1 < fred) > > > > > > it will think "fred" is a SGML tag, no? > > > > No, a < followed by a space is not a tag, it's character data. If it > > thought it were a tag, it would complain. > > Sometimes it is '<' (in single quotes), which I thought would be a > problem. The bottom line is, the SGML parser can figure that out itself, and if it has a problem, it will complain. We don't need to second guess it with regular expressions that are handcrafted out of thin air. I was hoping you would remember whether you initially put this in because of some tool problem. But if we are not finding any supporting evidence, I would suggest that we just scrap this thing entirely.
Peter Eisentraut wrote: > On l?r, 2011-09-03 at 16:47 -0400, Bruce Momjian wrote: > > Peter Eisentraut wrote: > > > On tor, 2011-09-01 at 17:31 -0400, Bruce Momjian wrote: > > > > Peter Eisentraut wrote: > > > > > On tor, 2011-09-01 at 14:17 -0400, Bruce Momjian wrote: > > > > > > > That still leaves open why we bother about escaping <. > > > > > > > > > > > > The problem is that I often add SGML that has: > > > > > > > > > > > > if (1 < 0) ... > > > > > > > > > > > > I need something to warn me about those, especially in the release > > > > > > notes. > > > > > > > > > > Why do you need to be warned about that? > > > > > > > > If I have: > > > > > > > > if (1 < fred) > > > > > > > > it will think "fred" is a SGML tag, no? > > > > > > No, a < followed by a space is not a tag, it's character data. If it > > > thought it were a tag, it would complain. > > > > Sometimes it is '<' (in single quotes), which I thought would be a > > problem. > > The bottom line is, the SGML parser can figure that out itself, and if > it has a problem, it will complain. We don't need to second guess it > with regular expressions that are handcrafted out of thin air. > > I was hoping you would remember whether you initially put this in > because of some tool problem. But if we are not finding any supporting > evidence, I would suggest that we just scrap this thing entirely. I put it in to warn about release.sgml markup problems, so I properly escaped all non-tag '>' and '<' characters. I have removed the tool. We can always re-add it if we find it is needed. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. +