Обсуждение: BUG #5075: Text Search parser does not identify xml tag when attribute name's contains underscore
BUG #5075: Text Search parser does not identify xml tag when attribute name's contains underscore
От
"Marek Lewczuk"
Дата:
The following bug has been logged online: Bug reference: 5075 Logged by: Marek Lewczuk Email address: marek@lewczuk.com PostgreSQL version: 8.4.0 Operating system: All Description: Text Search parser does not identify xml tag when attribute name's contains underscore Details: Please execute following example: select * from ts_debug('english', '<img width="182" height="120" align="right" style="margin: 0px 0px 5px 5px;" test_aa="26461"/>') As the result you will see, that <img/> is not identified as XML tag, but rather splitted as words, blank spaces etc. The reason for that is the fact, that last attribute "test_aa" contains underscore in its name - when the underscore is removed, then img tag is properly identified as XML tag. XML definition allows using underscore in tag and attribute names.
Re: BUG #5075: Text Search parser does not identify xml tag when attribute name's contains underscore
От
Euler Taveira de Oliveira
Дата:
Marek Lewczuk escreveu: > Please execute following example: > select * from ts_debug('english', '<img width="182" height="120" > align="right" style="margin: 0px 0px 5px 5px;" test_aa="26461"/>') > > As the result you will see, that <img/> is not identified as XML tag, but > rather splitted as words, blank spaces etc. The reason for that is the fact, > that last attribute "test_aa" contains underscore in its name - when the > underscore is removed, then img tag is properly identified as XML tag. > > XML definition allows using underscore in tag and attribute names. > The problem is we already allow it in tag names but not in attribute names. So the proper fix is to allow underscore when the state is TPS_InTag; according to XML spec [1], the underscore is a valid character in attribute names. A possible downside is that we don't have underscores in HTML attribute names. In this case, should it fail? I don't think so but... The problem exists in 8.3, 8.4 and HEAD. It is a trivial fix so I think there isn't a problem to back-patch it. [1] http://www.w3.org/TR/REC-xml/#sec-common-syn -- Euler Taveira de Oliveira http://www.timbira.com/ Index: wparser_def.c =================================================================== RCS file: /a/pgsql/dev/anoncvs/pgsql/src/backend/tsearch/wparser_def.c,v retrieving revision 1.24 diff -c -r1.24 wparser_def.c *** wparser_def.c 16 Jul 2009 06:33:44 -0000 1.24 --- wparser_def.c 23 Sep 2009 23:19:28 -0000 *************** *** 1225,1230 **** --- 1225,1231 ---- {p_isdigit, 0, A_NEXT, TPS_Null, 0, NULL}, {p_iseqC, '=', A_NEXT, TPS_Null, 0, NULL}, {p_iseqC, '-', A_NEXT, TPS_Null, 0, NULL}, + {p_iseqC, '_', A_NEXT, TPS_Null, 0, NULL}, {p_iseqC, '#', A_NEXT, TPS_Null, 0, NULL}, {p_iseqC, '/', A_NEXT, TPS_Null, 0, NULL}, {p_iseqC, ':', A_NEXT, TPS_Null, 0, NULL},
On Wed, Sep 23, 2009 at 7:31 PM, Euler Taveira de Oliveira <euler@timbira.com> wrote: > Marek Lewczuk escreveu: >> Please execute following example: >> select * from ts_debug('english', '<img width="182" height="120" >> align="right" style="margin: 0px 0px 5px 5px;" test_aa="26461"/>') >> >> As the result you will see, that <img/> is not identified as XML tag, but >> rather splitted as words, blank spaces etc. The reason for that is the fact, >> that last attribute "test_aa" contains underscore in its name - when the >> underscore is removed, then img tag is properly identified as XML tag. >> >> XML definition allows using underscore in tag and attribute names. >> > The problem is we already allow it in tag names but not in attribute names. So > the proper fix is to allow underscore when the state is TPS_InTag; according > to XML spec [1], the underscore is a valid character in attribute names. > > A possible downside is that we don't have underscores in HTML attribute names. > In this case, should it fail? I don't think so but... > > The problem exists in 8.3, 8.4 and HEAD. It is a trivial fix so I think there > isn't a problem to back-patch it. This patch should probably be added to https://commitfest.postgresql.org/action/commitfest_view/open so that we don't lose track of it. ...Robert
Re: BUG #5075: Text Search parser does not identify xml tag when attribute name's contains underscore
От
Selena Deckelmann
Дата:
On Sun, Sep 27, 2009 at 7:49 PM, Robert Haas <robertmhaas@gmail.com> wrote: > On Wed, Sep 23, 2009 at 7:31 PM, Euler Taveira de Oliveira > <euler@timbira.com> wrote: > >> The problem exists in 8.3, 8.4 and HEAD. It is a trivial fix so I think there >> isn't a problem to back-patch it. > > This patch should probably be added to > https://commitfest.postgresql.org/action/commitfest_view/open so that > we don't lose track of it. Done. -- http://chesnok.com/daily - me http://endpoint.com - work
Re: BUG #5075: Text Search parser does not identify xml tag when attribute name's contains underscore
От
Peter Eisentraut
Дата:
On ons, 2009-09-23 at 20:31 -0300, Euler Taveira de Oliveira wrote: > Marek Lewczuk escreveu: > > Please execute following example: > > select * from ts_debug('english', '<img width="182" height="120" > > align="right" style="margin: 0px 0px 5px 5px;" test_aa="26461"/>') > > > > As the result you will see, that <img/> is not identified as XML tag, but > > rather splitted as words, blank spaces etc. The reason for that is the fact, > > that last attribute "test_aa" contains underscore in its name - when the > > underscore is removed, then img tag is properly identified as XML tag. > > > > XML definition allows using underscore in tag and attribute names. > > > The problem is we already allow it in tag names but not in attribute names. So > the proper fix is to allow underscore when the state is TPS_InTag; according > to XML spec [1], the underscore is a valid character in attribute names. > > A possible downside is that we don't have underscores in HTML attribute names. > In this case, should it fail? I don't think so but... > > The problem exists in 8.3, 8.4 and HEAD. It is a trivial fix so I think there > isn't a problem to back-patch it. Fix committed to 8.3, 8.4, 8.5.
W dniu 2009-11-15 14:56, Peter Eisentraut pisze: > On ons, 2009-09-23 at 20:31 -0300, Euler Taveira de Oliveira wrote: >> Marek Lewczuk escreveu: >>> Please execute following example: >>> select * from ts_debug('english', '<img width="182" height="120" >>> align="right" style="margin: 0px 0px 5px 5px;" test_aa="26461"/>') > Fix committed to 8.3, 8.4, 8.5. Great. Thanks. ML