Обсуждение: tsearch Parser Hacking
Hackers, Is it possible to modify the default tsearch parser so that / doesn't get lexed as a "file" token? That is, instead of this: try=# select * from ts_debug('simple'::regconfig, 'w/d');alias │ description │ token │ dictionaries │ dictionary │lexemes ───────┼───────────────────┼───────┼──────────────┼────────────┼─────────file │ File or path name │ w/d │ {simple} │ simple │ {w/d} Ideally it'd think that / was the same as -: try=# select * from ts_debug('simple'::regconfig, 'w-d'); alias │ description │ token │ dictionaries│ dictionary │ lexemes ─────────────────┼─────────────────────────────────┼───────┼──────────────┼────────────┼─────────asciihword │ Hyphenatedword, all ASCII │ w-d │ {simple} │ simple │ {w-d}hword_asciipart │ Hyphenated word part, all ASCII│ w │ {simple} │ simple │ {w}blank │ Space symbols │ - │ {} │[null] │ [null]hword_asciipart │ Hyphenated word part, all ASCII │ d │ {simple} │ simple │ {d} (4 rows) Possible? Or would I have to write a completely new parser just to change this bit? Thanks, David
"David E. Wheeler" <david@kineticode.com> writes: > Is it possible to modify the default tsearch parser so that / doesn't get lexed as a "file" token? There is zero, none, nada, provision for modifying the behavior of the default parser, other than by changing its compiled-in state transition tables. It doesn't help any that said tables are baroquely designed and utterly undocumented. IMO, sooner or later we need to trash that code and replace it with something a bit more modification-friendly. regards, tom lane
On 14 February 2011 23:57, Tom Lane <tgl@sss.pgh.pa.us> wrote: > "David E. Wheeler" <david@kineticode.com> writes: >> Is it possible to modify the default tsearch parser so that / doesn't get lexed as a "file" token? > > There is zero, none, nada, provision for modifying the behavior of the > default parser, other than by changing its compiled-in state transition > tables. > > It doesn't help any that said tables are baroquely designed and utterly > undocumented. This is very true. I intended to look into adding new tokens, but gave up when I couldn't see how those transition tables worked. > IMO, sooner or later we need to trash that code and replace it with > something a bit more modification-friendly. +1 for annihilating the existing code at some point. -- Thom Brown Twitter: @darkixion IRC (freenode): dark_ixion Registered Linux user: #516935
On Mon, Feb 14, 2011 at 6:57 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > "David E. Wheeler" <david@kineticode.com> writes: >> Is it possible to modify the default tsearch parser so that / doesn't get lexed as a "file" token? > > There is zero, none, nada, provision for modifying the behavior of the > default parser, other than by changing its compiled-in state transition > tables. > > It doesn't help any that said tables are baroquely designed and utterly > undocumented. > > IMO, sooner or later we need to trash that code and replace it with > something a bit more modification-friendly. I added this to the TODO as something that can be tackled in the future. I've been wishing it would be possible to add other tokens as well (Python dotted path 'foo.bar.baz', Perl namespace path 'Foo::Bar', more flexible version number parsing, etc). David Blewett
On Feb 14, 2011, at 3:57 PM, Tom Lane wrote: > There is zero, none, nada, provision for modifying the behavior of the > default parser, other than by changing its compiled-in state transition > tables. > > It doesn't help any that said tables are baroquely designed and utterly > undocumented. > > IMO, sooner or later we need to trash that code and replace it with > something a bit more modification-friendly. I was afraid you'd say that. Thanks. David
I agree that it will be a good idea to rewrite the entire thing. However, in the mean time, I sent a proposal earlier
http://archives.postgresql.org/pgsql-hackers/2010-08/msg00019.php
And a patch later:
http://archives.postgresql.org/pgsql-hackers/2010-09/msg00476.php
Tom asked me to look into Compound Word support but I found it not usable. Here was my response:
http://archives.postgresql.org/pgsql-hackers/2011-01/msg00419.php
I have not got any response since then,
-Sushant.
http://archives.postgresql.org/pgsql-hackers/2010-08/msg00019.php
And a patch later:
http://archives.postgresql.org/pgsql-hackers/2010-09/msg00476.php
Tom asked me to look into Compound Word support but I found it not usable. Here was my response:
http://archives.postgresql.org/pgsql-hackers/2011-01/msg00419.php
I have not got any response since then,
-Sushant.
On Tue, Feb 15, 2011 at 9:33 AM, David E. Wheeler <david@kineticode.com> wrote:
On Feb 14, 2011, at 3:57 PM, Tom Lane wrote:I was afraid you'd say that. Thanks.
> There is zero, none, nada, provision for modifying the behavior of the
> default parser, other than by changing its compiled-in state transition
> tables.
>
> It doesn't help any that said tables are baroquely designed and utterly
> undocumented.
>
> IMO, sooner or later we need to trash that code and replace it with
> something a bit more modification-friendly.
David
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
David, it's not easy to hack tsearch parser, sorry. You can preparse your input before to_tsquery,to_tsvector. Oleg On Mon, 14 Feb 2011, David E. Wheeler wrote: > Hackers, > > Is it possible to modify the default tsearch parser so that / doesn't get lexed as a "file" token? That is, instead ofthis: > > try=# select * from ts_debug('simple'::regconfig, 'w/d'); > alias │ description │ token │ dictionaries │ dictionary │ lexemes > ───────┼───────────────────┼───────┼──────────────┼────────────┼───────── > file │ File or path name │ w/d │ {simple} │ simple │ {w/d} > > Ideally it'd think that / was the same as -: > > try=# select * from ts_debug('simple'::regconfig, 'w-d'); > alias │ description │ token │ dictionaries │ dictionary │ lexemes > ─────────────────┼─────────────────────────────────┼───────┼──────────────┼────────────┼───────── > asciihword │ Hyphenated word, all ASCII │ w-d │ {simple} │ simple │ {w-d} > hword_asciipart │ Hyphenated word part, all ASCII │ w │ {simple} │ simple │ {w} > blank │ Space symbols │ - │ {} │ [null] │ [null] > hword_asciipart │ Hyphenated word part, all ASCII │ d │ {simple} │ simple │ {d} > (4 rows) > > Possible? Or would I have to write a completely new parser just to change this bit? > > Thanks, > > David > > > Regards, Oleg _____________________________________________________________ Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru), Sternberg Astronomical Institute, Moscow University, Russia Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ phone: +007(495)939-16-83, +007(495)939-23-83
On Feb 14, 2011, at 11:37 PM, Oleg Bartunov wrote: > it's not easy to hack tsearch parser, sorry. You can preparse your input > before to_tsquery,to_tsvector. Yeah, I was thinking about s{/}{-}g before passing the values in. Might be the only way to do it for now… Thanks, David
On Mon, 14 Feb 2011, Tom Lane wrote: > "David E. Wheeler" <david@kineticode.com> writes: >> Is it possible to modify the default tsearch parser so that / doesn't get lexed as a "file" token? > > There is zero, none, nada, provision for modifying the behavior of the > default parser, other than by changing its compiled-in state transition > tables. > > It doesn't help any that said tables are baroquely designed and utterly > undocumented. what do you mean 'baroquely' ? Do you know 'gothic' design :? > > IMO, sooner or later we need to trash that code and replace it with > something a bit more modification-friendly. We thought about configurable parser, but AFAIR, we didn't get any support for this at that time. Regards, Oleg _____________________________________________________________ Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru), Sternberg Astronomical Institute, Moscow University, Russia Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ phone: +007(495)939-16-83, +007(495)939-23-83
On Mon, 14 Feb 2011, David E. Wheeler wrote: > On Feb 14, 2011, at 11:37 PM, Oleg Bartunov wrote: > >> it's not easy to hack tsearch parser, sorry. You can preparse your input >> before to_tsquery,to_tsvector. > > Yeah, I was thinking about s{/}{-}g before passing the values in. Might be the only way to do it for now? actually, it's not so difficult to *hack* parser to treat '/' as '-'. I thought about overriding some default parser behaviour, but didn't come to any useful solution. btw, some users already wrote their own parsers and even I have little tutorial: http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/docs/HOWTO-parser-tsearch2.html I wonder if it's worth to add it to http://www.postgresql.org/docs/8.4/static/test-parser.html Probably, good paper/presentation along with improving code docs would be enough for now, until someone got very bright idea about parser and time to implement it. Regards, Oleg _____________________________________________________________ Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru), Sternberg Astronomical Institute, Moscow University, Russia Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ phone: +007(495)939-16-83, +007(495)939-23-83
On Feb 14, 2011, at 11:44 PM, Oleg Bartunov wrote: >> IMO, sooner or later we need to trash that code and replace it with >> something a bit more modification-friendly. > > We thought about configurable parser, but AFAIR, we didn't get any support for this at that time. What would it take to change the requirement such that *any* SQL function could be a parser, not only C functions? Mayberequire that they turn a nested array of tokens? That way I could just write a function in PL/Perl quite easily. Best, David
On 16 Feb 2011, at 23:22, "David E. Wheeler" <david@kineticode.com> wrote: > On Feb 14, 2011, at 11:44 PM, Oleg Bartunov wrote: > >>> IMO, sooner or later we need to trash that code and replace it with >>> something a bit more modification-friendly. >> >> We thought about configurable parser, but AFAIR, we didn't get any support for this at that time. > > What would it take to change the requirement such that *any* SQL function could be a parser, not only C functions? Mayberequire that they turn a nested array of tokens? That way I could just write a function in PL/Perl quite easily. I had just the same thought in mind. But so far I systematically substitute _ and a few other characters to ł which doesn'tget interpreted as blanks. But more direct control would be appreciated Jesper
David, as a cool perl guy you can easily take OpenFTS (openfts.sourceforge.net), which provides perl interface to tsearch datatypes, and develop a plperl version. That would be interesting for many people, who like flexibility of perl. We personally use openfts in our web projects,i.e., we use tsearch as a storage and we prepare tsvector externally. Openfts distribution contains tests, examples of dictionaries, parser. Current interface of configuration is ugly, but it should be not difficult to write table driven configuration. What do you think ? Oleg On Wed, 16 Feb 2011, David E. Wheeler wrote: > On Feb 14, 2011, at 11:44 PM, Oleg Bartunov wrote: > >>> IMO, sooner or later we need to trash that code and replace it with >>> something a bit more modification-friendly. >> >> We thought about configurable parser, but AFAIR, we didn't get any support for this at that time. > > What would it take to change the requirement such that *any* SQL function could be a parser, not only C functions? Mayberequire that they turn a nested array of tokens? That way I could just write a function in PL/Perl quite easily. > > Best, > > David > Regards, Oleg _____________________________________________________________ Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru), Sternberg Astronomical Institute, Moscow University, Russia Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ phone: +007(495)939-16-83, +007(495)939-23-83