Обсуждение: tsearch Parser Hacking

Поиск
Список
Период
Сортировка

tsearch Parser Hacking

От
"David E. Wheeler"
Дата:
Hackers,

Is it possible to modify the default tsearch parser so that / doesn't get lexed as a "file" token? That is, instead of
this:

try=# select * from ts_debug('simple'::regconfig, 'w/d');alias │    description    │ token │ dictionaries │ dictionary
│lexemes  
───────┼───────────────────┼───────┼──────────────┼────────────┼─────────file  │ File or path name │ w/d   │ {simple}
 │ simple     │ {w/d} 

Ideally it'd think that / was the same as -:

try=# select * from ts_debug('simple'::regconfig, 'w-d');     alias      │           description           │ token │
dictionaries│ dictionary │ lexemes  
─────────────────┼─────────────────────────────────┼───────┼──────────────┼────────────┼─────────asciihword      │
Hyphenatedword, all ASCII      │ w-d   │ {simple}     │ simple     │ {w-d}hword_asciipart │ Hyphenated word part, all
ASCII│ w     │ {simple}     │ simple     │ {w}blank           │ Space symbols                   │ -     │ {}
│[null]     │ [null]hword_asciipart │ Hyphenated word part, all ASCII │ d     │ {simple}     │ simple     │ {d} 
(4 rows)

Possible? Or would I have to write a completely new parser just to change this bit?

Thanks,

David



Re: tsearch Parser Hacking

От
Tom Lane
Дата:
"David E. Wheeler" <david@kineticode.com> writes:
> Is it possible to modify the default tsearch parser so that / doesn't get lexed as a "file" token?

There is zero, none, nada, provision for modifying the behavior of the
default parser, other than by changing its compiled-in state transition
tables.

It doesn't help any that said tables are baroquely designed and utterly
undocumented.

IMO, sooner or later we need to trash that code and replace it with
something a bit more modification-friendly.
        regards, tom lane


Re: tsearch Parser Hacking

От
Thom Brown
Дата:
On 14 February 2011 23:57, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> "David E. Wheeler" <david@kineticode.com> writes:
>> Is it possible to modify the default tsearch parser so that / doesn't get lexed as a "file" token?
>
> There is zero, none, nada, provision for modifying the behavior of the
> default parser, other than by changing its compiled-in state transition
> tables.
>
> It doesn't help any that said tables are baroquely designed and utterly
> undocumented.

This is very true. I intended to look into adding new tokens, but gave
up when I couldn't see how those transition tables worked.

> IMO, sooner or later we need to trash that code and replace it with
> something a bit more modification-friendly.

+1 for annihilating the existing code at some point.

-- 
Thom Brown
Twitter: @darkixion
IRC (freenode): dark_ixion
Registered Linux user: #516935


Re: tsearch Parser Hacking

От
David Blewett
Дата:
On Mon, Feb 14, 2011 at 6:57 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> "David E. Wheeler" <david@kineticode.com> writes:
>> Is it possible to modify the default tsearch parser so that / doesn't get lexed as a "file" token?
>
> There is zero, none, nada, provision for modifying the behavior of the
> default parser, other than by changing its compiled-in state transition
> tables.
>
> It doesn't help any that said tables are baroquely designed and utterly
> undocumented.
>
> IMO, sooner or later we need to trash that code and replace it with
> something a bit more modification-friendly.

I added this to the TODO as something that can be tackled in the
future. I've been wishing it would be possible to add other tokens as
well (Python dotted path 'foo.bar.baz', Perl namespace path
'Foo::Bar', more flexible version number parsing, etc).

David Blewett


Re: tsearch Parser Hacking

От
"David E. Wheeler"
Дата:
On Feb 14, 2011, at 3:57 PM, Tom Lane wrote:

> There is zero, none, nada, provision for modifying the behavior of the
> default parser, other than by changing its compiled-in state transition
> tables.
> 
> It doesn't help any that said tables are baroquely designed and utterly
> undocumented.
> 
> IMO, sooner or later we need to trash that code and replace it with
> something a bit more modification-friendly.

I was afraid you'd say that. Thanks.

David


Re: tsearch Parser Hacking

От
Sushant Sinha
Дата:
I agree that it will be a good idea to rewrite the entire thing. However, in the mean time, I sent a proposal earlier

http://archives.postgresql.org/pgsql-hackers/2010-08/msg00019.php

And a patch later:

http://archives.postgresql.org/pgsql-hackers/2010-09/msg00476.php

Tom asked me to look into Compound Word support but I found it not usable. Here was my response:
http://archives.postgresql.org/pgsql-hackers/2011-01/msg00419.php

I have not got any response since then,

-Sushant.


On Tue, Feb 15, 2011 at 9:33 AM, David E. Wheeler <david@kineticode.com> wrote:
On Feb 14, 2011, at 3:57 PM, Tom Lane wrote:

> There is zero, none, nada, provision for modifying the behavior of the
> default parser, other than by changing its compiled-in state transition
> tables.
>
> It doesn't help any that said tables are baroquely designed and utterly
> undocumented.
>
> IMO, sooner or later we need to trash that code and replace it with
> something a bit more modification-friendly.

I was afraid you'd say that. Thanks.

David

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: tsearch Parser Hacking

От
Oleg Bartunov
Дата:
David,

it's not easy to hack tsearch parser, sorry. You can preparse your input
before to_tsquery,to_tsvector.

Oleg
On Mon, 14 Feb 2011, David E. Wheeler wrote:

> Hackers,
>
> Is it possible to modify the default tsearch parser so that / doesn't get lexed as a "file" token? That is, instead
ofthis: 
>
> try=# select * from ts_debug('simple'::regconfig, 'w/d');
> alias │    description    │ token │ dictionaries │ dictionary │ lexemes
> ───────┼───────────────────┼───────┼──────────────┼────────────┼─────────
> file  │ File or path name │ w/d   │ {simple}     │ simple     │ {w/d}
>
> Ideally it'd think that / was the same as -:
>
> try=# select * from ts_debug('simple'::regconfig, 'w-d');
>      alias      │           description           │ token │ dictionaries │ dictionary │ lexemes
> ─────────────────┼─────────────────────────────────┼───────┼──────────────┼────────────┼─────────
> asciihword      │ Hyphenated word, all ASCII      │ w-d   │ {simple}     │ simple     │ {w-d}
> hword_asciipart │ Hyphenated word part, all ASCII │ w     │ {simple}     │ simple     │ {w}
> blank           │ Space symbols                   │ -     │ {}           │ [null]     │ [null]
> hword_asciipart │ Hyphenated word part, all ASCII │ d     │ {simple}     │ simple     │ {d}
> (4 rows)
>
> Possible? Or would I have to write a completely new parser just to change this bit?
>
> Thanks,
>
> David
>
>
>
    Regards,        Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

Re: tsearch Parser Hacking

От
"David E. Wheeler"
Дата:
On Feb 14, 2011, at 11:37 PM, Oleg Bartunov wrote:

> it's not easy to hack tsearch parser, sorry. You can preparse your input
> before to_tsquery,to_tsvector.

Yeah, I was thinking about s{/}{-}g before passing the values in. Might be the only way to do it for now…

Thanks,

David



Re: tsearch Parser Hacking

От
Oleg Bartunov
Дата:
On Mon, 14 Feb 2011, Tom Lane wrote:

> "David E. Wheeler" <david@kineticode.com> writes:
>> Is it possible to modify the default tsearch parser so that / doesn't get lexed as a "file" token?
>
> There is zero, none, nada, provision for modifying the behavior of the
> default parser, other than by changing its compiled-in state transition
> tables.
>
> It doesn't help any that said tables are baroquely designed and utterly
> undocumented.

what do you mean 'baroquely' ? Do you know 'gothic' design :?

>
> IMO, sooner or later we need to trash that code and replace it with
> something a bit more modification-friendly.

We thought about configurable parser, but AFAIR, we didn't get any support 
for this at that time.
    Regards,        Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83


Re: tsearch Parser Hacking

От
Oleg Bartunov
Дата:
On Mon, 14 Feb 2011, David E. Wheeler wrote:

> On Feb 14, 2011, at 11:37 PM, Oleg Bartunov wrote:
>
>> it's not easy to hack tsearch parser, sorry. You can preparse your input
>> before to_tsquery,to_tsvector.
>
> Yeah, I was thinking about s{/}{-}g before passing the values in. Might be the only way to do it for now?

actually, it's not so difficult to *hack* parser to treat '/' as '-'.
I thought about overriding some default parser behaviour, but didn't come
to any useful solution. 
btw, some users already wrote their own parsers and even I have little
tutorial:
http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/docs/HOWTO-parser-tsearch2.html
I wonder if it's worth to add it to 
http://www.postgresql.org/docs/8.4/static/test-parser.html

Probably, good paper/presentation along with improving code docs would be 
enough for now, until someone got very bright idea about parser and 
time to implement it.
    Regards,        Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83


Re: tsearch Parser Hacking

От
"David E. Wheeler"
Дата:
On Feb 14, 2011, at 11:44 PM, Oleg Bartunov wrote:

>> IMO, sooner or later we need to trash that code and replace it with
>> something a bit more modification-friendly.
>
> We thought about configurable parser, but AFAIR, we didn't get any support for this at that time.

What would it take to change the requirement such that *any* SQL function could be a parser, not only C functions?
Mayberequire that they turn a nested array of tokens? That way I could just write a function in PL/Perl quite easily. 

Best,

David



Re: tsearch Parser Hacking

От
Jesper Krogh
Дата:
On 16 Feb 2011, at 23:22, "David E. Wheeler" <david@kineticode.com> wrote:

> On Feb 14, 2011, at 11:44 PM, Oleg Bartunov wrote:
>
>>> IMO, sooner or later we need to trash that code and replace it with
>>> something a bit more modification-friendly.
>>
>> We thought about configurable parser, but AFAIR, we didn't get any support for this at that time.
>
> What would it take to change the requirement such that *any* SQL function could be a parser, not only C functions?
Mayberequire that they turn a nested array of tokens? That way I could just write a function in PL/Perl quite easily. 

I had just the same thought in mind. But so far I systematically substitute _ and a few other characters to ł which
doesn'tget interpreted as blanks.  But more direct control would be appreciated  

Jesper

Re: tsearch Parser Hacking

От
Oleg Bartunov
Дата:
David,

as a cool perl guy you can easily take OpenFTS (openfts.sourceforge.net),
which provides perl interface to tsearch datatypes, and develop a
plperl version. That would be interesting for many people, who like flexibility
of perl. We personally use openfts in our web projects,i.e., we use tsearch as
a storage and we prepare tsvector externally. Openfts distribution contains
tests, examples of dictionaries, parser. Current interface of configuration
is ugly, but it should be not difficult to write table driven configuration.

What do you think ?

Oleg

On Wed, 16 Feb 2011, David E. Wheeler wrote:

> On Feb 14, 2011, at 11:44 PM, Oleg Bartunov wrote:
>
>>> IMO, sooner or later we need to trash that code and replace it with
>>> something a bit more modification-friendly.
>>
>> We thought about configurable parser, but AFAIR, we didn't get any support for this at that time.
>
> What would it take to change the requirement such that *any* SQL function could be a parser, not only C functions?
Mayberequire that they turn a nested array of tokens? That way I could just write a function in PL/Perl quite easily.
 
>
> Best,
>
> David
>
    Regards,        Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83