Обсуждение: Question regarding custom parser

Поиск
Список
Период
Сортировка

Question regarding custom parser

От
Arthur van der Wal
Дата:
Hi,

I want to change the way PostgreSQL splits text into tokens, for example:

plainto_tsquery("v-74") should split it up as "v" & "74" instead of "v" & "-74".

Another example:

select to_tsvector('NL83-V-74-001-001')
'-001':5,6 '74':4 'nl83':2 'nl83-v':1 'v':3

Searching for 'v-71' does not find the database entry as the '-' in 'v-71' is not indexed. It's hard to determine when PostgreSQL splits things up by '-' and when not


I tried writing my own parser (based on the the test_parser example) which does nothing more than split at '-', however it seems to me that the logic for finding 'base' words and derivitives that postgres does so nicely doesn't work anymore.

Another way would be to disable the (signed) int tokeniser and have the unsigned int tokeniser accept preceeding 0's.

Can anybody point me in the right direction as in how to tackle this problem?

Thanks very much in advance,

Arthur van der Wal

Re: Question regarding custom parser

От
Arjen Nienhuis
Дата:
You can create an index on to_tsvector(replace(foo, '-', ' ')) and then search using ...match..(replace(foo, ...), ...)

On Mon, Oct 4, 2010 at 11:41 AM, Arthur van der Wal <arthurvanderwal@gmail.com> wrote:
Hi,

I want to change the way PostgreSQL splits text into tokens, for example:

plainto_tsquery("v-74") should split it up as "v" & "74" instead of "v" & "-74".

Another example:

select to_tsvector('NL83-V-74-001-001')
'-001':5,6 '74':4 'nl83':2 'nl83-v':1 'v':3

Searching for 'v-71' does not find the database entry as the '-' in 'v-71' is not indexed. It's hard to determine when PostgreSQL splits things up by '-' and when not


I tried writing my own parser (based on the the test_parser example) which does nothing more than split at '-', however it seems to me that the logic for finding 'base' words and derivitives that postgres does so nicely doesn't work anymore.

Another way would be to disable the (signed) int tokeniser and have the unsigned int tokeniser accept preceeding 0's.

Can anybody point me in the right direction as in how to tackle this problem?

Thanks very much in advance,

Arthur van der Wal