Configuring Text Search parser?

Поиск
Список
Период
Сортировка
От jesper@krogh.cc
Тема Configuring Text Search parser?
Дата
Msg-id 1a26550c0b55c0a0af0dcbd8e080bc82.squirrel@shrek.krogh.cc
обсуждение исходный текст
Ответы Re: Configuring Text Search parser?  (Sushant Sinha <sushant354@gmail.com>)
Список pgsql-hackers
Hi.

I'm trying to migrate an application off an existing Full Text Search engine
and onto PostgreSQL .. one of my main (remaining) headaches are the
fact that PostgreSQL treats _ as a seperation charachter whereas the existing
behaviour is to "not split". That means:

testdb=# select ts_debug('database_tag_number_999');                                  ts_debug
------------------------------------------------------------------------------(asciiword,"Word, all
ASCII",database,{english_stem},english_stem,{databas})(blank,"Spacesymbols",_,{},,)(asciiword,"Word, all
ASCII",tag,{english_stem},english_stem,{tag})(blank,"Spacesymbols",_,{},,)(asciiword,"Word, all
ASCII",number,{english_stem},english_stem,{number})(blank,"Spacesymbols",_,{},,)(uint,"Unsigned
integer",999,{simple},simple,{999})
(7 rows)

Where the incoming data, by design contains a set of tags which includes _
and are expected to be one "lexeme".

I've tried patching my way out of this using this patch.

$ diff -w -C 5 src/backend/tsearch/wparser_def.c.orig
src/backend/tsearch/wparser_def.c
*** src/backend/tsearch/wparser_def.c.orig    2010-09-20 15:58:37.033336460
+0200
--- src/backend/tsearch/wparser_def.c    2010-09-20 15:58:41.193335577 +0200
***************
*** 967,986 ****
--- 967,988 ----
 static const TParserStateActionItem actionTPS_InNumWord[] = {     {p_isEOF, 0, A_BINGO, TPS_Base, NUMWORD, NULL},
{p_isalnum,0, A_NEXT, TPS_InNumWord, 0, NULL},     {p_isspecial, 0, A_NEXT, TPS_InNumWord, 0, NULL},
 
+     {p_iseqC, '_', A_NEXT, TPS_InNumWord, 0, NULL},     {p_iseqC, '@', A_PUSH, TPS_InEmail, 0, NULL},     {p_iseqC,
'/',A_PUSH, TPS_InFileFirst, 0, NULL},     {p_iseqC, '.', A_PUSH, TPS_InFileNext, 0, NULL},     {p_iseqC, '-', A_PUSH,
TPS_InHyphenNumWordFirst,0, NULL},     {NULL, 0, A_BINGO, TPS_Base, NUMWORD, NULL} };
 
 static const TParserStateActionItem actionTPS_InAsciiWord[] = {     {p_isEOF, 0, A_BINGO, TPS_Base, ASCIIWORD, NULL},
  {p_isasclet, 0, A_NEXT, TPS_Null, 0, NULL},
 
+     {p_iseqC, '_', A_NEXT, TPS_Null, 0, NULL},     {p_iseqC, '.', A_PUSH, TPS_InHostFirstDomain, 0, NULL},
{p_iseqC,'.', A_PUSH, TPS_InFileNext, 0, NULL},     {p_iseqC, '-', A_PUSH, TPS_InHostFirstAN, 0, NULL},     {p_iseqC,
'-',A_PUSH, TPS_InHyphenAsciiWordFirst, 0, NULL},     {p_iseqC, '@', A_PUSH, TPS_InEmail, 0, NULL},
 
***************
*** 995,1004 ****
--- 997,1007 ----
 static const TParserStateActionItem actionTPS_InWord[] = {     {p_isEOF, 0, A_BINGO, TPS_Base, WORD_T, NULL},
{p_isalpha,0, A_NEXT, TPS_Null, 0, NULL},     {p_isspecial, 0, A_NEXT, TPS_Null, 0, NULL},
 
+     {p_iseqC, '_', A_NEXT, TPS_Null, 0, NULL},     {p_isdigit, 0, A_NEXT, TPS_InNumWord, 0, NULL},     {p_iseqC, '-',
A_PUSH,TPS_InHyphenWordFirst, 0, NULL},     {NULL, 0, A_BINGO, TPS_Base, WORD_T, NULL} };
 



This will obviously break other peoples applications, so my questions would
be: If this should be made configurable.. how should it be done?

As a sidenote... Xapian doesn't split on _ .. Lucene does.

Thanks.

-- 
Jesper



В списке pgsql-hackers по дате отправления:

Предыдущее
От: Robert Haas
Дата:
Сообщение: Re: Configuring synchronous replication
Следующее
От: "Kevin Grittner"
Дата:
Сообщение: Re: Serializable Snapshot Isolation