Re: HTML tags and tsearch2

Поиск
Список
Период
Сортировка
От Oleg Bartunov
Тема Re: HTML tags and tsearch2
Дата
Msg-id Pine.LNX.4.64.0806261602120.11363@sn.sai.msu.ru
обсуждение исходный текст
Ответ на HTML tags and tsearch2  (Joanna Sharman <Joanna.Sharman@ed.ac.uk>)
Список pgsql-general
On Thu, 26 Jun 2008, Joanna Sharman wrote:

> Hi,
>
> I have recently started experimenting with tsearch2 and it seems that the
> default behaviour is to ignore HTML tags and treat them as word-separators.
> What I would like it to do is to ignore HTML tags within words, but instead
> of creating separate words, combine the characters separated by the tag into
> one word.
>
> For example: in the database I have words like 'K<sub>ir</sub>' that need to
> be searched using the term without HTML tags, i.e. 'Kir'. Currently, the HTML
> tags are ignored and two words are stored in the vector, 'k' and 'ir'. I
> would like only one word, 'kir', to be stored in the vector, so that searches
> using the word 'kir' will match the row.

2 options - write HTML parser and preprocess text before to_tsvector.

>
> A second, related question is whether it is possible to cause tsearch2 to
> split up words when it encounters digits, e.g. 'TM8' into 'TM' and '8'.

you can write your own dictionary or use dict_regex from
http://vo.astronet.ru/arxiv/dict_regex.html

>
> I am not sure if this functionality is possible to implement using tsearch2
> or if there might be a better way, so I would be grateful for any advice or
> pointers to further reading on how I might do this. (I am using PostgreSQL
> version 8.1.10)

think about upgrading to 8.3

>
> Many thanks in advance,
> Joanna
>
>

     Regards,
         Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

В списке pgsql-general по дате отправления:

Предыдущее
От: Joanna Sharman
Дата:
Сообщение: HTML tags and tsearch2
Следующее
От: "Phillip Mills"
Дата:
Сообщение: Re: Serialized Access