Help with tokenization of age-ranges in full text search

Поиск

Список

Период

Сортировка

От	Mason Hale
Тема	Help with tokenization of age-ranges in full text search
Дата	26 февраля 2015 г. 04:16:10
Msg-id	CAAFdBmQzXMOUm2b_JgmhanRiJPOG7oTBvrfRYowWhQ2XqGTHFA@mail.gmail.com обсуждение исходный текст
Ответы	Re: Help with tokenization of age-ranges in full text search
Список	pgsql-general

Дерево обсуждения

Hello, I've got a 9.3 database hosted at Heroku.

I'm full text search to search for "group names" in part of my application, and some of my group names are the names of youth sports age groups like "Boys 9-10" or "Girls 11-12".

I would like for a search for the terms "Boys", "Boys 9-10", "9", "10" or "9-10" to match "Boys 9-10".

I have created a custom dictionary and search configuration as follows:

CREATE TEXT SEARCH DICTIONARY public.simple_nostem_no_stop (

TEMPLATE = pg_catalog.simple

);

CREATE TEXT SEARCH CONFIGURATION public.user_search ( COPY = pg_catalog.simple );

ALTER TEXT SEARCH CONFIGURATION public.user_search

ALTER MAPPING FOR email, asciiword, asciihword, hword_asciipart, hword, hword_part, word WITH simple_nostem_no_stop;

Which results in this configuration:

development=# \dF+ public.user_search Text search configuration "public.user_search"

Parser: "pg_catalog.default"

Token | Dictionaries

-----------------+-----------------------

asciihword | simple_nostem_no_stop

asciiword | simple_nostem_no_stop

blank | simple

email | simple_nostem_no_stop

file | simple

float | simple

host | simple

hword | simple_nostem_no_stop

hword_asciipart | simple_nostem_no_stop

hword_numpart | simple_nostem_no_stop

hword_part | simple_nostem_no_stop

int | simple

numhword | simple_nostem_no_stop

numword | simple_nostem_no_stop

sfloat | simple

uint | simple

url | simple

url_path | simple

version | simple

word | simple_nostem_no_stop

Testing my query, I get the following tokenization:

development=# select alias, token from ts_debug('public.user_search', 'Boys 9-10');

alias | token

-----------+-------

asciiword | Boys

blank |

uint | 9

int | -10

(4 rows)

I was expecting / hoping to seek tokens for "9-10" and "10".

With the above a search for "9-10" or "10" would not match the term "Boys 9-10".

I was expecting the hnumword or hword_numpart token_types to match, but they appear to require a leading alpha character to match.

If I add a leading alpha character, it tokenizes the way I would like:

development=# select alias, token from ts_debug('public.user_search', 'Boys x9-y10');

alias | token

---------------+--------

asciiword | Boys

blank |

numhword | x9-y10

hword_numpart | x9

blank | -

hword_numpart | y10

(6 rows)

So my question is -- can I get the tokenization that I want out of a configuration of the stock available token types?

Since I'm hosting my db on Heroku, I assume a custom parser extension is not a possibility.

I've tried remove the uint and int mappings, hoping then that the other parsers will pick up the slack, but that just results in the terms being dropped altogether.

For example, if I run:

ALTER TEXT SEARCH CONFIGURATION public.user_search

DROP MAPPING IF EXISTS FOR file, float, host, int, uint, sfloat, url_path, version, blank, url;

Then my configuration now looks like this:

development=# \dF+ public.user_search

Text search configuration "public.user_search"

Parser: "pg_catalog.default"

Token | Dictionaries

-----------------+-----------------------

asciihword | simple_nostem_no_stop

asciiword | simple_nostem_no_stop

email | simple_nostem_no_stop

hword | simple_nostem_no_stop

hword_asciipart | simple_nostem_no_stop

hword_numpart | simple_nostem_no_stop

hword_part | simple_nostem_no_stop

numhword | simple_nostem_no_stop

numword | simple_nostem_no_stop

word | simple_nostem_no_stop

But now the tokens are just dropped altogether:

development=# select alias, token, lexemes from ts_debug('public.user_search', 'Boys 9-10');

alias | token | lexemes

-----------+-------+---------

asciiword | Boys | {boys}

blank | |

uint | 9 |

int | -10 |

(4 rows)

Thanks in advance for any advice.

Would love to find a simple solution.

Mason

В списке pgsql-general по дате отправления:

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Help with tokenization of age-ranges in full text search