Re: BUG #15488: Unexpected behaviour of to_tsverctor and ts_query
От | Arthur Zakirov |
---|---|
Тема | Re: BUG #15488: Unexpected behaviour of to_tsverctor and ts_query |
Дата | |
Msg-id | dc4c5f92-0eeb-2d5e-32af-653b80816c31@postgrespro.ru обсуждение исходный текст |
Ответ на | BUG #15488: Unexpected behaviour of to_tsverctor and ts_query (PG Bug reporting form <noreply@postgresql.org>) |
Список | pgsql-bugs |
Hello, On 06.11.2018 15:51, PG Bug reporting form wrote: > I am trying to do prefix search of emails using full text search. Emails > sometimes contain dot symbols. > Now, select to_tsquery('simple', 'thing.a:*'); returns a single term as > expected but > select to_tsquery('simple', 'the.thing.a:*') suddenly tokenizes my search > expression into 'the.thing':* & 'a':*. At the same time select > to_tsquery('simple', 'the.thing.aa:*'); tokenizes as expected into a single > term. > Also select to_tsquery('simple', 'the.thing@gmail.com'); returns a single > term as expected, but select to_tsquery('simple', 'the.thing@gma:*'); splits > on @ char and returns two tokens. > This behavior is not covered in the official documentation. Yeah, it seems strange, but full text search parser thinks that each label of a host should have at least two characters. In first example it thinks that it is a file name: =# select * from ts_debug('simple', 'thing.a:*'); alias | description | token | dictionaries | dictionary | lexemes -------+-------------------+---------+--------------+------------+----------- file | File or path name | thing.a | {simple} | simple | {thing.a} blank | Space symbols | :* | {} | (null) | (null) Next, it thinks that only first two label are only part of host name: =# select * from ts_debug('simple', 'the.thing.a:*'); alias | description | token | dictionaries | dictionary | lexemes -----------+-----------------+-----------+--------------+------------+------------- host | Host | the.thing | {simple} | simple | {the.thing} blank | Space symbols | . | {} | (null) | (null) asciiword | Word, all ASCII | a | {simple} | simple | {a} blank | Space symbols | :* | {} | (null) | (null) And next is host name: =# select * from ts_debug('simple', 'the.thing.aa:*'); alias | description | token | dictionaries | dictionary | lexemes -------+---------------+--------------+--------------+------------+---------------- host | Host | the.thing.aa | {simple} | simple | {the.thing.aa} blank | Space symbols | :* | {} | (null) | (null) It seems that the RFC doesn't have such restriction, if I'm not mistaken. See: https://tools.ietf.org/html/rfc1034#section-3.1 And same for emails: =# select * from ts_debug('simple', 'the.thing@gmail.com'); alias | description | token | dictionaries | dictionary | lexemes -------+---------------+---------------------+--------------+------------+----------------------- email | Email address | the.thing@gmail.com | {simple} | simple | {the.thing@gmail.com} =# select * from ts_debug('simple', 'the.thing@gmail.c:*'); alias | description | token | dictionaries | dictionary | lexemes -------+-------------------+-----------+--------------+------------+------------- host | Host | the.thing | {simple} | simple | {the.thing} blank | Space symbols | @ | {} | (null) | (null) file | File or path name | gmail.c | {simple} | simple | {gmail.c} blank | Space symbols | :* | {} | (null) | (null) =# select * from ts_debug('simple', 'the.thing@gma:*'); alias | description | token | dictionaries | dictionary | lexemes -----------+-----------------+-----------+--------------+------------+------------- host | Host | the.thing | {simple} | simple | {the.thing} blank | Space symbols | @ | {} | (null) | (null) asciiword | Word, all ASCII | gma | {simple} | simple | {gma} blank | Space symbols | :* | {} | (null) | (null) It seems that it can be easily fixed, but I'm not sure that it won't break something. If you have such host names only in prefix queries (not in documents itself) I may suggest the following workaround: =# select to_tsvector('simple', 'the.thing@gmail.com') @@ 'the.thing@gma:*'::tsquery; ?column? ---------- t -- Arthur Zakirov Postgres Professional: http://www.postgrespro.com Russian Postgres Company
В списке pgsql-bugs по дате отправления:
Предыдущее
От: Etsuro FujitaДата:
Сообщение: Re: BUG #15449: file_fdw using program cause exit code error whenusing LIMIT
Следующее
От: Arthur ZakirovДата:
Сообщение: Re: BUG #15491: index on function not being used for full text searchwhen querying through a view