Обсуждение: Behavior of a pg_trgm index for 2 (or < 3) character LIKE queries

Поиск
Список
Период
Сортировка

Behavior of a pg_trgm index for 2 (or < 3) character LIKE queries

От
Amit Langote
Дата:
Hello,

I have been trying to understand how pg_trgm works. As part of that, I
was looking at gin_extract_query_trgm(), which I think, extracts
trigrams from a search query string. So, I debugged for 3 cases:

1) column_name LIKE '%緊急%'

in this case, inside gin_extract_query_trgm(), after a call to
generate_wildcard_trgm(), returned trglen is 0, hence
GIN_SEARCH_MODE_ALL search mode is used.

2) column_name LIKE '%os%'

same as in case (1)

3) column_name LIKE '%ost%'

returned trglen is > 0, things proceed differently. May be, trigrams
have been generated and cane be used for index search.

I later commented out #define KEEPONLYALNUM from
contrib/pg_trgm/trgm.h (following from a related discussion on
-hackers viz.
http://www.postgresql.org/message-id/flat/CAHGQGwFJshvV2nGME19wdTW9teFw_w7h2ns4E+YYsjkB9WdWDQ@mail.gmail.com#CAHGQGwFJshvV2nGME19wdTW9teFw_w7h2ns4E+YYsjkB9WdWDQ@mail.gmail.com),
but things didn't change.

So, it appears, for search strings consisting of 2 (or < 3)
characters, trigrams can not be utilized. No?

NOTE: Using the master branch. The indexed column is a text field and
data consists of mix of Japanese, alphanumeric characters.

--
Amit Langote



Re: Behavior of a pg_trgm index for 2 (or < 3) character LIKE queries

От
Robert Haas
Дата:
On Wed, May 29, 2013 at 10:51 PM, Amit Langote <amitlangote09@gmail.com> wrote:
> So, it appears, for search strings consisting of 2 (or < 3)
> characters, trigrams can not be utilized. No?

I think that's right.  "trigram" means a sequence of three characters,
and what's stored in the indexes are three-character sequences from
the original text.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Behavior of a pg_trgm index for 2 (or < 3) character LIKE queries

От
Amit Langote
Дата:
On Thu, May 30, 2013 at 11:47 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Wed, May 29, 2013 at 10:51 PM, Amit Langote <amitlangote09@gmail.com> wrote:
>> So, it appears, for search strings consisting of 2 (or < 3)
>> characters, trigrams can not be utilized. No?
>
> I think that's right.  "trigram" means a sequence of three characters,
> and what's stored in the indexes are three-character sequences from
> the original text.
>

Was there any improvement to pg_trgm in recent past that could make it
better for partial matching (the case in question I suppose) or is
partial-matching a different thing altogether?


--
Amit Langote



Re: Behavior of a pg_trgm index for 2 (or < 3) character LIKE queries

От
Robert Haas
Дата:
On Thu, May 30, 2013 at 11:00 AM, Amit Langote <amitlangote09@gmail.com> wrote:
> Was there any improvement to pg_trgm in recent past that could make it
> better for partial matching (the case in question I suppose) or is
> partial-matching a different thing altogether?

Sorry, I am not sure.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Behavior of a pg_trgm index for 2 (or < 3) character LIKE queries

От
Sawada Masahiko
Дата:
2013/5/31 Amit Langote <amitlangote09@gmail.com>
>
> On Thu, May 30, 2013 at 11:47 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> > On Wed, May 29, 2013 at 10:51 PM, Amit Langote <amitlangote09@gmail.com> wrote:
> >> So, it appears, for search strings consisting of 2 (or < 3)
> >> characters, trigrams can not be utilized. No?
> >
> > I think that's right.  "trigram" means a sequence of three characters,
> > and what's stored in the indexes are three-character sequences from
> > the original text.
> >
>
> Was there any improvement to pg_trgm in recent past that could make it
> better for partial matching (the case in question I suppose) or is
> partial-matching a different thing altogether?
>

Hi Amit,

following emails are discussed about partial match of pg_trgm.  I hope
will this help.
<http://www.postgresql.org/message-id/CAHGQGwFJshvV2nGME19wdTW9teFw_w7h2ns4E+YYsjkB9WdWDQ@mail.gmail.com>
as you may know, if search string contains multibyte characters
trigram key is converted to CRC of 4 byte and it is used as key.
(but only use upper 3 byte from CRC)
so we can do partial matching if KEEPONLYALNUM is enabled.

Regards,

--------
Masahiko Sawada



Re: Behavior of a pg_trgm index for 2 (or < 3) character LIKE queries

От
Alexander Korotkov
Дата:
On Thu, May 30, 2013 at 12:49 PM, Sawada Masahiko <sawada.mshk@gmail.com> wrote:
following emails are discussed about partial match of pg_trgm.  I hope
will this help.
<http://www.postgresql.org/message-id/CAHGQGwFJshvV2nGME19wdTW9teFw_w7h2ns4E+YYsjkB9WdWDQ@mail.gmail.com>
as you may know, if search string contains multibyte characters
trigram key is converted to CRC of 4 byte and it is used as key.
(but only use upper 3 byte from CRC)
so we can do partial matching if KEEPONLYALNUM is enabled.

Please, read the further discussion on that thread. We can't do partial matching because of CRC independently of KEEPONLYALNUM.

------
With best regards,
Alexander Korotkov.

Re: Behavior of a pg_trgm index for 2 (or < 3) character LIKE queries

От
Amit Langote
Дата:
On Fri, May 31, 2013 at 4:25 AM, Alexander Korotkov
<aekorotkov@gmail.com> wrote:
> On Thu, May 30, 2013 at 12:49 PM, Sawada Masahiko <sawada.mshk@gmail.com>
> wrote:
>>
>> following emails are discussed about partial match of pg_trgm.  I hope
>> will this help.
>>
>> <http://www.postgresql.org/message-id/CAHGQGwFJshvV2nGME19wdTW9teFw_w7h2ns4E+YYsjkB9WdWDQ@mail.gmail.com>
>> as you may know, if search string contains multibyte characters
>> trigram key is converted to CRC of 4 byte and it is used as key.
>> (but only use upper 3 byte from CRC)
>> so we can do partial matching if KEEPONLYALNUM is enabled.
>
>
> Please, read the further discussion on that thread. We can't do partial
> matching because of CRC independently of KEEPONLYALNUM.
>

Thank you Sawada-san and Alexander.

I think the idea of using trigram "text" itself rather than its CRC
(due to its problems in partial matching) as GIN key (?) has not been
implemented into pg_trgm yet, right? And even though, such a facility
would  be added, we would still need to handle multibyte characters
case differently (even for partial matching), is that right?



--
Amit Langote



Re: Behavior of a pg_trgm index for 2 (or < 3) character LIKE queries

От
Amit Langote
Дата:
On Fri, May 31, 2013 at 4:25 AM, Alexander Korotkov
<aekorotkov@gmail.com> wrote:
> On Thu, May 30, 2013 at 12:49 PM, Sawada Masahiko <sawada.mshk@gmail.com>
> wrote:
>>
>> following emails are discussed about partial match of pg_trgm.  I hope
>> will this help.
>>
>> <http://www.postgresql.org/message-id/CAHGQGwFJshvV2nGME19wdTW9teFw_w7h2ns4E+YYsjkB9WdWDQ@mail.gmail.com>
>> as you may know, if search string contains multibyte characters
>> trigram key is converted to CRC of 4 byte and it is used as key.
>> (but only use upper 3 byte from CRC)
>> so we can do partial matching if KEEPONLYALNUM is enabled.
>
>
> Please, read the further discussion on that thread. We can't do partial
> matching because of CRC independently of KEEPONLYALNUM.
>

Also, a few more questions:

1) When building a trgm index, are there any differences for
multi-byte character strings. For example, would a 2 character
Japanese string (multi-byte offcourse) produce exactly 3 trigrams to
be stored in the index which would later be used while look-up?

2) And if that is so, is there problem in gin_extract_query_trgm(),
that is while generating trigrams from a query search term that causes
trigrams (stored in the index if answer to (1) is yes) NOT to be used
in such a partial matching case?

--
Amit Langote



Re: Behavior of a pg_trgm index for 2 (or < 3) character LIKE queries

От
Sawada Masahiko
Дата:
On Fri, May 31, 2013 at 11:16 AM, Amit Langote <amitlangote09@gmail.com> wrote:
> On Fri, May 31, 2013 at 4:25 AM, Alexander Korotkov
> <aekorotkov@gmail.com> wrote:
>> On Thu, May 30, 2013 at 12:49 PM, Sawada Masahiko <sawada.mshk@gmail.com>
>> wrote:
>>>
>>> following emails are discussed about partial match of pg_trgm.  I hope
>>> will this help.
>>>
>>> <http://www.postgresql.org/message-id/CAHGQGwFJshvV2nGME19wdTW9teFw_w7h2ns4E+YYsjkB9WdWDQ@mail.gmail.com>
>>> as you may know, if search string contains multibyte characters
>>> trigram key is converted to CRC of 4 byte and it is used as key.
>>> (but only use upper 3 byte from CRC)
>>> so we can do partial matching if KEEPONLYALNUM is enabled.
>>
>>
>> Please, read the further discussion on that thread. We can't do partial
>> matching because of CRC independently of KEEPONLYALNUM.
>>
>
> Also, a few more questions:
>
> 1) When building a trgm index, are there any differences for
> multi-byte character strings. For example, would a 2 character
> Japanese string (multi-byte offcourse) produce exactly 3 trigrams to
> be stored in the index which would later be used while look-up?

in above case a 2 character multibyte string produce 3 trigrams of
CRC. (because these was larger than 3 byte)
and these are used while look-up.

> 2) And if that is so, is there problem in gin_extract_query_trgm(),
> that is while generating trigrams from a query search term that causes
> trigrams (stored in the index if answer to (1) is yes) NOT to be used
> in such a partial matching case?

it means that we can't use trigrams in case of partial matching
because trigrams (stored in index) are converted to different
value(CRC).
right?


Regards,
--
-------
Sawada Masahiko



Re: Behavior of a pg_trgm index for 2 (or < 3) character LIKE queries

От
Amit Langote
Дата:
On Sat, Jun 1, 2013 at 1:48 AM, Sawada Masahiko <sawada.mshk@gmail.com> wrote:
> On Fri, May 31, 2013 at 11:16 AM, Amit Langote <amitlangote09@gmail.com> wrote:
>> 2) And if that is so, is there problem in gin_extract_query_trgm(),
>> that is while generating trigrams from a query search term that causes
>> trigrams (stored in the index if answer to (1) is yes) NOT to be used
>> in such a partial matching case?
>
> it means that we can't use trigrams in case of partial matching
> because trigrams (stored in index) are converted to different
> value(CRC).
> right?
>

When I debugged a partial match case such as  " column like '%st%'
", it appears that get_wildcard_trigrams return no trigrams for
wildcard part 'st' since charlen < 3. Hence, GIN_SEARCH_MODE_ALL mode
is used and results in  full index scan instead of trigrams being
used. This happens for multibyte case too. The problem is that for
wildcard part consisting of less than 3 characters,
get_wildcard_trigrams return nothing.


--
Amit Langote



Re: Behavior of a pg_trgm index for 2 (or < 3) character LIKE queries

От
Alexander Korotkov
Дата:
On Fri, May 31, 2013 at 9:41 PM, Amit Langote <amitlangote09@gmail.com> wrote:
On Sat, Jun 1, 2013 at 1:48 AM, Sawada Masahiko <sawada.mshk@gmail.com> wrote:
> On Fri, May 31, 2013 at 11:16 AM, Amit Langote <amitlangote09@gmail.com> wrote:
>> 2) And if that is so, is there problem in gin_extract_query_trgm(),
>> that is while generating trigrams from a query search term that causes
>> trigrams (stored in the index if answer to (1) is yes) NOT to be used
>> in such a partial matching case?
>
> it means that we can't use trigrams in case of partial matching
> because trigrams (stored in index) are converted to different
> value(CRC).
> right?
>

When I debugged a partial match case such as  " column like '%st%'
", it appears that get_wildcard_trigrams return no trigrams for
wildcard part 'st' since charlen < 3. Hence, GIN_SEARCH_MODE_ALL mode
is used and results in  full index scan instead of trigrams being
used. This happens for multibyte case too. The problem is that for
wildcard part consisting of less than 3 characters,
get_wildcard_trigrams return nothing.

Partial match for LIKE queries is not implemented at all. With current index structure it could be possible to implement it with guarantee that all indexed strings don't contain multibyte characters (i.e. single-byte encoding). 
Also, at it was mentioned, it's possible to implement operator class with text storage type which would support partial match in all the cases. However, I doubt it's reasonable to implement it based on pg_trgm, because of many hard-wired assumptions that trigram is fixed length and compatibility.

------
With best regards,
Alexander Korotkov.