Обсуждение: Differences between to_tsquery() and plainto_tsquery() when it comes to applying an index?

Поиск
Список
Период
Сортировка

Differences between to_tsquery() and plainto_tsquery() when it comes to applying an index?

От
Denis Papathanasiou
Дата:
As a follow-up to my question from this past Saturday
(http://archives.postgresql.org/pgsql-general/2010-08/msg00624.php), I
experimented with adding two types of indices to the same text column:

CREATE INDEX item_eng_searchable_text_idx ON item USING
gin(to_tsvector('english', searchable_text));

and

CREATE INDEX item_searchable_text_idx ON item (searchable_text);

Running my queries under explain, I noticed that queries of the form:

select pk from item where searchable_text @@ plainto_tsquery('search
phrase');

Actually run *slower* with the item_eng_searchable_text_idx index applied.

But when I switch the query to use to_tsquery() instead, e.g. something
like this:

select pk from item where searchable_text @@ to_tsquery('phrase');

The performance is better.

Is this because the gin/to_tsvector() index works differently for
to_tsquery() compared to plainto_ts_query() ?

If so, how can I create an effective index for queries that will use
plainto_tsquery() ?

Note that I need the plainto_tsquery() form b/c my search phrases will
correspond to exact blocks of text, and therefore they will contain
spaces and punctuation, etc.


Re: Differences between to_tsquery() and plainto_tsquery() when it comes to applying an index?

От
Tom Lane
Дата:
Denis Papathanasiou <denis.papathanasiou@gmail.com> writes:
> Is this because the gin/to_tsvector() index works differently for
> to_tsquery() compared to plainto_ts_query() ?

Not per se, but maybe looking at the actual outputs of the two function
calls would be enlightening.  I suspect you'll find that the search
conditions you are getting are not equivalent.

> Note that I need the plainto_tsquery() form b/c my search phrases will
> correspond to exact blocks of text, and therefore they will contain
> spaces and punctuation, etc.

In that case you haven't understood how text search works at all.
It simply doesn't do that.  You could possibly follow up a text search
for component words with a LIKE or similar comparison to verify that
the matches actually contain the desired string.

            regards, tom lane

Re: Differences between to_tsquery() and plainto_tsquery() when it comes to applying an index?

От
Denis Papathanasiou
Дата:
> Not per se, but maybe looking at the actual outputs of the two function
> calls would be enlightening.  I suspect you'll find that the search
> conditions you are getting are not equivalent.

Strictly speaking they're not, b/c the plainto_tsquery() is chaining
several tokens together.

However, at the heart of the question is this: if I define the index on
that column like this:

CREATE INDEX item_eng_searchable_text_idx ON item USING
gin(to_tsvector('english', searchable_text));

since, unlike the example in the
http://www.postgresql.org/docs/8.4/interactive/textsearch-tables.html
doc, I cannot define the searchable
text column in the table as being being of type ts_vector(), because the
text is not exclusively in English;

then, will the index be effective?

Based on some empirical experiments, it seems not.

If that's indeed that case, then how should I create the index?

> In that case you haven't understood how text search works at all.
> It simply doesn't do that.  You could possibly follow up a text search
> for component words with a LIKE or similar comparison to verify that
> the matches actually contain the desired string.

I cannot use LIKE on each token of the phrase because the text in the
column is unordered, and I would have to do an ILIKE '%'+token+'%' on
each to be logically correct.

IIRC, wildcards of the form token+'%' can use an index efficiently, but
wildcards on both ends such as '%'+token+'%' do not.

I did think about splitting the phrase tokens and doing a tsquery() join
on each token, but it seemed that's why plainto_tsquery() was created,
unless I misunderstood the docs on that specific point.

Re: Differences between to_tsquery() and plainto_tsquery() when it comes to applying an index?

От
Oleg Bartunov
Дата:
Denis,

we need examples of your explain analyze. I don't want to waste my time
reading theoretical reasoning :)

btw, Be sure you use the same search configuration as in create index or
index will not be used at all.

Oleg

On Wed, 25 Aug 2010, Denis Papathanasiou wrote:

> As a follow-up to my question from this past Saturday
> (http://archives.postgresql.org/pgsql-general/2010-08/msg00624.php), I
> experimented with adding two types of indices to the same text column:
>
> CREATE INDEX item_eng_searchable_text_idx ON item USING
> gin(to_tsvector('english', searchable_text));
>
> and
>
> CREATE INDEX item_searchable_text_idx ON item (searchable_text);
>
> Running my queries under explain, I noticed that queries of the form:
>
> select pk from item where searchable_text @@ plainto_tsquery('search
> phrase');
>
> Actually run *slower* with the item_eng_searchable_text_idx index applied.
>
> But when I switch the query to use to_tsquery() instead, e.g. something
> like this:
>
> select pk from item where searchable_text @@ to_tsquery('phrase');
>
> The performance is better.
>
> Is this because the gin/to_tsvector() index works differently for
> to_tsquery() compared to plainto_ts_query() ?
>
> If so, how can I create an effective index for queries that will use
> plainto_tsquery() ?
>
> Note that I need the plainto_tsquery() form b/c my search phrases will
> correspond to exact blocks of text, and therefore they will contain
> spaces and punctuation, etc.
>
>
>

     Regards,
         Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

Re: Differences between to_tsquery() and plainto_tsquery() when it comes to applying an index?

От
Denis Papathanasiou
Дата:
> we need examples of your explain analyze. I don't want to waste my time
> reading theoretical reasoning :)

Here's an actual 'explain analyze' example:

alerts=> CREATE INDEX node_val_tsv_idx ON node USING
gin(to_tsvector('english', val));
CREATE INDEX
alerts=> explain analyze select item_pk from node where
tag='primaryIssuer.entityType' and val @@ plainto_tsquery('Limited
Partnership');
                                                          QUERY PLAN


-----------------------------------------------------------------------------------------------------------------------------
  Bitmap Heap Scan on node  (cost=204.26..5792.92 rows=4 width=16)
(actual time=2.952..131.868 rows=953 loops=1)
    Recheck Cond: (tag = 'primaryIssuer.entityType'::text)
    Filter: (val @@ plainto_tsquery('Limited Partnership'::text))
    ->  Bitmap Index Scan on node_tag_idx  (cost=0.00..204.26 rows=3712
width=0) (actual time=1.628..1.628 rows=3631 loops=1)
          Index Cond: (tag = 'primaryIssuer.entityType'::text)
  Total runtime: 133.345 ms
(6 rows)

alerts=> DROP INDEX node_val_tsv_idx;
DROP INDEX
alerts=> explain analyze select item_pk from node where
tag='primaryIssuer.entityType' and val @@ plainto_tsquery('Limited
Partnership');
                                                          QUERY PLAN


-----------------------------------------------------------------------------------------------------------------------------
  Bitmap Heap Scan on node  (cost=204.26..5792.92 rows=4 width=16)
(actual time=2.938..93.239 rows=953 loops=1)
    Recheck Cond: (tag = 'primaryIssuer.entityType'::text)
    Filter: (val @@ plainto_tsquery('Limited Partnership'::text))
    ->  Bitmap Index Scan on node_tag_idx  (cost=0.00..204.26 rows=3712
width=0) (actual time=1.614..1.614 rows=3631 loops=1)
          Index Cond: (tag = 'primaryIssuer.entityType'::text)
  Total runtime: 94.696 ms
(6 rows)

The table this is run against is defined like this:

CREATE TABLE node (
     pk uuid primary key,
     item_pk uuid not null references item (pk),
     tag text not null,
     val text
);

In addition to the gin/ts_vector index on node.val shown above, there
are two other explicit indices on this table:

CREATE INDEX node_tag_idx ON node (tag);
CREATE INDEX node_val_idx ON node (val);

The reason for the node_val_idx index is that there will be cases where
the query phrase is known exactly, so the where clause in the select
statement will be just "val = 'Limited Partnership'".

> btw, Be sure you use the same search configuration as in create index or
> index will not be used at all.

Is this indeed the problem here?

The explain output references "val @@ plainto_tsquery()" but as a
filter, whereas the tag portion of the statement mentions node_tag_idx
as the index it used.

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: Differences between to_tsquery() and plainto_tsquery() when it comes to applying an index?

От
Oleg Bartunov
Дата:
Try this select

alerts=> explain analyze select item_pk from node where
  tag='primaryIssuer.entityType' and val @@ plainto_tsquery('english','Limited
  Partnership');

Read 12.2.2. Creating Indexes at http://www.postgresql.org/docs/8.4/static/textsearch-tables.html

CREATE INDEX pgweb_idx ON pgweb USING gin(to_tsvector('english', body));

Notice that the 2-argument version of to_tsvector is used. Only text search functions that specify a configuration name
canbe used in expression indexes (Section 11.7). This is because the index contents must be unaffected by
default_text_search_config.If they were affected, the index contents might be inconsistent because different entries
couldcontain tsvectors that were created with different text search configurations, and there would be no way to guess
whichwas which. It would be impossible to dump and restore such an index correctly. 



Oleg

On Wed, 25 Aug 2010, Denis Papathanasiou wrote:

>
>> we need examples of your explain analyze. I don't want to waste my time
>> reading theoretical reasoning :)
>
> Here's an actual 'explain analyze' example:
>
> alerts=> CREATE INDEX node_val_tsv_idx ON node USING
> gin(to_tsvector('english', val));
> CREATE INDEX
> alerts=> explain analyze select item_pk from node where
> tag='primaryIssuer.entityType' and val @@ plainto_tsquery('Limited
> Partnership');
>                                                         QUERY PLAN
>
-----------------------------------------------------------------------------------------------------------------------------
> Bitmap Heap Scan on node  (cost=204.26..5792.92 rows=4 width=16) (actual
> time=2.952..131.868 rows=953 loops=1)
>   Recheck Cond: (tag = 'primaryIssuer.entityType'::text)
>   Filter: (val @@ plainto_tsquery('Limited Partnership'::text))
>   ->  Bitmap Index Scan on node_tag_idx  (cost=0.00..204.26 rows=3712
> width=0) (actual time=1.628..1.628 rows=3631 loops=1)
>         Index Cond: (tag = 'primaryIssuer.entityType'::text)
> Total runtime: 133.345 ms
> (6 rows)
>
> alerts=> DROP INDEX node_val_tsv_idx;
> DROP INDEX
> alerts=> explain analyze select item_pk from node where
> tag='primaryIssuer.entityType' and val @@ plainto_tsquery('Limited
> Partnership');
>                                                         QUERY PLAN
>
-----------------------------------------------------------------------------------------------------------------------------
> Bitmap Heap Scan on node  (cost=204.26..5792.92 rows=4 width=16) (actual
> time=2.938..93.239 rows=953 loops=1)
>   Recheck Cond: (tag = 'primaryIssuer.entityType'::text)
>   Filter: (val @@ plainto_tsquery('Limited Partnership'::text))
>   ->  Bitmap Index Scan on node_tag_idx  (cost=0.00..204.26 rows=3712
> width=0) (actual time=1.614..1.614 rows=3631 loops=1)
>         Index Cond: (tag = 'primaryIssuer.entityType'::text)
> Total runtime: 94.696 ms
> (6 rows)
>
> The table this is run against is defined like this:
>
> CREATE TABLE node (
>    pk uuid primary key,
>    item_pk uuid not null references item (pk),
>    tag text not null,
>    val text
> );
>
> In addition to the gin/ts_vector index on node.val shown above, there are two
> other explicit indices on this table:
>
> CREATE INDEX node_tag_idx ON node (tag);
> CREATE INDEX node_val_idx ON node (val);
>
> The reason for the node_val_idx index is that there will be cases where the
> query phrase is known exactly, so the where clause in the select statement
> will be just "val = 'Limited Partnership'".
>
>> btw, Be sure you use the same search configuration as in create index or
>> index will not be used at all.
>
> Is this indeed the problem here?
>
> The explain output references "val @@ plainto_tsquery()" but as a filter,
> whereas the tag portion of the statement mentions node_tag_idx as the index
> it used.
>

     Regards,
         Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

Re: Differences between to_tsquery() and plainto_tsquery() when it comes to applying an index?

От
Tom Lane
Дата:
Denis Papathanasiou <denis.papathanasiou@gmail.com> writes:
> alerts=> CREATE INDEX node_val_tsv_idx ON node USING
> gin(to_tsvector('english', val));
> CREATE INDEX
> alerts=> explain analyze select item_pk from node where
> tag='primaryIssuer.entityType' and val @@ plainto_tsquery('Limited
> Partnership');
> [ doesn't use the index ]

That index doesn't match the query, so it can't be used.  Try

select item_pk from node where
tag='primaryIssuer.entityType' and
to_tsvector('english', val) @@ plainto_tsquery('Limited Partnership');

Note that seeing that the scan on the other index is pretty cheap,
it's not obvious that indexing the @@ clause is better anyway.
But it *can't* index this query using this index.

            regards, tom lane

Re: Differences between to_tsquery() and plainto_tsquery() when it comes to applying an index?

От
Denis Papathanasiou
Дата:
> we need examples of your explain analyze. I don't want to waste my time
> reading theoretical reasoning :)

Here's an actual 'explain analyze' example:

alerts=> CREATE INDEX node_val_tsv_idx ON node USING
gin(to_tsvector('english', val));
CREATE INDEX
alerts=> explain analyze select item_pk from node where
tag='primaryIssuer.entityType' and val @@ plainto_tsquery('Limited
Partnership');
                                                          QUERY PLAN


-----------------------------------------------------------------------------------------------------------------------------
  Bitmap Heap Scan on node  (cost=204.26..5792.92 rows=4 width=16)
(actual time=2.952..131.868 rows=953 loops=1)
    Recheck Cond: (tag = 'primaryIssuer.entityType'::text)
    Filter: (val @@ plainto_tsquery('Limited Partnership'::text))
    ->  Bitmap Index Scan on node_tag_idx  (cost=0.00..204.26 rows=3712
width=0) (actual time=1.628..1.628 rows=3631 loops=1)
          Index Cond: (tag = 'primaryIssuer.entityType'::text)
  Total runtime: 133.345 ms
(6 rows)

alerts=> DROP INDEX node_val_tsv_idx;
DROP INDEX
alerts=> explain analyze select item_pk from node where
tag='primaryIssuer.entityType' and val @@ plainto_tsquery('Limited
Partnership');
                                                          QUERY PLAN


-----------------------------------------------------------------------------------------------------------------------------
  Bitmap Heap Scan on node  (cost=204.26..5792.92 rows=4 width=16)
(actual time=2.938..93.239 rows=953 loops=1)
    Recheck Cond: (tag = 'primaryIssuer.entityType'::text)
    Filter: (val @@ plainto_tsquery('Limited Partnership'::text))
    ->  Bitmap Index Scan on node_tag_idx  (cost=0.00..204.26 rows=3712
width=0) (actual time=1.614..1.614 rows=3631 loops=1)
          Index Cond: (tag = 'primaryIssuer.entityType'::text)
  Total runtime: 94.696 ms
(6 rows)

The table this is run against is defined like this:

CREATE TABLE node (
     pk uuid primary key,
     item_pk uuid not null references item (pk),
     tag text not null,
     val text
);

In addition to the gin/ts_vector index on node.val shown above, there
are two other explicit indices on this table:

CREATE INDEX node_tag_idx ON node (tag);
CREATE INDEX node_val_idx ON node (val);

The reason for the node_val_idx index is that there will be cases where
the query phrase is known exactly, so the where clause in the select
statement will be just "val = 'Limited Partnership'".

> btw, Be sure you use the same search configuration as in create index or
> index will not be used at all.

Is this indeed the problem here?

The explain output references "val @@ plainto_tsquery()" but as a
filter, whereas the tag portion of the statement mentions node_tag_idx
as the index it used.

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: Differences between to_tsquery() and plainto_tsquery() when it comes to applying an index?

От
Denis Papathanasiou
Дата:
> That index doesn't match the query, so it can't be used.  Try
>
> select item_pk from node where
> tag='primaryIssuer.entityType' and
> to_tsvector('english', val) @@ plainto_tsquery('Limited Partnership');

Tom and Oleg: thank you for clarifying this.

I see where I made the mistake in applying the example from the
documentation.

> Note that seeing that the scan on the other index is pretty cheap,
> it's not obvious that indexing the @@ clause is better anyway.

So is it the case that, as in the documented example, I need to add a
column of type ts_vector to the table for the index to be most effective?

Re: Differences between to_tsquery() and plainto_tsquery() when it comes to applying an index?

От
Tom Lane
Дата:
Denis Papathanasiou <denis.papathanasiou@gmail.com> writes:
> So is it the case that, as in the documented example, I need to add a
> column of type ts_vector to the table for the index to be most effective?

You can do it either way, depending on what you want the queries to look
like.  One advantage of materializing the tsvector entries in the table
is that the to_tsvector() work doesn't have to be done over again if
the planner decides not to use that index for some reason (for instance,
it thinks the other index is more effective for the particular query).

            regards, tom lane