Обсуждение: GiST index not used for ORDER BY?

Поиск
Список
Период
Сортировка

GiST index not used for ORDER BY?

От
Max
Дата:
Hi,

I'm setting up a simple search engine using Tsearch2.
The basic idea is: a user enters a search query and a maximum of 1000
results is returned, sorted by date, newest first.

At the moment the table holding the searchable data has 1.1 million entries.
It works great when the search only produces a few hundred results. However
when people search on a common word with 10.000+ results, there's a
performance problem.

The database design looks like this:

CREATE TABLE posts_index
(
....
         startdate INT NOT NULL,
         idxFTI   tsvector,
....
);

Where startdate is a unix timestamp, and idxFTI is a tsvector with the data
to be searched.

Since only 1000 results need to be returned sorted by date (newest first),
I hoped to solve the problem by installing the btree_gist extension and
adding the following index:

CREATE INDEX idxFTI_idx2 ON posts_index USING gist(idxFTI,(-startdate));

However the -startdate portion of the index doesn't seem to be used:

------
EXPLAIN SELECT startdate, headline(subject,q) AS subject FROM posts_index
i, to_tsquery('default', '_SEARCH_TERM_') AS q WHERE idxfti @@ q ORDER BY
(-i.startdate) LIMIT 1000;

QUERY PLAN

Limit  (cost=5152014.10..5152016.60 rows=1000 width=126)
   ->  Sort  (cost=5152014.10..5155079.61 rows=1226201 width=126)
         Sort Key: (- i.startdate)
         ->  Nested Loop  (cost=0.00..4912754.84 rows=1226201 width=126)
               ->  Function Scan on q  (cost=0.00..12.50 rows=1000 width=32)
               ->  Index Scan using idxfti_idx2 on posts_index
i  (cost=0.00..4891.27 rows=1227 width=253)
                     Index Cond: (i.idxfti @@ "outer".q)
----

Any suggestions?

Regards,

Max


Re: GiST index not used for ORDER BY?

От
Oleg Bartunov
Дата:
On Thu, 27 Jan 2005, Max wrote:

> Hi,
>
> I'm setting up a simple search engine using Tsearch2.
> The basic idea is: a user enters a search query and a maximum of 1000 results
> is returned, sorted by date, newest first.
>
> At the moment the table holding the searchable data has 1.1 million entries.
> It works great when the search only produces a few hundred results. However
> when people search on a common word with 10.000+ results, there's a
> performance problem.
>
> The database design looks like this:
>
> CREATE TABLE posts_index
> (
> ....
>        startdate INT NOT NULL,
>        idxFTI   tsvector,
> ....
> );
>
> Where startdate is a unix timestamp, and idxFTI is a tsvector with the data
> to be searched.
>
> Since only 1000 results need to be returned sorted by date (newest first), I
> hoped to solve the problem by installing the btree_gist extension and adding
> the following index:
>
> CREATE INDEX idxFTI_idx2 ON posts_index USING gist(idxFTI,(-startdate));
>
> However the -startdate portion of the index doesn't seem to be used:
>
> ------
> EXPLAIN SELECT startdate, headline(subject,q) AS subject FROM posts_index i,
> to_tsquery('default', '_SEARCH_TERM_') AS q WHERE idxfti @@ q ORDER BY
> (-i.startdate) LIMIT 1000;

I assume you already vacuum your db. Hmm, seems you need to rewrite your query.


EXPLAIN SELECT startdate, headline(subject,q) AS subject FROM
( SELECT startdate, subject from posts_index i,
  to_tsquery('default', '_SEARCH_TERM_') AS q WHERE idxfti @@ q ORDER BY
  (-i.startdate) LIMIT 1000) as foo;

I bet your query will be much faster. In your query all founded tuples should
be read from disk to calculate headline(), while in my query maximum 1000
tuples will be read. So, performance gain could be noticeable, for example,
if search returns 10,000 tuples, my query will be 10x faster than yours :)
I think this is what you observed.



>
> QUERY PLAN
>
> Limit  (cost=5152014.10..5152016.60 rows=1000 width=126)
>  ->  Sort  (cost=5152014.10..5155079.61 rows=1226201 width=126)
>        Sort Key: (- i.startdate)
>        ->  Nested Loop  (cost=0.00..4912754.84 rows=1226201 width=126)
>              ->  Function Scan on q  (cost=0.00..12.50 rows=1000 width=32)
>              ->  Index Scan using idxfti_idx2 on posts_index i
> (cost=0.00..4891.27 rows=1227 width=253)
>                    Index Cond: (i.idxfti @@ "outer".q)
> ----
>
> Any suggestions?
>
> Regards,
>
> Max
>
> ---------------------------(end of broadcast)---------------------------
> TIP 2: you can get off all lists at once with the unregister command
>   (send "unregister YourEmailAddressHere" to majordomo@postgresql.org)
>

     Regards,
         Oleg
_____________________________________________________________
Oleg Bartunov, sci.researcher, hostmaster of AstroNet,
Sternberg Astronomical Institute, Moscow University (Russia)
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(095)939-16-83, +007(095)939-23-83

Re: GiST index not used for ORDER BY?

От
Max
Дата:
Hi,

At 09:54 PM 1/27/2005, you wrote:
>On Thu, 27 Jan 2005, Max wrote:
>>I'm setting up a simple search engine using Tsearch2.
>>The basic idea is: a user enters a search query and a maximum of 1000
>>results is returned, sorted by date, newest first.
>>
>>At the moment the table holding the searchable data has 1.1 million entries.
>>It works great when the search only produces a few hundred results.
>>However when people search on a common word with 10.000+ results, there's
>>a performance problem.
 >>
>>CREATE TABLE posts_index
>>(
>>....
>>        startdate INT NOT NULL,
>>        idxFTI   tsvector,
>>....
>>);
>>CREATE INDEX idxFTI_idx2 ON posts_index USING gist(idxFTI,(-startdate));


>I assume you already vacuum your db.

Yes, I did vacuum analyze it. And he does use the first part of the index
(idxFTI), just not the second part (-startdate).

>  Hmm, seems you need to rewrite your query.
>EXPLAIN SELECT startdate, headline(subject,q) AS subject FROM ( SELECT
>startdate, subject from posts_index i,
>  to_tsquery('default', '_SEARCH_TERM_') AS q WHERE idxfti @@ q ORDER BY
>  (-i.startdate) LIMIT 1000) as foo;
>
>I bet your query will be much faster. In your query all founded tuples should
>be read from disk to calculate headline(), while in my query maximum 1000
>tuples will be read. So, performance gain could be noticeable, for example,
>if search returns 10,000 tuples, my query will be 10x faster than yours :)
>I think this is what you observed.

Thanks for your help, however headline() doesn't seem the problem.
Here's an EXPLAIN ANALYZE using your query and a common word as SEARCH_TERM:

------
QUERY PLAN

Subquery Scan foo  (cost=5368809.49..5368824.49 rows=1000 width=181)
(actual time=363455.642..363510.277 rows=1000 loops=1)
   ->  Limit  (cost=5368809.49..5368811.99 rows=1000 width=126) (actual
time=363454.387..363455.983 rows=1000 loops=1)
         ->  Sort  (cost=5368809.49..5372006.34 rows=1278741 width=126)
(actual time=363454.380..363455.471 rows=1000 loops=1)
               Sort Key: (- i.startdate)
               ->  Nested Loop  (cost=0.00..5118844.92 rows=1278741
width=126) (actual time=0.140..354003.773 rows=343974 loops=1)
                     ->  Function Scan on q  (cost=0.00..12.50 rows=1000
width=32) (actual time=0.015..0.018 rows=1 loops=1)
                     ->  Index Scan using idxfti_idx2 on posts_index
i  (cost=0.00..5099.65 rows=1279 width=253) (actual time=0.111..353068.267
rows=343974 loops=1)
                           Index Cond: (i.idxfti @@ "outer".q)
Total runtime: 363571.960 ms
----

It still seems to rather sort 343.974 rows and take over 5 minutes to
complete, than use the index for the date. While searching on less common
words takes less than a second.Omitting headline() completely doesn't
changes anything either.
So it must be something else.


Regards,
Max


Re: GiST index not used for ORDER BY?

От
Oleg Bartunov
Дата:
On Thu, 27 Jan 2005, Max wrote:

> Hi,
>
> At 09:54 PM 1/27/2005, you wrote:
>> On Thu, 27 Jan 2005, Max wrote:
>>> I'm setting up a simple search engine using Tsearch2.
>>> The basic idea is: a user enters a search query and a maximum of 1000
>>> results is returned, sorted by date, newest first.
>>>
>>> At the moment the table holding the searchable data has 1.1 million
>>> entries.
>>> It works great when the search only produces a few hundred results.
>>> However when people search on a common word with 10.000+ results, there's
>>> a performance problem.
>>>
>>> CREATE TABLE posts_index
>>> (
>>> ....
>>>        startdate INT NOT NULL,
>>>        idxFTI   tsvector,
>>> ....
>>> );
>>> CREATE INDEX idxFTI_idx2 ON posts_index USING gist(idxFTI,(-startdate));
>
>
>> I assume you already vacuum your db.
>
> Yes, I did vacuum analyze it. And he does use the first part of the index
> (idxFTI), just not the second part (-startdate).
>
>>  Hmm, seems you need to rewrite your query.
>> EXPLAIN SELECT startdate, headline(subject,q) AS subject FROM ( SELECT
>> startdate, subject from posts_index i,
>>  to_tsquery('default', '_SEARCH_TERM_') AS q WHERE idxfti @@ q ORDER BY
>>  (-i.startdate) LIMIT 1000) as foo;
>>
>> I bet your query will be much faster. In your query all founded tuples
>> should
>> be read from disk to calculate headline(), while in my query maximum 1000
>> tuples will be read. So, performance gain could be noticeable, for example,
>> if search returns 10,000 tuples, my query will be 10x faster than yours :)
>> I think this is what you observed.
>
> Thanks for your help, however headline() doesn't seem the problem.
> Here's an EXPLAIN ANALYZE using your query and a common word as SEARCH_TERM:
>
> ------
> QUERY PLAN
>
> Subquery Scan foo  (cost=5368809.49..5368824.49 rows=1000 width=181) (actual
> time=363455.642..363510.277 rows=1000 loops=1)
>  ->  Limit  (cost=5368809.49..5368811.99 rows=1000 width=126) (actual
> time=363454.387..363455.983 rows=1000 loops=1)
>        ->  Sort  (cost=5368809.49..5372006.34 rows=1278741 width=126)
> (actual time=363454.380..363455.471 rows=1000 loops=1)
>              Sort Key: (- i.startdate)
>              ->  Nested Loop  (cost=0.00..5118844.92 rows=1278741 width=126)
> (actual time=0.140..354003.773 rows=343974 loops=1)
>                    ->  Function Scan on q  (cost=0.00..12.50 rows=1000
> width=32) (actual time=0.015..0.018 rows=1 loops=1)
>                    ->  Index Scan using idxfti_idx2 on posts_index i
> (cost=0.00..5099.65 rows=1279 width=253) (actual time=0.111..353068.267
> rows=343974 loops=1)
>                          Index Cond: (i.idxfti @@ "outer".q)
> Total runtime: 363571.960 ms
> ----
>
> It still seems to rather sort 343.974 rows and take over 5 minutes to
> complete, than use the index for the date. While searching on less common
> words takes less than a second.Omitting headline() completely doesn't changes
> anything either.
> So it must be something else.

strange. Why did you omit select ?
So, search returns 343.974 rows. Am I right ?
try select * from YOUR_TABLE limit 343974;
then you'll see how much time requires just for reading results.



>
>
> Regards,
> Max

     Regards,
         Oleg
_____________________________________________________________
Oleg Bartunov, sci.researcher, hostmaster of AstroNet,
Sternberg Astronomical Institute, Moscow University (Russia)
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(095)939-16-83, +007(095)939-23-83

Re: GiST index not used for ORDER BY?

От
Tom Lane
Дата:
Oleg Bartunov <oleg@sai.msu.su> writes:
> On Thu, 27 Jan 2005, Max wrote:
>> The basic idea is: a user enters a search query and a maximum of 1000 results
>> is returned, sorted by date, newest first.

> I assume you already vacuum your db. Hmm, seems you need to rewrite your query.

That's not going to help.  GiST doesn't support ordered retrieval does it?
The planner certainly doesn't think so, because pg_am says not.

            regards, tom lane