Обсуждение: Planner should use index on a LIKE 'foo%' query

Поиск
Список
Период
Сортировка

Planner should use index on a LIKE 'foo%' query

От
Moritz Onken
Дата:
Hi,

I have a query

select count(*)
  from result
  where exists
    (select * from item where item.url LIKE result.url || '%' limit 1);

which basically returns the number of items which exist in table
result and match a URL in table item by its prefix.
I read all about idexes (http://www.postgresql.org/docs/8.3/static/indexes-types.html
) and especially this part:
"The optimizer can also use a B-tree index for queries involving the
pattern matching operators LIKE and ~ if the pattern is a constant and
is anchored to the beginning of the string — for example, col LIKE 'foo
%' or col ~ '^foo', but not col LIKE '%bar'."

Since my server does not use the C locale I created the index with

CREATE INDEX test_index
   ON item
   USING btree
   (url varchar_pattern_ops);

which works fine for queries like

  SELECT distinct url from item where url like 'http://www.micro%'
limit 10;

explain analyze shows:
"Limit  (cost=9.53..9.54 rows=1 width=34) (actual time=80.809..80.856
rows=10 loops=1)"
"  ->  Unique  (cost=9.53..9.54 rows=1 width=34) (actual
time=80.806..80.835 rows=10 loops=1)"
"        ->  Sort  (cost=9.53..9.53 rows=1 width=34) (actual
time=80.802..80.812 rows=11 loops=1)"
"              Sort Key: url"
"              Sort Method:  quicksort  Memory: 306kB"
"              ->  Index Scan using test_index on item
(cost=0.00..9.52 rows=1 width=34) (actual time=0.030..6.165 rows=2254
loops=1)"
"                    Index Cond: (((url)::text ~>=~ 'http://
www.micro'::text) AND ((url)::text ~<~ 'http://www.micrp'::text))"
"                    Filter: ((url)::text ~~ 'http://www.micro%'::text)"
"Total runtime: 80.908 ms"

which is great but if I run the query with the subselect it uses a
sequence scan:

select *
  from result
  where exists
    (select * from item where item.url LIKE result.url || '%' limit 1)
limit 10;

"Limit  (cost=0.00..96.58 rows=10 width=36) (actual
time=12.660..35295.928 rows=10 loops=1)"
"  ->  Seq Scan on result  (cost=0.00..93886121.77 rows=9721314
width=36) (actual time=12.657..35295.906 rows=10 loops=1)"
"        Filter: (subplan)"
"        SubPlan"
"          ->  Limit  (cost=0.00..4.81 rows=1 width=42) (actual
time=2715.061..2715.061 rows=1 loops=13)"
"                ->  Seq Scan on item  (cost=0.00..109589.49
rows=22781 width=42) (actual time=2715.055..2715.055 rows=1 loops=13)"
"                      Filter: ((url)::text ~~ (($0)::text ||
'%'::text))"
"Total runtime: 35295.994 ms"


The only explaination is that I don't use a constant when comparing
the values. But actually it is a constant...


any help?

using postgres 8.3.3 on ubuntu.



Cheers,

moritz


Re: Planner should use index on a LIKE 'foo%' query

От
"Steinar H. Gunderson"
Дата:
On Sat, Jun 28, 2008 at 06:24:42PM +0200, Moritz Onken wrote:
>  SELECT distinct url from item where url like 'http://www.micro%' limit
> 10;

Here, the planner knows the pattern beforehand, and can see that it's a
simple prefix.
> select *
>  from result
>  where exists
>    (select * from item where item.url LIKE result.url || '%' limit 1)
> limit 10;

Here it cannot (what if result.url was '%foo%'?).

Try using something like (item.url >= result.url && item.url <= result.url ||
'z'), substituting an appropriately high character for 'z'.

> The only explaination is that I don't use a constant when comparing the
> values. But actually it is a constant...

It's not a constant at planning time.

Also note that you'd usually want to use IN instead of a WHERE EXISTS.

/* Steinar */
--
Homepage: http://www.sesse.net/

Re: Planner should use index on a LIKE 'foo%' query

От
Moritz Onken
Дата:

Anfang der weitergeleiteten E-Mail:

> Von: Moritz Onken <onken@houseofdesign.de>
> Datum: 30. Juni 2008 09:16:06 MESZ
> An: Steinar H. Gunderson <sgunderson@bigfoot.com>
> Betreff: Re: [PERFORM] Planner should use index on a LIKE 'foo%' query
>
>
> Am 28.06.2008 um 21:19 schrieb Steinar H. Gunderson:
>
>> On Sat, Jun 28, 2008 at 06:24:42PM +0200, Moritz Onken wrote:
>>> SELECT distinct url from item where url like 'http://www.micro%'
>>> limit
>>> 10;
>>
>> Here, the planner knows the pattern beforehand, and can see that
>> it's a
>> simple prefix.
>>> select *
>>> from result
>>> where exists
>>>  (select * from item where item.url LIKE result.url || '%' limit 1)
>>> limit 10;
>>
>> Here it cannot (what if result.url was '%foo%'?).
>
> That's right. Thanks for that hint. Is there a Postgres function
> which returns a constant (possibly an escape function)?
>>
>>
>> Try using something like (item.url >= result.url && item.url <=
>> result.url ||
>> 'z'), substituting an appropriately high character for 'z'.
>>
>>> The only explaination is that I don't use a constant when
>>> comparing the
>>> values. But actually it is a constant...
>
> I created a new column in "item" where I store the shortened url
> which makes "=" comparisons possible.
>
> the result table has 20.000.000 records and the item table 5.000.000.
> The query
>
> select count(1) from result where url in (select shorturl from item
> where shorturl = result.url);
>
> will take about 8 hours (still running, just guessing). Is this
> reasonable on a system with 1 GB of RAM and a AMD Athlon 64 3200+
> processor? (1 SATA HDD)
>
> regards,
>
> moritz
>


Re: Planner should use index on a LIKE 'foo%' query

От
Moritz Onken
Дата:
Am 28.06.2008 um 21:19 schrieb Steinar H. Gunderson:

> On Sat, Jun 28, 2008 at 06:24:42PM +0200, Moritz Onken wrote:
>> SELECT distinct url from item where url like 'http://www.micro%'
>> limit
>> 10;
>
> Here, the planner knows the pattern beforehand, and can see that
> it's a
> simple prefix.
>> select *
>> from result
>> where exists
>>  (select * from item where item.url LIKE result.url || '%' limit 1)
>> limit 10;
>
> Here it cannot (what if result.url was '%foo%'?).

That's right. Thanks for that hint. Is there a Postgres function which
returns a constant (possibly an escape function)?
>
>
> Try using something like (item.url >= result.url && item.url <=
> result.url ||
> 'z'), substituting an appropriately high character for 'z'.
>
>> The only explaination is that I don't use a constant when comparing
>> the
>> values. But actually it is a constant...

I created a new column in "item" where I store the shortened url which
makes "=" comparisons possible.

the result table has 20.000.000 records and the item table 5.000.000.
The query

select count(1) from result where url in (select shorturl from item
where shorturl = result.url);

will take about 8 hours (still running, just guessing). Is this
reasonable on a system with 1 GB of RAM and a AMD Athlon 64 3200+
processor? (1 SATA HDD)

regards,

moritz



Re: Planner should use index on a LIKE 'foo%' query

От
Matthew Wakeling
Дата:
On Mon, 30 Jun 2008, Moritz Onken wrote:
> I created a new column in "item" where I store the shortened url which makes
> "=" comparisons possible.

Good idea. Now create an index on that column.

> select count(1) from result where url in (select shorturl from item where
> shorturl = result.url);

What on earth is wrong with writing it like this?

SELECT COUNT(*) FROM (SELECT DISTINCT result.url FROM result, item WHERE
    item.shorturl = result.url) AS a

That should do a fairly sensible join plan. There's no point in using
fancy IN or EXISTS syntax when a normal join will do.

Matthew

--
I have an inferiority complex. But it's not a very good one.

Re: Planner should use index on a LIKE 'foo%' query

От
Dimitri Fontaine
Дата:
Hi,

Le samedi 28 juin 2008, Moritz Onken a écrit :
> select count(*)
>   from result
>   where exists
>     (select * from item where item.url LIKE result.url || '%' limit 1);
>
> which basically returns the number of items which exist in table
> result and match a URL in table item by its prefix.

It seems you could benefit from the prefix project, which support indexing
your case of prefix searches. Your query would then be:
  SELECT count(*) FROM result r JOIN item i ON r.url @> i.url;

The result.url column would have to made of type prefix_range, which casts
automatically to text when needed.

Find out more about the prefix projects at those urls:
  http://pgfoundry.org/projects/prefix
  http://prefix.projects.postgresql.org/README.html

Regards,
--
dim

Вложения

Re: Planner should use index on a LIKE 'foo%' query

От
Moritz Onken
Дата:
Am 30.06.2008 um 12:19 schrieb Matthew Wakeling:
>
>> select count(1) from result where url in (select shorturl from item
>> where shorturl = result.url);
>
> What on earth is wrong with writing it like this?
>
> SELECT COUNT(*) FROM (SELECT DISTINCT result.url FROM result, item
> WHERE
>   item.shorturl = result.url) AS a

I tried the this approach but it's slower than WHERE IN in my case.

>
> It seems you could benefit from the prefix project, which support
> indexing
> your case of prefix searches. Your query would then be:
>  SELECT count(*) FROM result r JOIN item i ON r.url @> i.url;
>
> The result.url column would have to made of type prefix_range, which
> casts
> automatically to text when needed.
>
> Find out more about the prefix projects at those urls:
>  http://pgfoundry.org/projects/prefix
>  http://prefix.projects.postgresql.org/README.html
>
> Regards,
> --
> dim

Thanks for that! looks interesting.

regards

Re: Planner should use index on a LIKE 'foo%' query

От
Matthew Wakeling
Дата:
On Mon, 30 Jun 2008, Moritz Onken wrote:
>> SELECT COUNT(*) FROM (SELECT DISTINCT result.url FROM result, item WHERE
>>  item.shorturl = result.url) AS a
>
> I tried the this approach but it's slower than WHERE IN in my case.

However there's a lot more scope for improving a query along these lines,
like adding indexes, or CLUSTERing on an index. It depends what other
queries you are wanting to run.

I don't know how much update/insert activity there will be on your
database. However, if you were to add an index on the URL on both tables,
then CLUSTER both tables on those indexes, and ANALYSE, then this query
should run as a merge join, and be pretty quick.

However, this is always going to be a long-running query, because it
accesses at least one whole table scan of a large table.

Matthew

--
"Finger to spiritual emptiness underlying everything."
        -- How a foreign C manual referred to a "pointer to void."

Re: Planner should use index on a LIKE 'foo%' query

От
Moritz Onken
Дата:
>
>
> However there's a lot more scope for improving a query along these
> lines, like adding indexes, or CLUSTERing on an index. It depends
> what other queries you are wanting to run.
>
> I don't know how much update/insert activity there will be on your
> database. However, if you were to add an index on the URL on both
> tables, then CLUSTER both tables on those indexes, and ANALYSE, then
> this query should run as a merge join, and be pretty quick.
>
> However, this is always going to be a long-running query, because it
> accesses at least one whole table scan of a large table.
>
> Matthew

There are already indexes on the url columns. I didn't cluster yet but
this is a pretty good idea, thanks. There will be no updates or
inserts. It's static data for research purposes.

moritz

Re: Planner should use index on a LIKE 'foo%' query

От
Moritz Onken
Дата:
Am 30.06.2008 um 16:59 schrieb Steinar H. Gunderson:

> On Mon, Jun 30, 2008 at 09:16:06AM +0200, Moritz Onken wrote:
>> the result table has 20.000.000 records and the item table 5.000.000.
>> The query
>>
>> select count(1) from result where url in (select shorturl from item
>> where shorturl = result.url);
>>
>> will take about 8 hours (still running, just guessing). Is this
>> reasonable on a system with 1 GB of RAM and a AMD Athlon 64 3200+
>> processor? (1 SATA HDD)
>
> I really don't see what your query tries to accomplish. Why would
> you want
> "url IN (... where .. = url)"? Wouldn't you want a different qualifier
> somehow?

well, it counts the number of rows with urls which already exist in
another
table.
How would you describe the query?
If the "(select shorturl from item where shorturl = result.url)"
clause is empty the row is not counted, that's what I want...

greetings,

moritz

Re: Planner should use index on a LIKE 'foo%' query

От
Matthew Wakeling
Дата:
On Mon, 30 Jun 2008, Moritz Onken wrote:
>>> select count(1) from result where url in (select shorturl from item
>>> where shorturl = result.url);
>>
>> I really don't see what your query tries to accomplish. Why would you want
>> "url IN (... where .. = url)"? Wouldn't you want a different qualifier
>> somehow?
>
> well, it counts the number of rows with urls which already exist in another
> table.
> How would you describe the query?
> If the "(select shorturl from item where shorturl = result.url)"
> clause is empty the row is not counted, that's what I want...

The thing here is that you are effectively causing Postgres to run a
sub-select for each row of the "result" table, each time generating either
an empty list or a list with one or more identical URLs. This is
effectively forcing a nested loop. In a way, you have two constraints
where you only need one.

You can safely take out the constraint in the subquery, so it is like
this:

SELECT COUNT(*) FROM result WHERE url IN (SELECT shorturl FROM item);

This will generate equivalent results, because those rows that didn't
match the constraint wouldn't have affected the IN anyway. However, it
will alter the performance, because the subquery will contain more
results, but it will only be run once, rather than multiple times. This is
effectively forcing a hash join (kind of).

Whereas if you rewrite the query as I demonstrated earlier, then you allow
Postgres to make its own choice about which join algorithm will work best.

Matthew

--
Anyone who goes to a psychiatrist ought to have his head examined.

Re: Planner should use index on a LIKE 'foo%' query

От
Moritz Onken
Дата:
>
> The thing here is that you are effectively causing Postgres to run a
> sub-select for each row of the "result" table, each time generating
> either an empty list or a list with one or more identical URLs. This
> is effectively forcing a nested loop. In a way, you have two
> constraints where you only need one.
>
> You can safely take out the constraint in the subquery, so it is
> like this:
>
> SELECT COUNT(*) FROM result WHERE url IN (SELECT shorturl FROM item);
>
> This will generate equivalent results, because those rows that
> didn't match the constraint wouldn't have affected the IN anyway.
> However, it will alter the performance, because the subquery will
> contain more results, but it will only be run once, rather than
> multiple times. This is effectively forcing a hash join (kind of).
>
> Whereas if you rewrite the query as I demonstrated earlier, then you
> allow Postgres to make its own choice about which join algorithm
> will work best.
>
> Matthew

Thank you! I learned a lot today :-)
I thought the subquery will be run on every row thus I tried to make
it as fast as possible by using a where clause. I didn't try your
first query on the hole table so it could be faster than mine approach.

greetings,

moritz