Обсуждение: How to optimize query that concatenates strings?

Поиск
Список
Период
Сортировка

How to optimize query that concatenates strings?

От
"badlydrawnbhoy"
Дата:
Hi all,

I've got a database of URLs, and when inserting new data into it I want
to make sure that there are no functionally equivalent URLs already
present. For example, 'umist.ac.uk' is functionally the same as
'umist.ac.uk/'.

I find that searching for the latter form, using string concatentation
to append the trailing slash, is much slower than searching for a
simple string - the index on URL name isn't used to speed up the
search.

Here's an illustration

url=# explain select exists(select * from url where url = 'umist.ac.uk'
or url || '/' = 'umist.ac.uk') as present;
                                          QUERY PLAN

-----------------------------------------------------------------------------------------------
 Result  (cost=47664.01..47664.02 rows=1 width=0)
   InitPlan
     ->  Seq Scan on url  (cost=0.00..47664.01 rows=6532 width=38)
           Filter: ((url = 'umist.ac.uk'::text) OR ((url || '/'::text)
= 'umist.ac.uk'::text))
(4 rows)

url=# explain select exists(select * from url where url =
'umist.ac.uk') as present;
                                 QUERY PLAN
----------------------------------------------------------------------------
 Result  (cost=5.97..5.98 rows=1 width=0)
   InitPlan
     ->  Index Scan using url_idx on url  (cost=0.00..5.97 rows=1
width=38)
           Index Cond: (url = 'umist.ac.uk'::text)
(4 rows)


Is there any way I can force postgres to use the index when using the
string concatenation in the query?

Thanks in advance,

BBB


Re: How to optimize query that concatenates strings?

От
Jacob Coby
Дата:
badlydrawnbhoy wrote:
> Hi all,
>
> I've got a database of URLs, and when inserting new data into it I want
> to make sure that there are no functionally equivalent URLs already
> present. For example, 'umist.ac.uk' is functionally the same as
> 'umist.ac.uk/'.
>
> I find that searching for the latter form, using string concatentation
> to append the trailing slash, is much slower than searching for a
> simple string - the index on URL name isn't used to speed up the
> search.
>
> Here's an illustration
>
> url=# explain select exists(select * from url where url = 'umist.ac.uk'
> or url || '/' = 'umist.ac.uk') as present;

Well, in that example, you should just remove the OR conditional - it
just evaluates to false anyways.

>
> Is there any way I can force postgres to use the index when using the
> string concatenation in the query?

If you are always going to strcat with a '/', you could probably create
a functional index or add a new column for a normalized url (which is
what I'd lean towards).



Re: How to optimize query that concatenates strings?

От
Chander Ganesan
Дата:
You could build a function-based index that contains the "simplified" version of each URL (in your case, the field with the '/' stripped).  Then use the same function on the URL going in.  In that case PostgreSQL will use the index that you created already.

Take a look at the PostgreSQL documentation for function-based indexes.

select from ... where simplify(url) <> url_col;

In the example above 'url_col' would have a function-based index that was based on 'simplify(url_col)'
Chander Ganesan
Open Technology Group, Inc.
One Copley Parkway, Suite 210
Morrisville, NC  27560
Phone: 877-258-8987/919-463-0999


badlydrawnbhoy wrote:
Hi all,

I've got a database of URLs, and when inserting new data into it I want
to make sure that there are no functionally equivalent URLs already
present. For example, 'umist.ac.uk' is functionally the same as
'umist.ac.uk/'.

I find that searching for the latter form, using string concatentation
to append the trailing slash, is much slower than searching for a
simple string - the index on URL name isn't used to speed up the
search.

Here's an illustration

url=# explain select exists(select * from url where url = 'umist.ac.uk'
or url || '/' = 'umist.ac.uk') as present;                                         QUERY PLAN

-----------------------------------------------------------------------------------------------Result  (cost=47664.01..47664.02 rows=1 width=0)  InitPlan    ->  Seq Scan on url  (cost=0.00..47664.01 rows=6532 width=38)          Filter: ((url = 'umist.ac.uk'::text) OR ((url || '/'::text)
= 'umist.ac.uk'::text))
(4 rows)

url=# explain select exists(select * from url where url =
'umist.ac.uk') as present;                                QUERY PLAN
----------------------------------------------------------------------------Result  (cost=5.97..5.98 rows=1 width=0)  InitPlan    ->  Index Scan using url_idx on url  (cost=0.00..5.97 rows=1
width=38)          Index Cond: (url = 'umist.ac.uk'::text)
(4 rows)


Is there any way I can force postgres to use the index when using the
string concatenation in the query?

Thanks in advance,

BBB


---------------------------(end of broadcast)---------------------------
TIP 2: Don't 'kill -9' the postmaster 

Re: How to optimize query that concatenates strings?

От
Stephane Bortzmeyer
Дата:
On Fri, Jul 07, 2006 at 04:29:51AM -0700,
 badlydrawnbhoy <badlydrawnbhoy@gmail.com> wrote
 a message of 48 lines which said:

> I've got a database of URLs, and when inserting new data into it I
> want to make sure that there are no functionally equivalent URLs
> already present. For example, 'umist.ac.uk' is functionally the same
> as 'umist.ac.uk/'.

IMHO, your problem seems to be an instance of a very general class:
data that needs canonicalization. For instance, I work for a domain
name registry and domain names are case-insensitive. Do we store the
domain names blindly and then always use ILIKE or other
case-insensitive operators? No, we canonicalize domain names by
changing them to lower-case. That way:

* we do not have to think of using case-insensitive operators,

* indexes do work.

This is what I recommend here: decide on a canonical form and
canonicalize everything when it comes into the database (it is easy to
do it from a trigger).

If the original form is important, you can still store it in a column
intended for display, not for searching.

Here is a way to canonicalize, with a trigger. The function
"canonicalize" is left as an exercice (you can write it in PL/pgSQL,
C, Python, etc):

CREATE OR REPLACE FUNCTION force_canonicalization() RETURNS TRIGGER
   AS 'BEGIN
          NEW.url = canonicalize(NEW.url);
          RETURN NEW;
       END;'
     LANGUAGE PLPGSQL;

CREATE TRIGGER force_canonicalization
   BEFORE INSERT ON Things
   FOR EACH ROW
   EXECUTE PROCEDURE force_canonicalization();