Обсуждение: How to optimize query that concatenates strings?
Hi all,
I've got a database of URLs, and when inserting new data into it I want
to make sure that there are no functionally equivalent URLs already
present. For example, 'umist.ac.uk' is functionally the same as
'umist.ac.uk/'.
I find that searching for the latter form, using string concatentation
to append the trailing slash, is much slower than searching for a
simple string - the index on URL name isn't used to speed up the
search.
Here's an illustration
url=# explain select exists(select * from url where url = 'umist.ac.uk'
or url || '/' = 'umist.ac.uk') as present;
QUERY PLAN
-----------------------------------------------------------------------------------------------
Result (cost=47664.01..47664.02 rows=1 width=0)
InitPlan
-> Seq Scan on url (cost=0.00..47664.01 rows=6532 width=38)
Filter: ((url = 'umist.ac.uk'::text) OR ((url || '/'::text)
= 'umist.ac.uk'::text))
(4 rows)
url=# explain select exists(select * from url where url =
'umist.ac.uk') as present;
QUERY PLAN
----------------------------------------------------------------------------
Result (cost=5.97..5.98 rows=1 width=0)
InitPlan
-> Index Scan using url_idx on url (cost=0.00..5.97 rows=1
width=38)
Index Cond: (url = 'umist.ac.uk'::text)
(4 rows)
Is there any way I can force postgres to use the index when using the
string concatenation in the query?
Thanks in advance,
BBB
badlydrawnbhoy wrote: > Hi all, > > I've got a database of URLs, and when inserting new data into it I want > to make sure that there are no functionally equivalent URLs already > present. For example, 'umist.ac.uk' is functionally the same as > 'umist.ac.uk/'. > > I find that searching for the latter form, using string concatentation > to append the trailing slash, is much slower than searching for a > simple string - the index on URL name isn't used to speed up the > search. > > Here's an illustration > > url=# explain select exists(select * from url where url = 'umist.ac.uk' > or url || '/' = 'umist.ac.uk') as present; Well, in that example, you should just remove the OR conditional - it just evaluates to false anyways. > > Is there any way I can force postgres to use the index when using the > string concatenation in the query? If you are always going to strcat with a '/', you could probably create a functional index or add a new column for a normalized url (which is what I'd lean towards).
You could build a function-based index that contains the "simplified" version of each URL (in your case, the field with the '/' stripped). Then use the same function on the URL going in. In that case PostgreSQL will use the index that you created already.
Take a look at the PostgreSQL documentation for function-based indexes.
select from ... where simplify(url) <> url_col;
In the example above 'url_col' would have a function-based index that was based on 'simplify(url_col)'
badlydrawnbhoy wrote:
Take a look at the PostgreSQL documentation for function-based indexes.
select from ... where simplify(url) <> url_col;
In the example above 'url_col' would have a function-based index that was based on 'simplify(url_col)'
Chander Ganesan Open Technology Group, Inc. One Copley Parkway, Suite 210 Morrisville, NC 27560 Phone: 877-258-8987/919-463-0999
badlydrawnbhoy wrote:
Hi all, I've got a database of URLs, and when inserting new data into it I want to make sure that there are no functionally equivalent URLs already present. For example, 'umist.ac.uk' is functionally the same as 'umist.ac.uk/'. I find that searching for the latter form, using string concatentation to append the trailing slash, is much slower than searching for a simple string - the index on URL name isn't used to speed up the search. Here's an illustration url=# explain select exists(select * from url where url = 'umist.ac.uk' or url || '/' = 'umist.ac.uk') as present; QUERY PLAN -----------------------------------------------------------------------------------------------Result (cost=47664.01..47664.02 rows=1 width=0) InitPlan -> Seq Scan on url (cost=0.00..47664.01 rows=6532 width=38) Filter: ((url = 'umist.ac.uk'::text) OR ((url || '/'::text) = 'umist.ac.uk'::text)) (4 rows) url=# explain select exists(select * from url where url = 'umist.ac.uk') as present; QUERY PLAN ----------------------------------------------------------------------------Result (cost=5.97..5.98 rows=1 width=0) InitPlan -> Index Scan using url_idx on url (cost=0.00..5.97 rows=1 width=38) Index Cond: (url = 'umist.ac.uk'::text) (4 rows) Is there any way I can force postgres to use the index when using the string concatenation in the query? Thanks in advance, BBB ---------------------------(end of broadcast)--------------------------- TIP 2: Don't 'kill -9' the postmaster
On Fri, Jul 07, 2006 at 04:29:51AM -0700,
badlydrawnbhoy <badlydrawnbhoy@gmail.com> wrote
a message of 48 lines which said:
> I've got a database of URLs, and when inserting new data into it I
> want to make sure that there are no functionally equivalent URLs
> already present. For example, 'umist.ac.uk' is functionally the same
> as 'umist.ac.uk/'.
IMHO, your problem seems to be an instance of a very general class:
data that needs canonicalization. For instance, I work for a domain
name registry and domain names are case-insensitive. Do we store the
domain names blindly and then always use ILIKE or other
case-insensitive operators? No, we canonicalize domain names by
changing them to lower-case. That way:
* we do not have to think of using case-insensitive operators,
* indexes do work.
This is what I recommend here: decide on a canonical form and
canonicalize everything when it comes into the database (it is easy to
do it from a trigger).
If the original form is important, you can still store it in a column
intended for display, not for searching.
Here is a way to canonicalize, with a trigger. The function
"canonicalize" is left as an exercice (you can write it in PL/pgSQL,
C, Python, etc):
CREATE OR REPLACE FUNCTION force_canonicalization() RETURNS TRIGGER
AS 'BEGIN
NEW.url = canonicalize(NEW.url);
RETURN NEW;
END;'
LANGUAGE PLPGSQL;
CREATE TRIGGER force_canonicalization
BEFORE INSERT ON Things
FOR EACH ROW
EXECUTE PROCEDURE force_canonicalization();