Обсуждение: How to optimize query that concatenates strings?
Hi all, I've got a database of URLs, and when inserting new data into it I want to make sure that there are no functionally equivalent URLs already present. For example, 'umist.ac.uk' is functionally the same as 'umist.ac.uk/'. I find that searching for the latter form, using string concatentation to append the trailing slash, is much slower than searching for a simple string - the index on URL name isn't used to speed up the search. Here's an illustration url=# explain select exists(select * from url where url = 'umist.ac.uk' or url || '/' = 'umist.ac.uk') as present; QUERY PLAN ----------------------------------------------------------------------------------------------- Result (cost=47664.01..47664.02 rows=1 width=0) InitPlan -> Seq Scan on url (cost=0.00..47664.01 rows=6532 width=38) Filter: ((url = 'umist.ac.uk'::text) OR ((url || '/'::text) = 'umist.ac.uk'::text)) (4 rows) url=# explain select exists(select * from url where url = 'umist.ac.uk') as present; QUERY PLAN ---------------------------------------------------------------------------- Result (cost=5.97..5.98 rows=1 width=0) InitPlan -> Index Scan using url_idx on url (cost=0.00..5.97 rows=1 width=38) Index Cond: (url = 'umist.ac.uk'::text) (4 rows) Is there any way I can force postgres to use the index when using the string concatenation in the query? Thanks in advance, BBB
badlydrawnbhoy wrote: > Hi all, > > I've got a database of URLs, and when inserting new data into it I want > to make sure that there are no functionally equivalent URLs already > present. For example, 'umist.ac.uk' is functionally the same as > 'umist.ac.uk/'. > > I find that searching for the latter form, using string concatentation > to append the trailing slash, is much slower than searching for a > simple string - the index on URL name isn't used to speed up the > search. > > Here's an illustration > > url=# explain select exists(select * from url where url = 'umist.ac.uk' > or url || '/' = 'umist.ac.uk') as present; Well, in that example, you should just remove the OR conditional - it just evaluates to false anyways. > > Is there any way I can force postgres to use the index when using the > string concatenation in the query? If you are always going to strcat with a '/', you could probably create a functional index or add a new column for a normalized url (which is what I'd lean towards).
You could build a function-based index that contains the "simplified" version of each URL (in your case, the field with the '/' stripped). Then use the same function on the URL going in. In that case PostgreSQL will use the index that you created already.
Take a look at the PostgreSQL documentation for function-based indexes.
select from ... where simplify(url) <> url_col;
In the example above 'url_col' would have a function-based index that was based on 'simplify(url_col)'
badlydrawnbhoy wrote:
Take a look at the PostgreSQL documentation for function-based indexes.
select from ... where simplify(url) <> url_col;
In the example above 'url_col' would have a function-based index that was based on 'simplify(url_col)'
Chander Ganesan Open Technology Group, Inc. One Copley Parkway, Suite 210 Morrisville, NC 27560 Phone: 877-258-8987/919-463-0999
badlydrawnbhoy wrote:
Hi all, I've got a database of URLs, and when inserting new data into it I want to make sure that there are no functionally equivalent URLs already present. For example, 'umist.ac.uk' is functionally the same as 'umist.ac.uk/'. I find that searching for the latter form, using string concatentation to append the trailing slash, is much slower than searching for a simple string - the index on URL name isn't used to speed up the search. Here's an illustration url=# explain select exists(select * from url where url = 'umist.ac.uk' or url || '/' = 'umist.ac.uk') as present; QUERY PLAN -----------------------------------------------------------------------------------------------Result (cost=47664.01..47664.02 rows=1 width=0) InitPlan -> Seq Scan on url (cost=0.00..47664.01 rows=6532 width=38) Filter: ((url = 'umist.ac.uk'::text) OR ((url || '/'::text) = 'umist.ac.uk'::text)) (4 rows) url=# explain select exists(select * from url where url = 'umist.ac.uk') as present; QUERY PLAN ----------------------------------------------------------------------------Result (cost=5.97..5.98 rows=1 width=0) InitPlan -> Index Scan using url_idx on url (cost=0.00..5.97 rows=1 width=38) Index Cond: (url = 'umist.ac.uk'::text) (4 rows) Is there any way I can force postgres to use the index when using the string concatenation in the query? Thanks in advance, BBB ---------------------------(end of broadcast)--------------------------- TIP 2: Don't 'kill -9' the postmaster
On Fri, Jul 07, 2006 at 04:29:51AM -0700, badlydrawnbhoy <badlydrawnbhoy@gmail.com> wrote a message of 48 lines which said: > I've got a database of URLs, and when inserting new data into it I > want to make sure that there are no functionally equivalent URLs > already present. For example, 'umist.ac.uk' is functionally the same > as 'umist.ac.uk/'. IMHO, your problem seems to be an instance of a very general class: data that needs canonicalization. For instance, I work for a domain name registry and domain names are case-insensitive. Do we store the domain names blindly and then always use ILIKE or other case-insensitive operators? No, we canonicalize domain names by changing them to lower-case. That way: * we do not have to think of using case-insensitive operators, * indexes do work. This is what I recommend here: decide on a canonical form and canonicalize everything when it comes into the database (it is easy to do it from a trigger). If the original form is important, you can still store it in a column intended for display, not for searching. Here is a way to canonicalize, with a trigger. The function "canonicalize" is left as an exercice (you can write it in PL/pgSQL, C, Python, etc): CREATE OR REPLACE FUNCTION force_canonicalization() RETURNS TRIGGER AS 'BEGIN NEW.url = canonicalize(NEW.url); RETURN NEW; END;' LANGUAGE PLPGSQL; CREATE TRIGGER force_canonicalization BEFORE INSERT ON Things FOR EACH ROW EXECUTE PROCEDURE force_canonicalization();