Обсуждение: type-casting and LIKE queries
I have an indexed column called home_phone, which is of type bigint. How can I search this column efficiently, using LIKE queries? For example: test=> SELECT id FROM user WHERE home_phone LIKE '407%'::bigint; ERROR: Bad int8 external representation "407%" Obviously, that's not correct, but I'm confused as to how the % and ? characters can be used along with type-casting. Or can they?
On Fri, 14 Mar 2003, valerian wrote: > I have an indexed column called home_phone, which is of type bigint. > How can I search this column efficiently, using LIKE queries? For I don't think you do. LIKE is defined on string types in any case, so even if you could convert it to a bigint, it's not going to do what you seem to want (using an index on a bigint). You could make a function that got say the top 3 digits of a number and use it in a functional index. However, for what you're doing, that might not be safe anyway (nor would like) unless you're certain that your input format is fixed or at least requires a consistent number of digits (what happens if someone doesn't put an area code and you expect one, or does and you don't). > example: > > test=> SELECT id FROM user WHERE home_phone LIKE '407%'::bigint; > ERROR: Bad int8 external representation "407%" > > Obviously, that's not correct, but I'm confused as to how the % and ? > characters can be used along with type-casting. Or can they?
On Fri, 2003-03-14 at 06:26, valerian wrote: > I have an indexed column called home_phone, which is of type bigint. > How can I search this column efficiently, using LIKE queries? For > example: > > test=> SELECT id FROM user WHERE home_phone LIKE '407%'::bigint; > ERROR: Bad int8 external representation "407%" try something like where home_phone::text like '407%' or alternatively harness the immense power of algebra to achieve your devious ends :) HTH, James.
On Fri, Mar 14, 2003 at 06:25:44PM +0000, James Gregory wrote: > try something like > where home_phone::text like '407%' Thanks, this works great. I also tried using regexes on home_phone::text, but they seem a bit slower than LIKE queries, for some reason (I'm guessing this is true in general, as regexes have more possible cases to deal with?) The only downside seems to be that queries that start with the % character don't make use of the text(home_phone) index. Is there a way around this? I noticed that the ? character doesn't have this behavior, so I tried a few things like: SELECT * FROM test WHERE home_phone::text LIKE '??????4820' and this does use the index, but it's a bit of a hack, especially if you're searching on a varchar(255) column... > or alternatively harness the immense power of algebra to achieve your > devious ends :) Not sure what you're implying here, unless maybe you were thinking along the lines of: SELECT * FROM test WHERE home_phone >= 4070000000 AND home_phone <= 4079999999 But unfortunately that won't work, as I have to be able to do searches on any substring of the home_phone column...
At 02:15 PM 3/14/03 -0500, valerian wrote: >this? I noticed that the ? character doesn't have this behavior, so I >tried a few things like: > > SELECT * FROM test WHERE home_phone::text LIKE '??????4820' > >and this does use the index, but it's a bit of a hack, especially if >you're searching on a varchar(255) column... Are you using the ? character as a wildcard? AFAIK _ and % are the wildcard characters for LIKE. > > or alternatively harness the immense power of algebra to achieve your > > devious ends :) > >Not sure what you're implying here, unless maybe you were thinking along >the lines of: > > SELECT * FROM test WHERE home_phone >= 4070000000 AND home_phone <= > 4079999999 > >But unfortunately that won't work, as I have to be able to do searches on >any substring of the home_phone column... Would having two indexes cover enough cases for you? One that allows indexed LIKE '407%'. And one that allows '%4820'. For the latter just reverse the string and index it, and do a search on LIKE '0284%'. Maybe you could create a function that reverses strings, not sure how that would work tho - could look messy since you probably don't want to reverse the % too. Not sure if Postgresql would be able to productively use both indexes to find a substring in the middle of text, given a suitable query. If the substring is in a fixed position in the middle I think it can. Regards, Link.
On Sat, Mar 15, 2003 at 02:00:35PM +0800, Lincoln Yeoh wrote: > Are you using the ? character as a wildcard? AFAIK _ and % are the wildcard > characters for LIKE. Sorry, that was a typo. My application allows searching with * and ? characters, but it translates them to % and _ when creating the sql statement. > Would having two indexes cover enough cases for you? One that allows > indexed LIKE '407%'. And one that allows '%4820'. > > For the latter just reverse the string and index it, and do a search on > LIKE '0284%'. Hey that's an interesting idea! > Maybe you could create a function that reverses strings, not sure how that > would work tho - could look messy since you probably don't want to reverse > the % too. So I guess that trick wouldn't work if the search string has a % both at the beginning and the end... > Not sure if Postgresql would be able to productively use both indexes to > find a substring in the middle of text, given a suitable query. If the > substring is in a fixed position in the middle I think it can. I can't guarantee fixed positions though, because I also have to be able to run arbitrary searches on varchar columns too... For instance, I have a column 'email' of type varchar(255), which has a unique index on lower(email). Some typical searches might be: *@hotmail.com johndoe@* *unix* The first two are no problem if I use the 'reverse index' trick you described. But I don't see how the indexes can get used in the third case? However, I went ahead and created the a 'reverse_lc' function to see what kind of performace I would get. The planner isn't using my index though. :( test=> CREATE UNIQUE INDEX test_email_lc_idx ON test (lower(email)); CREATE INDEX test=> CREATE UNIQUE INDEX test_email_revlc_idx ON test (reverse_lc(email)); CREATE INDEX test=> VACUUM ANALYZE test ; VACUUM test=> EXPLAIN ANALYZE SELECT * FROM test WHERE lower(email) LIKE 'asdf%'; QUERY PLAN -------------------------------------------------------------------------------------------------------------------------- Index Scan using test_email_lc_idx on test (cost=0.00..125.62 rows=38 width=45) (actual time=0.50..0.50 rows=0 loops=1) Index Cond: ((lower((email)::text) >= 'asdf'::text) AND (lower((email)::text) < 'asdg'::text)) Filter: (lower((email)::text) ~~ 'asdf%'::text) Total runtime: 0.64 msec (4 rows) test=> EXPLAIN ANALYZE SELECT * FROM test WHERE reverse_lc(email) LIKE '%asdf'; QUERY PLAN ------------------------------------------------------------------------------------------------------ Seq Scan on test (cost=0.00..193.56 rows=38 width=45) (actual time=5852.42..5852.42 rows=0 loops=1) Filter: (reverse_lc((email)::text) ~~ '%asdf'::text) Total runtime: 5852.54 msec (3 rows) test=> SELECT reverse_lc('%asdf'); reverse_lc ------------ fdsa% (1 row) Here's the code for it: CREATE FUNCTION reverse_lc(text) RETURNS text IMMUTABLE AS ' DECLARE old_str ALIAS FOR $1; low text; new_str text; len integer; i integer; BEGIN len := length(old_str); low := lower(old_str); new_str := ''''; i := 0; WHILE i < len LOOP new_str := rpad(new_str, i+1, substr(old_str, len-i, 1)); i := i+1; END LOOP; RETURN new_str; END; ' LANGUAGE 'plpgsql'; Did I forget to do something?
On Sun, Mar 16, 2003 at 07:34:37PM -0500, valerian wrote: > test=> EXPLAIN ANALYZE SELECT * FROM test WHERE reverse_lc(email) LIKE '%asdf'; > QUERY PLAN > ------------------------------------------------------------------------------------------------------ > Seq Scan on test (cost=0.00..193.56 rows=38 width=45) (actual time=5852.42..5852.42 rows=0 loops=1) > Filter: (reverse_lc((email)::text) ~~ '%asdf'::text) > Total runtime: 5852.54 msec > (3 rows) Never mind, I just realized that I forgot to reverse the search key also... test=> EXPLAIN ANALYZE SELECT * FROM test WHERE reverse_lc(email) LIKE 'fdsa%'; QUERY PLAN ----------------------------------------------------------------------------------------------------------------------------- Index Scan using test_email_revlc_idx on test (cost=0.00..125.62 rows=38 width=45) (actual time=0.39..0.39 rows=0 loops=1) Index Cond: ((reverse_lc((email)::text) >= 'fdsa'::text) AND (reverse_lc((email)::text) < 'fdsb'::text)) Filter: (reverse_lc((email)::text) ~~ 'fdsa%'::text) Total runtime: 0.53 msec (4 rows) So that takes care of the first two types of queries, but not the one that has a % both at the beginning and end of the search key. Any ideas on how to handle those?
On Sun, Mar 16, 2003 at 09:05:52PM -0500, valerian wrote: > On Sun, Mar 16, 2003 at 07:34:37PM -0500, valerian wrote: > So that takes care of the first two types of queries, but not the one > that has a % both at the beginning and end of the search key. > > Any ideas on how to handle those? I don't think there's a way to handle expressions with that level of generality. Actually there are ways using suffix tries or some such structure, but there's no implementation of that in Postgres. I don't know if that kind of thing is doable with GiST indexes; should be, at least in theory. If you want to do it, it's going to cost a nontrivial amount of work. -- Alvaro Herrera (<alvherre[a]dcc.uchile.cl>) "El miedo atento y previsor es la madre de la seguridad" (E. Burke)
What you're asking for comes under full text indexing. There's a fair bit of research in this field. Supposedly a way to do this is to create an index of substrings. e.g. this is the text Index: this is the text his is the text is is the text s is the text And so on. But without compression and other tricks it might not perform well. For the index can become really huge so using it could be slower than or be about the same speed as a seq scan of the main table. A similar method is to just index keywords. If that is sufficient you could look at the full text index thing for Postgresql. If you're using it for phone numbers, I'd think most people are ok with searching for the starting digits, or the ending digits. For email you could try keywords. In my experience if the keyword table isn't huge then a substring search on the keyword table can be pretty fast. Hope that helps, Link. At 09:05 PM 3/16/03 -0500, valerian wrote: >test=> EXPLAIN ANALYZE SELECT * FROM test WHERE reverse_lc(email) LIKE >'fdsa%'; > QUERY PLAN >----------------------------------------------------------------------------------------------------------------------------- > Index Scan using test_email_revlc_idx on test (cost=0.00..125.62 > rows=38 width=45) (actual time=0.39..0.39 rows=0 loops=1) > Index Cond: ((reverse_lc((email)::text) >= 'fdsa'::text) AND > (reverse_lc((email)::text) < 'fdsb'::text)) > Filter: (reverse_lc((email)::text) ~~ 'fdsa%'::text) > Total runtime: 0.53 msec >(4 rows) > >So that takes care of the first two types of queries, but not the one >that has a % both at the beginning and end of the search key. > >Any ideas on how to handle those?