Обсуждение: Query tuning help
Sorry to bother everyone with yet another "my query isn't using an index" problem but I am over my head on this one.. I am open to ways of restructuring this query to perform better. I have a table, 'ea', with 22 million rows in it. VACUUM ANALYZE has been just run on the table. This is the result of: explain analyze select distinct em.incidentid, ea.recordtext as retdata, eg.long, eg.lat from ea, em, eg where em.incidentid = ea.incidentid and em.incidentid = eg.incidentid and em.entrydate >= '2005-1-1 00:00' and em.entrydate <= '2005-5-9 00:00' and ea.incidentid in ( select incidentid from ea where recordtext like '%RED%' ) and ea.incidentid in ( select incidentid from ea where recordtext like '%CORVETTE%' ) and ( recordtext like '%RED%' or recordtext like '%CORVETTE%' ) order by em.entrydate --------------------- ANALYZE RESULTS --------------------- Unique (cost=774693.72..774693.76 rows=1 width=159) (actual time=446787.056..446787.342 rows=72 loops=1) -> Sort (cost=774693.72..774693.72 rows=1 width=159) (actual time=446787.053..446787.075 rows=72 loops=1) Sort Key: em.incidentid, public.ea.recordtext, eg.long, eg.lat -> Nested Loop (cost=771835.10..774693.71 rows=1 width=159) (actual time=444378.655..446786.746 rows=72 loops=1) -> Nested Loop (cost=771835.10..774688.81 rows=1 width=148) (actual time=444378.532..446768.381 rows=72 loops=1) -> Nested Loop IN Join (cost=771835.10..774678.88 rows=2 width=81) (actual time=444367.080..446191.864 rows=701 loops=1) -> Nested Loop (cost=771835.10..774572.05 rows=42 width=64) (actual time=444366.859..445463.232 rows=1011 loops=1) -> HashAggregate (cost=771835.10..771835.10 rows=1 width=17) (actual time=444366.702..444368.583 rows=473 loops=1) -> Seq Scan on ea (cost=0.00..771834.26 rows=335 width=17) (actual time=259.746..444358.837 rows=592 loops=1) Filter: ((recordtext)::text ~~ '%CORVETTE%'::text) -> Index Scan using ea1 on ea (cost=0.00..2736.43 rows=42 width=47) (actual time=2.085..2.309 rows=2 loops=473) Index Cond: ((ea.incidentid)::text = ("outer".incidentid)::text) Filter: (((recordtext)::text ~~ '%RED%'::text) OR ((recordtext)::text ~~ '%CORVETTE%'::text)) -> Index Scan using ea1 on ea (cost=0.00..2733.81 rows=42 width=17) (actual time=0.703..0.703 rows=1 loops=1011) Index Cond: (("outer".incidentid)::text = (ea.incidentid)::text) Filter: ((recordtext)::text ~~ '%RED%'::text) -> Index Scan using em_incidentid_idx on em (cost=0.00..4.95 rows=1 width=67) (actual time=0.820..0.821 rows=0 loops=701) Index Cond: (("outer".incidentid)::text = (em.incidentid)::text) Filter: ((entrydate >= '2005-01-01 00:00:00'::timestamp without time zone) AND (entrydate <= '2005-05-09 00:00:00'::timestamp without time zone)) -> Index Scan using eg_incidentid_idx on eg (cost=0.00..4.89 rows=1 width=79) (actual time=0.245..0.246 rows=1 loops=72) Index Cond: (("outer".incidentid)::text = (eg.incidentid)::text) Total runtime: 446871.880 ms (22 rows) ------------------------- EXPLANATION ------------------------- The reason for the redundant LIKE clause is that first, I only want those "incidentid"s that contain the words 'RED' and 'CORVETTE'. BUT, those two words may exist across multiple records with the same incidentid. Then, I only want to actually work with the rows that contain one of the words. This query will repeat the same logic for however many keywords are entered by the user. I have investigated text searching options and have not found them to be congruous with my application. Why is it choosing a sequential scan one part of the query when searching for the words, yet using an index scan for another part of it? Is there a better way to structure the query to give it better hints? I'm using 8.0.1 on a 4-way Opteron with beefy RAID-10 and 12GB of RAM. Thank you for any advice. -Dan
Dan, > and ( recordtext like '%RED%' or recordtext like '%CORVETTE%' ) It is simply not possible to use B-tree indexes on these kind of text queries. B-trees require you to start at the "left" side of the field, because B-trees locate records via <> tests. "Anywhere in the field" text search requires a Full Text Index. > The reason for the redundant LIKE clause is that first, I only want > those "incidentid"s that contain the words 'RED' and 'CORVETTE'. BUT, > those two words may exist across multiple records with the same > incidentid. Then, I only want to actually work with the rows that > contain one of the words. This query will repeat the same logic for > however many keywords are entered by the user. I have investigated > text searching options and have not found them to be congruous with my > application. Sounds like you either need to restructure your application, restructure your database (so that you're not doing "anywhere in field" searches), or buy 32GB of ram so that you can cache the whole table. -- Josh Berkus Aglio Database Solutions San Francisco
On Mon, 9 May 2005 09:20 am, Dan Harris wrote: > Sorry to bother everyone with yet another "my query isn't using an > index" problem but I am over my head on this one.. I am open to ways > of restructuring this query to perform better. > > I have a table, 'ea', with 22 million rows in it. VACUUM ANALYZE has > been just run on the table. > > This is the result of: > > explain analyze > select distinct > em.incidentid, > ea.recordtext as retdata, > eg.long, > eg.lat > from > ea, em, eg > where > em.incidentid = ea.incidentid and > em.incidentid = eg.incidentid and > em.entrydate >= '2005-1-1 00:00' and > em.entrydate <= '2005-5-9 00:00' > and ea.incidentid in ( > select > incidentid > from > ea > where > recordtext like '%RED%' > ) > > and ea.incidentid in ( > select > incidentid > from > ea > where > recordtext like '%CORVETTE%' > ) > and ( recordtext like '%RED%' or recordtext like '%CORVETTE%' ) > order by em.entrydate > You cannot use an index for %CORVETTE%, or %RED%. There is no way for the index to know if a row had that in the middle without scanning the whole index. So it's much cheaper to do a sequence scan. One possible way to make the query faster is to limit based on date, as you will only get about 700 rows. And then don't use subselects, as they are doing full sequence scans. I think this query does what you do above, and I think it will be faster, but I don't know. select distinct em.incidentid, ea.recordtext as retdata, eg.long, eg.lat FROM em JOIN ea ON (em.incidentid = ea.incidentid AND em.entrydate >= '2005-1-1 00:00' AND em.entrydate <= '2005-5-9 00:00' AND ea.recordtext like '%RED%' AND ea.recordtext like '%CORVETTE%') JOIN eg ON em.incidentid = eg.incidentid WHERE (recordtext like '%RED%' or recordtext like '%CORVETTE%' ); > > --------------------- > ANALYZE RESULTS > --------------------- > > Unique (cost=774693.72..774693.76 rows=1 width=159) (actual time=446787.056..446787.342 rows=72 loops=1) > -> Sort (cost=774693.72..774693.72 rows=1 width=159) (actual time=446787.053..446787.075 rows=72 loops=1) > Sort Key: em.incidentid, public.ea.recordtext, eg.long, eg.lat > -> Nested Loop (cost=771835.10..774693.71 rows=1 width=159) (actual time=444378.655..446786.746 rows=72 loops=1) > -> Nested Loop (cost=771835.10..774688.81 rows=1 width=148) (actual time=444378.532..446768.381 rows=72loops=1) > -> Nested Loop IN Join (cost=771835.10..774678.88 rows=2 width=81) (actual time=444367.080..446191.864rows=701 loops=1) > -> Nested Loop (cost=771835.10..774572.05 rows=42 width=64) (actual time=444366.859..445463.232rows=1011 loops=1) > -> HashAggregate (cost=771835.10..771835.10 rows=1 width=17) (actual time=444366.702..444368.583rows=473 loops=1) > -> Seq Scan on ea (cost=0.00..771834.26 rows=335 width=17) (actual time=259.746..444358.837rows=592 loops=1) > Filter: ((recordtext)::text ~~ '%CORVETTE%'::text) > -> Index Scan using ea1 on ea (cost=0.00..2736.43 rows=42 width=47) (actual time=2.085..2.309rows=2 loops=473) > Index Cond: ((ea.incidentid)::text = ("outer".incidentid)::text) > Filter: (((recordtext)::text ~~ '%RED%'::text) OR ((recordtext)::text ~~ '%CORVETTE%'::text)) > -> Index Scan using ea1 on ea (cost=0.00..2733.81 rows=42 width=17) (actual time=0.703..0.703rows=1 loops=1011) > Index Cond: (("outer".incidentid)::text = (ea.incidentid)::text) > Filter: ((recordtext)::text ~~ '%RED%'::text) > -> Index Scan using em_incidentid_idx on em (cost=0.00..4.95 rows=1 width=67) (actual time=0.820..0.821rows=0 loops=701) > Index Cond: (("outer".incidentid)::text = (em.incidentid)::text) > Filter: ((entrydate >= '2005-01-01 00:00:00'::timestamp without time zone) AND (entrydate <='2005-05-09 00:00:00'::timestamp without time zone)) > -> Index Scan using eg_incidentid_idx on eg (cost=0.00..4.89 rows=1 width=79) (actual time=0.245..0.246rows=1 loops=72) > Index Cond: (("outer".incidentid)::text = (eg.incidentid)::text) > Total runtime: 446871.880 ms > (22 rows) > > > ------------------------- > EXPLANATION > ------------------------- > The reason for the redundant LIKE clause is that first, I only want > those "incidentid"s that contain the words 'RED' and 'CORVETTE'. BUT, > those two words may exist across multiple records with the same > incidentid. Then, I only want to actually work with the rows that > contain one of the words. This query will repeat the same logic for > however many keywords are entered by the user. I have investigated > text searching options and have not found them to be congruous with my > application. > > Why is it choosing a sequential scan one part of the query when > searching for the words, yet using an index scan for another part of > it? Is there a better way to structure the query to give it better > hints? > > I'm using 8.0.1 on a 4-way Opteron with beefy RAID-10 and 12GB of RAM. > > Thank you for any advice. > > -Dan > > > > > ---------------------------(end of broadcast)--------------------------- > TIP 9: the planner will ignore your desire to choose an index scan if your > joining column's datatypes do not match > >
Russell Smith <mr-russ@pws.com.au> writes: > On Mon, 9 May 2005 09:20 am, Dan Harris wrote: >> and ( recordtext like '%RED%' or recordtext like '%CORVETTE%' ) >> > You cannot use an index for %CORVETTE%, or %RED%. Not a btree index anyway. Dan might have some success here with a full-text-indexing package (eg, contrib/tsearch2) regards, tom lane
On May 8, 2005, at 6:51 PM, Russell Smith wrote: > On Mon, 9 May 2005 09:20 am, Dan Harris wrote: > You cannot use an index for %CORVETTE%, or %RED%. There is no way > for the index to know if a row had that in the middle without scanning > the whole > index. So it's much cheaper to do a sequence scan. > While I believe you, I'm confused by this line in my original EXPLAIN ANALYZE: >> -> Index Scan using ea1 on ea (cost=0.00..2736.43 rows=42 width=47) >> (actual time=2.085..2.309 rows=2 loops=473) >> Index Cond: >> ((ea.incidentid)::text = ("outer".incidentid)::text) >> Filter: (((recordtext)::text >> ~~ '%RED%'::text) OR ((recordtext)::text ~~ '%CORVETTE%'::text)) Doesn't that mean it was using an index to filter? Along those lines, before I created index 'ea1', the query was much much slower. So, it seemed like creating this index made a difference. > One possible way to make the query faster is to limit based on date, > as you will only get about 700 rows. > And then don't use subselects, as they are doing full sequence scans. > I think this query does what you do > above, and I think it will be faster, but I don't know. > I REALLY like this idea! If I could just filter by date first and then sequential scan through those, it should be very manageable. Hopefully I can keep this goal while still accommodating the requirement listed in my next paragraph. > select distinct em.incidentid, ea.recordtext as retdata, eg.long, > eg.lat > FROM em JOIN ea ON (em.incidentid = ea.incidentid AND em.entrydate >= > '2005-1-1 00:00' > AND em.entrydate <= '2005-5-9 00:00' AND ea.recordtext like '%RED%' > AND ea.recordtext like '%CORVETTE%') > JOIN eg ON em.incidentid = eg.incidentid WHERE (recordtext like > '%RED%' or recordtext like '%CORVETTE%' ); > I have run this, and while it is very fast, I'm concerned it's not doing what I need. Here's the situation: Due to the format of the systems with which I integrate ( I have no control over these formats ), we will get these 'recordtext' values one line at a time, accumulating over time. The only way I can find to make this work is to insert a new record for each line. The problem is, that when someone wants to search multiple keywords, they expect these words to be matched across multiple records with a given incident number. For a very simple example: IncidentID Date Recordtext -------------- ------------- ------------------------------------------------------- 11111 2005-05-01 14:21 blah blah blah RED blah blah 2222 2005-05-01 14:23 not what we are looking for 11111 2005-05-02 02:05 blah CORVETTE blah blah So, doing a search with an 'and' condition, e.g. WHERE RECORDTEXT LIKE '%RED%' AND RECORDTEXT LIKE '%CORVETTE%' , will not match because the condition will only be applied to a single row of recordtext at a time, not a whole group with the same incident number. If I were to use tsearch2 for full-text indexing, would I need to create another table that merges all of my recordtext rows into a single 'text' field type? If so, this is where I run into problems, as my logic also needs to match multiple words in their original order. I may also receive additional updates to the previous data. In that case, I need to replace the original record with the latest version of it. If I have already concatenated these rows into a single field, the logic to in-line replace only the old text that has changed is very very difficult at best. So, that's the reason I had to do two subqueries in my example. Please tell me if I misunderstood your logic and it really will match given my condition above, but it didn't seem like it would. Thanks again for the quick responses! This list has been a great resource for me. -Dan
Dan, > While I believe you, I'm confused by this line in my original EXPLAIN > > ANALYZE: > >> -> Index Scan using ea1 on ea (cost=0.00..2736.43 rows=42 width=47) > >> (actual time=2.085..2.309 rows=2 loops=473) > >> Index Cond: > >> ((ea.incidentid)::text = ("outer".incidentid)::text) > >> Filter: (((recordtext)::text > >> ~~ '%RED%'::text) OR ((recordtext)::text ~~ '%CORVETTE%'::text)) The index named is matching based on incidentid -- the join condition. The "filter" is applied against the table rows, i.e. a scan. > If I were to use tsearch2 for full-text indexing, would I need to > create another table that merges all of my recordtext rows into a > single 'text' field type? No. Read the OpenFTS docs, they are fairly clear on how to set up a simple FTS index. (TSearch2 ~~ OpenFTS) > If so, this is where I run into problems, as > my logic also needs to match multiple words in their original order. You do that by doubling up ... that is, use the FTS index to pick all rows that contain "RED" and "CORVETTE", and then check the order. I'll also note that your current query is not checking word order. Example: WHERE recordtext_fti @@ to_tsquery ('default', 'RED && CORVETTE') AND recordtext LIKE '%RED%CORVETTE%' I'm doing something fairly similar on one of my projects and it works very well. The limitations on TSearch2 indexes are: 1) they are expensive to update, so your data loads would be noticably slower. 2) they are only fast when cached in RAM (and when cached, are *very* fast). So if you have a variety of other processes that tend to fill up RAM between searches, you may find them less useful. 3) You have to create a materialized index column next to recordtext, which will increase the size of the table. -- Josh Berkus Aglio Database Solutions San Francisco
Dan Harris <fbsd@drivefaster.net> writes: >> -> Index Scan using ea1 on ea (cost=0.00..2736.43 rows=42 width=47) (actual time=2.085..2.309 rows=2 loops=473) >> Index Cond: ((ea.incidentid)::text = ("outer".incidentid)::text) >> Filter: (((recordtext)::text ~~ '%RED%'::text) OR ((recordtext)::text ~~ '%CORVETTE%'::text)) > Doesn't that mean it was using an index to filter? No. The "Index Cond" shows it is using the index only for the join condition. A "Filter" is an additional filter condition that happens to get applied at this plan node --- but it'll be applied to every row the index finds for the index condition. regards, tom lane
On May 8, 2005, at 8:06 PM, Josh Berkus wrote: > >> If I were to use tsearch2 for full-text indexing, would I need to >> create another table that merges all of my recordtext rows into a >> single 'text' field type? > > No. Read the OpenFTS docs, they are fairly clear on how to set up a > simple > FTS index. (TSearch2 ~~ OpenFTS) > >> If so, this is where I run into problems, as >> my logic also needs to match multiple words in their original order. I have been reading the Tsearch2 docs and either I don't understand something or I'm not communicating my situation clearly enough. It seems that Tsearch2 has a concept of "document". And, in everything I am reading, they expect your "document" to be all contained in a single row. Since my words can be spread across multiple rows, I don't see that Tsearch2 will combine all 'recordtext' row values with the same "incidentid" into a single vector. Am I overlooking something in the docs? > > I'm doing something fairly similar on one of my projects and it works > very > well. > I'd be curious what similarities they have? Is it the searching across multiple rows or the order of words? > The limitations on TSearch2 indexes are: > 1) they are expensive to update, so your data loads would be noticably > slower. > 2) they are only fast when cached in RAM (and when cached, are *very* > fast). > So if you have a variety of other processes that tend to fill up RAM > between > searches, you may find them less useful. > 3) You have to create a materialized index column next to recordtext, > which > will increase the size of the table. Duly noted. If this method can search across rows, I'm willing to accept this overhead for the speed it would add. In the meantime, is there any way I can reach my goal without Tsearch2 by just restructuring my query to narrow down the results by date first, then seq scan for the 'likes'? -Dan
On Mon, 9 May 2005 11:49 am, Dan Harris wrote: > > On May 8, 2005, at 6:51 PM, Russell Smith wrote: > [snip] > > select distinct em.incidentid, ea.recordtext as retdata, eg.long, > > eg.lat > > FROM em JOIN ea ON (em.incidentid = ea.incidentid AND em.entrydate >= > > '2005-1-1 00:00' > > AND em.entrydate <= '2005-5-9 00:00' AND ea.recordtext like '%RED%' > > AND ea.recordtext like '%CORVETTE%') > > JOIN eg ON em.incidentid = eg.incidentid WHERE (recordtext like > > '%RED%' or recordtext like '%CORVETTE%' ); > > > > I have run this, and while it is very fast, I'm concerned it's not > doing what I need. How fast is very fast? > Here's the situation: > > Due to the format of the systems with which I integrate ( I have no > control over these formats ), we will get these 'recordtext' values one > line at a time, accumulating over time. The only way I can find to > make this work is to insert a new record for each line. The problem > is, that when someone wants to search multiple keywords, they expect > these words to be matched across multiple records with a given incident > number. > > For a very simple example: > > IncidentID Date Recordtext > -------------- ------------- > ------------------------------------------------------- > 11111 2005-05-01 14:21 blah blah blah RED blah blah > 2222 2005-05-01 14:23 not what we are looking for > 11111 2005-05-02 02:05 blah CORVETTE blah blah > > So, doing a search with an 'and' condition, e.g. WHERE RECORDTEXT LIKE > '%RED%' AND RECORDTEXT LIKE '%CORVETTE%' , will not match because the > condition will only be applied to a single row of recordtext at a time, > not a whole group with the same incident number. > > If I were to use tsearch2 for full-text indexing, would I need to > create another table that merges all of my recordtext rows into a > single 'text' field type? If so, this is where I run into problems, as > my logic also needs to match multiple words in their original order. I > may also receive additional updates to the previous data. In that > case, I need to replace the original record with the latest version of > it. If I have already concatenated these rows into a single field, the > logic to in-line replace only the old text that has changed is very > very difficult at best. So, that's the reason I had to do two > subqueries in my example. Please tell me if I misunderstood your logic > and it really will match given my condition above, but it didn't seem > like it would. > > Thanks again for the quick responses! This list has been a great > resource for me. > select distinct em.incidentid, ea.recordtext as retdata, eg.long, eg.lat FROM em JOIN ea ON (em.incidentid = ea.incidentid AND em.entrydate >= '2005-1-1 00:00' AND em.entrydate <= '2005-5-9 00:00' AND (ea.recordtext like '%RED%' OR ea.recordtext like '%CORVETTE%')) JOIN eg ON em.incidentid = eg.incidentid WHERE em.incidentid IN (select distinct em.incidentid, ea.recordtext as retdata, eg.long, eg.lat FROM em JOIN ea ON (em.incidentid = ea.incidentid AND em.entrydate >= '2005-1-1 00:00' AND em.entrydate <= '2005-5-9 00:00' AND ea.recordtext like '%CORVETTE%')) JOIN eg ON em.incidentid = eg.incidentid) AND em.incidentid IN (select distinct em.incidentid, ea.recordtext as retdata, eg.long, eg.lat FROM em JOIN ea ON (em.incidentid = ea.incidentid AND em.entrydate >= '2005-1-1 00:00' AND em.entrydate <= '2005-5-9 00:00' AND ea.recordtext like '%RED%')) JOIN eg ON em.incidentid = eg.incidentid) This may be more accurate. However I would cool it VERY NASTY. Josh's solutions may be better. However much of the data should be in memory once the subplans are done, so it may be quite fast. you may > > > -Dan > > > ---------------------------(end of broadcast)--------------------------- > TIP 9: the planner will ignore your desire to choose an index scan if your > joining column's datatypes do not match > >
On May 8, 2005, at 8:32 PM, Russell Smith wrote: >> I have run this, and while it is very fast, I'm concerned it's not >> doing what I need. > How fast is very fast? > It took 35 seconds to complete versus ~450 my old way. > > select distinct em.incidentid, ea.recordtext as retdata, eg.long, > eg.lat > FROM em JOIN ea ON (em.incidentid = ea.incidentid AND em.entrydate >= > '2005-1-1 00:00' > AND em.entrydate <= '2005-5-9 00:00' AND (ea.recordtext like '%RED%' > OR ea.recordtext like '%CORVETTE%')) > JOIN eg ON em.incidentid = eg.incidentid WHERE > em.incidentid IN > (select distinct em.incidentid, ea.recordtext as retdata, eg.long, > eg.lat > FROM em JOIN ea ON (em.incidentid = ea.incidentid AND em.entrydate >= > '2005-1-1 00:00' > AND em.entrydate <= '2005-5-9 00:00' AND ea.recordtext like > '%CORVETTE%')) > JOIN eg ON em.incidentid = eg.incidentid) AND > em.incidentid IN > (select distinct em.incidentid, ea.recordtext as retdata, eg.long, > eg.lat > FROM em JOIN ea ON (em.incidentid = ea.incidentid AND em.entrydate >= > '2005-1-1 00:00' > AND em.entrydate <= '2005-5-9 00:00' AND ea.recordtext like '%RED%')) > JOIN eg ON em.incidentid = eg.incidentid) > Yes, it is nasty, but so was my previous query :) So long as this is faster, I'm ok with that. I'll see if i can make this work. Thank you very much. -Dan
On Sun, 8 May 2005 20:31:38 -0600, Dan Harris <fbsd@drivefaster.net> wrote: > Duly noted. If this method can search across rows, I'm willing to > accept this overhead for the speed it would add. You could use intersect to search across rows. Using tsearch2 will look up the RED and CORVETTE using the index and intersect will pull out the commmon rows. > In the meantime, is there any way I can reach my goal without Tsearch2 > by just restructuring my query to narrow down the results by date > first, then seq scan for the 'likes'? select distinct em.incidentid, ea.recordtext as retdata, eg.long, eg.lat >from ea, em, eg, ( select ea.incidentid from ea, em where em.incidentid = ea.incidentid and em.entrydate >= '2005-1-1 00:00' and em.entrydate <= '2005-5-9 00:00' and recordtext like '%RED%' intersect select ea.incidentid from ea, em where em.incidentid = ea.incidentid and em.entrydate >= '2005-1-1 00:00' and em.entrydate <= '2005-5-9 00:00' and recordtext like '%CORVETTE%' ) as iid where em.incidentid = ea.incidentid and em.incidentid = eg.incidentid and em.entrydate >= '2005-1-1 00:00' and em.entrydate <= '2005-5-9 00:00' and ea.incidentid = iid.incidentid and ( recordtext like '%RED%' or recordtext like '%CORVETTE%' ) order by em.entrydate klint. +---------------------------------------+-----------------+ : Klint Gore : "Non rhyming : : EMail : kg@kgb.une.edu.au : slang - the : : Snail : A.B.R.I. : possibilities : : Mail University of New England : are useless" : : Armidale NSW 2351 Australia : L.J.J. : : Fax : +61 2 6772 5376 : : +---------------------------------------+-----------------+
In article <7fc67646a961f5ebef90def7aeb95fd0@drivefaster.net>, Dan Harris <fbsd@drivefaster.net> writes: > On May 8, 2005, at 8:06 PM, Josh Berkus wrote: >> >>> If I were to use tsearch2 for full-text indexing, would I need to >>> create another table that merges all of my recordtext rows into a >>> single 'text' field type? >> >> No. Read the OpenFTS docs, they are fairly clear on how to set up >> a simple >> FTS index. (TSearch2 ~~ OpenFTS) >> >>> If so, this is where I run into problems, as >>> my logic also needs to match multiple words in their original order. > I have been reading the Tsearch2 docs and either I don't understand > something or I'm not communicating my situation clearly enough. It > seems that Tsearch2 has a concept of "document". And, in everything I > am reading, they expect your "document" to be all contained in a > single row. Since my words can be spread across multiple rows, I > don't see that Tsearch2 will combine all 'recordtext' row values with > the same "incidentid" into a single vector. Am I overlooking > something in the docs? AFAICS no, but you could create a separate table containing just the distinct incidentids and the tsearch2 vectors of all recordtexts matching that incidentid. This table would get updated solely by triggers on the original table and would provide a fast way to get all incidentids for RED and CORVETTE. The question is: would this reduce the number of rows to check more than filtering on date?
Quoting Russell Smith <mr-russ@pws.com.au>: > On Mon, 9 May 2005 11:49 am, Dan Harris wrote: > > On May 8, 2005, at 6:51 PM, Russell Smith wrote: > [snip] > > select distinct em.incidentid, ea.recordtext as retdata, eg.long, eg.lat > > FROM em > > JOIN ea ON em.incidentid = ea.incidentid --- slight paraphrase /Mischa. > > AND em.entrydate between '2005-1-1' and '2005-5-9' > > AND ea.recordtext like '%RED%' AND ea.recordtext like '%CORVETTE%' > > Here's the situation: > > Due to the format of the systems with which I integrate ( I have no > > control over these formats ), we will get these 'recordtext' values one > > line at a time, accumulating over time. The only way I can find to > > make this work is to insert a new record for each line. The problem > > is, that when someone wants to search multiple keywords, they expect > > these words to be matched across multiple records with a given incident > > number. > > > > For a very simple example: > > > > IncidentID Date Recordtext > > -------------- ------------- > > 11111 2005-05-01 14:21 blah blah blah RED blah blah > > 2222 2005-05-01 14:23 not what we are looking for > > 11111 2005-05-02 02:05 blah CORVETTE blah blah > > select em.incidentid, ea.recordtest as retdata from em join ( -- equivalent to "where incidentid in (...)", sometimes faster. select incidentid from em join ea using (incidentid) where em.entrydate between '2005-1-1' and '2005-5-9' group by incidentid having 1 = min(case when recordtest like '%RED%' then 1 end) and 1 = min(case when recordtest like '%CORVETTE%' then 1 end) ) as X using (incidentid);
Hi Dan, I tried to understand your query, but I couldn't get my understanding of the query and your description in sync. Why do you use sub selects? Wouldn't a simple "recordtext like '%RED%'" do the trick too? You combine all your where conditions with and. To me this looks like you get only rows with RED and CORVETTE. From your description I would rewrite the query as explain analyze select distinct em.incidentid, ea.recordtext as retdata, eg.long, eg.lat from ea join em using(incidentid) join eg using(incidentid) where em.entrydate >= '2005-1-1 00:00'::date and em.entrydate <= '2005-5-9 00:00'::date and ( recordtext like '%RED%' or recordtext like '%CORVETTE%' ) order by em.entrydate That should give you all rows containing one of the words. Does it work? Is is faster? Is it fast enough? Ulrich