Обсуждение: best practise/pattern for large OR / LIKE searches
Hey All,
I am wondering if there is a common pattern for these sort of queries :
SELECT * FROM tbl WHERE datanumber LIKE '%12345%' OR LIKE '%54321%' OR LIKE '%8766%' OR LIKE '%009%', ..
The number of OR/LIKES are in the order of 50-100 items...
the table tbl is a couple of million rows.
The datanumber is a string that are maximum 10 characters long, no spaces and can contain numbers and letters.
Apart from creating a couple of index table to make the LIKE left anchored something like this :
tbl <----> tbl_4letters
tbl <----> tbl_5letters
tbl <----> tbl_3letters
or creating a functional index 'of some sort' are there any other brilliant ideas out there to solve such a problem (GIN/GIS???) ?
Searches are currently taking to long and we would like to optimize them, but before we dive into our own solution we
where wondering if there already common solutions for this...
Kind Regards,
Ries van Twisk
Hello one year ago there was proposal for index support for LIKE %some%. The problem was extreme size of index size. I thing so you can write own C function, that can check string faster than repeated LIKE some like SELECT * FROM tbl WHERE contains(datanumber, '12345','54321',....) regards Pavel Stehule 2009/8/26 Ries van Twisk <pg@rvt.dds.nl>: > Hey All, > I am wondering if there is a common pattern for these sort of queries : > SELECT * FROM tbl WHERE datanumber LIKE '%12345%' OR LIKE '%54321%' OR LIKE > '%8766%' OR LIKE '%009%', .. > The number of OR/LIKES are in the order of 50-100 items... > the table tbl is a couple of million rows. > The datanumber is a string that are maximum 10 characters long, no spaces > and can contain numbers and letters. > Apart from creating a couple of index table to make the LIKE left anchored > something like this : > tbl <----> tbl_4letters > tbl <----> tbl_5letters > tbl <----> tbl_3letters > or creating a functional index 'of some sort' are there any other brilliant > ideas out there to solve such a problem (GIN/GIS???) ? > Searches are currently taking to long and we would like to optimize them, > but before we dive into our own solution we > where wondering if there already common solutions for this... > Kind Regards, > Ries van Twisk > > > > >
Hi Pavel, can you provide some link or other directions to the proposal? I guess it was posted to this list or somewhere else? Tomas > Hello > > one year ago there was proposal for index support for LIKE %some%. The > problem was extreme size of index size. > > I thing so you can write own C function, that can check string faster > than repeated LIKE > > some like > > SELECT * FROM tbl WHERE contains(datanumber, '12345','54321',....) > > regards > Pavel Stehule > > 2009/8/26 Ries van Twisk <pg@rvt.dds.nl>: >> Hey All, >> I am wondering if there is a common pattern for these sort of queries : >> SELECT * FROM tbl WHERE datanumber LIKE '%12345%' OR LIKE '%54321%' OR >> LIKE >> '%8766%' OR LIKE '%009%', .. >> The number of OR/LIKES are in the order of 50-100 items... >> the table tbl is a couple of million rows. >> The datanumber is a string that are maximum 10 characters long, no >> spaces >> and can contain numbers and letters. >> Apart from creating a couple of index table to make the LIKE left >> anchored >> something like this : >> tbl <----> tbl_4letters >> tbl <----> tbl_5letters >> tbl <----> tbl_3letters >> or creating a functional index 'of some sort' are there any other >> brilliant >> ideas out there to solve such a problem (GIN/GIS???) ? >> Searches are currently taking to long and we would like to optimize >> them, >> but before we dive into our own solution we >> where wondering if there already common solutions for this... >> Kind Regards, >> Ries van Twisk >> >> >> >> >> > > -- > Sent via pgsql-general mailing list (pgsql-general@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-general >
2009/8/26 <tv@fuzzy.cz>: > Hi Pavel, > > can you provide some link or other directions to the proposal? I guess it > was posted to this list or somewhere else? Please, ask to Oleg Bartunov http://www.sai.msu.su/~megera/wiki/wildspeed regards Pavel Stehule > > Tomas > >> Hello >> >> one year ago there was proposal for index support for LIKE %some%. The >> problem was extreme size of index size. >> >> I thing so you can write own C function, that can check string faster >> than repeated LIKE >> >> some like >> >> SELECT * FROM tbl WHERE contains(datanumber, '12345','54321',....) >> >> regards >> Pavel Stehule >> >> 2009/8/26 Ries van Twisk <pg@rvt.dds.nl>: >>> Hey All, >>> I am wondering if there is a common pattern for these sort of queries : >>> SELECT * FROM tbl WHERE datanumber LIKE '%12345%' OR LIKE '%54321%' OR >>> LIKE >>> '%8766%' OR LIKE '%009%', .. >>> The number of OR/LIKES are in the order of 50-100 items... >>> the table tbl is a couple of million rows. >>> The datanumber is a string that are maximum 10 characters long, no >>> spaces >>> and can contain numbers and letters. >>> Apart from creating a couple of index table to make the LIKE left >>> anchored >>> something like this : >>> tbl <----> tbl_4letters >>> tbl <----> tbl_5letters >>> tbl <----> tbl_3letters >>> or creating a functional index 'of some sort' are there any other >>> brilliant >>> ideas out there to solve such a problem (GIN/GIS???) ? >>> Searches are currently taking to long and we would like to optimize >>> them, >>> but before we dive into our own solution we >>> where wondering if there already common solutions for this... >>> Kind Regards, >>> Ries van Twisk >>> >>> >>> >>> >>> >> >> -- >> Sent via pgsql-general mailing list (pgsql-general@postgresql.org) >> To make changes to your subscription: >> http://www.postgresql.org/mailpref/pgsql-general >> > > >
The wildspeed function seems to be what I was looking for. an dI remember that I have seen it before on the list... just I couldn't remember names or anything... Ries On Aug 26, 2009, at 7:28 AM, Pavel Stehule wrote: > 2009/8/26 <tv@fuzzy.cz>: >> Hi Pavel, >> >> can you provide some link or other directions to the proposal? I >> guess it >> was posted to this list or somewhere else? > > Please, ask to Oleg Bartunov > > http://www.sai.msu.su/~megera/wiki/wildspeed > > regards > Pavel Stehule > >> >> Tomas >> >>> Hello >>> >>> one year ago there was proposal for index support for LIKE %some%. >>> The >>> problem was extreme size of index size. >>> >>> I thing so you can write own C function, that can check string >>> faster >>> than repeated LIKE >>> >>> some like >>> >>> SELECT * FROM tbl WHERE contains(datanumber, '12345','54321',....) >>> >>> regards >>> Pavel Stehule >>> >>> 2009/8/26 Ries van Twisk <pg@rvt.dds.nl>: >>>> Hey All, >>>> I am wondering if there is a common pattern for these sort of >>>> queries : >>>> SELECT * FROM tbl WHERE datanumber LIKE '%12345%' OR LIKE >>>> '%54321%' OR >>>> LIKE >>>> '%8766%' OR LIKE '%009%', .. >>>> The number of OR/LIKES are in the order of 50-100 items... >>>> the table tbl is a couple of million rows. >>>> The datanumber is a string that are maximum 10 characters long, no >>>> spaces >>>> and can contain numbers and letters. >>>> Apart from creating a couple of index table to make the LIKE left >>>> anchored >>>> something like this : >>>> tbl <----> tbl_4letters >>>> tbl <----> tbl_5letters >>>> tbl <----> tbl_3letters >>>> or creating a functional index 'of some sort' are there any other >>>> brilliant >>>> ideas out there to solve such a problem (GIN/GIS???) ? >>>> Searches are currently taking to long and we would like to optimize >>>> them, >>>> but before we dive into our own solution we >>>> where wondering if there already common solutions for this... >>>> Kind Regards, >>>> Ries van Twisk >>>> >>>> >>>> >>>> >>>> >>> >>> -- >>> Sent via pgsql-general mailing list (pgsql-general@postgresql.org) >>> To make changes to your subscription: >>> http://www.postgresql.org/mailpref/pgsql-general >>> >> >> >> regards, Ries van Twisk ------------------------------------------------------------------------------------------------- tags: Freelance TYPO3 Glassfish JasperReports JasperETL Flex Blaze-DS WebORB PostgreSQL DB-Architect email: ries@vantwisk.nl web: http://www.rvantwisk.nl/ skype: callto://r.vantwisk Phone: +1-810-476-4196 Cell: +593 9901 7694 SIP: +1-747-690-5133
> SELECT * FROM tbl WHERE datanumber LIKE '%12345%' OR LIKE '%54321%'
> OR LIKE '%8766%' OR LIKE '%009%', ..
>
> The number of OR/LIKES are in the order of 50-100 items...
> the table tbl is a couple of million rows.
Are the fixed strings in the wildcards "words" (i.e., are they
completely arbitrarily embedded in the text, or are they delimited in
some regular way)? If they are "words," you might consider using the
full text functionality to create an index of them, and searching
using that.
--
-- Christophe Pettus
xof@thebuild.com
On 2009-08-26, Ries van Twisk <pg@rvt.dds.nl> wrote:
>
> --Apple-Mail-1173-222712773
> Content-Type: text/plain;
> charset=US-ASCII;
> format=flowed;
> delsp=yes
> Content-Transfer-Encoding: 7bit
>
> Hey All,
>
> I am wondering if there is a common pattern for these sort of queries :
>
> SELECT * FROM tbl WHERE datanumber LIKE '%12345%' OR LIKE '%54321%' OR
> LIKE '%8766%' OR LIKE '%009%', ..
SELECT * FROM tbl WHERE datanumber LIKE ANY ARRAY('%12345%','%54321%','%8766%'...)
> The number of OR/LIKES are in the order of 50-100 items...
> the table tbl is a couple of million rows.
regex might perfrom better than LIKE ANY
SELECT * FROM tbl WHERE '12345|54321|8766|009' ~ datanumber;
regex is compiled to a finite state machine and then the datanumber
column is scanned in a single pass (for each row)
> Searches are currently taking to long and we would like to optimize
> them, but before we dive into our own solution we
> where wondering if there already common solutions for this...
try regex first if that's too slow you may need to write a
dictionary function that splits datanuimber into it's components
and use full text index/search. (this will slow down updates as they will do
upto 20 inserts into the index)
searches should then be optimally fast
Hello
> regex is compiled to a finite state machine and then the datanumber
> column is scanned in a single pass (for each row)
>
>> Searches are currently taking to long and we would like to optimize
>> them, but before we dive into our own solution we
>> where wondering if there already common solutions for this...
>
> try regex first if that's too slow you may need to write a
> dictionary function that splits datanuimber into it's components
> and use full text index/search. (this will slow down updates as they will do
> upto 20 inserts into the index)
>
> searches should then be optimally fast
>
I did some tests:
1) I fill test table
insert into test SELECT
array_to_string(array_agg(array_to_string(ARRAY(select
substring('01234567890' from (random()*10)::int + 1 for 1) from
generate_series(1,(random()*10+5)::int + i - i)),'')),',') as b from
generate_series(1,100000) g(i) group by (random()*1000)::int;
2. I tested searching of 5 or 13 values. I did tests on 8.4 and 8.1
8.1
using like 190ms(440ms*)
using regexp 115ms(259ms*)
* for 13 values - so there regexp is faster than like
on 8.4
using like 80ms(151ms)
using regexp 131ms(267ms)
so like is faster then regexp on 8.4.
fulltext test (8.4)
420ms(470ms) -- without index
14ms(26ms) -- with GiST index
1ms(2ms) -- with Gin index
some samples of test queries:
select * from test where to_tsvector('simple',a) @@
to_tsquery('simple','296426496|7707431116555|98173598191|302598|53174827|02292064629|188631468777|4756243248|920473|16602317|76613513|78640|9176323');
select * from test where a ~
'296426496|7707431116555|98173598191|302598|53174827|02292064629|188631468777|4756243248|920473|16602317|76613513|78640|9176323';
select * from test where a like '%296426496%' or a like '%
7707431116555%' or a like '%98173598191%' or a like '%302598%' or a
like '%53174827%' or a like '%02292064629%' or a like '%188631468777%'
or a like '%4756243248%' or a like '%920473%' or a like '%16602317%'
or a like '%76613513%' or a like '%78640%' or a like '%9176323%';
regards
Pavel Stehule