Обсуждение: best practise/pattern for large OR / LIKE searches

Поиск
Список
Период
Сортировка

best practise/pattern for large OR / LIKE searches

От
Ries van Twisk
Дата:
Hey All,

I am wondering if there is a common pattern for these sort of queries :

SELECT * FROM tbl WHERE datanumber LIKE '%12345%' OR LIKE '%54321%' OR LIKE '%8766%' OR LIKE '%009%', ..

The number of OR/LIKES are in the order of 50-100 items...
the table tbl is a couple of million rows.

The datanumber is a string that are maximum 10 characters long, no spaces and can contain numbers and letters.

Apart from creating a couple of index table to make the LIKE left anchored something like this :

tbl <----> tbl_4letters
tbl <----> tbl_5letters
tbl <----> tbl_3letters

or creating a functional index 'of some sort' are there any other brilliant ideas out there to solve such a problem (GIN/GIS???) ?

Searches are currently taking to long and we would like to optimize them, but before we dive into our own solution we
where wondering if there already common solutions for this...

Kind Regards,
Ries van Twisk





Re: best practise/pattern for large OR / LIKE searches

От
Pavel Stehule
Дата:
Hello

one year ago there was proposal for index support for LIKE %some%. The
problem was extreme size of index size.

I thing so you can write own C function, that can check string faster
than repeated LIKE

some like

SELECT * FROM tbl WHERE contains(datanumber, '12345','54321',....)

regards
Pavel Stehule

2009/8/26 Ries van Twisk <pg@rvt.dds.nl>:
> Hey All,
> I am wondering if there is a common pattern for these sort of queries :
> SELECT * FROM tbl WHERE datanumber LIKE '%12345%' OR LIKE '%54321%' OR LIKE
> '%8766%' OR LIKE '%009%', ..
> The number of OR/LIKES are in the order of 50-100 items...
> the table tbl is a couple of million rows.
> The datanumber is a string that are maximum 10 characters long, no spaces
> and can contain numbers and letters.
> Apart from creating a couple of index table to make the LIKE left anchored
> something like this :
> tbl <----> tbl_4letters
> tbl <----> tbl_5letters
> tbl <----> tbl_3letters
> or creating a functional index 'of some sort' are there any other brilliant
> ideas out there to solve such a problem (GIN/GIS???) ?
> Searches are currently taking to long and we would like to optimize them,
> but before we dive into our own solution we
> where wondering if there already common solutions for this...
> Kind Regards,
> Ries van Twisk
>
>
>
>
>

Re: best practise/pattern for large OR / LIKE searches

От
tv@fuzzy.cz
Дата:
Hi Pavel,

can you provide some link or other directions to the proposal? I guess it
was posted to this list or somewhere else?

Tomas

> Hello
>
> one year ago there was proposal for index support for LIKE %some%. The
> problem was extreme size of index size.
>
> I thing so you can write own C function, that can check string faster
> than repeated LIKE
>
> some like
>
> SELECT * FROM tbl WHERE contains(datanumber, '12345','54321',....)
>
> regards
> Pavel Stehule
>
> 2009/8/26 Ries van Twisk <pg@rvt.dds.nl>:
>> Hey All,
>> I am wondering if there is a common pattern for these sort of queries :
>> SELECT * FROM tbl WHERE datanumber LIKE '%12345%' OR LIKE '%54321%' OR
>> LIKE
>> '%8766%' OR LIKE '%009%', ..
>> The number of OR/LIKES are in the order of 50-100 items...
>> the table tbl is a couple of million rows.
>> The datanumber is a string that are maximum 10 characters long, no
>> spaces
>> and can contain numbers and letters.
>> Apart from creating a couple of index table to make the LIKE left
>> anchored
>> something like this :
>> tbl <----> tbl_4letters
>> tbl <----> tbl_5letters
>> tbl <----> tbl_3letters
>> or creating a functional index 'of some sort' are there any other
>> brilliant
>> ideas out there to solve such a problem (GIN/GIS???) ?
>> Searches are currently taking to long and we would like to optimize
>> them,
>> but before we dive into our own solution we
>> where wondering if there already common solutions for this...
>> Kind Regards,
>> Ries van Twisk
>>
>>
>>
>>
>>
>
> --
> Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-general
>



Re: best practise/pattern for large OR / LIKE searches

От
Pavel Stehule
Дата:
2009/8/26  <tv@fuzzy.cz>:
> Hi Pavel,
>
> can you provide some link or other directions to the proposal? I guess it
> was posted to this list or somewhere else?

Please, ask to Oleg Bartunov

http://www.sai.msu.su/~megera/wiki/wildspeed

regards
Pavel Stehule

>
> Tomas
>
>> Hello
>>
>> one year ago there was proposal for index support for LIKE %some%. The
>> problem was extreme size of index size.
>>
>> I thing so you can write own C function, that can check string faster
>> than repeated LIKE
>>
>> some like
>>
>> SELECT * FROM tbl WHERE contains(datanumber, '12345','54321',....)
>>
>> regards
>> Pavel Stehule
>>
>> 2009/8/26 Ries van Twisk <pg@rvt.dds.nl>:
>>> Hey All,
>>> I am wondering if there is a common pattern for these sort of queries :
>>> SELECT * FROM tbl WHERE datanumber LIKE '%12345%' OR LIKE '%54321%' OR
>>> LIKE
>>> '%8766%' OR LIKE '%009%', ..
>>> The number of OR/LIKES are in the order of 50-100 items...
>>> the table tbl is a couple of million rows.
>>> The datanumber is a string that are maximum 10 characters long, no
>>> spaces
>>> and can contain numbers and letters.
>>> Apart from creating a couple of index table to make the LIKE left
>>> anchored
>>> something like this :
>>> tbl <----> tbl_4letters
>>> tbl <----> tbl_5letters
>>> tbl <----> tbl_3letters
>>> or creating a functional index 'of some sort' are there any other
>>> brilliant
>>> ideas out there to solve such a problem (GIN/GIS???) ?
>>> Searches are currently taking to long and we would like to optimize
>>> them,
>>> but before we dive into our own solution we
>>> where wondering if there already common solutions for this...
>>> Kind Regards,
>>> Ries van Twisk
>>>
>>>
>>>
>>>
>>>
>>
>> --
>> Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
>> To make changes to your subscription:
>> http://www.postgresql.org/mailpref/pgsql-general
>>
>
>
>

Re: best practise/pattern for large OR / LIKE searches

От
Ries van Twisk
Дата:
The wildspeed function seems to be what I was looking for.

an dI remember that I have seen it before on the list... just I
couldn't remember names or anything...

Ries


On Aug 26, 2009, at 7:28 AM, Pavel Stehule wrote:

> 2009/8/26  <tv@fuzzy.cz>:
>> Hi Pavel,
>>
>> can you provide some link or other directions to the proposal? I
>> guess it
>> was posted to this list or somewhere else?
>
> Please, ask to Oleg Bartunov
>
> http://www.sai.msu.su/~megera/wiki/wildspeed
>
> regards
> Pavel Stehule
>
>>
>> Tomas
>>
>>> Hello
>>>
>>> one year ago there was proposal for index support for LIKE %some%.
>>> The
>>> problem was extreme size of index size.
>>>
>>> I thing so you can write own C function, that can check string
>>> faster
>>> than repeated LIKE
>>>
>>> some like
>>>
>>> SELECT * FROM tbl WHERE contains(datanumber, '12345','54321',....)
>>>
>>> regards
>>> Pavel Stehule
>>>
>>> 2009/8/26 Ries van Twisk <pg@rvt.dds.nl>:
>>>> Hey All,
>>>> I am wondering if there is a common pattern for these sort of
>>>> queries :
>>>> SELECT * FROM tbl WHERE datanumber LIKE '%12345%' OR LIKE
>>>> '%54321%' OR
>>>> LIKE
>>>> '%8766%' OR LIKE '%009%', ..
>>>> The number of OR/LIKES are in the order of 50-100 items...
>>>> the table tbl is a couple of million rows.
>>>> The datanumber is a string that are maximum 10 characters long, no
>>>> spaces
>>>> and can contain numbers and letters.
>>>> Apart from creating a couple of index table to make the LIKE left
>>>> anchored
>>>> something like this :
>>>> tbl <----> tbl_4letters
>>>> tbl <----> tbl_5letters
>>>> tbl <----> tbl_3letters
>>>> or creating a functional index 'of some sort' are there any other
>>>> brilliant
>>>> ideas out there to solve such a problem (GIN/GIS???) ?
>>>> Searches are currently taking to long and we would like to optimize
>>>> them,
>>>> but before we dive into our own solution we
>>>> where wondering if there already common solutions for this...
>>>> Kind Regards,
>>>> Ries van Twisk
>>>>
>>>>
>>>>
>>>>
>>>>
>>>
>>> --
>>> Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
>>> To make changes to your subscription:
>>> http://www.postgresql.org/mailpref/pgsql-general
>>>
>>
>>
>>


            regards, Ries van Twisk

-------------------------------------------------------------------------------------------------
tags: Freelance TYPO3 Glassfish JasperReports JasperETL Flex Blaze-DS
WebORB PostgreSQL DB-Architect
email: ries@vantwisk.nl        web:   http://www.rvantwisk.nl/
skype: callto://r.vantwisk
Phone: +1-810-476-4196    Cell: +593 9901 7694                   SIP:
+1-747-690-5133








Re: best practise/pattern for large OR / LIKE searches

От
Christophe Pettus
Дата:
> SELECT * FROM tbl WHERE datanumber LIKE '%12345%' OR LIKE '%54321%'
> OR LIKE '%8766%' OR LIKE '%009%', ..
>
> The number of OR/LIKES are in the order of 50-100 items...
> the table tbl is a couple of million rows.

Are the fixed strings in the wildcards "words" (i.e., are they
completely arbitrarily embedded in the text, or are they delimited in
some regular way)? If they are "words," you might consider using the
full text functionality to create an index of them, and searching
using that.

--
-- Christophe Pettus
    xof@thebuild.com


Re: best practise/pattern for large OR / LIKE searches

От
Jasen Betts
Дата:
On 2009-08-26, Ries van Twisk <pg@rvt.dds.nl> wrote:
>
> --Apple-Mail-1173-222712773
> Content-Type: text/plain;
>     charset=US-ASCII;
>     format=flowed;
>     delsp=yes
> Content-Transfer-Encoding: 7bit
>
> Hey All,
>
> I am wondering if there is a common pattern for these sort of queries :
>
> SELECT * FROM tbl WHERE datanumber LIKE '%12345%' OR LIKE '%54321%' OR
> LIKE '%8766%' OR LIKE '%009%', ..

SELECT * FROM tbl WHERE datanumber LIKE ANY ARRAY('%12345%','%54321%','%8766%'...)

> The number of OR/LIKES are in the order of 50-100 items...
> the table tbl is a couple of million rows.

regex might perfrom better than LIKE ANY

SELECT * FROM tbl WHERE '12345|54321|8766|009' ~ datanumber;

regex is compiled to a finite state machine and then the datanumber
column is scanned in a single pass (for each row)

> Searches are currently taking to long and we would like to optimize
> them, but before we dive into our own solution we
> where wondering if there already common solutions for this...

try regex first if that's too slow you may need to write a
dictionary function that splits datanuimber into it's components
and use full text index/search. (this will slow down updates as they will do
upto 20 inserts into the index)

searches should then be optimally fast

Re: best practise/pattern for large OR / LIKE searches

От
Pavel Stehule
Дата:
Hello

> regex is compiled to a finite state machine and then the datanumber
> column is scanned in a single pass (for each row)
>
>> Searches are currently taking to long and we would like to optimize
>> them, but before we dive into our own solution we
>> where wondering if there already common solutions for this...
>
> try regex first if that's too slow you may need to write a
> dictionary function that splits datanuimber into it's components
> and use full text index/search. (this will slow down updates as they will do
> upto 20 inserts into the index)
>
> searches should then be optimally fast
>

I did some tests:

1) I fill test table
insert into test SELECT
array_to_string(array_agg(array_to_string(ARRAY(select
substring('01234567890' from (random()*10)::int + 1 for 1) from
generate_series(1,(random()*10+5)::int + i - i)),'')),',') as b from
generate_series(1,100000) g(i) group by (random()*1000)::int;

2. I tested searching of 5 or 13 values. I did tests on 8.4 and 8.1

8.1
using like 190ms(440ms*)
using regexp 115ms(259ms*)

* for 13 values - so there regexp is faster than like

on 8.4
using like 80ms(151ms)
using regexp 131ms(267ms)

so like is faster then regexp on 8.4.

fulltext test (8.4)
420ms(470ms) -- without index
14ms(26ms) -- with GiST index
1ms(2ms) -- with Gin index

some samples of test queries:
select * from test where to_tsvector('simple',a) @@

to_tsquery('simple','296426496|7707431116555|98173598191|302598|53174827|02292064629|188631468777|4756243248|920473|16602317|76613513|78640|9176323');

select * from test where a ~

'296426496|7707431116555|98173598191|302598|53174827|02292064629|188631468777|4756243248|920473|16602317|76613513|78640|9176323';

select * from test where a like '%296426496%' or a like '%
7707431116555%' or a like '%98173598191%' or a like '%302598%' or a
like '%53174827%' or a like '%02292064629%' or a like '%188631468777%'
or a like '%4756243248%' or a like '%920473%' or a like '%16602317%'
or a like '%76613513%' or a like '%78640%' or a like '%9176323%';

regards
Pavel Stehule