Re: Search for fixed set of keywords
От | Oleg Bartunov |
---|---|
Тема | Re: Search for fixed set of keywords |
Дата | |
Msg-id | Pine.LNX.4.64.0801091351440.26876@sn.sai.msu.ru обсуждение исходный текст |
Ответ на | Search for fixed set of keywords (Jörg Kiegeland <kiegeland@ikv.de>) |
Ответы |
Re: Search for fixed set of keywords
|
Список | pgsql-performance |
Did you try integer arrays with GIN (inverted index) ? Oleg On Wed, 9 Jan 2008, J?rg Kiegeland wrote: > Hello, > > I have an interesting generic search task, for which I have done different > performance tests and I would like to share and discuss my results on this > newsgroup. > > So I begin to describe the search task: > > ========= > You have a set of N unique IDs. Every ID is associated with an integer > scoring value. Every ID is also associated with up to K different keywords > (there are totally K different keywords K1 ... Kn). Now find the first Z > best-scored IDs which are associated with a given set of keywords in one of > two ways: > > (C1) The ID must be associated with all keywords of the given set of > keywords. > (C2) The ID must be associated with at least one keyword of the given set of > keywords. > ========= > > > My tests showed that only a Multiple-Column-approach resulted in a acceptable > query response time. I also tried out an int-array approach using gist, a > sub-string approach, a bit-column approach, and even a sub-string approach > using Solr. > Actually, the int-array approach was 20% faster for Z=infinity, but it became > linear for the test case [Z=1000 and *all* IDs matches the search condition]. > (To be not misunderstood, "acceptable time" means: having a fixed Z, a fixed > set of keywords K, a fixed query, and an increasing N, results in constant up > to logarithmic response time; linear or worser-than-linear time is not > accepted) > > In the Multiple-Column-approach, there is one table. The table has a boolean > column for each keyword. It has also a column for the ID and for the scoring. > Now, for each keyword column and for the scoring column a separate index is > created. > C1 is implemented by an AND-query on the keyword columns, C2 by and OR query, > and the result is sorted for the scoring column, cutting of after the first Z > results. > > However our requirements for the search task have changed and I not yet > managed to find a search approach with acceptable response time for following > variation: > Namely that one uses C2 and do not sort for a scoring column but use as > scoring value the number of matched keywords for a given ID. > The difficulty in this query type is that the scoring is dependent on the > query itself.. > > So has anyone an idea how to solve this query type with acceptable response > time, or can anybody tell/prove, that this is theoretically not possible? > > > > ---------------------------(end of broadcast)--------------------------- > TIP 6: explain analyze is your friend > Regards, Oleg _____________________________________________________________ Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru), Sternberg Astronomical Institute, Moscow University, Russia Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ phone: +007(495)939-16-83, +007(495)939-23-83
В списке pgsql-performance по дате отправления: