Re: [GENERAL] Creation of tsearch2 index is very
От | Ron |
---|---|
Тема | Re: [GENERAL] Creation of tsearch2 index is very |
Дата | |
Msg-id | 7.0.1.0.2.20060121105907.03955090@earthlink.net обсуждение исходный текст |
Ответ на | Re: [GENERAL] Creation of tsearch2 index is very (Oleg Bartunov <oleg@sai.msu.su>) |
Список | pgsql-performance |
Perhaps a different approach to this problem is called for: _Managing Gigabytes: Compressing and Indexing Documents and Images_ 2ed Witten, Moffat, Bell ISBN 1-55860-570-3 This is a VERY good book on the subject. I'd also suggest looking at the publicly available work on indexing and searching for search engines like Inktomi (sp?) and Google. Ron At 08:34 AM 1/21/2006, Oleg Bartunov wrote: >On Sat, 21 Jan 2006, Ron wrote: > >>At 07:23 PM 1/20/2006, Tom Lane wrote: >>>"Steinar H. Gunderson" <sgunderson@bigfoot.com> writes: >>> > On Fri, Jan 20, 2006 at 06:52:37PM -0500, Tom Lane wrote: >>> >> It's also worth considering that the entire approach is a heuristic, >>> >> really --- getting the furthest-apart pair of seeds doesn't guarantee >>> >> an optimal split as far as I can see. Maybe there's some totally >>> >> different way to do it. >>> > For those of us who don't know what tsearch2/gist is trying to accomplish >>> > here, could you provide some pointers? :-) >>>Well, we're trying to split an index page that's gotten full into >>>two index pages, preferably with approximately equal numbers of items in >>>each new page (this isn't a hard requirement though). >> >>Maybe we are over thinking this. What happens if we do the obvious >>and just make a new page and move the "last" n/2 items on the full >>page to the new page? >> >>Various forms of "move the last n/2 items" can be tested here: >>0= just split the table in half. Sometimes KISS works. O(1). >>1= the one's with the highest (or lowest) "x" value. >>2= the one's with the highest sum of coordinates (x+y+...= values >>in the top/bottom n/2 of entries). >>3= split the table so that each table has entries whose size_waste >>values add up to approximately the same value. >>4= I'm sure there are others. >>1-5 can be done in O(n) time w/o auxiliary data. They can be done >>in O(1) if we've kept track of the appropriate metric as we've >>built the current page. >> >> >>>I think the true figure of merit for a split is how often will >>>subsequent searches have to descend into *both* of the resulting >>>pages as opposed to just one >>>--- the less often that is true, the better. I'm not very clear >>>on what tsearch2 is doing with these bitmaps, but it looks like an >>>upper page's downlink has the union (bitwise OR) of the one-bits >>>in the values on the lower page, and you have to visit the lower >>>page if this union has a nonempty intersection with the set you >>>are looking for. If that's correct, what you really want is to >>>divide the values so that the unions of the two sets have minimal >>>overlap ... which seems to me to have little to do with what the >>>code does at present. >>I'm not sure what "upper page" and "lower page" mean here? >> >> >>>Teodor, Oleg, can you clarify what's needed here? >>Ditto. Guys what is the real motivation and purpose for this code? > >we want not just split the page by two very distinct parts, but to keep >resulted signatures which is ORed signature of all signatures in the page >as much 'sparse' as can. some information available here >http://www.sai.msu.su/~megera/oddmuse/index.cgi/Tsearch_V2_internals > >Unfortunately, we're rather busy right now and couldn't be very useful.
В списке pgsql-performance по дате отправления: