Обсуждение: Re: [GENERAL] Creation of tsearch2 index is very slow
[ thread moved to pgsql-performance ] I've obtained a gprof profile on Stephan's sample case (many thanks for providing the data, Stephan). The command is CREATE INDEX foo ON publications_test USING gist (fti_title); where fti_title is a tsvector column. There are 236984 rows in the table, most with between 4 and 10 words in fti_title. sum(length(fti_title)) yields 1636202 ... not sure if this is a relevant measure, however. Using CVS tip with a fairly vanilla configuration (including --enable-cassert), here are all the hotspots down to the 1% level: % cumulative self self total time seconds seconds calls s/call s/call name 20.19 1.90 1.90 588976 0.00 0.00 gistchoose 19.02 3.69 1.79 683471 0.00 0.00 XLogInsert 5.95 4.25 0.56 3575135 0.00 0.00 LWLockAcquire 4.46 4.67 0.42 3579005 0.00 0.00 LWLockRelease 4.14 5.06 0.39 3146848 0.00 0.00 AllocSetAlloc 3.72 5.41 0.35 236984 0.00 0.00 gistdoinsert 3.40 5.73 0.32 876047 0.00 0.00 hash_search 2.76 5.99 0.26 3998576 0.00 0.00 LockBuffer 2.28 6.21 0.22 11514275 0.00 0.00 gistdentryinit 1.86 6.38 0.18 841757 0.00 0.00 UnpinBuffer 1.81 6.55 0.17 12201023 0.00 0.00 FunctionCall1 1.81 6.72 0.17 237044 0.00 0.00 AllocSetCheck 1.49 6.86 0.14 236984 0.00 0.00 gistmakedeal 1.49 7.00 0.14 10206985 0.00 0.00 FunctionCall3 1.49 7.14 0.14 1287874 0.00 0.00 MemoryContextAllocZero 1.28 7.26 0.12 826179 0.00 0.00 PinBuffer 1.17 7.37 0.11 875785 0.00 0.00 hash_any 1.17 7.48 0.11 1857292 0.00 0.00 MemoryContextAlloc 1.17 7.59 0.11 221466 0.00 0.00 PageIndexTupleDelete 1.06 7.69 0.10 9762101 0.00 0.00 gistpenalty Clearly, one thing that would be worth doing is suppressing the WAL traffic when possible, as we already do for btree builds. It seems that gistchoose may have some internal ineffiency too --- I haven't looked at the code yet. The other thing that jumps out is the very large numbers of calls to gistdentryinit, FunctionCall1, FunctionCall3. Some interesting parts of the calls/calledby graph are: ----------------------------------------------- 0.35 8.07 236984/236984 gistbuildCallback [14] [15] 89.5 0.35 8.07 236984 gistdoinsert [15] 0.14 3.55 236984/236984 gistmakedeal [16] 1.90 0.89 588976/588976 gistchoose [17] 0.07 0.83 825960/841757 ReadBuffer [19] 0.09 0.10 825960/1287874 MemoryContextAllocZero [30] 0.12 0.05 1888904/3998576 LockBuffer [29] 0.13 0.00 825960/3575135 LWLockAcquire [21] 0.10 0.00 825960/3579005 LWLockRelease [26] 0.06 0.00 473968/3146848 AllocSetAlloc [27] 0.03 0.00 473968/1857292 MemoryContextAlloc [43] 0.02 0.00 825960/1272423 gistcheckpage [68] ----------------------------------------------- 0.14 3.55 236984/236984 gistdoinsert [15] [16] 39.2 0.14 3.55 236984 gistmakedeal [16] 1.20 0.15 458450/683471 XLogInsert [18] 0.01 0.66 224997/224997 gistxlogInsertCompletion [20] 0.09 0.35 444817/444817 gistgetadjusted [23] 0.08 0.17 456801/456804 formUpdateRdata [32] 0.17 0.01 827612/841757 UnpinBuffer [35] 0.11 0.00 221466/221466 PageIndexTupleDelete [42] 0.02 0.08 456801/460102 gistfillbuffer [45] 0.06 0.04 1649/1649 gistSplit [46] 0.08 0.00 685099/3579005 LWLockRelease [26] 0.03 0.05 446463/446463 gistFindCorrectParent [50] 0.04 0.02 685099/3998576 LockBuffer [29] 0.04 0.00 1649/1649 gistextractbuffer [58] 0.03 0.00 460102/460121 write_buffer [66] 0.02 0.00 825960/826092 ReleaseBuffer [69] 0.02 0.00 221402/221402 gistadjscans [82] 0.00 0.00 1582/1582 gistunion [131] 0.00 0.00 1649/1649 formSplitRdata [147] 0.00 0.00 1649/1649 gistjoinvector [178] 0.00 0.00 3/3 gistnewroot [199] 0.00 0.00 458450/461748 gistnospace [418] 0.00 0.00 458450/458450 WriteNoReleaseBuffer [419] 0.00 0.00 1652/1671 WriteBuffer [433] ----------------------------------------------- 1.90 0.89 588976/588976 gistdoinsert [15] [17] 29.7 1.90 0.89 588976 gistchoose [17] 0.25 0.17 9762101/10892174 FunctionCall3 <cycle 1> [38] 0.18 0.14 9762101/11514275 gistdentryinit [28] 0.10 0.00 9762101/9762101 gistpenalty [47] 0.04 0.02 588976/1478610 gistDeCompressAtt [39] ----------------------------------------------- 0.00 0.00 1/683471 gistbuild [12] 0.00 0.00 1/683471 log_heap_update [273] 0.00 0.00 1/683471 RecordTransactionCommit [108] 0.00 0.00 1/683471 smgrcreate [262] 0.00 0.00 3/683471 gistnewroot [199] 0.00 0.00 5/683471 heap_insert [116] 0.00 0.00 12/683471 _bt_insertonpg [195] 0.59 0.07 224997/683471 gistxlogInsertCompletion [20] 1.20 0.15 458450/683471 gistmakedeal [16] [18] 21.4 1.79 0.22 683471 XLogInsert [18] 0.11 0.00 683471/3575135 LWLockAcquire [21] 0.08 0.00 687340/3579005 LWLockRelease [26] 0.03 0.00 683471/683471 GetCurrentTransactionIdIfAny [65] 0.01 0.00 15604/15604 AdvanceXLInsertBuffer [111] 0.00 0.00 3/10094 BufferGetBlockNumber [95] 0.00 0.00 3869/3870 XLogWrite [281] 0.00 0.00 3870/3871 LWLockConditionalAcquire [428] 0.00 0.00 3/3 BufferGetFileNode [611] ----------------------------------------------- 0.00 0.00 3164/11514275 gistunion [131] 0.01 0.00 270400/11514275 gistSplit [46] 0.03 0.02 1478610/11514275 gistDeCompressAtt [39] 0.18 0.14 9762101/11514275 gistchoose [17] [28] 4.0 0.22 0.16 11514275 gistdentryinit [28] 0.16 0.00 11514275/12201023 FunctionCall1 [36] ----------------------------------------------- 0.00 0.00 67/12201023 index_endscan <cycle 1> [167] 0.01 0.00 686681/12201023 gistcentryinit [62] 0.16 0.00 11514275/12201023 gistdentryinit [28] [36] 1.8 0.17 0.00 12201023 FunctionCall1 [36] 0.00 0.00 67/67 btendscan [231] 0.00 0.00 12200956/22855929 data_start [414] ----------------------------------------------- 67 index_beginscan_internal <cycle 1> [169] 0.01 0.01 444817/10892174 gistgetadjusted [23] 0.25 0.17 9762101/10892174 gistchoose [17] [38] 1.5 0.14 0.00 10206985 FunctionCall3 <cycle 1> [38] 0.00 0.00 10206918/22855929 data_start [414] 0.00 0.00 67/67 btbeginscan [486] 67 RelationGetIndexScan <cycle 1> [212] Now that we have some data, we can start to think about how to improve matters ... regards, tom lane
On Fri, Jan 20, 2006 at 02:14:29PM -0500, Tom Lane wrote: > [ thread moved to pgsql-performance ] > > I've obtained a gprof profile on Stephan's sample case (many thanks for > providing the data, Stephan). The command is <snip> Something I'm missing is the calls to tsearch functions. I'm not 100% familiar with gprof, but is it possible those costs have been added somewhere else because it's in a shared library? Perhaps the costs went into FunctionCall1/3? Have a nice day, -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a > tool for doing 5% of the work and then sitting around waiting for someone > else to do the other 95% so you can sue them.
Вложения
Martijn van Oosterhout <kleptog@svana.org> writes: > Something I'm missing is the calls to tsearch functions. I'm not 100% > familiar with gprof, but is it possible those costs have been added > somewhere else because it's in a shared library? Perhaps the costs went > into FunctionCall1/3? I think that the tsearch functions must be the stuff charged as "data_start" (which is not actually any symbol in our source code). That is showing as being called by FunctionCallN which is what you'd expect. If the totals given by gprof are correct, then it's down in the noise. I don't think I trust that too much ... but I don't see anything in the gprof manual about how to include a dynamically loaded library in the profile. (I did compile tsearch2 with -pg, but that's evidently not enough.) I'll see if I can link tsearch2 statically to resolve this question. regards, tom lane
On Fri, Jan 20, 2006 at 03:21:45PM -0500, Tom Lane wrote: > If the totals given by gprof are correct, then it's down in the noise. > I don't think I trust that too much ... but I don't see anything in the > gprof manual about how to include a dynamically loaded library in the > profile. (I did compile tsearch2 with -pg, but that's evidently not > enough.) There is some mention on the web of an environment variable you can set: LD_PROFILE=<libname> These pages seem relevent: http://sourceware.org/ml/binutils/2002-04/msg00047.html http://www.scit.wlv.ac.uk/cgi-bin/mansec?1+gprof It's wierd how some man pages for gprof describe this feature, but the one on my local system doesn't mention it. > I'll see if I can link tsearch2 statically to resolve this question. That'll work too... -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a > tool for doing 5% of the work and then sitting around waiting for someone > else to do the other 95% so you can sue them.
Вложения
Well, I feel like a fool, because I failed to notice that the total runtime shown in that profile wasn't anywhere close to the actual wall clock time. gprof is indeed simply not counting the time spent in dynamically-linked code. With tsearch2 statically linked into the backend, a more believable picture emerges: % cumulative self self total time seconds seconds calls Ks/call Ks/call name 98.96 1495.93 1495.93 33035195 0.00 0.00 hemdistsign 0.27 1500.01 4.08 10030581 0.00 0.00 makesign 0.11 1501.74 1.73 588976 0.00 0.00 gistchoose 0.10 1503.32 1.58 683471 0.00 0.00 XLogInsert 0.05 1504.15 0.83 246579 0.00 0.00 sizebitvec 0.05 1504.93 0.78 446399 0.00 0.00 gtsvector_union 0.03 1505.45 0.52 3576475 0.00 0.00 LWLockRelease 0.03 1505.92 0.47 1649 0.00 0.00 gtsvector_picksplit 0.03 1506.38 0.47 3572572 0.00 0.00 LWLockAcquire 0.02 1506.74 0.36 444817 0.00 0.00 gtsvector_same 0.02 1507.09 0.35 4077089 0.00 0.00 AllocSetAlloc 0.02 1507.37 0.28 236984 0.00 0.00 gistdoinsert 0.02 1507.63 0.26 874195 0.00 0.00 hash_search 0.02 1507.89 0.26 9762101 0.00 0.00 gtsvector_penalty 0.01 1508.08 0.19 236984 0.00 0.00 gistmakedeal 0.01 1508.27 0.19 841754 0.00 0.00 UnpinBuffer 0.01 1508.45 0.18 22985469 0.00 0.00 hemdistcache 0.01 1508.63 0.18 3998572 0.00 0.00 LockBuffer 0.01 1508.81 0.18 686681 0.00 0.00 gtsvector_compress 0.01 1508.98 0.17 11514275 0.00 0.00 gistdentryinit So we gotta fix hemdistsign ... regards, tom lane
On Fri, Jan 20, 2006 at 04:19:15PM -0500, Tom Lane wrote: > % cumulative self self total > time seconds seconds calls Ks/call Ks/call name > 98.96 1495.93 1495.93 33035195 0.00 0.00 hemdistsign <snip> > So we gotta fix hemdistsign ... lol! Yeah, I guess so. Pretty nasty loop. LOOPBIT will iterate 8*63=504 times and it's going to do silly bit handling on each and every iteration. Given that all it's doing is counting bits, a simple fix would be to loop over bytes, use XOR and count ones. For extreme speedup create a lookup table with 256 entries to give you the answer straight away... Have a nice day, -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a > tool for doing 5% of the work and then sitting around waiting for someone > else to do the other 95% so you can sue them.
Вложения
On Fri, Jan 20, 2006 at 10:37:54PM +0100, Martijn van Oosterhout wrote: > Given that all it's doing is counting bits, a simple fix would be to > loop over bytes, use XOR and count ones. For extreme speedup create a > lookup table with 256 entries to give you the answer straight away... For extra obfscation: unsigned v = (unsigned)c; int num_bits = (v * 0x1001001001001ULL & 0x84210842108421ULL) % 0x1f; (More more-or-less intelligent options at http://graphics.stanford.edu/~seander/bithacks.html#CountBitsSetNaive :-) ) /* Steinar */ -- Homepage: http://www.sesse.net/
Martijn van Oosterhout <kleptog@svana.org> writes: > Given that all it's doing is counting bits, a simple fix would be to > loop over bytes, use XOR and count ones. For extreme speedup create a > lookup table with 256 entries to give you the answer straight away... Yeah, I just finished doing that and got about a 20x overall speedup (30-some seconds to build the index instead of 10 minutes). However, hemdistsign is *still* 70% of the runtime after doing that. The problem seems to be that gtsvector_picksplit is calling it an inordinate number of times: 0.53 30.02 1649/1649 FunctionCall2 <cycle 2> [19] [20] 52.4 0.53 30.02 1649 gtsvector_picksplit [20] 29.74 0.00 23519673/33035195 hemdistsign [18] 0.14 0.00 22985469/22985469 hemdistcache [50] 0.12 0.00 268480/10030581 makesign [25] 0.02 0.00 270400/270400 fillcache [85] 0.00 0.00 9894/4077032 AllocSetAlloc [34] 0.00 0.00 9894/2787477 MemoryContextAlloc [69] (hemdistcache calls hemdistsign --- I think gprof is doing something funny with tail-calls here, and showing hemdistsign as directly called from gtsvector_picksplit when control really arrives through hemdistcache.) The bulk of the problem is clearly in this loop, which performs O(N^2) comparisons to find the two entries that are furthest apart in hemdist terms: for (k = FirstOffsetNumber; k < maxoff; k = OffsetNumberNext(k)) { for (j = OffsetNumberNext(k); j <= maxoff; j = OffsetNumberNext(j)) { if (k == FirstOffsetNumber) fillcache(&cache[j], GETENTRY(entryvec, j)); size_waste = hemdistcache(&(cache[j]), &(cache[k])); if (size_waste > waste) { waste = size_waste; seed_1 = k; seed_2 = j; } } } I wonder if there is a way to improve on that. regards, tom lane
On Fri, Jan 20, 2006 at 04:50:17PM -0500, Tom Lane wrote: > I wonder if there is a way to improve on that. Ooh, the farthest pair problem (in an N-dimensional vector space, though). I'm pretty sure problems like this has been studied quite extensively in the literature, although perhaps not with the same norm. It's known under both "farthest pair" and "diameter", and probably others. I'm fairly sure it should be solvable in at least O(n log n). /* Steinar */ -- Homepage: http://www.sesse.net/
At 05:16 PM 1/20/2006, Steinar H. Gunderson wrote: >On Fri, Jan 20, 2006 at 04:50:17PM -0500, Tom Lane wrote: > > I wonder if there is a way to improve on that. > >Ooh, the farthest pair problem (in an N-dimensional vector space, though). >I'm pretty sure problems like this has been studied quite extensively in the >literature, although perhaps not with the same norm. It's known under both >"farthest pair" and "diameter", and probably others. I'm fairly sure it >should be solvable in at least O(n log n). If the N-dimensional space is Euclidean (any <x, x+1> is the same distance apart in dimension x), then finding the farthest pair can be done in at least O(n). If you do not want the actual distance and can create the proper data structures, particularly if you can update them incrementally as you generate pairs, it is often possible to solve this problem in O(lg n) or O(1). I'll do some grinding. Ron
On Fri, Jan 20, 2006 at 04:50:17PM -0500, Tom Lane wrote: > (hemdistcache calls hemdistsign --- I think gprof is doing something > funny with tail-calls here, and showing hemdistsign as directly called > from gtsvector_picksplit when control really arrives through hemdistcache.) It may be the compiler. All these functions are declared static, which gives the compiler quite a bit of leeway to rearrange code. > The bulk of the problem is clearly in this loop, which performs O(N^2) > comparisons to find the two entries that are furthest apart in hemdist > terms: Ah. A while ago someone came onto the list asking about bit strings indexing[1]. If I'd known tsearch worked like this I would have pointed him to it. Anyway, before he went off to implement it he mentioned "Jarvis-Patrick clustering", whatever that means. Probably more relevent was this thread[2] on -hackers a while back with pseudo-code[3]. How well it works, I don't know, it worked for him evidently, he went away happy... [1] http://archives.postgresql.org/pgsql-general/2005-11/msg00473.php [2] http://archives.postgresql.org/pgsql-hackers/2005-11/msg01067.php [3] http://archives.postgresql.org/pgsql-hackers/2005-11/msg01069.php Hope this helps, -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a > tool for doing 5% of the work and then sitting around waiting for someone > else to do the other 95% so you can sue them.
Вложения
At 04:37 PM 1/20/2006, Martijn van Oosterhout wrote: >On Fri, Jan 20, 2006 at 04:19:15PM -0500, Tom Lane wrote: > > % cumulative self self total > > time seconds seconds calls Ks/call Ks/call name > > 98.96 1495.93 1495.93 33035195 0.00 0.00 hemdistsign > ><snip> > > > So we gotta fix hemdistsign ... > >lol! Yeah, I guess so. Pretty nasty loop. LOOPBIT will iterate 8*63=504 >times and it's going to do silly bit handling on each and every >iteration. > >Given that all it's doing is counting bits, a simple fix would be to >loop over bytes, use XOR and count ones. For extreme speedup create a >lookup table with 256 entries to give you the answer straight away... For an even more extreme speedup, don't most modern CPUs have an asm instruction that counts the bits (un)set (AKA "population counting") in various size entities (4b, 8b, 16b, 32b, 64b, and 128b for 64b CPUs with SWAR instructions)? Ron
On Fri, Jan 20, 2006 at 05:46:34PM -0500, Ron wrote: > For an even more extreme speedup, don't most modern CPUs have an asm > instruction that counts the bits (un)set (AKA "population counting") > in various size entities (4b, 8b, 16b, 32b, 64b, and 128b for 64b > CPUs with SWAR instructions)? None in the x86 series that I'm aware of, at least. You have instructions for finding the highest set bit, though. /* Steinar */ -- Homepage: http://www.sesse.net/
Ron <rjpeace@earthlink.net> writes: > For an even more extreme speedup, don't most modern CPUs have an asm > instruction that counts the bits (un)set (AKA "population counting") > in various size entities (4b, 8b, 16b, 32b, 64b, and 128b for 64b > CPUs with SWAR instructions)? Yeah, but fetching from a small constant table is pretty quick too; I doubt it's worth getting involved in machine-specific assembly code for this. I'm much more interested in the idea of improving the furthest-distance algorithm in gtsvector_picksplit --- if we can do that, it'll probably drop the distance calculation down to the point where it's not really worth the trouble to assembly-code it. regards, tom lane
On Fri, Jan 20, 2006 at 05:46:34PM -0500, Ron wrote: > At 04:37 PM 1/20/2006, Martijn van Oosterhout wrote: > >Given that all it's doing is counting bits, a simple fix would be to > >loop over bytes, use XOR and count ones. For extreme speedup create a > >lookup table with 256 entries to give you the answer straight away... > For an even more extreme speedup, don't most modern CPUs have an asm > instruction that counts the bits (un)set (AKA "population counting") > in various size entities (4b, 8b, 16b, 32b, 64b, and 128b for 64b > CPUs with SWAR instructions)? Quite possibly, though I wouldn't have the foggiest idea how to get the C compiler to generate it. Given that even a lookup table will get you pretty close to that with plain C coding, I think that's quite enough for a function that really is just a small part of a much larger system... Better solution (as Tom points out): work out how to avoid calling it so much in the first place... At the moment each call to gtsvector_picksplit seems to call the distance function around 14262 times. Getting that down by an order of magnitude will help much much more. Have a nice day, -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a > tool for doing 5% of the work and then sitting around waiting for someone > else to do the other 95% so you can sue them.
Вложения
On Fri, Jan 20, 2006 at 05:29:46PM -0500, Ron wrote: > If the N-dimensional space is Euclidean (any <x, x+1> is the same > distance apart in dimension x), then finding the farthest pair can be > done in at least O(n). That sounds a bit optimistic. http://portal.acm.org/ft_gateway.cfm?id=167217&type=pdf&coll=GUIDE&dl=GUIDE&CFID=66230761&CFTOKEN=72453878 is from 1993, but still it struggles with getting it down to O(n log n) deterministically, for Euclidian 3-space, and our problem is not Euclidian (although it still satisfies the triangle inequality, which sounds important to me) and in a significantly higher dimension... /* Steinar */ -- Homepage: http://www.sesse.net/
On Fri, Jan 20, 2006 at 05:50:36PM -0500, Tom Lane wrote: > Yeah, but fetching from a small constant table is pretty quick too; > I doubt it's worth getting involved in machine-specific assembly code > for this. I'm much more interested in the idea of improving the > furthest-distance algorithm in gtsvector_picksplit --- if we can do > that, it'll probably drop the distance calculation down to the point > where it's not really worth the trouble to assembly-code it. For the record: Could we do with a less-than-optimal split here? In that case, an extremely simple heuristic is: best = distance(0, 1) best_i = 0 best_j = 1 for i = 2..last: if distance(best_i, i) > best: best = distance(best_i, i) best_j = i else if distance(best_j, i) > best: best = distance(best_j, i) best_i = i I've tested it on various data, and although it's definitely not _correct_, it generally gets within 10%. /* Steinar */ -- Homepage: http://www.sesse.net/
On Fri, Jan 20, 2006 at 04:50:17PM -0500, Tom Lane wrote: > I wonder if there is a way to improve on that. http://www.cs.uwaterloo.ca/~tmchan/slide_isaac.ps: The diameter problem has been studied extensively in the traditional model. Although O(n log n) algorithms have been given for d = 2 and d = 3, only slightly subquadratic algorithms are known for higher dimensions. It doesn't mention a date, but has references to at least 2004-papers, so I'm fairly sure nothing big has happened since that. It sounds like we either want to go for an approximation, or just accept that it's a lot of work to get it better than O(n^2). Or, of course, find some special property of our problem that makes it easier than the general problem :-) /* Steinar */ -- Homepage: http://www.sesse.net/
"Steinar H. Gunderson" <sgunderson@bigfoot.com> writes: > For the record: Could we do with a less-than-optimal split here? Yeah, I was wondering the same. The code is basically choosing two "seed" values to drive the index-page split. Intuitively it seems that "pretty far apart" would be nearly as good as "absolute furthest apart" for this purpose. The cost of a less-than-optimal split would be paid on all subsequent index accesses, though, so it's not clear how far we can afford to go in this direction. It's also worth considering that the entire approach is a heuristic, really --- getting the furthest-apart pair of seeds doesn't guarantee an optimal split as far as I can see. Maybe there's some totally different way to do it. regards, tom lane
On Fri, Jan 20, 2006 at 06:52:37PM -0500, Tom Lane wrote: > It's also worth considering that the entire approach is a heuristic, > really --- getting the furthest-apart pair of seeds doesn't guarantee > an optimal split as far as I can see. Maybe there's some totally > different way to do it. For those of us who don't know what tsearch2/gist is trying to accomplish here, could you provide some pointers? :-) During my mini-literature-search on Google, I've found various algorithms for locating clusters in high-dimensional metric spaces[1]; some of it might be useful, but I might just be misunderstanding what the real problem is. [1] http://ieeexplore.ieee.org/iel5/69/30435/01401892.pdf?arnumber=1401892 , for instance /* Steinar */ -- Homepage: http://www.sesse.net/
"Steinar H. Gunderson" <sgunderson@bigfoot.com> writes: > On Fri, Jan 20, 2006 at 06:52:37PM -0500, Tom Lane wrote: >> It's also worth considering that the entire approach is a heuristic, >> really --- getting the furthest-apart pair of seeds doesn't guarantee >> an optimal split as far as I can see. Maybe there's some totally >> different way to do it. > For those of us who don't know what tsearch2/gist is trying to accomplish > here, could you provide some pointers? :-) Well, we're trying to split an index page that's gotten full into two index pages, preferably with approximately equal numbers of items in each new page (this isn't a hard requirement though). I think the true figure of merit for a split is how often will subsequent searches have to descend into *both* of the resulting pages as opposed to just one --- the less often that is true, the better. I'm not very clear on what tsearch2 is doing with these bitmaps, but it looks like an upper page's downlink has the union (bitwise OR) of the one-bits in the values on the lower page, and you have to visit the lower page if this union has a nonempty intersection with the set you are looking for. If that's correct, what you really want is to divide the values so that the unions of the two sets have minimal overlap ... which seems to me to have little to do with what the code does at present. Teodor, Oleg, can you clarify what's needed here? regards, tom lane
On Fri, Jan 20, 2006 at 07:23:10PM -0500, Tom Lane wrote: > I'm not very clear on what tsearch2 is doing with these bitmaps, but it > looks like an upper page's downlink has the union (bitwise OR) of the > one-bits in the values on the lower page, and you have to visit the lower > page if this union has a nonempty intersection with the set you are looking > for. If that's correct, what you really want is to divide the values so > that the unions of the two sets have minimal overlap ... which seems to me > to have little to do with what the code does at present. Sort of like the vertex-cover problem? That's probably a lot harder than finding the two farthest points... /* Steinar */ -- Homepage: http://www.sesse.net/
Tom Lane wrote: > Well, we're trying to split an index page that's gotten full into two > index pages, preferably with approximately equal numbers of items in > each new page (this isn't a hard requirement though). ... If that's > correct, what you really want is to divide the values so that the unions > of the two sets have minimal overlap ... which seems to me to have > little to do with what the code does at present. This problem has been studied extensively by chemists, and they haven't found any easy solutions. The Jarvis Patrick clustering algorithm might give you hints about a fast approach. In theory it's K*O(N^2), but J-P ispreferred for large datasets (millions of molecules) because the coefficient K can be made quite low. It starts with a"similarity metric" for two bit strings, the Tanimoto or Tversky coefficients: http://www.daylight.com/dayhtml/doc/theory/theory.finger.html#RTFToC83 J-P Clustering is described here: http://www.daylight.com/dayhtml/doc/cluster/cluster.a.html#cl33 J-P Clustering is probably not the best for this problem (see the illustrations in the link above to see why), but the generalidea of computing N-nearest-neighbors, followed by a partitioning step, could be what's needed. Craig
At 07:23 PM 1/20/2006, Tom Lane wrote: >"Steinar H. Gunderson" <sgunderson@bigfoot.com> writes: > > On Fri, Jan 20, 2006 at 06:52:37PM -0500, Tom Lane wrote: > >> It's also worth considering that the entire approach is a heuristic, > >> really --- getting the furthest-apart pair of seeds doesn't guarantee > >> an optimal split as far as I can see. Maybe there's some totally > >> different way to do it. > > > For those of us who don't know what tsearch2/gist is trying to accomplish > > here, could you provide some pointers? :-) > >Well, we're trying to split an index page that's gotten full into >two index pages, preferably with approximately equal numbers of items in >each new page (this isn't a hard requirement though). Maybe we are over thinking this. What happens if we do the obvious and just make a new page and move the "last" n/2 items on the full page to the new page? Various forms of "move the last n/2 items" can be tested here: 0= just split the table in half. Sometimes KISS works. O(1). 1= the one's with the highest (or lowest) "x" value. 2= the one's with the highest sum of coordinates (x+y+...= values in the top/bottom n/2 of entries). 3= split the table so that each table has entries whose size_waste values add up to approximately the same value. 4= I'm sure there are others. 1-5 can be done in O(n) time w/o auxiliary data. They can be done in O(1) if we've kept track of the appropriate metric as we've built the current page. >I think the true figure of merit for a split is how often will >subsequent searches have to descend into *both* of the resulting >pages as opposed to just one >--- the less often that is true, the better. I'm not very clear on >what tsearch2 is doing with these bitmaps, but it looks like an >upper page's downlink has the union (bitwise OR) of the one-bits in >the values on the lower page, and you have to visit the lower page >if this union has a nonempty intersection with the set you are >looking for. If that's correct, what you really want is to divide >the values so that the unions of the two sets have minimal overlap >... which seems to me to have little to do with what the code does at present. I'm not sure what "upper page" and "lower page" mean here? >Teodor, Oleg, can you clarify what's needed here? Ditto. Guys what is the real motivation and purpose for this code? Ron
On Fri, Jan 20, 2006 at 05:50:36PM -0500, Tom Lane wrote: > Yeah, but fetching from a small constant table is pretty quick too; > I doubt it's worth getting involved in machine-specific assembly code > for this. I'm much more interested in the idea of improving the > furthest-distance algorithm in gtsvector_picksplit --- if we can do > that, it'll probably drop the distance calculation down to the point > where it's not really worth the trouble to assembly-code it. I've played with another algorithm. Very simple but it's only O(N). It doesn't get the longest distance but it does get close. Basically you take the first two elements as your starting length. Then you loop over each remaining string, each time finding the longest pair out of each set of three. I've only tried it on random strings. The maximum distance for 128 random strings tends to be around 291-295. This algorithm tends to find lengths around 280. Pseudo code below (in Perl). However, IMHO, this algorithm is optimising the wrong thing. It shouldn't be trying to split into sets that are far apart, it should be trying to split into sets that minimize the number of set bits (ie distance from zero), since that's what's will speed up searching. That's harder though (this algorithm does approximate it sort of) and I havn't come up with an algorithm yet sub MaxDistFast { my $strings = shift; my $m1 = 0; my $m2 = 1; my $dist = -1; for my $i (2..$#$strings) { my $d1 = HammDist( $strings->[$i], $strings->[$m1] ); my $d2 = HammDist( $strings->[$i], $strings->[$m2] ); my $m = ($d1 > $d2) ? $m1 : $m2; my $d = ($d1 > $d2) ? $d1 : $d2; if( $d > $dist ) { $dist = $d; $m1 = $i; $m2 = $m; } } return($m1,$m2,$dist); } Full program available at: http://svana.org/kleptog/temp/picksplit.pl Have a nice day, -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a > tool for doing 5% of the work and then sitting around waiting for someone > else to do the other 95% so you can sue them.
Вложения
On Sat, 21 Jan 2006, Martijn van Oosterhout wrote: > However, IMHO, this algorithm is optimising the wrong thing. It > shouldn't be trying to split into sets that are far apart, it should be > trying to split into sets that minimize the number of set bits (ie > distance from zero), since that's what's will speed up searching. Martijn, you're right! We want not only to split page to very different parts, but not to increase the number of sets bits in resulted signatures, which are union (OR'ed) of all signatures in part. We need not only fast index creation (thanks, Tom !), but a better index. Some information is available here http://www.sai.msu.su/~megera/oddmuse/index.cgi/Tsearch_V2_internals There are should be more detailed document, but I don't remember where:) > That's harder though (this algorithm does approximate it sort of) > and I havn't come up with an algorithm yet Don't ask how hard we thought :) Regards, Oleg _____________________________________________________________ Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru), Sternberg Astronomical Institute, Moscow University, Russia Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ phone: +007(495)939-16-83, +007(495)939-23-83
On Sat, 21 Jan 2006, Ron wrote: > At 07:23 PM 1/20/2006, Tom Lane wrote: >> "Steinar H. Gunderson" <sgunderson@bigfoot.com> writes: >> > On Fri, Jan 20, 2006 at 06:52:37PM -0500, Tom Lane wrote: >> >> It's also worth considering that the entire approach is a heuristic, >> >> really --- getting the furthest-apart pair of seeds doesn't guarantee >> >> an optimal split as far as I can see. Maybe there's some totally >> >> different way to do it. >> >> > For those of us who don't know what tsearch2/gist is trying to accomplish >> > here, could you provide some pointers? :-) >> >> Well, we're trying to split an index page that's gotten full into two index >> pages, preferably with approximately equal numbers of items in >> each new page (this isn't a hard requirement though). > > Maybe we are over thinking this. What happens if we do the obvious and just > make a new page and move the "last" n/2 items on the full page to the new > page? > > Various forms of "move the last n/2 items" can be tested here: > 0= just split the table in half. Sometimes KISS works. O(1). > 1= the one's with the highest (or lowest) "x" value. > 2= the one's with the highest sum of coordinates (x+y+...= values in the > top/bottom n/2 of entries). > 3= split the table so that each table has entries whose size_waste values add > up to approximately the same value. > 4= I'm sure there are others. > 1-5 can be done in O(n) time w/o auxiliary data. They can be done in O(1) if > we've kept track of the appropriate metric as we've built the current page. > > >> I think the true figure of merit for a split is how often will subsequent >> searches have to descend into *both* of the resulting pages as opposed to >> just one >> --- the less often that is true, the better. I'm not very clear on what >> tsearch2 is doing with these bitmaps, but it looks like an upper page's >> downlink has the union (bitwise OR) of the one-bits in the values on the >> lower page, and you have to visit the lower page if this union has a >> nonempty intersection with the set you are looking for. If that's correct, >> what you really want is to divide the values so that the unions of the two >> sets have minimal overlap ... which seems to me to have little to do with >> what the code does at present. > I'm not sure what "upper page" and "lower page" mean here? > > >> Teodor, Oleg, can you clarify what's needed here? > Ditto. Guys what is the real motivation and purpose for this code? we want not just split the page by two very distinct parts, but to keep resulted signatures which is ORed signature of all signatures in the page as much 'sparse' as can. some information available here http://www.sai.msu.su/~megera/oddmuse/index.cgi/Tsearch_V2_internals Unfortunately, we're rather busy right now and couldn't be very useful. > > Ron > > > ---------------------------(end of broadcast)--------------------------- > TIP 6: explain analyze is your friend > Regards, Oleg _____________________________________________________________ Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru), Sternberg Astronomical Institute, Moscow University, Russia Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ phone: +007(495)939-16-83, +007(495)939-23-83
On Sat, Jan 21, 2006 at 04:29:13PM +0300, Oleg Bartunov wrote: > Martijn, you're right! We want not only to split page to very > different parts, but not to increase the number of sets bits in > resulted signatures, which are union (OR'ed) of all signatures > in part. We need not only fast index creation (thanks, Tom !), > but a better index. Some information is available here > http://www.sai.msu.su/~megera/oddmuse/index.cgi/Tsearch_V2_internals > There are should be more detailed document, but I don't remember where:) I see how it works, what I don't quite get is whether the "inverted index" you refer to is what we're working with here, or just what's in tsearchd? > >That's harder though (this algorithm does approximate it sort of) > >and I havn't come up with an algorithm yet > > Don't ask how hard we thought :) Well, looking at how other people are struggling with it, it's definitly a Hard Problem. One thing though, I don't think the picksplit algorithm as is really requires you to strictly have the longest distance, just something reasonably long. So I think the alternate algorithm I posted should produce equivalent results. No idea how to test it though... Have a nice day, -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a > tool for doing 5% of the work and then sitting around waiting for someone > else to do the other 95% so you can sue them.
Вложения
On Sat, 21 Jan 2006, Ron wrote: > Perhaps a different approach to this problem is called for: > > _Managing Gigabytes: Compressing and Indexing Documents and Images_ 2ed > Witten, Moffat, Bell > ISBN 1-55860-570-3 > > This is a VERY good book on the subject. > > I'd also suggest looking at the publicly available work on indexing and > searching for search engines like Inktomi (sp?) and Google. > Ron Ron, you completely miss the problem ! We do know MG and other SE. Actually, we've implemented several search engines based on inverted index technology (see, for example, pgsql.ru/db/pgsearch). tsearch2 was designed for online indexing, while keeping inverted index online is rather difficult problem. We do have plan to implement inverted index as an option for large read-only archives, but now we discuss how to organize online index and if possible to optimize current storage for signatures without breaking search performance. > > > At 08:34 AM 1/21/2006, Oleg Bartunov wrote: >> On Sat, 21 Jan 2006, Ron wrote: >> >>> At 07:23 PM 1/20/2006, Tom Lane wrote: >>>> "Steinar H. Gunderson" <sgunderson@bigfoot.com> writes: >>>> > On Fri, Jan 20, 2006 at 06:52:37PM -0500, Tom Lane wrote: >>>> >> It's also worth considering that the entire approach is a heuristic, >>>> >> really --- getting the furthest-apart pair of seeds doesn't guarantee >>>> >> an optimal split as far as I can see. Maybe there's some totally >>>> >> different way to do it. >>>> > For those of us who don't know what tsearch2/gist is trying to >>>> accomplish >>>> > here, could you provide some pointers? :-) >>>> Well, we're trying to split an index page that's gotten full into two >>>> index pages, preferably with approximately equal numbers of items in >>>> each new page (this isn't a hard requirement though). >>> >>> Maybe we are over thinking this. What happens if we do the obvious and >>> just make a new page and move the "last" n/2 items on the full page to the >>> new page? >>> >>> Various forms of "move the last n/2 items" can be tested here: >>> 0= just split the table in half. Sometimes KISS works. O(1). >>> 1= the one's with the highest (or lowest) "x" value. >>> 2= the one's with the highest sum of coordinates (x+y+...= values in the >>> top/bottom n/2 of entries). >>> 3= split the table so that each table has entries whose size_waste values >>> add up to approximately the same value. >>> 4= I'm sure there are others. >>> 1-5 can be done in O(n) time w/o auxiliary data. They can be done in O(1) >>> if we've kept track of the appropriate metric as we've built the current >>> page. >>> >>> >>>> I think the true figure of merit for a split is how often will subsequent >>>> searches have to descend into *both* of the resulting pages as opposed to >>>> just one >>>> --- the less often that is true, the better. I'm not very clear on what >>>> tsearch2 is doing with these bitmaps, but it looks like an upper page's >>>> downlink has the union (bitwise OR) of the one-bits in the values on the >>>> lower page, and you have to visit the lower page if this union has a >>>> nonempty intersection with the set you are looking for. If that's >>>> correct, what you really want is to divide the values so that the unions >>>> of the two sets have minimal overlap ... which seems to me to have little >>>> to do with what the code does at present. >>> I'm not sure what "upper page" and "lower page" mean here? >>> >>> >>>> Teodor, Oleg, can you clarify what's needed here? >>> Ditto. Guys what is the real motivation and purpose for this code? >> >> we want not just split the page by two very distinct parts, but to keep >> resulted signatures which is ORed signature of all signatures in the page >> as much 'sparse' as can. some information available here >> http://www.sai.msu.su/~megera/oddmuse/index.cgi/Tsearch_V2_internals >> >> Unfortunately, we're rather busy right now and couldn't be very useful. > > Regards, Oleg _____________________________________________________________ Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru), Sternberg Astronomical Institute, Moscow University, Russia Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ phone: +007(495)939-16-83, +007(495)939-23-83
Perhaps a different approach to this problem is called for: _Managing Gigabytes: Compressing and Indexing Documents and Images_ 2ed Witten, Moffat, Bell ISBN 1-55860-570-3 This is a VERY good book on the subject. I'd also suggest looking at the publicly available work on indexing and searching for search engines like Inktomi (sp?) and Google. Ron At 08:34 AM 1/21/2006, Oleg Bartunov wrote: >On Sat, 21 Jan 2006, Ron wrote: > >>At 07:23 PM 1/20/2006, Tom Lane wrote: >>>"Steinar H. Gunderson" <sgunderson@bigfoot.com> writes: >>> > On Fri, Jan 20, 2006 at 06:52:37PM -0500, Tom Lane wrote: >>> >> It's also worth considering that the entire approach is a heuristic, >>> >> really --- getting the furthest-apart pair of seeds doesn't guarantee >>> >> an optimal split as far as I can see. Maybe there's some totally >>> >> different way to do it. >>> > For those of us who don't know what tsearch2/gist is trying to accomplish >>> > here, could you provide some pointers? :-) >>>Well, we're trying to split an index page that's gotten full into >>>two index pages, preferably with approximately equal numbers of items in >>>each new page (this isn't a hard requirement though). >> >>Maybe we are over thinking this. What happens if we do the obvious >>and just make a new page and move the "last" n/2 items on the full >>page to the new page? >> >>Various forms of "move the last n/2 items" can be tested here: >>0= just split the table in half. Sometimes KISS works. O(1). >>1= the one's with the highest (or lowest) "x" value. >>2= the one's with the highest sum of coordinates (x+y+...= values >>in the top/bottom n/2 of entries). >>3= split the table so that each table has entries whose size_waste >>values add up to approximately the same value. >>4= I'm sure there are others. >>1-5 can be done in O(n) time w/o auxiliary data. They can be done >>in O(1) if we've kept track of the appropriate metric as we've >>built the current page. >> >> >>>I think the true figure of merit for a split is how often will >>>subsequent searches have to descend into *both* of the resulting >>>pages as opposed to just one >>>--- the less often that is true, the better. I'm not very clear >>>on what tsearch2 is doing with these bitmaps, but it looks like an >>>upper page's downlink has the union (bitwise OR) of the one-bits >>>in the values on the lower page, and you have to visit the lower >>>page if this union has a nonempty intersection with the set you >>>are looking for. If that's correct, what you really want is to >>>divide the values so that the unions of the two sets have minimal >>>overlap ... which seems to me to have little to do with what the >>>code does at present. >>I'm not sure what "upper page" and "lower page" mean here? >> >> >>>Teodor, Oleg, can you clarify what's needed here? >>Ditto. Guys what is the real motivation and purpose for this code? > >we want not just split the page by two very distinct parts, but to keep >resulted signatures which is ORed signature of all signatures in the page >as much 'sparse' as can. some information available here >http://www.sai.msu.su/~megera/oddmuse/index.cgi/Tsearch_V2_internals > >Unfortunately, we're rather busy right now and couldn't be very useful.
On Sat, 21 Jan 2006, Martijn van Oosterhout wrote: > On Sat, Jan 21, 2006 at 04:29:13PM +0300, Oleg Bartunov wrote: >> Martijn, you're right! We want not only to split page to very >> different parts, but not to increase the number of sets bits in >> resulted signatures, which are union (OR'ed) of all signatures >> in part. We need not only fast index creation (thanks, Tom !), >> but a better index. Some information is available here >> http://www.sai.msu.su/~megera/oddmuse/index.cgi/Tsearch_V2_internals >> There are should be more detailed document, but I don't remember where:) > > I see how it works, what I don't quite get is whether the "inverted > index" you refer to is what we're working with here, or just what's in > tsearchd? just tsearchd. We plan to implement inverted index into PostgreSQL core and then adapt tsearch2 to use it as option for read-only archives. > >>> That's harder though (this algorithm does approximate it sort of) >>> and I havn't come up with an algorithm yet >> >> Don't ask how hard we thought :) > > Well, looking at how other people are struggling with it, it's > definitly a Hard Problem. One thing though, I don't think the picksplit > algorithm as is really requires you to strictly have the longest > distance, just something reasonably long. So I think the alternate > algorithm I posted should produce equivalent results. No idea how to > test it though... you may try our development module 'gevel' to see how dense is a signature. www=# \d v_pages Table "public.v_pages" Column | Type | Modifiers -----------+-------------------+----------- tid | integer | not null path | character varying | not null body | character varying | title | character varying | di | integer | dlm | integer | de | integer | md5 | character(22) | fts_index | tsvector | Indexes: "v_pages_pkey" PRIMARY KEY, btree (tid) "v_pages_path_key" UNIQUE, btree (path) "v_gist_key" gist (fts_index) # select * from gist_print('v_gist_key') as t(level int, valid bool, a gtsvector) where level =1; level | valid | a -------+-------+-------------------------------- 1 | t | 1698 true bits, 318 false bits 1 | t | 1699 true bits, 317 false bits 1 | t | 1701 true bits, 315 false bits 1 | t | 1500 true bits, 516 false bits 1 | t | 1517 true bits, 499 false bits (5 rows) Regards, Oleg _____________________________________________________________ Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru), Sternberg Astronomical Institute, Moscow University, Russia Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ phone: +007(495)939-16-83, +007(495)939-23-83
gevel is available from http://www.sai.msu.su/~megera/postgres/gist/ Oleg On Sat, 21 Jan 2006, Martijn van Oosterhout wrote: > On Sat, Jan 21, 2006 at 04:29:13PM +0300, Oleg Bartunov wrote: >> Martijn, you're right! We want not only to split page to very >> different parts, but not to increase the number of sets bits in >> resulted signatures, which are union (OR'ed) of all signatures >> in part. We need not only fast index creation (thanks, Tom !), >> but a better index. Some information is available here >> http://www.sai.msu.su/~megera/oddmuse/index.cgi/Tsearch_V2_internals >> There are should be more detailed document, but I don't remember where:) > > I see how it works, what I don't quite get is whether the "inverted > index" you refer to is what we're working with here, or just what's in > tsearchd? > >>> That's harder though (this algorithm does approximate it sort of) >>> and I havn't come up with an algorithm yet >> >> Don't ask how hard we thought :) > > Well, looking at how other people are struggling with it, it's > definitly a Hard Problem. One thing though, I don't think the picksplit > algorithm as is really requires you to strictly have the longest > distance, just something reasonably long. So I think the alternate > algorithm I posted should produce equivalent results. No idea how to > test it though... > > Have a nice day, > Regards, Oleg _____________________________________________________________ Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru), Sternberg Astronomical Institute, Moscow University, Russia Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ phone: +007(495)939-16-83, +007(495)939-23-83
Ron <rjpeace@earthlink.net> writes: > At 07:23 PM 1/20/2006, Tom Lane wrote: >> Well, we're trying to split an index page that's gotten full into >> two index pages, preferably with approximately equal numbers of items in >> each new page (this isn't a hard requirement though). > Maybe we are over thinking this. What happens if we do the obvious > and just make a new page and move the "last" n/2 items on the full > page to the new page? Search performance will go to hell in a handbasket :-(. We have to make at least some effort to split the page in a way that will allow searches to visit only one of the two child pages rather than both. It's certainly true though that finding the furthest pair is not a necessary component of that. It's reasonable if you try to visualize the problem in 2D or 3D, but I'm not sure that that geometric intuition holds up in such a high-dimensional space as we have here. regards, tom lane
On Sat, Jan 21, 2006 at 06:22:52PM +0300, Oleg Bartunov wrote: > >I see how it works, what I don't quite get is whether the "inverted > >index" you refer to is what we're working with here, or just what's in > >tsearchd? > > just tsearchd. We plan to implement inverted index into PostgreSQL core > and then adapt tsearch2 to use it as option for read-only archives. Hmm, had a look and think about it and I think I see what you mean by an inverted index. I also think your going to have a real exercise implementing it in Postgres because postgres indexes work on the basis of one tuple, one index entry, which I think your inverted index doesn't do. That said, I think GiST could be extended to support your case without too much difficulty. Interesting project though :) BTW, given you appear to have a tsearch2 index with some real-world data, would you be willing to try some alternate picksplit algorithms to see if your gevel module shows any difference? Have a nice day, -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a > tool for doing 5% of the work and then sitting around waiting for someone > else to do the other 95% so you can sue them.
Вложения
On Sat, 21 Jan 2006, Tom Lane wrote: > Ron <rjpeace@earthlink.net> writes: >> At 07:23 PM 1/20/2006, Tom Lane wrote: >>> Well, we're trying to split an index page that's gotten full into >>> two index pages, preferably with approximately equal numbers of items in >>> each new page (this isn't a hard requirement though). > >> Maybe we are over thinking this. What happens if we do the obvious >> and just make a new page and move the "last" n/2 items on the full >> page to the new page? > > Search performance will go to hell in a handbasket :-(. We have to make > at least some effort to split the page in a way that will allow searches > to visit only one of the two child pages rather than both. does the order of the items within a given page matter? if not this sounds like a partial quicksort algorithm would work. you don't need to fully sort things, but you do want to make sure that everything on the first page is 'less' then everything on the second page so you can skip passes that don't cross a page boundry > It's certainly true though that finding the furthest pair is not a > necessary component of that. It's reasonable if you try to visualize > the problem in 2D or 3D, but I'm not sure that that geometric intuition > holds up in such a high-dimensional space as we have here. I will say that I'm not understanding the problem well enough to understand themulti-dimentional nature of this problem. David Lang
David Lang <dlang@invendra.net> writes: > On Sat, 21 Jan 2006, Tom Lane wrote: >> Ron <rjpeace@earthlink.net> writes: >>> Maybe we are over thinking this. What happens if we do the obvious >>> and just make a new page and move the "last" n/2 items on the full >>> page to the new page? >> >> Search performance will go to hell in a handbasket :-(. We have to make >> at least some effort to split the page in a way that will allow searches >> to visit only one of the two child pages rather than both. > does the order of the items within a given page matter? AFAIK the items within a GIST index page are just stored in insertion order (which is exactly why Ron's suggestion above doesn't work well). There's no semantic significance to it. It's only when we have to split the page that we need to classify the items more finely. regards, tom lane
At 01:27 PM 1/21/2006, Tom Lane wrote: >Ron <rjpeace@earthlink.net> writes: > > At 07:23 PM 1/20/2006, Tom Lane wrote: > >> Well, we're trying to split an index page that's gotten full into > >> two index pages, preferably with approximately equal numbers of items in > >> each new page (this isn't a hard requirement though). > > > Maybe we are over thinking this. What happens if we do the obvious > > and just make a new page and move the "last" n/2 items on the full > > page to the new page? > >Search performance will go to hell in a handbasket :-(. We have to make >at least some effort to split the page in a way that will allow searches >to visit only one of the two child pages rather than both. > >It's certainly true though that finding the furthest pair is not a >necessary component of that. It's reasonable if you try to visualize >the problem in 2D or 3D, but I'm not sure that that geometric intuition >holds up in such a high-dimensional space as we have here. After reading the various papers available on GiST and RD trees, I think I have a decent suggestion. Since RD tree keys contain the keys of their descendents/components in them, they are basically a representation of a depth first search. This can be useful for intra-document searches. OTOH, inter-document searches are going to be more akin to breadth first searches using RD trees. Thus my suggestion is that we maintain =two= index structures for text data. The first contains as many keys and all their descendents as possible. When we can no longer fit a specific complete "path" into a page, we start a new one; trying to keep complete top level to leaf key sets within a page. This will minimize paging during intra-document searches. The second index keeps siblings within a page and avoids putting parents or children within a page unless the entire depth first search can be kept within the page in addition to the siblings present. This will minimize paging during inter-document searches. Traditional B-tree ordering methods can be used to define the ordering/placement of pages within each index, which will minimize head seeks to find the correct page to scan. Since the criteria for putting a key within a page or starting a new page is simple, performance for those tasks should be O(1). The price is the extra space used for two indexes instead of one, but at first glance that seems well worth it. Comments? Ron
Ron wrote: > At 01:27 PM 1/21/2006, Tom Lane wrote: > >Ron <rjpeace@earthlink.net> writes: > >> At 07:23 PM 1/20/2006, Tom Lane wrote: > >>> Well, we're trying to split an index page that's gotten full into > >>> two index pages, preferably with approximately equal numbers of items in > >>> each new page (this isn't a hard requirement though). > After reading the various papers available on GiST and RD trees, I > think I have a decent suggestion. I for one don't understand what does your suggestion have to do with the problem at hand ... not that I have a better one myself. -- Alvaro Herrera http://www.amazon.com/gp/registry/CTMLCN8V17R4 "Siempre hay que alimentar a los dioses, aunque la tierra esté seca" (Orual)
At 08:00 PM 1/26/2006, Alvaro Herrera wrote: >Ron wrote: > > At 01:27 PM 1/21/2006, Tom Lane wrote: > > >Ron <rjpeace@earthlink.net> writes: > > >> At 07:23 PM 1/20/2006, Tom Lane wrote: > > >>> Well, we're trying to split an index page that's gotten full into > > >>> two index pages, preferably with approximately equal numbers > of items in > > >>> each new page (this isn't a hard requirement though). > > > After reading the various papers available on GiST and RD trees, I > > think I have a decent suggestion. > >I for one don't understand what does your suggestion have to do with the >problem at hand ... not that I have a better one myself. We have two problems here. The first is that the page splitting code for these indexes currently has O(N^2) performance. The second is that whatever solution we do use for this functionality, we still need good performance during searches that use the index. It's not clear that the solutions we've discussed to splitting index pages thus far will result in good performance during searches. My suggestion is intended to address both issues. If I'm right it helps obtain high performance during searches while allowing the index page splitting code to be O(1) Ron.
Ron <rjpeace@earthlink.net> writes: > We have two problems here. > The first is that the page splitting code for these indexes currently > has O(N^2) performance. > The second is that whatever solution we do use for this functionality, > we still need good performance during searches that use the index. No, unfortunately that's not the problem that needs to be solved. The problem is figuring out WHICH records to put in the "left" and "right" trees once you split them. If you can figurethat out, then your suggestion (and perhaps other techniques) could be useful. The problem boils down to this: You have a whole bunch of essentially random bitmaps. You have two buckets. You want toput half of the bitmaps in one bucket, and half in the other bucket, and when you get through, you want all of the bitmapsin each bucket to be maximally similar to each other, and maximally dissimilar to the ones in the other bucket. That way, when you OR all the bitmaps in each bucket together to build the bitmap for the left and right child nodes of thetree, you'll get maximum separation -- the chances that you'll have to descend BOTH the left and right nodes of the treeare minimized. Unfortunately, this problem is very likely in the set of NP-complete problems, i.e. like the famous "Traveling Salesman Problem,"you can prove there's no algorithm that will give the answer in a reasonable time. In this case, "reasonable" wouldbe measured in milliseconds to seconds, but in fact an actual "perfect" split of a set of bitmaps probably can't becomputed in the lifetime of the universe for more than a few hundred bitmaps. That's the problem that's being discussed: How do you decide which bitmaps go in each of the two buckets? Any solution willnecessarily be imperfect, a pragmatic algorithm that gives an imperfect, but acceptable, answer. As I mentioned earlier, chemists make extensive use of bitmaps to categorize and group molecules. They use Tanimoto or Tverskysimilarity metrics (Tanimoto is a special case of Tversky), because it's extremely fast to compare two bitmaps, andthe score is highly correlated with the number of bits the two bitmaps have in common. But even with a fast "distance" metric like Tanimoto, there's still no easy way to decide which bucket to put each bitmapinto. Craig
You seem to have missed my point. I just gave a very clear description of how to "decide which bitmaps go in each of the two buckets" by reformulating the question into "decide which bitmaps go in each of =four= buckets". The intent is to have two indexes, one optimized for one common class of searches, and the other optimized for another common class of searches. By decomposing the optimization problem into two simpler problems, the hope is that we address all the issues reasonably simply while still getting decent performance. Nothing is free. The price we pay, and it is significant, is that we now have two indexes where before we had only one. Ron At 09:29 PM 1/26/2006, Craig A. James wrote: >Ron <rjpeace@earthlink.net> writes: >>We have two problems here. >>The first is that the page splitting code for these indexes >>currently has O(N^2) performance. >>The second is that whatever solution we do use for this >>functionality, we still need good performance during searches that >>use the index. > >No, unfortunately that's not the problem that needs to be solved. > >The problem is figuring out WHICH records to put in the "left" and >"right" trees once you split them. If you can figure that out, then >your suggestion (and perhaps other techniques) could be useful. > >The problem boils down to this: You have a whole bunch of >essentially random bitmaps. You have two buckets. You want to put >half of the bitmaps in one bucket, and half in the other bucket, and >when you get through, you want all of the bitmaps in each bucket to >be maximally similar to each other, and maximally dissimilar to the >ones in the other bucket. > >That way, when you OR all the bitmaps in each bucket together to >build the bitmap for the left and right child nodes of the tree, >you'll get maximum separation -- the chances that you'll have to >descend BOTH the left and right nodes of the tree are minimized. > >Unfortunately, this problem is very likely in the set of NP-complete >problems, i.e. like the famous "Traveling Salesman Problem," you can >prove there's no algorithm that will give the answer in a reasonable >time. In this case, "reasonable" would be measured in milliseconds >to seconds, but in fact an actual "perfect" split of a set of >bitmaps probably can't be computed in the lifetime of the universe >for more than a few hundred bitmaps. > >That's the problem that's being discussed: How do you decide which >bitmaps go in each of the two buckets? Any solution will >necessarily be imperfect, a pragmatic algorithm that gives an >imperfect, but acceptable, answer. > >As I mentioned earlier, chemists make extensive use of bitmaps to >categorize and group molecules. They use Tanimoto or Tversky >similarity metrics (Tanimoto is a special case of Tversky), because >it's extremely fast to compare two bitmaps, and the score is highly >correlated with the number of bits the two bitmaps have in common. > >But even with a fast "distance" metric like Tanimoto, there's still >no easy way to decide which bucket to put each bitmap into. > >Craig
At 11:13 PM 1/26/2006, Craig A. James wrote: >Ron, > >I'll write to you privately, because these discussions can get messy >"in public". I'm responding to this missive publicly in an attempt to help the discussion along. It is not my usual practice to respond to private messages publicly, but this seems a good case for an exception. >>You seem to have missed my point. I just gave a very clear >>description of how to "decide which bitmaps go in each of the two >>buckets" by reformulating the question into "decide which bitmaps >>go in each of =four= buckets". > >Sorry to disagree, but here's the problem. It's not whether you put >them into two, or four, or N buckets. The problem is, how do you >categorize them to begin with, so that you have some reasonable >criterion for which item goes in which bucket? THAT is the hard >problem, not whether you have two or four buckets. Agreed. ...and I've given the answer to "how do you categorize them" using a general property of RD Trees which should result in "a reasonable criterion for which item goes in which bucket" when used for text searching. The definition of RD tree keys being either "atomic" (non decomposable) or "molecular" (containing the keys of their descendents) is the one source of our current problems figuring out how to split them and, if I'm correct, a hint as to how to solve the splitting problem in O(1) time while helping to foster high performance during seearches. >Earlier, you wrote: >>Traditional B-tree ordering methods can be used to define the >>ordering/placement of pages within each index, which will minimize >>head seeks to find the correct page to scan. >>Since the criteria for putting a key within a page or starting a >>new page is simple, performance for those tasks should be O(1). > >What are the "traditional B-tree ordering methods"? That's the >issue, right there. The problem is that bitmaps HAVE NO ORDERING >METHOD. You can't sort them numerically or alphabetically. The =bitmaps= have no ordering method. =Pages= of bitmaps MUST have an ordering method or we have no idea which page to look at when searching for a key. Treating the "root" bitmaps (those that may have descendents but have no parents) as numbers and ordering the pages using B tree creation methods that use those numbers as keys is a simple way to create a balanced data structure with high fan out. IOW, a recipe for finding the proper page(s) to scan using minimal seeks. >Take a crowd of people. It's easy to divide them in half by names: >just alphabetize and split the list in half. That's what your >solution would improve on. > >But imagine I gave you a group of people and told you to put them >into two rooms, such that when you are through, the people in each >room are maximally similar to each other and maximally dissimilar to >the people in the other room. How would you do it? > >First of all, you'd have to define what "different" and "same" >are. Is your measure based on skin color, hair color, age, height, >weight, girth, intelligence, speed, endurance, body odor, ... >? Suppose I tell you, "All of those". You have to sort the people >so that your two groups are separated such that the group >differences are maximized in this N-dimensional >space. Computationally, it's a nearly impossible problem. I'm =changing= the problem using the semantics of RD trees. Using an RD tree representation, we'd create and assign a key for each person that ranked them compared to everyone else for each of the metrics we decided to differentiate on. Then we start to form trees of keys to these people by creating "parent" keys as roots that contain the union of everyone with the same or possibly similar value for some quality. By iterating this process, we end up with a bottom up construction method for an RD tree whose root key will the union of all the keys representing these people. If this is one data structure, we end up with an efficient and effective way of answering not only Boolean but also ranking and similarity type queries. The problem comes when we have to split this monolithic DS into pieces for best performance. As many have noted, there is no simple way to decide how to do such a thing. OTOH, we =do= have the knowledge of how RD trees are built and what their keys represent, and we know that queries are going to tend strongly to either a) traverse the path from parent to child (depth first) or b) find all siblings with (dis)similar characteristics (breadth first), or c) contain a mixture of a) and b). So I'm suggesting that conceptually we clone the original RD tree and we split each clone according to two different methods. Method A results in pages that contain as many complete depth first paths from root to leave as possible on each page. Method B results in pages that contain as many siblings as possible per page. ...and we use the appropriate index during each type of query or query part. In return for using 2x as much space, we should have a general method that is O(1) for decomposing RD trees in such a way as to support high performance during searches. >Here's a much simpler version of the problem. Suppose I give you >1000 numbers, and tell you to divide them in half so that the two >groups have the smallest standard deviation within each group >possible, and the two average values of the groups of numbers are >the farthest apart possible. Pretty easy, right? > >Now do it where "distance" is evaluated modulo(10), that is, 1 and 9 >are closer together than 1 and 3. Now try it -- you'll find that >this problem is almost impossible to solve. This is orthogonal to the discussion at hand since the above is not akin to text searching nor best done with RD trees and that is exactly what this discussion is about. We don't have to solve a general problem for all domains. We only have to solve it for the specific domain of text search using the specific DS of RD trees. >The problem is that, as database designers, we're used to text and >numbers, which have an inherent order. Bitmaps have no ordering -- >you can't say one is "greater than" or "less than" the other. All >you can do is determine that A and B are "more similar" or "less >similar" than A and C. Text and numbers only have an order because we deem them to. There is even a very common default order we tend to use. But even in the trivial case of ranking we've all said that "1" is better than "2" in one situation and "2" is better than "1" in another. In the case of text searching, the bitmaps represent where to find legomena, AKA specific tokens. Particularly hapax legomena, AKA unique tokens. Hapax Legomena are particularly important because they are maximally efficient at driving the search needed to answer a query. While absolute hapax legomena are great for quickly pruning things within a document or document region, relative hapax legomena can do the same thing when searching among multiple documents or document regions. The two indexes I'm suggesting are designed to take advantage of this general property of text searching. Hopefully this clarifies things and motivates a better discussion? Ron