Обсуждение: Re: [GENERAL] Creation of tsearch2 index is very slow

От:
Tom Lane
Дата:

[ thread moved to pgsql-performance ]

I've obtained a gprof profile on Stephan's sample case (many thanks for
providing the data, Stephan).  The command is
    CREATE INDEX foo ON publications_test USING gist (fti_title);
where fti_title is a tsvector column.  There are 236984 rows in the
table, most with between 4 and 10 words in fti_title.
sum(length(fti_title)) yields 1636202 ... not sure if this is a
relevant measure, however.

Using CVS tip with a fairly vanilla configuration (including
--enable-cassert), here are all the hotspots down to the 1% level:

  %   cumulative   self              self     total
 time   seconds   seconds    calls   s/call   s/call  name
 20.19      1.90     1.90   588976     0.00     0.00  gistchoose
 19.02      3.69     1.79   683471     0.00     0.00  XLogInsert
  5.95      4.25     0.56  3575135     0.00     0.00  LWLockAcquire
  4.46      4.67     0.42  3579005     0.00     0.00  LWLockRelease
  4.14      5.06     0.39  3146848     0.00     0.00  AllocSetAlloc
  3.72      5.41     0.35   236984     0.00     0.00  gistdoinsert
  3.40      5.73     0.32   876047     0.00     0.00  hash_search
  2.76      5.99     0.26  3998576     0.00     0.00  LockBuffer
  2.28      6.21     0.22 11514275     0.00     0.00  gistdentryinit
  1.86      6.38     0.18   841757     0.00     0.00  UnpinBuffer
  1.81      6.55     0.17 12201023     0.00     0.00  FunctionCall1
  1.81      6.72     0.17   237044     0.00     0.00  AllocSetCheck
  1.49      6.86     0.14   236984     0.00     0.00  gistmakedeal
  1.49      7.00     0.14 10206985     0.00     0.00  FunctionCall3
  1.49      7.14     0.14  1287874     0.00     0.00  MemoryContextAllocZero
  1.28      7.26     0.12   826179     0.00     0.00  PinBuffer
  1.17      7.37     0.11   875785     0.00     0.00  hash_any
  1.17      7.48     0.11  1857292     0.00     0.00  MemoryContextAlloc
  1.17      7.59     0.11   221466     0.00     0.00  PageIndexTupleDelete
  1.06      7.69     0.10  9762101     0.00     0.00  gistpenalty

Clearly, one thing that would be worth doing is suppressing the WAL
traffic when possible, as we already do for btree builds.  It seems
that gistchoose may have some internal ineffiency too --- I haven't
looked at the code yet.  The other thing that jumps out is the very
large numbers of calls to gistdentryinit, FunctionCall1, FunctionCall3.
Some interesting parts of the calls/calledby graph are:

-----------------------------------------------
                0.35    8.07  236984/236984      gistbuildCallback [14]
[15]    89.5    0.35    8.07  236984         gistdoinsert [15]
                0.14    3.55  236984/236984      gistmakedeal [16]
                1.90    0.89  588976/588976      gistchoose [17]
                0.07    0.83  825960/841757      ReadBuffer [19]
                0.09    0.10  825960/1287874     MemoryContextAllocZero [30]
                0.12    0.05 1888904/3998576     LockBuffer [29]
                0.13    0.00  825960/3575135     LWLockAcquire [21]
                0.10    0.00  825960/3579005     LWLockRelease [26]
                0.06    0.00  473968/3146848     AllocSetAlloc [27]
                0.03    0.00  473968/1857292     MemoryContextAlloc [43]
                0.02    0.00  825960/1272423     gistcheckpage [68]
-----------------------------------------------
                0.14    3.55  236984/236984      gistdoinsert [15]
[16]    39.2    0.14    3.55  236984         gistmakedeal [16]
                1.20    0.15  458450/683471      XLogInsert [18]
                0.01    0.66  224997/224997      gistxlogInsertCompletion [20]
                0.09    0.35  444817/444817      gistgetadjusted [23]
                0.08    0.17  456801/456804      formUpdateRdata [32]
                0.17    0.01  827612/841757      UnpinBuffer [35]
                0.11    0.00  221466/221466      PageIndexTupleDelete [42]
                0.02    0.08  456801/460102      gistfillbuffer [45]
                0.06    0.04    1649/1649        gistSplit [46]
                0.08    0.00  685099/3579005     LWLockRelease [26]
                0.03    0.05  446463/446463      gistFindCorrectParent [50]
                0.04    0.02  685099/3998576     LockBuffer [29]
                0.04    0.00    1649/1649        gistextractbuffer [58]
                0.03    0.00  460102/460121      write_buffer [66]
                0.02    0.00  825960/826092      ReleaseBuffer [69]
                0.02    0.00  221402/221402      gistadjscans [82]
                0.00    0.00    1582/1582        gistunion [131]
                0.00    0.00    1649/1649        formSplitRdata [147]
                0.00    0.00    1649/1649        gistjoinvector [178]
                0.00    0.00       3/3           gistnewroot [199]
                0.00    0.00  458450/461748      gistnospace [418]
                0.00    0.00  458450/458450      WriteNoReleaseBuffer [419]
                0.00    0.00    1652/1671        WriteBuffer [433]
-----------------------------------------------
                1.90    0.89  588976/588976      gistdoinsert [15]
[17]    29.7    1.90    0.89  588976         gistchoose [17]
                0.25    0.17 9762101/10892174     FunctionCall3 <cycle 1> [38]
                0.18    0.14 9762101/11514275     gistdentryinit [28]
                0.10    0.00 9762101/9762101     gistpenalty [47]
                0.04    0.02  588976/1478610     gistDeCompressAtt [39]
-----------------------------------------------
                0.00    0.00       1/683471      gistbuild [12]
                0.00    0.00       1/683471      log_heap_update [273]
                0.00    0.00       1/683471      RecordTransactionCommit [108]
                0.00    0.00       1/683471      smgrcreate [262]
                0.00    0.00       3/683471      gistnewroot [199]
                0.00    0.00       5/683471      heap_insert [116]
                0.00    0.00      12/683471      _bt_insertonpg [195]
                0.59    0.07  224997/683471      gistxlogInsertCompletion [20]
                1.20    0.15  458450/683471      gistmakedeal [16]
[18]    21.4    1.79    0.22  683471         XLogInsert [18]
                0.11    0.00  683471/3575135     LWLockAcquire [21]
                0.08    0.00  687340/3579005     LWLockRelease [26]
                0.03    0.00  683471/683471      GetCurrentTransactionIdIfAny [65]
                0.01    0.00   15604/15604       AdvanceXLInsertBuffer [111]
                0.00    0.00       3/10094       BufferGetBlockNumber [95]
                0.00    0.00    3869/3870        XLogWrite [281]
                0.00    0.00    3870/3871        LWLockConditionalAcquire [428]
                0.00    0.00       3/3           BufferGetFileNode [611]
-----------------------------------------------
                0.00    0.00    3164/11514275     gistunion [131]
                0.01    0.00  270400/11514275     gistSplit [46]
                0.03    0.02 1478610/11514275     gistDeCompressAtt [39]
                0.18    0.14 9762101/11514275     gistchoose [17]
[28]     4.0    0.22    0.16 11514275         gistdentryinit [28]
                0.16    0.00 11514275/12201023     FunctionCall1 [36]
-----------------------------------------------
                0.00    0.00      67/12201023     index_endscan <cycle 1> [167]
                0.01    0.00  686681/12201023     gistcentryinit [62]
                0.16    0.00 11514275/12201023     gistdentryinit [28]
[36]     1.8    0.17    0.00 12201023         FunctionCall1 [36]
                0.00    0.00      67/67          btendscan [231]
                0.00    0.00 12200956/22855929     data_start [414]
-----------------------------------------------
                                  67             index_beginscan_internal <cycle 1> [169]
                0.01    0.01  444817/10892174     gistgetadjusted [23]
                0.25    0.17 9762101/10892174     gistchoose [17]
[38]     1.5    0.14    0.00 10206985         FunctionCall3 <cycle 1> [38]
                0.00    0.00 10206918/22855929     data_start [414]
                0.00    0.00      67/67          btbeginscan [486]
                                  67             RelationGetIndexScan <cycle 1> [212]

Now that we have some data, we can start to think about how to improve
matters ...

            regards, tom lane

От:
Martijn van Oosterhout
Дата:

On Fri, Jan 20, 2006 at 02:14:29PM -0500, Tom Lane wrote:
> [ thread moved to pgsql-performance ]
>
> I've obtained a gprof profile on Stephan's sample case (many thanks for
> providing the data, Stephan).  The command is

<snip>

Something I'm missing is the calls to tsearch functions. I'm not 100%
familiar with gprof, but is it possible those costs have been added
somewhere else because it's in a shared library? Perhaps the costs went
into FunctionCall1/3?

Have a nice day,
--
Martijn van Oosterhout   <>   http://svana.org/kleptog/
> Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a
> tool for doing 5% of the work and then sitting around waiting for someone
> else to do the other 95% so you can sue them.

От:
Tom Lane
Дата:

Martijn van Oosterhout <> writes:
> Something I'm missing is the calls to tsearch functions. I'm not 100%
> familiar with gprof, but is it possible those costs have been added
> somewhere else because it's in a shared library? Perhaps the costs went
> into FunctionCall1/3?

I think that the tsearch functions must be the stuff charged as
"data_start" (which is not actually any symbol in our source code).
That is showing as being called by FunctionCallN which is what you'd
expect.

If the totals given by gprof are correct, then it's down in the noise.
I don't think I trust that too much ... but I don't see anything in the
gprof manual about how to include a dynamically loaded library in the
profile.  (I did compile tsearch2 with -pg, but that's evidently not
enough.)

I'll see if I can link tsearch2 statically to resolve this question.

            regards, tom lane

От:
Martijn van Oosterhout
Дата:

On Fri, Jan 20, 2006 at 03:21:45PM -0500, Tom Lane wrote:
> If the totals given by gprof are correct, then it's down in the noise.
> I don't think I trust that too much ... but I don't see anything in the
> gprof manual about how to include a dynamically loaded library in the
> profile.  (I did compile tsearch2 with -pg, but that's evidently not
> enough.)

There is some mention on the web of an environment variable you can
set: LD_PROFILE=<libname>

These pages seem relevent:
http://sourceware.org/ml/binutils/2002-04/msg00047.html
http://www.scit.wlv.ac.uk/cgi-bin/mansec?1+gprof

It's wierd how some man pages for gprof describe this feature, but the
one on my local system doesn't mention it.

> I'll see if I can link tsearch2 statically to resolve this question.

That'll work too...
--
Martijn van Oosterhout   <>   http://svana.org/kleptog/
> Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a
> tool for doing 5% of the work and then sitting around waiting for someone
> else to do the other 95% so you can sue them.

От:
Tom Lane
Дата:

Well, I feel like a fool, because I failed to notice that the total
runtime shown in that profile wasn't anywhere close to the actual wall
clock time.  gprof is indeed simply not counting the time spent in
dynamically-linked code.  With tsearch2 statically linked into the
backend, a more believable picture emerges:

  %   cumulative   self              self     total
 time   seconds   seconds    calls  Ks/call  Ks/call  name
 98.96   1495.93  1495.93 33035195     0.00     0.00  hemdistsign
  0.27   1500.01     4.08 10030581     0.00     0.00  makesign
  0.11   1501.74     1.73   588976     0.00     0.00  gistchoose
  0.10   1503.32     1.58   683471     0.00     0.00  XLogInsert
  0.05   1504.15     0.83   246579     0.00     0.00  sizebitvec
  0.05   1504.93     0.78   446399     0.00     0.00  gtsvector_union
  0.03   1505.45     0.52  3576475     0.00     0.00  LWLockRelease
  0.03   1505.92     0.47     1649     0.00     0.00  gtsvector_picksplit
  0.03   1506.38     0.47  3572572     0.00     0.00  LWLockAcquire
  0.02   1506.74     0.36   444817     0.00     0.00  gtsvector_same
  0.02   1507.09     0.35  4077089     0.00     0.00  AllocSetAlloc
  0.02   1507.37     0.28   236984     0.00     0.00  gistdoinsert
  0.02   1507.63     0.26   874195     0.00     0.00  hash_search
  0.02   1507.89     0.26  9762101     0.00     0.00  gtsvector_penalty
  0.01   1508.08     0.19   236984     0.00     0.00  gistmakedeal
  0.01   1508.27     0.19   841754     0.00     0.00  UnpinBuffer
  0.01   1508.45     0.18 22985469     0.00     0.00  hemdistcache
  0.01   1508.63     0.18  3998572     0.00     0.00  LockBuffer
  0.01   1508.81     0.18   686681     0.00     0.00  gtsvector_compress
  0.01   1508.98     0.17 11514275     0.00     0.00  gistdentryinit

So we gotta fix hemdistsign ...

            regards, tom lane

От:
Martijn van Oosterhout
Дата:

On Fri, Jan 20, 2006 at 04:19:15PM -0500, Tom Lane wrote:
>   %   cumulative   self              self     total
>  time   seconds   seconds    calls  Ks/call  Ks/call  name
>  98.96   1495.93  1495.93 33035195     0.00     0.00  hemdistsign

<snip>

> So we gotta fix hemdistsign ...

lol! Yeah, I guess so. Pretty nasty loop. LOOPBIT will iterate 8*63=504
times and it's going to do silly bit handling on each and every
iteration.

Given that all it's doing is counting bits, a simple fix would be to
loop over bytes, use XOR and count ones. For extreme speedup create a
lookup table with 256 entries to give you the answer straight away...

Have a nice day,
--
Martijn van Oosterhout   <>   http://svana.org/kleptog/
> Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a
> tool for doing 5% of the work and then sitting around waiting for someone
> else to do the other 95% so you can sue them.

От:
"Steinar H. Gunderson"
Дата:

On Fri, Jan 20, 2006 at 10:37:54PM +0100, Martijn van Oosterhout wrote:
> Given that all it's doing is counting bits, a simple fix would be to
> loop over bytes, use XOR and count ones. For extreme speedup create a
> lookup table with 256 entries to give you the answer straight away...

For extra obfscation:

  unsigned v = (unsigned)c;
  int num_bits = (v * 0x1001001001001ULL & 0x84210842108421ULL) % 0x1f;

(More more-or-less intelligent options at
http://graphics.stanford.edu/~seander/bithacks.html#CountBitsSetNaive :-) )

/* Steinar */
--
Homepage: http://www.sesse.net/

От:
Tom Lane
Дата:

Martijn van Oosterhout <> writes:
> Given that all it's doing is counting bits, a simple fix would be to
> loop over bytes, use XOR and count ones. For extreme speedup create a
> lookup table with 256 entries to give you the answer straight away...

Yeah, I just finished doing that and got about a 20x overall speedup
(30-some seconds to build the index instead of 10 minutes).  However,
hemdistsign is *still* 70% of the runtime after doing that.  The problem
seems to be that gtsvector_picksplit is calling it an inordinate number
of times:

                0.53   30.02    1649/1649        FunctionCall2 <cycle 2> [19]
[20]    52.4    0.53   30.02    1649         gtsvector_picksplit [20]
               29.74    0.00 23519673/33035195     hemdistsign [18]
                0.14    0.00 22985469/22985469     hemdistcache [50]
                0.12    0.00  268480/10030581     makesign [25]
                0.02    0.00  270400/270400      fillcache [85]
                0.00    0.00    9894/4077032     AllocSetAlloc [34]
                0.00    0.00    9894/2787477     MemoryContextAlloc [69]

(hemdistcache calls hemdistsign --- I think gprof is doing something
funny with tail-calls here, and showing hemdistsign as directly called
from gtsvector_picksplit when control really arrives through hemdistcache.)

The bulk of the problem is clearly in this loop, which performs O(N^2)
comparisons to find the two entries that are furthest apart in hemdist
terms:

    for (k = FirstOffsetNumber; k < maxoff; k = OffsetNumberNext(k))
    {
        for (j = OffsetNumberNext(k); j <= maxoff; j = OffsetNumberNext(j))
        {
            if (k == FirstOffsetNumber)
                fillcache(&cache[j], GETENTRY(entryvec, j));

            size_waste = hemdistcache(&(cache[j]), &(cache[k]));
            if (size_waste > waste)
            {
                waste = size_waste;
                seed_1 = k;
                seed_2 = j;
            }
        }
    }

I wonder if there is a way to improve on that.

            regards, tom lane

От:
"Steinar H. Gunderson"
Дата:

On Fri, Jan 20, 2006 at 04:50:17PM -0500, Tom Lane wrote:
> I wonder if there is a way to improve on that.

Ooh, the farthest pair problem (in an N-dimensional vector space, though).
I'm pretty sure problems like this has been studied quite extensively in the
literature, although perhaps not with the same norm. It's known under both
"farthest pair" and "diameter", and probably others. I'm fairly sure it
should be solvable in at least O(n log n).

/* Steinar */
--
Homepage: http://www.sesse.net/

От:
Ron
Дата:

At 05:16 PM 1/20/2006, Steinar H. Gunderson wrote:
>On Fri, Jan 20, 2006 at 04:50:17PM -0500, Tom Lane wrote:
> > I wonder if there is a way to improve on that.
>
>Ooh, the farthest pair problem (in an N-dimensional vector space, though).
>I'm pretty sure problems like this has been studied quite extensively in the
>literature, although perhaps not with the same norm. It's known under both
>"farthest pair" and "diameter", and probably others. I'm fairly sure it
>should be solvable in at least O(n log n).

If the N-dimensional space is Euclidean (any <x, x+1> is the same
distance apart in dimension x), then finding the farthest pair can be
done in at least O(n).

If you do not want the actual distance and can create the proper data
structures, particularly if you can update them incrementally as you
generate pairs, it is often possible to solve this problem in O(lg n) or O(1).

I'll do some grinding.
Ron



От:
Martijn van Oosterhout
Дата:

On Fri, Jan 20, 2006 at 04:50:17PM -0500, Tom Lane wrote:
> (hemdistcache calls hemdistsign --- I think gprof is doing something
> funny with tail-calls here, and showing hemdistsign as directly called
> from gtsvector_picksplit when control really arrives through hemdistcache.)

It may be the compiler. All these functions are declared static, which
gives the compiler quite a bit of leeway to rearrange code.

> The bulk of the problem is clearly in this loop, which performs O(N^2)
> comparisons to find the two entries that are furthest apart in hemdist
> terms:

Ah. A while ago someone came onto the list asking about bit strings
indexing[1]. If I'd known tsearch worked like this I would have pointed
him to it. Anyway, before he went off to implement it he mentioned
"Jarvis-Patrick clustering", whatever that means.

Probably more relevent was this thread[2] on -hackers a while back with
pseudo-code[3]. How well it works, I don't know, it worked for him
evidently, he went away happy...

[1] http://archives.postgresql.org/pgsql-general/2005-11/msg00473.php
[2] http://archives.postgresql.org/pgsql-hackers/2005-11/msg01067.php
[3] http://archives.postgresql.org/pgsql-hackers/2005-11/msg01069.php

Hope this helps,
--
Martijn van Oosterhout   <>   http://svana.org/kleptog/
> Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a
> tool for doing 5% of the work and then sitting around waiting for someone
> else to do the other 95% so you can sue them.

От:
Ron
Дата:

At 04:37 PM 1/20/2006, Martijn van Oosterhout wrote:
>On Fri, Jan 20, 2006 at 04:19:15PM -0500, Tom Lane wrote:
> >   %   cumulative   self              self     total
> >  time   seconds   seconds    calls  Ks/call  Ks/call  name
> >  98.96   1495.93  1495.93 33035195     0.00     0.00  hemdistsign
>
><snip>
>
> > So we gotta fix hemdistsign ...
>
>lol! Yeah, I guess so. Pretty nasty loop. LOOPBIT will iterate 8*63=504
>times and it's going to do silly bit handling on each and every
>iteration.
>
>Given that all it's doing is counting bits, a simple fix would be to
>loop over bytes, use XOR and count ones. For extreme speedup create a
>lookup table with 256 entries to give you the answer straight away...
For an even more extreme speedup, don't most modern CPUs have an asm
instruction that counts the bits (un)set (AKA "population counting")
in various size entities (4b, 8b, 16b, 32b, 64b, and 128b for 64b
CPUs with SWAR instructions)?

Ron



От:
"Steinar H. Gunderson"
Дата:

On Fri, Jan 20, 2006 at 05:46:34PM -0500, Ron wrote:
> For an even more extreme speedup, don't most modern CPUs have an asm
> instruction that counts the bits (un)set (AKA "population counting")
> in various size entities (4b, 8b, 16b, 32b, 64b, and 128b for 64b
> CPUs with SWAR instructions)?

None in the x86 series that I'm aware of, at least.

You have instructions for finding the highest set bit, though.

/* Steinar */
--
Homepage: http://www.sesse.net/

От:
Tom Lane
Дата:

Ron <> writes:
> For an even more extreme speedup, don't most modern CPUs have an asm
> instruction that counts the bits (un)set (AKA "population counting")
> in various size entities (4b, 8b, 16b, 32b, 64b, and 128b for 64b
> CPUs with SWAR instructions)?

Yeah, but fetching from a small constant table is pretty quick too;
I doubt it's worth getting involved in machine-specific assembly code
for this.  I'm much more interested in the idea of improving the
furthest-distance algorithm in gtsvector_picksplit --- if we can do
that, it'll probably drop the distance calculation down to the point
where it's not really worth the trouble to assembly-code it.

            regards, tom lane

От:
Martijn van Oosterhout
Дата:

On Fri, Jan 20, 2006 at 05:46:34PM -0500, Ron wrote:
> At 04:37 PM 1/20/2006, Martijn van Oosterhout wrote:
> >Given that all it's doing is counting bits, a simple fix would be to
> >loop over bytes, use XOR and count ones. For extreme speedup create a
> >lookup table with 256 entries to give you the answer straight away...
> For an even more extreme speedup, don't most modern CPUs have an asm
> instruction that counts the bits (un)set (AKA "population counting")
> in various size entities (4b, 8b, 16b, 32b, 64b, and 128b for 64b
> CPUs with SWAR instructions)?

Quite possibly, though I wouldn't have the foggiest idea how to get the
C compiler to generate it.

Given that even a lookup table will get you pretty close to that with
plain C coding, I think that's quite enough for a function that really
is just a small part of a much larger system...

Better solution (as Tom points out): work out how to avoid calling it
so much in the first place... At the moment each call to
gtsvector_picksplit seems to call the distance function around 14262
times. Getting that down by an order of magnitude will help much much
more.

Have a nice day,
--
Martijn van Oosterhout   <>   http://svana.org/kleptog/
> Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a
> tool for doing 5% of the work and then sitting around waiting for someone
> else to do the other 95% so you can sue them.

От:
"Steinar H. Gunderson"
Дата:

On Fri, Jan 20, 2006 at 05:29:46PM -0500, Ron wrote:
> If the N-dimensional space is Euclidean (any <x, x+1> is the same
> distance apart in dimension x), then finding the farthest pair can be
> done in at least O(n).

That sounds a bit optimistic.

  http://portal.acm.org/ft_gateway.cfm?id=167217&type=pdf&coll=GUIDE&dl=GUIDE&CFID=66230761&CFTOKEN=72453878

is from 1993, but still it struggles with getting it down to O(n log n)
deterministically, for Euclidian 3-space, and our problem is not Euclidian
(although it still satisfies the triangle inequality, which sounds important
to me) and in a significantly higher dimension...

/* Steinar */
--
Homepage: http://www.sesse.net/

От:
"Steinar H. Gunderson"
Дата:

On Fri, Jan 20, 2006 at 05:50:36PM -0500, Tom Lane wrote:
> Yeah, but fetching from a small constant table is pretty quick too;
> I doubt it's worth getting involved in machine-specific assembly code
> for this.  I'm much more interested in the idea of improving the
> furthest-distance algorithm in gtsvector_picksplit --- if we can do
> that, it'll probably drop the distance calculation down to the point
> where it's not really worth the trouble to assembly-code it.

For the record: Could we do with a less-than-optimal split here? In that
case, an extremely simple heuristic is:

  best = distance(0, 1)
  best_i = 0
  best_j = 1

  for i = 2..last:
      if distance(best_i, i) > best:
          best = distance(best_i, i)
      best_j = i
      else if distance(best_j, i) > best:
          best = distance(best_j, i)
      best_i = i

I've tested it on various data, and although it's definitely not _correct_,
it generally gets within 10%.

/* Steinar */
--
Homepage: http://www.sesse.net/

От:
"Steinar H. Gunderson"
Дата:

On Fri, Jan 20, 2006 at 04:50:17PM -0500, Tom Lane wrote:
> I wonder if there is a way to improve on that.

http://www.cs.uwaterloo.ca/~tmchan/slide_isaac.ps:

  The diameter problem has been studied extensively in the traditional model.
  Although O(n log n) algorithms have been given for d = 2 and d = 3, only
  slightly subquadratic algorithms are known for higher dimensions.

It doesn't mention a date, but has references to at least 2004-papers, so I'm
fairly sure nothing big has happened since that.

It sounds like we either want to go for an approximation, or just accept that
it's a lot of work to get it better than O(n^2). Or, of course, find some
special property of our problem that makes it easier than the general problem
:-)

/* Steinar */
--
Homepage: http://www.sesse.net/

От:
Tom Lane
Дата:

"Steinar H. Gunderson" <> writes:
> For the record: Could we do with a less-than-optimal split here?

Yeah, I was wondering the same.  The code is basically choosing two
"seed" values to drive the index-page split.  Intuitively it seems that
"pretty far apart" would be nearly as good as "absolute furthest apart"
for this purpose.

The cost of a less-than-optimal split would be paid on all subsequent
index accesses, though, so it's not clear how far we can afford to go in
this direction.

It's also worth considering that the entire approach is a heuristic,
really --- getting the furthest-apart pair of seeds doesn't guarantee
an optimal split as far as I can see.  Maybe there's some totally
different way to do it.

            regards, tom lane

От:
"Steinar H. Gunderson"
Дата:

On Fri, Jan 20, 2006 at 06:52:37PM -0500, Tom Lane wrote:
> It's also worth considering that the entire approach is a heuristic,
> really --- getting the furthest-apart pair of seeds doesn't guarantee
> an optimal split as far as I can see.  Maybe there's some totally
> different way to do it.

For those of us who don't know what tsearch2/gist is trying to accomplish
here, could you provide some pointers? :-) During my mini-literature-search
on Google, I've found various algorithms for locating clusters in
high-dimensional metric spaces[1]; some of it might be useful, but I might
just be misunderstanding what the real problem is.

[1] http://ieeexplore.ieee.org/iel5/69/30435/01401892.pdf?arnumber=1401892 ,
    for instance

/* Steinar */
--
Homepage: http://www.sesse.net/

От:
Tom Lane
Дата:

"Steinar H. Gunderson" <> writes:
> On Fri, Jan 20, 2006 at 06:52:37PM -0500, Tom Lane wrote:
>> It's also worth considering that the entire approach is a heuristic,
>> really --- getting the furthest-apart pair of seeds doesn't guarantee
>> an optimal split as far as I can see.  Maybe there's some totally
>> different way to do it.

> For those of us who don't know what tsearch2/gist is trying to accomplish
> here, could you provide some pointers? :-)

Well, we're trying to split an index page that's gotten full into two
index pages, preferably with approximately equal numbers of items in
each new page (this isn't a hard requirement though).  I think the true
figure of merit for a split is how often will subsequent searches have
to descend into *both* of the resulting pages as opposed to just one
--- the less often that is true, the better.  I'm not very clear on
what tsearch2 is doing with these bitmaps, but it looks like an upper
page's downlink has the union (bitwise OR) of the one-bits in the values
on the lower page, and you have to visit the lower page if this union
has a nonempty intersection with the set you are looking for.  If that's
correct, what you really want is to divide the values so that the unions
of the two sets have minimal overlap ... which seems to me to have
little to do with what the code does at present.

Teodor, Oleg, can you clarify what's needed here?

            regards, tom lane

От:
"Steinar H. Gunderson"
Дата:

On Fri, Jan 20, 2006 at 07:23:10PM -0500, Tom Lane wrote:
> I'm not very clear on what tsearch2 is doing with these bitmaps, but it
> looks like an upper page's downlink has the union (bitwise OR) of the
> one-bits in the values on the lower page, and you have to visit the lower
> page if this union has a nonempty intersection with the set you are looking
> for.  If that's correct, what you really want is to divide the values so
> that the unions of the two sets have minimal overlap ... which seems to me
> to have little to do with what the code does at present.

Sort of like the vertex-cover problem? That's probably a lot harder than
finding the two farthest points...

/* Steinar */
--
Homepage: http://www.sesse.net/

От:
"Craig A. James"
Дата:

Tom Lane wrote:
> Well, we're trying to split an index page that's gotten full into two
> index pages, preferably with approximately equal numbers of items in
> each new page (this isn't a hard requirement though).  ...  If that's
> correct, what you really want is to divide the values so that the unions
> of the two sets have minimal overlap ... which seems to me to have
> little to do with what the code does at present.

This problem has been studied extensively by chemists, and they haven't found any easy solutions.

The Jarvis Patrick clustering algorithm might give you hints about a fast approach.  In theory it's K*O(N^2), but J-P
ispreferred for large datasets (millions of molecules) because the coefficient K can be made quite low.  It starts with
a"similarity metric" for two bit strings, the Tanimoto or Tversky coefficients: 

  http://www.daylight.com/dayhtml/doc/theory/theory.finger.html#RTFToC83

J-P Clustering is described here:

  http://www.daylight.com/dayhtml/doc/cluster/cluster.a.html#cl33

J-P Clustering is probably not the best for this problem (see the illustrations in the link above to see why), but the
generalidea of computing N-nearest-neighbors, followed by a partitioning step, could be what's needed. 

Craig

От:
Ron
Дата:

At 07:23 PM 1/20/2006, Tom Lane wrote:
>"Steinar H. Gunderson" <> writes:
> > On Fri, Jan 20, 2006 at 06:52:37PM -0500, Tom Lane wrote:
> >> It's also worth considering that the entire approach is a heuristic,
> >> really --- getting the furthest-apart pair of seeds doesn't guarantee
> >> an optimal split as far as I can see.  Maybe there's some totally
> >> different way to do it.
>
> > For those of us who don't know what tsearch2/gist is trying to accomplish
> > here, could you provide some pointers? :-)
>
>Well, we're trying to split an index page that's gotten full into
>two index pages, preferably with approximately equal numbers of items in
>each new page (this isn't a hard requirement though).

Maybe we are over thinking this.  What happens if we do the obvious
and just make a new page and move the "last" n/2 items on the full
page to the new page?

Various forms of "move the last n/2 items" can be tested here:
0= just split the table in half.  Sometimes KISS  works. O(1).
1= the one's with the highest (or lowest) "x" value.
2= the one's with the highest sum of coordinates (x+y+...= values in
the top/bottom n/2 of entries).
3= split the table so that each table has entries whose size_waste
values add up to approximately the same value.
4= I'm sure there are others.
1-5 can be done in O(n) time w/o auxiliary data.  They can be done in
O(1) if we've kept track of the appropriate metric as we've built the
current page.


>I think the true figure of merit for a split is how often will
>subsequent searches have to descend into *both* of the resulting
>pages as opposed to just one
>--- the less often that is true, the better.  I'm not very clear on
>what tsearch2 is doing with these bitmaps, but it looks like an
>upper page's downlink has the union (bitwise OR) of the one-bits in
>the values on the lower page, and you have to visit the lower page
>if this union has a nonempty intersection with the set you are
>looking for.  If that's correct, what you really want is to divide
>the values so that the unions of the two sets have minimal overlap
>... which seems to me to have little to do with what the code does at present.
I'm not sure what "upper page" and "lower page" mean here?


>Teodor, Oleg, can you clarify what's needed here?
Ditto.  Guys what is the real motivation and purpose for this code?

Ron



От:
Martijn van Oosterhout
Дата:

On Fri, Jan 20, 2006 at 05:50:36PM -0500, Tom Lane wrote:
> Yeah, but fetching from a small constant table is pretty quick too;
> I doubt it's worth getting involved in machine-specific assembly code
> for this.  I'm much more interested in the idea of improving the
> furthest-distance algorithm in gtsvector_picksplit --- if we can do
> that, it'll probably drop the distance calculation down to the point
> where it's not really worth the trouble to assembly-code it.

I've played with another algorithm. Very simple but it's only O(N). It
doesn't get the longest distance but it does get close. Basically you
take the first two elements as your starting length. Then you loop over
each remaining string, each time finding the longest pair out of each
set of three.

I've only tried it on random strings. The maximum distance for 128
random strings tends to be around 291-295. This algorithm tends to find
lengths around 280. Pseudo code below (in Perl).

However, IMHO, this algorithm is optimising the wrong thing. It
shouldn't be trying to split into sets that are far apart, it should be
trying to split into sets that minimize the number of set bits (ie
distance from zero), since that's what's will speed up searching.
That's harder though (this algorithm does approximate it sort of)
and I havn't come up with an algorithm yet

sub MaxDistFast
{
  my $strings = shift;

  my $m1 = 0;
  my $m2 = 1;
  my $dist = -1;

  for my $i (2..$#$strings)
  {
    my $d1 = HammDist( $strings->[$i], $strings->[$m1] );
    my $d2 = HammDist( $strings->[$i], $strings->[$m2] );

    my $m = ($d1 > $d2) ? $m1 : $m2;
    my $d = ($d1 > $d2) ? $d1 : $d2;

    if( $d > $dist )
    {
      $dist = $d;
      $m1 = $i;
      $m2 = $m;
    }
  }
  return($m1,$m2,$dist);
}

Full program available at:
http://svana.org/kleptog/temp/picksplit.pl

Have a nice day,
--
Martijn van Oosterhout   <>   http://svana.org/kleptog/
> Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a
> tool for doing 5% of the work and then sitting around waiting for someone
> else to do the other 95% so you can sue them.

От:
Oleg Bartunov
Дата:

On Sat, 21 Jan 2006, Martijn van Oosterhout wrote:

> However, IMHO, this algorithm is optimising the wrong thing. It
> shouldn't be trying to split into sets that are far apart, it should be
> trying to split into sets that minimize the number of set bits (ie
> distance from zero), since that's what's will speed up searching.

Martijn, you're right! We want not only to split page to very
different parts, but not to increase the number of sets bits in
resulted signatures, which are union (OR'ed) of all signatures
in part. We need not only fast index creation (thanks, Tom !),
but a better index. Some information is available here
http://www.sai.msu.su/~megera/oddmuse/index.cgi/Tsearch_V2_internals
There are should be more detailed document, but I don't remember where:)

> That's harder though (this algorithm does approximate it sort of)
> and I havn't come up with an algorithm yet

Don't ask how hard we thought :)

     Regards,
         Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: , http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

От:
Oleg Bartunov
Дата:

On Sat, 21 Jan 2006, Ron wrote:

> At 07:23 PM 1/20/2006, Tom Lane wrote:
>> "Steinar H. Gunderson" <> writes:
>> > On Fri, Jan 20, 2006 at 06:52:37PM -0500, Tom Lane wrote:
>> >> It's also worth considering that the entire approach is a heuristic,
>> >> really --- getting the furthest-apart pair of seeds doesn't guarantee
>> >> an optimal split as far as I can see.  Maybe there's some totally
>> >> different way to do it.
>>
>> > For those of us who don't know what tsearch2/gist is trying to accomplish
>> > here, could you provide some pointers? :-)
>>
>> Well, we're trying to split an index page that's gotten full into two index
>> pages, preferably with approximately equal numbers of items in
>> each new page (this isn't a hard requirement though).
>
> Maybe we are over thinking this.  What happens if we do the obvious and just
> make a new page and move the "last" n/2 items on the full page to the new
> page?
>
> Various forms of "move the last n/2 items" can be tested here:
> 0= just split the table in half.  Sometimes KISS  works. O(1).
> 1= the one's with the highest (or lowest) "x" value.
> 2= the one's with the highest sum of coordinates (x+y+...= values in the
> top/bottom n/2 of entries).
> 3= split the table so that each table has entries whose size_waste values add
> up to approximately the same value.
> 4= I'm sure there are others.
> 1-5 can be done in O(n) time w/o auxiliary data.  They can be done in O(1) if
> we've kept track of the appropriate metric as we've built the current page.
>
>
>> I think the true figure of merit for a split is how often will subsequent
>> searches have to descend into *both* of the resulting pages as opposed to
>> just one
>> --- the less often that is true, the better.  I'm not very clear on what
>> tsearch2 is doing with these bitmaps, but it looks like an upper page's
>> downlink has the union (bitwise OR) of the one-bits in the values on the
>> lower page, and you have to visit the lower page if this union has a
>> nonempty intersection with the set you are looking for.  If that's correct,
>> what you really want is to divide the values so that the unions of the two
>> sets have minimal overlap ... which seems to me to have little to do with
>> what the code does at present.
> I'm not sure what "upper page" and "lower page" mean here?
>
>
>> Teodor, Oleg, can you clarify what's needed here?
> Ditto.  Guys what is the real motivation and purpose for this code?

we want not just split the page by two very distinct parts, but to keep
resulted signatures which is ORed signature of all signatures in the page
as much 'sparse' as can.
some information available here
http://www.sai.msu.su/~megera/oddmuse/index.cgi/Tsearch_V2_internals

Unfortunately, we're rather busy right now and couldn't be very useful.

>
> Ron
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 6: explain analyze is your friend
>

     Regards,
         Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: , http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

От:
Martijn van Oosterhout
Дата:

On Sat, Jan 21, 2006 at 04:29:13PM +0300, Oleg Bartunov wrote:
> Martijn, you're right! We want not only to split page to very
> different parts, but not to increase the number of sets bits in
> resulted signatures, which are union (OR'ed) of all signatures
> in part. We need not only fast index creation (thanks, Tom !),
> but a better index. Some information is available here
> http://www.sai.msu.su/~megera/oddmuse/index.cgi/Tsearch_V2_internals
> There are should be more detailed document, but I don't remember where:)

I see how it works, what I don't quite get is whether the "inverted
index" you refer to is what we're working with here, or just what's in
tsearchd?

> >That's harder though (this algorithm does approximate it sort of)
> >and I havn't come up with an algorithm yet
>
> Don't ask how hard we thought :)

Well, looking at how other people are struggling with it, it's
definitly a Hard Problem. One thing though, I don't think the picksplit
algorithm as is really requires you to strictly have the longest
distance, just something reasonably long. So I think the alternate
algorithm I posted should produce equivalent results. No idea how to
test it though...

Have a nice day,
--
Martijn van Oosterhout   <>   http://svana.org/kleptog/
> Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a
> tool for doing 5% of the work and then sitting around waiting for someone
> else to do the other 95% so you can sue them.

От:
Oleg Bartunov
Дата:

On Sat, 21 Jan 2006, Ron wrote:

> Perhaps a different approach to this problem is called for:
>
> _Managing Gigabytes: Compressing and Indexing Documents and Images_  2ed
> Witten, Moffat, Bell
> ISBN 1-55860-570-3
>
> This is a VERY good book on the subject.
>
> I'd also suggest looking at the publicly available work on indexing and
> searching for search engines like Inktomi (sp?) and Google.
> Ron

Ron,

you completely miss the problem ! We do know MG and other SE.  Actually,
we've implemented several search engines based on inverted index technology
(see, for example, pgsql.ru/db/pgsearch). tsearch2 was designed for
online indexing, while keeping inverted index online is rather difficult
problem. We do have plan to implement inverted index as an option for
large read-only archives, but now we discuss how to organize online
index and if possible to optimize current storage for signatures
without breaking search performance.

>
>
> At 08:34 AM 1/21/2006, Oleg Bartunov wrote:
>> On Sat, 21 Jan 2006, Ron wrote:
>>
>>> At 07:23 PM 1/20/2006, Tom Lane wrote:
>>>> "Steinar H. Gunderson" <> writes:
>>>> > On Fri, Jan 20, 2006 at 06:52:37PM -0500, Tom Lane wrote:
>>>> >> It's also worth considering that the entire approach is a heuristic,
>>>> >> really --- getting the furthest-apart pair of seeds doesn't guarantee
>>>> >> an optimal split as far as I can see.  Maybe there's some totally
>>>> >> different way to do it.
>>>> > For those of us who don't know what tsearch2/gist is trying to
>>>> accomplish
>>>> > here, could you provide some pointers? :-)
>>>> Well, we're trying to split an index page that's gotten full into two
>>>> index pages, preferably with approximately equal numbers of items in
>>>> each new page (this isn't a hard requirement though).
>>>
>>> Maybe we are over thinking this.  What happens if we do the obvious and
>>> just make a new page and move the "last" n/2 items on the full page to the
>>> new page?
>>>
>>> Various forms of "move the last n/2 items" can be tested here:
>>> 0= just split the table in half.  Sometimes KISS  works. O(1).
>>> 1= the one's with the highest (or lowest) "x" value.
>>> 2= the one's with the highest sum of coordinates (x+y+...= values in the
>>> top/bottom n/2 of entries).
>>> 3= split the table so that each table has entries whose size_waste values
>>> add up to approximately the same value.
>>> 4= I'm sure there are others.
>>> 1-5 can be done in O(n) time w/o auxiliary data.  They can be done in O(1)
>>> if we've kept track of the appropriate metric as we've built the current
>>> page.
>>>
>>>
>>>> I think the true figure of merit for a split is how often will subsequent
>>>> searches have to descend into *both* of the resulting pages as opposed to
>>>> just one
>>>> --- the less often that is true, the better.  I'm not very clear on what
>>>> tsearch2 is doing with these bitmaps, but it looks like an upper page's
>>>> downlink has the union (bitwise OR) of the one-bits in the values on the
>>>> lower page, and you have to visit the lower page if this union has a
>>>> nonempty intersection with the set you are looking for.  If that's
>>>> correct, what you really want is to divide the values so that the unions
>>>> of the two sets have minimal overlap ... which seems to me to have little
>>>> to do with what the code does at present.
>>> I'm not sure what "upper page" and "lower page" mean here?
>>>
>>>
>>>> Teodor, Oleg, can you clarify what's needed here?
>>> Ditto.  Guys what is the real motivation and purpose for this code?
>>
>> we want not just split the page by two very distinct parts, but to keep
>> resulted signatures which is ORed signature of all signatures in the page
>> as much 'sparse' as can. some information available here
>> http://www.sai.msu.su/~megera/oddmuse/index.cgi/Tsearch_V2_internals
>>
>> Unfortunately, we're rather busy right now and couldn't be very useful.
>
>

     Regards,
         Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: , http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

От:
Ron
Дата:

Perhaps a different approach to this problem is called for:

_Managing Gigabytes: Compressing and Indexing Documents and Images_  2ed
Witten, Moffat, Bell
ISBN 1-55860-570-3

This is a VERY good book on the subject.

I'd also suggest looking at the publicly available work on indexing
and searching for search engines like Inktomi (sp?) and Google.
Ron


At 08:34 AM 1/21/2006, Oleg Bartunov wrote:
>On Sat, 21 Jan 2006, Ron wrote:
>
>>At 07:23 PM 1/20/2006, Tom Lane wrote:
>>>"Steinar H. Gunderson" <> writes:
>>> > On Fri, Jan 20, 2006 at 06:52:37PM -0500, Tom Lane wrote:
>>> >> It's also worth considering that the entire approach is a heuristic,
>>> >> really --- getting the furthest-apart pair of seeds doesn't guarantee
>>> >> an optimal split as far as I can see.  Maybe there's some totally
>>> >> different way to do it.
>>> > For those of us who don't know what tsearch2/gist is trying to accomplish
>>> > here, could you provide some pointers? :-)
>>>Well, we're trying to split an index page that's gotten full into
>>>two index pages, preferably with approximately equal numbers of items in
>>>each new page (this isn't a hard requirement though).
>>
>>Maybe we are over thinking this.  What happens if we do the obvious
>>and just make a new page and move the "last" n/2 items on the full
>>page to the new page?
>>
>>Various forms of "move the last n/2 items" can be tested here:
>>0= just split the table in half.  Sometimes KISS  works. O(1).
>>1= the one's with the highest (or lowest) "x" value.
>>2= the one's with the highest sum of coordinates (x+y+...= values
>>in the top/bottom n/2 of entries).
>>3= split the table so that each table has entries whose size_waste
>>values add up to approximately the same value.
>>4= I'm sure there are others.
>>1-5 can be done in O(n) time w/o auxiliary data.  They can be done
>>in O(1) if we've kept track of the appropriate metric as we've
>>built the current page.
>>
>>
>>>I think the true figure of merit for a split is how often will
>>>subsequent searches have to descend into *both* of the resulting
>>>pages as opposed to just one
>>>--- the less often that is true, the better.  I'm not very clear
>>>on what tsearch2 is doing with these bitmaps, but it looks like an
>>>upper page's downlink has the union (bitwise OR) of the one-bits
>>>in the values on the lower page, and you have to visit the lower
>>>page if this union has a nonempty intersection with the set you
>>>are looking for.  If that's correct, what you really want is to
>>>divide the values so that the unions of the two sets have minimal
>>>overlap ... which seems to me to have little to do with what the
>>>code does at present.
>>I'm not sure what "upper page" and "lower page" mean here?
>>
>>
>>>Teodor, Oleg, can you clarify what's needed here?
>>Ditto.  Guys what is the real motivation and purpose for this code?
>
>we want not just split the page by two very distinct parts, but to keep
>resulted signatures which is ORed signature of all signatures in the page
>as much 'sparse' as can. some information available here
>http://www.sai.msu.su/~megera/oddmuse/index.cgi/Tsearch_V2_internals
>
>Unfortunately, we're rather busy right now and couldn't be very useful.



От:
Oleg Bartunov
Дата:

On Sat, 21 Jan 2006, Martijn van Oosterhout wrote:

> On Sat, Jan 21, 2006 at 04:29:13PM +0300, Oleg Bartunov wrote:
>> Martijn, you're right! We want not only to split page to very
>> different parts, but not to increase the number of sets bits in
>> resulted signatures, which are union (OR'ed) of all signatures
>> in part. We need not only fast index creation (thanks, Tom !),
>> but a better index. Some information is available here
>> http://www.sai.msu.su/~megera/oddmuse/index.cgi/Tsearch_V2_internals
>> There are should be more detailed document, but I don't remember where:)
>
> I see how it works, what I don't quite get is whether the "inverted
> index" you refer to is what we're working with here, or just what's in
> tsearchd?

just tsearchd. We plan to implement inverted index into PostgreSQL core
and then adapt tsearch2 to use it as option for read-only archives.

>
>>> That's harder though (this algorithm does approximate it sort of)
>>> and I havn't come up with an algorithm yet
>>
>> Don't ask how hard we thought :)
>
> Well, looking at how other people are struggling with it, it's
> definitly a Hard Problem. One thing though, I don't think the picksplit
> algorithm as is really requires you to strictly have the longest
> distance, just something reasonably long. So I think the alternate
> algorithm I posted should produce equivalent results. No idea how to
> test it though...

you may try our development module 'gevel' to see how dense is a signature.

www=# \d v_pages
           Table "public.v_pages"
   Column   |       Type        | Modifiers
-----------+-------------------+-----------
  tid       | integer           | not null
  path      | character varying | not null
  body      | character varying |
  title     | character varying |
  di        | integer           |
  dlm       | integer           |
  de        | integer           |
  md5       | character(22)     |
  fts_index | tsvector          |
Indexes:
     "v_pages_pkey" PRIMARY KEY, btree (tid)
     "v_pages_path_key" UNIQUE, btree (path)
     "v_gist_key" gist (fts_index)

# select * from gist_print('v_gist_key') as t(level int, valid bool, a gtsvector) where level =1;
  level | valid |               a
-------+-------+--------------------------------
      1 | t     | 1698 true bits, 318 false bits
      1 | t     | 1699 true bits, 317 false bits
      1 | t     | 1701 true bits, 315 false bits
      1 | t     | 1500 true bits, 516 false bits
      1 | t     | 1517 true bits, 499 false bits
(5 rows)



     Regards,
         Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: , http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

От:
Oleg Bartunov
Дата:

gevel is available from
http://www.sai.msu.su/~megera/postgres/gist/

     Oleg
On Sat, 21 Jan 2006, Martijn van Oosterhout wrote:

> On Sat, Jan 21, 2006 at 04:29:13PM +0300, Oleg Bartunov wrote:
>> Martijn, you're right! We want not only to split page to very
>> different parts, but not to increase the number of sets bits in
>> resulted signatures, which are union (OR'ed) of all signatures
>> in part. We need not only fast index creation (thanks, Tom !),
>> but a better index. Some information is available here
>> http://www.sai.msu.su/~megera/oddmuse/index.cgi/Tsearch_V2_internals
>> There are should be more detailed document, but I don't remember where:)
>
> I see how it works, what I don't quite get is whether the "inverted
> index" you refer to is what we're working with here, or just what's in
> tsearchd?
>
>>> That's harder though (this algorithm does approximate it sort of)
>>> and I havn't come up with an algorithm yet
>>
>> Don't ask how hard we thought :)
>
> Well, looking at how other people are struggling with it, it's
> definitly a Hard Problem. One thing though, I don't think the picksplit
> algorithm as is really requires you to strictly have the longest
> distance, just something reasonably long. So I think the alternate
> algorithm I posted should produce equivalent results. No idea how to
> test it though...
>
> Have a nice day,
>

     Regards,
         Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: , http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

От:
Tom Lane
Дата:

Ron <> writes:
> At 07:23 PM 1/20/2006, Tom Lane wrote:
>> Well, we're trying to split an index page that's gotten full into
>> two index pages, preferably with approximately equal numbers of items in
>> each new page (this isn't a hard requirement though).

> Maybe we are over thinking this.  What happens if we do the obvious
> and just make a new page and move the "last" n/2 items on the full
> page to the new page?

Search performance will go to hell in a handbasket :-(.  We have to make
at least some effort to split the page in a way that will allow searches
to visit only one of the two child pages rather than both.

It's certainly true though that finding the furthest pair is not a
necessary component of that.  It's reasonable if you try to visualize
the problem in 2D or 3D, but I'm not sure that that geometric intuition
holds up in such a high-dimensional space as we have here.

            regards, tom lane

От:
Martijn van Oosterhout
Дата:

On Sat, Jan 21, 2006 at 06:22:52PM +0300, Oleg Bartunov wrote:
> >I see how it works, what I don't quite get is whether the "inverted
> >index" you refer to is what we're working with here, or just what's in
> >tsearchd?
>
> just tsearchd. We plan to implement inverted index into PostgreSQL core
> and then adapt tsearch2 to use it as option for read-only archives.

Hmm, had a look and think about it and I think I see what you mean by
an inverted index. I also think your going to have a real exercise
implementing it in Postgres because postgres indexes work on the basis
of one tuple, one index entry, which I think your inverted index
doesn't do.

That said, I think GiST could be extended to support your case without
too much difficulty. Interesting project though :)

BTW, given you appear to have a tsearch2 index with some real-world
data, would you be willing to try some alternate picksplit algorithms
to see if your gevel module shows any difference?

Have a nice day,
--
Martijn van Oosterhout   <>   http://svana.org/kleptog/
> Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a
> tool for doing 5% of the work and then sitting around waiting for someone
> else to do the other 95% so you can sue them.

От:
David Lang
Дата:

On Sat, 21 Jan 2006, Tom Lane wrote:

> Ron <> writes:
>> At 07:23 PM 1/20/2006, Tom Lane wrote:
>>> Well, we're trying to split an index page that's gotten full into
>>> two index pages, preferably with approximately equal numbers of items in
>>> each new page (this isn't a hard requirement though).
>
>> Maybe we are over thinking this.  What happens if we do the obvious
>> and just make a new page and move the "last" n/2 items on the full
>> page to the new page?
>
> Search performance will go to hell in a handbasket :-(.  We have to make
> at least some effort to split the page in a way that will allow searches
> to visit only one of the two child pages rather than both.

does the order of the items within a given page matter? if not this sounds
like a partial quicksort algorithm would work. you don't need to fully
sort things, but you do want to make sure that everything on the first
page is 'less' then everything on the second page so you can skip passes
that don't cross a page boundry

> It's certainly true though that finding the furthest pair is not a
> necessary component of that.  It's reasonable if you try to visualize
> the problem in 2D or 3D, but I'm not sure that that geometric intuition
> holds up in such a high-dimensional space as we have here.

I will say that I'm not understanding the problem well enough to
understand themulti-dimentional nature of this problem.

David Lang

От:
Tom Lane
Дата:

David Lang <> writes:
> On Sat, 21 Jan 2006, Tom Lane wrote:
>> Ron <> writes:
>>> Maybe we are over thinking this.  What happens if we do the obvious
>>> and just make a new page and move the "last" n/2 items on the full
>>> page to the new page?
>>
>> Search performance will go to hell in a handbasket :-(.  We have to make
>> at least some effort to split the page in a way that will allow searches
>> to visit only one of the two child pages rather than both.

> does the order of the items within a given page matter?

AFAIK the items within a GIST index page are just stored in insertion
order (which is exactly why Ron's suggestion above doesn't work well).
There's no semantic significance to it.  It's only when we have to split
the page that we need to classify the items more finely.

            regards, tom lane

От:
Ron
Дата:

At 01:27 PM 1/21/2006, Tom Lane wrote:
>Ron <> writes:
> > At 07:23 PM 1/20/2006, Tom Lane wrote:
> >> Well, we're trying to split an index page that's gotten full into
> >> two index pages, preferably with approximately equal numbers of items in
> >> each new page (this isn't a hard requirement though).
>
> > Maybe we are over thinking this.  What happens if we do the obvious
> > and just make a new page and move the "last" n/2 items on the full
> > page to the new page?
>
>Search performance will go to hell in a handbasket :-(.  We have to make
>at least some effort to split the page in a way that will allow searches
>to visit only one of the two child pages rather than both.
>
>It's certainly true though that finding the furthest pair is not a
>necessary component of that.  It's reasonable if you try to visualize
>the problem in 2D or 3D, but I'm not sure that that geometric intuition
>holds up in such a high-dimensional space as we have here.
After reading the various papers available on GiST and RD trees, I
think I have a decent suggestion.

Since RD tree keys contain the keys of their descendents/components
in them, they are basically a representation of a depth first
search.  This can be useful for intra-document searches.

OTOH, inter-document searches are going to be more akin to breadth
first searches using RD trees.

Thus my suggestion is that we maintain =two= index structures for text data.

The first contains as many keys and all their descendents as
possible.  When we can no longer fit a specific complete "path" into
a page, we start a new one; trying to keep complete top level to leaf
key sets within a page.  This will minimize paging during
intra-document searches.

The second index keeps siblings within a page and avoids putting
parents or children within a page unless the entire depth first
search can be kept within the page in addition to the siblings
present.  This will minimize paging during inter-document searches.

Traditional B-tree ordering methods can be used to define the
ordering/placement of pages within each index, which will minimize
head seeks to find the correct page to scan.

Since the criteria for putting a key within a page or starting a new
page is simple, performance for those tasks should be O(1).

The price is the extra space used for two indexes instead of one, but
at first glance that seems well worth it.

Comments?
Ron





От:
Alvaro Herrera
Дата:

Ron wrote:
> At 01:27 PM 1/21/2006, Tom Lane wrote:
> >Ron <> writes:
> >> At 07:23 PM 1/20/2006, Tom Lane wrote:
> >>> Well, we're trying to split an index page that's gotten full into
> >>> two index pages, preferably with approximately equal numbers of items in
> >>> each new page (this isn't a hard requirement though).

> After reading the various papers available on GiST and RD trees, I
> think I have a decent suggestion.

I for one don't understand what does your suggestion have to do with the
problem at hand ... not that I have a better one myself.

--
Alvaro Herrera                 http://www.amazon.com/gp/registry/CTMLCN8V17R4
"Siempre hay que alimentar a los dioses, aunque la tierra esté seca" (Orual)

От:
Ron
Дата:

At 08:00 PM 1/26/2006, Alvaro Herrera wrote:
>Ron wrote:
> > At 01:27 PM 1/21/2006, Tom Lane wrote:
> > >Ron <> writes:
> > >> At 07:23 PM 1/20/2006, Tom Lane wrote:
> > >>> Well, we're trying to split an index page that's gotten full into
> > >>> two index pages, preferably with approximately equal numbers
> of items in
> > >>> each new page (this isn't a hard requirement though).
>
> > After reading the various papers available on GiST and RD trees, I
> > think I have a decent suggestion.
>
>I for one don't understand what does your suggestion have to do with the
>problem at hand ... not that I have a better one myself.

We have two problems here.
The first is that the page splitting code for these indexes currently
has O(N^2) performance.
The second is that whatever solution we do use for this
functionality, we still need good performance during searches that
use the index.  It's not clear that the solutions we've discussed to
splitting index pages thus far will result in good performance during searches.

My suggestion is intended to address both issues.

If I'm right it helps obtain high performance during searches while
allowing the index page splitting code to be O(1)

Ron.



От:
"Craig A. James"
Дата:

Ron <> writes:
> We have two problems here.
> The first is that the page splitting code for these indexes currently
> has O(N^2) performance.
> The second is that whatever solution we do use for this functionality,
> we still need good performance during searches that use the index.

No, unfortunately that's not the problem that needs to be solved.

The problem is figuring out WHICH records to put in the "left" and "right" trees once you split them.  If you can
figurethat out, then your suggestion (and perhaps other techniques) could be useful. 

The problem boils down to this:  You have a whole bunch of essentially random bitmaps.  You have two buckets.  You want
toput half of the bitmaps in one bucket, and half in the other bucket, and when you get through, you want all of the
bitmapsin each bucket to be maximally similar to each other, and maximally dissimilar to the ones in the other bucket. 

That way, when you OR all the bitmaps in each bucket together to build the bitmap for the left and right child nodes of
thetree, you'll get maximum separation -- the chances that you'll have to descend BOTH the left and right nodes of the
treeare minimized. 

Unfortunately, this problem is very likely in the set of NP-complete problems, i.e. like the famous "Traveling Salesman
Problem,"you can prove there's no algorithm that will give the answer in a reasonable time.  In this case, "reasonable"
wouldbe measured in milliseconds to seconds, but in fact an actual "perfect" split of a set of bitmaps probably can't
becomputed in the lifetime of the universe for more than a few hundred bitmaps. 

That's the problem that's being discussed: How do you decide which bitmaps go in each of the two buckets?  Any solution
willnecessarily be imperfect, a pragmatic algorithm that gives an imperfect, but acceptable, answer.   

As I mentioned earlier, chemists make extensive use of bitmaps to categorize and group molecules.  They use Tanimoto or
Tverskysimilarity metrics (Tanimoto is a special case of Tversky), because it's extremely fast to compare two bitmaps,
andthe score is highly correlated with the number of bits the two bitmaps have in common. 

But even with a fast "distance" metric like Tanimoto, there's still no easy way to decide which bucket to put each
bitmapinto. 

Craig

От:
Ron
Дата:

You seem to have missed my point.  I just gave a very clear
description of how to "decide which bitmaps go in each of the two
buckets" by reformulating the question into "decide which bitmaps go
in each of =four= buckets".

The intent is to have two indexes, one optimized for one common class
of searches, and the other optimized for another common class of searches.

By decomposing the optimization problem into two simpler problems,
the hope is that we address all the issues reasonably simply while
still getting decent performance.

Nothing is free.  The price we pay, and it is significant, is that we
now have two indexes where before we had only one.

Ron


At 09:29 PM 1/26/2006, Craig A. James wrote:
>Ron <> writes:
>>We have two problems here.
>>The first is that the page splitting code for these indexes
>>currently has O(N^2) performance.
>>The second is that whatever solution we do use for this
>>functionality, we still need good performance during searches that
>>use the index.
>
>No, unfortunately that's not the problem that needs to be solved.
>
>The problem is figuring out WHICH records to put in the "left" and
>"right" trees once you split them.  If you can figure that out, then
>your suggestion (and perhaps other techniques) could be useful.
>
>The problem boils down to this:  You have a whole bunch of
>essentially random bitmaps.  You have two buckets.  You want to put
>half of the bitmaps in one bucket, and half in the other bucket, and
>when you get through, you want all of the bitmaps in each bucket to
>be maximally similar to each other, and maximally dissimilar to the
>ones in the other bucket.
>
>That way, when you OR all the bitmaps in each bucket together to
>build the bitmap for the left and right child nodes of the tree,
>you'll get maximum separation -- the chances that you'll have to
>descend BOTH the left and right nodes of the tree are minimized.
>
>Unfortunately, this problem is very likely in the set of NP-complete
>problems, i.e. like the famous "Traveling Salesman Problem," you can
>prove there's no algorithm that will give the answer in a reasonable
>time.  In this case, "reasonable" would be measured in milliseconds
>to seconds, but in fact an actual "perfect" split of a set of
>bitmaps probably can't be computed in the lifetime of the universe
>for more than a few hundred bitmaps.
>
>That's the problem that's being discussed: How do you decide which
>bitmaps go in each of the two buckets?  Any solution will
>necessarily be imperfect, a pragmatic algorithm that gives an
>imperfect, but acceptable, answer.
>
>As I mentioned earlier, chemists make extensive use of bitmaps to
>categorize and group molecules.  They use Tanimoto or Tversky
>similarity metrics (Tanimoto is a special case of Tversky), because
>it's extremely fast to compare two bitmaps, and the score is highly
>correlated with the number of bits the two bitmaps have in common.
>
>But even with a fast "distance" metric like Tanimoto, there's still
>no easy way to decide which bucket to put each bitmap into.
>
>Craig



От:
Ron
Дата:

At 11:13 PM 1/26/2006, Craig A. James wrote:
>Ron,
>
>I'll write to you privately, because these discussions can get messy
>"in public".

I'm responding to this missive publicly in an attempt to help the
discussion along.  It is not my usual practice to respond to private
messages publicly, but this seems a good case for an exception.


>>You seem to have missed my point.  I just gave a very clear
>>description of how to "decide which bitmaps go in each of the two
>>buckets" by reformulating the question into "decide which bitmaps
>>go in each of =four= buckets".
>
>Sorry to disagree, but here's the problem. It's not whether you put
>them into two, or four, or N buckets.  The problem is, how do you
>categorize them to begin with, so that you have some reasonable
>criterion for which item goes in which bucket?  THAT is the hard
>problem, not whether you have two or four buckets.

Agreed.  ...and I've given the answer to "how do you categorize them"
using a general property of RD Trees which should result in "a
reasonable criterion for which item goes in which bucket" when used
for text searching.

The definition of RD tree keys being either "atomic" (non
decomposable) or "molecular" (containing the keys of their
descendents) is the one source of our current problems figuring out
how to split them and, if I'm correct, a hint as to how to solve the
splitting problem in O(1) time while helping to foster high
performance during seearches.


>Earlier, you wrote:
>>Traditional B-tree ordering methods can be used to define the
>>ordering/placement of pages within each index, which will minimize
>>head seeks to find the correct page to scan.
>>Since the criteria for putting a key within a page or starting a
>>new page is simple, performance for those tasks should be O(1).
>
>What are the "traditional B-tree ordering methods"?  That's the
>issue, right there.  The problem is that bitmaps HAVE NO ORDERING
>METHOD.  You can't sort them numerically or alphabetically.
The =bitmaps= have no ordering method.  =Pages= of bitmaps MUST have
an ordering method or we have no idea which page to look at when
searching for a key.

Treating the "root" bitmaps (those that may have descendents but have
no parents) as numbers and ordering the pages using B tree creation
methods that use those numbers as keys is a simple way to create a
balanced data structure with high fan out.  IOW, a recipe for finding
the proper page(s) to scan using minimal seeks.


>Take a crowd of people.  It's easy to divide them in half by names:
>just alphabetize and split the list in half.  That's what your
>solution would improve on.
>
>But imagine I gave you a group of people and told you to put them
>into two rooms, such that when you are through, the people in each
>room are maximally similar to each other and maximally dissimilar to
>the people in the other room.  How would you do it?
>
>First of all, you'd have to define what "different" and "same"
>are.  Is your measure based on skin color, hair color, age, height,
>weight, girth, intelligence, speed, endurance, body odor, ...
>?  Suppose I tell you, "All of those".  You have to sort the people
>so that your two groups are separated such that the group
>differences are maximized in this N-dimensional
>space.  Computationally, it's a nearly impossible problem.
I'm =changing= the problem using the semantics of RD trees.  Using an
RD tree representation, we'd create and assign a key for each person
that ranked them compared to everyone else for each of the metrics we
decided to differentiate on.

Then we start to form trees of keys to these people by creating
"parent" keys as roots that contain the union of everyone with the
same or possibly similar value for some quality.  By iterating this
process, we end up with a bottom up construction method for an RD
tree whose root key will the union of all the keys representing these
people.  If this is one data structure, we end up with an efficient
and effective way of answering not only Boolean but also ranking and
similarity type queries.

The problem comes when we have to split this monolithic DS into
pieces for best performance.  As many have noted, there is no
simple  way to decide how to do such a thing.

OTOH, we =do= have the knowledge of how RD trees are built and what
their keys represent, and we know that queries are going to tend
strongly to either a) traverse the path from parent to child (depth
first) or b) find all siblings with (dis)similar characteristics
(breadth first), or c) contain a mixture of a) and b).

So I'm suggesting that conceptually we clone the original RD tree and
we split each clone according to two different methods.
Method A results in pages that contain as many complete depth first
paths from root to leave as possible on each page.
Method B results in pages that contain as many siblings as possible per page.
...and we use the appropriate index during each type of query or query part.

In return for using 2x as much space, we should have a general method
that is O(1) for decomposing RD trees in such a way as to support
high performance during searches.


>Here's a much simpler version of the problem.  Suppose I give you
>1000 numbers, and tell you to divide them in half so that the two
>groups have the smallest standard deviation within each group
>possible, and the two average values of the groups of numbers are
>the farthest apart possible.  Pretty easy, right?
>
>Now do it where "distance" is evaluated modulo(10), that is, 1 and 9
>are closer together than 1 and 3.  Now try it -- you'll find that
>this problem is almost impossible to solve.
This is orthogonal to the discussion at hand since the above is not
akin to text searching nor best done with RD trees and that is
exactly what this discussion is about.  We don't have to solve a
general problem for all domains.  We only have to solve it for the
specific domain of text search using the specific DS of RD trees.


>The problem is that, as database designers, we're used to text and
>numbers, which have an inherent order.  Bitmaps have no ordering --
>you can't say one is "greater than" or "less than" the other.  All
>you can do is determine that A and B are "more similar" or "less
>similar" than A and C.
Text and numbers only have an order because we deem them to.  There
is even a very common default order we tend to use.  But even in the
trivial case of ranking we've all said that "1" is better than "2" in
one situation and "2" is better than "1" in another.

In the case of text searching, the bitmaps represent where to find
legomena, AKA specific tokens.  Particularly hapax legomena, AKA
unique tokens.  Hapax Legomena are particularly important because
they are maximally efficient at driving the search needed to answer a query.

While absolute hapax legomena are great for quickly pruning things
within a document or document region, relative hapax legomena can do
the same thing when searching among multiple documents or document regions.

The two indexes I'm suggesting are designed to take advantage of this
general property of text searching.


Hopefully this clarifies things and motivates a better discussion?
Ron