Martijn van Oosterhout <kleptog@svana.org> writes:
> Given that all it's doing is counting bits, a simple fix would be to
> loop over bytes, use XOR and count ones. For extreme speedup create a
> lookup table with 256 entries to give you the answer straight away...
Yeah, I just finished doing that and got about a 20x overall speedup
(30-some seconds to build the index instead of 10 minutes). However,
hemdistsign is *still* 70% of the runtime after doing that. The problem
seems to be that gtsvector_picksplit is calling it an inordinate number
of times:
0.53 30.02 1649/1649 FunctionCall2 <cycle 2> [19]
[20] 52.4 0.53 30.02 1649 gtsvector_picksplit [20]
29.74 0.00 23519673/33035195 hemdistsign [18]
0.14 0.00 22985469/22985469 hemdistcache [50]
0.12 0.00 268480/10030581 makesign [25]
0.02 0.00 270400/270400 fillcache [85]
0.00 0.00 9894/4077032 AllocSetAlloc [34]
0.00 0.00 9894/2787477 MemoryContextAlloc [69]
(hemdistcache calls hemdistsign --- I think gprof is doing something
funny with tail-calls here, and showing hemdistsign as directly called
from gtsvector_picksplit when control really arrives through hemdistcache.)
The bulk of the problem is clearly in this loop, which performs O(N^2)
comparisons to find the two entries that are furthest apart in hemdist
terms:
for (k = FirstOffsetNumber; k < maxoff; k = OffsetNumberNext(k))
{
for (j = OffsetNumberNext(k); j <= maxoff; j = OffsetNumberNext(j))
{
if (k == FirstOffsetNumber)
fillcache(&cache[j], GETENTRY(entryvec, j));
size_waste = hemdistcache(&(cache[j]), &(cache[k]));
if (size_waste > waste)
{
waste = size_waste;
seed_1 = k;
seed_2 = j;
}
}
}
I wonder if there is a way to improve on that.
regards, tom lane