Re: scoring differences between bitmasks

Поиск
Список
Период
Сортировка
От Ben
Тема Re: scoring differences between bitmasks
Дата
Msg-id D54F13C3-7377-4B1A-B0E7-3F53EB5610DA@silentmedia.com
обсуждение исходный текст
Ответ на Re: scoring differences between bitmasks  (Martijn van Oosterhout <kleptog@svana.org>)
Список pgsql-general
Yeah, I thought about this, but damn my 3-dimensional mind, I just
can't convince myself this will work for 256 dimensions. :) Or
rather, I can see how it could, *given* good placements of the grid
points to snap each vector to. Unfortunately, I don't know enough
theory to know what good placements would be.

On Oct 2, 2005, at 9:47 AM, Martijn van Oosterhout wrote:

> On Sun, Oct 02, 2005 at 08:28:53AM -0700, Ben wrote:
>
>> Yes, that's the straightforward way to do it. But given that my
>> vectors are 256 bits in length, and that I'm going to eventually have
>> about 4 million of them to search through, I was hoping greater minds
>> than mine had figured out how to do it faster, or how compute some
>> kind of indexing....... somehow.
>>
>
> Ah right, that's another kettle of fish. Well, your problem space is a
> form of eulidean space in the sense that the triangle inequality
> holds,
> eg dist(A,C) <= dist(A,B)+dist(B,C). Maybe one of the geometric index
> classes could help.
>
> Here comes something I thought up myself, but you may be able to come
> up with something better.
>
> 1. For each string, store its distance from the strings:
> 0000000000 etc
> 0101010101 etc
> 0011001100 etc
> maybe more or less, you'll need to investigate
>
> 2. Index each of these columns.
> 3. For your target string calculate the distances from each of these
> 4. Do a query like:
>
> SELECT key, * from table
> where dist_zero_string > $dist_zero_search-2 and dist_zero_string <
> $dist_zero_search+2
> where dist_0101_string > $dist_0101_search-2 and dist_0101_string <
> $dist_0101_search+2
> where dist_0011_string > $dist_0101_search-2 and dist_0011_string <
> $dist_0011_search+2
> order by dist(key,searchvalue);
>
> PostgreSQL 8.1 has bitmap index scans which could make the above quite
> fast. Some theory may help determine the best key strings. For example
> all-zeros and all-ones are redundant since you can calculate the
> distance for one from the other. You also might need to vary your
> distances to match with (the 2 above) depending on the average
> distance
> to the strings and where you search.
>
> You could use functional indexes to avoid storing the actual values in
> the tables. Or you could take two or three of these values and make
> points and use an rtree index on those to find nearby points (which
> will represent nearby strings).
>
> Anyway, this is just some random thoughts, hope it helps,
> --
> Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/
> kleptog/
>
>> Patent. n. Genius is 5% inspiration and 95% perspiration. A patent
>> is a
>> tool for doing 5% of the work and then sitting around waiting for
>> someone
>> else to do the other 95% so you can sue them.
>


В списке pgsql-general по дате отправления:

Предыдущее
От: Samik Raychaudhuri
Дата:
Сообщение: Re: Portable PostgreSQL
Следующее
От: "Todd A. Cook"
Дата:
Сообщение: Re: scoring differences between bitmasks