Re: scoring differences between bitmasks

Поиск
Список
Период
Сортировка
От Martijn van Oosterhout
Тема Re: scoring differences between bitmasks
Дата
Msg-id 20051002164700.GI30492@svana.org
обсуждение исходный текст
Ответ на Re: scoring differences between bitmasks  (Ben <bench@silentmedia.com>)
Ответы Re: scoring differences between bitmasks  (Ben <bench@silentmedia.com>)
Список pgsql-general
On Sun, Oct 02, 2005 at 08:28:53AM -0700, Ben wrote:
> Yes, that's the straightforward way to do it. But given that my
> vectors are 256 bits in length, and that I'm going to eventually have
> about 4 million of them to search through, I was hoping greater minds
> than mine had figured out how to do it faster, or how compute some
> kind of indexing....... somehow.

Ah right, that's another kettle of fish. Well, your problem space is a
form of eulidean space in the sense that the triangle inequality holds,
eg dist(A,C) <= dist(A,B)+dist(B,C). Maybe one of the geometric index
classes could help.

Here comes something I thought up myself, but you may be able to come
up with something better.

1. For each string, store its distance from the strings:
0000000000 etc
0101010101 etc
0011001100 etc
maybe more or less, you'll need to investigate

2. Index each of these columns.
3. For your target string calculate the distances from each of these
4. Do a query like:

SELECT key, * from table
where dist_zero_string > $dist_zero_search-2 and dist_zero_string < $dist_zero_search+2
where dist_0101_string > $dist_0101_search-2 and dist_0101_string < $dist_0101_search+2
where dist_0011_string > $dist_0101_search-2 and dist_0011_string < $dist_0011_search+2
order by dist(key,searchvalue);

PostgreSQL 8.1 has bitmap index scans which could make the above quite
fast. Some theory may help determine the best key strings. For example
all-zeros and all-ones are redundant since you can calculate the
distance for one from the other. You also might need to vary your
distances to match with (the 2 above) depending on the average distance
to the strings and where you search.

You could use functional indexes to avoid storing the actual values in
the tables. Or you could take two or three of these values and make
points and use an rtree index on those to find nearby points (which
will represent nearby strings).

Anyway, this is just some random thoughts, hope it helps,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a
> tool for doing 5% of the work and then sitting around waiting for someone
> else to do the other 95% so you can sue them.

Вложения

В списке pgsql-general по дате отправления:

Предыдущее
От: Samik Raychaudhuri
Дата:
Сообщение: Portable PostgreSQL
Следующее
От: Nirmalya Lahiri
Дата:
Сообщение: Re: Broken pipe