Re: scoring differences between bitmasks
От | Ben |
---|---|
Тема | Re: scoring differences between bitmasks |
Дата | |
Msg-id | D54F13C3-7377-4B1A-B0E7-3F53EB5610DA@silentmedia.com обсуждение исходный текст |
Ответ на | Re: scoring differences between bitmasks (Martijn van Oosterhout <kleptog@svana.org>) |
Список | pgsql-general |
Yeah, I thought about this, but damn my 3-dimensional mind, I just can't convince myself this will work for 256 dimensions. :) Or rather, I can see how it could, *given* good placements of the grid points to snap each vector to. Unfortunately, I don't know enough theory to know what good placements would be. On Oct 2, 2005, at 9:47 AM, Martijn van Oosterhout wrote: > On Sun, Oct 02, 2005 at 08:28:53AM -0700, Ben wrote: > >> Yes, that's the straightforward way to do it. But given that my >> vectors are 256 bits in length, and that I'm going to eventually have >> about 4 million of them to search through, I was hoping greater minds >> than mine had figured out how to do it faster, or how compute some >> kind of indexing....... somehow. >> > > Ah right, that's another kettle of fish. Well, your problem space is a > form of eulidean space in the sense that the triangle inequality > holds, > eg dist(A,C) <= dist(A,B)+dist(B,C). Maybe one of the geometric index > classes could help. > > Here comes something I thought up myself, but you may be able to come > up with something better. > > 1. For each string, store its distance from the strings: > 0000000000 etc > 0101010101 etc > 0011001100 etc > maybe more or less, you'll need to investigate > > 2. Index each of these columns. > 3. For your target string calculate the distances from each of these > 4. Do a query like: > > SELECT key, * from table > where dist_zero_string > $dist_zero_search-2 and dist_zero_string < > $dist_zero_search+2 > where dist_0101_string > $dist_0101_search-2 and dist_0101_string < > $dist_0101_search+2 > where dist_0011_string > $dist_0101_search-2 and dist_0011_string < > $dist_0011_search+2 > order by dist(key,searchvalue); > > PostgreSQL 8.1 has bitmap index scans which could make the above quite > fast. Some theory may help determine the best key strings. For example > all-zeros and all-ones are redundant since you can calculate the > distance for one from the other. You also might need to vary your > distances to match with (the 2 above) depending on the average > distance > to the strings and where you search. > > You could use functional indexes to avoid storing the actual values in > the tables. Or you could take two or three of these values and make > points and use an rtree index on those to find nearby points (which > will represent nearby strings). > > Anyway, this is just some random thoughts, hope it helps, > -- > Martijn van Oosterhout <kleptog@svana.org> http://svana.org/ > kleptog/ > >> Patent. n. Genius is 5% inspiration and 95% perspiration. A patent >> is a >> tool for doing 5% of the work and then sitting around waiting for >> someone >> else to do the other 95% so you can sue them. >
В списке pgsql-general по дате отправления: