Re: [GENERAL] Understanding and implementing a GiST Index

Поиск
Список
Период
Сортировка
От Connor Wolf
Тема Re: [GENERAL] Understanding and implementing a GiST Index
Дата
Msg-id 543784DC.9020107@imaginaryindustries.com
обсуждение исходный текст
Список pgsql-hackers
I'd be ok with either SP-GiST or GiST, though there are tree structures (BK tree) that are particularly suited to this application that I think map to GiST better then SP-GiST.

Re: hamming in the RD-tree implementation: Where? I cloned the smlar git repo, and poked around, and it looks like it only can operate on set intersections, which is a non-starter for what I need to do.  I spent a while looking through the source, and I didn't see anything that looked like hamming distance calculation (through I probably missed some stuff. The indirection through `FCall2()` makes things very hard to follow).

Anyways, right now, smlar is not useful to me, because it completely ignores the structure of incoming arrays:
postgres=# SELECT smlar(
postgres(# '{1,0,0,0, 1,1,1,1, 1,1,1,0, 0,1,0,0}'::int[],
postgres(# '{1,0,0,1, 0,1,0,0, 0,1,0,0, 0,1,1,0}'
postgres(# );
 smlar
-------
     1
(1 row)

For two radically different hashes (shortened for example purposes) which compare as identical.

I spent some time trying to see if I could easily disable the array uniquefication, and by disabling the calls to `qsort_arg`, and modifying `numOfIntersect` so it just counts the number of times the array elements do not match, and I'm getting the behaviour I want out of `smlar()` at this point:

postgres=# SELECT smlar(
postgres(# '{1,0,0,0, 1,1,1,1, 1,1,1,0, 0,1,0,0}'::int[],
postgres(# '{1,0,0,1, 0,1,0,0, 0,1,0,0, 0,1,1,0}'
postgres(# ); smlar
-------- 0.4375
(1 row)

postgres=# SELECT smlar(
'{0,1,1,0}'::int[],
'{1,0,1,0}'
); smlar
-------   0.5
(1 row)


But the index does not seem to work after the changes I made, and the debugging print statements (well, `elog()` statements) I inserted into `cmpArrayElem()` and `numOfIntersect()` are not being hit, so I don't understand how the index is even being built.

Anyways, I'm rambling a bit. Any input would be great.
Thanks,
Connor

On 10/9/2014 8:11 AM, Oleg Bartunov wrote:
Just quick recommendation.
You need to ask -hackers mailing list for that. Also, do you want GiST or SP-GiST ?
We already use hamming distance in rd-tree, which implemented in GiST, see our presentation http://www.sai.msu.su/~megera/postgres/talks/pgcon-2012.pdf, for example.

Oleg

On Thu, Oct 9, 2014 at 11:09 AM, Connor Wolf <wolf@imaginaryindustries.com> wrote:
Hi there!

I'm trying to implement a custom indexing system using the GiST index framework, and I have a few questions.
Basically, I'm trying to implement a system that allows me to search across a set of 64 bit integers by hamming distance. This is intended to be used in searching for similar images, where the 64 bit integer is actually a phash of an image, and similarity between two images is reflected in the hamming distance between two integers.

Anyways, The appropriate approach here is to use something called a BK tree, for which I've written some test code and I think I have a decent grip of (my test code seems to work, in any event).

That said:
  • Is there a simple piece of example-code for implementing a GiST index for a single index? I've been looking through the /contrib/ directory, particularly the /contrib/btree_gist/ files, but it looks like 90% of the complexity there is related to supporting all the column types Postgres has, rather then actually tied to producing a functional index.
  • Once I have something compiling, how can I check to be sure that I'm actually using the indexing module I created? I can enable my compiled extension using `CREATE EXTENSION`, but is there an option for `CREATE INDEX test_index ON testing USING gist (val);` that lets me specify *which* GiST index is actually employed? Is this even a valid question?
    This is probably something that's available in one of the system tables somewhere, but I haven't had much luck with google finding out where.
  • Testing: What's the appropriate way to examine the generated tree structure of the index? I certainly went through a few bugs with my test tree system (and that was in python!). Is there any way to examine the index structure for debugging purposes?
  • Also, it looks like `ereport()` is the proper way to emit debugging information. Is this correct?
  • In that vein, is there any way to have information that is on the operations of an entire query? Looking at the number of tree nodes touched for a scan would be nice (and I would not be surprised if there is already a facility for it).

Project code is here if anyone is interested, any help would be great. I have very little idea what I'm doing.

Thanks,
Connor



В списке pgsql-hackers по дате отправления:

Предыдущее
От: Fabrízio de Royes Mello
Дата:
Сообщение: Re: schema-only -n option in pg_restore fails
Следующее
От: Robert Haas
Дата:
Сообщение: Re: UPSERT wiki page, and SQL MERGE syntax