Обсуждение: Array indexes, GIN?
I need to cross reference 2 tables. There are O(10M) A's, each has an ordered set of 10 of the O(100K) B's associated with it. The dominant query will be finding the A's and their count associated with a given list of ~1k B's i.e. if 2 of the listed B's are in A's set of 10, it's (A,2), and we should get back ~100K rows. The good news is we only need to run this brutal query every couple minutes, but the row updates will flow fast. Luckily this is PostgreSQL, so the simple solution seems to be CREATE TABLE xref( A bigint, B bigint[10] ); -- A is primary key which cuts down the table overhead. O(10M) rows w/array. On the surface, looks like a job for GIN, but GIN seems undocumented, specifically mentions it doesn't support the deletes we'll have many of since it's designed for word searching apparently, the performance implications are undocumented. I searched, I read, and even IRC'd, and it seems like GIN is just not used much. Is GIN right? Will this work at all? Will it run fast enough to function?
Adam, > On the surface, looks like a job for GIN, but GIN seems undocumented, > specifically mentions it doesn't support the deletes we'll have many of > since it's designed for word searching apparently, the performance > implications are undocumented. I searched, I read, and even IRC'd, and > it seems like GIN is just not used much. It's new (as of 8.2). And the authors, Oleg and Teodor, are notorious for skimpy documetentation. I'd start with the code in INTARRAY contrib module (also by Teodor) and bug them on pgsql-hackers about helping you implement a GIN index for arrays. -- Josh Berkus PostgreSQL @ Sun San Francisco
On Thu, 1 Mar 2007, Josh Berkus wrote: > Adam, > >> On the surface, looks like a job for GIN, but GIN seems undocumented, >> specifically mentions it doesn't support the deletes we'll have many of >> since it's designed for word searching apparently, the performance >> implications are undocumented. I searched, I read, and even IRC'd, and >> it seems like GIN is just not used much. > > It's new (as of 8.2). And the authors, Oleg and Teodor, are notorious for > skimpy documetentation. We're getting better, we have 72 pages written about new FTS :) > > I'd start with the code in INTARRAY contrib module (also by Teodor) and bug > them on pgsql-hackers about helping you implement a GIN index for arrays. GIN already has support for one dimensional arrays and intarray, particularly, too has support of GiN. Regards, Oleg _____________________________________________________________ Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru), Sternberg Astronomical Institute, Moscow University, Russia Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ phone: +007(495)939-16-83, +007(495)939-23-83
Oleg Bartunov wrote on 3/1/2007 10:45 PM: > On Thu, 1 Mar 2007, Josh Berkus wrote: > >> Adam, >> >>> On the surface, looks like a job for GIN, but GIN seems undocumented, >>> specifically mentions it doesn't support the deletes we'll have many of >>> since it's designed for word searching apparently, the performance >>> implications are undocumented. I searched, I read, and even IRC'd, and >>> it seems like GIN is just not used much. >> >> It's new (as of 8.2). And the authors, Oleg and Teodor, are notorious >> for >> skimpy documetentation. > > We're getting better, we have 72 pages written about new FTS :) I'm guessing FTS is not quite done since you matched 'FTS' to 'GIN' ;) > GIN already has support for one dimensional arrays and intarray, > particularly, too has support of GiN. Great, so can GIN handle my situation? I'm a little unsure what to make of "Note: There is no delete operation for ET." in particular since I'm dealing with large numbers. -- Adam L. Beberg http://www.mithral.com/~beberg/
On Thu, 2007-03-01 at 19:59 -0800, Adam L Beberg wrote: > On the surface, looks like a job for GIN, but GIN seems undocumented, > specifically mentions it doesn't support the deletes we'll have many of > since it's designed for word searching apparently, the performance It can delete an entry for one of the keys of an index, it just can't delete the key itself when the number of entries goes down to zero. Because you only have O(100K) possible keys, that shouldn't be a problem. The GIN indexes can reclaim space. If they couldn't, they wouldn't be nearly as useful. The time when you run into problems is when you have a huge, sparsely populated keyspace, with a huge number of keys contained by no tuples in the table. However, for your application, GIN still might not be the right answer. GIN can only return tuples which do contain some matching keys, it won't return the number of matching keys in that tuple (that's not the job of an index). Let's run some numbers: * percentage of tuples returned = 100K rows out of the 10M = 1% * tuples per page = 8192 bytes / 32 (tuple header) + 8 (bigint) + 80 (10 bigints) = ~70. Let's say it's 50 due to some free space. Based on those numbers, the GIN index is basically going to say "get every other page". PostgreSQL will optimize that into a sequential scan because it makes no sense to do a random fetch for every other page. So, the fastest way you can do this (that I can see) is just fetch every tuple and count the number of matches in each array. You know your data better than I do, so replace those numbers with real ones and see if it still makes sense. The basic rule is that an index scan is useful only if it reduces the number of disk pages you need to fetch enough to make up for the extra cost of random I/O. Regards, Jeff Davis