Обсуждение: Array indexes, GIN?

Поиск
Список
Период
Сортировка

Array indexes, GIN?

От
Adam L Beberg
Дата:
I need to cross reference 2 tables. There are O(10M) A's, each has an
ordered set of 10 of the O(100K) B's associated with it. The dominant
query will be finding the A's and their count associated with a given
list of ~1k B's i.e. if 2 of the listed B's are in A's set of 10, it's
(A,2), and we should get back ~100K rows. The good news is we only need
to run this brutal query every couple minutes, but the row updates will
flow fast.

Luckily this is PostgreSQL, so the simple solution seems to be

   CREATE TABLE xref( A bigint, B bigint[10] ); -- A is primary key

which cuts down the table overhead. O(10M) rows w/array.

On the surface, looks like a job for GIN, but GIN seems undocumented,
specifically mentions it doesn't support the deletes we'll have many of
since it's designed for word searching apparently, the performance
implications are undocumented. I searched, I read, and even IRC'd, and
it seems like GIN is just not used much.

Is GIN right? Will this work at all? Will it run fast enough to function?

Re: Array indexes, GIN?

От
Josh Berkus
Дата:
Adam,

> On the surface, looks like a job for GIN, but GIN seems undocumented,
> specifically mentions it doesn't support the deletes we'll have many of
> since it's designed for word searching apparently, the performance
> implications are undocumented. I searched, I read, and even IRC'd, and
> it seems like GIN is just not used much.

It's new (as of 8.2).  And the authors, Oleg and Teodor, are notorious for
skimpy documetentation.

I'd start with the code in INTARRAY contrib module (also by Teodor) and bug
them on pgsql-hackers about helping you implement a GIN index for arrays.

--
Josh Berkus
PostgreSQL @ Sun
San Francisco

Re: Array indexes, GIN?

От
Oleg Bartunov
Дата:
On Thu, 1 Mar 2007, Josh Berkus wrote:

> Adam,
>
>> On the surface, looks like a job for GIN, but GIN seems undocumented,
>> specifically mentions it doesn't support the deletes we'll have many of
>> since it's designed for word searching apparently, the performance
>> implications are undocumented. I searched, I read, and even IRC'd, and
>> it seems like GIN is just not used much.
>
> It's new (as of 8.2).  And the authors, Oleg and Teodor, are notorious for
> skimpy documetentation.

We're getting better, we have 72 pages written about new FTS :)

>
> I'd start with the code in INTARRAY contrib module (also by Teodor) and bug
> them on pgsql-hackers about helping you implement a GIN index for arrays.

GIN already has support for one dimensional arrays and intarray, particularly,
too has support of GiN.


     Regards,
         Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

Re: Array indexes, GIN?

От
Adam L Beberg
Дата:
Oleg Bartunov wrote on 3/1/2007 10:45 PM:
> On Thu, 1 Mar 2007, Josh Berkus wrote:
>
>> Adam,
>>
>>> On the surface, looks like a job for GIN, but GIN seems undocumented,
>>> specifically mentions it doesn't support the deletes we'll have many of
>>> since it's designed for word searching apparently, the performance
>>> implications are undocumented. I searched, I read, and even IRC'd, and
>>> it seems like GIN is just not used much.
>>
>> It's new (as of 8.2).  And the authors, Oleg and Teodor, are notorious
>> for
>> skimpy documetentation.
>
> We're getting better, we have 72 pages written about new FTS :)

I'm guessing FTS is not quite done since you matched 'FTS' to 'GIN' ;)

> GIN already has support for one dimensional arrays and intarray,
> particularly, too has support of GiN.

Great, so can GIN handle my situation? I'm a little unsure what to make
of "Note: There is no delete operation for ET." in particular since I'm
dealing with large numbers.

--
Adam L. Beberg
http://www.mithral.com/~beberg/

Re: Array indexes, GIN?

От
Jeff Davis
Дата:
On Thu, 2007-03-01 at 19:59 -0800, Adam L Beberg wrote:
> On the surface, looks like a job for GIN, but GIN seems undocumented,
> specifically mentions it doesn't support the deletes we'll have many of
> since it's designed for word searching apparently, the performance

It can delete an entry for one of the keys of an index, it just can't
delete the key itself when the number of entries goes down to zero.
Because you only have O(100K) possible keys, that shouldn't be a
problem. The GIN indexes can reclaim space. If they couldn't, they
wouldn't be nearly as useful.

The time when you run into problems is when you have a huge, sparsely
populated keyspace, with a huge number of keys contained by no tuples in
the table.

However, for your application, GIN still might not be the right answer.
GIN can only return tuples which do contain some matching keys, it won't
return the number of matching keys in that tuple (that's not the job of
an index).

Let's run some numbers:

 * percentage of tuples returned = 100K rows out of the 10M = 1%
 * tuples per page = 8192 bytes / 32 (tuple header) + 8 (bigint) + 80
(10 bigints) = ~70. Let's say it's 50 due to some free space.

Based on those numbers, the GIN index is basically going to say "get
every other page". PostgreSQL will optimize that into a sequential scan
because it makes no sense to do a random fetch for every other page.

So, the fastest way you can do this (that I can see) is just fetch every
tuple and count the number of matches in each array. You know your data
better than I do, so replace those numbers with real ones and see if it
still makes sense.

The basic rule is that an index scan is useful only if it reduces the
number of disk pages you need to fetch enough to make up for the extra
cost of random I/O.

Regards,
    Jeff Davis