Re: Using a GIN index on an integer array to model sets of tags

Поиск
Список
Период
Сортировка
От Ryan Kelly
Тема Re: Using a GIN index on an integer array to model sets of tags
Дата
Msg-id 20121117211503.GA5149@llserver.lakeliving.com
обсуждение исходный текст
Ответ на Using a GIN index on an integer array to model sets of tags  (Mike Jarmy <mjarmy@gmail.com>)
Список pgsql-general
On Sat, Nov 17, 2012 at 11:01:41AM -0500, Mike Jarmy wrote:
> I am researching how to set up a schema for querying a set of tags
> associated with an object.  There are approximately 100 distinct tags
> in my application (these will be pre-populated), and I am expecting a
> fairly low number of distinct sets of these tags  -- in other words, a
> given set of tags will most likely be reused by many objects.  There
> will be on the order of thousands to hundreds-of-thousands of records
> in the 'objects' table.
>
> I could use the standard many-to-many approach for this, that I've
> used many times in the past:
>
> CREATE TABLE TAGS (
>     ID SMALLINT PRIMARY KEY,
>     NAME VARCHAR);
>
> CREATE TABLE OBJECTS (
>     ID SERIAL PRIMARY KEY,
>     FOO VARCHAR);
>
> CREATE TABLE OBJECT_TAGS (
>     OBJECT_ID INT,
>     TAG_ID SMALLINT,
>     PRIMARY KEY (OBJECT_ID, TAG_ID));
>
> However, I've run across a couple of articles suggesting that I could
> use an array field to store the tags on the object, and use a GIN
> index to do set comparisons, e.g:
> http://sjsnyder.com/using-postgresql-arrays-with-gin-indexes-for.
> Though note that in that article they are storing each tag in the
> 'tags' field as text, whereas I think it would make more sense to
> store the id of each tag in the 'tags' field, and cache the mapping
> from tag names to tag ids in my application:
We actually do exactly what the link above suggests and store the text
directly, but we also have thousands of distinct tags. On the other
hand, with a large number of objects, you (we?) could save a tremendous
amount of space by "decoding" the tags either application-side or using
a stored procedure, which is something I've been investigating recently.

> CREATE TABLE TAGS (
>     ID SMALLINT PRIMARY KEY,
>     NAME VARCHAR);
>
> CREATE TABLE OBJECTS (
>     ID SERIAL PRIMARY KEY,
>     TAGS SMALLINT[],
>     FOO VARCHAR);
>
> CREATE INDEX OBJECTS_TAGS_INDEX ON OBJECTS USING GIN (TAGS);
>
> I like this array-plus-gin-index approach because the object id is not
> repeated over and over again as it is in the many-to-many link table.
> It seems like it would be much more efficient.  Another advantage is
> that the queries are so much more legible -- it just looks a lot
> cleaner to me.
>
> However, the PostgreSQL manual's chapter on arrays
> (http://www.postgresql.org/docs/9.2/static/arrays.html) does not
> mention GIN indexes.  Instead, it has a tip saying "Arrays are not
> sets; searching for specific array elements can be a sign of database
> misdesign." Of course the manual is correct if the array field is
> unindexed, but I find the fact that the GIN-indexing approach is not
> mentioned to be puzzling.
Arrays are not sets, but we use them just like sets. It makes me wish
SQL standard sets were implemented in postgres.

> So, my question is: Is it in fact a good idea to use the
> integer-array-plus-GIN-index approach as a way to store sets in-line?
> Does anyone actually do this in production? Or is the old-school
> many-to-many relationship still the way to go?  Of course I'm going to
> try both approaches out myself and see how it works, but I'm
> interested in the communities opinions on this subject (particularly
> since I'm new to PostgreSQL, having mostly used commercial databases
> in the past).
One-to-many and many-to-many our still our go-to options, but if the
number of items participating in the join is going to be large, arrays
are far superior.  Especially when then join does no filtering. A common
operation for us is to download a very large number - millions - of
items filtered by a certain set of tags. The original design used
one-to-many to store the association between items and tags. When it was
replaced with an approach using arrays with gin indexes, the speed up
for finding large numbers of items was 20x, and finding smaller subsets
was 60x. This was about 18 months ago on postgres 8.4, so I may not be
100% correct here. But that was the ballpark.

> Thanks, Mike Jarmy

-Ryan Kelly


В списке pgsql-general по дате отправления:

Предыдущее
От: Adrian Klaver
Дата:
Сообщение: Re: 9.2 streaming replication issue and solution strategy
Следующее
От: Marko Kreen
Дата:
Сообщение: Re: Plproxy with returns table() make PG segfault