Advice for using integer arrays?

Поиск

Список

Период

Сортировка

От	Michael Heaney
Тема	Advice for using integer arrays?
Дата	6 января 2015 г. 20:10:12
Msg-id	54AC16E4.9060709@jcvi.org обсуждение исходный текст
Ответы	Re: Advice for using integer arrays? (Arthur Silva <arthurprs@gmail.com>) Re: Advice for using integer arrays? (Rob Sargent <robjsargent@gmail.com>) Re: Advice for using integer arrays? (Jeff Janes <jeff.janes@gmail.com>) Re: Advice for using integer arrays? (Martijn van Oosterhout <kleptog@svana.org>)
Список	pgsql-general

Дерево обсуждения

I'm fairly new to Postgres, and have a design issue for which an array
of integers might be a good solution.  But I'd like to hear from the
experts before proceeding down this path.

Essentially, I'm trying to model the relationship between a group of
biological samples and their genes. Each sample (there are ~10K of them
now, with more coming) will have about 30,000 genes. Conversely, a
particular gene may be present in almost all samples.

So I've created the following table to handle the many-to-many relationship:

  create table sample_gene (id serial, sample_id int, gene_id int);

which looks like this when populated:

sample_id    |    gene_id
---------------------------------------
1                 |      1
1                 |      2
...
1                 |    30475
2                 |     1
2                 |     2
...
2                 |    29973
3                 |      1
etc.

The table now contains hundreds of millions of rows (with many, many
more to come).  Join performance between samples and genes is quite
slow, even with indexes on sample_id and gene_id.

So it occurred to me: why not eliminate all the duplicate sample_id
values by storing the gene_id's in an array, like so:

  create table sample_gene_array (id serial, sample_id int, gene_id int
[] );

So now the table data looks like this:

sample_id    |    gene_id []
---------------------------------------
1                 |      [1:30475]
2                 |      [1:29973]
etc.

The new table is significantly smaller, and performance (using ANY[] )
is quite good.  Nevertheless, I'm uneasy.  I come from a Sybase ASE
background, and so have no experience with arrays as datatypes. Is it
okay to store 30K+ gene values in an array in the linking table (or
maybe even in the sample table itself, thus eliminating the linking
table)?  Should I unnest the gene_id's first, before using them to join
to the gene table?

TIA for any guidance you can provide.  Again, I'm a Postgres neophyte -
but I'm in awe of the power and flexibility of this database, and wish
that I'd started using it sooner.

------
Michael Heaney
JCVI

В списке pgsql-general по дате отправления:

Предыдущее

От: Jonathan Vanasco
Дата: 06 января 2015 г., 20:07:16
Сообщение: Re: postgresql versus riak for a global exchange

Следующее

От: Arthur Silva
Дата: 06 января 2015 г., 20:48:05
Сообщение: Re: Advice for using integer arrays?

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Advice for using integer arrays?

Предыдущее

Следующее