Обсуждение: Storing number '001' ?

Поиск
Список
Период
Сортировка

Storing number '001' ?

От
Charles Hauser
Дата:
Hi,

I need to store numbers which contain '0' as the first digit like '001', '01' .

A little history.  A DNA clone_id is denoted by '894001A01.x1'.  I
need to sort  clone_id, and have broken it down into its meaningful
components:

project:    894
plate:        001
plate row:    A
plate column:     01
read:        x
ver:        1

CREATE TABLE clone_fasta (
project integer NOT NULL,
plate integer NOT NULL,
p_row char(1) NOT NULL,
p_column integer NOT NULL,
read char(1) NOT NULL,
ver integer NOT NULL,
length integer NOT NULL,
seq text NOT NULL,
PRIMARY KEY (project,plate,p_row,p_column,read,ver)
);

Unfortunately, storing these numbers as integers converts 001 ->1,
which I can't use.

How does one store a number like '001'?

Thanks,
--
Chuck

Re: Storing number '001' ?

От
Torbjörn Andersson
Дата:
> Unfortunately, storing these numbers as integers converts 001 ->1,
> which I can't use.
>
> How does one store a number like '001'?


Store it as a varchar and use CAST when you select it.


Vänliga hälsningar

Torbjörn Andersson
---------------------------------------------------
Embryo Communication      phone: +46 (0) 31-774 39 11(00)
Kungsgatan 7a             fax: +46 (0)31 774 07 80
S-411 19 Göteborg         mobile: 0708-30 70 04
Sweden                    home: http://www.embryo.se/
        mail: torbjorn.andersson@embryo.se
---------------------------------------------------
"Att idag tänka annorlunda än igår skiljer den vise från den envise." — John
Steinbeck



Re: Storing number '001' ?

От
"Josh Berkus"
Дата:
Chuck,

First off, I'd like to congratulate you on having the foresight to break
down the value into seperate, atomic, components.  Far too many novice
DBA's would have stored the whole ID as one VARCHAR, and then written a
bunch of custom functions to parse it.  The latter approach has
disastrous consequences for data integrity, so I'm thrilled that you
have made the right choice.

> project: 894
> plate:  001
> plate row: A
> plate column:  01
> read:  x
> ver:  1

> Unfortunately, storing these numbers as integers converts 001 ->1,
> which I can't use.
>
> How does one store a number like '001'?

That depends on the answer to this question:
Is the Plate number always a zero-filled 3-digit integer?

If the answer is Yes, then you can simply use the to_char function to
format the integer plate number as you wish to see it:
SELECT to_char(plate, '000');
  (warning: to_char has a bug in ver. 7.1.x that causes it to sometimes
insert a leading space, e.g. " 001" instead of "001".  To fix this, use
the Trim function)
The database will still store the integer, making for smaller disk
space, faster searching, and automatic matching if you forget to
zero-fill.

If the answer is No (i.e. the plate number is sometimes more or less
than three digits) then you'll have to store the plate number as a
VARCHAR.  In this case, you will want to create a Constraint that
prevents entry of non-numerical characters into the Plate field, and
adapt your user interface so that it zero-fills user input automatically
before saving.  E-mail me back if you need to do this.

-Josh



______AGLIO DATABASE SOLUTIONS___________________________
                                       Josh Berkus
  Complete information technology      josh@agliodbs.com
   and data management solutions       (415) 565-7293
  for law firms, small businesses        fax 621-2533
    and non-profit organizations.      San Francisco

Re: Storing number '001' ?

От
Jason Earl
Дата:
Actually you probably don't want to *store* those extra zeros, you
simply want to display them.

So simply make sure that you use lpad when you select these values.
For example if I had a table test like this:

CREATE TABLE test (
        foo     integer
);

INSERT INTO test (foo) VALUES (1);
INSERT INTO test (foo) VALUES (2);
INSERT INTO test (foo) VALUES (3);
INSERT INTO test (foo) VALUES (300);

I could then select from it with a query like:

SELECT lpad(foo::text, 3, '0') AS padded_num FROM test;

and get:

 padded_num
------------
 001
 002
 003
 300
(4 rows)

Neat huh!  Of course, you also might want to put a constraint on that
column because otherwise you could have problems.  For example if you
add another value to the table:

INSERT INTO test (foo) VALUES (3000);

and then select use the lpad query you get:

 padded_num
------------
 001
 002
 003
 300
 300
(5 rows)

Which could be bad.

Another tactic would be to simply store the value as a char(3) or
varchar(3) value and do your error checking when you insert the value.
If you are going to be doing math on these values, however, it is
probably a win to store the value as an integer.

Jason

Charles Hauser <chauser@acpub.duke.edu> writes:

> Hi,
>
> I need to store numbers which contain '0' as the first digit like
> '001', '01' .
>
> A little history.  A DNA clone_id is denoted by '894001A01.x1'.  I
> need to sort  clone_id, and have broken it down into its meaningful
> components:
>
> project:    894
> plate:        001
> plate row:    A
> plate column:     01
> read:        x
> ver:        1
>
> CREATE TABLE clone_fasta (
> project integer NOT NULL,
> plate integer NOT NULL,
> p_row char(1) NOT NULL,
> p_column integer NOT NULL,
> read char(1) NOT NULL,
> ver integer NOT NULL,
> length integer NOT NULL,
> seq text NOT NULL,
> PRIMARY KEY (project,plate,p_row,p_column,read,ver)
> );
>
> Unfortunately, storing these numbers as integers converts 001 ->1,
> which I can't use.
>
> How does one store a number like '001'?
>
> Thanks,
> --
> Chuck
>
> ---------------------------(end of broadcast)---------------------------
> TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org

Re: Storing number '001' ?

От
Andrew McMillan
Дата:
On Thu, 2001-12-06 at 12:33, Charles Hauser wrote:
> Hi,
>
> I need to store numbers which contain '0' as the first digit like '001', '01' .
>
> A little history.  A DNA clone_id is denoted by '894001A01.x1'.  I
> need to sort  clone_id, and have broken it down into its meaningful
> components:
>
> project:    894
> plate:        001
> plate row:    A
> plate column:     01
> read:        x
> ver:        1
>
> CREATE TABLE clone_fasta (
> project integer NOT NULL,
> plate integer NOT NULL,
> p_row char(1) NOT NULL,
> p_column integer NOT NULL,
> read char(1) NOT NULL,
> ver integer NOT NULL,
> length integer NOT NULL,
> seq text NOT NULL,
> PRIMARY KEY (project,plate,p_row,p_column,read,ver)
> );
>
> Unfortunately, storing these numbers as integers converts 001 ->1,
> which I can't use.
>
> How does one store a number like '001'?

If you have to deal with numbers such that '001' is different to '01'
then they are not numbers: they are _text_.

Therefore: store them in a 'TEXT' field instead.

Also, if you need to sort them into numeric order nonetheless, you could
left-pad with spaces so that '001' becomes '   001' and '01' became
'    01' and so forth.

Or you could just:
... ORDER BY int4(textfield) ...

You can even:
CREATE INDEX myindex ON myfile( int4(textfield));

to have that sort ordering supported by an index.

Regards,
                    Andrew.
--
--------------------------------------------------------------------
Andrew @ Catalyst .Net.NZ Ltd, PO Box 11-053, Manners St, Wellington
WEB: http://catalyst.net.nz/        PHYS: Level 2, 150-154 Willis St
DDI: +64(4)916-7201    MOB: +64(21)635-694    OFFICE: +64(4)499-2267


Re: Storing number '001' ?

От
"Josh Berkus"
Дата:
Chuck,

> I have set up TABLES: clone_fasta, clone_qual and gb_accessions.  All
> three tables have the same PRIMARY KEY
> (project,plate,p_row,p_column,read,ver) but contain other unique data
> (seq and length in the case of TABLE clone_fasta shown below).

If I were setting up this database, I would create a "surrogate key"
using a SERIAL column and use that as my key for joins.  While your
"real" primary key is the 6-column combination above, I think you will
find 6-column joins a royal pain in practice.  In other words, change
the schema per my comments below.

> CREATE TABLE clone_fasta(
clone_id SERIAL NOT NULL,
> project INTEGER NOT NULL,
> plate INTEGER NOT NULL,          <-- to_char(plate, '000')
> p_row CHAR(1) NOT NULL,
> p_column INTEGER NOT NULL,  <-- to_char(p_column, '00')
> read CHAR(1) NOT NULL,
> ver INTEGER NOT NULL,
> length INTEGER NOT NULL,
> seq TEXT NOT NULL,
> PRIMARY KEY (project,plate,p_row,p_column,read,ver),
CONSTRAINT clone_id_cs UNIQUE (clone_id)
> );

If there are other unique/constrained columns, don't forget to add
constraints for them as well.  Finally, should any of the columns above
link to reference tables of possible values?  For example, I'm willing
to bet that the "read" column is constrained to 2 or 3 values.

And finally, you'll want the primary key and all columns with
constraints, or columns which regularly JOIN, indexed.  Pardon me if you
already know all this.

>      $result = $conn->exec(
>      "SELECT
> clone_fasta.seq,clone_fasta.length,clone_qual.qual,library.details,gb_accessions.accn_no
>      FROM clone_fasta,clone_qual,gb_accessions,library
>      WHERE clone_fasta.project = library.project AND
>      $la1 = $la2  AND  # set gb_accessions PK = clone_fasta PK
>      $la3 = $la2  AND  # set clone_qual PK = clone_fasta PK
>       clone_fasta.project = '$estIDs[0]' AND
>       clone_fasta.plate = '$estIDs[1]'   AND
>       clone_fasta.p_row = '$estIDs[2]'   AND
>       clone_fasta.p_column = '$estIDs[3]' AND
>       clone_fasta.read = '$estIDs[4]' AND
>       clone_fasta.ver = '$estIDs[5]'
>     ");
>
>
> I tried to use:  clone_fasta.to_char(plate,'000') = '$estIDs[1]' ,
> but this errors on syntax.

First, is that an interface language above that will swap out the
$estIDs[] values that you have listed?  PHP, maybe?

Second, yes, you did make a slight mistake in your syntax:
to_char(clone_fasta.plate,'000') = '$estIDs[1]'

Clearer, now?

BTW, I think that you're doing OK in the database design, but would
benefit significantly from a good database design book. See
http://techdocs.postgresql.org/bookreviews.php

-Josh

______AGLIO DATABASE SOLUTIONS___________________________
                                       Josh Berkus
  Complete information technology      josh@agliodbs.com
   and data management solutions       (415) 565-7293
  for law firms, small businesses        fax 621-2533
    and non-profit organizations.      San Francisco

Re: Storing number '001' ?

От
"Josh Berkus"
Дата:
Chuck,

> Thanks.  I will look into the books, any specific recomendation?  I
> am learning on the fly as they say.  I am a molecular biologist who
> now needs to learn Perl and DBA related....

Bleah!  Next they'll be asking me to learn molecular biology.
As for books, I'm responsible for the book page at
http://techdocs.postgresql.org/bokreviews.php .   Anything I recommend
is already there.  You might want to start (and end) with Database
Design for Mere Mortals, which is like a "for Dummies(tm)" book, but
with accurate information.

Since you're at a university, in theory you could get a CS student to
help you as a work-study project.  However, I have yet to meet the CS
major who was a good DBA; a real understanding of database integrity and
design seems to only come from real-world experience.  So you may be
better off doing the DBA stuff yourself.    What about getting a CS
student to help with the Perl and HTML, though?

> If interested, my working version of the database is at:
<address snipped in case you didn't want to post it to the world>
>
> If you enter a term such as 'kinase', you will find all examples of
> such in the DB and click through the various links for more detail.

Cool!  I have a friend at Berkeley MCB.  Can I show this to him?

> Close, Perl. Most of the genomics work is written in Perl, because it
> is quite powerful when it come to handling text (GCACTAGCAGGCGA, DNA
> sequence).

Yeah, I'd agree there.  It's hard to beat Perl for text parsing.  Also,
when you get more advanced, Perl::DBI supports full middleware
functionality.  Perl's got a steep learning curve, though.

> I wholeheartedly agree that doing joins w/6 columns is unmanageable.
> I had thought about using a SERIAL column in each TABLE, but did not
> see how to use them in joins.  My understanding is that the number
> generated is specific for each table row.  So, if I want to join 2
> tables(clone_fasta and clone_qual) and find the 'seq' and 'qual'
> values for clone '894001A01.x1', how can one use the SERIAL clone_id
> in the join?
>
> >
> >>  CREATE TABLE clone_fasta(
> >clone_id SERIAL NOT NULL, 'xxx'
> >>  project INTEGER NOT NULL, 894
> >  > plate INTEGER NOT NULL,    1
> >  > p_row CHAR(1) NOT NULL, A
> >  > p_column INTEGER NOT NULL, 1
> >  > read CHAR(1) NOT NULL, x
> >>  ver INTEGER NOT NULL, 1
> >>  length INTEGER NOT NULL, 373
> >>  seq TEXT NOT NULL,  GAGCTAGXCAGGATC...
> >>  PRIMARY KEY (project,plate,p_row,p_column,read,ver),
> >CONSTRAINT clone_id_cs UNIQUE (clone_id)
> >  > );
>
> >
> >  > CREATE TABLE clone_qual(
> >clone_id SERIAL NOT NULL, 'yyy'
> >>  project INTEGER NOT NULL, 894
> >  > plate INTEGER NOT NULL,   1
> >  > p_row CHAR(1) NOT NULL, A
> >  > p_column INTEGER NOT NULL, 1
> >  > read CHAR(1) NOT NULL, x
> >  > ver INTEGER NOT NULL, 1
> >  > qual INTEGER[] NOT NULL, {10,1,12,...}
> >  > PRIMARY KEY (project,plate,p_row,p_column,read,ver),
> >CONSTRAINT clone_id_cs UNIQUE (clone_id)
> >  > );
>
>
> SELECT clone_fasta.seq,clone_qual.qual
> WHERE clone_fasta.clone_id = clone_qual.clone_id   <-- These are
> never the same, correct?
> AND .....?

Ah!  I understand the confusion.  Before I try to sort this one out, can
you explain to me the relationship between the four tables?  What kind
of information is stored in clone-fasta?  What kind of data is stored in
clone_qual, and what is the relationship between the two tables?  What
about the other tables?

And I'll want to post our whole interaction to the pgsql-novice list as
education for the masses.  With your permission, I might compile our
e-mails as a lesson for techdocs.

-Josh

______AGLIO DATABASE SOLUTIONS___________________________
                                       Josh Berkus
  Complete information technology      josh@agliodbs.com
   and data management solutions       (415) 565-7293
  for law firms, small businesses        fax 621-2533
    and non-profit organizations.      San Francisco

Re: Storing number '001' ?

От
"Josh Berkus"
Дата:
Chuck,

(I'm quoting most of your text -- my comments are embedded in your table
design)

> After writing this am, I think I may have begun to understand how to
> use the SERIAL id;
>
> A 'MASTER TABLE' containing ONLY clone description/component parts
> (894001A01.x1)
>
> CREATE TABLE clone(
> clone_id SERIAL PRIMARY KEY,
> project INTEGER NOT NULL,
> plate INTEGER NOT NULL,
> p_row CHAR(1) NOT NULL,
> p_column INTEGER NOT NULL,
> read CHAR(1) NOT NULL,
> ver INTEGER  NOT NULL,
> UNIQUE(project,plate,p_row,p_column,read,ver)
> );
Josh: As I said before, it is up to you whether you make clone_id the
primary key and the 6-column combo a unique index, or vice-versa.  Both
values are what is known as Candidate Keys, and will function to select
a unique record.  That being said, the approach you have taken above is
probably marginally better as you will be Joining the clone_id more
often than the 6 columns.
>
> DATA TABLES which refer back to MASTER
>
> CREATE TABLE clone_fasta(
> clone_id SERIAL PRIMARY KEY,
J: No, you want to give this table its own ID column:
    clone_fasta_id SERIAL PRIMARY KEY,
> fk_clone DATATYPE?? NOT NULL REFERENCES clone ON UPDATE ?? ON DELETE
> ??
J: I suggest instead:
    clone_id INTEGER NOT NULL REFERENCES clone(clone_id) ON DELECT
CASCADE;
(the last modifier means that if you delete the clone, the fasta record
is deleted too.   On Update Cascade is only relevant if you are going to
be manually changing the clone_id value, which you won't)
> seq TEXT NOT NULL,
> length INTEGER NOT NULL
> );
>
> CREATE TABLE clone_qual(
> clone_id SERIAL PRIMARY KEY,
> fk_clone DATATYPE?? NOT NULL REFERENCES clone ON UPDATE ?? ON DELETE
> ??
J: Per my comments above:
    clone_qual_id SERIAL PRIMARY KEY,
    clone_id INTEGER NOT NULL REFERENCES clone(clone_id) ON DELETE
CASCADE,
> qual INTEGER[] NOT NULL,
> );

You should NOT use an  array to store production data.  Array columns
are not relational, cannot be properly indexed, and are difficult to
query.  I use array columns only in temporary tables and buffer tables.
Can you describe the data you are storing in Qual so that I can
reccommend a suitable subtable structure?

>
> CREATE TABLE gb_accessions(
> clone_id SERIAL PRIMARY KEY,
> fk_clone DATATYPE?? NOT NULL REFERENCES clone ON UPDATE ?? ON DELETE
> ??
J: Same as above.  I think you can figure this one out.
> accn_no varchar(12) NOT NULL,
> gi_no varchar(12) NOT NULL
> );
>
> Close ?

Close.

The reasons you want to change the Foriegn Key column to match the name
of the key column in the parent table are: 1) clarity, and 2) A useful
shortcut in PostgreSQL:  "SELECT * FROM clone JOIN clone_fasta USING
(clone_id)" that will save you some typing.

More comments below:

> Two sets of numbers are used in tracking.  clone_id identifies an
> individual bacterial colony/clone.  contig_id identifies a sequence
> assembled from multiple clones.
> e.g.
> clone A:           GATTCTCTCTCTCGACGAGC
> clone B:                        GACGAGCATTATCTACGCATACTACTCATA
> contig (A+B):      GATTCTCTCTCTCGACGAGCATTATCTACGCATACTACTCATA
>
>
> 1. clone id: 894001A01.x1:
>  894: sequencing project (3 or 4 digit number)
>  001: plate clone resides in (96 well microtiter plates)
>    A: plate row (A-H)
>  01: plate column (1-12)
>    x: read, which primer was used to sequence the clone (x or y)
>    1: ver, some clones are sequenced more than once (1,2...n)
>
> 2. contig id: 20010822.123.1
>  20010822: date contig was assembled
>  123: contig_no
>  1: ver, several possibilities may result
>
> The assembled sequence of a CONTIG represents that of a gene.
>
> The TABLES I have at present are:
>
> clone_fasta:
>  clone_id (894001A01.x1) in the 6 columns
>  seq TEXT: DNA sequence
>  length INTEGER: # of nucleotides/bases in the sequence
>
> clone_qual:
>  clone_id (894001A01.x1) in the 6 columns
>  qual INTEGER[]: array of quality values for each base, (1-2
> digit integer  {1 9 9 45 38 44 ...}
>  several columns of descriptive info
>
> gb_accessions:
>  clone_id (894001A01.x1) in the 6 columns
>  accn_no (BG842967): each clone sequence is submitted to
> GenBank, a repository of public sequences
>  gi_no (14224151): GenBank assigns each clone an accession
> number (accn_no) and a gi_no.
>
> contig:  contig_id (20010822.123.1) stored in 3 columns
>  assembly_date: 20010822, date

Keep in mind that this can be stored as a real DATE, then formatted to
the above on query!

>  contig_no: 123
>  ver: 1
>  seq: assembled sequence from several clones
>  length: length of assembled sequence
>  ests: array of clone_ids which were used to generate the
> contig {894001A01.x1, 963125H01.y2}

A lot of the numbering above could be automated through Postgres
functions.

>
> homolog: Contains putative annotation data, i.e. what gene is it
>  assembly_date: 20010822, date
>  contig_no: 123
>  ver: 1
>  several columns of descriptive info
>  homolog: tentative description of 'gene' TEXT
>
> library: Info on the different DNA libraries being sequenced
>  project: sequencing project designation, ie '894', same #
> seen in clones, 894001A01.x1.
>  library: text describing library
>  details: more detailed description
>
> TABLES clone_fasta, clone_qual and gb_accessions all contain info
> regarding individual bacterial clones derived from a 'library' of
> clones.  They are interconnected by the clone id (894001A01.x1)
>
> TABLE library contains info explaining under what conditions clone
> 894001A01, and ALL clone prefixed by '894' were made. Also all other
> project descriptions (832,833,925,963, 1024,1031...)  The
> descriptions detail growth conditions used to produce the DNA
> library.
>
> TABLES homolog and contig contain 'meta' info about the sequence and
> identity of DNA sequences obtained from assembling related clones.  A
> contig '20010822.123.1' might be assembled from two clones
> (894001A01.x1, 963125H01.y2).

You'll want to also construct relational tables.  For example, if a
contig is assembled from two clones, you'll want a table called
contig_clones that relates in this fashion:

clones        contig_clones         contig
clone_id    -----------clone_id|contig_id----------contig_id
clone_id                /--clone_id|contig_id
clone_id           /
clone_id------/

> There, now you know all I do, and  more so.
> There are more tables to be added in the future, but the data you saw
> on the web site resides in the above tables, or predicessors to them.

That's a great start on a spec document.  In the future, if you have to
do this again, you'll want to do the above kind of assessment before any
actual table design.

To finish this off, I'd like some more info:
1. What is the information you want to store in arrays?  You should be
sotring this information in tables for the reasons I outlined above.
2. For the 3 "clone" tables, is the relationship between them one-to-one
or one-to-many?

-Josh

______AGLIO DATABASE SOLUTIONS___________________________
                                       Josh Berkus
  Complete information technology      josh@agliodbs.com
   and data management solutions       (415) 565-7293
  for law firms, small businesses        fax 621-2533
    and non-profit organizations.      San Francisco

Re: Storing number '001' ?

От
"Josh Berkus"
Дата:
Charles,

> I get an error:
>
> chlamy_est=> CREATE TABLE clone_fasta(
> chlamy_est(> clone_fasta_id SERIAL PRIMARY KEY,
> chlamy_est(> clone_id INTEGER NOT NULL REFERENCES clone(clone_id) ON
> DELETE CASCADE,
> chlamy_est(> seq TEXT NOT NULL,
> chlamy_est(> length INTEGER NOT NULL
> chlamy_est(> );
> NOTICE:  CREATE TABLE will create implicit sequence
> 'clone_fasta_clone_fasta_id_seq' for SERIAL column
> 'clone_fasta.clone_fasta_id'
> NOTICE:  CREATE TABLE/PRIMARY KEY will create implicit index
> 'clone_fasta_pkey' for table 'clone_fasta'
> NOTICE:  CREATE TABLE will create implicit trigger(s) for FOREIGN KEY
> check(s)
> ERROR:  Relation 'clone_fasta_clone_fasta_id_seq' already exists
>
>
> Something funny going on w/column names?

Nope.  You just dropped a previous instance of the clone_fasta table and
Postgres does not automatically drop the sequence. You need to manually
drop the sequence:
DROP SEQUENCE clone_fasta_clone_fasta_id_seq;
... before re-creating the table.

-Josh

______AGLIO DATABASE SOLUTIONS___________________________
                                       Josh Berkus
  Complete information technology      josh@agliodbs.com
   and data management solutions       (415) 565-7293
  for law firms, small businesses        fax 621-2533
    and non-profit organizations.      San Francisco

Re: Storing number '001' ?

От
"Josh Berkus"
Дата:
Charles,

> Attached is my draft of a schema.

Hey, I like this!  What software did you use to generate it?  I
currently lack a diagramming tool that produces JPGs.

> I believe I was in error yesterday, when I stated that the
> relationship between the clone tables was one to one.  It is true
> that a clone has 1 and only 1 sequence, but a sequence exists for 1
> or more clones(TABLES clone_fasta and clone_qual).

See below.  This changes our schema.

> I am working my way thru Database Design for Mere Mortals, and
> another book, Relational Database Design Clearly Explained by J.L.
> Harrington.

I would appreciate a 1-paragraph review of the latter book for the
bookreviews page when you're done with it.

> In practical terms I don't understand how to deal with the FKs.  For
> example:
>
> CREATE TABLE clone_fasta(
> clone_fasta_id SERIAL PRIMARY KEY,
> clone_id INTEGER NOT NULL REFERENCES clone(clone_id) ON DELETE
> CASCADE,
> seq TEXT NOT NULL,
> length INTEGER NOT NULL
> );

The problem is that, according to your new diagram, our parent-child
relationship is backwards.  The clones table should contain FKs to
clone-fasta and clone-qual instead of what we have now.  If you can give
me a better description of where the fasta and qual information comes
from, and what order it's entered in, we can tweak the schema to match.

-Josh

P.S. List readers, I apologize for omitting the diagram but it's 150K.

______AGLIO DATABASE SOLUTIONS___________________________
                                       Josh Berkus
  Complete information technology      josh@agliodbs.com
   and data management solutions       (415) 565-7293
  for law firms, small businesses        fax 621-2533
    and non-profit organizations.      San Francisco

Re: Storing number '001' ?

От
"Josh Berkus"
Дата:
Chuck,

> >1. Are "Fasta" and "Sequence" the same thing? (further questions
> assume
> >this to be the case).
>
> For our purposes, yes.  Strictly speaking however, fasta denotes a
> particular format the sequence is in:
>
> >894001A01.x1
> GATCGATCGCTACGTCAGAC
>
> is fasta formatted sequence, whereas:
>
> GATCGATCGCTACGTCAGAC
>
> is sequence.
>
> In TABLE clone_fasta.seq I store the latter, ie
> 'GATCGATCGCTACGTCAGAC'.  If it helps, I can name TABLE clone_fasta
> TABLE clone_sequence?

No, don't rename your tables for my convenience.  I was just confused
because you were using the word "fasta" in some places and "sequence" in
others.

> >2. We established in the last e-mail that there is potentially more
> than
> >one clone related to each clone_fasta and to each clone_qual record,
> and
> >that no clone has more than one fasta or qual record.  Is that still
> >true?
>
> I believe not. Let me answere this with an example, and go from
> there.
> I will simplify the clone id and not break it down into 6 fields.
>
>
>   TABLE  clone   TABLE clone_fasta
>   TABLE clone_qual
>   clone     seq
>    qual
> record 1: 894001A01.x1 <------> GATCGATATATA.....
> <------>    {9 9 9 23 34 45 ...}
>
> record 2: 894001A01.y1 <------> TTTTTTGATGAT.....
> <------>    {3 4 6 9 14 34 21 ...}
>
> record 3: 894001A02.x1 <------> GTTTCACTAGCT.....
> <------>    {8 5 15 31 24 7 ...}
>
>
>  From the above example, which is universally true, I would state
> that:
>
> 1. one and only one clone relates to each clone_fasta and to each
> clone_qual.
> 2. no clone has more than one fasta or qual record.
>
> Stated another way, each clone has one and only one fasta(sequence),
> and one and only one qual.

So, two more questions:
1. Does more than one clone potentially relate to each fasta?  Do we
care? (i.e. will we ever query the database for "Which clones relate to
sequence x?"  Or do we receive data like "CLones 1999, 2001, and 2173
have sequence x)?")
2. Does clone_qual properly relate to clones, or to clone_fasta?  From
your description, I can see things going either way.
3. The contigs:  is a contig assembled out of clones, sequences (fasta)
or quals?  I'm not clear on this.  From your description, it seems like
a contig might actually represent 1-2 sequences(fastas or qual?), as
opposed to 1-2 clones.

> >3. In what order does the data arrive for your tables?  I.e., is
> this an
> >accurate order of events:
> >(1) Clone data
> >(2) Sequence (Fasta?) data
> >(3) Qual data
> >(4) Contig data
> >(5) Library and Genebank data.
> >Is this accurate?
>
>
> More accurate to order them as:
>
> (5a) Library : is updated with each new project
> (1),(2),(3) : arrive simultaneously, but would be entered in the
> order you listed.
> (5b) Genbank : data submitted to Genbank after (1,2,3) are in hand
> (4)
> (6) blast

We'll hash this out eventually!  I think we're still struggling with you
speaking biologist and me speaking DBA ...

-Josh


______AGLIO DATABASE SOLUTIONS___________________________
                                       Josh Berkus
  Complete information technology      josh@agliodbs.com
   and data management solutions       (415) 565-7293
  for law firms, small businesses        fax 621-2533
    and non-profit organizations.      San Francisco

Re: Storing number '001' ?

От
"Josh Berkus"
Дата:
Chuck,

Answer below.

> Sorry 'bout that, tried to shoot off the email prior to a meeting.
>
> The aim is to enter data into TABLE clone_contig.
>
> My plan was to first load TABLES clone and contig:
>
> CREATE TABLE clone (
> clone_id SERIAL PRIMARY KEY,
> project INTEGER NOT NULL,
> plate CHAR(3) NOT NULL,
> p_row CHAR(1) NOT NULL,
> p_column CHAR(2) NOT NULL,
> read CHAR(1) NOT NULL,
> ver INTEGER NOT NULL,
> seq TEXT NOT NULL,
> qual TEXT NOT NULL,
> UNIQUE (project,plate,p_row,p_column,read,ver)
> );
>
>
> CREATE TABLE contig (
> contig_id SERIAL PRIMARY KEY,
> assembly_date date NOT NULL,
> contig_no integer NOT NULL,
> ver integer NOT NULL,
> length INTEGER NOT NULL,
> seq TEXT NOT NULL,
> UNIQUE (assembly_date,contig_no,ver)
> );
>
>
>
> Next, to load TABLE clone_contig I was going to do the following:
>
>
> CREATE TABLE clone_contig(
> clone_id INTEGER REFERENCES clone(clone_id) ON DELETE CASCADE,
> contig_id INTEGER REFERENCES contig(contig_id) ON DELETE CASCADE,
> UNIQUE(clone_id,contig_id)
> );
>
> Using Perl, I would build a hash of array containing an @clones for
> each contig:
>
>  @{$HoC{$contig}{clones}}
>
> For each key(contig) I was going to query the DB for 'contig
> contig_contig_id_seq' from TABLE contig, AND  'clone_clone_id_seq'
> for each clone in the array @{$HoC{$contig}{clones}} from TABLE
> clone.
>
>
> So, if contig '20010822.123.1' was assembled from 2 clones
> (894001A01.x1, 963012H10.x1)
> (note: clone and contigs are represented in full form, not as
> parceled out in tables above for simplicity sake)
>
>
>    TABLE clone    TABLE contig
>  clone_clone_id_seq     contig_contig_id_seq
>   167756   894001A01.x1 <----->
>  37238238 20010822.123.1
>   21389   963012H10.x1 <----->
>  37238238 20010822.123.1
>
>   SELECT clone_clone_id_seq FROM clone where clone='894001A01.x1';
>   SELECT clone_clone_id_seq FROM clone where clone='963012H10.x1';
>
>   SELECT contig_contig_id_seq FROM contig where
> contige='20010822.123.1';

Here's your only problem.  You don't select "clone_clone_id_seq".
That's the name of a Sequence.  You want to select your key column,
which is Clone_ID or Contig_ID.

If you actually wanted to get the ID of the last Contig you inserted,
for example, you could use the CURRVAL('contig_contig_id_seq') function.
See the following page for more info on how sequences work:

http://www.postgresql.org/idocs/index.php?sql-createsequence.html

-Josh

______AGLIO DATABASE SOLUTIONS___________________________
                                       Josh Berkus
  Complete information technology      josh@agliodbs.com
   and data management solutions       (415) 565-7293
  for law firms, small businesses        fax 621-2533
    and non-profit organizations.      San Francisco

Re: Storing number '001' ?

От
"Josh Berkus"
Дата:
Chuck,

> One last question before I leave you alone.
>
> I have test data in all 3 tables:
>
> clone <--->  clone_contig <----> contig
>
>
>
> In the case where contig '20010822.123.1' was assembled from 2 clones
> (894001A01.x1, 963012H10.x1)
>
>  TABLE clone  TABLE clone_contig  TABLE contig
>  clone_id clone   clone_id contig_id
>  contig_id contig
>  167756 894001A01.x1 167756 37238238
>  37238238 20010822.123.1
>  21389 963012H10.x1 21389 37238238
>
>
> Now, if I want to do a query to recover the sequence of all clones
> which were used to assemble contig 20010822.123.1 is the relational
> table clone_contig invoked automatically?
>
> ie can I do a query:
>
>  SELECT clone.seq FROM contig,clone WHERE contig='20010822.123.1';
>
> More generally, how does one make use of the relations set up in a
> relational table?

Well, those relations are really just complex constraints.  They only
shorten your queries in systems that support the NATURAL JOIN properly;
however, this JOIN implementation is so unevenly supported in various
databases that it's a better idea not to use it.  Thus, in your query:

SELECT clone.seq
FROM clone JOIN contig_clones USING (clone_id)
    JOIN contig USING (contig_id)
WHERE contig.contig = '20010822.123.1'*

*= remember that you are actually storing the contig value in 3 fields,
so you really need to compare against the 3 fields.

The REFERENCES constraints you put in the table definition are
*constraints*, meaning that they restrict what can go into the table.
They do not help you join the two tables, except for maybe making query
execution a little faster.  What they do do is make sure that every
single record in clone_contigs has corresponding records in clones and
contigs.  That way, you don't have to worry about testing for "orphan
records" because there can't be any.

Finally, you need to index all joined fields for performance reasons.
Assuming that you already have Primary Keys on clones.clone_id,
contigs.contig_id, and (contig_clones.clone_id,
contig_clones.contig_id), then the only index you are missing is a
seperate index on contig_clones.contig_id (dual-column indexes are
seldom of much use on JOINing the second column).

CREATE INDEX contig_clones_contig_idx ON contig_clones (contig_id);

-Josh Berkus


______AGLIO DATABASE SOLUTIONS___________________________
                                       Josh Berkus
  Complete information technology      josh@agliodbs.com
   and data management solutions       (415) 565-7293
  for law firms, small businesses        fax 621-2533
    and non-profit organizations.      San Francisco

Re: Storing number '001' ?

От
"Josh Berkus"
Дата:
Chuck,

> I am considering one last relational table to relate contig::blast.
>
> These two tables are related as:
>   a contig has 0,1 or more blast hits (0 only if contig not yet run
> thru blast)
>   a blast hit exists for 0, 1 or more contigs (0 only when no data
> loaded)
>
> As I have it currently, TABLE blast contains the FK contig_id to
> track which contig the blast record refers to.

Yes, but it doesn't fit your statements above, which clearly imply a
many-to-many relationship.  Drop the FK.
>
> Is it appropriate in this case to make a TABLE contig_blast, to me it
> appears so....?

This is correct.  However, your SQL is wrong.  See my corrections below.

> Now that I think about it, it may also be useful to make a table
> clone_blast. If one wanted to do a query for which blast results
> pertained to a given clone.
>
>  clone --- (assembled into) ---> contig ---> blast result
>  |      |
>  -------------------------------------------------

This, on the other hand, is unnecessary and even problematic.  We
already have the table clone_contig.  If we add the relationship table
contig_blast, then we can select all clones in a blast by selecting:

blast JOIN contig_blast JOIN contigs JOIN clone_contigs JOIN clones

Adding a seperate relationship table to maintain this information would
be redundant, and also troublesome because eventually clone_blast would
be bound to get out of synch with the relationship above.

> CREATE TABLE contig (
> contig_id SERIAL PRIMARY KEY,
> assembly_date date NOT NULL,
> contig_no integer NOT NULL,
> ver integer NOT NULL,
> length INTEGER NOT NULL,
> seq TEXT NOT NULL,
> UNIQUE (assembly_date,contig_no,ver)
> );
>
>
> CREATE TABLE blast (
> blast_id SERIAL PRIMARY KEY,
###> contig_id INTEGER REFERENCES contig(contig_id) ON DELETE
CASCADE,###
Delete this column.  It does not fit the data model.
> score integer,
> homolog text
> );
>
> CREATE TABLE contig_blast (
> blast_id INTEGER REFERENCES blast(blast_id) ON DELETE CASCADE,
> contig_id INTEGER REFERENCES contig(contig_id) ON DELETE CASCADE,
> UNIQUE(clone_id,contig_id)
uh-uh:  UNIQUE (blast_id, contig_id)
> );

> CREATE TABLE clone_blast (
> blast_id INTEGER REFERENCES blast(blast_id) ON DELETE CASCADE,
> clone_id INTEGER REFERENCES clone(clone_id) ON DELETE CASCADE,
> UNIQUE(clone_id,contig_id)
> );
Drop this table.

-Josh

______AGLIO DATABASE SOLUTIONS___________________________
                                       Josh Berkus
  Complete information technology      josh@agliodbs.com
   and data management solutions       (415) 565-7293
  for law firms, small businesses        fax 621-2533
    and non-profit organizations.      San Francisco

Re: Storing number '001' ?

От
"Josh Berkus"
Дата:
Chuck,

>   With these two tables, entering data gets a tad more complicated.

Yup.  Time to write a GUI.

> CREATE TABLE blast (
> blast_id SERIAL PRIMARY KEY,
> others..,
> score integer,
> homolog text
> );
>
>
> CREATE TABLE contig_blast (
> blast_id INTEGER REFERENCES blast(blast_id) ON DELETE CASCADE,
> contig_id INTEGER REFERENCES contig(contig_id) ON DELETE CASCADE,
> UNIQUE(blast_id,contig_id)
> );
>
> w/ all contig data loaded into TABLE contig
>
> 1st: lookup contig_id for blast data to be entered.
> 2nd: INSERT INTO blast(col1..coln) VALUES(A...N)
> 3rd: after INSERT -> get blast_id for data just INSERTED
>     $blast_id = currval('blast_id')
>
> 4th: insert contig_id & blast_id into TABLE contig_blast

100% correct.

>
>
> I can't seem to recover the value of blast_id form a Perl script:
>
> for a test TABLE test:
>
> CREATE TABLE test (
> blast_id SERIAL PRIMARY KEY,
> homolog text
> );
>
>
>         *****************
>
> #!/usr/bin/perl -w
> use strict;
> use Pg;
> my ($conn,$result,$ID);
>
> my $CONNECT = $DBNAME . $DBHOST . $LOGIN . $PWD;
>
> $conn = Pg::connectdb($CONNECT);
> die $conn->errorMessage unless PGRES_CONNECTION_OK eq $conn->status;
>
>
> $conn->exec("INSERT INTO test(homolog) VALUES ('row5');");
>
> $ID = $conn->exec("currval('blastx_id');");

Correction:  $ID = $conn->exec("SELECT currval('blastx_id');");
And are you sure the sequence is named "Blastx_id"?  Normal naming would
be: 'blast_blast_id_seq'

Fnally, if this system ever goes mulit-user, you will want to wrap the
above in a transaction to prevent simultaneous updates from yielding a
deceptive CURRVAL result.

>
> print "ID: $ID\n";
>

-Josh Berkus

______AGLIO DATABASE SOLUTIONS___________________________
                                       Josh Berkus
  Complete information technology      josh@agliodbs.com
   and data management solutions       (415) 565-7293
  for law firms, small businesses        fax 621-2533
    and non-profit organizations.      San Francisco