Обсуждение: Storing number '001' ?
Hi, I need to store numbers which contain '0' as the first digit like '001', '01' . A little history. A DNA clone_id is denoted by '894001A01.x1'. I need to sort clone_id, and have broken it down into its meaningful components: project: 894 plate: 001 plate row: A plate column: 01 read: x ver: 1 CREATE TABLE clone_fasta ( project integer NOT NULL, plate integer NOT NULL, p_row char(1) NOT NULL, p_column integer NOT NULL, read char(1) NOT NULL, ver integer NOT NULL, length integer NOT NULL, seq text NOT NULL, PRIMARY KEY (project,plate,p_row,p_column,read,ver) ); Unfortunately, storing these numbers as integers converts 001 ->1, which I can't use. How does one store a number like '001'? Thanks, -- Chuck
> Unfortunately, storing these numbers as integers converts 001 ->1, > which I can't use. > > How does one store a number like '001'? Store it as a varchar and use CAST when you select it. Vänliga hälsningar Torbjörn Andersson --------------------------------------------------- Embryo Communication phone: +46 (0) 31-774 39 11(00) Kungsgatan 7a fax: +46 (0)31 774 07 80 S-411 19 Göteborg mobile: 0708-30 70 04 Sweden home: http://www.embryo.se/ mail: torbjorn.andersson@embryo.se --------------------------------------------------- "Att idag tänka annorlunda än igår skiljer den vise från den envise." John Steinbeck
Chuck, First off, I'd like to congratulate you on having the foresight to break down the value into seperate, atomic, components. Far too many novice DBA's would have stored the whole ID as one VARCHAR, and then written a bunch of custom functions to parse it. The latter approach has disastrous consequences for data integrity, so I'm thrilled that you have made the right choice. > project: 894 > plate: 001 > plate row: A > plate column: 01 > read: x > ver: 1 > Unfortunately, storing these numbers as integers converts 001 ->1, > which I can't use. > > How does one store a number like '001'? That depends on the answer to this question: Is the Plate number always a zero-filled 3-digit integer? If the answer is Yes, then you can simply use the to_char function to format the integer plate number as you wish to see it: SELECT to_char(plate, '000'); (warning: to_char has a bug in ver. 7.1.x that causes it to sometimes insert a leading space, e.g. " 001" instead of "001". To fix this, use the Trim function) The database will still store the integer, making for smaller disk space, faster searching, and automatic matching if you forget to zero-fill. If the answer is No (i.e. the plate number is sometimes more or less than three digits) then you'll have to store the plate number as a VARCHAR. In this case, you will want to create a Constraint that prevents entry of non-numerical characters into the Plate field, and adapt your user interface so that it zero-fills user input automatically before saving. E-mail me back if you need to do this. -Josh ______AGLIO DATABASE SOLUTIONS___________________________ Josh Berkus Complete information technology josh@agliodbs.com and data management solutions (415) 565-7293 for law firms, small businesses fax 621-2533 and non-profit organizations. San Francisco
Actually you probably don't want to *store* those extra zeros, you simply want to display them. So simply make sure that you use lpad when you select these values. For example if I had a table test like this: CREATE TABLE test ( foo integer ); INSERT INTO test (foo) VALUES (1); INSERT INTO test (foo) VALUES (2); INSERT INTO test (foo) VALUES (3); INSERT INTO test (foo) VALUES (300); I could then select from it with a query like: SELECT lpad(foo::text, 3, '0') AS padded_num FROM test; and get: padded_num ------------ 001 002 003 300 (4 rows) Neat huh! Of course, you also might want to put a constraint on that column because otherwise you could have problems. For example if you add another value to the table: INSERT INTO test (foo) VALUES (3000); and then select use the lpad query you get: padded_num ------------ 001 002 003 300 300 (5 rows) Which could be bad. Another tactic would be to simply store the value as a char(3) or varchar(3) value and do your error checking when you insert the value. If you are going to be doing math on these values, however, it is probably a win to store the value as an integer. Jason Charles Hauser <chauser@acpub.duke.edu> writes: > Hi, > > I need to store numbers which contain '0' as the first digit like > '001', '01' . > > A little history. A DNA clone_id is denoted by '894001A01.x1'. I > need to sort clone_id, and have broken it down into its meaningful > components: > > project: 894 > plate: 001 > plate row: A > plate column: 01 > read: x > ver: 1 > > CREATE TABLE clone_fasta ( > project integer NOT NULL, > plate integer NOT NULL, > p_row char(1) NOT NULL, > p_column integer NOT NULL, > read char(1) NOT NULL, > ver integer NOT NULL, > length integer NOT NULL, > seq text NOT NULL, > PRIMARY KEY (project,plate,p_row,p_column,read,ver) > ); > > Unfortunately, storing these numbers as integers converts 001 ->1, > which I can't use. > > How does one store a number like '001'? > > Thanks, > -- > Chuck > > ---------------------------(end of broadcast)--------------------------- > TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org
On Thu, 2001-12-06 at 12:33, Charles Hauser wrote: > Hi, > > I need to store numbers which contain '0' as the first digit like '001', '01' . > > A little history. A DNA clone_id is denoted by '894001A01.x1'. I > need to sort clone_id, and have broken it down into its meaningful > components: > > project: 894 > plate: 001 > plate row: A > plate column: 01 > read: x > ver: 1 > > CREATE TABLE clone_fasta ( > project integer NOT NULL, > plate integer NOT NULL, > p_row char(1) NOT NULL, > p_column integer NOT NULL, > read char(1) NOT NULL, > ver integer NOT NULL, > length integer NOT NULL, > seq text NOT NULL, > PRIMARY KEY (project,plate,p_row,p_column,read,ver) > ); > > Unfortunately, storing these numbers as integers converts 001 ->1, > which I can't use. > > How does one store a number like '001'? If you have to deal with numbers such that '001' is different to '01' then they are not numbers: they are _text_. Therefore: store them in a 'TEXT' field instead. Also, if you need to sort them into numeric order nonetheless, you could left-pad with spaces so that '001' becomes ' 001' and '01' became ' 01' and so forth. Or you could just: ... ORDER BY int4(textfield) ... You can even: CREATE INDEX myindex ON myfile( int4(textfield)); to have that sort ordering supported by an index. Regards, Andrew. -- -------------------------------------------------------------------- Andrew @ Catalyst .Net.NZ Ltd, PO Box 11-053, Manners St, Wellington WEB: http://catalyst.net.nz/ PHYS: Level 2, 150-154 Willis St DDI: +64(4)916-7201 MOB: +64(21)635-694 OFFICE: +64(4)499-2267
Chuck, > I have set up TABLES: clone_fasta, clone_qual and gb_accessions. All > three tables have the same PRIMARY KEY > (project,plate,p_row,p_column,read,ver) but contain other unique data > (seq and length in the case of TABLE clone_fasta shown below). If I were setting up this database, I would create a "surrogate key" using a SERIAL column and use that as my key for joins. While your "real" primary key is the 6-column combination above, I think you will find 6-column joins a royal pain in practice. In other words, change the schema per my comments below. > CREATE TABLE clone_fasta( clone_id SERIAL NOT NULL, > project INTEGER NOT NULL, > plate INTEGER NOT NULL, <-- to_char(plate, '000') > p_row CHAR(1) NOT NULL, > p_column INTEGER NOT NULL, <-- to_char(p_column, '00') > read CHAR(1) NOT NULL, > ver INTEGER NOT NULL, > length INTEGER NOT NULL, > seq TEXT NOT NULL, > PRIMARY KEY (project,plate,p_row,p_column,read,ver), CONSTRAINT clone_id_cs UNIQUE (clone_id) > ); If there are other unique/constrained columns, don't forget to add constraints for them as well. Finally, should any of the columns above link to reference tables of possible values? For example, I'm willing to bet that the "read" column is constrained to 2 or 3 values. And finally, you'll want the primary key and all columns with constraints, or columns which regularly JOIN, indexed. Pardon me if you already know all this. > $result = $conn->exec( > "SELECT > clone_fasta.seq,clone_fasta.length,clone_qual.qual,library.details,gb_accessions.accn_no > FROM clone_fasta,clone_qual,gb_accessions,library > WHERE clone_fasta.project = library.project AND > $la1 = $la2 AND # set gb_accessions PK = clone_fasta PK > $la3 = $la2 AND # set clone_qual PK = clone_fasta PK > clone_fasta.project = '$estIDs[0]' AND > clone_fasta.plate = '$estIDs[1]' AND > clone_fasta.p_row = '$estIDs[2]' AND > clone_fasta.p_column = '$estIDs[3]' AND > clone_fasta.read = '$estIDs[4]' AND > clone_fasta.ver = '$estIDs[5]' > "); > > > I tried to use: clone_fasta.to_char(plate,'000') = '$estIDs[1]' , > but this errors on syntax. First, is that an interface language above that will swap out the $estIDs[] values that you have listed? PHP, maybe? Second, yes, you did make a slight mistake in your syntax: to_char(clone_fasta.plate,'000') = '$estIDs[1]' Clearer, now? BTW, I think that you're doing OK in the database design, but would benefit significantly from a good database design book. See http://techdocs.postgresql.org/bookreviews.php -Josh ______AGLIO DATABASE SOLUTIONS___________________________ Josh Berkus Complete information technology josh@agliodbs.com and data management solutions (415) 565-7293 for law firms, small businesses fax 621-2533 and non-profit organizations. San Francisco
Chuck, > Thanks. I will look into the books, any specific recomendation? I > am learning on the fly as they say. I am a molecular biologist who > now needs to learn Perl and DBA related.... Bleah! Next they'll be asking me to learn molecular biology. As for books, I'm responsible for the book page at http://techdocs.postgresql.org/bokreviews.php . Anything I recommend is already there. You might want to start (and end) with Database Design for Mere Mortals, which is like a "for Dummies(tm)" book, but with accurate information. Since you're at a university, in theory you could get a CS student to help you as a work-study project. However, I have yet to meet the CS major who was a good DBA; a real understanding of database integrity and design seems to only come from real-world experience. So you may be better off doing the DBA stuff yourself. What about getting a CS student to help with the Perl and HTML, though? > If interested, my working version of the database is at: <address snipped in case you didn't want to post it to the world> > > If you enter a term such as 'kinase', you will find all examples of > such in the DB and click through the various links for more detail. Cool! I have a friend at Berkeley MCB. Can I show this to him? > Close, Perl. Most of the genomics work is written in Perl, because it > is quite powerful when it come to handling text (GCACTAGCAGGCGA, DNA > sequence). Yeah, I'd agree there. It's hard to beat Perl for text parsing. Also, when you get more advanced, Perl::DBI supports full middleware functionality. Perl's got a steep learning curve, though. > I wholeheartedly agree that doing joins w/6 columns is unmanageable. > I had thought about using a SERIAL column in each TABLE, but did not > see how to use them in joins. My understanding is that the number > generated is specific for each table row. So, if I want to join 2 > tables(clone_fasta and clone_qual) and find the 'seq' and 'qual' > values for clone '894001A01.x1', how can one use the SERIAL clone_id > in the join? > > > > >> CREATE TABLE clone_fasta( > >clone_id SERIAL NOT NULL, 'xxx' > >> project INTEGER NOT NULL, 894 > > > plate INTEGER NOT NULL, 1 > > > p_row CHAR(1) NOT NULL, A > > > p_column INTEGER NOT NULL, 1 > > > read CHAR(1) NOT NULL, x > >> ver INTEGER NOT NULL, 1 > >> length INTEGER NOT NULL, 373 > >> seq TEXT NOT NULL, GAGCTAGXCAGGATC... > >> PRIMARY KEY (project,plate,p_row,p_column,read,ver), > >CONSTRAINT clone_id_cs UNIQUE (clone_id) > > > ); > > > > > > CREATE TABLE clone_qual( > >clone_id SERIAL NOT NULL, 'yyy' > >> project INTEGER NOT NULL, 894 > > > plate INTEGER NOT NULL, 1 > > > p_row CHAR(1) NOT NULL, A > > > p_column INTEGER NOT NULL, 1 > > > read CHAR(1) NOT NULL, x > > > ver INTEGER NOT NULL, 1 > > > qual INTEGER[] NOT NULL, {10,1,12,...} > > > PRIMARY KEY (project,plate,p_row,p_column,read,ver), > >CONSTRAINT clone_id_cs UNIQUE (clone_id) > > > ); > > > SELECT clone_fasta.seq,clone_qual.qual > WHERE clone_fasta.clone_id = clone_qual.clone_id <-- These are > never the same, correct? > AND .....? Ah! I understand the confusion. Before I try to sort this one out, can you explain to me the relationship between the four tables? What kind of information is stored in clone-fasta? What kind of data is stored in clone_qual, and what is the relationship between the two tables? What about the other tables? And I'll want to post our whole interaction to the pgsql-novice list as education for the masses. With your permission, I might compile our e-mails as a lesson for techdocs. -Josh ______AGLIO DATABASE SOLUTIONS___________________________ Josh Berkus Complete information technology josh@agliodbs.com and data management solutions (415) 565-7293 for law firms, small businesses fax 621-2533 and non-profit organizations. San Francisco
Chuck, (I'm quoting most of your text -- my comments are embedded in your table design) > After writing this am, I think I may have begun to understand how to > use the SERIAL id; > > A 'MASTER TABLE' containing ONLY clone description/component parts > (894001A01.x1) > > CREATE TABLE clone( > clone_id SERIAL PRIMARY KEY, > project INTEGER NOT NULL, > plate INTEGER NOT NULL, > p_row CHAR(1) NOT NULL, > p_column INTEGER NOT NULL, > read CHAR(1) NOT NULL, > ver INTEGER NOT NULL, > UNIQUE(project,plate,p_row,p_column,read,ver) > ); Josh: As I said before, it is up to you whether you make clone_id the primary key and the 6-column combo a unique index, or vice-versa. Both values are what is known as Candidate Keys, and will function to select a unique record. That being said, the approach you have taken above is probably marginally better as you will be Joining the clone_id more often than the 6 columns. > > DATA TABLES which refer back to MASTER > > CREATE TABLE clone_fasta( > clone_id SERIAL PRIMARY KEY, J: No, you want to give this table its own ID column: clone_fasta_id SERIAL PRIMARY KEY, > fk_clone DATATYPE?? NOT NULL REFERENCES clone ON UPDATE ?? ON DELETE > ?? J: I suggest instead: clone_id INTEGER NOT NULL REFERENCES clone(clone_id) ON DELECT CASCADE; (the last modifier means that if you delete the clone, the fasta record is deleted too. On Update Cascade is only relevant if you are going to be manually changing the clone_id value, which you won't) > seq TEXT NOT NULL, > length INTEGER NOT NULL > ); > > CREATE TABLE clone_qual( > clone_id SERIAL PRIMARY KEY, > fk_clone DATATYPE?? NOT NULL REFERENCES clone ON UPDATE ?? ON DELETE > ?? J: Per my comments above: clone_qual_id SERIAL PRIMARY KEY, clone_id INTEGER NOT NULL REFERENCES clone(clone_id) ON DELETE CASCADE, > qual INTEGER[] NOT NULL, > ); You should NOT use an array to store production data. Array columns are not relational, cannot be properly indexed, and are difficult to query. I use array columns only in temporary tables and buffer tables. Can you describe the data you are storing in Qual so that I can reccommend a suitable subtable structure? > > CREATE TABLE gb_accessions( > clone_id SERIAL PRIMARY KEY, > fk_clone DATATYPE?? NOT NULL REFERENCES clone ON UPDATE ?? ON DELETE > ?? J: Same as above. I think you can figure this one out. > accn_no varchar(12) NOT NULL, > gi_no varchar(12) NOT NULL > ); > > Close ? Close. The reasons you want to change the Foriegn Key column to match the name of the key column in the parent table are: 1) clarity, and 2) A useful shortcut in PostgreSQL: "SELECT * FROM clone JOIN clone_fasta USING (clone_id)" that will save you some typing. More comments below: > Two sets of numbers are used in tracking. clone_id identifies an > individual bacterial colony/clone. contig_id identifies a sequence > assembled from multiple clones. > e.g. > clone A: GATTCTCTCTCTCGACGAGC > clone B: GACGAGCATTATCTACGCATACTACTCATA > contig (A+B): GATTCTCTCTCTCGACGAGCATTATCTACGCATACTACTCATA > > > 1. clone id: 894001A01.x1: > 894: sequencing project (3 or 4 digit number) > 001: plate clone resides in (96 well microtiter plates) > A: plate row (A-H) > 01: plate column (1-12) > x: read, which primer was used to sequence the clone (x or y) > 1: ver, some clones are sequenced more than once (1,2...n) > > 2. contig id: 20010822.123.1 > 20010822: date contig was assembled > 123: contig_no > 1: ver, several possibilities may result > > The assembled sequence of a CONTIG represents that of a gene. > > The TABLES I have at present are: > > clone_fasta: > clone_id (894001A01.x1) in the 6 columns > seq TEXT: DNA sequence > length INTEGER: # of nucleotides/bases in the sequence > > clone_qual: > clone_id (894001A01.x1) in the 6 columns > qual INTEGER[]: array of quality values for each base, (1-2 > digit integer {1 9 9 45 38 44 ...} > several columns of descriptive info > > gb_accessions: > clone_id (894001A01.x1) in the 6 columns > accn_no (BG842967): each clone sequence is submitted to > GenBank, a repository of public sequences > gi_no (14224151): GenBank assigns each clone an accession > number (accn_no) and a gi_no. > > contig: contig_id (20010822.123.1) stored in 3 columns > assembly_date: 20010822, date Keep in mind that this can be stored as a real DATE, then formatted to the above on query! > contig_no: 123 > ver: 1 > seq: assembled sequence from several clones > length: length of assembled sequence > ests: array of clone_ids which were used to generate the > contig {894001A01.x1, 963125H01.y2} A lot of the numbering above could be automated through Postgres functions. > > homolog: Contains putative annotation data, i.e. what gene is it > assembly_date: 20010822, date > contig_no: 123 > ver: 1 > several columns of descriptive info > homolog: tentative description of 'gene' TEXT > > library: Info on the different DNA libraries being sequenced > project: sequencing project designation, ie '894', same # > seen in clones, 894001A01.x1. > library: text describing library > details: more detailed description > > TABLES clone_fasta, clone_qual and gb_accessions all contain info > regarding individual bacterial clones derived from a 'library' of > clones. They are interconnected by the clone id (894001A01.x1) > > TABLE library contains info explaining under what conditions clone > 894001A01, and ALL clone prefixed by '894' were made. Also all other > project descriptions (832,833,925,963, 1024,1031...) The > descriptions detail growth conditions used to produce the DNA > library. > > TABLES homolog and contig contain 'meta' info about the sequence and > identity of DNA sequences obtained from assembling related clones. A > contig '20010822.123.1' might be assembled from two clones > (894001A01.x1, 963125H01.y2). You'll want to also construct relational tables. For example, if a contig is assembled from two clones, you'll want a table called contig_clones that relates in this fashion: clones contig_clones contig clone_id -----------clone_id|contig_id----------contig_id clone_id /--clone_id|contig_id clone_id / clone_id------/ > There, now you know all I do, and more so. > There are more tables to be added in the future, but the data you saw > on the web site resides in the above tables, or predicessors to them. That's a great start on a spec document. In the future, if you have to do this again, you'll want to do the above kind of assessment before any actual table design. To finish this off, I'd like some more info: 1. What is the information you want to store in arrays? You should be sotring this information in tables for the reasons I outlined above. 2. For the 3 "clone" tables, is the relationship between them one-to-one or one-to-many? -Josh ______AGLIO DATABASE SOLUTIONS___________________________ Josh Berkus Complete information technology josh@agliodbs.com and data management solutions (415) 565-7293 for law firms, small businesses fax 621-2533 and non-profit organizations. San Francisco
Charles, > I get an error: > > chlamy_est=> CREATE TABLE clone_fasta( > chlamy_est(> clone_fasta_id SERIAL PRIMARY KEY, > chlamy_est(> clone_id INTEGER NOT NULL REFERENCES clone(clone_id) ON > DELETE CASCADE, > chlamy_est(> seq TEXT NOT NULL, > chlamy_est(> length INTEGER NOT NULL > chlamy_est(> ); > NOTICE: CREATE TABLE will create implicit sequence > 'clone_fasta_clone_fasta_id_seq' for SERIAL column > 'clone_fasta.clone_fasta_id' > NOTICE: CREATE TABLE/PRIMARY KEY will create implicit index > 'clone_fasta_pkey' for table 'clone_fasta' > NOTICE: CREATE TABLE will create implicit trigger(s) for FOREIGN KEY > check(s) > ERROR: Relation 'clone_fasta_clone_fasta_id_seq' already exists > > > Something funny going on w/column names? Nope. You just dropped a previous instance of the clone_fasta table and Postgres does not automatically drop the sequence. You need to manually drop the sequence: DROP SEQUENCE clone_fasta_clone_fasta_id_seq; ... before re-creating the table. -Josh ______AGLIO DATABASE SOLUTIONS___________________________ Josh Berkus Complete information technology josh@agliodbs.com and data management solutions (415) 565-7293 for law firms, small businesses fax 621-2533 and non-profit organizations. San Francisco
Charles, > Attached is my draft of a schema. Hey, I like this! What software did you use to generate it? I currently lack a diagramming tool that produces JPGs. > I believe I was in error yesterday, when I stated that the > relationship between the clone tables was one to one. It is true > that a clone has 1 and only 1 sequence, but a sequence exists for 1 > or more clones(TABLES clone_fasta and clone_qual). See below. This changes our schema. > I am working my way thru Database Design for Mere Mortals, and > another book, Relational Database Design Clearly Explained by J.L. > Harrington. I would appreciate a 1-paragraph review of the latter book for the bookreviews page when you're done with it. > In practical terms I don't understand how to deal with the FKs. For > example: > > CREATE TABLE clone_fasta( > clone_fasta_id SERIAL PRIMARY KEY, > clone_id INTEGER NOT NULL REFERENCES clone(clone_id) ON DELETE > CASCADE, > seq TEXT NOT NULL, > length INTEGER NOT NULL > ); The problem is that, according to your new diagram, our parent-child relationship is backwards. The clones table should contain FKs to clone-fasta and clone-qual instead of what we have now. If you can give me a better description of where the fasta and qual information comes from, and what order it's entered in, we can tweak the schema to match. -Josh P.S. List readers, I apologize for omitting the diagram but it's 150K. ______AGLIO DATABASE SOLUTIONS___________________________ Josh Berkus Complete information technology josh@agliodbs.com and data management solutions (415) 565-7293 for law firms, small businesses fax 621-2533 and non-profit organizations. San Francisco
Chuck, > >1. Are "Fasta" and "Sequence" the same thing? (further questions > assume > >this to be the case). > > For our purposes, yes. Strictly speaking however, fasta denotes a > particular format the sequence is in: > > >894001A01.x1 > GATCGATCGCTACGTCAGAC > > is fasta formatted sequence, whereas: > > GATCGATCGCTACGTCAGAC > > is sequence. > > In TABLE clone_fasta.seq I store the latter, ie > 'GATCGATCGCTACGTCAGAC'. If it helps, I can name TABLE clone_fasta > TABLE clone_sequence? No, don't rename your tables for my convenience. I was just confused because you were using the word "fasta" in some places and "sequence" in others. > >2. We established in the last e-mail that there is potentially more > than > >one clone related to each clone_fasta and to each clone_qual record, > and > >that no clone has more than one fasta or qual record. Is that still > >true? > > I believe not. Let me answere this with an example, and go from > there. > I will simplify the clone id and not break it down into 6 fields. > > > TABLE clone TABLE clone_fasta > TABLE clone_qual > clone seq > qual > record 1: 894001A01.x1 <------> GATCGATATATA..... > <------> {9 9 9 23 34 45 ...} > > record 2: 894001A01.y1 <------> TTTTTTGATGAT..... > <------> {3 4 6 9 14 34 21 ...} > > record 3: 894001A02.x1 <------> GTTTCACTAGCT..... > <------> {8 5 15 31 24 7 ...} > > > From the above example, which is universally true, I would state > that: > > 1. one and only one clone relates to each clone_fasta and to each > clone_qual. > 2. no clone has more than one fasta or qual record. > > Stated another way, each clone has one and only one fasta(sequence), > and one and only one qual. So, two more questions: 1. Does more than one clone potentially relate to each fasta? Do we care? (i.e. will we ever query the database for "Which clones relate to sequence x?" Or do we receive data like "CLones 1999, 2001, and 2173 have sequence x)?") 2. Does clone_qual properly relate to clones, or to clone_fasta? From your description, I can see things going either way. 3. The contigs: is a contig assembled out of clones, sequences (fasta) or quals? I'm not clear on this. From your description, it seems like a contig might actually represent 1-2 sequences(fastas or qual?), as opposed to 1-2 clones. > >3. In what order does the data arrive for your tables? I.e., is > this an > >accurate order of events: > >(1) Clone data > >(2) Sequence (Fasta?) data > >(3) Qual data > >(4) Contig data > >(5) Library and Genebank data. > >Is this accurate? > > > More accurate to order them as: > > (5a) Library : is updated with each new project > (1),(2),(3) : arrive simultaneously, but would be entered in the > order you listed. > (5b) Genbank : data submitted to Genbank after (1,2,3) are in hand > (4) > (6) blast We'll hash this out eventually! I think we're still struggling with you speaking biologist and me speaking DBA ... -Josh ______AGLIO DATABASE SOLUTIONS___________________________ Josh Berkus Complete information technology josh@agliodbs.com and data management solutions (415) 565-7293 for law firms, small businesses fax 621-2533 and non-profit organizations. San Francisco
Chuck, Answer below. > Sorry 'bout that, tried to shoot off the email prior to a meeting. > > The aim is to enter data into TABLE clone_contig. > > My plan was to first load TABLES clone and contig: > > CREATE TABLE clone ( > clone_id SERIAL PRIMARY KEY, > project INTEGER NOT NULL, > plate CHAR(3) NOT NULL, > p_row CHAR(1) NOT NULL, > p_column CHAR(2) NOT NULL, > read CHAR(1) NOT NULL, > ver INTEGER NOT NULL, > seq TEXT NOT NULL, > qual TEXT NOT NULL, > UNIQUE (project,plate,p_row,p_column,read,ver) > ); > > > CREATE TABLE contig ( > contig_id SERIAL PRIMARY KEY, > assembly_date date NOT NULL, > contig_no integer NOT NULL, > ver integer NOT NULL, > length INTEGER NOT NULL, > seq TEXT NOT NULL, > UNIQUE (assembly_date,contig_no,ver) > ); > > > > Next, to load TABLE clone_contig I was going to do the following: > > > CREATE TABLE clone_contig( > clone_id INTEGER REFERENCES clone(clone_id) ON DELETE CASCADE, > contig_id INTEGER REFERENCES contig(contig_id) ON DELETE CASCADE, > UNIQUE(clone_id,contig_id) > ); > > Using Perl, I would build a hash of array containing an @clones for > each contig: > > @{$HoC{$contig}{clones}} > > For each key(contig) I was going to query the DB for 'contig > contig_contig_id_seq' from TABLE contig, AND 'clone_clone_id_seq' > for each clone in the array @{$HoC{$contig}{clones}} from TABLE > clone. > > > So, if contig '20010822.123.1' was assembled from 2 clones > (894001A01.x1, 963012H10.x1) > (note: clone and contigs are represented in full form, not as > parceled out in tables above for simplicity sake) > > > TABLE clone TABLE contig > clone_clone_id_seq contig_contig_id_seq > 167756 894001A01.x1 <-----> > 37238238 20010822.123.1 > 21389 963012H10.x1 <-----> > 37238238 20010822.123.1 > > SELECT clone_clone_id_seq FROM clone where clone='894001A01.x1'; > SELECT clone_clone_id_seq FROM clone where clone='963012H10.x1'; > > SELECT contig_contig_id_seq FROM contig where > contige='20010822.123.1'; Here's your only problem. You don't select "clone_clone_id_seq". That's the name of a Sequence. You want to select your key column, which is Clone_ID or Contig_ID. If you actually wanted to get the ID of the last Contig you inserted, for example, you could use the CURRVAL('contig_contig_id_seq') function. See the following page for more info on how sequences work: http://www.postgresql.org/idocs/index.php?sql-createsequence.html -Josh ______AGLIO DATABASE SOLUTIONS___________________________ Josh Berkus Complete information technology josh@agliodbs.com and data management solutions (415) 565-7293 for law firms, small businesses fax 621-2533 and non-profit organizations. San Francisco
Chuck, > One last question before I leave you alone. > > I have test data in all 3 tables: > > clone <---> clone_contig <----> contig > > > > In the case where contig '20010822.123.1' was assembled from 2 clones > (894001A01.x1, 963012H10.x1) > > TABLE clone TABLE clone_contig TABLE contig > clone_id clone clone_id contig_id > contig_id contig > 167756 894001A01.x1 167756 37238238 > 37238238 20010822.123.1 > 21389 963012H10.x1 21389 37238238 > > > Now, if I want to do a query to recover the sequence of all clones > which were used to assemble contig 20010822.123.1 is the relational > table clone_contig invoked automatically? > > ie can I do a query: > > SELECT clone.seq FROM contig,clone WHERE contig='20010822.123.1'; > > More generally, how does one make use of the relations set up in a > relational table? Well, those relations are really just complex constraints. They only shorten your queries in systems that support the NATURAL JOIN properly; however, this JOIN implementation is so unevenly supported in various databases that it's a better idea not to use it. Thus, in your query: SELECT clone.seq FROM clone JOIN contig_clones USING (clone_id) JOIN contig USING (contig_id) WHERE contig.contig = '20010822.123.1'* *= remember that you are actually storing the contig value in 3 fields, so you really need to compare against the 3 fields. The REFERENCES constraints you put in the table definition are *constraints*, meaning that they restrict what can go into the table. They do not help you join the two tables, except for maybe making query execution a little faster. What they do do is make sure that every single record in clone_contigs has corresponding records in clones and contigs. That way, you don't have to worry about testing for "orphan records" because there can't be any. Finally, you need to index all joined fields for performance reasons. Assuming that you already have Primary Keys on clones.clone_id, contigs.contig_id, and (contig_clones.clone_id, contig_clones.contig_id), then the only index you are missing is a seperate index on contig_clones.contig_id (dual-column indexes are seldom of much use on JOINing the second column). CREATE INDEX contig_clones_contig_idx ON contig_clones (contig_id); -Josh Berkus ______AGLIO DATABASE SOLUTIONS___________________________ Josh Berkus Complete information technology josh@agliodbs.com and data management solutions (415) 565-7293 for law firms, small businesses fax 621-2533 and non-profit organizations. San Francisco
Chuck, > I am considering one last relational table to relate contig::blast. > > These two tables are related as: > a contig has 0,1 or more blast hits (0 only if contig not yet run > thru blast) > a blast hit exists for 0, 1 or more contigs (0 only when no data > loaded) > > As I have it currently, TABLE blast contains the FK contig_id to > track which contig the blast record refers to. Yes, but it doesn't fit your statements above, which clearly imply a many-to-many relationship. Drop the FK. > > Is it appropriate in this case to make a TABLE contig_blast, to me it > appears so....? This is correct. However, your SQL is wrong. See my corrections below. > Now that I think about it, it may also be useful to make a table > clone_blast. If one wanted to do a query for which blast results > pertained to a given clone. > > clone --- (assembled into) ---> contig ---> blast result > | | > ------------------------------------------------- This, on the other hand, is unnecessary and even problematic. We already have the table clone_contig. If we add the relationship table contig_blast, then we can select all clones in a blast by selecting: blast JOIN contig_blast JOIN contigs JOIN clone_contigs JOIN clones Adding a seperate relationship table to maintain this information would be redundant, and also troublesome because eventually clone_blast would be bound to get out of synch with the relationship above. > CREATE TABLE contig ( > contig_id SERIAL PRIMARY KEY, > assembly_date date NOT NULL, > contig_no integer NOT NULL, > ver integer NOT NULL, > length INTEGER NOT NULL, > seq TEXT NOT NULL, > UNIQUE (assembly_date,contig_no,ver) > ); > > > CREATE TABLE blast ( > blast_id SERIAL PRIMARY KEY, ###> contig_id INTEGER REFERENCES contig(contig_id) ON DELETE CASCADE,### Delete this column. It does not fit the data model. > score integer, > homolog text > ); > > CREATE TABLE contig_blast ( > blast_id INTEGER REFERENCES blast(blast_id) ON DELETE CASCADE, > contig_id INTEGER REFERENCES contig(contig_id) ON DELETE CASCADE, > UNIQUE(clone_id,contig_id) uh-uh: UNIQUE (blast_id, contig_id) > ); > CREATE TABLE clone_blast ( > blast_id INTEGER REFERENCES blast(blast_id) ON DELETE CASCADE, > clone_id INTEGER REFERENCES clone(clone_id) ON DELETE CASCADE, > UNIQUE(clone_id,contig_id) > ); Drop this table. -Josh ______AGLIO DATABASE SOLUTIONS___________________________ Josh Berkus Complete information technology josh@agliodbs.com and data management solutions (415) 565-7293 for law firms, small businesses fax 621-2533 and non-profit organizations. San Francisco
Chuck, > With these two tables, entering data gets a tad more complicated. Yup. Time to write a GUI. > CREATE TABLE blast ( > blast_id SERIAL PRIMARY KEY, > others.., > score integer, > homolog text > ); > > > CREATE TABLE contig_blast ( > blast_id INTEGER REFERENCES blast(blast_id) ON DELETE CASCADE, > contig_id INTEGER REFERENCES contig(contig_id) ON DELETE CASCADE, > UNIQUE(blast_id,contig_id) > ); > > w/ all contig data loaded into TABLE contig > > 1st: lookup contig_id for blast data to be entered. > 2nd: INSERT INTO blast(col1..coln) VALUES(A...N) > 3rd: after INSERT -> get blast_id for data just INSERTED > $blast_id = currval('blast_id') > > 4th: insert contig_id & blast_id into TABLE contig_blast 100% correct. > > > I can't seem to recover the value of blast_id form a Perl script: > > for a test TABLE test: > > CREATE TABLE test ( > blast_id SERIAL PRIMARY KEY, > homolog text > ); > > > ***************** > > #!/usr/bin/perl -w > use strict; > use Pg; > my ($conn,$result,$ID); > > my $CONNECT = $DBNAME . $DBHOST . $LOGIN . $PWD; > > $conn = Pg::connectdb($CONNECT); > die $conn->errorMessage unless PGRES_CONNECTION_OK eq $conn->status; > > > $conn->exec("INSERT INTO test(homolog) VALUES ('row5');"); > > $ID = $conn->exec("currval('blastx_id');"); Correction: $ID = $conn->exec("SELECT currval('blastx_id');"); And are you sure the sequence is named "Blastx_id"? Normal naming would be: 'blast_blast_id_seq' Fnally, if this system ever goes mulit-user, you will want to wrap the above in a transaction to prevent simultaneous updates from yielding a deceptive CURRVAL result. > > print "ID: $ID\n"; > -Josh Berkus ______AGLIO DATABASE SOLUTIONS___________________________ Josh Berkus Complete information technology josh@agliodbs.com and data management solutions (415) 565-7293 for law firms, small businesses fax 621-2533 and non-profit organizations. San Francisco