Обсуждение: blanking out repeated columns in rows

Поиск
Список
Период
Сортировка

blanking out repeated columns in rows

От
Sean McCorkle
Дата:
Here's one for all you SQL jockeys out there (hope this is a good group
to post this query)

I'm currently (happily) using PostgreSQL for doing DNA expression analysis,
and I often end up with joins where columns on the left side get repeated
to matching differing 

For example, a select output ends up looking like this:
(from psql.  i've removed the - and | lines)
   tag             gb_id      pos     descrip

ACTATTTTTAGAGACCC NM_032685.1  307 hypothetical protein MGC13005 (MGC13005),
AGAAAAAAAAAAAAAAA NM_019110.2   27 hypothetical protein P1 p373c6 (P1P373C6)
AGAAAAAAAAAAAAAAA NM_019110.2   27 hypothetical protein P1 p373c6 (P1P373C6)
AGCCACCACGCCTGGTC NM_003693.1  260 acetyl LDL receptor; SREC=scavenger 
AGCCACCGCGCCCGGCC NM_007081.1  486 RAB, member of RAS oncogene family-like 2B
AGCCACCGCGCCCGGCC NM_013412.1  486 RAB, member of RAS oncogene family-like 2A
AGCCACCGCGCCTGGCC NM_000651.2  229 complement component (3b/4b) receptor 1,
AGCCACCGCGCCTGGCC NM_000573.2  229 complement component (3b/4b) receptor 1,
ATCAAAAAAAAAAAAAA NM_079421.1   25 cyclin-dependent kinase inhibitor 2D


but I (and my colleagues) would much rather see this, which draws 
attention to the duplicates (or multiples) in the left column.
   tag             gb_id      pos     descrip

ACTATTTTTAGAGACCC NM_032685.1  307 hypothetical protein MGC13005 (MGC13005),
AGAAAAAAAAAAAAAAA NM_019110.2   27 hypothetical protein P1 p373c6 (P1P373C6)                 NM_019110.2   27
hypotheticalprotein P1 p373c6 (P1P373C6)
 
AGCCACCACGCCTGGTC NM_003693.1  260 acetyl LDL receptor; SREC=scavenger 
AGCCACCGCGCCCGGCC NM_007081.1  486 RAB, member of RAS oncogene family-like 2B                 NM_013412.1  486 RAB,
memberof RAS oncogene family-like 2A
 
AGCCACCGCGCCTGGCC NM_000651.2  229 complement component (3b/4b) receptor 1,                 NM_000573.2  229 complement
component(3b/4b) receptor 1,
 
ATCAAAAAAAAAAAAAA NM_079421.1   25 cyclin-dependent kinase inhibitor 2D


Is there any slick way to accomplish this in SQL?  Or is there a switch
or something I've missed in postgresql?  Currently I'm postprocessing
the output with a perl script, but I'd prefer the SQL to generate
it directly, if possible.

Thanks, folks!
=sean

-------------------------------------------------------------------------------
Sean R. McCorkle           mccorkle@bnl.gov   | Haven't you heard, that aside
Genome Group               (631) 344-4270 ph  | from the details, there's
Biology Department         (631) 344-3407 fax | nothing further to be found?
Brookhaven National Laboratory                |    --Master Dogen's 300 Koan
Upton, New York  11973                        |      Shobogenzo (commentary)


Re: blanking out repeated columns in rows

От
Masaru Sugawara
Дата:
On Tue, 7 May 2002 11:03:50 -0400
Sean McCorkle <mccorkle@avenger.bio.bnl.gov> wrote:

> but I (and my colleagues) would much rather see this, which draws 
> attention to the duplicates (or multiples) in the left column.
> 
>     tag             gb_id      pos     descrip
> 
> ACTATTTTTAGAGACCC NM_032685.1  307 hypothetical protein MGC13005 (MGC13005),
> AGAAAAAAAAAAAAAAA NM_019110.2   27 hypothetical protein P1 p373c6 (P1P373C6)
>                   NM_019110.2   27 hypothetical protein P1 p373c6 (P1P373C6)
> AGCCACCACGCCTGGTC NM_003693.1  260 acetyl LDL receptor; SREC=scavenger 
> AGCCACCGCGCCCGGCC NM_007081.1  486 RAB, member of RAS oncogene family-like 2B
>                   NM_013412.1  486 RAB, member of RAS oncogene family-like 2A
> AGCCACCGCGCCTGGCC NM_000651.2  229 complement component (3b/4b) receptor 1,
>                   NM_000573.2  229 complement component (3b/4b) receptor 1,
> ATCAAAAAAAAAAAAAA NM_079421.1   25 cyclin-dependent kinase inhibitor 2D


How about this method of appending sequences as unique indices?


First:
create temp sequence dna_rownum1;
create temp sequence dna_rownum2;

Secand:
SELECT setval('dna_rownum1', 1, false);   -- (1)
SELECT setval('dna_rownum2', 1, false);   -- (2)
SELECT (CASE WHEN t1.idx = t3.idx THEN t1.tag ELSE NULL END) AS tag,              t1.gb_id,              t1.pos,
     t1.descrip FROM  (SELECT *, nextval('dna_rownum1') AS idx                 FROM dna                ORDER BY idx
        ) AS t1,             (SELECT t2.tag, MIN(t2.idx) AS idx                 FROM (SELECT tag,
nextval('dna_rownum2')AS idx                                     FROM dna ORDER BY idx) AS t2               GROUP BY
t2.tag            ) AS t3WHERE t1.tag = t3.tagORDER BY t1.tag, t1.idx;  -- (3)
 

Note: (1) and (2) need to be executed at the same time, but (3) doesn't.        And CREATE TEMP SEQUENCE is practicable
in7.2 or later.
 


Regards,
Masaru Sugawara




Re: blanking out repeated columns in rows

От
Masaru Sugawara
Дата:
On Thu, 09 May 2002 01:51:59 +0900
Masaru Sugawara <rk73@sea.plala.or.jp> wrote:

> 
> How about this method of appending sequences as unique indices?
> 
> 
> First:
> create temp sequence dna_rownum1;
> create temp sequence dna_rownum2;
> 
> Secand:
> SELECT setval('dna_rownum1', 1, false);   -- (1)
> SELECT setval('dna_rownum2', 1, false);   -- (2)
> SELECT (CASE WHEN t1.idx = t3.idx THEN t1.tag ELSE NULL END) AS tag,


I'm sorry. This one is simpler.

SELECT (CASE WHEN t1.oid = t2.idx THEN t1.tag ELSE NULL END) AS tag,              t1.gb_id,              t1.pos,
     t1.descrip FROM  dna AS t1,             (SELECT tag, MIN(oid) AS idx                 FROM dna               GROUP
BYtag             ) AS t2WHERE t1.tag = t2.tagORDER BY t1.tag, t1.oid;  
 



Regards,
Masaru Sugawara