Обсуждение: Transposing rows and columns
I'm working with some people who live and breath Excel. I need to be able to move data back and forth between formats whichmake sense for Excel and for PostgreSQL. In some cases, this is just to accommodate what people are used to. In othercases, like statistical clustering, it's something that really has to be done. Here is a simplified example: I'm given data in Excel with one sheet each for a bunch of experiments. In each sheet, there are rows with different drugsat different doses and columns for each subject. The cells contain the response data. I wrote a Perl script whichautomates the process of extracting that data into a csv file which can be imported into a table like the following: create table results( expt_no int references experiments(id), subject int references subjects(id), drug text references drugs(name), dose numeric, response numeric ) Now, suppose I do some computation on the results in the database and want to export it back out to the same kind of formatthat I received it (drugs and doses in rows and subjects in columns.) One method would be to use Perl. I could useDBD::Pg and loop through a bunch of queries to build a two dimensional array and then spit that back out but is therea good way to do this just in SQL? Is there a better way than creating a temporary table for each subject and thenjoining all the temp tables? -Aram
> I'm working with some people who live and breath Excel. I need to be able > to move data back and forth between formats which make sense for Excel and > for PostgreSQL. In some cases, this is just to accommodate what people are > used to. In other cases, like statistical clustering, it's something that > really has to be done. > > Here is a simplified example: > > I'm given data in Excel with one sheet each for a bunch of experiments. In > each sheet, there are rows with different drugs at different doses and > columns for each subject. The cells contain the response data. I wrote a > Perl script which automates the process of extracting that data into a csv > file which can be imported into a table like the following: > > create table results( > expt_no int references experiments(id), > subject int references subjects(id), > drug text references drugs(name), > dose numeric, > response numeric > ) > > Now, suppose I do some computation on the results in the database and want > to export it back out to the same kind of format that I received it (drugs > and doses in rows and subjects in columns.) One method would be to use > Perl. I could use DBD::Pg and loop through a bunch of queries to build a > two dimensional array and then spit that back out but is there a good way > to do this just in SQL? Is there a better way than creating a temporary > table for each subject and then joining all the temp tables? You may want to look into the tablefunc contrib module. It contains a crosstab which will transpose rows and columns in the result. This may be slow though. HTH
On Thu, Sep 16, 2010 at 11:42:21AM -0400, Aram Fingal wrote: > create table results( > expt_no int references experiments(id), > subject int references subjects(id), > drug text references drugs(name), > dose numeric, > response numeric > ) What's the primary key? I presume it's (expt_no,subject,drug,dose). > Now, suppose I do some computation on the results in the database and > want to export it back out to the same kind of format that I received > it (drugs and doses in rows and subjects in columns.) Have you tried setting up an ODBC data source to the database and use the PivotTable functionality in Excel to do the transformation? If you want to do the transformation in SQL, you'd be writing something like: SELECT drug, dose MIN(CASE subject WHEN 1 THEN response END) AS resp_1, MIN(CASE subject WHEN 2 THEN response END) AS resp_2, MIN(CASE subject WHEN 3 THEN response END) AS resp_3 FROM results WHERE expt_no = 1 AND subject IN (1,2,3) GROUP BY drug, dose ORDER BY drug, dose; Or you can use the tablefunc contrib module as suggested by Uwe. I prefer doing it by hand as you get more options, but it can be quite tedious if you've got lots of columns you're trying to deal with. If I've got my assumption about primary key wrong then my code, as well as the tablefunc, will probably both fail to do the "right thing". -- Sam http://samason.me.uk/
On Sep 16, 2010, at 12:28 PM, Sam Mason wrote: > On Thu, Sep 16, 2010 at 11:42:21AM -0400, Aram Fingal wrote: >> create table results( >> expt_no int references experiments(id), >> subject int references subjects(id), >> drug text references drugs(name), >> dose numeric, >> response numeric >> ) > > What's the primary key? I presume it's (expt_no,subject,drug,dose). Yes that's correct. I copied and simplified from the actual table, which has a lot more in the table definition. > >> Now, suppose I do some computation on the results in the database and >> want to export it back out to the same kind of format that I received >> it (drugs and doses in rows and subjects in columns.) > > Have you tried setting up an ODBC data source to the database and use > the PivotTable functionality in Excel to do the transformation? I have thought about that but later on, when we do the full sized experiments, there will be too many rows for Excel to handle. > > If you want to do the transformation in SQL, you'd be writing something > like: > > SELECT drug, dose > MIN(CASE subject WHEN 1 THEN response END) AS resp_1, > MIN(CASE subject WHEN 2 THEN response END) AS resp_2, > MIN(CASE subject WHEN 3 THEN response END) AS resp_3 > FROM results > WHERE expt_no = 1 > AND subject IN (1,2,3) > GROUP BY drug, dose > ORDER BY drug, dose; That's a good trick to know but I just tried it and found that it begins to get complicated with the actual data. It alsoreturns a separate row for each drug/dose/subject combination which isn't exactly what I want. Each row has one columnwith a value and the rest of the columns in that row are all <null>. > > Or you can use the tablefunc contrib module as suggested by Uwe. I > prefer doing it by hand as you get more options, but it can be quite > tedious if you've got lots of columns you're trying to deal with. I guess I need to read up on pivot tables. At first glance, this looks like the best solution. -Aram
On 09/16/10 10:44 AM, Aram Fingal wrote: > I have thought about that but later on, when we do the full sized experiments, there will be too many rows for Excel tohandle. if you insist on this transposing, won't that mean you'll end up with more columns than SQL can/should handle?
On Sep 16, 2010, at 4:37 PM, John R Pierce wrote:
On 09/16/10 10:44 AM, Aram Fingal wrote:I have thought about that but later on, when we do the full sized experiments, there will be too many rows for Excel to handle.
if you insist on this transposing, won't that mean you'll end up with more columns than SQL can/should handle?
No. The organization in Excel is much more efficient of the total number of cells used but not much good for querying. When I transpose it for use in the database (or pivot it in Excel), it actually multiplies the number of rows. So, if the version with separate columns for each subject has X rows and Y columns, you get X * Y rows in the database version. For example, If there are 100 subjects, and 1000 drug/dose combinations. Then the Excel version has 102 columns (drug, dose and a column for each subject) and 1000 rows. The database (or pivoted) version would have 4 columns (subject, drug, dose and response) and 100,000 rows. Excel maxes out at 65,535 rows and PostgreSQL has no limit.
The subjects, by the way, are not people, they are cancer cell tissue cultures in 384-well plates, handled by robots. That's how we can do so many drug/dose combinations. We'll do even more in the future.
-Aram
On Thu, Sep 16, 2010 at 01:44:30PM -0400, Aram Fingal wrote: > On Sep 16, 2010, at 12:28 PM, Sam Mason wrote: > > If you want to do the transformation in SQL, you'd be writing something > > like: > > > > SELECT drug, dose > > MIN(CASE subject WHEN 1 THEN response END) AS resp_1, > > MIN(CASE subject WHEN 2 THEN response END) AS resp_2, > > MIN(CASE subject WHEN 3 THEN response END) AS resp_3 > > FROM results > > WHERE expt_no = 1 > > AND subject IN (1,2,3) > > GROUP BY drug, dose > > ORDER BY drug, dose; > > That's a good trick to know but I just tried it and found that it > begins to get complicated with the actual data. It also returns > a separate row for each drug/dose/subject combination which isn't > exactly what I want. Each row has one column with a value and the > rest of the columns in that row are all <null>. It shouldn't give a row per subject as the subject isn't in the GROUP BY list. Either that or you've got rounding problems that means that the dose or drug combination means that they're effectively unique per subject. You need to find the set of columns that gives you the "right" number of rows, the same issue applies to pivot tables in Excel. > > Or you can use the tablefunc contrib module as suggested by Uwe. I > > prefer doing it by hand as you get more options, but it can be quite > > tedious if you've got lots of columns you're trying to deal with. > > I guess I need to read up on pivot tables. At first glance, this > looks like the best solution. Maybe pull them out by experiment? -- Sam http://samason.me.uk/
On 09/16/2010 05:26 PM, Aram Fingal wrote: > > On Sep 16, 2010, at 4:37 PM, John R Pierce wrote: > >> On 09/16/10 10:44 AM, Aram Fingal wrote: >>> I have thought about that but later on, when we do the full sized >>> experiments, there will be too many rows for Excel to handle. >> >> if you insist on this transposing, won't that mean you'll end up with >> more columns than SQL can/should handle? > > No. The organization in Excel is much more efficient of the total number > of cells used but not much good for querying. When I transpose it for > use in the database (or pivot it in Excel), it actually multiplies the > number of rows. So, if the version with separate columns for each > subject has X rows and Y columns, you get X * Y rows in the database > version. For example, If there are 100 subjects, and 1000 drug/dose > combinations. Then the Excel version has 102 columns (drug, dose and a > column for each subject) and 1000 rows. The database (or pivoted) > version would have 4 columns (subject, drug, dose and response) and > 100,000 rows. Excel maxes out at 65,535 rows and PostgreSQL has no limit. I think excel 2007 can handle more than 65,535 rows. > > The subjects, by the way, are not people, they are cancer cell tissue > cultures in 384-well plates, handled by robots. That's how we can do so > many drug/dose combinations. We'll do even more in the future. > > -Aram -- Stephen Clark NetWolves Sr. Software Engineer III Phone: 813-579-3200 Fax: 813-882-0209 Email: steve.clark@netwolves.com www.netwolves.com
On Sep 17, 2010, at 9:00 AM, Steve Clark wrote: > I think excel 2007 can handle more than 65,535 rows. You may be right. I'm actually using NeoOffice (Mac enhanced version of OpenOffice) and that can handle something like 1,048,000rows. I wouldn't be surprised if newer versions of Excel can do the same. The real issue is querying. -Aram