Обсуждение: transposing data for a view
I have a generalized table: scanid | region | volume ------------------------- 1 A 34.4 1 B 32.1 1 C 29.1 2 A 32.4 2 B 33.2 2 C 35.6 . . . I want to create a flattened out view that looks like the following: scanid | A_volume | B_volume | C_volume ---------------------------------------- 1 34.4 32.1 29.1 2 32.4 33.2 35.6 . . . How do I correctly/efficiently construct a psql query to pivot/transpose the data? I am using postgreSQL version 7.0.x thanks, Jeremy
----- Original Message ----- From: "H Jeremy Bockholt" <jeremy@wundt.psychiatry.uiowa.edu> To: <pgsql-sql@postgresql.org> Sent: Tuesday, October 30, 2001 7:36 PM Subject: [SQL] transposing data for a view > I have a generalized table: > > scanid | region | volume > ------------------------- > 1 A 34.4 > 1 B 32.1 > 1 C 29.1 > 2 A 32.4 > 2 B 33.2 > 2 C 35.6 > . > . > . > > I want to create a flattened out view that looks like the following: > > scanid | A_volume | B_volume | C_volume > ---------------------------------------- > 1 34.4 32.1 29.1 > 2 32.4 33.2 35.6 > . > . > . > > How do I correctly/efficiently construct a psql query to > pivot/transpose the data? I am using postgreSQL version 7.0.x SELECT A.scanid, A.volume AS A_volume, B.volume AS B_volume, C.volume AS C_volume FROM table A JOIN table B ON (A.scanid = B.scanid) JOIN table C ON (B.scanid = C.scanid) WHERE A.region = 'A' AND B.region = 'B' AND C.region = 'C' - James
Jeremy, First, to do a pivot table, you have to be using Postgres 7.1.x. 7.0.x will not do it. So upgrade now. > I want to create a flattened out view that looks like the following: > > scanid | A_volume | B_volume | C_volume > ---------------------------------------- > 1 34.4 32.1 29.1 > 2 32.4 33.2 35.6 There are two approaches, the simple approach and the complex. The simple approach requires you to know in advance of building the view all of the possible values for your category column. The complex approach, which is dynamic, requires a rather sophisticated function (which I will write eventually, really!) so we won't go into it here. The simple approach is to create each column as a sub-select in the FROM clause of your statement. So, per the example above: SELECT scanid, A_volume, B_volume, C_volume FROM (SELECT scanid FROM volumes GROUP BY scanid) scan LEFT OUTER JOIN (SELECT scanid, volume as A_volume FROM volumes WHERE region = 'A') av ON scan.scanid = av.scanid LEFT OUTER JOIN (SELECT scanid, volume as B_volume FROM volumes WHERE region = 'B') bv ON scan.scanid = bv.scanid LEFT OUTER JOIN (SELECT scanid, volume as C_volume FROM volumes WHERE region = 'C') cv ON scan.scanid = cv.scanid ORDER BY scanid; This approach can be adapted to include aggregates and the like. -Josh Berkus ______AGLIO DATABASE SOLUTIONS___________________________ Josh Berkus Complete informationtechnology josh@agliodbs.com and data management solutions (415) 565-7293 for law firms, small businesses fax 621-2533 and non-profit organizations. San Francisco
I think this might do the trick...
CREATE FUNCTION hori_view() RETURNS BOOLEAN AS 'DECLARE view_select TEXT; view_from TEXT; view_where TEXT; column_name
TEXT;last_column_name TEXT; g_row generalized_table%ROWTYPE;BEGIN SELECT region INTO column_name FROM
generalized_tableORDER BY region LIMIT 1; view_select := ''SELECT '' || column_name || ''.scanid, '' || column_name
||''.volume AS "'' || column_name || ''_volume"''; view_from := '' FROM generalized_table '' || column_name;
view_where:= '' WHERE '' ||column_name || ''.region = '''''' || column_name || ''''''''; last_column_name :=
column_name;FOR g_row IN SELECT DISTINCT ON (region) * FROM generalized_table ORDER BY region OFFSET 1 LOOP
view_select:= view_select || '', '' || g_row.region || ''.volume AS "'' || g_row.region || ''_volume"'';
view_from:= view_from || '' JOIN generalized_table '' || g_row.region || '' ON ('' || last_column_name ||
''.scanid= '' || g_row.region || ''.scanid)''; view_where := view_where || '' AND '' || g_row.region ||
''.region= '''''' || g_row.region || ''''''''; last_column_name := g_row.region; END LOOP;EXECUTE ''CREATE VIEW
generalized_viewAS '' || view_select || view_from || view_where;RETURN TRUE; END;
' LANGUAGE 'plpgsql';
SELECT hori_view();
SELECT * FROM generalized_view;
Ok, it may not be pretty but it works,
Regards,
Aasmund.
On Wed, 31 Oct 2001 12:42:10 -0800, "Josh Berkus" <josh@agliodbs.com> wrote:
> Jeremy,
>
> First, to do a pivot table, you have to be using Postgres 7.1.x. 7.0.x
> will not do it. So upgrade now.
>
>
> There are two approaches, the simple approach and the complex. The
> simple approach requires you to know in advance of building the view all
> of the possible values for your category column. The complex approach,
> which is dynamic, requires a rather sophisticated function (which I will
> write eventually, really!) so we won't go into it here.
>
> The simple approach is to create each column as a sub-select in the FROM
> clause of your statement. So, per the example above:
>
> SELECT scanid, A_volume, B_volume, C_volume
> FROM (SELECT scanid FROM volumes GROUP BY scanid) scan
> LEFT OUTER JOIN
> (SELECT scanid, volume as A_volume FROM volumes WHERE region = 'A') av
> ON scan.scanid = av.scanid LEFT OUTER JOIN
> (SELECT scanid, volume as B_volume FROM volumes WHERE region = 'B') bv
> ON scan.scanid = bv.scanid LEFT OUTER JOIN
> (SELECT scanid, volume as C_volume FROM volumes WHERE region = 'C') cv
> ON scan.scanid = cv.scanid
> ORDER BY scanid;
>
> This approach can be adapted to include aggregates and the like.
>
> -Josh Berkus
>
> ______AGLIO DATABASE SOLUTIONS___________________________
> Josh Berkus
> Complete information technology josh@agliodbs.com
> and data management solutions (415) 565-7293
> for law firms, small businesses fax 621-2533
> and non-profit organizations. San Francisco
>
> ---------------------------(end of broadcast)---------------------------
> TIP 4: Don't 'kill -9' the postmaster
Aasmund Midttun Godal
aasmund@godal.com - http://www.godal.com/
+47 40 45 20 46
Hi, I see how your idea could work--the only thing I would change would be region needs to be scanid on the first line; however, I'm using PostgreSQL v7.03, so I can not do subqueries within FROM clause. Other than creating a temporary table, is there a way to restructure this solution to get around this limitation? thanks, jeremy Arian Prins <prinsarian@zonnet.nl> wrote in message news:<3BDFAD88.6BB058DA@zonnet.nl>... > H Jeremy Bockholt schreef: > > > I have a generalized table: > > > > scanid | region | volume > > ------------------------- > > 1 A 34.4 > > 1 B 32.1 > > 1 C 29.1 > > 2 A 32.4 > > 2 B 33.2 > > 2 C 35.6 > > . > > . > > . > > > > I want to create a flattened out view that looks like the following: > > > > scanid | A_volume | B_volume | C_volume > > ---------------------------------------- > > 1 34.4 32.1 29.1 > > 2 32.4 33.2 35.6 > > . > > . > > . > > > > How do I correctly/efficiently construct a psql query to > > pivot/transpose the data? I am using postgreSQL version 7.0.x > > > > thanks, > > Jeremy > > Try This: > > select region, sum(a_volume) AS a_volume, > sum(b_volume) AS b_volume, > sum(c_volume) AS c_volume > from ( > select > scanid, > volume AS a_volume, > 0 AS b_volume, > 0 AS c_volume > from mytable > where region = A > UNION > select > scanid, > 0 AS a_volume, > volume AS b_volume, > 0 AS c_volume > from mytable > where region = B > UNION > select > scanid, > 0 AS a_volume, > 0 AS b_volume, > volume AS c_volume > from mytable > where region = C > ) tmp > > (you might have to specifically typecast the zero's) > > It would probably also be possible using CASE-statements. This is just > _one_ idea. > > Arian.
H Jeremy Bockholt schreef: > I have a generalized table: > > scanid | region | volume > ------------------------- > 1 A 34.4 > 1 B 32.1 > 1 C 29.1 > 2 A 32.4 > 2 B 33.2 > 2 C 35.6 > . > . > . > > I want to create a flattened out view that looks like the following: > > scanid | A_volume | B_volume | C_volume > ---------------------------------------- > 1 34.4 32.1 29.1 > 2 32.4 33.2 35.6 > . > . > . > > How do I correctly/efficiently construct a psql query to > pivot/transpose the data? I am using postgreSQL version 7.0.x > > thanks, > Jeremy Try This: select region, sum(a_volume) AS a_volume, sum(b_volume) AS b_volume, sum(c_volume) ASc_volume from ( select scanid, volume AS a_volume, 0 AS b_volume, 0 AS c_volume from mytable where region= A UNION select scanid, 0 AS a_volume, volume AS b_volume, 0 AS c_volume from mytable where region = B UNION select scanid, 0 AS a_volume, 0 AS b_volume, volume AS c_volume from mytable where region = C) tmp (you might have to specifically typecast the zero's) It would probably also be possible using CASE-statements. This is just _one_ idea. Arian.
Jeff,
> SELECT
> scanid,
> sum ( CASE WHEN region = 'A' THEN volume ELSE NULL
> )
> AS A_volume,
> sum ( CASE WHEN region = 'B' THEN volume ELSE NULL
> )
> AS B_volume,
> sum ( CASE WHEN region = 'C' THEN volume ELSE NULL
> )
> AS C_volume
> FROM table
> GROUP BY scanid;
Hey, that's an elegant solution to doing it in 7.0.3. I hadn't thought
of it. Jeremy, never mind what I said about being forced to upgrade.
Upgrading *would* still be a good idea, of course.
Of couse, it's only *half* a solution. Your query will result in:
scanid A_volume B_volume C_volume
1 34.5
1 55.1
1 12.3
2 11.1
etc.
For the second half of the solution, Jeremy needs to create the above as
a view ('volume_rollup_1') and apply this second view:
SELECT scanid, SUM(A_volume) as A_volume, SUM(B_Volume) as B_Volume,
SUM(C_volume) as C_volume
FROM volume_rollup_1;
This will give Jeremy the "pivot" grid he's looking for.
> BTW, I don't believe the self-join approach proposed
> earlier will work, because joining on "scanid" will
> create a cartesian type join where the region values
> will be duplicated (multiplicated!).
Not if you're talking about my query, they won't. I use that query form
in many projects to create roll-ups; it's the "best" SQL92 approach to
the "pivot table" problem. However, it will not work in 7.0.3.
-Josh
______AGLIO DATABASE SOLUTIONS___________________________
Josh Berkus
Complete information technology josh@agliodbs.com
and data management solutions (415) 565-7293
for law firms, small businesses fax 621-2533
and non-profit organizations. San Francisco
Вложения
Jeremy,
> I see how your idea could work--the only thing I would change would
> be region needs to be scanid on the first line; however,
> I'm using PostgreSQL v7.03, so I can not do subqueries within FROM
> clause.
>
> Other than creating a temporary table, is there a way to
> restructure this solution to get around this limitation?
No, you need to upgrade. What's the obstacle to using 7.1.3, anyway?
-Josh
______AGLIO DATABASE SOLUTIONS___________________________
Josh Berkus
Complete information technology josh@agliodbs.com
and data management solutions (415) 565-7293
for law firms, small businesses fax 621-2533
and non-profit organizations. San Francisco
Вложения
How about:
SELECT scanid, sum ( CASE WHEN region = 'A' THEN volume ELSE NULL
) AS A_volume, sum ( CASE WHEN region = 'B' THEN volume ELSE NULL
) AS B_volume, sum ( CASE WHEN region = 'C' THEN volume ELSE NULL
) AS C_volume
FROM table
GROUP BY scanid;
Requires that you know in advance the range of region
values.
A bit shorter than some of the other approaches? :-)
BTW, I don't believe the self-join approach proposed
earlier will work, because joining on "scanid" will
create a cartesian type join where the region values
will be duplicated (multiplicated!).
--- Aasmund Midttun Godal <postgresql@envisity.com>
wrote:
> I think this might do the trick...
>
> CREATE FUNCTION hori_view() RETURNS BOOLEAN AS '
> DECLARE
> view_select TEXT;
> view_from TEXT;
> view_where TEXT;
> column_name TEXT;
> last_column_name TEXT;
> g_row generalized_table%ROWTYPE;
> BEGIN
> SELECT region INTO column_name
> FROM generalized_table ORDER BY region LIMIT 1;
> view_select := ''SELECT '' || column_name ||
> ''.scanid, '' || column_name || ''.volume AS "''
> ||
> column_name || ''_volume"'';
> view_from := '' FROM generalized_table '' ||
> column_name;
> view_where := '' WHERE '' ||column_name ||
> ''.region = '''''' || column_name || '''''''';
> last_column_name := column_name;
> FOR g_row IN SELECT DISTINCT ON (region) *
> FROM generalized_table ORDER BY region OFFSET 1
> LOOP
> view_select := view_select || '', '' ||
> g_row.region ||
> ''.volume AS "'' || g_row.region || ''_volume"'';
> view_from := view_from || '' JOIN
> generalized_table '' ||
> g_row.region || '' ON ('' || last_column_name ||
> ''.scanid = '' || g_row.region || ''.scanid)'';
> view_where := view_where || '' AND '' ||
> g_row.region ||
> ''.region = '''''' || g_row.region || '''''''';
> last_column_name := g_row.region;
> END LOOP;
> EXECUTE ''CREATE VIEW generalized_view AS '' ||
> view_select ||
> view_from || view_where;
> RETURN TRUE;
> END;
> ' LANGUAGE 'plpgsql';
>
> SELECT hori_view();
>
> SELECT * FROM generalized_view;
>
> Ok, it may not be pretty but it works,
>
> Regards,
>
> Aasmund.
>
> On Wed, 31 Oct 2001 12:42:10 -0800, "Josh Berkus"
> <josh@agliodbs.com> wrote:
> > Jeremy,
> >
> > First, to do a pivot table, you have to be using
> Postgres 7.1.x. 7.0.x
> > will not do it. So upgrade now.
> >
> >
> > There are two approaches, the simple approach and
> the complex. The
> > simple approach requires you to know in advance of
> building the view all
> > of the possible values for your category column.
> The complex approach,
> > which is dynamic, requires a rather sophisticated
> function (which I will
> > write eventually, really!) so we won't go into it
> here.
> >
> > The simple approach is to create each column as a
> sub-select in the FROM
> > clause of your statement. So, per the example
> above:
> >
> > SELECT scanid, A_volume, B_volume, C_volume
> > FROM (SELECT scanid FROM volumes GROUP BY scanid)
> scan
> > LEFT OUTER JOIN
> > (SELECT scanid, volume as A_volume FROM volumes
> WHERE region = 'A') av
> > ON scan.scanid = av.scanid LEFT OUTER JOIN
> > (SELECT scanid, volume as B_volume FROM volumes
> WHERE region = 'B') bv
> > ON scan.scanid = bv.scanid LEFT OUTER JOIN
> > (SELECT scanid, volume as C_volume FROM volumes
> WHERE region = 'C') cv
> > ON scan.scanid = cv.scanid
> > ORDER BY scanid;
> >
> > This approach can be adapted to include aggregates
> and the like.
> >
> > -Josh Berkus
> >
> > ______AGLIO DATABASE
> SOLUTIONS___________________________
> > Josh Berkus
> > Complete information technology
> josh@agliodbs.com
> > and data management solutions (415)
> 565-7293
> > for law firms, small businesses fax
> 621-2533
> > and non-profit organizations. San
> Francisco
> >
> > ---------------------------(end of
> broadcast)---------------------------
> > TIP 4: Don't 'kill -9' the postmaster
>
> Aasmund Midttun Godal
>
> aasmund@godal.com - http://www.godal.com/
> +47 40 45 20 46
>
> ---------------------------(end of
> broadcast)---------------------------
> TIP 2: you can get off all lists at once with the
> unregister command
> (send "unregister YourEmailAddressHere" to
majordomo@postgresql.org)
__________________________________________________
Do You Yahoo!?
Make a great connection at Yahoo! Personals.
http://personals.yahoo.com
Josh, I don't see how you got the result you quote below. My query sums the volume figures, grouping by scanid: that worked as expected (one line per scanid) on my system when I tested it (version 7.1.2, though I don't think that matters). Jeff --- Josh Berkus <josh@agliodbs.com> wrote: > Of couse, it's only *half* a solution. Your query > will result in: > > scanid A_volume B_volume C_volume > 1 34.5 > 1 55.1 > 1 12.3 > 2 11.1 > etc. > __________________________________________________ Do You Yahoo!? Make a great connection at Yahoo! Personals. http://personals.yahoo.com
Jeff,
> I don't see how you got the result you quote below.
> My query sums the volume figures, grouping by scanid:
> that worked as expected (one line per scanid) on my
> system when I tested it (version 7.1.2, though I don't
> think that matters).
Lemme try it ....
Yes, you're right. Sorry! Disregard my commentary about the second
view.
-Josh
______AGLIO DATABASE SOLUTIONS___________________________
Josh Berkus
Complete information technology josh@agliodbs.com
and data management solutions (415) 565-7293
for law firms, small businesses fax 621-2533
and non-profit organizations. San Francisco
Вложения
> > BTW, I don't believe the self-join approach proposed > > earlier will work, because joining on "scanid" will > > create a cartesian type join where the region values > > will be duplicated (multiplicated!). > > Not if you're talking about my query, they won't. I use that query form > in many projects to create roll-ups; it's the "best" SQL92 approach to > the "pivot table" problem. However, it will not work in 7.0.3. I think he might be talking about mine. The region values will not be duplicated, the WHERE clause prevents it. I kind of prefer my own query aesthetically, is it as efficient internally? - James
--- James Orr <james@lrgmail.com> wrote: > I think he might be talking about mine. The region > values will not be > duplicated, the WHERE clause prevents it. If you are saying that I didn't read the original query closely enough, you're probably right. Unfortunately I deleted the original message, so I can't check that. > I kind of prefer my own query aesthetically, is it > as efficient internally? Someone more knowledgable will have to answer that: though I would guess that working with three joined tables would slow things down somewhat. > > - James > __________________________________________________ Do You Yahoo!? Make a great connection at Yahoo! Personals. http://personals.yahoo.com
(apologies if this was sent twice)
Hello all. I have an academic project that I’m working on and, as I’m relatively new to optimization techniques and database design, I’m wondering if some of you can give me some pointers. Below is the schema (to model baseball statistics), and I’m pretty much stuck with it at this point. If anyone has any suggestions involving changing the schema, I’d appreciate hearing them just for future reference…
There are several interesting queries that I’ve been constructing just to get a feel for this schema, and given that some of these tables contain 100,000-200,000 tuples, some queries are taking a good 5-10 seconds to execute. I’m just wondering if this is simply the fault of my schema or are these queries poorly constructed? Here are some queries I’ve been trying:
Here’s a query for the top ten all time home run leaders:
select P.first_name, P.last_name, S.player_id, sum(B.hr) as hr, (sum(B.h)::float / sum(B.ab)::float) ab
from statistics S, batting_stats B, players P
where S.id=B.id and S.player_id=P.id and B.ab>0
group by P.first_name, P.last_name, S.player_id
order by hr desc limit 10;
Select the 10 highest batting averages of all time where the batter had 600 or more at bats. Also gather the name, year, team, hits, at bats...
select (B.h::float / B.ab) as avg, B.h, B.ab, S.year, P.first_name, P.last_name, T.city, T.name
from batting_stats B, statistics S, players P, Teams T
where B.ab > 600 and S.id=B.id and S.player_id=P.id and S.team_id=T.id
order by avg desc limit 10;
To find the names of the single season home run leaders, along with the total number of home runs, the team name/city and the year:
select P.first_name, P.last_name, B.hr, T.name, T.city, S.year from statistics S, batting_stats B, players P, teams T
where (S.id=B.id) and
(S.player_id=P.id) and
(B.hr>30) and
(T.id=S.team_id) limit 10;
You get the idea. These queries take a while. Is this just the way it is or there things that can be done to optimize this?
One separate issue (the reason why the above examples are all about batting statistics) I’m having is representing the innings pitched statistic. The way it is often represented (and the way it is done in this schema) is something like this 123.0 means exactly 123 innings pitched, 123.1 means 123 1/3 innings, and 123.2 means 123 2/3 innings pitched. I’m contemplating the best way to write a function that knows how to sum these values accurately. Is this something that can be done with PL/PGSQL or should I go straight to something like PLPERL? Alternatively, I could research a way to represent fractions in the DB and write a script to convert all values in this column. Any advice here??
Thanks in advance for any thoughts, comments, suggestions…
-Jeff
create table leagues
(
-- create an integer id field for easier and efficient FK referencing
id serial,
name varchar(50),
first_year integer not null
check (first_year > 1850 and first_year < 2010),
last_year integer
check (first_year > 1850 and first_year < 2010),
primary key(id),
constraint chronological_order
check (last_year >= first_year)
);
create table teams
(
id serial,
name varchar(50) not null,
city varchar(50),
first_year integer not null
check (first_year > 1850 and first_year < 2010),
last_year integer
check (first_year > 1850 and first_year < 2010),
alt_id integer,
league_id integer not null,
primary key(id),
foreign key(alt_id) references teams(id),
foreign key(league_id) references leagues(id)
on delete cascade,
constraint chronological_order
check (last_year >= first_year)
);
create table players
(
id serial,
first_name varchar(30),
last_name varchar(30) not null,
bats char(1) check (bats in ('L','R','S')),
throws char(1) check (throws in ('L','R')),
dob date,
primary key(id)
);
create table statistics
(
id serial,
year integer not null,
g integer,
player_id integer not null,
team_id integer not null,
foreign key(player_id) references players(id)
on delete cascade,
foreign key(team_id) references teams(id)
on delete cascade,
primary key(id)
);
create table managing_stats
(
id integer not null,
w integer,
l integer,
primary key(id),
foreign key(id) references statistics(id)
on delete cascade
);
create table fielding_stats
(
id integer not null,
pos char(5),
po integer,
a integer,
e integer,
dp integer,
primary key(id),
foreign key(id) references statistics(id)
on delete cascade
);
create table batting_stats
(
id integer not null,
ab integer,
r integer,
h integer,
doubles integer,
triples integer,
hr integer,
rbi integer,
sb integer,
cs integer,
bb integer,
so integer,
sh integer,
sf integer,
ibb integer,
hbp integer,
primary key(id),
foreign key(id) references statistics(id)
on delete cascade
);
create table pitching_stats
(
id integer not null,
w integer,
l integer,
gs integer,
cg integer,
sh integer,
sv integer,
ip numeric(5,1),
h integer,
er integer,
hr integer,
bb integer,
so integer,
primary key(id),
foreign key(id) references statistics(id)
on delete cascade
);
On Fri, 9 Nov 2001, Jeff Sack wrote: > To find the names of the single season home run leaders, along with the > total number of home runs, the team name/city and the year: > > select P.first_name, P.last_name, B.hr, T.name, T.city, S.year from > statistics S, batting_stats B, players P, teams T > where (S.id=B.id) and > (S.player_id=P.id) and > (B.hr>30) and > (T.id=S.team_id) limit 10; > > You get the idea. These queries take a while. Is this just the way it > is or there things that can be done to optimize this? As a starting point, have you run vacuum analyze and what does explain show for the query. Also, do you have indexes on fields that you're limiting on (like hr, etc...). > One separate issue (the reason why the above examples are all about > batting statistics) I'm having is representing the innings pitched > statistic. The way it is often represented (and the way it is done in > this schema) is something like this 123.0 means exactly 123 innings > pitched, 123.1 means 123 1/3 innings, and 123.2 means 123 2/3 innings > pitched. I'm contemplating the best way to write a function that knows > how to sum these values accurately. Is this something that can be done > with PL/PGSQL or should I go straight to something like PLPERL? > Alternatively, I could research a way to represent fractions in the DB > and write a script to convert all values in this column. Any advice > here?? You'd probably be best off doing the conversion at insert time into an additional field assuming that inserts are much less likely that selects on your data. (col-floor(col)*(10/3::numeric) seems to get back an appropriate value but is probably reasonably expensive.
Stephan Szabo <sszabo@megazone23.bigpanda.com> writes: > On Fri, 9 Nov 2001, Jeff Sack wrote: > > > To find the names of the single season home run leaders, along with the > > total number of home runs, the team name/city and the year: > > > > select P.first_name, P.last_name, B.hr, T.name, T.city, S.year from > > statistics S, batting_stats B, players P, teams T > > where (S.id=B.id) and > > (S.player_id=P.id) and > > (B.hr>30) and > > (T.id=S.team_id) limit 10; > > > > You get the idea. These queries take a while. Is this just the way it > > is or there things that can be done to optimize this? > > As a starting point, have you run vacuum analyze and what does > explain show for the query. Also, do you have indexes on fields that > you're limiting on (like hr, etc...). > > > One separate issue (the reason why the above examples are all about > > batting statistics) I'm having is representing the innings pitched > > statistic. The way it is often represented (and the way it is done in > > this schema) is something like this 123.0 means exactly 123 innings > > pitched, 123.1 means 123 1/3 innings, and 123.2 means 123 2/3 innings > > pitched. I'm contemplating the best way to write a function that knows > > how to sum these values accurately. Is this something that can be done > > with PL/PGSQL or should I go straight to something like PLPERL? > > Alternatively, I could research a way to represent fractions in the DB > > and write a script to convert all values in this column. Any advice > > here?? > > You'd probably be best off doing the conversion at insert time into an > additional field assuming that inserts are much less likely that > selects on your data. > > (col-floor(col)*(10/3::numeric) seems to get back an appropriate value > but is probably reasonably expensive. Another thought would be to simply store this as an integer that is the number of thirds of an inning that the pitcher pitched. In other words your 123.1 would be stored as 370. That would allow you a very easy way to manipulate these values mathematically, and it would be easy to write a simple function to format these values so that your uses would see the 123.1 that they expect. Heck something like: SELECT (370 / 3)::text || '.' || (370 % 3)::text; would do exactly what you want. Throw in some indexes like Josh and Stephen suggest and you'll be cooking with gasoline. You also might want to take a look at Bruce Momjian's article about performance tuning: http://www2.linuxjournal.com/lj-issues/issue88/4791.html Jason
On Fri, 9 Nov 2001 07:57:41 -0800 (PST)
Stephan Szabo <sszabo@megazone23.bigpanda.com> wrote:
> > On Fri, 9 Nov 2001, Jeff Sack wrote:
> >
> > One separate issue (the reason why the above examples are all about
> > batting statistics) I'm having is representing the innings pitched
> > statistic. The way it is often represented (and the way it is done in
> > this schema) is something like this 123.0 means exactly 123 innings
> > pitched, 123.1 means 123 1/3 innings, and 123.2 means 123 2/3 innings
> > pitched. I'm contemplating the best way to write a function that knows
> > how to sum these values accurately. Is this something that can be done
> > with PL/PGSQL or should I go straight to something like PLPERL?
> > Alternatively, I could research a way to represent fractions in the DB
> > and write a script to convert all values in this column. Any advice
> > here??
>
> You'd probably be best off doing the conversion at insert time into an
> additional field assuming that inserts are much less likely that
> selects on your data.
>
> (col-floor(col)*(10/3::numeric) seems to get back an appropriate value
> but is probably reasonably expensive.
Hi,I tried to put your nice idea into the GROUP BY clause, since I wantedto know how expensive it is. There are
severalplayers in the playerstable and about 60k rows in the pitching_stats, which have only columnsconcerned with the
SUM()aggregate function. In case of my PC with 333MHz clock, 256M SDRAM, and ATA33 HDD, the costs of executing the
followingqueries are about 4 sec and 6 sec, respectively. It seems to me that they are reasonable. However the ratio
ofthe sort time appears to occupyquite a bit of time.CREATE TABLE players (id serial
CONSTRAINT pkey_players PRIMARY KEY, first_name
varchar(30));CREATE TABLE pitching_stats(id integer NOT NULL
CONSTRAINT key_pitching_stats REFERENCES players(id), ip
numeric(5,1));
----- 1st version. It costs about 4 sec.
SELECT pl.id, pl.first_name, t.ret FROM (SELECT ps.id, sum(ps.ip-floor(ps.ip)) % 0.3::numeric +
floor((sum(ps.ip-floor(ps.ip)))/ 0.3::numeric) + sum(floor(ps.ip)) AS ret FROM pitching_stats
ASps GROUP BY ps.id ) AS t INNER JOIN players AS pl ON (t.id = pl.id)
----- 2nd version. It costs about 6 sec.
SELECT pl.id, pl.first_name, sum(ps.ip-floor(ps.ip)) % 0.3::numeric + floor((sum(ps.ip-floor(ps.ip))) /
0.3::numeric) + sum(floor(ps.ip)) AS ret FROM players AS pl INNER JOIN pitching_stats AS ps ON (pl.id =
ps.id) GROUP BY pl.id, pl.first_name
QUERY PLAN:
----- 1st version.
Hash Join (cost=6357.01..7197.06 rows=6000 width=32) -> Subquery Scan t (cost=6355.96..6955.96 rows=6000 width=16)
-> Aggregate (cost=6355.96..6955.96 rows=6000 width=16) -> Group (cost=6355.96..6505.96 rows=60000
width=16) -> Sort (cost=6355.96..6355.96 rows=60000 width=16) -> Seq Scan
onpitching_stats ps (cost=0.00..983.00 rows=60000 width=16) -> Hash (cost=1.04..1.04
rows=4width=16) -> Seq Scan on players pl (cost=0.00..1.04 rows=4 width=16)
----- 2nd version.
Aggregate (cost=9037.33..9787.33 rows=6000 width=32) -> Group (cost=9037.33..9337.33 rows=60000 width=32) ->
Sort (cost=9037.33..9037.33 rows=60000 width=32) -> Hash Join (cost=1.05..3384.10 rows=60000 width=32)
-> Seq Scan on pitching_stats ps (cost=0.00..983.00 rows=6 0000
width=16) -> Hash (cost=1.04..1.04 rows=4 width=16) -> Seq Scan on players
pl (cost=0.00..1.04 rows=4 width=16)
regards,
Masaru Sugawara