Обсуждение: Find "smallest common year"
Hi there,
my aim is to plot a line graph for a single country but for two or
three variables. Now, one could use two different y-axis (if there
are two data sets with different units); but it's not really the
right way to analyse data.
One should rather normalize them to a base year to enhance
comparisons. To achieve this, one would set the values for a specific
year to 1 (or 100) and put the other values of all variables in
relation to it.
Thus, taking two variables, say,
Fish Catch
- in Thousand Metric Tons
- data available for 1975 to 2005
and
Total Population
- in Million People
- data available for 1960 to 2005
Country is Kenya.
The "smallest common year" would be 1975. One needs thus to set the
values for that year to 1 (or 100), and calculate the other values of
the variables in respect to the needed factor.
Now the question is how to find that "smallest common year"
automatically, via SQL or PHP?
Can anyone give me a hint? Thanks for any help!
Stef
____________________________________________________________________
Stefan Schwarzer
Lean Back and Relax - Enjoy some Nature Photography:
Appetite for Global Data? UNEP GEP Data Portal:
____________________________________________________________________
If I understood your question, maybe it's you want:
SELECT min(<desired data>) FROM table1 JOIN table2 ON (table1.data_field = table2.data_field);
--
William Leite Araújo
Estudante de paternidade - 24a semana
SELECT min(<desired data>) FROM table1 JOIN table2 ON (table1.data_field = table2.data_field);
2007/9/27, Stefan Schwarzer <stefan.schwarzer@grid.unep.ch >:
Hi there,my aim is to plot a line graph for a single country but for two orthree variables. Now, one could use two different y-axis (if thereare two data sets with different units); but it's not really theright way to analyse data.One should rather normalize them to a base year to enhancecomparisons. To achieve this, one would set the values for a specificyear to 1 (or 100) and put the other values of all variables inrelation to it.Thus, taking two variables, say,
Fish Catch- in Thousand Metric Tons- data available for 1975 to 2005andTotal Population- in Million People- data available for 1960 to 2005Country is Kenya.The "smallest common year" would be 1975. One needs thus to set thevalues for that year to 1 (or 100), and calculate the other values ofthe variables in respect to the needed factor.Now the question is how to find that "smallest common year"automatically, via SQL or PHP?Can anyone give me a hint? Thanks for any help!Stef____________________________________________________________________
Stefan SchwarzerLean Back and Relax - Enjoy some Nature Photography :Appetite for Global Dat a? UNEP GEP Data Portal:____________________________________________________________________
--
William Leite Araújo
Estudante de paternidade - 24a semana
Sorry,
I forgot to mention my table design, which is like this:
name 2001 2002 2003 2004 2005
-----------------------------------------------------------------
Afghanistan ....
Albania ....
(Yep, I know, bad table design.... :-)). I tried to change it to the more common "id | year | value" format, but there were too many SQL problems afterwards for the different queries/calculations we need to have....)
So, the mentioned SQL unfortunately doesn't work in that case....
Stef
On Sep 27, 2007, at 1:52 PM, William Leite Araújo wrote:
If I understood your question, maybe it's you want:
SELECT min(<desired data>) FROM table1 JOIN table2 ON (table1.data_field = table2.data_field);2007/9/27, Stefan Schwarzer <stefan.schwarzer@grid.unep.ch >:Hi there,my aim is to plot a line graph for a single country but for two orthree variables. Now, one could use two different y-axis (if thereare two data sets with different units); but it's not really theright way to analyse data.One should rather normalize them to a base year to enhancecomparisons. To achieve this, one would set the values for a specificyear to 1 (or 100) and put the other values of all variables inrelation to it.Thus, taking two variables, say,Fish Catch- in Thousand Metric Tons- data available for 1975 to 2005andTotal Population- in Million People- data available for 1960 to 2005Country is Kenya.The "smallest common year" would be 1975. One needs thus to set thevalues for that year to 1 (or 100), and calculate the other values ofthe variables in respect to the needed factor.Now the question is how to find that "smallest common year"automatically, via SQL or PHP?Can anyone give me a hint? Thanks for any help!Stef____________________________________________________________________
Stefan SchwarzerLean Back and Relax - Enjoy some Nature Photography :Appetite for Global Dat a? UNEP GEP Data Portal:____________________________________________________________________
--
William Leite Araújo
Estudante de paternidade - 24a semana
Stefan Schwarzer wrote: > Sorry, > > I forgot to mention my table design, which is like this: > > name 2001 2002 2003 2004 2005 > ----------------------------------------------------------------- > Afghanistan .... > Albania .... > > (Yep, I know, bad table design.... :-)). I tried to change it to the > more common "id | year | value" format, but there were too many SQL > problems afterwards for the different queries/calculations we need to > have....) May I suggest that you concentrate on solving *those* problems instead of the programmatically trivial computation of lowest common value? Notice that a *really trivial* programming exercise becomes highly involved in your case --- if I'm understanding correctly what you have, I assume you'd have to check one by one the fields for NULL or non-NULL values --- that's intolerably ugly, IMHO, and it is a very simple and direct consequence of an as-unsound-as-it-gets db/table design. Carlos --
I really tried it out. I changed my whole database to the "id-country | year | value" - format. And then tried to build my SQL queries. But it was really, really difficult, and sometimes impossible (for me) to get what I wanted.
Actually, it remains quite difficult for me to remember the actual problems I had with it.
But, for example, I need to aggregate the national data on-the-fly to their regions. I need to calculate per Capita data on-the-fly for each variable. Although, one would say this should be simple to accomplish, me and a semi-professional database expert could hardly solve these things.
In one case we came up with as many sub-selects as years were available (or selected by the user) (that can be up to 60 years). Is this "efficient" SQL programming?
What would you recommend for say, 500 global national statistical variables, 500 regional and 500 subregional and 500 global aggregations? Years being covered having something between 10 and 60 years for each of these variables. All available for 240 countries/territories.
Thanks for any recommendations!
Stef
Stefan Schwarzer wrote:
Sorry,
I forgot to mention my table design, which is like this:
name 2001 2002 2003 2004 2005
-----------------------------------------------------------------
Afghanistan ....
Albania ....
(Yep, I know, bad table design.... :-)). I tried to change it to the more common "id | year | value" format, but there were too many SQL problems afterwards for the different queries/calculations we need to have....)
May I suggest that you concentrate on solving *those* problems instead of
the programmatically trivial computation of lowest common value? Notice
that a *really trivial* programming exercise becomes highly involved in
your case --- if I'm understanding correctly what you have, I assume you'd
have to check one by one the fields for NULL or non-NULL values --- that's
intolerably ugly, IMHO, and it is a very simple and direct consequence of
an as-unsound-as-it-gets db/table design.
____________________________________________________________________
Stefan Schwarzer
Lean Back and Relax - Enjoy some Nature Photography:
Appetite for Global Data? UNEP GEP Data Portal:
____________________________________________________________________
On 9/27/07, Stefan Schwarzer <stefan.schwarzer@grid.unep.ch> wrote: > > I really tried it out. I changed my whole database to the "id-country | year > | value" - format. And then tried to build my SQL queries. But it was > really, really difficult, and sometimes impossible (for me) to get what I > wanted. > > Actually, it remains quite difficult for me to remember the actual problems > I had with it. > > But, for example, I need to aggregate the national data on-the-fly to their > regions. I need to calculate per Capita data on-the-fly for each variable. > Although, one would say this should be simple to accomplish, me and a > semi-professional database expert could hardly solve these things. You should have brought your problem here. You'd be surprised what a fresh set of eyes can see. > In one case we came up with as many sub-selects as years were available (or > selected by the user) (that can be up to 60 years). Is this "efficient" SQL > programming? Probably not. But that doesn't mean it was the right approach either. There well might have been a more efficient approach you didn't think of. > What would you recommend for say, 500 global national statistical variables, > 500 regional and 500 subregional and 500 global aggregations? Years being > covered having something between 10 and 60 years for each of these > variables. All available for 240 countries/territories. I generally approach such problems by putting the data right (normalized) at the start, then munging the data into summary tables to handle the problems you're seeing now. I find it far easier to maintain normalized tables that produced non-normalized ones (for things like data warehousing) than it is to maintain non-normalized tables and trying to produce normalized data from that.
>> What would you recommend for say, 500 global national statistical >> variables, >> 500 regional and 500 subregional and 500 global aggregations? >> Years being >> covered having something between 10 and 60 years for each of these >> variables. All available for 240 countries/territories. > > I generally approach such problems by putting the data right > (normalized) at the start, then munging the data into summary tables > to handle the problems you're seeing now. > > I find it far easier to maintain normalized tables that produced > non-normalized ones (for things like data warehousing) than it is to > maintain non-normalized tables and trying to produce normalized data > from that. Ok, I do understand that. So, instead of the earlier mentioned database design, I would have something like this: - one table for the country names/ids/etc. (Afghanistan, 1; Albania, 2....) - one table for the variable names/ids/etc. (GDP, 1; Population, 2; Fish Catch, 3;....) - one table for the years names/ids/etc. (1970, 1; 1971, 2; 1973, 3; ....) and - one table for all "statistical data" with four fields - id_variable, id_country, id_year, and the actual value You say > I find it far easier to maintain normalized tables that produced > non-normalized ones (for things like data warehousing) than it is to > maintain non-normalized tables and trying to produce normalized data > from that. It seems to me more difficult now to produce a non-normalized output based on the normalized table. How would look a query like, if I need now to SELECT, say 100 countries and 20 years? Something like this (simplified and without joins): SELECT value, id.year FROM main_table WHERE year = '1970' OR year = '1971' OR .... country_name = 'Afghanistan' OR country_name = 'Albania' OR ... Actually, last time we came up with SubSelects for each year. So, does this make sense? Thanks a lot for your help! Stef ____________________________________________________________________ Stefan Schwarzer Lean Back and Relax - Enjoy some Nature Photography: http://photoblog.la-famille-schwarzer.de Appetite for Global Data? UNEP GEO Data Portal: http://geodata.grid.unep.ch ____________________________________________________________________
Stefan Schwarzer skrev: >>> What would you recommend for say, 500 global national statistical >>> variables, >>> 500 regional and 500 subregional and 500 global aggregations? Years >>> being >>> covered having something between 10 and 60 years for each of these >>> variables. All available for 240 countries/territories. >> >> I generally approach such problems by putting the data right >> (normalized) at the start, then munging the data into summary tables >> to handle the problems you're seeing now. >> >> I find it far easier to maintain normalized tables that produced >> non-normalized ones (for things like data warehousing) than it is to >> maintain non-normalized tables and trying to produce normalized data >> from that. > > Ok, I do understand that. > > So, instead of the earlier mentioned database design, I would have > something like this: > > - one table for the country names/ids/etc. (Afghanistan, 1; Albania, > 2....) > - one table for the variable names/ids/etc. (GDP, 1; Population, 2; > Fish Catch, 3;....) > - one table for the years names/ids/etc. (1970, 1; 1971, 2; 1973, 3; > ....) > and > - one table for all "statistical data" with four fields - > id_variable, id_country, id_year, and the actual value This is one posibility. Another is to have one table for each variable. This has the benefit of not mixing different units/data types in the same field. It does mean you cannot use the same (parameterized) query for getting different measures. Since it is easy to create views converting from one to the other of these presentations, which one you choose is not that important Also, there is no obvious need to have a lookup table for years - just store the year as an integer in your data table(s). If necessary, add a constraint indicating which years are valid. You can produce rows from missing years by left joining with generate_series(start_year, end_year) Even if you choose to store the valid years in a table, the id_year is unnecessary - just use the year itself as the primary key. More in another reply. Nis
Stefan Schwarzer wrote: > Ok, I do understand that. > > So, instead of the earlier mentioned database design, I would have > something like this: > > - one table for the country names/ids/etc. (Afghanistan, 1; Albania, > 2....) > - one table for the variable names/ids/etc. (GDP, 1; Population, 2; > Fish Catch, 3;....) > - one table for the years names/ids/etc. (1970, 1; 1971, 2; 1973, 3; > ....) If you _do_ need this table (because you want to constrain your statistical data to only contain a specific set of years, or because you need a quick list of available years to select from): Make the year primary key and drop the artificial index. Years are perfectly fine data to constrain on, and it saves you the joins with that table (the foreign key constraint constrains your data sufficiently). You may want to constrain your years further by adding a check constraint, fe. CHECK (year > 1900). The same argument goes for the country names, but I generally don't like to have text data for primary key. Contrary to years, their values might change (for which an ON UPDATE CASCADE FK constraint trigger would have to go through your data to update all the references). Plus they take up a bit more space than integers, although with country names that won't matter that much. OTOH... If your country names contain non-ascii characters and the database is configured to collate on those properly, it will be slower. That aside (they're kind of minor points), the country name is also a good candidate to become key (primary and foreign respectively). > and > - one table for all "statistical data" with four fields - > id_variable, id_country, id_year, and the actual value Some things may have changed here due to my above suggestions. > It seems to me more difficult now to produce a non-normalized output > based on the normalized table. How would look a query like, if I need > now to SELECT, say 100 countries and 20 years? Something like this > (simplified and without joins): > > SELECT > value, > id.year > FROM > main_table > WHERE > year = '1970' OR > year = '1971' OR > .... > country_name = 'Afghanistan' OR > country_name = 'Albania' OR > ... I don't really see what the problem is; with the years as columns in your table they're just in a different places in this query. Without knowing what problem you're trying to solve with this query it's a bit of a guess. I assume those years and countries are user input? I usually use BETWEEN or IN (...) for such cases, but that boils down to the same query. It's just shorter to write. BTW, You didn't actually use type text for your year column, did you? No quotes needed then. Otherwise you'd have to make sure your year values are all the same length or sorting gets... interesting. SELECT value, year FROM main_table WHERE year BETWEEN 1970 AND 1975 AND country_name IN ('Afghanistan', 'Albania', ...) But I think you'll usually be looking for aggregates, so GROUP BY and HAVING are probably more appropriate for your queries. SELECT country_name, year, SUM(value) FROM main_table WHERE country_name IN ('Afghanistan', 'Albania', ...) AND year BETWEEN 1970 AND 1975 GROUP BY country_name, year HAVING SUM(value) > 0; -- Alban Hertroys alban@magproductions.nl magproductions b.v. T: ++31(0)534346874 F: ++31(0)534346876 M: I: www.magproductions.nl A: Postbus 416 7500 AK Enschede // Integrate Your World //
A few more comments on your table design. Stefan Schwarzer skrev: > So, instead of the earlier mentioned database design, I would have > something like this: > > - one table for the country names/ids/etc. (Afghanistan, 1; Albania, > 2....) There is a well-established natural key for countries - ISO 3166. Use that instead of "id". > - one table for the variable names/ids/etc. (GDP, 1; Population, 2; > Fish Catch, 3;....) > - one table for the years names/ids/etc. (1970, 1; 1971, 2; 1973, 3; > ....) No need for an ID column on this one. Just use the year as the primary key. Nis
Alban Hertroys wrote: > If you _do_ need this table (because you want to constrain your > statistical data to only contain a specific set of years, or because you > need a quick list of available years to select from): Make the year > primary key and drop the artificial index. > > Years are perfectly fine data to constrain on, and it saves you the > joins with that table (the foreign key constraint constrains your data > sufficiently). > > You may want to constrain your years further by adding a check > constraint, fe. CHECK (year > 1900). > > The same argument goes for the country names, but I generally don't like > to have text data for primary key. Ah, but there is a standardised list of country-codes ideal for the DBA since the code is (usually) easily understandable. http://en.wikipedia.org/wiki/ISO_3166 The only problem might be if historical data uses old boundaries (e.g. Soviet Union, Yugoslavia). -- Richard Huxton Archonet Ltd
> Ah, but there is a standardised list of country-codes ideal for the > DBA since the code is (usually) easily understandable. > http://en.wikipedia.org/wiki/ISO_3166 > The only problem might be if historical data uses old boundaries > (e.g. Soviet Union, Yugoslavia). Yep, have all of them (ISO-2, ISO-3, UN), but they don't cover all countries + territories. Nevertheless, in most cases the list is sufficient... Thanks nevertheless for the link! Stef
> > >> - one table for the years names/ids/etc. (1970, 1; 1971, 2; >> 1973, 3; >> ....) > > If you _do_ need this table (because you want to constrain your > statistical data to only contain a specific set of years, or > because you > need a quick list of available years to select from): Make the year > primary key and drop the artificial index. > > Years are perfectly fine data to constrain on, and it saves you the > joins with that table (the foreign key constraint constrains your data > sufficiently). If my years are not only single years such as 1970, 1971... but time spans, such as 1970-75.... should your proposal be still valid? It won't be anymore an integer field, but a text field instead... Stef
> BTW, You didn't actually use type text for your year column, did > you? No > quotes needed then. Otherwise you'd have to make sure your year values > are all the same length or sorting gets... interesting. Yep, my comment just before concerns especially this paragraph, I guess. With not only years but also time spans for my data - so, not only 1970, 1971.... but also 1970-75, 1975-80 etc. I would need indeed to use text for my year column. Why get sorting than "interesting"? Stef
>> I find it far easier to maintain normalized tables that produced >> non-normalized ones (for things like data warehousing) than it is to >> maintain non-normalized tables and trying to produce normalized data >> from that. > > Ok, I do understand that. > > So, instead of the earlier mentioned database design, I would have > something like this: > > - one table for the country names/ids/etc. (Afghanistan, 1; > Albania, 2....) > - one table for the variable names/ids/etc. (GDP, 1; Population, > 2; Fish Catch, 3;....) > - one table for the years names/ids/etc. (1970, 1; 1971, 2; > 1973, 3; ....) > and > - one table for all "statistical data" with four fields - > id_variable, id_country, id_year, and the actual value > > You say > >> I find it far easier to maintain normalized tables that produced >> non-normalized ones (for things like data warehousing) than it is to >> maintain non-normalized tables and trying to produce normalized data >> from that. What is your view about (having 500 different variables/data sets) using a single table for all data versus one table for each variable. In terms of "readability" I guess the second solution would be better. But, then,.... I don't know... Thanks for any views.... Stef
Stefan Schwarzer wrote: > > >> BTW, You didn't actually use type text for your year column, did you? No >> quotes needed then. Otherwise you'd have to make sure your year values >> are all the same length or sorting gets... interesting. > > Yep, my comment just before concerns especially this paragraph, I guess. > With not only years but also time spans for my data - so, not only 1970, > 1971.... but also 1970-75, 1975-80 etc. I would need indeed to use text > for my year column. Why get sorting than "interesting"? > > Stef Text sorts alphabetically, not numerically: integer text ------- ------- 1 '1' 2 '10' 3 '2' 4 '3' 5 '4' 6 '5' 7 '6' 8 '7' 9 '8' 10 '9' I guess you won't have any problems with sorting, as you use full century prefixes and the differences in syntax ('1970' vs '1970-75') are at the rear of your values. An entirely different question is whether it is a good idea to write a range as a value that the database cannot interpret correctly (referring to the '1970-75' notation). You cannot group records by value this way if you need to (for example) combine data from '1970' with data from '1970-75'. But you seem to use these values just for labels, which I assume are unique across years (eg. if you have a value '1970-75' you don't have values '1970', 1971'..'1974'), in which case this is safe to use. As pointed out by several people earlier, they make an excellent foreign key too (provided they're unique). Other options to handle these years involve having a start/end year or date (to_date('01/01/1970', 'MM/DD/YYYY'), to_date('12/31/1975', 'MM/DD/YYYY')) or a start date and an interval (to_date('01/01/1970', 'MM/DD/YYYY'), INTERVAL '1 year'). That makes operators like OVERLAPS and BETWEEN available to your queries, which may be convenient ;) -- Alban Hertroys alban@magproductions.nl magproductions b.v. T: ++31(0)534346874 F: ++31(0)534346876 M: I: www.magproductions.nl A: Postbus 416 7500 AK Enschede // Integrate Your World //
> An entirely different question is whether it is a good idea to write a > range as a value that the database cannot interpret correctly > (referring > to the '1970-75' notation). You cannot group records by value this way > if you need to (for example) combine data from '1970' with data from > '1970-75'. > > But you seem to use these values just for labels, which I assume are > unique across years (eg. if you have a value '1970-75' you don't have > values '1970', 1971'..'1974'), in which case this is safe to use. As > pointed out by several people earlier, they make an excellent foreign > key too (provided they're unique). Yep, this is question I posed myself too. In the moment, when doing for example "per Capita" calculations on the fly of a variable which has something like 1970-75, I would then sum up the Total Population over the given period, divide it through the number of years and then use it with the selected variable to get the "per Capita" data. But if I would instead insert yearly data, it would mean that it had five lines with the same values. No problem with that? Stef
Stefan Schwarzer wrote: > >> An entirely different question is whether it is a good idea to write a >> range as a value that the database cannot interpret correctly (referring >> to the '1970-75' notation). You cannot group records by value this way >> if you need to (for example) combine data from '1970' with data from >> '1970-75'. >> >> But you seem to use these values just for labels, which I assume are >> unique across years (eg. if you have a value '1970-75' you don't have >> values '1970', 1971'..'1974'), in which case this is safe to use. As >> pointed out by several people earlier, they make an excellent foreign >> key too (provided they're unique). > > Yep, this is question I posed myself too. In the moment, when doing for > example "per Capita" calculations on the fly of a variable which has > something like 1970-75, I would then sum up the Total Population over > the given period, divide it through the number of years and then use it > with the selected variable to get the "per Capita" data. > > But if I would instead insert yearly data, it would mean that it had > five lines with the same values. No problem with that? Not entirely what I suggested, but also a viable solution, sure. I was suggesting to add a column to your yearly data marking the end of the range. Given your above examples, you could then do queries like: SELECT population / num_years FROM my_data; (Assuming you add the length of the interval as a number of years, which seems plausible because you don't seem to calculate with any intervals not dividable by a year). Adding this additional column may justify putting the years (and their durations) into their own table. -- Alban Hertroys alban@magproductions.nl magproductions b.v. T: ++31(0)534346874 F: ++31(0)534346876 M: I: www.magproductions.nl A: Postbus 416 7500 AK Enschede // Integrate Your World //