Обсуждение: MEDIAN as custom aggregate?
Folks, Hey, anybody have a custom aggregate for median calucation? I'm doing this through a PL/pgSQL function, and a custom aggregate would probably be faster. For those whose stats terminology is rusty, the "median" is the "middle" value in a distribution. For example, if we had the following data: Table ages person age Jim 21 Rusty 24 Carol 37 Bob 62 Leah 78 Our Median would be Carol's age, 37. This is a different figure from the Mean, or Average, which is 44.4. Using the combination of the Mean and the Median you can do all kinds of interesting statistical analysis. -Josh Berkus ______AGLIO DATABASE SOLUTIONS___________________________ Josh Berkus Complete informationtechnology josh@agliodbs.com and data management solutions (415) 565-7293 for law firms, small businesses fax 621-2533 and non-profit organizations. San Francisco
Can't you do something like select age from ages order by age limit 1 offset (select count(*) from ages)/2; except you can't nest the select so you'll have to use a variable to hold it... Make sure it does the right thing when there is an odd number of rows. I don't understand why you want the median and not some parameters of your assumed distribution (mean and variance, for example)but each to his own... Allan. Josh Berkus wrote: > Folks, > > Hey, anybody have a custom aggregate for median calucation? I'm doing > this through a PL/pgSQL function, and a custom aggregate would probably > be faster. > > For those whose stats terminology is rusty, the "median" is the "middle" > value in a distribution. For example, if we had the following data: > > Table ages > person age > Jim 21 > Rusty 24 > Carol 37 > Bob 62 > Leah 78 > > Our Median would be Carol's age, 37. This is a different figure from > the Mean, or Average, which is 44.4. Using the combination of the Mean > and the Median you can do all kinds of interesting statistical analysis. > > -Josh Berkus > > ______AGLIO DATABASE SOLUTIONS___________________________ > Josh Berkus > Complete information technology josh@agliodbs.com > and data management solutions (415) 565-7293 > for law firms, small businesses fax 621-2533 > and non-profit organizations. San Francisco > > ---------------------------(end of broadcast)--------------------------- > TIP 3: if posting/reading through Usenet, please send an appropriate > subscribe-nomail command to majordomo@postgresql.org so that your > message can get through to the mailing list cleanly
Allan,
> Can't you do something like
>
> select age from ages order by age limit 1 offset (select count(*)
> from ages)/2;
>
> except you can't nest the select so you'll have to use a variable to
> hold it...
>
> Make sure it does the right thing when there is an odd number of
> rows.
Duuuuuuuh! I feel like a real idiot now.
The query I'll use is this:
SELECT site, COUNT(pageviews) as count_views, AVG(pageviews) AS
mean_views, median_views
FROM sites, (SELECT pageviews as median_view FROM pageviews LIMIT 1 OFFSET middlerec('pageviews')) med
GROUP BY site, median_views;
Where middlerec is a custom function that counts the records and returns
the middle one.
-Josh
______AGLIO DATABASE SOLUTIONS___________________________ Josh Berkus Complete
informationtechnology josh@agliodbs.com and data management solutions (415) 565-7293 for law firms, small
businesses fax 621-2533 and non-profit organizations. San Francisco
"Josh Berkus" <josh@agliodbs.com> writes:
> The query I'll use is this:
> SELECT site, COUNT(pageviews) as count_views, AVG(pageviews) AS
> mean_views, median_views
> FROM sites,
> (SELECT pageviews as median_view
> FROM pageviews
> LIMIT 1 OFFSET middlerec('pageviews')) med
> GROUP BY site, median_views;
> Where middlerec is a custom function that counts the records and returns
> the middle one.
Um ... does that work? I thought LIMIT was fairly restrictive about
what it would take as a parameter --- like, constants or $n parameters
only.
I do not know of any median-finding algorithm that doesn't require a
depressingly large amount of storage...
regards, tom lane
On Fri, 12 Oct 2001 12:38:12 -0700
"Josh Berkus" wrote:
> For those whose stats terminology is rusty, the "median" is the "middle"
> value in a distribution. For example, if we had the following data:
>
> Table ages
> person age
> Jim 21
> Rusty 24
> Carol 37
> Bob 62
> Leah 78
>
> Our Median would be Carol's age, 37. This is a different figure from
> the Mean, or Average, which is 44.4. Using the combination of the Mean
> and the Median you can do all kinds of interesting statistical analysis.
>
In such case of this, there needs "nextval" in a query to deal with
a large number of rows. I think the following query, is not smart,
will return the necessary rows (or an average of the rows). But even
so it may need considerable time...
-- (on 7.1.3)
create sequence seq_ages start 1;
select a1.age, a1.rank -- or select avg(a1.age) from (select a0.person, a0.age, (nextval('seq_ages') - 1) as rank
from (select *, setval('seq_ages',1) -- to reset a sequence from ages order by age
-- this insignificant "order by" is -- needed in order to work "setval"
)as a0 order by a0.age ) as a1 where exists (select * from ages where a1.rank >=
(select(count(*)+1)/2 from ages) and a1.rank <= (select count(*)/2+1 from ages) )
;
Regards,
Masaru Sugawara
Tom, > Um ... does that work? I thought LIMIT was fairly restrictive about > what it would take as a parameter --- like, constants or $n > parameters > only. > > I do not know of any median-finding algorithm that doesn't require a > depressingly large amount of storage... Me neither. You're right; the query didn't work. Here's a link to the median-finding function I posted to the CookBook: http://www.brasileiro.net/postgres/cookbook/view-one-recipe.adp?recipe_id=1654 Anyone who can improve it is welcome! -Josh ______AGLIO DATABASE SOLUTIONS___________________________ Josh Berkus Complete informationtechnology josh@agliodbs.com and data management solutions (415) 565-7293 for law firms, small businesses fax 621-2533 and non-profit organizations. San Francisco