Обсуждение: SQL challenge--top 10 for each key value?
Offered up for anyone with time on their hands. I fiddled around with 
this for half an afternoon, then gave up and did it programmatically in 
Perl.
Given a table that looks something like this:
id       | INTEGER
query    | INTEGER
checksum | char(32)
score    | INTEGER
include  | BOOLEAN
The table is unique by "id". "Checksum" may be repeated, but I only care 
if it is repeated within a given group by "query". ("query" is non-null.)
I can get the top scorer for each "query" row by something like this:
SELECT * FROM (  SELECT DISTINCT ON (checksum) *  FROM my_table  ORDER BY checksum, score DESC)
ORDER BY query;
How would you go about getting the top N (say, the top 10) for each query?
And then, if that's too easy for you--consider a further case where I 
want every row for a given "query" that has "include" TRUE, and enough 
non-"include" rows to make N. I might end up with more than N rows for a 
given value of "query" if there were more than N with "include" set.
I headed off in the direction of groups of SELECTs and UNIONs, and quit 
when I got to something like four levels of "SELECT ... AS FOO" ...
-- 
Jeff Boes                                      vox 269.226.9550 ext 24
Database Engineer                                     fax 269.349.9076
Nexcerpt, Inc.                                 http://www.nexcerpt.com           ...Nexcerpt... Extend your Expertise
			
		
Jeff Boes <jboes@nexcerpt.com> writes:
> I headed off in the direction of groups of SELECTs and UNIONs, and quit when I
> got to something like four levels of "SELECT ... AS FOO" ...
four? wimp, that's nothing!
ok, seriously I think there's no way to do this directly with straight SQL.
You would have to define a non-immutable function that has some temporary
storage where it keeps track of how many it has seen. 
The generic function that would help here would be some kind of rank(value)
that would give you the equivalent of rownum except with a level break every
time value changes. I've been hoping to see something like this on the list
for a long time but haven't yet.
If the value of n is constant and small you could cheat with an aggregate
function with an array of the top n values.
db=> create function first3_accum(integer[],integer) returns integer[] as 'select case when array_upper($1,1) >= 3 then
$1else array_append($1,$2) end' language sql strict immutable;
 
CREATE FUNCTION
db=> create aggregate first3 (basetype = integer, sfunc = first3_accum, stype = integer[], initcond = '{}');
CREATE AGGREGATE
then something like:
SELECT first3(id) FROM (SELECT id         FROM my_table        ORDER BY query,                 CASE WHEN include THEN 1
ELSE2 END ASC,                 score DESC)GROUP BY query
 
But then you'll have to go back to the table to refetch the original records
that you've found. The best way I find to do that is with the int_array_enum()
function from the int_aggregate contrib module.
SELECT *  FROM my_table WHERE id IN (          SELECT int_array_enum(f3)            FROM (                  SELECT
first3(id)as f3                    FROM (SELECT id                            FROM my_table
ORDERBY query,                                    CASE WHEN include THEN 1 ELSE 2 END ASC,
     score DESC) as x                   GROUP BY query                ) as x          )
 
This last step is kind of annoying since you've already seen all those
records. And it requires writing a new aggregate function every time the value
of n changes though, which kind of sucks.
In theory if the new work in 7.5 handling structured datatypes is as cool as
it sounds you could have an array of complete records and when UNNEST is
eventually incorporated into the array code then you could expand those
instead of using the int_array_enum function. Neither of those things are
ready yet as far as I know though.
-- 
greg
			
		On Thu, 2004-04-08 at 19:33, Greg Stark wrote:
> Jeff Boes <jboes@nexcerpt.com> writes:
> 
> > I headed off in the direction of groups of SELECTs and UNIONs, and quit when I
> > got to something like four levels of "SELECT ... AS FOO" ...
> 
> four? wimp, that's nothing!
> 
> ok, seriously I think there's no way to do this directly with straight SQL.
> You would have to define a non-immutable function that has some temporary
> storage where it keeps track of how many it has seen. 
I don't believe that is true, though it is certainly is in PostgreSQL.
The spec has the ability to apply a progressive aggregate on the results
of a query (window function). Meaning you can accomplish things like
counting (ROW_NUMBER) or running totals.
Something along the lines of the below would accomplish what you want
according to spec. ROW_NUMBER() is a spec defined function. (6.10 of
SQL200N)
       SELECT *          FROM (SELECT ROW_NUMBER() OVER (DISTINCT query) AS counter                 <rest of query>
        )         WHERE counter > 10;
 
			
		Rod Taylor <pg@rbt.ca> writes:
> ROW_NUMBER() is a spec defined function. (6.10 of SQL200N)
If the spec doesn't even have a year number yet, you can hardly expect
real implementations to support it ;-).  There is no such thing in the
extant specs SQL92 or SQL99.
        regards, tom lane
			
		Rod, > Something along the lines of the below would accomplish what you want > according to spec. ROW_NUMBER() is a spec defined function. (6.10 of > SQL200N) Great leaping little gods! They added something called "row number" to the spec? Boy howdy, folks were right ... the ANSI committee really has completly blown off the relational model completely. First there was the addition of network-database functions so that IBM could make DB2 look more like a real database, now this .... When a standards committee becomes hostage to a handful of vendors, kiss real standards goodbye. -- Josh Berkus Aglio Database Solutions San Francisco
Josh Berkus <josh@agliodbs.com> writes: > Rod, > > > Something along the lines of the below would accomplish what you want > > according to spec. ROW_NUMBER() is a spec defined function. (6.10 of > > SQL200N) > > Great leaping little gods! They added something called "row number" to the > spec? > > Boy howdy, folks were right ... the ANSI committee really has completly blown > off the relational model completely. If it's like Oracle's rownum then it's the row number of the *output*, not the position on disk. So it's not entirely blowing off the relational model any more than ORDER BY does. The weird thing is the number of cases where you want ORDER BY or rownum inside subselects. Which the solution to the original question needed. > When a standards committee becomes hostage to a handful of vendors, kiss > real standards goodbye. In the case of SQL was there ever any pretension otherwise? Was the SQL standard ever really useful as a "real standard"? I can write useful ANSI C89 code that will compile and work on any C compiler. Trying to write portable SQL92 code that does any useful work is about as productive as stapling bagels to your forehead. -- greg
On Fri, 2004-04-09 at 18:43, Greg Stark wrote: > Josh Berkus <josh@agliodbs.com> writes: > > > Rod, > > > > > Something along the lines of the below would accomplish what you want > > > according to spec. ROW_NUMBER() is a spec defined function. (6.10 of > > > SQL200N) > > > > Great leaping little gods! They added something called "row number" to the > > spec? > > > > Boy howdy, folks were right ... the ANSI committee really has completly blown > > off the relational model completely. > > If it's like Oracle's rownum then it's the row number of the *output*, not the > position on disk. So it's not entirely blowing off the relational model any > more than ORDER BY does. > > The weird thing is the number of cases where you want ORDER BY or rownum > inside subselects. Which the solution to the original question needed. It's not really like Oracles row num at all, though I suppose you can emulate rownum using it. The intention is that you will use it for "aggregates" like running totals, moving averages, counting, etc. http://www.devx.com/getHelpOn/10MinuteSolution/16573/1954?pf=true
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
> How would you go about getting the top N (say, the top 10) for each query?
Assume you have a table "ch" and three sequences 'aa', 'bb', and 'cc'.
(Only 'aa' and 'bb' need to be initially set)
SELECT setval('aa',1,'f');
SELECT setval('bb',1,'f');
SELECT nextval('cc') AS rating,q2 AS query, s2 AS score FROM
(SELECT 0 AS q1, 0 AS s1, NULL AS cs, nextval('aa') AS v1UNION ALL(SELECT *, nextval('aa') AS v1 FROM (SELECT query AS
q1,MAX(score) AS s1, checksum AS cs FROM ch GROUP BY 1,3 ORDER BY 1 ASC, 2 DESC) AS foo)
 
) AS uno,
((SELECT *, nextval('bb') AS v2 FROM (SELECT query AS q2, MAX(score) AS s2, checksum AS cs FROM ch GROUP BY 1,3 ORDER
BY1 ASC, 2 DESC) AS foo)UNION ALLSELECT NULL AS q2, 0 AS s2, NULL AS cs, nextval('bb') AS v2
 
) AS dos
WHERE v1 = v2 AND q2 IS NOT NULL
AND ((CASE WHEN q1 != q2 THEN setval('cc',1,'f') ELSE 0 END > 0)OR(CASE WHEN currval('cc')<10 THEN 1 ELSE 0 END >0)
);
- --
Greg Sabino Mullane greg@turnstep.com
PGP Key: 0x14964AC8 200404101029
-----BEGIN PGP SIGNATURE-----
iD8DBQFAeAZ1vJuQZxSWSsgRAqYuAJ9HaYLotPYkyi1U76I9xnvi8AhLTQCfUyJq
+iVdbz5U7HKep89z0kp49U0=
=6+OH
-----END PGP SIGNATURE-----
			
		Rod, Greg > It's not really like Oracles row num at all, though I suppose you can > emulate rownum using it. The intention is that you will use it for > "aggregates" like running totals, moving averages, counting, etc. Yes, that makes a certain amount of sense. I just take exception to the name "Row Number" becuase it confuses newbies about the actual nature of the data being returned and gets them back in the bad space of believing in fixed data ordering -- something which RDBMSes are supposed to avoid. I'm also disgusted at the vendor partiality that this shows. PostgreSQL and MySQL both have working implementations of a very similar concept using the non-confusing term "Limit". However, since Oracle is on the committee, they had to use a more "Oracle-friendly" term, I guess. > In the case of SQL was there ever any pretension otherwise? Was the SQL > standard ever really useful as a "real standard"? I can write useful ANSI > C89 code that will compile and work on any C compiler. Trying to write > portable SQL92 code that does any useful work is about as productive as > stapling bagels to your forehead. Well, there *was* a pretense otherwise, which ended about 1994, just as we were getting SQL on this project. Now the big vendors -- mostly IBM and Oracle since Informix and Sybase are dying -- run everything and adapt the standard to what features their products already have. So, yes, SQL92 needed development and expansion. But we didn't need SQL99. -- Josh Berkus Aglio Database Solutions San Francisco
On Fri, 09 Apr 2004 02:11:44 -0400, Tom Lane wrote: >> ROW_NUMBER() is a spec defined function. (6.10 of SQL200N) > > If the spec doesn't even have a year number yet, you can hardly expect > real implementations to support it ;-) SQL:2003 is finished. Among its new (non-core) OLAP features are a set of "windows functions" (spec section 6.10), which include feature ID T611 (elementary OLAP):- ROW_NUMBER() OVER (...)- RANK() OVER (...)- DENSE_RANK() OVER (...) feature ID T612 (extended OLAP):- PERCENT_RANK() OVER (...)- CUME_DIST() OVER (...) See http://www.acm.org/sigmod/record/issues/0403/index.html#standards for an article which summarizes the news in SQL:2003. ROW_NUMBER() OVER may be used in queries where a PostgreSQL user which use LIMIT. RANK() OVER may be used in queries where a PostgreSQL user would have to come up with a somewhat strange query in order to get acceptable performance, see http://troels.arvin.dk/db/rdbms/#select-top-n-postgresql -- Greetings from Troels Arvin, Copenhagen, Denmark
On Sun, 11 Apr 2004 00:13:17 +0200, I wrote: > Among its new (non-core) OLAP features are a set of > "windows functions" Sorry - I meant "window functions"... (Microsoft don't seem to have had much influence in SQL:2003's OLAP-specifications; IBM seems to be the big influencer in those parts of the standard.) -- Greetings from Troels Arvin, Copenhagen, Denmark
Welcome to the real world, Josh. There are people who have full time salaried positions soley to attend standards meetings. Note that ROW_NUMBER() really is handy, regardless of the silly name. And there was a little python function of mine that did it fairly simply, except that you needed to initialize the counter in the connection before use. create or replace function pycounter(integer) returns integer as ' if args[0] == 0: SD["nextno"] = 1 return SD["nextno"] try: SD["nextno"] += 1 except: SD["nextno"] =1 return SD["nextno"] ' language 'plpythonu'; And clearly it can be done faster as a little C function. elein On Fri, Apr 09, 2004 at 09:06:39AM -0700, Josh Berkus wrote: > Rod, > > > Something along the lines of the below would accomplish what you want > > according to spec. ROW_NUMBER() is a spec defined function. (6.10 of > > SQL200N) > > Great leaping little gods! They added something called "row number" to the > spec? > > Boy howdy, folks were right ... the ANSI committee really has completly blown > off the relational model completely. First there was the addition of > network-database functions so that IBM could make DB2 look more like a real > database, now this .... > > When a standards committee becomes hostage to a handful of vendors, kiss real > standards goodbye. > > -- > Josh Berkus > Aglio Database Solutions > San Francisco > > ---------------------------(end of broadcast)--------------------------- > TIP 8: explain analyze is your friend
This solution will be in Monday's edition of PostgreSQL General Bits (http://www.varlena.com/GeneralBits). (In other words, if it doesn't do what you mean, let me know now!) CREATE TYPE topscores AS (id integer, query integer, checksum char(32), score integer); CREATE OR REPLACE FUNCTION topscores(integer) RETURNS SETOF topscores AS ' DECLARE t topscores%ROWTYPE; r RECORD; q RECORD; n alias for $1; BEGIN FOR q IN SELECT distinct query from table70 order by query LOOP FOR t IN SELECT id , query, checksum, score FROM table70 where query = q.query ORDER BY query, score DESC LIMIT n LOOP RETURN NEXT t; END LOOP; END LOOP; RETURN; END; ' language 'plpgsql'; select * from topscores(1) ; select * from topscores(2) ; select * from topscores(3) ; On Thu, Apr 08, 2004 at 07:55:33PM +0000, Jeff Boes wrote: > Offered up for anyone with time on their hands. I fiddled around with > this for half an afternoon, then gave up and did it programmatically in > Perl. > > Given a table that looks something like this: > > id | INTEGER > query | INTEGER > checksum | char(32) > score | INTEGER > include | BOOLEAN > > > The table is unique by "id". "Checksum" may be repeated, but I only care > if it is repeated within a given group by "query". ("query" is non-null.) > > I can get the top scorer for each "query" row by something like this: > > SELECT * FROM ( > SELECT DISTINCT ON (checksum) * > FROM my_table > ORDER BY checksum, score DESC) > ORDER BY query; > > How would you go about getting the top N (say, the top 10) for each query? > > And then, if that's too easy for you--consider a further case where I > want every row for a given "query" that has "include" TRUE, and enough > non-"include" rows to make N. I might end up with more than N rows for a > given value of "query" if there were more than N with "include" set. > > I headed off in the direction of groups of SELECTs and UNIONs, and quit > when I got to something like four levels of "SELECT ... AS FOO" ... > > -- > Jeff Boes vox 269.226.9550 ext 24 > Database Engineer fax 269.349.9076 > Nexcerpt, Inc. http://www.nexcerpt.com > ...Nexcerpt... Extend your Expertise > > > ---------------------------(end of broadcast)--------------------------- > TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org
elein <elein@varlena.com> writes: > create or replace function pycounter(integer) > returns integer as > ' > if args[0] == 0: > SD["nextno"] = 1 > return SD["nextno"] > try: > SD["nextno"] += 1 > except: > SD["nextno"] = 1 > return SD["nextno"] > ' language 'plpythonu'; > > And clearly it can be done faster as a little > C function. Does this approach have a hope of working if it's used twice in the same query? -- greg
No, it will not work twice in the same query as is. If you want to code two counter buckets and pass in some way to distinguish between the two yada yada yada it is possible. It is also possible to code this to do multi-level counting/breaks/calculations, etc. But the SD dictionary is by connection. So any values stored in it need to be initialized at the appropriate time *outside* of the first use. elein On Sun, Apr 11, 2004 at 12:38:20AM -0400, Greg Stark wrote: > > elein <elein@varlena.com> writes: > > > create or replace function pycounter(integer) > > returns integer as > > ' > > if args[0] == 0: > > SD["nextno"] = 1 > > return SD["nextno"] > > try: > > SD["nextno"] += 1 > > except: > > SD["nextno"] = 1 > > return SD["nextno"] > > ' language 'plpythonu'; > > > > And clearly it can be done faster as a little > > C function. > > Does this approach have a hope of working if it's used twice in the same > query? > > > -- > greg > > > ---------------------------(end of broadcast)--------------------------- > TIP 9: the planner will ignore your desire to choose an index scan if your > joining column's datatypes do not match
Troels Arvin wrote: > > See http://www.acm.org/sigmod/record/issues/0403/index.html#standards for > an article which summarizes the news in SQL:2003. > This is a very useful page; thank you for creating it and for noting it in this thread! -- (Posted from an account used as a SPAM dump. If you really want to get in touch with me, dump the 'jboes' and substitute 'mur'.) ________ Jeffery Boes <>< jboes@qtm.net