Обсуждение: bug on aggregate function AVG()
prova=> select * from a; int_2| int_4| int_8 -----+----------+------------------- 32767|2147483647|9223372036620802086 32767|2147483647|9223372036620802086 (2 rows) prova=> select avg(int_2), avg(int_4), avg(int_8) from a; avg|avg| avg ---+---+-----------1| -1|-233973722 (1 row) Jose'
Jose' Soares wrote: > > prova=> select * from a; > int_2| int_4| int_8 > -----+----------+------------------- > 32767|2147483647|9223372036620802086 > 32767|2147483647|9223372036620802086 > (2 rows) > > prova=> select avg(int_2), avg(int_4), avg(int_8) from a; > avg|avg| avg > ---+---+---------- > -1| -1|-233973722 > (1 row) > > Jose' The same on SUM(): prova=> select sum(int_2), sum(int_4), sum(int_8) from a; sum|sum| sum ---+---+-----------2| -2|-467947444 (1 row) but it works as follow: prova=> select avg(int_2*1.0), avg(int_4*1.0), avg(int_8*1.0) from a; avg| avg|avg -----+----------+------------------- 32767|2147483647|9.2233720366208e+18 (1 row) prova=> select sum(int_2*1.0), sum(int_4*1.0), sum(int_8*1.0) from a; sum| sum|sum -----+----------+-------------------- 65534|4294967294|1.84467440732416e+19 (1 row) Jose'
> > prova=> select avg(int_2), avg(int_4), avg(int_8) from a; > The same on SUM(): Sure. For some reason, on most platforms integers are allowed to overflow in Postgres. Of course, both SUM() and AVG() take a running sum, and once they overflow you are hosed... - Tom
Hi Tom, mercoledì, 4 novembre 98, you wrote: >> > prova=> select avg(int_2), avg(int_4), avg(int_8) from a; >> The same on SUM(): TGL> Sure. For some reason, on most platforms integers are allowed to TGL> overflow in Postgres. Of course, both SUM() and AVG() take a running TGL> sum, and once they overflow you are hosed... ^^^^^ I don't know what does the word "hosed" mean Tom, I hope you don't want to tell me there's no solution for this problem. I see that AVG() and SUM() uses an accumulator not enough big to hold the result of calculation, but the point is: should we consider this thing a "terrible" bug or an acceptable feature ? What about to convert every accumulator to float8 ? select intero4 from a; intero4 ---------- 2147483647 2147483647 2147483647 (3 rows) select sum(inter04),sum(intero4*1.0) from a; sum| sum ----------+---------- 2147483645|6442450941 (1 row) select avg(intero4),avg(intero4*1.0) from a; avg| avg ---------+---------- 715827881|2147483647 (1 row) Anyway I think we need to work a little bit on aggregates: MIN() and MAX() doesn't accept a string as parameter. SUM() and AVG() gives a wrong result because it goes on overflow. and none of them allows the clause DISTINCT. What do you think about ? ;) Jose'
> I don't know what does the word "hosed" mean Tom, I hope you don't > want to tell me there's no solution for this problem. As you guessed, "hosed" isn't good ;) > I see that AVG() and SUM() uses an accumulator not enough big to hold > the result of calculation, but the point is: should we consider this > thing a "terrible" bug or an acceptable feature ? > What about to convert every accumulator to float8 ? imho we can't do that because we lose the exact qualities of integers. If you accumulate in float8, and if you take a sum over a very large table, you might start ignoring values. That is, if you have accumulated 15 or 16 digits worth of number, and then try adding 1 as the next number, the result will be the same as the input. With integers that is never the case, but we have to deal with overflows better. I would think we should start signalling overflows rather than silently overflowing, but I'm not sure what that entails. > Anyway I think we need to work a little bit on aggregates: > MIN() and MAX() doesn't accept a string as parameter. Yes, at the moment only numeric quantities are supported. > SUM() and AVG() gives a wrong result because it goes on overflow. > and none of them allows the clause DISTINCT. Yes, SELECT SUM(DISTINCT i) FROM t; is not yet supported. That's a project for v6.5. btw, I'm also planning on working on your "NULL problem" you mentioned earlier... - Tom
"Thomas G. Lockhart" <lockhart@alumni.caltech.edu> writes: >> I see that AVG() and SUM() uses an accumulator not enough big to hold >> the result of calculation, but the point is: should we consider this >> thing a "terrible" bug or an acceptable feature ? >> What about to convert every accumulator to float8 ? > imho we can't do that because we lose the exact qualities of integers. > If you accumulate in float8, and if you take a sum over a very large > table, you might start ignoring values. I think that SUM() on an int column ought to produce an exact result. AVG() is a different story --- I think you could make a good case that it ought to produce a float result even when the input is integers, since the exact right answer would typically not be integral anyway. (A programmer who wants the average rounded to integer should have to write something like ROUND(AVG(x)), I think.) One way you could postpone the overflow problem for SUM() is to accumulate the running sum in a "long", or even better "long long" where available, even if the input datatype is a smaller flavor of int. You might still find that the end result overflows, but if the incoming values are not all the same sign then this might avoid an unnecessary intermediate overflow. regards, tom lane
TL> "Thomas G. Lockhart" <lockhart@alumni.caltech.edu> writes: >>> I see that AVG() and SUM() uses an accumulator not enough big to hold >>> the result of calculation, but the point is: should we consider this >>> thing a "terrible" bug or an acceptable feature ? >>> What about to convert every accumulator to float8 ? >> imho we can't do that because we lose the exact qualities of integers. >> If you accumulate in float8, and if you take a sum over a very large >> table, you might start ignoring values. TL> I think that SUM() on an int column ought to produce an exact result. TL> AVG() is a different story --- I think you could make a good case that TL> it ought to produce a float result even when the input is integers, TL> since the exact right answer would typically not be integral anyway. TL> (A programmer who wants the average rounded to integer should have to TL> write something like ROUND(AVG(x)), I think.) TL> One way you could postpone the overflow problem for SUM() is to TL> accumulate the running sum in a "long", or even better "long long" where TL> available, even if the input datatype is a smaller flavor of int. TL> You might still find that the end result overflows, but if the incoming TL> values are not all the same sign then this might avoid an unnecessary TL> intermediate overflow. TL> regards, tom lane I see there are four new built in functions: int42pl,int42div, int84pl and int84div. I created four new aggregate functions SUM(int2), SUM(int4), AVG(int2) and AVG(int4) using these new functions, in this way the accumulator for int2 is int4 and the accumulator for int4 is int8, this reduce the probabilities of overflow cases. Remains just the overflow problem for int8. Is there any reason for not use these functions on SUM() and AVG() on official release ? ------------------------------------------------ drop aggregate sum2 int2; DROP create aggregate sum2( sfunc1 = int42pl, basetype = int2, stype1 = int4, initcond1 = '0' ); CREATE drop aggregate avg2 int2; DROP create aggregate avg2( sfunc1 = int42pl, basetype = int2, stype1 = int4, initcond1 = '0', sfunc2 = int2inc, stype2 = int2, finalfunc = int42div, initcond2 = '0' ); CREATE drop aggregate sum4 int4; DROP create aggregate sum4( sfunc1 = int84pl, basetype = int4, stype1 = int8, initcond1 = '0' ); CREATE drop aggregate avg4 int4; DROP create aggregate avg4( sfunc1 = int84pl, basetype = int4, stype1 = int8, initcond1 = '0', sfunc2 = int4inc, stype2 = int4, finalfunc = int84div, initcond2 = '0' ); CREATE drop table b; DROP create table b( i2 int2, i4 int4, i8 int8 ); CREATE insert into b values (32767,2147483647,9223372036620802086); NOTICE: Integer input '9223372036620802086' is out of range; promoted to float ^^^^^^^^^^^^^^^^^^^^ what's happen here? seems a test for an int4 not an int8. INSERT 246255 1 insert into b values (32767,2147483647,2147483647); INSERT 246256 1 select * from b; i2| i4| i8 -----+----------+------------------- 32767|2147483647|9223372036620802048 32767|2147483647| 2147483647 (2 rows) select sum2(i2) as new_sum2, sum(i2) as old_sum2, sum4(i4) as new_sum4, sum(i4) as old_sum4, sum(i8)as only_sum8 from b; new_sum2|old_sum2| new_sum4|old_sum4| only_sum8 --------+--------+----------+--------+-------------------- 65534| -2|4294967294| -2|-9223372034941265921 (1 row) select avg2(i2) as new_avg2, avg(i2) as old_avg2, avg4(i4) as new_avg4, avg(i4) as old_avg4, avg(i8)as only_sum8 from b; new_avg2|old_avg2| new_avg4|old_avg4| only_sum8 --------+--------+----------+--------+-------------------- 32767| -1|2147483647| -1|-4611686017470632960 (1 row) Is there any reason for not use these functions on SUM() and AVG() on official release ? - Jose' -
> Is there any reason for not use these functions on SUM() and AVG() on > official release ? It sounds like a good idea. The only hesitation I have at the moment is that not all platforms have int8 support, and I'm not certain which these are. Also, accumulating int4 into int8 is probably pretty slow since on 32-bit machines the "long long" is usually done in a s/w library, not in machine code. float8 might be a better choice for accumulating AVG(), but I'm worried about incorrect results with large tables (> 1M entries) which have pathological distributions of numbers (e.g. 1M entries with MAXINT and 1M entries with zero). int4 gives ~9.2 decimal places, float8 gives ~15 decimal places, so there is only about ~6 decimal places of headroom. Of course, why am I worried? That is much better than what we have currently. And someone reported that at least one commercial system (Sybase?) returns float8 for avg() (and sum()?) as I recall. So, your suggestion is that for AVG() at least we return something other than the input type; how about returning float8 for any input type? Don't know if SUM() could/should behave similarly... - Tom
> MIN() and MAX() doesn't accept a string as parameter. I've added capabilities to do min() and max() on strings. To help with this I've put the new implicit type coersion techniques into the aggregate function handling. So, for example, the same routine which handles min(text) will also handle min(char()) and min(varchar()). These changes are already in the main cvs tree, but are not in the v6.4.x branch since they require a small number of catalog changes to implement. Will post the patches on the patches list in case anyone needs them before v6.5 comes out... - Tom