Re: [HACKERS] Sum aggregate calculation for single precsion real

Поиск
Список
Период
Сортировка
От Konstantin Knizhnik
Тема Re: [HACKERS] Sum aggregate calculation for single precsion real
Дата
Msg-id 903a8b6f-8b96-1cfa-e8a3-4656a87eb89c@postgrespro.ru
обсуждение исходный текст
Ответ на Re: [HACKERS] Sum aggregate calculation for single precsion real  (Jim Nasby <Jim.Nasby@BlueTreble.com>)
Список pgsql-hackers


On 14.02.2017 16:59, Jim Nasby wrote:
On 2/13/17 10:45 AM, Konstantin Knizhnik wrote:
It is not true - please notice query execution time of this two queries:

I bet you'd get even less difference if you simply cast to float8 instead of adding 0.0. Same result, no floating point addition.

The expectation for SUM(float4) is that you want speed and are
prepared to cope with the consequences.  It's easy enough to cast your
input to float8 if you want a wider accumulator, or to numeric if
you'd like more stable (not necessarily more accurate :-() results.
I do not think it's the database's job to make those choices for you.

From my point of your it is strange and wrong expectation.
I am choosing "float4" type for a column just because it is enough to
represent range of data I have and I need to minimize size of record.

In other words, you've decided to trade accuracy for performance...

Could not agree with it...
1. If I choose float4 type to store bid price (which usually has 5-6 significant digits) - I do not loose precision and accuracy is not suffered.
The accuracy is important when I am calculating sum of prices. But here the assumption that accuracy of sum calculation should depend on type of summed field
is non obvious. May be it is more or less clear for C programmers but not for SQL users.
In all database I have tested SUM  of single precision floats is calculated at least using double precision numbers (or using numeric type).

2. There is no huge gap in performance between accumulating  in float4 and float8. There are no "orders of magnitude":
postgres=# select sum(l_quantity) from lineitem_projection;
     sum    
-------------
 1.07374e+09
(1 row)

Time: 4659.509 ms (00:04.660)

postgres=# select sum(l_quantity::float8) from lineitem_projection;
    sum    
------------
 1529738036
(1 row)

Time: 5465.320 ms (00:05.465)


So do not think that there is actually compromise here between performance and accuracy.
But current implementation cause leads to many confusions and contradictions with users expectations:

1. The fact that sum(l_quantity) and sum(l_quantity::float8) are absolutely different (1.5 times!!! - we loose 0.5 milliard dollars:)
2. avg(l_quantity)*count(l_quantity) is not equal to sum(l_quantity)    But in case of casting to float8 result is the same.
3. sum of aggregates for groups is not equal to total sum (once again no problem for float8 type_/

But when I am calculating sum, I expect to receive more or less precise
result. Certainly I realize that even in case of using double it is

... but now you want to trade performance for accuracy? Why would you expect the database to magically come to that conclusion?

Se above. No trading here. Please notice that current Postgres implementation of AVG aggregates calculates at sum and sum of squares even if last one is not needed for AVG.
The comment in the code says:

 * It might seem attractive to optimize this by having multiple accumulator
 * functions that only calculate the sums actually needed.  But on most
 * modern machines, a couple of extra floating-point multiplies will be
 * insignificant compared to the other per-tuple overhead, so I've chosen
 * to minimize code space instead.

And it is true!
In the addition to the results above I can add AVG timing for AVG calculation:

postgres=# select avg(l_quantity) from lineitem_projection;
       avg       
------------------
 25.5015621964919
(1 row)

postgres=# select avg(l_quantity::float8) from lineitem_projection;
       avg       
------------------
 25.5015621964919
(1 row)

Please notice that avg for float is calculated using float4_accum which use float8 accumulator and also calculates sumX2!


Time: 6103.807 ms (00:06.104)



So I do not see reasonable arguments here for using float4pl for sum(float4)!
And I do not know any database which has such strange behavior.
I know that "be as others" or especially "be as Oracle" are never good argument for Postgres community but doing something differently (and IMHO  wrong) without any significant reasons seems to be very strange.


-- 
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company 

В списке pgsql-hackers по дате отправления:

Предыдущее
От: "Ideriha, Takeshi"
Дата:
Сообщение: Re: [HACKERS] GUC for cleanup indexes threshold.
Следующее
От: "Seki, Eiji"
Дата:
Сообщение: Re: [HACKERS] Proposal: GetOldestXminExtend for ignoring arbitraryvacuum flags