Обсуждение: Strange results when casting string to double

Поиск
Список
Период
Сортировка

Strange results when casting string to double

От
Carsten Klein
Дата:
Hi there,

I'm using several (now unsupported) PostgreSQL 9.3.24 servers on 
different (ancient) Ubuntu 14.04 LTS machines. On only one of those 
servers, I get strange/wrong results when converting a string into a 
double value:

SELECT 1.56::double precision;

--> 1.55999999999999   (wrong!)

Although I do not find any differences in configuration, on all other 
servers the result looks like this (correct?):

SELECT 1.56::double precision;

--> 1.56               (correct!)

AFAIK, this conversion is done by internal function float8in, which, 
when called directly, yields the same results:

SELECT float8in('1.56');

--> 1.55999999999999   (wrong!)   on one server, and
--> 1.56               (correct!) on all other servers.

Option extra_float_digits is zero (0) while doing all these tests. Also, 
the problem seems to occur while converting text to double precision and 
not when displaying the obtained double precision value. Why? The binary 
representation of the double precision value is also different.

I've created a small to_bit function in Python to get the double 
precision value's binary representation:


CREATE OR REPLACE FUNCTION to_bit(value double precision)
   RETURNS bit AS
$BODY$
     if 'fn.to_bit_d64' in SD:
         return SD['fn.to_bit_d64'](value)

     import struct
     def to_bit_d64(value):
         return ''.join('{:0>8b}'.format(c) for c in struct.pack('!d', 
value))

     SD['fn.to_bit_d64'] = to_bit_d64
     return SD['fn.to_bit_d64'](value)
  $BODY$
   LANGUAGE plpython3u IMMUTABLE STRICT
   COST 100;


The fraction (mantissa) of both values is different by 1:

value            fraction
1.55999999999999 1000111101011100001010001111010111000010100011110101
1.56             1000111101011100001010001111010111000010100011110110

The fraction of the probably wrong value is one less than the fraction 
of the correct value.

Formatting both values with 20 digits right of the decimal separator 
(like printf("%.20f" ...) yields:

1.55999999999999983124 (wrong!)
1.56000000000000005329 (correct!)

Since even calling function float8in directly returns a probably wrong 
result on one of the servers makes me believe, that there's no custom 
cast in place being responsible for the wrong results.

Function float8in basically relies on C library function

double strtod(const char *str, char **endptr)

which I tested with a tiny C programm (calling strtod only and printing 
the result with printf("%.20f", val);). The result is

1.56000000000000005329 (correct!)

on every server. So, seems like the C library function works as expected 
on all servers.

Although I'm not a C expert, I don't find anything suspicious that 
function float8in does with the value returned from strtod.

In version 9.3.24, file /src/backend/utils/adt/float.c looks a bit 
different from the file in master branch. However, basically both 
versions do much the same things. The old 9.3.24 version does some more 
special error checks (#ifdef HAVE_BUGGY_IRIX_STRTOD, #ifdef 
HAVE_BUGGY_SOLARIS_STRTOD and CHECKFLOATVAL), but these either throw 
errors or set the converted value to return to a special value (if 
indicated).

Has anyone an idea of what's going on here? I know, this version is far 
from still being supported, however, there's not much real changes in 
file float.c between these versions (in other words, this may happen 
with recent versions as well?). The database instances on all servers 
are configured quite the same (more or less). All run with the same 
extensions installed; none is using any preloaded libraries (which may 
replace C library function strtod?).

-- 
Carsten Klein
c(dot)klein(@)datagis(dot)com




Re: Strange results when casting string to double

От
Adrian Klaver
Дата:
On 2/16/22 05:27, Carsten Klein wrote:
> Hi there,
> 
> I'm using several (now unsupported) PostgreSQL 9.3.24 servers on 
> different (ancient) Ubuntu 14.04 LTS machines. On only one of those 
> servers, I get strange/wrong results when converting a string into a 
> double value:
> 

> Has anyone an idea of what's going on here? I know, this version is far 
> from still being supported, however, there's not much real changes in 
> file float.c between these versions (in other words, this may happen 
> with recent versions as well?). The database instances on all servers 
> are configured quite the same (more or less). 

What is the more or less for the problem server?

Are the hardware(CPU) architectures the same for all the servers?

If not how does the problem server differ?

Was Postgres installed from the same source/same way on all the server?


> All run with the same 
> extensions installed; none is using any preloaded libraries (which may 
> replace C library function strtod?).
> 


-- 
Adrian Klaver
adrian.klaver@aklaver.com



Re: Strange results when casting string to double

От
Carsten Klein
Дата:

On Wed, Feb 16, 2022 at 05:46 PM Adrian Klaver wrote
> On 2/16/22 05:27, Carsten Klein wrote:
>> Hi there,
>>
>> I'm using several (now unsupported) PostgreSQL 9.3.24 servers on 
>> different (ancient) Ubuntu 14.04 LTS machines. On only one of those 
>> servers, I get strange/wrong results when converting a string into a 
>> double value:
>>
> 
>> Has anyone an idea of what's going on here? I know, this version is 
>> far from still being supported, however, there's not much real changes 
>> in file float.c between these versions (in other words, this may 
>> happen with recent versions as well?). The database instances on all 
>> servers are configured quite the same (more or less). 


> What is the more or less for the problem server?

What? Didn't get that...


> Are the hardware(CPU) architectures the same for all the servers?

The problem server is a virtual machine (VMware). I've tested this on 
two other servers and do receive probably correct results. One of these 
is also a virtual machine (same VMware, running on different hardware), 
the other one is a physical box with an AMD Athlon(tm) II X4 640 64-bit 
processor.

Customer site:

Production System:  VMware (ESX Host A)   works as expected

Testing System:     VMware (ESX Host B)   DOES NOT WORK as expected


Our site:

Development System: AMD Athlon(tm) II     works as expected


However, since the strtod function works correctly from my tiny C 
program on all these machines, I don't believe that this problem has to 
do with hardware or architecture. I guess, PostgreSQL dynamically links 
to these C library functions, right?


> If not how does the problem server differ?

I certainly have no idea. Actually, I don't see any more relevant 
differences.


> Was Postgres installed from the same source/same way on all the server?

PostgreSQL was installed from Ubuntu's official repositories (14.04 LTS) 
on all servers.

-- 
Carsten Klein
c(dot)klein(@)datagis(dot)com



Re: Strange results when casting string to double

От
"David G. Johnston"
Дата:
On Wed, Feb 16, 2022 at 11:38 AM Carsten Klein <c.klein@datagis.com> wrote:


On Wed, Feb 16, 2022 at 05:46 PM Adrian Klaver wrote
> On 2/16/22 05:27, Carsten Klein wrote:
>> Hi there,
>>
>> I'm using several (now unsupported) PostgreSQL 9.3.24 servers on
>> different (ancient) Ubuntu 14.04 LTS machines. On only one of those
>> servers, I get strange/wrong results when converting a string into a
>> double value:
>>
>
>> Has anyone an idea of what's going on here? I know, this version is
>> far from still being supported, however, there's not much real changes
>> in file float.c between these versions (in other words, this may
>> happen with recent versions as well?). The database instances on all
>> servers are configured quite the same (more or less).


> What is the more or less for the problem server?

What? Didn't get that...

You said they are more or less the same.  Problems like these tend to hide in the "less" portion of the inequality.
 
Testing System:     VMware (ESX Host B)   DOES NOT WORK as expected


Given that this isn't working as expected it doesn't make for a great testing system.  Install and initdb 14.2 on this machine and let's see what PostgreSQL produces.


> If not how does the problem server differ?

I certainly have no idea. Actually, I don't see any more relevant
differences.

What is the precise version of libc that is installed for one.  Exact ESX releases too.

This isn't really all that interesting a report for the project if it only exists in one ancient system that cannot be experimented with.  Maybe it's a faulty register on that machine's CPU.  There is more double-checking and comparing that can be done here but it seems unlikely to be productive.  It is more plausible that the snowflake machine in question just has issues and needs to be retired.  Installing a newer version of PostgreSQL on it before junking it is about the right amount of experimental effort.

David J.

Re: Strange results when casting string to double

От
Carsten Klein
Дата:

On Wed, Feb 16, 2022 at 08:11 PM David G. Johnston wrote:

> You said they are more or less the same.  Problems like these tend to 
> hide in the "less" portion of the inequality.

On of the virtualized servers was created as a clone of the other one 
(using VMware to clone the VM). So, basically, these are very equal. Of 
course, they diverged over time.

Focusing on PostgreSQL, here are the differences of postgresql.conf, 
comparing testing system and production system:

< work_mem = 8MB                                # min 64kB
---
 > work_mem = 4MB                                # min 64kB
417c417
< #log_statement = 'all'                        # none, ddl, mod, all
---
 > #log_statement = 'none'                       # none, ddl, mod, all

Both PostgreSQL server have the same roles and users, that same 
extensions installed and no preloaded libraries.

> Given that this isn't working as expected it doesn't make for a great 
> testing system.  Install and initdb 14.2 on this machine and let's see 
> what PostgreSQL produces.

The testing system runs since 2015. I don't know whether the problem was 
present from the beginning. But I don't think so, as we also have 
"correct" double values in that database. Now, since "binary equality" 
of the double precision values is a new requirement, we started to 
notice, that (at least not) newly added (UPDATEd) values, e. g. 1.56 are 
not binary equal to already present values:

Table abc, column xyz: currently (before) 1.56

UPDATE abc SET xyz = 1.56;

Table abc, column xyz: after 1.559999999

We have a trigger, that monitors such updates and it reports a changed 
value, which is not correct. The problem is, that the assignment

    SET xyz = 1.56

actually means

    SET xyz = 1.559999999

since getting a double value from the string 1.56 yields 1.559999999.

Yes, moving to the latest PostgreSQL version might fix that error. 
However, this is a customer's testing system. Actually, it is intended 
to be reinstalled with Ubuntu 22.04 LTS which brings PostgreSQL 14. But 
prior to that, we need to complete a project on the testing system that 
requires that "binary equality" of double values.

> What is the precise version of libc that is installed for one.  Exact 
> ESX releases too.

Both VM servers run on

ESXi 6   (correct behavior)
ESXi 6.5 (misbehaving)

All machines use libc version 2.19 (libc-2.19.so).

> This isn't really all that interesting a report for the project if it 
> only exists in one ancient system that cannot be experimented with.  
> Maybe it's a faulty register on that machine's CPU.  There is more 
> double-checking and comparing that can be done here but it seems 
> unlikely to be productive.  It is more plausible that the snowflake 
> machine in question just has issues and needs to be retired.  Installing 
> a newer version of PostgreSQL on it before junking it is about the right 
> amount of experimental effort.

I just wanted to ask whether someone knows something about this or has 
ever heard about such a behavior. You say, the snowflake machine has 
issues... I don't believe in hardware issues, since it runs in VMware 
and likely on many different CPUs. Isn't it more than unlikely that such 
a constantly occurring error is caused by one faulty CPU (among that 
many CPUs an ESX server typically has)?

And, keep in mind that strtod function works as expected from a simply C 
testing program. I guess that the parsed double's value gets modified 
somewhere in PostgreSQL after strtod was called. However, I do not yet 
see where and why. I was hoping that someone of you could help.

Carsten



Re: Strange results when casting string to double

От
Gavan Schneider
Дата:
On 17 Feb 2022, at 8:02, Carsten Klein wrote:

> We have a trigger, that monitors such updates and it reports a changed value, which is not correct. The problem is,
thatthe assignment 
>
> SET xyz = 1.56
>
> actually means
>
> SET xyz = 1.559999999
>
> since getting a double value from the string 1.56 yields 1.559999999.

<snip>

> I just wanted to ask whether someone knows something about this or has ever heard about such a behavior. You say, the
snowflakemachine has issues... I don't believe in hardware issues, since it runs in VMware and likely on many different
CPUs.Isn't it more than unlikely that such a constantly occurring error is caused by one faulty CPU (among that many
CPUsan ESX server typically has)? 
>
Harking back to my long distant formative years I was taught to never ever rely on equality tests when it came to
floatingpoint values. Nothing has changed in this regard. 

If binary exact is part of the requirement then the answer is INTEGER or NUMERIC never FLOAT, REAL, DOUBLE or anything
similar.

So, assuming the trigger function is the only source of this spurious grief one could apply the lesson taught in the
60sfrom the dawn of computing: 
    EQUALITY = absolute_value(op1 - op2)  < epsilon — where op1 & op2 are DOUBLE, and epsilon is smaller than you care
for

Given the intrinsic (standards compliant) uncertainty when converting from absolute (e.g., string representation) to
floatingpoint there will never be value resolving why there are differences. 

I suggest using the comparison that is appropriate to the representation of those values or fix the design by using the
properrepresentation. 

Gavan Schneider
——
Gavan Schneider, Sodwalls, NSW, Australia
Explanations exist; they have existed for all time; there is always a well-known solution to every human problem —
neat,plausible, and wrong. 
— H. L. Mencken, 1920



Re: Strange results when casting string to double

От
Carsten Klein
Дата:


On Thu, Feb 17, 2022 at 00:07 Gavan Schneider wrote:

> Harking back to my long distant formative years I was taught to never ever rely on equality tests when it came to
floatingpoint values. Nothing has changed in this regard.
 
> 
> If binary exact is part of the requirement then the answer is INTEGER or NUMERIC never FLOAT, REAL, DOUBLE or
anythingsimilar.
 
> 
> So, assuming the trigger function is the only source of this spurious grief one could apply the lesson taught in the
60sfrom the dawn of computing:
 
>     EQUALITY = absolute_value(op1 - op2)  < epsilon — where op1 & op2 are DOUBLE, and epsilon is smaller than you
carefor
 
> 
> Given the intrinsic (standards compliant) uncertainty when converting from absolute (e.g., string representation) to
floatingpoint there will never be value resolving why there are differences.
 
> 
> I suggest using the comparison that is appropriate to the representation of those values or fix the design by using
theproper representation.
 

Just some(!) background:

I know that there are other ways to compare _any_ floating point values. 
However, doing that for a whole ROW in a fast manner is not trivial (e. 
g. it tends to get quite slow). With the hstore extension and (so 
called) binary equality I've found a very fast way which is also fully 
generic. hstore uses text representation, so comparisons depend on how 
values are converted to and from text.


But all that is not the point. Double precision conversion algorithms 
are well defined and deterministic, so it should yield the same results 
when called with the same arguments (every time and on every machine). 
That is

1.56::double precision == 1.56000000000000005329070518201E0

1.56::double precision != 1.55999999999999983124610025698E0

With reduced precision (as in PG), that gives you:

1.56::double precision == 1.56

1.56::double precision != 1.55999999999999

However, one of my ProstgreSQL servers returns the latter (wrong) value. 
You can test this with C library function 'strtod' or with any of the 
online converters around:

https://www.binaryconvert.com/result_double.html

Click 'New Conversion' and enter 1.56 into the 'Decimal' field. Then 
'Convert to binary' or hit enter.

So, the primary problem of that PostgreSQL server is, that it converts 
text to double in a wrong way. Apart from any triggers, "binary 
equality" and whatever else I'm doing in this project, this has dramatic 
effects on the database, as it's messing up the values that I'm storing:

Imagine I do:

INSERT INTO foo (my_col) VALUES ('Hello World');

But the database has happily stored a different string:

SELECT my_col FROM foo;
     my_col
-------------
  Hello Worlc
(1 row)

Finding that string again may be done with "fuzzy search" or regular 
expressions, but the much better approach is the database not to let 
mess up the string while storing it.

Double precision values are limited in precision (in binary) and there 
are numbers, that cannot be stored exactly (as it's true for many 
decimal numbers, like 1/3 ~ 0.33333 as well). Nevertheless, with a given 
maximum of precision, the same values should have the same (well 
defined) binary value so that conversions between text and double should 
not change the value at any time on any machine.

Carsten



Re: Strange results when casting string to double

От
Thomas Kellerer
Дата:
Carsten Klein schrieb am 16.02.2022 um 14:27:
> I'm using several (now unsupported) PostgreSQL 9.3.24 servers on
> different (ancient) Ubuntu 14.04 LTS machines. On only one of those
> servers, I get strange/wrong results when converting a string into a
> double value:
>
> SELECT 1.56::double precision;
>
> --> 1.55999999999999   (wrong!)
>
> Although I do not find any differences in configuration, on all other servers the result looks like this (correct?):
>
> SELECT 1.56::double precision;
>
> --> 1.56               (correct!)
>

I wonder if extra_float_digits is different between those  systems
Maybe initialized by differently configured SQL clients.



Re: Strange results when casting string to double

От
Carsten Klein
Дата:
On Thu, Feb 17, 2022 at 09:41 AM Thomas Kellerer wrote:
> Carsten Klein schrieb am 16.02.2022 um 14:27:
Ah, man versteht sich :)
>> I'm using several (now unsupported) PostgreSQL 9.3.24 servers on
>> different (ancient) Ubuntu 14.04 LTS machines. On only one of those
>> servers, I get strange/wrong results when converting a string into a
>> double value:
>>
>> SELECT 1.56::double precision;
>>
>> --> 1.55999999999999   (wrong!)
>>
>> Although I do not find any differences in configuration, on all other servers the result looks like this
(correct?):
>>
>> SELECT 1.56::double precision;
>>
>> --> 1.56               (correct!)
>>
> 
> I wonder if extra_float_digits is different between those  systems
> Maybe initialized by differently configured SQL clients.

As I've seen in the sources in file /src/backend/utils/adt/float.c, 
extra_float_digits is used when converting double precision values back 
to text only. The binary (BIT) representation if the double value's 
fraction (mantissa) tells me, that the actual double precision value is 
already wrong.

Also, extra_float_digits is constantly zero in my tests. I've only been 
using pgAdmin III for that and always ensured with

SELECT current_setting('extra_float_digits');

that it's zero.

Carsten




Re: Strange results when casting string to double

От
"Peter J. Holzer"
Дата:
On 2022-02-16 22:02:02 +0100, Carsten Klein wrote:
> On of the virtualized servers was created as a clone of the other one (using
> VMware to clone the VM). So, basically, these are very equal. Of course,
> they diverged over time.
>
> Focusing on PostgreSQL, here are the differences of postgresql.conf,
> comparing testing system and production system:

I don't think these explain the difference. I'd check whether the
postgresql binaries and all the the shared libraries are the same.

Or - since this only happens on the test system and not on the production
system - I'd just clone the production system again to create a new test
system and see if the problem happens there, too.

        hp

--
   _  | Peter J. Holzer    | Story must make more sense than reality.
|_|_) |                    |
| |   | hjp@hjp.at         |    -- Charles Stross, "Creative writing
__/   | http://www.hjp.at/ |       challenge!"

Вложения

Re: Strange results when casting string to double

От
Carsten Klein
Дата:

AOn Thu, Feb 17, 2022 at 10:27 AM Peter J. Holzer wrote
> 
> I don't think these explain the difference. I'd check whether the
> postgresql binaries and all the the shared libraries are the same.
> 
> Or - since this only happens on the test system and not on the production
> system - I'd just clone the production system again to create a new test
> system and see if the problem happens there, too.
> 

File postgresql and it's loaded libraries are identical. Same set of 
libraries loaded on all servers; all loaded library have identical MD5 sums.

It's the customer's VMware, so I can't just make another clone. However, 
I'm quite sure that it will work on a newly cloned testing system (since 
on production system everything is OK).

Carsten



Re: Strange results when casting string to double

От
Peter Eisentraut
Дата:
On 16.02.22 14:27, Carsten Klein wrote:
> AFAIK, this conversion is done by internal function float8in, which, 
> when called directly, yields the same results:
> 
> SELECT float8in('1.56');
> 
> --> 1.55999999999999   (wrong!)   on one server, and
> --> 1.56               (correct!) on all other servers.

float8in() really just calls the operating system's strtod() function. 
I would test that one directly with a small C program.

It's also possible that different compiler options lead to different 
optimizations.



Re: Strange results when casting string to double

От
Carsten Klein
Дата:


On 18.02.2022 13:28, Peter Eisentraut wrote:

> float8in() really just calls the operating system's strtod() function. I 
> would test that one directly with a small C program.
> 
> It's also possible that different compiler options lead to different 
> optimizations.

That's what I did. Here's my small C program: (nicht lachen *g*)

#include <stdio.h>
#include <stdlib.h>

int main(int argc, char* argv[]) {

     /* default string to convert */
     char buf[10] = "1.56\0";

     /* choose and print string to convert */
     char* sval = argc > 1 ? argv[1] : buf;
     printf("string value: %s\n", sval);

     /* convert and print */
     char* ptr;
     double dval = strtod(sval, &ptr);
     printf("double value: %.20f\n", dval);

     return 0;
}

It works correctly on all these servers. Here's its output:

string value: 1.56
double value: 1.56000000000000005329

I didn't test different compiler options. However, PostgreSQL was always 
installed from official Ubuntu 14.04 repositories (getting the binaries, 
not the source packages), so all binaries should have been compiled with 
the same options.

Carsten



Re: Strange results when casting string to double

От
Tom Lane
Дата:
Carsten Klein <c.klein@datagis.com> writes:
> On 18.02.2022 13:28, Peter Eisentraut wrote:
>> float8in() really just calls the operating system's strtod() function. I 
>> would test that one directly with a small C program.

> That's what I did. Here's my small C program: (nicht lachen *g*)

Yeah, you said that upthread, which makes the whole thing pretty
baffling.  One possible explanation is that your small program got linked
against a different version of libc than what the Postgres backend is
using ("ldd" would help you check that, but given the age of the Postgres
installation, this seems plausible).  Beyond that it's hard to think
of any explanation other than hardware fault or corrupted executable.

            regards, tom lane



Re: Strange results when casting string to double

От
Carsten Klein
Дата:


On 18.02.2022 16:32, Tom Lane wrote:
> Yeah, you said that upthread, which makes the whole thing pretty
> baffling.  One possible explanation is that your small program got linked
> against a different version of libc than what the Postgres backend is
> using ("ldd" would help you check that, but given the age of the Postgres
> installation, this seems plausible).  Beyond that it's hard to think
> of any explanation other than hardware fault or corrupted executable.

Tom,

both PostgreSQL and my C program are linked to the same libc.so.6. Same 
path, same MD5 sum. Since libc is a Shared Object (so), both processes 
should really run the identical code. Am I missing something? I've 
written and compiled the small C program on the same old Ubuntu OS.

So, you're not aware of any ways this behavior could be achieved from 
within PostgreSQL? Something like a custom cast (actually, there is 
none) or something that could intercept string to double conversion? 
That would be something to look at closer. The question is: how would 
you implement such an evil database wide text to double conversion (just 
to kid users) if you had to?

PostgreSQL is up for more than 480 days on that server. I'm thinking of 
giving a restart of the database a try. However, there's a long running 
import taking place, so this will not happen before mid or end of next week.

Regards, Carsten



Re: Strange results when casting string to double

От
Tom Lane
Дата:
Carsten Klein <c.klein@datagis.com> writes:
> So, you're not aware of any ways this behavior could be achieved from 
> within PostgreSQL? Something like a custom cast (actually, there is 
> none) or something that could intercept string to double conversion? 

Well, you asserted that these installations are all alike ... but
sure, it's conceivable that somebody could've replaced the default
numeric -> float8 cast (i.e., numeric_float8()) with some other code
that does it a bit inaccurately.  There's only a pg_cast entry
connecting that function to those types.  Have you tried stepping
through things with a debugger, to see if numeric_float8 and float8in
are actually reached on the problematic system?

> That would be something to look at closer. The question is: how would 
> you implement such an evil database wide text to double conversion (just 
> to kid users) if you had to?

At the SQL level, your example does not involve text -> double.
The literal 1.56 is of type numeric.

            regards, tom lane



Re: Strange results when casting string to double

От
"Peter J. Holzer"
Дата:
On 2022-02-18 18:07:35 +0100, Carsten Klein wrote:
> both PostgreSQL and my C program are linked to the same libc.so.6. Same
> path, same MD5 sum. Since libc is a Shared Object (so), both processes
> should really run the identical code. Am I missing something? I've written
> and compiled the small C program on the same old Ubuntu OS.
[...]
> PostgreSQL is up for more than 480 days on that server. I'm thinking of
> giving a restart of the database a try. However, there's a long running
> import taking place, so this will not happen before mid or end of next week.

If there has been a glibc update (or a postgresql update) in those 480
days (Ubuntu 14.04 is out of its normal maintenance period but ESM is
still available) the running processes may well run different code than
a newly started program. So it could be a bug which has since been
fixed.

Another idea: It could be the case that something (maybe a bug in
postgres, maybe an extension, maybe even a random bit flip in memory)
changed the FP rounding mode within the postgres process, which would
affect all FP computations until the rounding mode is reset. That would
have to have happened in the master to affect the worker processes as
consistently as you are seeing. I don't know if it is even possible for
a non-standard rounding mode to persist for any length of time, but if
it is it would certainly account for weird rounding errors.

        hp

--
   _  | Peter J. Holzer    | Story must make more sense than reality.
|_|_) |                    |
| |   | hjp@hjp.at         |    -- Charles Stross, "Creative writing
__/   | http://www.hjp.at/ |       challenge!"

Вложения

Re: Strange results when casting string to double

От
Tom Lane
Дата:
"Peter J. Holzer" <hjp-pgsql@hjp.at> writes:
> Another idea: It could be the case that something (maybe a bug in
> postgres, maybe an extension, maybe even a random bit flip in memory)
> changed the FP rounding mode within the postgres process, which would
> affect all FP computations until the rounding mode is reset. That would
> have to have happened in the master to affect the worker processes as
> consistently as you are seeing. I don't know if it is even possible for
> a non-standard rounding mode to persist for any length of time, but if
> it is it would certainly account for weird rounding errors.

Hmm, that is a pretty interesting theory.  On a RHEL8 box, I find
that fesetround(FE_DOWNWARD) causes strtod("1.56", NULL) to return

  1.55999999999999983124

rather than the usual

  1.56000000000000005329

which seems to square with Carsten's symptom.

Postgres itself contains no fesetround calls, but if you want
to believe a random bit flip changed that mode, maybe that'd
account for it.  It'd certainly be interesting to find out
whether the problem persists after a postmaster restart.

[ wanders away wondering if the troublesome machine has ECC
memory ... ]

            regards, tom lane



Re: Strange results when casting string to double

От
Tomas Pospisek
Дата:
On 18.02.22 22:42, Peter J. Holzer wrote:

> If there has been a glibc update (or a postgresql update) in those 480
> days (Ubuntu 14.04 is out of its normal maintenance period but ESM is
> still available) the running processes may well run different code than
> a newly started program. So it could be a bug which has since been
> fixed.

That would be visible via `lsof`. `libc. The file `...libc...so` that 
`postgres` is keeping open would have the text `DEL` (as in deleted) in 
the `FD` column of `lsof`'s output.

As opposed to a newly started program which would have `REG` (regular 
file) there.
*t



Re: Strange results when casting string to double

От
Merlin Moncure
Дата:
On Sat, Feb 19, 2022 at 8:35 AM Tomas Pospisek <tpo2@sourcepole.ch> wrote:
>
> On 18.02.22 22:42, Peter J. Holzer wrote:
>
> > If there has been a glibc update (or a postgresql update) in those 480
> > days (Ubuntu 14.04 is out of its normal maintenance period but ESM is
> > still available) the running processes may well run different code than
> > a newly started program. So it could be a bug which has since been
> > fixed.
>
> That would be visible via `lsof`. `libc. The file `...libc...so` that
> `postgres` is keeping open would have the text `DEL` (as in deleted) in
> the `FD` column of `lsof`'s output.
>
> As opposed to a newly started program which would have `REG` (regular
> file) there.

If this doesn't bear fruit, are there debugging symbols?  Setting a
breakpoint might produce some insight.

merlin



Re: Strange results when casting string to double

От
Carsten Klein
Дата:


On 19.02.2022 15:35, Tomas Pospisek wrote:

> That would be visible via `lsof`. `libc. The file `...libc...so` that 
> `postgres` is keeping open would have the text `DEL` (as in deleted) in 
> the `FD` column of `lsof`'s output.
> 
> As opposed to a newly started program which would have `REG` (regular 
> file) there.
> *t

Actually, PostgreSLQ uses the same libc than any recently started 
program. lsof reports REG and my /lib/x86_64-linux-gnu/libc-2.19.so has 
file date Mar 27, 2019, so it's much older than the point in time when 
PostgreSQL was started the last time (aprx. 480 days ago).

After all, I guess the idea of a wrong rounding setting (set to round 
down, raised by Peter J. Holzer) seems most plausible to me. In 
particular, since the "right" and "wrong" values caused by wrong 
rounding shown by Tom Lane are exactly the values I am seeing.

Also, there is a quite aggressive import script running for some days, 
which even tried to replace some of the core functions, like 
array_length (it added a pure PL/pgSQL version in public schema). Seems 
like they wanted to "Polyfill" some required functions. Maybe that 
script is responsible for changing the process' rounding mode? The 
customer started that script without thinking too much about it... :(

@Tom Lane: you say, PostgreSQL does not / cannot change rounding mode at 
any time? There is no function to do so? So, the script (it's more like 
a module) must provide a Shared Object module with a C function in order 
to change the FPU's rounding mode?

Because several people recommended using a debugger in order to see 
whats going on here: actually I have no expertise with debugging on 
Linux without an IDE. So I did not yet think of using a debugger so far.

I will try a restart of the DB ASAP. However, in order to prove the 
rounding mode thesis: someone knows whether fesetround(FE_DOWNWARD) just 
sets a flag in the C runtime environment or does this call actually set 
the rounding mode in the FPU's control word? Is there any chance to get 
that current rounding mode with or even without a debugger? (Maybe some 
file in /proc filesystem?)

I'm very unhappy with the customer starting that import script. In order 
to find the `smoking gun`, I will now have a closer look at the 
"product" that import is based on and what they have done to the database...

Regards, Carsten



Re: Strange results when casting string to double

От
Tom Lane
Дата:
Carsten Klein <c.klein@datagis.com> writes:
> @Tom Lane: you say, PostgreSQL does not / cannot change rounding mode at 
> any time? There is no function to do so? So, the script (it's more like 
> a module) must provide a Shared Object module with a C function in order 
> to change the FPU's rounding mode?

Per grep, there is no call of fesetround() in the Postgres source
tree.  I'm not sure offhand whether libc exposes any other APIs
that could change the rounding mode, but I am quite sure that we
wouldn't be intentionally changing it anywhere.

> I will try a restart of the DB ASAP. However, in order to prove the 
> rounding mode thesis: someone knows whether fesetround(FE_DOWNWARD) just 
> sets a flag in the C runtime environment or does this call actually set 
> the rounding mode in the FPU's control word? Is there any chance to get 
> that current rounding mode with or even without a debugger? (Maybe some 
> file in /proc filesystem?)

The OS would surely allow each process to have its own setting of the
rounding mode, so I doubt you can see it from outside.

Another point to keep in mind is that no matter how invasive that
import script might be, it's still hard to explain how it'd affect
the rounding mode in other backend processes.  You have to postulate
either that the rounding mode has been changed in the postmaster
process (and then inherited by session backends via fork()), or that
some code running at the time of child process creation changes the
mode, or that they replaced numeric_float8 with something else.

I think the only way that the postmaster's rounding mode could change
after postmaster start is the cosmic-ray hypothesis; while we do have
features that'd allow loading extra code into the postmaster, I'm
pretty sure they only take effect at postmaster start.  So even if
that import script tried to do that, it wouldn't have succeeded yet.

Of the other two hypotheses, "substitute numeric_float8" seems like
the most likely, especially given the other stuff you mentioned the
script doing.  Have you checked the relevant pg_cast entry to see
if it's been changed?  It'd also be interesting to see if the odd
rounding behavior happens in all databases of the cluster or just
one.

            regards, tom lane



Re: Strange results when casting string to double

От
Carsten Klein
Дата:

On 19.02.2022 20:34 Tom Lane wrote:

> Per grep, there is no call of fesetround() in the Postgres source
> tree.  I'm not sure offhand whether libc exposes any other APIs
> that could change the rounding mode, but I am quite sure that we
> wouldn't be intentionally changing it anywhere.
> 
> The OS would surely allow each process to have its own setting of the
> rounding mode, so I doubt you can see it from outside.
> 
> Another point to keep in mind is that no matter how invasive that
> import script might be, it's still hard to explain how it'd affect
> the rounding mode in other backend processes.  You have to postulate
> either that the rounding mode has been changed in the postmaster
> process (and then inherited by session backends via fork()), or that
> some code running at the time of child process creation changes the
> mode, or that they replaced numeric_float8 with something else.
> 
> I think the only way that the postmaster's rounding mode could change
> after postmaster start is the cosmic-ray hypothesis; while we do have
> features that'd allow loading extra code into the postmaster, I'm
> pretty sure they only take effect at postmaster start.  So even if
> that import script tried to do that, it wouldn't have succeeded yet.
> 
> Of the other two hypotheses, "substitute numeric_float8" seems like
> the most likely, especially given the other stuff you mentioned the
> script doing.  Have you checked the relevant pg_cast entry to see
> if it's been changed?  It'd also be interesting to see if the odd
> rounding behavior happens in all databases of the cluster or just
> one.

The script has finished!

After a restart of the database, everything works as expected again. 
Rounding as well as text/numeric to double precision works the same on 
all of my servers.

Prior to restarting, I've implemented my own Python based versions of both

int fegetround(void);

int fesetround(int rounding_mode integer);

Have a look a these:

CREATE OR REPLACE FUNCTION fegetround()
   RETURNS integer AS
$BODY$
     if 'fn.fegetround' in SD:
         return SD['fn.fegetround']()

     from ctypes import cdll
     from ctypes.util import find_library
     libm = cdll.LoadLibrary(find_library('m'))
     def fegetround():
         return libm.fegetround()

     SD['fn.fegetround'] = fegetround
     return SD['fn.fegetround']()
  $BODY$
   LANGUAGE plpython3u VOLATILE
   COST 100;

CREATE OR REPLACE FUNCTION fesetround(rounding_mode integer)
   RETURNS integer AS
$BODY$
     if 'fn.fesetround' in SD:
         return SD['fn.fesetround'](rounding_mode)

     from ctypes import cdll
     from ctypes.util import find_library
     libm = cdll.LoadLibrary(find_library('m'))
     def fesetround(rounding_mode):
         return libm.fesetround(rounding_mode)

     SD['fn.fesetround'] = fesetround
     return SD['fn.fesetround'](rounding_mode)
  $BODY$
   LANGUAGE plpython3u VOLATILE STRICT
   COST 100;

With those, I was able to proof, that actually the "wrong" rounding mode

FE_DOWNWARD (0x400)

was in effect for every new process/connection with all the described 
effects on casting from string or numeric to double precision:

SELECT 1.56::double precision
-> 1.55999999999999

Setting rounding mode to

FE_TONEAREST (0x0),

instantly lead back to the expected casting behavior:

SELECT 1.56::double precision
-> 1.56

Setting rounding mode after restarting the database is still possible, 
however, new sessions start off with the "correct" rounding mode 
FE_TONEAREST (0x0). So, the only thing that's really changed after the 
restart was, that the postmaster now has the "correct" rounding mode, 
which it promotes down when forking off child processes.

We'll likely never know, why ever the postmaster got tainted with that 
FE_DOWNWARD (0x400) rounding mode.

As Tom Lane said, no matter how aggressive the script could be, it can, 
if at all, only change its current session's rounding mode. So, maybe it 
actually was a random bit flip or a side effect caused by a quite rare 
error condition in postmaster.

Nearly the same is true for any core functions or casts hijacked by the 
script - these are only in effect for the database the script was ever 
connecting to. In my case, the script only used one database. However, 
the issue was present with any database.

Two official math functions to get and set the session's rounding mode 
provided by PostgreSQL could be a good add-on for any of the next 
versions of the database. Thinking about it again... maybe that's just 
too dangerous :-p

Finally, many thanks to all that supported me and came up with that many 
helpful ideas! :-)

Regards, Carsten



Re: Strange results when casting string to double

От
Tom Lane
Дата:
Carsten Klein <c.klein@datagis.com> writes:
> Prior to restarting, I've implemented my own Python based versions of both
> int fegetround(void);
> int fesetround(int rounding_mode integer);

Ah, good idea!

> With those, I was able to proof, that actually the "wrong" rounding mode
> FE_DOWNWARD (0x400)
> was in effect for every new process/connection with all the described 
> effects on casting from string or numeric to double precision:

Thanks for confirming that that was the source of the problem.

> We'll likely never know, why ever the postmaster got tainted with that 
> FE_DOWNWARD (0x400) rounding mode.

Indeed.  It's hard to see any other explanation than "random bit flip"
though.  The postmaster is designed to run only a very small amount of
code, almost none of which is user-controllable.  Even if there were
somewhere some code that intended to change the rounding mode, explaining
how the postmaster got to that without crashing is a tall order.

> Two official math functions to get and set the session's rounding mode 
> provided by PostgreSQL could be a good add-on for any of the next 
> versions of the database. Thinking about it again... maybe that's just 
> too dangerous :-p

No, we'd absolutely not ever provide a supported function to change
rounding modes.  That would require downgrading all float-related
functions from IMMUTABLE to STABLE, which would have severe performance
consequences, even for people with no interest in changing modes.

Thanks for closing out the thread with this info!

            regards, tom lane