Обсуждение: Strange results when casting string to double
Hi there,
I'm using several (now unsupported) PostgreSQL 9.3.24 servers on
different (ancient) Ubuntu 14.04 LTS machines. On only one of those
servers, I get strange/wrong results when converting a string into a
double value:
SELECT 1.56::double precision;
--> 1.55999999999999 (wrong!)
Although I do not find any differences in configuration, on all other
servers the result looks like this (correct?):
SELECT 1.56::double precision;
--> 1.56 (correct!)
AFAIK, this conversion is done by internal function float8in, which,
when called directly, yields the same results:
SELECT float8in('1.56');
--> 1.55999999999999 (wrong!) on one server, and
--> 1.56 (correct!) on all other servers.
Option extra_float_digits is zero (0) while doing all these tests. Also,
the problem seems to occur while converting text to double precision and
not when displaying the obtained double precision value. Why? The binary
representation of the double precision value is also different.
I've created a small to_bit function in Python to get the double
precision value's binary representation:
CREATE OR REPLACE FUNCTION to_bit(value double precision)
RETURNS bit AS
$BODY$
if 'fn.to_bit_d64' in SD:
return SD['fn.to_bit_d64'](value)
import struct
def to_bit_d64(value):
return ''.join('{:0>8b}'.format(c) for c in struct.pack('!d',
value))
SD['fn.to_bit_d64'] = to_bit_d64
return SD['fn.to_bit_d64'](value)
$BODY$
LANGUAGE plpython3u IMMUTABLE STRICT
COST 100;
The fraction (mantissa) of both values is different by 1:
value fraction
1.55999999999999 1000111101011100001010001111010111000010100011110101
1.56 1000111101011100001010001111010111000010100011110110
The fraction of the probably wrong value is one less than the fraction
of the correct value.
Formatting both values with 20 digits right of the decimal separator
(like printf("%.20f" ...) yields:
1.55999999999999983124 (wrong!)
1.56000000000000005329 (correct!)
Since even calling function float8in directly returns a probably wrong
result on one of the servers makes me believe, that there's no custom
cast in place being responsible for the wrong results.
Function float8in basically relies on C library function
double strtod(const char *str, char **endptr)
which I tested with a tiny C programm (calling strtod only and printing
the result with printf("%.20f", val);). The result is
1.56000000000000005329 (correct!)
on every server. So, seems like the C library function works as expected
on all servers.
Although I'm not a C expert, I don't find anything suspicious that
function float8in does with the value returned from strtod.
In version 9.3.24, file /src/backend/utils/adt/float.c looks a bit
different from the file in master branch. However, basically both
versions do much the same things. The old 9.3.24 version does some more
special error checks (#ifdef HAVE_BUGGY_IRIX_STRTOD, #ifdef
HAVE_BUGGY_SOLARIS_STRTOD and CHECKFLOATVAL), but these either throw
errors or set the converted value to return to a special value (if
indicated).
Has anyone an idea of what's going on here? I know, this version is far
from still being supported, however, there's not much real changes in
file float.c between these versions (in other words, this may happen
with recent versions as well?). The database instances on all servers
are configured quite the same (more or less). All run with the same
extensions installed; none is using any preloaded libraries (which may
replace C library function strtod?).
--
Carsten Klein
c(dot)klein(@)datagis(dot)com
On 2/16/22 05:27, Carsten Klein wrote: > Hi there, > > I'm using several (now unsupported) PostgreSQL 9.3.24 servers on > different (ancient) Ubuntu 14.04 LTS machines. On only one of those > servers, I get strange/wrong results when converting a string into a > double value: > > Has anyone an idea of what's going on here? I know, this version is far > from still being supported, however, there's not much real changes in > file float.c between these versions (in other words, this may happen > with recent versions as well?). The database instances on all servers > are configured quite the same (more or less). What is the more or less for the problem server? Are the hardware(CPU) architectures the same for all the servers? If not how does the problem server differ? Was Postgres installed from the same source/same way on all the server? > All run with the same > extensions installed; none is using any preloaded libraries (which may > replace C library function strtod?). > -- Adrian Klaver adrian.klaver@aklaver.com
On Wed, Feb 16, 2022 at 05:46 PM Adrian Klaver wrote > On 2/16/22 05:27, Carsten Klein wrote: >> Hi there, >> >> I'm using several (now unsupported) PostgreSQL 9.3.24 servers on >> different (ancient) Ubuntu 14.04 LTS machines. On only one of those >> servers, I get strange/wrong results when converting a string into a >> double value: >> > >> Has anyone an idea of what's going on here? I know, this version is >> far from still being supported, however, there's not much real changes >> in file float.c between these versions (in other words, this may >> happen with recent versions as well?). The database instances on all >> servers are configured quite the same (more or less). > What is the more or less for the problem server? What? Didn't get that... > Are the hardware(CPU) architectures the same for all the servers? The problem server is a virtual machine (VMware). I've tested this on two other servers and do receive probably correct results. One of these is also a virtual machine (same VMware, running on different hardware), the other one is a physical box with an AMD Athlon(tm) II X4 640 64-bit processor. Customer site: Production System: VMware (ESX Host A) works as expected Testing System: VMware (ESX Host B) DOES NOT WORK as expected Our site: Development System: AMD Athlon(tm) II works as expected However, since the strtod function works correctly from my tiny C program on all these machines, I don't believe that this problem has to do with hardware or architecture. I guess, PostgreSQL dynamically links to these C library functions, right? > If not how does the problem server differ? I certainly have no idea. Actually, I don't see any more relevant differences. > Was Postgres installed from the same source/same way on all the server? PostgreSQL was installed from Ubuntu's official repositories (14.04 LTS) on all servers. -- Carsten Klein c(dot)klein(@)datagis(dot)com
On Wed, Feb 16, 2022 at 11:38 AM Carsten Klein <c.klein@datagis.com> wrote:
On Wed, Feb 16, 2022 at 05:46 PM Adrian Klaver wrote
> On 2/16/22 05:27, Carsten Klein wrote:
>> Hi there,
>>
>> I'm using several (now unsupported) PostgreSQL 9.3.24 servers on
>> different (ancient) Ubuntu 14.04 LTS machines. On only one of those
>> servers, I get strange/wrong results when converting a string into a
>> double value:
>>
>
>> Has anyone an idea of what's going on here? I know, this version is
>> far from still being supported, however, there's not much real changes
>> in file float.c between these versions (in other words, this may
>> happen with recent versions as well?). The database instances on all
>> servers are configured quite the same (more or less).
> What is the more or less for the problem server?
What? Didn't get that...
You said they are more or less the same. Problems like these tend to hide in the "less" portion of the inequality.
Testing System: VMware (ESX Host B) DOES NOT WORK as expected
Given that this isn't working as expected it doesn't make for a great testing system. Install and initdb 14.2 on this machine and let's see what PostgreSQL produces.
> If not how does the problem server differ?
I certainly have no idea. Actually, I don't see any more relevant
differences.
What is the precise version of libc that is installed for one. Exact ESX releases too.
This isn't really all that interesting a report for the project if it only exists in one ancient system that cannot be experimented with. Maybe it's a faulty register on that machine's CPU. There is more double-checking and comparing that can be done here but it seems unlikely to be productive. It is more plausible that the snowflake machine in question just has issues and needs to be retired. Installing a newer version of PostgreSQL on it before junking it is about the right amount of experimental effort.
David J.
On Wed, Feb 16, 2022 at 08:11 PM David G. Johnston wrote:
> You said they are more or less the same. Problems like these tend to
> hide in the "less" portion of the inequality.
On of the virtualized servers was created as a clone of the other one
(using VMware to clone the VM). So, basically, these are very equal. Of
course, they diverged over time.
Focusing on PostgreSQL, here are the differences of postgresql.conf,
comparing testing system and production system:
< work_mem = 8MB # min 64kB
---
> work_mem = 4MB # min 64kB
417c417
< #log_statement = 'all' # none, ddl, mod, all
---
> #log_statement = 'none' # none, ddl, mod, all
Both PostgreSQL server have the same roles and users, that same
extensions installed and no preloaded libraries.
> Given that this isn't working as expected it doesn't make for a great
> testing system. Install and initdb 14.2 on this machine and let's see
> what PostgreSQL produces.
The testing system runs since 2015. I don't know whether the problem was
present from the beginning. But I don't think so, as we also have
"correct" double values in that database. Now, since "binary equality"
of the double precision values is a new requirement, we started to
notice, that (at least not) newly added (UPDATEd) values, e. g. 1.56 are
not binary equal to already present values:
Table abc, column xyz: currently (before) 1.56
UPDATE abc SET xyz = 1.56;
Table abc, column xyz: after 1.559999999
We have a trigger, that monitors such updates and it reports a changed
value, which is not correct. The problem is, that the assignment
SET xyz = 1.56
actually means
SET xyz = 1.559999999
since getting a double value from the string 1.56 yields 1.559999999.
Yes, moving to the latest PostgreSQL version might fix that error.
However, this is a customer's testing system. Actually, it is intended
to be reinstalled with Ubuntu 22.04 LTS which brings PostgreSQL 14. But
prior to that, we need to complete a project on the testing system that
requires that "binary equality" of double values.
> What is the precise version of libc that is installed for one. Exact
> ESX releases too.
Both VM servers run on
ESXi 6 (correct behavior)
ESXi 6.5 (misbehaving)
All machines use libc version 2.19 (libc-2.19.so).
> This isn't really all that interesting a report for the project if it
> only exists in one ancient system that cannot be experimented with.
> Maybe it's a faulty register on that machine's CPU. There is more
> double-checking and comparing that can be done here but it seems
> unlikely to be productive. It is more plausible that the snowflake
> machine in question just has issues and needs to be retired. Installing
> a newer version of PostgreSQL on it before junking it is about the right
> amount of experimental effort.
I just wanted to ask whether someone knows something about this or has
ever heard about such a behavior. You say, the snowflake machine has
issues... I don't believe in hardware issues, since it runs in VMware
and likely on many different CPUs. Isn't it more than unlikely that such
a constantly occurring error is caused by one faulty CPU (among that
many CPUs an ESX server typically has)?
And, keep in mind that strtod function works as expected from a simply C
testing program. I guess that the parsed double's value gets modified
somewhere in PostgreSQL after strtod was called. However, I do not yet
see where and why. I was hoping that someone of you could help.
Carsten
On 17 Feb 2022, at 8:02, Carsten Klein wrote:
> We have a trigger, that monitors such updates and it reports a changed value, which is not correct. The problem is,
thatthe assignment
>
> SET xyz = 1.56
>
> actually means
>
> SET xyz = 1.559999999
>
> since getting a double value from the string 1.56 yields 1.559999999.
<snip>
> I just wanted to ask whether someone knows something about this or has ever heard about such a behavior. You say, the
snowflakemachine has issues... I don't believe in hardware issues, since it runs in VMware and likely on many different
CPUs.Isn't it more than unlikely that such a constantly occurring error is caused by one faulty CPU (among that many
CPUsan ESX server typically has)?
>
Harking back to my long distant formative years I was taught to never ever rely on equality tests when it came to
floatingpoint values. Nothing has changed in this regard.
If binary exact is part of the requirement then the answer is INTEGER or NUMERIC never FLOAT, REAL, DOUBLE or anything
similar.
So, assuming the trigger function is the only source of this spurious grief one could apply the lesson taught in the
60sfrom the dawn of computing:
EQUALITY = absolute_value(op1 - op2) < epsilon — where op1 & op2 are DOUBLE, and epsilon is smaller than you care
for
Given the intrinsic (standards compliant) uncertainty when converting from absolute (e.g., string representation) to
floatingpoint there will never be value resolving why there are differences.
I suggest using the comparison that is appropriate to the representation of those values or fix the design by using the
properrepresentation.
Gavan Schneider
——
Gavan Schneider, Sodwalls, NSW, Australia
Explanations exist; they have existed for all time; there is always a well-known solution to every human problem —
neat,plausible, and wrong.
— H. L. Mencken, 1920
On Thu, Feb 17, 2022 at 00:07 Gavan Schneider wrote: > Harking back to my long distant formative years I was taught to never ever rely on equality tests when it came to floatingpoint values. Nothing has changed in this regard. > > If binary exact is part of the requirement then the answer is INTEGER or NUMERIC never FLOAT, REAL, DOUBLE or anythingsimilar. > > So, assuming the trigger function is the only source of this spurious grief one could apply the lesson taught in the 60sfrom the dawn of computing: > EQUALITY = absolute_value(op1 - op2) < epsilon — where op1 & op2 are DOUBLE, and epsilon is smaller than you carefor > > Given the intrinsic (standards compliant) uncertainty when converting from absolute (e.g., string representation) to floatingpoint there will never be value resolving why there are differences. > > I suggest using the comparison that is appropriate to the representation of those values or fix the design by using theproper representation. Just some(!) background: I know that there are other ways to compare _any_ floating point values. However, doing that for a whole ROW in a fast manner is not trivial (e. g. it tends to get quite slow). With the hstore extension and (so called) binary equality I've found a very fast way which is also fully generic. hstore uses text representation, so comparisons depend on how values are converted to and from text. But all that is not the point. Double precision conversion algorithms are well defined and deterministic, so it should yield the same results when called with the same arguments (every time and on every machine). That is 1.56::double precision == 1.56000000000000005329070518201E0 1.56::double precision != 1.55999999999999983124610025698E0 With reduced precision (as in PG), that gives you: 1.56::double precision == 1.56 1.56::double precision != 1.55999999999999 However, one of my ProstgreSQL servers returns the latter (wrong) value. You can test this with C library function 'strtod' or with any of the online converters around: https://www.binaryconvert.com/result_double.html Click 'New Conversion' and enter 1.56 into the 'Decimal' field. Then 'Convert to binary' or hit enter. So, the primary problem of that PostgreSQL server is, that it converts text to double in a wrong way. Apart from any triggers, "binary equality" and whatever else I'm doing in this project, this has dramatic effects on the database, as it's messing up the values that I'm storing: Imagine I do: INSERT INTO foo (my_col) VALUES ('Hello World'); But the database has happily stored a different string: SELECT my_col FROM foo; my_col ------------- Hello Worlc (1 row) Finding that string again may be done with "fuzzy search" or regular expressions, but the much better approach is the database not to let mess up the string while storing it. Double precision values are limited in precision (in binary) and there are numbers, that cannot be stored exactly (as it's true for many decimal numbers, like 1/3 ~ 0.33333 as well). Nevertheless, with a given maximum of precision, the same values should have the same (well defined) binary value so that conversions between text and double should not change the value at any time on any machine. Carsten
Carsten Klein schrieb am 16.02.2022 um 14:27: > I'm using several (now unsupported) PostgreSQL 9.3.24 servers on > different (ancient) Ubuntu 14.04 LTS machines. On only one of those > servers, I get strange/wrong results when converting a string into a > double value: > > SELECT 1.56::double precision; > > --> 1.55999999999999 (wrong!) > > Although I do not find any differences in configuration, on all other servers the result looks like this (correct?): > > SELECT 1.56::double precision; > > --> 1.56 (correct!) > I wonder if extra_float_digits is different between those systems Maybe initialized by differently configured SQL clients.
On Thu, Feb 17, 2022 at 09:41 AM Thomas Kellerer wrote:
> Carsten Klein schrieb am 16.02.2022 um 14:27:
Ah, man versteht sich :)
>> I'm using several (now unsupported) PostgreSQL 9.3.24 servers on
>> different (ancient) Ubuntu 14.04 LTS machines. On only one of those
>> servers, I get strange/wrong results when converting a string into a
>> double value:
>>
>> SELECT 1.56::double precision;
>>
>> --> 1.55999999999999 (wrong!)
>>
>> Although I do not find any differences in configuration, on all other servers the result looks like this
(correct?):
>>
>> SELECT 1.56::double precision;
>>
>> --> 1.56 (correct!)
>>
>
> I wonder if extra_float_digits is different between those systems
> Maybe initialized by differently configured SQL clients.
As I've seen in the sources in file /src/backend/utils/adt/float.c,
extra_float_digits is used when converting double precision values back
to text only. The binary (BIT) representation if the double value's
fraction (mantissa) tells me, that the actual double precision value is
already wrong.
Also, extra_float_digits is constantly zero in my tests. I've only been
using pgAdmin III for that and always ensured with
SELECT current_setting('extra_float_digits');
that it's zero.
Carsten
On 2022-02-16 22:02:02 +0100, Carsten Klein wrote:
> On of the virtualized servers was created as a clone of the other one (using
> VMware to clone the VM). So, basically, these are very equal. Of course,
> they diverged over time.
>
> Focusing on PostgreSQL, here are the differences of postgresql.conf,
> comparing testing system and production system:
I don't think these explain the difference. I'd check whether the
postgresql binaries and all the the shared libraries are the same.
Or - since this only happens on the test system and not on the production
system - I'd just clone the production system again to create a new test
system and see if the problem happens there, too.
hp
--
_ | Peter J. Holzer | Story must make more sense than reality.
|_|_) | |
| | | hjp@hjp.at | -- Charles Stross, "Creative writing
__/ | http://www.hjp.at/ | challenge!"
Вложения
AOn Thu, Feb 17, 2022 at 10:27 AM Peter J. Holzer wrote > > I don't think these explain the difference. I'd check whether the > postgresql binaries and all the the shared libraries are the same. > > Or - since this only happens on the test system and not on the production > system - I'd just clone the production system again to create a new test > system and see if the problem happens there, too. > File postgresql and it's loaded libraries are identical. Same set of libraries loaded on all servers; all loaded library have identical MD5 sums. It's the customer's VMware, so I can't just make another clone. However, I'm quite sure that it will work on a newly cloned testing system (since on production system everything is OK). Carsten
On 16.02.22 14:27, Carsten Klein wrote:
> AFAIK, this conversion is done by internal function float8in, which,
> when called directly, yields the same results:
>
> SELECT float8in('1.56');
>
> --> 1.55999999999999 (wrong!) on one server, and
> --> 1.56 (correct!) on all other servers.
float8in() really just calls the operating system's strtod() function.
I would test that one directly with a small C program.
It's also possible that different compiler options lead to different
optimizations.
On 18.02.2022 13:28, Peter Eisentraut wrote:
> float8in() really just calls the operating system's strtod() function. I
> would test that one directly with a small C program.
>
> It's also possible that different compiler options lead to different
> optimizations.
That's what I did. Here's my small C program: (nicht lachen *g*)
#include <stdio.h>
#include <stdlib.h>
int main(int argc, char* argv[]) {
/* default string to convert */
char buf[10] = "1.56\0";
/* choose and print string to convert */
char* sval = argc > 1 ? argv[1] : buf;
printf("string value: %s\n", sval);
/* convert and print */
char* ptr;
double dval = strtod(sval, &ptr);
printf("double value: %.20f\n", dval);
return 0;
}
It works correctly on all these servers. Here's its output:
string value: 1.56
double value: 1.56000000000000005329
I didn't test different compiler options. However, PostgreSQL was always
installed from official Ubuntu 14.04 repositories (getting the binaries,
not the source packages), so all binaries should have been compiled with
the same options.
Carsten
Carsten Klein <c.klein@datagis.com> writes:
> On 18.02.2022 13:28, Peter Eisentraut wrote:
>> float8in() really just calls the operating system's strtod() function. I
>> would test that one directly with a small C program.
> That's what I did. Here's my small C program: (nicht lachen *g*)
Yeah, you said that upthread, which makes the whole thing pretty
baffling. One possible explanation is that your small program got linked
against a different version of libc than what the Postgres backend is
using ("ldd" would help you check that, but given the age of the Postgres
installation, this seems plausible). Beyond that it's hard to think
of any explanation other than hardware fault or corrupted executable.
regards, tom lane
On 18.02.2022 16:32, Tom Lane wrote:
> Yeah, you said that upthread, which makes the whole thing pretty
> baffling. One possible explanation is that your small program got linked
> against a different version of libc than what the Postgres backend is
> using ("ldd" would help you check that, but given the age of the Postgres
> installation, this seems plausible). Beyond that it's hard to think
> of any explanation other than hardware fault or corrupted executable.
Tom,
both PostgreSQL and my C program are linked to the same libc.so.6. Same
path, same MD5 sum. Since libc is a Shared Object (so), both processes
should really run the identical code. Am I missing something? I've
written and compiled the small C program on the same old Ubuntu OS.
So, you're not aware of any ways this behavior could be achieved from
within PostgreSQL? Something like a custom cast (actually, there is
none) or something that could intercept string to double conversion?
That would be something to look at closer. The question is: how would
you implement such an evil database wide text to double conversion (just
to kid users) if you had to?
PostgreSQL is up for more than 480 days on that server. I'm thinking of
giving a restart of the database a try. However, there's a long running
import taking place, so this will not happen before mid or end of next week.
Regards, Carsten
Carsten Klein <c.klein@datagis.com> writes:
> So, you're not aware of any ways this behavior could be achieved from
> within PostgreSQL? Something like a custom cast (actually, there is
> none) or something that could intercept string to double conversion?
Well, you asserted that these installations are all alike ... but
sure, it's conceivable that somebody could've replaced the default
numeric -> float8 cast (i.e., numeric_float8()) with some other code
that does it a bit inaccurately. There's only a pg_cast entry
connecting that function to those types. Have you tried stepping
through things with a debugger, to see if numeric_float8 and float8in
are actually reached on the problematic system?
> That would be something to look at closer. The question is: how would
> you implement such an evil database wide text to double conversion (just
> to kid users) if you had to?
At the SQL level, your example does not involve text -> double.
The literal 1.56 is of type numeric.
regards, tom lane
On 2022-02-18 18:07:35 +0100, Carsten Klein wrote:
> both PostgreSQL and my C program are linked to the same libc.so.6. Same
> path, same MD5 sum. Since libc is a Shared Object (so), both processes
> should really run the identical code. Am I missing something? I've written
> and compiled the small C program on the same old Ubuntu OS.
[...]
> PostgreSQL is up for more than 480 days on that server. I'm thinking of
> giving a restart of the database a try. However, there's a long running
> import taking place, so this will not happen before mid or end of next week.
If there has been a glibc update (or a postgresql update) in those 480
days (Ubuntu 14.04 is out of its normal maintenance period but ESM is
still available) the running processes may well run different code than
a newly started program. So it could be a bug which has since been
fixed.
Another idea: It could be the case that something (maybe a bug in
postgres, maybe an extension, maybe even a random bit flip in memory)
changed the FP rounding mode within the postgres process, which would
affect all FP computations until the rounding mode is reset. That would
have to have happened in the master to affect the worker processes as
consistently as you are seeing. I don't know if it is even possible for
a non-standard rounding mode to persist for any length of time, but if
it is it would certainly account for weird rounding errors.
hp
--
_ | Peter J. Holzer | Story must make more sense than reality.
|_|_) | |
| | | hjp@hjp.at | -- Charles Stross, "Creative writing
__/ | http://www.hjp.at/ | challenge!"
Вложения
"Peter J. Holzer" <hjp-pgsql@hjp.at> writes:
> Another idea: It could be the case that something (maybe a bug in
> postgres, maybe an extension, maybe even a random bit flip in memory)
> changed the FP rounding mode within the postgres process, which would
> affect all FP computations until the rounding mode is reset. That would
> have to have happened in the master to affect the worker processes as
> consistently as you are seeing. I don't know if it is even possible for
> a non-standard rounding mode to persist for any length of time, but if
> it is it would certainly account for weird rounding errors.
Hmm, that is a pretty interesting theory. On a RHEL8 box, I find
that fesetround(FE_DOWNWARD) causes strtod("1.56", NULL) to return
1.55999999999999983124
rather than the usual
1.56000000000000005329
which seems to square with Carsten's symptom.
Postgres itself contains no fesetround calls, but if you want
to believe a random bit flip changed that mode, maybe that'd
account for it. It'd certainly be interesting to find out
whether the problem persists after a postmaster restart.
[ wanders away wondering if the troublesome machine has ECC
memory ... ]
regards, tom lane
On 18.02.22 22:42, Peter J. Holzer wrote: > If there has been a glibc update (or a postgresql update) in those 480 > days (Ubuntu 14.04 is out of its normal maintenance period but ESM is > still available) the running processes may well run different code than > a newly started program. So it could be a bug which has since been > fixed. That would be visible via `lsof`. `libc. The file `...libc...so` that `postgres` is keeping open would have the text `DEL` (as in deleted) in the `FD` column of `lsof`'s output. As opposed to a newly started program which would have `REG` (regular file) there. *t
On Sat, Feb 19, 2022 at 8:35 AM Tomas Pospisek <tpo2@sourcepole.ch> wrote: > > On 18.02.22 22:42, Peter J. Holzer wrote: > > > If there has been a glibc update (or a postgresql update) in those 480 > > days (Ubuntu 14.04 is out of its normal maintenance period but ESM is > > still available) the running processes may well run different code than > > a newly started program. So it could be a bug which has since been > > fixed. > > That would be visible via `lsof`. `libc. The file `...libc...so` that > `postgres` is keeping open would have the text `DEL` (as in deleted) in > the `FD` column of `lsof`'s output. > > As opposed to a newly started program which would have `REG` (regular > file) there. If this doesn't bear fruit, are there debugging symbols? Setting a breakpoint might produce some insight. merlin
On 19.02.2022 15:35, Tomas Pospisek wrote: > That would be visible via `lsof`. `libc. The file `...libc...so` that > `postgres` is keeping open would have the text `DEL` (as in deleted) in > the `FD` column of `lsof`'s output. > > As opposed to a newly started program which would have `REG` (regular > file) there. > *t Actually, PostgreSLQ uses the same libc than any recently started program. lsof reports REG and my /lib/x86_64-linux-gnu/libc-2.19.so has file date Mar 27, 2019, so it's much older than the point in time when PostgreSQL was started the last time (aprx. 480 days ago). After all, I guess the idea of a wrong rounding setting (set to round down, raised by Peter J. Holzer) seems most plausible to me. In particular, since the "right" and "wrong" values caused by wrong rounding shown by Tom Lane are exactly the values I am seeing. Also, there is a quite aggressive import script running for some days, which even tried to replace some of the core functions, like array_length (it added a pure PL/pgSQL version in public schema). Seems like they wanted to "Polyfill" some required functions. Maybe that script is responsible for changing the process' rounding mode? The customer started that script without thinking too much about it... :( @Tom Lane: you say, PostgreSQL does not / cannot change rounding mode at any time? There is no function to do so? So, the script (it's more like a module) must provide a Shared Object module with a C function in order to change the FPU's rounding mode? Because several people recommended using a debugger in order to see whats going on here: actually I have no expertise with debugging on Linux without an IDE. So I did not yet think of using a debugger so far. I will try a restart of the DB ASAP. However, in order to prove the rounding mode thesis: someone knows whether fesetround(FE_DOWNWARD) just sets a flag in the C runtime environment or does this call actually set the rounding mode in the FPU's control word? Is there any chance to get that current rounding mode with or even without a debugger? (Maybe some file in /proc filesystem?) I'm very unhappy with the customer starting that import script. In order to find the `smoking gun`, I will now have a closer look at the "product" that import is based on and what they have done to the database... Regards, Carsten
Carsten Klein <c.klein@datagis.com> writes:
> @Tom Lane: you say, PostgreSQL does not / cannot change rounding mode at
> any time? There is no function to do so? So, the script (it's more like
> a module) must provide a Shared Object module with a C function in order
> to change the FPU's rounding mode?
Per grep, there is no call of fesetround() in the Postgres source
tree. I'm not sure offhand whether libc exposes any other APIs
that could change the rounding mode, but I am quite sure that we
wouldn't be intentionally changing it anywhere.
> I will try a restart of the DB ASAP. However, in order to prove the
> rounding mode thesis: someone knows whether fesetround(FE_DOWNWARD) just
> sets a flag in the C runtime environment or does this call actually set
> the rounding mode in the FPU's control word? Is there any chance to get
> that current rounding mode with or even without a debugger? (Maybe some
> file in /proc filesystem?)
The OS would surely allow each process to have its own setting of the
rounding mode, so I doubt you can see it from outside.
Another point to keep in mind is that no matter how invasive that
import script might be, it's still hard to explain how it'd affect
the rounding mode in other backend processes. You have to postulate
either that the rounding mode has been changed in the postmaster
process (and then inherited by session backends via fork()), or that
some code running at the time of child process creation changes the
mode, or that they replaced numeric_float8 with something else.
I think the only way that the postmaster's rounding mode could change
after postmaster start is the cosmic-ray hypothesis; while we do have
features that'd allow loading extra code into the postmaster, I'm
pretty sure they only take effect at postmaster start. So even if
that import script tried to do that, it wouldn't have succeeded yet.
Of the other two hypotheses, "substitute numeric_float8" seems like
the most likely, especially given the other stuff you mentioned the
script doing. Have you checked the relevant pg_cast entry to see
if it's been changed? It'd also be interesting to see if the odd
rounding behavior happens in all databases of the cluster or just
one.
regards, tom lane
On 19.02.2022 20:34 Tom Lane wrote:
> Per grep, there is no call of fesetround() in the Postgres source
> tree. I'm not sure offhand whether libc exposes any other APIs
> that could change the rounding mode, but I am quite sure that we
> wouldn't be intentionally changing it anywhere.
>
> The OS would surely allow each process to have its own setting of the
> rounding mode, so I doubt you can see it from outside.
>
> Another point to keep in mind is that no matter how invasive that
> import script might be, it's still hard to explain how it'd affect
> the rounding mode in other backend processes. You have to postulate
> either that the rounding mode has been changed in the postmaster
> process (and then inherited by session backends via fork()), or that
> some code running at the time of child process creation changes the
> mode, or that they replaced numeric_float8 with something else.
>
> I think the only way that the postmaster's rounding mode could change
> after postmaster start is the cosmic-ray hypothesis; while we do have
> features that'd allow loading extra code into the postmaster, I'm
> pretty sure they only take effect at postmaster start. So even if
> that import script tried to do that, it wouldn't have succeeded yet.
>
> Of the other two hypotheses, "substitute numeric_float8" seems like
> the most likely, especially given the other stuff you mentioned the
> script doing. Have you checked the relevant pg_cast entry to see
> if it's been changed? It'd also be interesting to see if the odd
> rounding behavior happens in all databases of the cluster or just
> one.
The script has finished!
After a restart of the database, everything works as expected again.
Rounding as well as text/numeric to double precision works the same on
all of my servers.
Prior to restarting, I've implemented my own Python based versions of both
int fegetround(void);
int fesetround(int rounding_mode integer);
Have a look a these:
CREATE OR REPLACE FUNCTION fegetround()
RETURNS integer AS
$BODY$
if 'fn.fegetround' in SD:
return SD['fn.fegetround']()
from ctypes import cdll
from ctypes.util import find_library
libm = cdll.LoadLibrary(find_library('m'))
def fegetround():
return libm.fegetround()
SD['fn.fegetround'] = fegetround
return SD['fn.fegetround']()
$BODY$
LANGUAGE plpython3u VOLATILE
COST 100;
CREATE OR REPLACE FUNCTION fesetround(rounding_mode integer)
RETURNS integer AS
$BODY$
if 'fn.fesetround' in SD:
return SD['fn.fesetround'](rounding_mode)
from ctypes import cdll
from ctypes.util import find_library
libm = cdll.LoadLibrary(find_library('m'))
def fesetround(rounding_mode):
return libm.fesetround(rounding_mode)
SD['fn.fesetround'] = fesetround
return SD['fn.fesetround'](rounding_mode)
$BODY$
LANGUAGE plpython3u VOLATILE STRICT
COST 100;
With those, I was able to proof, that actually the "wrong" rounding mode
FE_DOWNWARD (0x400)
was in effect for every new process/connection with all the described
effects on casting from string or numeric to double precision:
SELECT 1.56::double precision
-> 1.55999999999999
Setting rounding mode to
FE_TONEAREST (0x0),
instantly lead back to the expected casting behavior:
SELECT 1.56::double precision
-> 1.56
Setting rounding mode after restarting the database is still possible,
however, new sessions start off with the "correct" rounding mode
FE_TONEAREST (0x0). So, the only thing that's really changed after the
restart was, that the postmaster now has the "correct" rounding mode,
which it promotes down when forking off child processes.
We'll likely never know, why ever the postmaster got tainted with that
FE_DOWNWARD (0x400) rounding mode.
As Tom Lane said, no matter how aggressive the script could be, it can,
if at all, only change its current session's rounding mode. So, maybe it
actually was a random bit flip or a side effect caused by a quite rare
error condition in postmaster.
Nearly the same is true for any core functions or casts hijacked by the
script - these are only in effect for the database the script was ever
connecting to. In my case, the script only used one database. However,
the issue was present with any database.
Two official math functions to get and set the session's rounding mode
provided by PostgreSQL could be a good add-on for any of the next
versions of the database. Thinking about it again... maybe that's just
too dangerous :-p
Finally, many thanks to all that supported me and came up with that many
helpful ideas! :-)
Regards, Carsten
Carsten Klein <c.klein@datagis.com> writes:
> Prior to restarting, I've implemented my own Python based versions of both
> int fegetround(void);
> int fesetround(int rounding_mode integer);
Ah, good idea!
> With those, I was able to proof, that actually the "wrong" rounding mode
> FE_DOWNWARD (0x400)
> was in effect for every new process/connection with all the described
> effects on casting from string or numeric to double precision:
Thanks for confirming that that was the source of the problem.
> We'll likely never know, why ever the postmaster got tainted with that
> FE_DOWNWARD (0x400) rounding mode.
Indeed. It's hard to see any other explanation than "random bit flip"
though. The postmaster is designed to run only a very small amount of
code, almost none of which is user-controllable. Even if there were
somewhere some code that intended to change the rounding mode, explaining
how the postmaster got to that without crashing is a tall order.
> Two official math functions to get and set the session's rounding mode
> provided by PostgreSQL could be a good add-on for any of the next
> versions of the database. Thinking about it again... maybe that's just
> too dangerous :-p
No, we'd absolutely not ever provide a supported function to change
rounding modes. That would require downgrading all float-related
functions from IMMUTABLE to STABLE, which would have severe performance
consequences, even for people with no interest in changing modes.
Thanks for closing out the thread with this info!
regards, tom lane