Обсуждение: Strange results when casting string to double
Hi there, I'm using several (now unsupported) PostgreSQL 9.3.24 servers on different (ancient) Ubuntu 14.04 LTS machines. On only one of those servers, I get strange/wrong results when converting a string into a double value: SELECT 1.56::double precision; --> 1.55999999999999 (wrong!) Although I do not find any differences in configuration, on all other servers the result looks like this (correct?): SELECT 1.56::double precision; --> 1.56 (correct!) AFAIK, this conversion is done by internal function float8in, which, when called directly, yields the same results: SELECT float8in('1.56'); --> 1.55999999999999 (wrong!) on one server, and --> 1.56 (correct!) on all other servers. Option extra_float_digits is zero (0) while doing all these tests. Also, the problem seems to occur while converting text to double precision and not when displaying the obtained double precision value. Why? The binary representation of the double precision value is also different. I've created a small to_bit function in Python to get the double precision value's binary representation: CREATE OR REPLACE FUNCTION to_bit(value double precision) RETURNS bit AS $BODY$ if 'fn.to_bit_d64' in SD: return SD['fn.to_bit_d64'](value) import struct def to_bit_d64(value): return ''.join('{:0>8b}'.format(c) for c in struct.pack('!d', value)) SD['fn.to_bit_d64'] = to_bit_d64 return SD['fn.to_bit_d64'](value) $BODY$ LANGUAGE plpython3u IMMUTABLE STRICT COST 100; The fraction (mantissa) of both values is different by 1: value fraction 1.55999999999999 1000111101011100001010001111010111000010100011110101 1.56 1000111101011100001010001111010111000010100011110110 The fraction of the probably wrong value is one less than the fraction of the correct value. Formatting both values with 20 digits right of the decimal separator (like printf("%.20f" ...) yields: 1.55999999999999983124 (wrong!) 1.56000000000000005329 (correct!) Since even calling function float8in directly returns a probably wrong result on one of the servers makes me believe, that there's no custom cast in place being responsible for the wrong results. Function float8in basically relies on C library function double strtod(const char *str, char **endptr) which I tested with a tiny C programm (calling strtod only and printing the result with printf("%.20f", val);). The result is 1.56000000000000005329 (correct!) on every server. So, seems like the C library function works as expected on all servers. Although I'm not a C expert, I don't find anything suspicious that function float8in does with the value returned from strtod. In version 9.3.24, file /src/backend/utils/adt/float.c looks a bit different from the file in master branch. However, basically both versions do much the same things. The old 9.3.24 version does some more special error checks (#ifdef HAVE_BUGGY_IRIX_STRTOD, #ifdef HAVE_BUGGY_SOLARIS_STRTOD and CHECKFLOATVAL), but these either throw errors or set the converted value to return to a special value (if indicated). Has anyone an idea of what's going on here? I know, this version is far from still being supported, however, there's not much real changes in file float.c between these versions (in other words, this may happen with recent versions as well?). The database instances on all servers are configured quite the same (more or less). All run with the same extensions installed; none is using any preloaded libraries (which may replace C library function strtod?). -- Carsten Klein c(dot)klein(@)datagis(dot)com
On 2/16/22 05:27, Carsten Klein wrote: > Hi there, > > I'm using several (now unsupported) PostgreSQL 9.3.24 servers on > different (ancient) Ubuntu 14.04 LTS machines. On only one of those > servers, I get strange/wrong results when converting a string into a > double value: > > Has anyone an idea of what's going on here? I know, this version is far > from still being supported, however, there's not much real changes in > file float.c between these versions (in other words, this may happen > with recent versions as well?). The database instances on all servers > are configured quite the same (more or less). What is the more or less for the problem server? Are the hardware(CPU) architectures the same for all the servers? If not how does the problem server differ? Was Postgres installed from the same source/same way on all the server? > All run with the same > extensions installed; none is using any preloaded libraries (which may > replace C library function strtod?). > -- Adrian Klaver adrian.klaver@aklaver.com
On Wed, Feb 16, 2022 at 05:46 PM Adrian Klaver wrote > On 2/16/22 05:27, Carsten Klein wrote: >> Hi there, >> >> I'm using several (now unsupported) PostgreSQL 9.3.24 servers on >> different (ancient) Ubuntu 14.04 LTS machines. On only one of those >> servers, I get strange/wrong results when converting a string into a >> double value: >> > >> Has anyone an idea of what's going on here? I know, this version is >> far from still being supported, however, there's not much real changes >> in file float.c between these versions (in other words, this may >> happen with recent versions as well?). The database instances on all >> servers are configured quite the same (more or less). > What is the more or less for the problem server? What? Didn't get that... > Are the hardware(CPU) architectures the same for all the servers? The problem server is a virtual machine (VMware). I've tested this on two other servers and do receive probably correct results. One of these is also a virtual machine (same VMware, running on different hardware), the other one is a physical box with an AMD Athlon(tm) II X4 640 64-bit processor. Customer site: Production System: VMware (ESX Host A) works as expected Testing System: VMware (ESX Host B) DOES NOT WORK as expected Our site: Development System: AMD Athlon(tm) II works as expected However, since the strtod function works correctly from my tiny C program on all these machines, I don't believe that this problem has to do with hardware or architecture. I guess, PostgreSQL dynamically links to these C library functions, right? > If not how does the problem server differ? I certainly have no idea. Actually, I don't see any more relevant differences. > Was Postgres installed from the same source/same way on all the server? PostgreSQL was installed from Ubuntu's official repositories (14.04 LTS) on all servers. -- Carsten Klein c(dot)klein(@)datagis(dot)com
On Wed, Feb 16, 2022 at 11:38 AM Carsten Klein <c.klein@datagis.com> wrote:
On Wed, Feb 16, 2022 at 05:46 PM Adrian Klaver wrote
> On 2/16/22 05:27, Carsten Klein wrote:
>> Hi there,
>>
>> I'm using several (now unsupported) PostgreSQL 9.3.24 servers on
>> different (ancient) Ubuntu 14.04 LTS machines. On only one of those
>> servers, I get strange/wrong results when converting a string into a
>> double value:
>>
>
>> Has anyone an idea of what's going on here? I know, this version is
>> far from still being supported, however, there's not much real changes
>> in file float.c between these versions (in other words, this may
>> happen with recent versions as well?). The database instances on all
>> servers are configured quite the same (more or less).
> What is the more or less for the problem server?
What? Didn't get that...
You said they are more or less the same. Problems like these tend to hide in the "less" portion of the inequality.
Testing System: VMware (ESX Host B) DOES NOT WORK as expected
Given that this isn't working as expected it doesn't make for a great testing system. Install and initdb 14.2 on this machine and let's see what PostgreSQL produces.
> If not how does the problem server differ?
I certainly have no idea. Actually, I don't see any more relevant
differences.
What is the precise version of libc that is installed for one. Exact ESX releases too.
This isn't really all that interesting a report for the project if it only exists in one ancient system that cannot be experimented with. Maybe it's a faulty register on that machine's CPU. There is more double-checking and comparing that can be done here but it seems unlikely to be productive. It is more plausible that the snowflake machine in question just has issues and needs to be retired. Installing a newer version of PostgreSQL on it before junking it is about the right amount of experimental effort.
David J.
On Wed, Feb 16, 2022 at 08:11 PM David G. Johnston wrote: > You said they are more or less the same. Problems like these tend to > hide in the "less" portion of the inequality. On of the virtualized servers was created as a clone of the other one (using VMware to clone the VM). So, basically, these are very equal. Of course, they diverged over time. Focusing on PostgreSQL, here are the differences of postgresql.conf, comparing testing system and production system: < work_mem = 8MB # min 64kB --- > work_mem = 4MB # min 64kB 417c417 < #log_statement = 'all' # none, ddl, mod, all --- > #log_statement = 'none' # none, ddl, mod, all Both PostgreSQL server have the same roles and users, that same extensions installed and no preloaded libraries. > Given that this isn't working as expected it doesn't make for a great > testing system. Install and initdb 14.2 on this machine and let's see > what PostgreSQL produces. The testing system runs since 2015. I don't know whether the problem was present from the beginning. But I don't think so, as we also have "correct" double values in that database. Now, since "binary equality" of the double precision values is a new requirement, we started to notice, that (at least not) newly added (UPDATEd) values, e. g. 1.56 are not binary equal to already present values: Table abc, column xyz: currently (before) 1.56 UPDATE abc SET xyz = 1.56; Table abc, column xyz: after 1.559999999 We have a trigger, that monitors such updates and it reports a changed value, which is not correct. The problem is, that the assignment SET xyz = 1.56 actually means SET xyz = 1.559999999 since getting a double value from the string 1.56 yields 1.559999999. Yes, moving to the latest PostgreSQL version might fix that error. However, this is a customer's testing system. Actually, it is intended to be reinstalled with Ubuntu 22.04 LTS which brings PostgreSQL 14. But prior to that, we need to complete a project on the testing system that requires that "binary equality" of double values. > What is the precise version of libc that is installed for one. Exact > ESX releases too. Both VM servers run on ESXi 6 (correct behavior) ESXi 6.5 (misbehaving) All machines use libc version 2.19 (libc-2.19.so). > This isn't really all that interesting a report for the project if it > only exists in one ancient system that cannot be experimented with. > Maybe it's a faulty register on that machine's CPU. There is more > double-checking and comparing that can be done here but it seems > unlikely to be productive. It is more plausible that the snowflake > machine in question just has issues and needs to be retired. Installing > a newer version of PostgreSQL on it before junking it is about the right > amount of experimental effort. I just wanted to ask whether someone knows something about this or has ever heard about such a behavior. You say, the snowflake machine has issues... I don't believe in hardware issues, since it runs in VMware and likely on many different CPUs. Isn't it more than unlikely that such a constantly occurring error is caused by one faulty CPU (among that many CPUs an ESX server typically has)? And, keep in mind that strtod function works as expected from a simply C testing program. I guess that the parsed double's value gets modified somewhere in PostgreSQL after strtod was called. However, I do not yet see where and why. I was hoping that someone of you could help. Carsten
On 17 Feb 2022, at 8:02, Carsten Klein wrote: > We have a trigger, that monitors such updates and it reports a changed value, which is not correct. The problem is, thatthe assignment > > SET xyz = 1.56 > > actually means > > SET xyz = 1.559999999 > > since getting a double value from the string 1.56 yields 1.559999999. <snip> > I just wanted to ask whether someone knows something about this or has ever heard about such a behavior. You say, the snowflakemachine has issues... I don't believe in hardware issues, since it runs in VMware and likely on many different CPUs.Isn't it more than unlikely that such a constantly occurring error is caused by one faulty CPU (among that many CPUsan ESX server typically has)? > Harking back to my long distant formative years I was taught to never ever rely on equality tests when it came to floatingpoint values. Nothing has changed in this regard. If binary exact is part of the requirement then the answer is INTEGER or NUMERIC never FLOAT, REAL, DOUBLE or anything similar. So, assuming the trigger function is the only source of this spurious grief one could apply the lesson taught in the 60sfrom the dawn of computing: EQUALITY = absolute_value(op1 - op2) < epsilon — where op1 & op2 are DOUBLE, and epsilon is smaller than you care for Given the intrinsic (standards compliant) uncertainty when converting from absolute (e.g., string representation) to floatingpoint there will never be value resolving why there are differences. I suggest using the comparison that is appropriate to the representation of those values or fix the design by using the properrepresentation. Gavan Schneider —— Gavan Schneider, Sodwalls, NSW, Australia Explanations exist; they have existed for all time; there is always a well-known solution to every human problem — neat,plausible, and wrong. — H. L. Mencken, 1920
On Thu, Feb 17, 2022 at 00:07 Gavan Schneider wrote: > Harking back to my long distant formative years I was taught to never ever rely on equality tests when it came to floatingpoint values. Nothing has changed in this regard. > > If binary exact is part of the requirement then the answer is INTEGER or NUMERIC never FLOAT, REAL, DOUBLE or anythingsimilar. > > So, assuming the trigger function is the only source of this spurious grief one could apply the lesson taught in the 60sfrom the dawn of computing: > EQUALITY = absolute_value(op1 - op2) < epsilon — where op1 & op2 are DOUBLE, and epsilon is smaller than you carefor > > Given the intrinsic (standards compliant) uncertainty when converting from absolute (e.g., string representation) to floatingpoint there will never be value resolving why there are differences. > > I suggest using the comparison that is appropriate to the representation of those values or fix the design by using theproper representation. Just some(!) background: I know that there are other ways to compare _any_ floating point values. However, doing that for a whole ROW in a fast manner is not trivial (e. g. it tends to get quite slow). With the hstore extension and (so called) binary equality I've found a very fast way which is also fully generic. hstore uses text representation, so comparisons depend on how values are converted to and from text. But all that is not the point. Double precision conversion algorithms are well defined and deterministic, so it should yield the same results when called with the same arguments (every time and on every machine). That is 1.56::double precision == 1.56000000000000005329070518201E0 1.56::double precision != 1.55999999999999983124610025698E0 With reduced precision (as in PG), that gives you: 1.56::double precision == 1.56 1.56::double precision != 1.55999999999999 However, one of my ProstgreSQL servers returns the latter (wrong) value. You can test this with C library function 'strtod' or with any of the online converters around: https://www.binaryconvert.com/result_double.html Click 'New Conversion' and enter 1.56 into the 'Decimal' field. Then 'Convert to binary' or hit enter. So, the primary problem of that PostgreSQL server is, that it converts text to double in a wrong way. Apart from any triggers, "binary equality" and whatever else I'm doing in this project, this has dramatic effects on the database, as it's messing up the values that I'm storing: Imagine I do: INSERT INTO foo (my_col) VALUES ('Hello World'); But the database has happily stored a different string: SELECT my_col FROM foo; my_col ------------- Hello Worlc (1 row) Finding that string again may be done with "fuzzy search" or regular expressions, but the much better approach is the database not to let mess up the string while storing it. Double precision values are limited in precision (in binary) and there are numbers, that cannot be stored exactly (as it's true for many decimal numbers, like 1/3 ~ 0.33333 as well). Nevertheless, with a given maximum of precision, the same values should have the same (well defined) binary value so that conversions between text and double should not change the value at any time on any machine. Carsten
Carsten Klein schrieb am 16.02.2022 um 14:27: > I'm using several (now unsupported) PostgreSQL 9.3.24 servers on > different (ancient) Ubuntu 14.04 LTS machines. On only one of those > servers, I get strange/wrong results when converting a string into a > double value: > > SELECT 1.56::double precision; > > --> 1.55999999999999 (wrong!) > > Although I do not find any differences in configuration, on all other servers the result looks like this (correct?): > > SELECT 1.56::double precision; > > --> 1.56 (correct!) > I wonder if extra_float_digits is different between those systems Maybe initialized by differently configured SQL clients.
On Thu, Feb 17, 2022 at 09:41 AM Thomas Kellerer wrote: > Carsten Klein schrieb am 16.02.2022 um 14:27: Ah, man versteht sich :) >> I'm using several (now unsupported) PostgreSQL 9.3.24 servers on >> different (ancient) Ubuntu 14.04 LTS machines. On only one of those >> servers, I get strange/wrong results when converting a string into a >> double value: >> >> SELECT 1.56::double precision; >> >> --> 1.55999999999999 (wrong!) >> >> Although I do not find any differences in configuration, on all other servers the result looks like this (correct?): >> >> SELECT 1.56::double precision; >> >> --> 1.56 (correct!) >> > > I wonder if extra_float_digits is different between those systems > Maybe initialized by differently configured SQL clients. As I've seen in the sources in file /src/backend/utils/adt/float.c, extra_float_digits is used when converting double precision values back to text only. The binary (BIT) representation if the double value's fraction (mantissa) tells me, that the actual double precision value is already wrong. Also, extra_float_digits is constantly zero in my tests. I've only been using pgAdmin III for that and always ensured with SELECT current_setting('extra_float_digits'); that it's zero. Carsten
On 2022-02-16 22:02:02 +0100, Carsten Klein wrote: > On of the virtualized servers was created as a clone of the other one (using > VMware to clone the VM). So, basically, these are very equal. Of course, > they diverged over time. > > Focusing on PostgreSQL, here are the differences of postgresql.conf, > comparing testing system and production system: I don't think these explain the difference. I'd check whether the postgresql binaries and all the the shared libraries are the same. Or - since this only happens on the test system and not on the production system - I'd just clone the production system again to create a new test system and see if the problem happens there, too. hp -- _ | Peter J. Holzer | Story must make more sense than reality. |_|_) | | | | | hjp@hjp.at | -- Charles Stross, "Creative writing __/ | http://www.hjp.at/ | challenge!"
Вложения
AOn Thu, Feb 17, 2022 at 10:27 AM Peter J. Holzer wrote > > I don't think these explain the difference. I'd check whether the > postgresql binaries and all the the shared libraries are the same. > > Or - since this only happens on the test system and not on the production > system - I'd just clone the production system again to create a new test > system and see if the problem happens there, too. > File postgresql and it's loaded libraries are identical. Same set of libraries loaded on all servers; all loaded library have identical MD5 sums. It's the customer's VMware, so I can't just make another clone. However, I'm quite sure that it will work on a newly cloned testing system (since on production system everything is OK). Carsten
On 16.02.22 14:27, Carsten Klein wrote: > AFAIK, this conversion is done by internal function float8in, which, > when called directly, yields the same results: > > SELECT float8in('1.56'); > > --> 1.55999999999999 (wrong!) on one server, and > --> 1.56 (correct!) on all other servers. float8in() really just calls the operating system's strtod() function. I would test that one directly with a small C program. It's also possible that different compiler options lead to different optimizations.
On 18.02.2022 13:28, Peter Eisentraut wrote: > float8in() really just calls the operating system's strtod() function. I > would test that one directly with a small C program. > > It's also possible that different compiler options lead to different > optimizations. That's what I did. Here's my small C program: (nicht lachen *g*) #include <stdio.h> #include <stdlib.h> int main(int argc, char* argv[]) { /* default string to convert */ char buf[10] = "1.56\0"; /* choose and print string to convert */ char* sval = argc > 1 ? argv[1] : buf; printf("string value: %s\n", sval); /* convert and print */ char* ptr; double dval = strtod(sval, &ptr); printf("double value: %.20f\n", dval); return 0; } It works correctly on all these servers. Here's its output: string value: 1.56 double value: 1.56000000000000005329 I didn't test different compiler options. However, PostgreSQL was always installed from official Ubuntu 14.04 repositories (getting the binaries, not the source packages), so all binaries should have been compiled with the same options. Carsten
Carsten Klein <c.klein@datagis.com> writes: > On 18.02.2022 13:28, Peter Eisentraut wrote: >> float8in() really just calls the operating system's strtod() function. I >> would test that one directly with a small C program. > That's what I did. Here's my small C program: (nicht lachen *g*) Yeah, you said that upthread, which makes the whole thing pretty baffling. One possible explanation is that your small program got linked against a different version of libc than what the Postgres backend is using ("ldd" would help you check that, but given the age of the Postgres installation, this seems plausible). Beyond that it's hard to think of any explanation other than hardware fault or corrupted executable. regards, tom lane
On 18.02.2022 16:32, Tom Lane wrote: > Yeah, you said that upthread, which makes the whole thing pretty > baffling. One possible explanation is that your small program got linked > against a different version of libc than what the Postgres backend is > using ("ldd" would help you check that, but given the age of the Postgres > installation, this seems plausible). Beyond that it's hard to think > of any explanation other than hardware fault or corrupted executable. Tom, both PostgreSQL and my C program are linked to the same libc.so.6. Same path, same MD5 sum. Since libc is a Shared Object (so), both processes should really run the identical code. Am I missing something? I've written and compiled the small C program on the same old Ubuntu OS. So, you're not aware of any ways this behavior could be achieved from within PostgreSQL? Something like a custom cast (actually, there is none) or something that could intercept string to double conversion? That would be something to look at closer. The question is: how would you implement such an evil database wide text to double conversion (just to kid users) if you had to? PostgreSQL is up for more than 480 days on that server. I'm thinking of giving a restart of the database a try. However, there's a long running import taking place, so this will not happen before mid or end of next week. Regards, Carsten
Carsten Klein <c.klein@datagis.com> writes: > So, you're not aware of any ways this behavior could be achieved from > within PostgreSQL? Something like a custom cast (actually, there is > none) or something that could intercept string to double conversion? Well, you asserted that these installations are all alike ... but sure, it's conceivable that somebody could've replaced the default numeric -> float8 cast (i.e., numeric_float8()) with some other code that does it a bit inaccurately. There's only a pg_cast entry connecting that function to those types. Have you tried stepping through things with a debugger, to see if numeric_float8 and float8in are actually reached on the problematic system? > That would be something to look at closer. The question is: how would > you implement such an evil database wide text to double conversion (just > to kid users) if you had to? At the SQL level, your example does not involve text -> double. The literal 1.56 is of type numeric. regards, tom lane
On 2022-02-18 18:07:35 +0100, Carsten Klein wrote: > both PostgreSQL and my C program are linked to the same libc.so.6. Same > path, same MD5 sum. Since libc is a Shared Object (so), both processes > should really run the identical code. Am I missing something? I've written > and compiled the small C program on the same old Ubuntu OS. [...] > PostgreSQL is up for more than 480 days on that server. I'm thinking of > giving a restart of the database a try. However, there's a long running > import taking place, so this will not happen before mid or end of next week. If there has been a glibc update (or a postgresql update) in those 480 days (Ubuntu 14.04 is out of its normal maintenance period but ESM is still available) the running processes may well run different code than a newly started program. So it could be a bug which has since been fixed. Another idea: It could be the case that something (maybe a bug in postgres, maybe an extension, maybe even a random bit flip in memory) changed the FP rounding mode within the postgres process, which would affect all FP computations until the rounding mode is reset. That would have to have happened in the master to affect the worker processes as consistently as you are seeing. I don't know if it is even possible for a non-standard rounding mode to persist for any length of time, but if it is it would certainly account for weird rounding errors. hp -- _ | Peter J. Holzer | Story must make more sense than reality. |_|_) | | | | | hjp@hjp.at | -- Charles Stross, "Creative writing __/ | http://www.hjp.at/ | challenge!"
Вложения
"Peter J. Holzer" <hjp-pgsql@hjp.at> writes: > Another idea: It could be the case that something (maybe a bug in > postgres, maybe an extension, maybe even a random bit flip in memory) > changed the FP rounding mode within the postgres process, which would > affect all FP computations until the rounding mode is reset. That would > have to have happened in the master to affect the worker processes as > consistently as you are seeing. I don't know if it is even possible for > a non-standard rounding mode to persist for any length of time, but if > it is it would certainly account for weird rounding errors. Hmm, that is a pretty interesting theory. On a RHEL8 box, I find that fesetround(FE_DOWNWARD) causes strtod("1.56", NULL) to return 1.55999999999999983124 rather than the usual 1.56000000000000005329 which seems to square with Carsten's symptom. Postgres itself contains no fesetround calls, but if you want to believe a random bit flip changed that mode, maybe that'd account for it. It'd certainly be interesting to find out whether the problem persists after a postmaster restart. [ wanders away wondering if the troublesome machine has ECC memory ... ] regards, tom lane
On 18.02.22 22:42, Peter J. Holzer wrote: > If there has been a glibc update (or a postgresql update) in those 480 > days (Ubuntu 14.04 is out of its normal maintenance period but ESM is > still available) the running processes may well run different code than > a newly started program. So it could be a bug which has since been > fixed. That would be visible via `lsof`. `libc. The file `...libc...so` that `postgres` is keeping open would have the text `DEL` (as in deleted) in the `FD` column of `lsof`'s output. As opposed to a newly started program which would have `REG` (regular file) there. *t
On Sat, Feb 19, 2022 at 8:35 AM Tomas Pospisek <tpo2@sourcepole.ch> wrote: > > On 18.02.22 22:42, Peter J. Holzer wrote: > > > If there has been a glibc update (or a postgresql update) in those 480 > > days (Ubuntu 14.04 is out of its normal maintenance period but ESM is > > still available) the running processes may well run different code than > > a newly started program. So it could be a bug which has since been > > fixed. > > That would be visible via `lsof`. `libc. The file `...libc...so` that > `postgres` is keeping open would have the text `DEL` (as in deleted) in > the `FD` column of `lsof`'s output. > > As opposed to a newly started program which would have `REG` (regular > file) there. If this doesn't bear fruit, are there debugging symbols? Setting a breakpoint might produce some insight. merlin
On 19.02.2022 15:35, Tomas Pospisek wrote: > That would be visible via `lsof`. `libc. The file `...libc...so` that > `postgres` is keeping open would have the text `DEL` (as in deleted) in > the `FD` column of `lsof`'s output. > > As opposed to a newly started program which would have `REG` (regular > file) there. > *t Actually, PostgreSLQ uses the same libc than any recently started program. lsof reports REG and my /lib/x86_64-linux-gnu/libc-2.19.so has file date Mar 27, 2019, so it's much older than the point in time when PostgreSQL was started the last time (aprx. 480 days ago). After all, I guess the idea of a wrong rounding setting (set to round down, raised by Peter J. Holzer) seems most plausible to me. In particular, since the "right" and "wrong" values caused by wrong rounding shown by Tom Lane are exactly the values I am seeing. Also, there is a quite aggressive import script running for some days, which even tried to replace some of the core functions, like array_length (it added a pure PL/pgSQL version in public schema). Seems like they wanted to "Polyfill" some required functions. Maybe that script is responsible for changing the process' rounding mode? The customer started that script without thinking too much about it... :( @Tom Lane: you say, PostgreSQL does not / cannot change rounding mode at any time? There is no function to do so? So, the script (it's more like a module) must provide a Shared Object module with a C function in order to change the FPU's rounding mode? Because several people recommended using a debugger in order to see whats going on here: actually I have no expertise with debugging on Linux without an IDE. So I did not yet think of using a debugger so far. I will try a restart of the DB ASAP. However, in order to prove the rounding mode thesis: someone knows whether fesetround(FE_DOWNWARD) just sets a flag in the C runtime environment or does this call actually set the rounding mode in the FPU's control word? Is there any chance to get that current rounding mode with or even without a debugger? (Maybe some file in /proc filesystem?) I'm very unhappy with the customer starting that import script. In order to find the `smoking gun`, I will now have a closer look at the "product" that import is based on and what they have done to the database... Regards, Carsten
Carsten Klein <c.klein@datagis.com> writes: > @Tom Lane: you say, PostgreSQL does not / cannot change rounding mode at > any time? There is no function to do so? So, the script (it's more like > a module) must provide a Shared Object module with a C function in order > to change the FPU's rounding mode? Per grep, there is no call of fesetround() in the Postgres source tree. I'm not sure offhand whether libc exposes any other APIs that could change the rounding mode, but I am quite sure that we wouldn't be intentionally changing it anywhere. > I will try a restart of the DB ASAP. However, in order to prove the > rounding mode thesis: someone knows whether fesetround(FE_DOWNWARD) just > sets a flag in the C runtime environment or does this call actually set > the rounding mode in the FPU's control word? Is there any chance to get > that current rounding mode with or even without a debugger? (Maybe some > file in /proc filesystem?) The OS would surely allow each process to have its own setting of the rounding mode, so I doubt you can see it from outside. Another point to keep in mind is that no matter how invasive that import script might be, it's still hard to explain how it'd affect the rounding mode in other backend processes. You have to postulate either that the rounding mode has been changed in the postmaster process (and then inherited by session backends via fork()), or that some code running at the time of child process creation changes the mode, or that they replaced numeric_float8 with something else. I think the only way that the postmaster's rounding mode could change after postmaster start is the cosmic-ray hypothesis; while we do have features that'd allow loading extra code into the postmaster, I'm pretty sure they only take effect at postmaster start. So even if that import script tried to do that, it wouldn't have succeeded yet. Of the other two hypotheses, "substitute numeric_float8" seems like the most likely, especially given the other stuff you mentioned the script doing. Have you checked the relevant pg_cast entry to see if it's been changed? It'd also be interesting to see if the odd rounding behavior happens in all databases of the cluster or just one. regards, tom lane
On 19.02.2022 20:34 Tom Lane wrote: > Per grep, there is no call of fesetround() in the Postgres source > tree. I'm not sure offhand whether libc exposes any other APIs > that could change the rounding mode, but I am quite sure that we > wouldn't be intentionally changing it anywhere. > > The OS would surely allow each process to have its own setting of the > rounding mode, so I doubt you can see it from outside. > > Another point to keep in mind is that no matter how invasive that > import script might be, it's still hard to explain how it'd affect > the rounding mode in other backend processes. You have to postulate > either that the rounding mode has been changed in the postmaster > process (and then inherited by session backends via fork()), or that > some code running at the time of child process creation changes the > mode, or that they replaced numeric_float8 with something else. > > I think the only way that the postmaster's rounding mode could change > after postmaster start is the cosmic-ray hypothesis; while we do have > features that'd allow loading extra code into the postmaster, I'm > pretty sure they only take effect at postmaster start. So even if > that import script tried to do that, it wouldn't have succeeded yet. > > Of the other two hypotheses, "substitute numeric_float8" seems like > the most likely, especially given the other stuff you mentioned the > script doing. Have you checked the relevant pg_cast entry to see > if it's been changed? It'd also be interesting to see if the odd > rounding behavior happens in all databases of the cluster or just > one. The script has finished! After a restart of the database, everything works as expected again. Rounding as well as text/numeric to double precision works the same on all of my servers. Prior to restarting, I've implemented my own Python based versions of both int fegetround(void); int fesetround(int rounding_mode integer); Have a look a these: CREATE OR REPLACE FUNCTION fegetround() RETURNS integer AS $BODY$ if 'fn.fegetround' in SD: return SD['fn.fegetround']() from ctypes import cdll from ctypes.util import find_library libm = cdll.LoadLibrary(find_library('m')) def fegetround(): return libm.fegetround() SD['fn.fegetround'] = fegetround return SD['fn.fegetround']() $BODY$ LANGUAGE plpython3u VOLATILE COST 100; CREATE OR REPLACE FUNCTION fesetround(rounding_mode integer) RETURNS integer AS $BODY$ if 'fn.fesetround' in SD: return SD['fn.fesetround'](rounding_mode) from ctypes import cdll from ctypes.util import find_library libm = cdll.LoadLibrary(find_library('m')) def fesetround(rounding_mode): return libm.fesetround(rounding_mode) SD['fn.fesetround'] = fesetround return SD['fn.fesetround'](rounding_mode) $BODY$ LANGUAGE plpython3u VOLATILE STRICT COST 100; With those, I was able to proof, that actually the "wrong" rounding mode FE_DOWNWARD (0x400) was in effect for every new process/connection with all the described effects on casting from string or numeric to double precision: SELECT 1.56::double precision -> 1.55999999999999 Setting rounding mode to FE_TONEAREST (0x0), instantly lead back to the expected casting behavior: SELECT 1.56::double precision -> 1.56 Setting rounding mode after restarting the database is still possible, however, new sessions start off with the "correct" rounding mode FE_TONEAREST (0x0). So, the only thing that's really changed after the restart was, that the postmaster now has the "correct" rounding mode, which it promotes down when forking off child processes. We'll likely never know, why ever the postmaster got tainted with that FE_DOWNWARD (0x400) rounding mode. As Tom Lane said, no matter how aggressive the script could be, it can, if at all, only change its current session's rounding mode. So, maybe it actually was a random bit flip or a side effect caused by a quite rare error condition in postmaster. Nearly the same is true for any core functions or casts hijacked by the script - these are only in effect for the database the script was ever connecting to. In my case, the script only used one database. However, the issue was present with any database. Two official math functions to get and set the session's rounding mode provided by PostgreSQL could be a good add-on for any of the next versions of the database. Thinking about it again... maybe that's just too dangerous :-p Finally, many thanks to all that supported me and came up with that many helpful ideas! :-) Regards, Carsten
Carsten Klein <c.klein@datagis.com> writes: > Prior to restarting, I've implemented my own Python based versions of both > int fegetround(void); > int fesetround(int rounding_mode integer); Ah, good idea! > With those, I was able to proof, that actually the "wrong" rounding mode > FE_DOWNWARD (0x400) > was in effect for every new process/connection with all the described > effects on casting from string or numeric to double precision: Thanks for confirming that that was the source of the problem. > We'll likely never know, why ever the postmaster got tainted with that > FE_DOWNWARD (0x400) rounding mode. Indeed. It's hard to see any other explanation than "random bit flip" though. The postmaster is designed to run only a very small amount of code, almost none of which is user-controllable. Even if there were somewhere some code that intended to change the rounding mode, explaining how the postmaster got to that without crashing is a tall order. > Two official math functions to get and set the session's rounding mode > provided by PostgreSQL could be a good add-on for any of the next > versions of the database. Thinking about it again... maybe that's just > too dangerous :-p No, we'd absolutely not ever provide a supported function to change rounding modes. That would require downgrading all float-related functions from IMMUTABLE to STABLE, which would have severe performance consequences, even for people with no interest in changing modes. Thanks for closing out the thread with this info! regards, tom lane