Обсуждение: Upgrade 10.5->10.6 : db crash BUS ERROR (sig 10), reproducible
Hi, after upgrading from 10.5 to 10.6.2, database now says this: Mar 7 13:55:25 <local0.info> edge postgres[1820]: [7-1] :[] LOG: database system was shut down at 2019-03-07 13:43:29 CET Mar 7 13:55:25 <local0.info> edge postgres[1816]: [7-1] :[] LOG: database system is ready to accept connections Mar 7 13:58:43 <local0.info> edge postgres[1816]: [8-1] :[] LOG: worker process: parallel worker for PID 3526 (PID 3527)was terminated by signal 10: Bus error Mar 7 13:58:43 <local0.info> edge postgres[1816]: [9-1] :[] LOG: terminating any other active server processes Mar 7 13:58:43 <local0.info> edge postgres[1816]: [10-1] :[] LOG: archiver process (PID 1824) exited with exit code 1 Mar 7 13:58:43 <local0.info> edge postgres[1816]: [11-1] :[] LOG: all server processes terminated; reinitializing Mar 7 13:58:45 <local0.info> edge postgres[3531]: [12-1] :[] LOG: database system was interrupted; last known up at 2019-03-0713:55:25 CET Mar 7 13:58:51 <local0.info> edge postgres[3534]: [12-1] [unknown]:[unknown][192.168.98.3(45111)] LOG: connection received:host=192.168.98.3 port=45111 Mar 7 13:58:51 <local0.err> edge postgres[3534]: [13-1] rapppmcf:fin[192.168.98.3(45111)] FATAL: the database system isin recovery mode Mar 7 13:58:51 <local0.info> edge postgres[3535]: [12-1] [unknown]:[unknown][192.168.98.3(45112)] LOG: connection received:host=192.168.98.3 port=45112 Mar 7 13:58:51 <local0.err> edge postgres[3535]: [13-1] rapppmcf:fin[192.168.98.3(45112)] FATAL: the database system isin recovery mode Mar 7 13:58:57 <local0.info> edge postgres[3531]: [13-1] :[] LOG: database system was not properly shut down; automaticrecovery in progress Mar 7 13:58:58 <local0.info> edge postgres[3531]: [14-1] :[] LOG: redo starts at 2C/C600008C Mar 7 13:59:10 <local0.info> edge postgres[1816]: [12-1] :[] LOG: startup process (PID 3531) was terminated by signal 10:Bus error Mar 7 13:59:10 <local0.info> edge postgres[1816]: [13-1] :[] LOG: aborting startup due to startup process failure Mar 7 13:59:11 <local0.info> edge postgres[1816]: [14-1] :[] LOG: database system is shut down This is repeatable. DB starts normally, I start first application fine, start second application fine, start third application: KABOOM! Reinstalled 10.5 for now, so it runs again. Time to read the relnotes: > When building on i386 with the clang compiler, require -msse2 to be > used (Andres Freund) > This avoids problems with missed floating point overflow checks. What the hell does that mean? Does it concern the build process? Or the operation? Why does it only concern the Clang? And what is SSE2 concerned with? Or, is this a strangely cryptic statement, which, after proper decryption, should actually read: // "Beginning with the upgrade from 10.5 to 10.6, postgreSQL can no // longer run on platforms that do not provide SSE2" ??? The point here is, my third application works with lots of floating point stuff. The other two do not. Further investigation to followup ASAP. PMc
>>>>> "Peter" == Peter <pmc@citylink.dinoex.sub.org> writes: Peter> Hi, Peter> after upgrading from 10.5 to 10.6.2, database now says this: Peter> Mar 7 13:58:43 <local0.info> edge postgres[1816]: [8-1] :[] LOG: Peter> worker process: parallel worker for PID 3526 (PID 3527) was Peter> terminated by signal 10: Bus error I'm assuming from the CC that this is on FreeBSD, but on what architecture? Did it drop a core file (look in the data dir for postgres.core) and if so can you get a backtrace? Peter> Time to read the relnotes: >> When building on i386 with the clang compiler, require -msse2 to be >> used (Andres Freund) This avoids problems with missed floating point >> overflow checks. Peter> What the hell does that mean? Does it concern the build process? Peter> Or the operation? Why does it only concern the Clang? And what Peter> is SSE2 concerned with? It concerns only overflow checks in floating-point computations. Clang's __builtin_isinf(x) function, which is supposed to test if x is infinite, does not work reliably on i386 when the x87 registers are used for floating point. It does work if the SSE2 registers are used instead, which clang will do if the -msse2 option is used. The downside of course is that the code will no longer run on CPUs that are old enough to vote. This is only a problem on clang because gcc has other options we can use to force working infinity tests. You can check whether your CPU supports SSE2 by looking at the Features= line in /var/run/dmesg.boot. It seems unlikely that it does not, because SSE2 was introduced in 2000 with the Pentium 4. -- Andrew (irc:RhodiumToad)
7 mars 2019 kl. 18:20 skrev Andrew Gierth <andrew@tao11.riddles.org.uk>:"Peter" == Peter <pmc@citylink.dinoex.sub.org> writes:
Peter> Hi,
Peter> after upgrading from 10.5 to 10.6.2, database now says this:
Peter> Mar 7 13:58:43 <local0.info> edge postgres[1816]: [8-1] :[] LOG:
Peter> worker process: parallel worker for PID 3526 (PID 3527) was
Peter> terminated by signal 10: Bus error
I'm assuming from the CC that this is on FreeBSD, but on what
architecture?
Did it drop a core file (look in the data dir for postgres.core) and if
so can you get a backtrace?
Peter> Time to read the relnotes:When building on i386 with the clang compiler, require -msse2 to be
used (Andres Freund) This avoids problems with missed floating point
overflow checks.
Peter> What the hell does that mean? Does it concern the build process?
Peter> Or the operation? Why does it only concern the Clang? And what
Peter> is SSE2 concerned with?
It concerns only overflow checks in floating-point computations.
Clang's __builtin_isinf(x) function, which is supposed to test if x is
infinite, does not work reliably on i386 when the x87 registers are used
for floating point. It does work if the SSE2 registers are used instead,
which clang will do if the -msse2 option is used. The downside of course
is that the code will no longer run on CPUs that are old enough to vote.
We had one report of a user who had SSE2 flag still on (int /etc/make.conf) when building, and got the same problem. [https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=236025]
This is only a problem on clang because gcc has other options we can use
to force working infinity tests.
You can check whether your CPU supports SSE2 by looking at the Features=
line in /var/run/dmesg.boot. It seems unlikely that it does not, because
SSE2 was introduced in 2000 with the Pentium 4.
--
Andrew (irc:RhodiumToad)
Andrew Gierth <andrew@tao11.riddles.org.uk> writes: > "Peter" == Peter <pmc@citylink.dinoex.sub.org> writes: > Peter> What the hell does that mean? Does it concern the build process? > Peter> Or the operation? Why does it only concern the Clang? And what > Peter> is SSE2 concerned with? > It concerns only overflow checks in floating-point computations. It seems pretty unlikely that that'd have anything to do with a bus-error failure, anyway. But this report contains far too little information to let anyone do anything but speculate. regards, tom lane
>>>>> "Palle" == Palle Girgensohn <girgen@FreeBSD.org> writes: >> Clang's __builtin_isinf(x) function, which is supposed to test if x is >> infinite, does not work reliably on i386 when the x87 registers are used >> for floating point. It does work if the SSE2 registers are used instead, >> which clang will do if the -msse2 option is used. The downside of course >> is that the code will no longer run on CPUs that are old enough to vote. Palle> For this reason, we build i386 with gcc starting with version Palle> 10.6. the CFLAGS+=--msse2 was not realiable for all CPU:s since Palle> not all i386 CPU:s support SSE2. Palle> We had one report of a user who had SSE2 flag still on (int Palle> /etc/make.conf) when building, and got the same problem. Palle> [https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=236025 Palle> <https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=236025>] The user in that bug report was using a Pentium 4, which supports SSE2, so it's not the lack of SSE2 that caused the problem. Furthermore, the crash was in XLogReadRecord, which does not use floats. So I'm going to guess that your bug 236025 is actually an alignment problem, with the compiler making some assumption about alignment that we're violating. I'll investigate and see what I can find. -- Andrew.
On 3/7/19 5:30 PM, Andrew Gierth wrote: >>>>>> "Palle" == Palle Girgensohn <girgen@FreeBSD.org> writes: > >> Clang's __builtin_isinf(x) function, which is supposed to test if x is > >> infinite, does not work reliably on i386 when the x87 registers are used > >> for floating point. It does work if the SSE2 registers are used instead, > >> which clang will do if the -msse2 option is used. The downside of course > >> is that the code will no longer run on CPUs that are old enough to vote. > > Palle> For this reason, we build i386 with gcc starting with version > Palle> 10.6. the CFLAGS+=--msse2 was not realiable for all CPU:s since > Palle> not all i386 CPU:s support SSE2. > > Palle> We had one report of a user who had SSE2 flag still on (int > Palle> /etc/make.conf) when building, and got the same problem. > Palle> [https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=236025 > Palle> <https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=236025>] > > The user in that bug report was using a Pentium 4, which supports SSE2, > so it's not the lack of SSE2 that caused the problem. But if it's compiled for i386 instead of i686? > Furthermore, the crash was in XLogReadRecord, which does not use floats. > > So I'm going to guess that your bug 236025 is actually an alignment > problem, with the compiler making some assumption about alignment that > we're violating. I'll investigate and see what I can find. -- Angular momentum makes the world go 'round.
Hi Tom, Andrew, much thanks for the replies! Alright, lets fill in some concrete data: > I'm assuming from the CC that this is on FreeBSD, but on what > architecture? When on my evening errands I recognized that I should have mentioned this - FreeBSD is correct; it is built on amd64 for i386, and run on i386. Version: FreeBSD 11.2-RELEASE-p9 #0 r343946M#C51:82 Build-Options: OPTIONS_FILE_UNSET+=DEBUG OPTIONS_FILE_UNSET+=DOCS OPTIONS_FILE_UNSET+=DTRACE OPTIONS_FILE_SET+=GSSAPI OPTIONS_FILE_SET+=INTDATE OPTIONS_FILE_UNSET+=LDAP OPTIONS_FILE_SET+=NLS OPTIONS_FILE_UNSET+=OPTIMIZED_CFLAGS OPTIONS_FILE_UNSET+=PAM OPTIONS_FILE_SET+=SSL OPTIONS_FILE_SET+=TZDATA OPTIONS_FILE_SET+=XML Extra Compiler-Options: -march=pentium3 Init-Options: --data-checksums --encoding=utf-8 --lc-collate=de_DE.UTF-8 --lc-ctype=de_DE.UTF-8 --lc-messages=en_US.UTF-8 --lc-monetary=en_US.UTF-8 --lc-numeric=en_US.UTF-8 --lc-time=en_US.UTF-8 Run-Options: -w -m fast -o --config_file=/usr/local/etc/postgresql/postgresql.conf Furthermore, FreeBSD did impose a change for R. 10.6: it forces the use of gcc on i386 (gcc-8 in this case). Earlier versions were built with system compiler Clang. The commitlog says this about the matter: ! r484807 | girgen | 2018-11-12 16:54:19 +0100 (Mon, 12 Nov 2018) | 5 lines ! ! Fix build problems on i386 ! ! Use GCC seems to be proper way to do it. SSE2 would not be available ! for all CPU:s. > Did it drop a core file (look in the data dir for postgres.core) and if > so can you get a backtrace? Looking... yes, there is a core. Lets grab a first-fault core, as that one obviousely is from the failed recover: ! (gdb) core postgres.core.1st ! Core was generated by `postgres: bgworker: parallel worker for PID 68755 '. ! Program terminated with signal 10, Bus error. ! Reading symbols from <etc etc> ! #0 0x0838bdf2 in pg_checksum_page () ! (gdb) bt ! #0 0x0838bdf2 in pg_checksum_page () ! #1 0x0838a2b8 in PageIsVerified () ! #2 0x5a824500 in ?? () ! #3 0x00000000 in ?? () The second one looks this way: ! (gdb) core postgres.core ! Core was generated by `postgres: startup process recovering 000000010000002C000000C6'. ! Program terminated with signal 10, Bus error. ! Reading symbols from <lots of files> ! #0 0x0838bdf2 in pg_checksum_page () ! (gdb) bt ! #0 0x0838bdf2 in pg_checksum_page () ! #1 0x0838a2b8 in PageIsVerified () ! #2 0x59e14500 in ?? () ! #3 0x00000000 in ?? () Anything more I can do here? (Advice on how to build with debugging support is appreciated.) > You can check whether your CPU supports SSE2 by looking at the Features= > line in /var/run/dmesg.boot. It seems unlikely that it does not, because > SSE2 was introduced in 2000 with the Pentium 4. No need to check; I am absolutely certain that it does NOT. https://www.asus.com/supportonly/CUV4X-DLS/HelpDesk_CPU/ But, Your explanation seems not to answer the fundamental question: if the database at 10.6 is still supposed to be able to run without SSE2? > It seems pretty unlikely that that'd have anything to do with a > bus-error failure, anyway. But this report contains far too little > information to let anyone do anything but speculate. Whateever information You like to have, just ask and I will gladly do my best to obtain it, as I get around. (This is a reproducible on a very well maintained piece of software - this is rather fun.) Some more experiments & observations: The crash happens at a specific query - I get parse,bind, but no execute timing. Furthermore, when I try and set ! max_parallel_workers_per_gather = 0 then the query goes thru and delivers proper results. But then after few minutes I get this one: ! postgres[71256]: [8-1] :[] LOG: 00000: checkpointer process (PID 71258) ! was terminated by signal 10: Bus error Different approach, same result: ! dynamic_shared_memory_type = posix -> crash immediate ! dynamic_shared_memory_type = sysv -> crash immediate ! dynamic_shared_memory_type = mmap -> crash immediate ! dynamic_shared_memory_type = none -> crash later in checkpointer regards, PMc
>>>>> "Peter" == Peter <pmc@citylink.dinoex.sub.org> writes: >> You can check whether your CPU supports SSE2 by looking at the Features= >> line in /var/run/dmesg.boot. It seems unlikely that it does not, because >> SSE2 was introduced in 2000 with the Pentium 4. Peter> No need to check; I am absolutely certain that it does NOT. Peter> https://www.asus.com/supportonly/CUV4X-DLS/HelpDesk_CPU/ Peter> But, Your explanation seems not to answer the fundamental Peter> question: if the database at 10.6 is still supposed to be able Peter> to run without SSE2? Yes, the database is supposed to be able to run without SSE2, as long as it is built with gcc and not clang, and without any architecture flags that imply SSE2 support. I'm pretty sure nothing in our buildfarm is i386 without SSE2 though. Peter> Whateever information You like to have, just ask and I will Peter> gladly do my best to obtain it, as I get around. (This is a Peter> reproducible on a very well maintained piece of software - this Peter> is rather fun.) Your backtrace implies that you are running with checksums enabled; true? You should be able to build the port with debugging enabled by setting WITH_DEBUG=1 in the environment or on the make command line. (I have not yet tried this myself - I rarely build from the port.) Peter> The crash happens at a specific query - I get parse,bind, but no Peter> execute timing. What is the exact query? I'm doing some investigations of my own, I may have more questions later. -- Andrew (irc:RhodiumToad)
>>>>> "Peter" == Peter <pmc@citylink.dinoex.sub.org> writes: Peter> Looking... yes, there is a core. Lets grab a first-fault core, Peter> as that one obviousely is from the failed recover: Peter> ! (gdb) core postgres.core.1st Peter> ! Core was generated by `postgres: bgworker: parallel worker for PID 68755 '. Peter> ! Program terminated with signal 10, Bus error. Peter> ! Reading symbols from <etc etc> Peter> ! #0 0x0838bdf2 in pg_checksum_page () Peter> ! (gdb) bt Peter> ! #0 0x0838bdf2 in pg_checksum_page () Peter> ! #1 0x0838a2b8 in PageIsVerified () Peter> ! #2 0x5a824500 in ?? () Peter> ! #3 0x00000000 in ?? () Can you do the command 'info reg' on this core, and also the command 'disass pg_checksum_page' -- Andrew (irc:RhodiumToad)
On Fri, Mar 08, 2019 at 02:48:12AM +0000, Andrew Gierth wrote: ! >>>>> "Peter" == Peter <pmc@citylink.dinoex.sub.org> writes: ! ! Peter> Looking... yes, there is a core. Lets grab a first-fault core, ! Peter> as that one obviousely is from the failed recover: ! ! Peter> ! (gdb) core postgres.core.1st ! Peter> ! Core was generated by `postgres: bgworker: parallel worker for PID 68755 '. ! Peter> ! Program terminated with signal 10, Bus error. ! Peter> ! Reading symbols from <etc etc> ! Peter> ! #0 0x0838bdf2 in pg_checksum_page () ! Peter> ! (gdb) bt ! Peter> ! #0 0x0838bdf2 in pg_checksum_page () ! Peter> ! #1 0x0838a2b8 in PageIsVerified () ! Peter> ! #2 0x5a824500 in ?? () ! Peter> ! #3 0x00000000 in ?? () ! ! Can you do the command 'info reg' on this core, and also the command ! 'disass pg_checksum_page' Okay - made a new one, slightly different backtrace: #0 0x0838bdf2 in pg_checksum_page () (gdb) bt #0 0x0838bdf2 in pg_checksum_page () #1 0x0838a2b8 in PageIsVerified () #2 0x5a538500 in ?? () #3 0x00000000 in ?? () (gdb) info reg eax 0x7fbfc1ec 2143273452 ecx 0x7f347fc4 2134147012 edx 0x2b5d7d81 727547265 ebx 0x2b5d3ca2 727530658 esp 0x7fbfc15c 0x7fbfc15c ebp 0x7f344cd0 0x7f344cd0 esi 0x5a538500 1515422976 edi 0x7fbfc1ec 2143273452 eip 0x838bdf2 0x838bdf2 eflags 0x210246 2163270 cs 0x33 51 ss 0x3b 59 ds 0x3b 59 es 0x3b 59 fs 0x3b 59 gs 0x1b 27 Dump of assembler code for function pg_checksum_page: 0x0838ba90 <pg_checksum_page+0>: push %ebp 0x0838ba91 <pg_checksum_page+1>: push %edi 0x0838ba92 <pg_checksum_page+2>: push %esi 0x0838ba93 <pg_checksum_page+3>: push %ebx 0x0838ba94 <pg_checksum_page+4>: xor %ebx,%ebx 0x0838ba96 <pg_checksum_page+6>: sub $0x9c,%esp 0x0838ba9c <pg_checksum_page+12>: mov 0xb0(%esp),%esi 0x0838baa3 <pg_checksum_page+19>: movzwl 0x8(%esi),%eax 0x0838baa7 <pg_checksum_page+23>: movw $0x0,0x8(%esi) 0x0838baad <pg_checksum_page+29>: mov %eax,0x8(%esp) 0x0838bab1 <pg_checksum_page+33>: mov 0x86122a0(%ebx),%edx 0x0838bab7 <pg_checksum_page+39>: mov 0x86122a4(%ebx),%ecx 0x0838babd <pg_checksum_page+45>: mov 0x86122a8(%ebx),%edi 0x0838bac3 <pg_checksum_page+51>: mov 0x86122ac(%ebx),%ebp 0x0838bac9 <pg_checksum_page+57>: mov %edx,0x10(%esp,%ebx,1) 0x0838bacd <pg_checksum_page+61>: mov 0x86122b0(%ebx),%eax 0x0838bad3 <pg_checksum_page+67>: mov %ecx,0x14(%esp,%ebx,1) 0x0838bad7 <pg_checksum_page+71>: mov 0x86122b4(%ebx),%edx 0x0838badd <pg_checksum_page+77>: mov %edi,0x18(%esp,%ebx,1) 0x0838bae1 <pg_checksum_page+81>: mov 0x86122b8(%ebx),%ecx 0x0838bae7 <pg_checksum_page+87>: mov %ebp,0x1c(%esp,%ebx,1) 0x0838baeb <pg_checksum_page+91>: mov 0x86122bc(%ebx),%edi 0x0838baf1 <pg_checksum_page+97>: mov %eax,0x20(%esp,%ebx,1) 0x0838baf5 <pg_checksum_page+101>: mov %edx,0x24(%esp,%ebx,1) 0x0838baf9 <pg_checksum_page+105>: mov %ecx,0x28(%esp,%ebx,1) 0x0838bafd <pg_checksum_page+109>: mov %edi,0x2c(%esp,%ebx,1) 0x0838bb01 <pg_checksum_page+113>: add $0x20,%ebx 0x0838bb04 <pg_checksum_page+116>: cmp $0x80,%ebx 0x0838bb0a <pg_checksum_page+122>: jb 0x838bab1 <pg_checksum_page+33> 0x0838bb0c <pg_checksum_page+124>: lea 0x2000(%esi),%ebx 0x0838bb12 <pg_checksum_page+130>: mov %esi,%edx 0x0838bb14 <pg_checksum_page+132>: mov %ebx,0xc(%esp) 0x0838bb18 <pg_checksum_page+136>: lea 0x10(%esp),%ecx 0x0838bb1c <pg_checksum_page+140>: lea 0x0(%esi),%esi 0x0838bb20 <pg_checksum_page+144>: xor %eax,%eax 0x0838bb22 <pg_checksum_page+146>: mov (%ecx,%eax,4),%ebp 0x0838bb25 <pg_checksum_page+149>: mov (%edx,%eax,4),%edi 0x0838bb28 <pg_checksum_page+152>: xor %edi,%ebp 0x0838bb2a <pg_checksum_page+154>: imul $0x1000193,%ebp,%ebx 0x0838bb30 <pg_checksum_page+160>: shr $0x11,%ebp 0x0838bb33 <pg_checksum_page+163>: lea 0x1(%eax),%edi 0x0838bb36 <pg_checksum_page+166>: xor %ebx,%ebp 0x0838bb38 <pg_checksum_page+168>: mov %ebp,(%ecx,%eax,4) 0x0838bb3b <pg_checksum_page+171>: mov (%edx,%edi,4),%ebp 0x0838bb3e <pg_checksum_page+174>: mov (%ecx,%edi,4),%ebx 0x0838bb41 <pg_checksum_page+177>: xor %ebp,%ebx 0x0838bb43 <pg_checksum_page+179>: imul $0x1000193,%ebx,%ebp 0x0838bb49 <pg_checksum_page+185>: shr $0x11,%ebx 0x0838bb4c <pg_checksum_page+188>: xor %ebp,%ebx 0x0838bb4e <pg_checksum_page+190>: mov %ebx,(%ecx,%edi,4) 0x0838bb51 <pg_checksum_page+193>: lea 0x2(%eax),%edi 0x0838bb54 <pg_checksum_page+196>: mov (%ecx,%edi,4),%ebx 0x0838bb57 <pg_checksum_page+199>: mov (%edx,%edi,4),%ebp 0x0838bb5a <pg_checksum_page+202>: xor %ebp,%ebx 0x0838bb5c <pg_checksum_page+204>: imul $0x1000193,%ebx,%ebp 0x0838bb62 <pg_checksum_page+210>: shr $0x11,%ebx 0x0838bb65 <pg_checksum_page+213>: xor %ebp,%ebx 0x0838bb67 <pg_checksum_page+215>: mov %ebx,(%ecx,%edi,4) 0x0838bb6a <pg_checksum_page+218>: lea 0x3(%eax),%edi 0x0838bb6d <pg_checksum_page+221>: mov (%ecx,%edi,4),%ebx 0x0838bb70 <pg_checksum_page+224>: mov (%edx,%edi,4),%ebp 0x0838bb73 <pg_checksum_page+227>: xor %ebp,%ebx 0x0838bb75 <pg_checksum_page+229>: imul $0x1000193,%ebx,%ebp 0x0838bb7b <pg_checksum_page+235>: shr $0x11,%ebx 0x0838bb7e <pg_checksum_page+238>: xor %ebp,%ebx 0x0838bb80 <pg_checksum_page+240>: mov %ebx,(%ecx,%edi,4) 0x0838bb83 <pg_checksum_page+243>: lea 0x4(%eax),%edi 0x0838bb86 <pg_checksum_page+246>: mov (%ecx,%edi,4),%ebx 0x0838bb89 <pg_checksum_page+249>: mov (%edx,%edi,4),%ebp 0x0838bb8c <pg_checksum_page+252>: xor %ebp,%ebx 0x0838bb8e <pg_checksum_page+254>: imul $0x1000193,%ebx,%ebp 0x0838bb94 <pg_checksum_page+260>: shr $0x11,%ebx 0x0838bb97 <pg_checksum_page+263>: xor %ebp,%ebx 0x0838bb99 <pg_checksum_page+265>: mov %ebx,(%ecx,%edi,4) 0x0838bb9c <pg_checksum_page+268>: lea 0x5(%eax),%edi 0x0838bb9f <pg_checksum_page+271>: mov (%ecx,%edi,4),%ebx 0x0838bba2 <pg_checksum_page+274>: mov (%edx,%edi,4),%ebp 0x0838bba5 <pg_checksum_page+277>: xor %ebp,%ebx 0x0838bba7 <pg_checksum_page+279>: imul $0x1000193,%ebx,%ebp 0x0838bbad <pg_checksum_page+285>: shr $0x11,%ebx 0x0838bbb0 <pg_checksum_page+288>: xor %ebp,%ebx 0x0838bbb2 <pg_checksum_page+290>: mov %ebx,(%ecx,%edi,4) 0x0838bbb5 <pg_checksum_page+293>: lea 0x6(%eax),%edi 0x0838bbb8 <pg_checksum_page+296>: mov (%ecx,%edi,4),%ebx 0x0838bbbb <pg_checksum_page+299>: mov (%edx,%edi,4),%ebp 0x0838bbbe <pg_checksum_page+302>: xor %ebp,%ebx 0x0838bbc0 <pg_checksum_page+304>: imul $0x1000193,%ebx,%ebp 0x0838bbc6 <pg_checksum_page+310>: shr $0x11,%ebx 0x0838bbc9 <pg_checksum_page+313>: xor %ebp,%ebx 0x0838bbcb <pg_checksum_page+315>: mov %ebx,(%ecx,%edi,4) 0x0838bbce <pg_checksum_page+318>: lea 0x7(%eax),%edi 0x0838bbd1 <pg_checksum_page+321>: mov (%ecx,%edi,4),%ebx 0x0838bbd4 <pg_checksum_page+324>: mov (%edx,%edi,4),%ebp 0x0838bbd7 <pg_checksum_page+327>: xor %ebp,%ebx 0x0838bbd9 <pg_checksum_page+329>: imul $0x1000193,%ebx,%ebp 0x0838bbdf <pg_checksum_page+335>: shr $0x11,%ebx 0x0838bbe2 <pg_checksum_page+338>: xor %ebp,%ebx 0x0838bbe4 <pg_checksum_page+340>: add $0x8,%eax 0x0838bbe7 <pg_checksum_page+343>: mov %ebx,(%ecx,%edi,4) ---Type <return> to continue, or q <return> to quit--- 0x0838bbea <pg_checksum_page+346>: cmp $0x20,%eax 0x0838bbed <pg_checksum_page+349>: jne 0x838bb22 <pg_checksum_page+146> 0x0838bbf3 <pg_checksum_page+355>: sub $0xffffff80,%edx 0x0838bbf6 <pg_checksum_page+358>: cmp 0xc(%esp),%edx 0x0838bbfa <pg_checksum_page+362>: jne 0x838bb20 <pg_checksum_page+144> 0x0838bc00 <pg_checksum_page+368>: mov %ecx,%eax 0x0838bc02 <pg_checksum_page+370>: mov %ecx,%edx 0x0838bc04 <pg_checksum_page+372>: lea 0x90(%esp),%edi 0x0838bc0b <pg_checksum_page+379>: mov (%edx),%ebx 0x0838bc0d <pg_checksum_page+381>: add $0x20,%edx 0x0838bc10 <pg_checksum_page+384>: imul $0x1000193,%ebx,%ebp 0x0838bc16 <pg_checksum_page+390>: shr $0x11,%ebx 0x0838bc19 <pg_checksum_page+393>: xor %ebp,%ebx 0x0838bc1b <pg_checksum_page+395>: mov %ebx,-0x20(%edx) 0x0838bc1e <pg_checksum_page+398>: mov -0x1c(%edx),%ebx 0x0838bc21 <pg_checksum_page+401>: imul $0x1000193,%ebx,%ebp 0x0838bc27 <pg_checksum_page+407>: shr $0x11,%ebx 0x0838bc2a <pg_checksum_page+410>: xor %ebp,%ebx 0x0838bc2c <pg_checksum_page+412>: mov %ebx,-0x1c(%edx) 0x0838bc2f <pg_checksum_page+415>: mov -0x18(%edx),%ebx 0x0838bc32 <pg_checksum_page+418>: imul $0x1000193,%ebx,%ebp 0x0838bc38 <pg_checksum_page+424>: shr $0x11,%ebx 0x0838bc3b <pg_checksum_page+427>: xor %ebp,%ebx 0x0838bc3d <pg_checksum_page+429>: mov %ebx,-0x18(%edx) 0x0838bc40 <pg_checksum_page+432>: mov -0x14(%edx),%ebx 0x0838bc43 <pg_checksum_page+435>: imul $0x1000193,%ebx,%ebp 0x0838bc49 <pg_checksum_page+441>: shr $0x11,%ebx 0x0838bc4c <pg_checksum_page+444>: xor %ebp,%ebx 0x0838bc4e <pg_checksum_page+446>: mov %ebx,-0x14(%edx) 0x0838bc51 <pg_checksum_page+449>: mov -0x10(%edx),%ebx 0x0838bc54 <pg_checksum_page+452>: imul $0x1000193,%ebx,%ebp 0x0838bc5a <pg_checksum_page+458>: shr $0x11,%ebx 0x0838bc5d <pg_checksum_page+461>: xor %ebp,%ebx 0x0838bc5f <pg_checksum_page+463>: mov %ebx,-0x10(%edx) 0x0838bc62 <pg_checksum_page+466>: mov -0xc(%edx),%ebx 0x0838bc65 <pg_checksum_page+469>: imul $0x1000193,%ebx,%ebp 0x0838bc6b <pg_checksum_page+475>: shr $0x11,%ebx 0x0838bc6e <pg_checksum_page+478>: xor %ebp,%ebx 0x0838bc70 <pg_checksum_page+480>: mov %ebx,-0xc(%edx) 0x0838bc73 <pg_checksum_page+483>: mov -0x8(%edx),%ebx 0x0838bc76 <pg_checksum_page+486>: imul $0x1000193,%ebx,%ebp 0x0838bc7c <pg_checksum_page+492>: shr $0x11,%ebx 0x0838bc7f <pg_checksum_page+495>: xor %ebp,%ebx 0x0838bc81 <pg_checksum_page+497>: mov %ebx,-0x8(%edx) 0x0838bc84 <pg_checksum_page+500>: mov -0x4(%edx),%ebx 0x0838bc87 <pg_checksum_page+503>: imul $0x1000193,%ebx,%ebp 0x0838bc8d <pg_checksum_page+509>: shr $0x11,%ebx 0x0838bc90 <pg_checksum_page+512>: xor %ebp,%ebx 0x0838bc92 <pg_checksum_page+514>: mov %ebx,-0x4(%edx) 0x0838bc95 <pg_checksum_page+517>: cmp %edx,%edi 0x0838bc97 <pg_checksum_page+519>: jne 0x838bc0b <pg_checksum_page+379> 0x0838bc9d <pg_checksum_page+525>: mov %edi,%edx 0x0838bc9f <pg_checksum_page+527>: sub %ecx,%edx 0x0838bca1 <pg_checksum_page+529>: sub $0x4,%edx 0x0838bca4 <pg_checksum_page+532>: shr $0x2,%edx 0x0838bca7 <pg_checksum_page+535>: inc %edx 0x0838bca8 <pg_checksum_page+536>: and $0x7,%edx 0x0838bcab <pg_checksum_page+539>: je 0x838bd60 <pg_checksum_page+720> 0x0838bcb1 <pg_checksum_page+545>: cmp $0x1,%edx 0x0838bcb4 <pg_checksum_page+548>: je 0x838bd44 <pg_checksum_page+692> 0x0838bcba <pg_checksum_page+554>: cmp $0x2,%edx 0x0838bcbd <pg_checksum_page+557>: je 0x838bd31 <pg_checksum_page+673> 0x0838bcbf <pg_checksum_page+559>: cmp $0x3,%edx 0x0838bcc2 <pg_checksum_page+562>: je 0x838bd1e <pg_checksum_page+654> 0x0838bcc4 <pg_checksum_page+564>: cmp $0x4,%edx 0x0838bcc7 <pg_checksum_page+567>: je 0x838bd0b <pg_checksum_page+635> 0x0838bcc9 <pg_checksum_page+569>: cmp $0x5,%edx 0x0838bccc <pg_checksum_page+572>: je 0x838bcf8 <pg_checksum_page+616> 0x0838bcce <pg_checksum_page+574>: cmp $0x6,%edx 0x0838bcd1 <pg_checksum_page+577>: je 0x838bce5 <pg_checksum_page+597> 0x0838bcd3 <pg_checksum_page+579>: mov (%ecx),%eax 0x0838bcd5 <pg_checksum_page+581>: imul $0x1000193,%eax,%ebx 0x0838bcdb <pg_checksum_page+587>: shr $0x11,%eax 0x0838bcde <pg_checksum_page+590>: xor %ebx,%eax 0x0838bce0 <pg_checksum_page+592>: mov %eax,(%ecx) 0x0838bce2 <pg_checksum_page+594>: lea 0x4(%ecx),%eax 0x0838bce5 <pg_checksum_page+597>: mov (%eax),%ebp 0x0838bce7 <pg_checksum_page+599>: add $0x4,%eax 0x0838bcea <pg_checksum_page+602>: imul $0x1000193,%ebp,%ecx 0x0838bcf0 <pg_checksum_page+608>: shr $0x11,%ebp 0x0838bcf3 <pg_checksum_page+611>: xor %ecx,%ebp 0x0838bcf5 <pg_checksum_page+613>: mov %ebp,-0x4(%eax) 0x0838bcf8 <pg_checksum_page+616>: mov (%eax),%edx 0x0838bcfa <pg_checksum_page+618>: add $0x4,%eax 0x0838bcfd <pg_checksum_page+621>: imul $0x1000193,%edx,%ebx 0x0838bd03 <pg_checksum_page+627>: shr $0x11,%edx 0x0838bd06 <pg_checksum_page+630>: xor %ebx,%edx 0x0838bd08 <pg_checksum_page+632>: mov %edx,-0x4(%eax) 0x0838bd0b <pg_checksum_page+635>: mov (%eax),%ebp 0x0838bd0d <pg_checksum_page+637>: add $0x4,%eax 0x0838bd10 <pg_checksum_page+640>: imul $0x1000193,%ebp,%ecx 0x0838bd16 <pg_checksum_page+646>: shr $0x11,%ebp 0x0838bd19 <pg_checksum_page+649>: xor %ecx,%ebp 0x0838bd1b <pg_checksum_page+651>: mov %ebp,-0x4(%eax) 0x0838bd1e <pg_checksum_page+654>: mov (%eax),%edx 0x0838bd20 <pg_checksum_page+656>: add $0x4,%eax 0x0838bd23 <pg_checksum_page+659>: imul $0x1000193,%edx,%ebx 0x0838bd29 <pg_checksum_page+665>: shr $0x11,%edx 0x0838bd2c <pg_checksum_page+668>: xor %ebx,%edx 0x0838bd2e <pg_checksum_page+670>: mov %edx,-0x4(%eax) ---Type <return> to continue, or q <return> to quit--- 0x0838bd31 <pg_checksum_page+673>: mov (%eax),%ebp 0x0838bd33 <pg_checksum_page+675>: add $0x4,%eax 0x0838bd36 <pg_checksum_page+678>: imul $0x1000193,%ebp,%ecx 0x0838bd3c <pg_checksum_page+684>: shr $0x11,%ebp 0x0838bd3f <pg_checksum_page+687>: xor %ecx,%ebp 0x0838bd41 <pg_checksum_page+689>: mov %ebp,-0x4(%eax) 0x0838bd44 <pg_checksum_page+692>: mov (%eax),%edx 0x0838bd46 <pg_checksum_page+694>: add $0x4,%eax 0x0838bd49 <pg_checksum_page+697>: imul $0x1000193,%edx,%ebx 0x0838bd4f <pg_checksum_page+703>: shr $0x11,%edx 0x0838bd52 <pg_checksum_page+706>: xor %ebx,%edx 0x0838bd54 <pg_checksum_page+708>: mov %edx,-0x4(%eax) 0x0838bd57 <pg_checksum_page+711>: cmp %edi,%eax 0x0838bd59 <pg_checksum_page+713>: je 0x838bdf2 <pg_checksum_page+866> 0x0838bd5f <pg_checksum_page+719>: nop 0x0838bd60 <pg_checksum_page+720>: mov (%eax),%ebp 0x0838bd62 <pg_checksum_page+722>: add $0x20,%eax 0x0838bd65 <pg_checksum_page+725>: mov -0x1c(%eax),%edx 0x0838bd68 <pg_checksum_page+728>: imul $0x1000193,%ebp,%ecx 0x0838bd6e <pg_checksum_page+734>: imul $0x1000193,%edx,%ebx 0x0838bd74 <pg_checksum_page+740>: shr $0x11,%ebp 0x0838bd77 <pg_checksum_page+743>: shr $0x11,%edx 0x0838bd7a <pg_checksum_page+746>: xor %ecx,%ebp 0x0838bd7c <pg_checksum_page+748>: mov %ebp,-0x20(%eax) 0x0838bd7f <pg_checksum_page+751>: xor %ebx,%edx 0x0838bd81 <pg_checksum_page+753>: mov -0x18(%eax),%ebp 0x0838bd84 <pg_checksum_page+756>: mov %edx,-0x1c(%eax) 0x0838bd87 <pg_checksum_page+759>: mov -0x14(%eax),%edx 0x0838bd8a <pg_checksum_page+762>: imul $0x1000193,%ebp,%ecx 0x0838bd90 <pg_checksum_page+768>: imul $0x1000193,%edx,%ebx 0x0838bd96 <pg_checksum_page+774>: shr $0x11,%ebp 0x0838bd99 <pg_checksum_page+777>: shr $0x11,%edx 0x0838bd9c <pg_checksum_page+780>: xor %ecx,%ebp 0x0838bd9e <pg_checksum_page+782>: mov %ebp,-0x18(%eax) 0x0838bda1 <pg_checksum_page+785>: xor %ebx,%edx 0x0838bda3 <pg_checksum_page+787>: mov -0x10(%eax),%ebp 0x0838bda6 <pg_checksum_page+790>: mov %edx,-0x14(%eax) 0x0838bda9 <pg_checksum_page+793>: mov -0xc(%eax),%edx 0x0838bdac <pg_checksum_page+796>: imul $0x1000193,%ebp,%ecx 0x0838bdb2 <pg_checksum_page+802>: imul $0x1000193,%edx,%ebx 0x0838bdb8 <pg_checksum_page+808>: shr $0x11,%ebp 0x0838bdbb <pg_checksum_page+811>: shr $0x11,%edx 0x0838bdbe <pg_checksum_page+814>: xor %ecx,%ebp 0x0838bdc0 <pg_checksum_page+816>: mov %ebp,-0x10(%eax) 0x0838bdc3 <pg_checksum_page+819>: xor %ebx,%edx 0x0838bdc5 <pg_checksum_page+821>: mov -0x8(%eax),%ebp 0x0838bdc8 <pg_checksum_page+824>: mov %edx,-0xc(%eax) 0x0838bdcb <pg_checksum_page+827>: mov -0x4(%eax),%edx 0x0838bdce <pg_checksum_page+830>: imul $0x1000193,%ebp,%ecx 0x0838bdd4 <pg_checksum_page+836>: imul $0x1000193,%edx,%ebx 0x0838bdda <pg_checksum_page+842>: shr $0x11,%ebp 0x0838bddd <pg_checksum_page+845>: shr $0x11,%edx 0x0838bde0 <pg_checksum_page+848>: xor %ecx,%ebp 0x0838bde2 <pg_checksum_page+850>: mov %ebp,-0x8(%eax) 0x0838bde5 <pg_checksum_page+853>: xor %ebx,%edx 0x0838bde7 <pg_checksum_page+855>: mov %edx,-0x4(%eax) 0x0838bdea <pg_checksum_page+858>: cmp %edi,%eax 0x0838bdec <pg_checksum_page+860>: jne 0x838bd60 <pg_checksum_page+720> 0x0838bdf2 <pg_checksum_page+866>: movaps 0x20(%esp),%xmm0 0x0838bdf7 <pg_checksum_page+871>: mov $0x80008001,%ebx 0x0838bdfc <pg_checksum_page+876>: xorps 0x10(%esp),%xmm0 0x0838be01 <pg_checksum_page+881>: xorps 0x30(%esp),%xmm0 0x0838be06 <pg_checksum_page+886>: xorps 0x40(%esp),%xmm0 0x0838be0b <pg_checksum_page+891>: xorps 0x50(%esp),%xmm0 0x0838be10 <pg_checksum_page+896>: xorps 0x60(%esp),%xmm0 0x0838be15 <pg_checksum_page+901>: xorps 0x70(%esp),%xmm0 0x0838be1a <pg_checksum_page+906>: xorps 0x80(%esp),%xmm0 0x0838be22 <pg_checksum_page+914>: mov 0x8(%esp),%edi 0x0838be26 <pg_checksum_page+918>: movaps %xmm0,%xmm1 0x0838be29 <pg_checksum_page+921>: mov 0xb4(%esp),%ebp 0x0838be30 <pg_checksum_page+928>: shufps $0x55,%xmm0,%xmm1 0x0838be34 <pg_checksum_page+932>: movaps %xmm0,%xmm2 0x0838be37 <pg_checksum_page+935>: movss %xmm1,0x8(%esp) 0x0838be3d <pg_checksum_page+941>: unpckhps %xmm0,%xmm2 0x0838be40 <pg_checksum_page+944>: mov %di,0x8(%esi) 0x0838be44 <pg_checksum_page+948>: mov 0x8(%esp),%esi 0x0838be48 <pg_checksum_page+952>: movss %xmm0,0x8(%esp) 0x0838be4e <pg_checksum_page+958>: shufps $0xff,%xmm0,%xmm0 0x0838be52 <pg_checksum_page+962>: mov 0x8(%esp),%eax 0x0838be56 <pg_checksum_page+966>: movss %xmm2,0x8(%esp) 0x0838be5c <pg_checksum_page+972>: mov 0x8(%esp),%ecx 0x0838be60 <pg_checksum_page+976>: movss %xmm0,0x8(%esp) 0x0838be66 <pg_checksum_page+982>: xor %eax,%esi 0x0838be68 <pg_checksum_page+984>: xor %ebp,%esi 0x0838be6a <pg_checksum_page+986>: mov 0x8(%esp),%edx 0x0838be6e <pg_checksum_page+990>: add $0x9c,%esp 0x0838be74 <pg_checksum_page+996>: xor %ecx,%esi 0x0838be76 <pg_checksum_page+998>: xor %edx,%esi 0x0838be78 <pg_checksum_page+1000>: mov %esi,%eax 0x0838be7a <pg_checksum_page+1002>: mul %ebx 0x0838be7c <pg_checksum_page+1004>: pop %ebx 0x0838be7d <pg_checksum_page+1005>: shr $0xf,%edx 0x0838be80 <pg_checksum_page+1008>: mov %edx,%edi 0x0838be82 <pg_checksum_page+1010>: shl $0x10,%edi 0x0838be85 <pg_checksum_page+1013>: sub %edx,%edi 0x0838be87 <pg_checksum_page+1015>: sub %edi,%esi 0x0838be89 <pg_checksum_page+1017>: mov %esi,%eax 0x0838be8b <pg_checksum_page+1019>: pop %esi 0x0838be8c <pg_checksum_page+1020>: inc %eax 0x0838be8d <pg_checksum_page+1021>: pop %edi ---Type <return> to continue, or q <return> to quit--- 0x0838be8e <pg_checksum_page+1022>: pop %ebp 0x0838be8f <pg_checksum_page+1023>: ret End of assembler dump.
On Fri, Mar 08, 2019 at 02:35:33AM +0000, Andrew Gierth wrote: ! Yes, the database is supposed to be able to run without SSE2, as long as ! it is built with gcc and not clang, and without any architecture flags ! that imply SSE2 support. Okay, thank You, thats what I was worrying - as some developers make a strategic decision here. ! I'm pretty sure nothing in our buildfarm is i386 without SSE2 though. *laugh* no problem with that. There probably wouldn't be any reason to have such. Here I have a couple of good reasons: that machine does for what other people buy a little plastic box from the shelf called internet access router - and besides being stupid, these pieces are full of bugs and get hacked (I doubt anybody would bother to write a spectre exploit for pentium3, although it should be possible). And on the other side, this is a server board built for 365/24 running on regECC mem. A new one of that class would inevitably carry quite a big Xeon, and so would do nothing than idle here. I see no point in such investment, at least not until Intel comes up with a really nice new design getting rid of the crap. https://www.techradar.com/news/spoiler-flaw-in-intel-cpus-is-similar-to-spectre-yet-dangerously-different ! Peter> Whateever information You like to have, just ask and I will ! Peter> gladly do my best to obtain it, as I get around. (This is a ! Peter> reproducible on a very well maintained piece of software - this ! Peter> is rather fun.) ! ! Your backtrace implies that you are running with checksums enabled; ! true? Correct. ! Peter> The crash happens at a specific query - I get parse,bind, but no ! Peter> execute timing. ! ! What is the exact query? This will be some work. It includes about six different sql functions (and some of these are probably also old enough to vote). ! I'm doing some investigations of my own, I may have more questions ! later. You're welcome. rgds, PMc
>>>>> "Peter" == Peter <pmc@citylink.dinoex.sub.org> writes: Peter> esp 0x7fbfc15c 0x7fbfc15c And there we go; the stack is misaligned. (only 4 byte alignment where 16 is expected). Peter> eip 0x838bdf2 0x838bdf2 Peter> 0x0838bdf2 <pg_checksum_page+866>: movaps 0x20(%esp),%xmm0 MOVAPS is an SSE (not SSE2) instruction; it's enabled by virtue of the fact that you used -march=pentium3 (the pentium3 supports SSE but not SSE2). The "A" stands for "aligned"; an unaligned source address causes an exception. %esp+0x20 is not correctly aligned for the instruction. GCC defaults to using a 16-byte stack alignment, but it relies on the caller to align the stack too, so if a GCC-compiled function is called from code that doesn't align the stack, then this kind of error can result. I do not know offhand (but I plan to find out) what clang's default stack alignment on i386 is. You can tell GCC to realign the stack itself using the -mstackrealign option. This problem shows up only with GCC and not with clang because clang does not attempt to use SSE to vectorize this particular piece of code. The non-vectorized implementation generated by clang has no special requirements for stack alignment. But at the end of the day this is not a problem with PostgreSQL - it would show up with any code compiled with GCC where the compiler had elected to use SSE instructions for optimization. -- Andrew (irc:RhodiumToad)
On 2019-Mar-08, Peter wrote: > On Fri, Mar 08, 2019 at 02:35:33AM +0000, Andrew Gierth wrote: > ! I'm pretty sure nothing in our buildfarm is i386 without SSE2 though. > > *laugh* no problem with that. There probably wouldn't be any reason > to have such. Actually there *is* a very good reason to have one, which is that we would have discovered this bug right away. (Just ask Tom Lane for a tally of bugs that have been discovered due to his old HPUX 10.20 dinosaur he keeps running just for that purpose). It seems difficult to grab hold of such hardware, however. I don't suppose you have many spare cycles on that machine of yours to run a buildfarm animal? -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Alvaro Herrera <alvherre@2ndquadrant.com> writes: > On 2019-Mar-08, Peter wrote: >> On Fri, Mar 08, 2019 at 02:35:33AM +0000, Andrew Gierth wrote: >>> ! I'm pretty sure nothing in our buildfarm is i386 without SSE2 though. >> *laugh* no problem with that. There probably wouldn't be any reason >> to have such. > Actually there *is* a very good reason to have one, which is that we > would have discovered this bug right away. (Just ask Tom Lane for a > tally of bugs that have been discovered due to his old HPUX 10.20 > dinosaur he keeps running just for that purpose). It seems difficult to > grab hold of such hardware, however. I don't suppose you have many > spare cycles on that machine of yours to run a buildfarm animal? IIUC, this bug isn't actually down to the old hardware: any SSE-capable chip ought to exhibit the same problem. The bug is in the toolchain somewhere, in that some compiler or run-time infrastructure is failing to maintain 16-byte stack alignment as required by the ABI. Or, possibly, there is disagreement among relevant toolchain elements as to exactly what ABI they're using. regards, tom lane
>>>>> "Andrew" == Andrew Gierth <andrew@tao11.riddles.org.uk> writes: Andrew> So I'm going to guess that your bug 236025 is actually an Andrew> alignment problem, with the compiler making some assumption Andrew> about alignment that we're violating. I'll investigate and see Andrew> what I can find. OK, I have completed my analysis of both reports. The bottom line is that this is a disagreement between gcc and the (clang-compiled) system libraries over what the stack alignment should be; GCC wants and assumes 16 byte alignment, but clang won't provide that. It's not any kind of bug in PostgreSQL. For most applications there is no issue because GCC aligns the stack itself on entry into main(), so the only time it becomes an issue is if two conditions are met: (1) the application must call into an outside (non-GCC-compiled) library which then calls _back_ into the application, AND (2) the subsequent code executes instructions that rely on the stack alignment for correctness (and not just performance). PostgreSQL compiled by GCC on i386 without architecture options will not rely on the alignment of the stack so condition (2) is not met. Only if you specify an architecture such as -march=pentium3 (which enables SSE) will any instructions be used which require strict alignment. It may not be obvious how condition (1) is met, but notice that the report from Peter has the crash happening in either a background worker or the checkpointer process; this is significant because those are spawned from postmaster while in a signal handler, and the signal handler's stack frame has disturbed the stack alignment (and with the system libraries compiled with clang and not gcc, no attempt is made to adjust that). So the implications for the postgresql port on freebsd/i386 are: 1. If you compile with GCC and no architecture options you should have no problems on any cpu. This presumably covers the case of the packaged binaries. 2. If you compile with GCC and any of -msse, -msse2, -march=pentium3 or later, or any similar flag that enables use of SSE or later (I believe that no MMX instructions require special alignment), then you will also need -mstackrealign (or patch the source to add the equivalent attribute to every signal handler function or other callback, which I don't really recommend). (Maybe the port should add this option defensively?) The crash in (freebsd) bug #236025 is explained by the fact that the user had -msse2 set when compiling with GCC. Peter's crash is explained by the use of -march=pentium3 when compiling with GCC. 3. If you compile with clang and -msse2 then there should be no stack alignment issues (since clang doesn't assume the stack is aligned) but obviously you then can't run the binary on a pre-pentium4 cpu. -- Andrew (irc:RhodiumToad)
>>>>> "Peter" == Peter <pmc@citylink.dinoex.sub.org> writes: Peter> Extra Compiler-Options: Peter> -march=pentium3 I CC'd you on my response to Palle, but to spell it out, what you need to do to fix this is either: 1. Remove the -march=pentium3 option. 2. Add the -mstackrealign option as well. Either way should work. Could you try (one or both, your preference) and report back? -- Andrew (irc:RhodiumToad)
Hi Andrew, many thanks for Your efforts! Lets see what I get out of this. First, it seems I can reproduce the fault on my build machine (IvyBridge core-i5) in the i386-chroot as well - which is not a surprize according to Your explanations. On Fri, Mar 08, 2019 at 04:51:47PM +0000, Andrew Gierth wrote: ! MOVAPS is an SSE (not SSE2) instruction; it's enabled by virtue of the ! fact that you used -march=pentium3 (the pentium3 supports SSE but not ! SSE2). The "A" stands for "aligned"; an unaligned source address causes ! an exception. %esp+0x20 is not correctly aligned for the instruction. Okay so far. I was occasionally wondering if that pentium3 option would effect anything at all. Now we see, it does. ;) ! GCC defaults to using a 16-byte stack alignment, but it relies on the ! caller to align the stack too, so if a GCC-compiled function is called ! from code that doesn't align the stack, then this kind of error can ! result. I do not know offhand (but I plan to find out) what clang's ! default stack alignment on i386 is. Well, what caused me a headache this evening is: who would be the caller in this case, as -from my understanding- it is just postgreSQL running? Now from Your newer mail this riddle does clear up well. In my build environment, I can now create and start a new db-cluster and issue only the single command "CREATE ROLE bacula;" and it will crash - but then again I have to wait for the next checkpointer. ! You can tell GCC to realign the stack itself using the -mstackrealign ! option. Yepp, that appears to solve it. So, as there is a fix now, I'm pondering about who would be the responsible to apply it? * the system owner (alongside with the CPU definition) * the port maintainer (alongside with the compiler choice) * the postgres configure script ! This problem shows up only with GCC and not with clang because clang ! does not attempt to use SSE to vectorize this particular piece of code. ! The non-vectorized implementation generated by clang has no special ! requirements for stack alignment. But at the end of the day this is not ! a problem with PostgreSQL - it would show up with any code compiled with ! GCC where the compiler had elected to use SSE instructions for ! optimization. Well, its clearly my fault, coming up with that pentium3 option. *gg* rgds, P.
On Fri, Mar 08, 2019 at 08:22:33PM +0000, Andrew Gierth wrote: ! >>>>> "Peter" == Peter <pmc@citylink.dinoex.sub.org> writes: ! ! Peter> Extra Compiler-Options: ! Peter> -march=pentium3 ! ! I CC'd you on my response to Palle, but to spell it out, what you need ! to do to fix this is either: ! ! 1. Remove the -march=pentium3 option. ! ! 2. Add the -mstackrealign option as well. Yes, thank You, this got clear to me in the afternoon. ! Either way should work. Could you try (one or both, your preference) and ! report back? Yes, -mstackrealign helps. So, as my local problem is now solved, I am thinking about that "greater good for mankind" thing. ;) At least one other person did run into the issue, and from Your explanation I understand that this is not a postgres issue, but could happen to any piece of software that decides to use gcc. Now it seems easy to put this compiler option into the port's specific makefile for postgreSQL, but it should actually be fixed systemwide. But, while it gets recommended to set the cpu-type systemwide on FreeBSD, it is not recommend for the system owner to set CFLAGS systemwide (even less specific ones for gcc - and I didn't find a proper way to do that, anyway). So my proposal is that this one belongs into /usr/ports/Mk/bsd.gcc.mk + .if ${MACHINE_CPU:Msse} && ${MACHINE_ARCH} == i386 + CFLAGS+= -mstackrealign + .endif That seems to work. Palle, Your opinion? rgds, PMc
9 mars 2019 kl. 01:17 skrev Peter <pmc@citylink.dinoex.sub.org>:On Fri, Mar 08, 2019 at 08:22:33PM +0000, Andrew Gierth wrote:
! >>>>> "Peter" == Peter <pmc@citylink.dinoex.sub.org> writes:
!
! Peter> Extra Compiler-Options:
! Peter> -march=pentium3
!
! I CC'd you on my response to Palle, but to spell it out, what you need
! to do to fix this is either:
!
! 1. Remove the -march=pentium3 option.
!
! 2. Add the -mstackrealign option as well.
Yes, thank You, this got clear to me in the afternoon.
! Either way should work. Could you try (one or both, your preference) and
! report back?
Yes, -mstackrealign helps. So, as my local problem is now solved,
I am thinking about that "greater good for mankind" thing. ;)
At least one other person did run into the issue, and from Your
explanation I understand that this is not a postgres issue, but
could happen to any piece of software that decides to use gcc.
Now it seems easy to put this compiler option into the port's
specific makefile for postgreSQL, but it should actually be fixed
systemwide.
But, while it gets recommended to set the cpu-type systemwide on
FreeBSD, it is not recommend for the system owner to set CFLAGS
systemwide (even less specific ones for gcc - and I didn't find
a proper way to do that, anyway).
So my proposal is that this one belongs into /usr/ports/Mk/bsd.gcc.mk
+ .if ${MACHINE_CPU:Msse} && ${MACHINE_ARCH} == i386
+ CFLAGS+= -mstackrealign
+ .endif
That seems to work. Palle, Your opinion?
Perhaps, as Andrew suggests, just adding -mstackrealign defensively in the port for i386 is a reasonable tradeoff? It would not help for other ports though.
I'm cc:ing gerald@FreeBSD.org, maintainer of Mk/bsd.gcc.mk, to see if he has any i thoughts about this. I will forward Andrew's analyze to Gerald as well.
Palle
Hi Palle! On Sat, Mar 09, 2019 at 01:42:55PM +0100, Palle Girgensohn wrote: ! > 9 mars 2019 kl. 01:17 skrev Peter <pmc@citylink.dinoex.sub.org>: ! > Yes, -mstackrealign helps. So, as my local problem is now solved, ! > I am thinking about that "greater good for mankind" thing. ;) ! > ! > At least one other person did run into the issue, and from Your ! > explanation I understand that this is not a postgres issue, but ! > could happen to any piece of software that decides to use gcc. ! > ! > Now it seems easy to put this compiler option into the port's ! > specific makefile for postgreSQL, but it should actually be fixed ! > systemwide. ! > ! > But, while it gets recommended to set the cpu-type systemwide on ! > FreeBSD, it is not recommend for the system owner to set CFLAGS ! > systemwide (even less specific ones for gcc - and I didn't find ! > a proper way to do that, anyway). ! > ! > So my proposal is that this one belongs into /usr/ports/Mk/bsd.gcc.mk ! > ! > + .if ${MACHINE_CPU:Msse} && ${MACHINE_ARCH} == i386 ! > + CFLAGS+= -mstackrealign ! > + .endif ! > ! > That seems to work. Palle, Your opinion? ! ! Well, we ideally want the binary packages built with the ports ! framework, for a specific platform, in this case i386, to work on all ! i386 machines. This makes it a bit more complicated. Yes, that was my intention! The MACHINE_CPU variable gets already populated with "sse" (from somewhere in /usr/share/mk). What I usually do, I do only set the proper CPU type in /etc/make.conf, and then the "-march=pentium3" (or whatever CPU there is) appears magically in all builds. I didn't configure this! /etc/make.conf: ! # Maschine ist Pentium-3 ! CPUTYPE?= p3 Nothing more is needed, and now in any ports' directory, we get ! >postgresql10-server$ make -V MACHINE_CPU ! sse i686 mmx i586 i486 Contrarily, if I remove the CPUTYPE from /etc/make.conf, then it says: ! >postgresql10-server$ make -V MACHINE_CPU ! i486 And I suppose these strings appear to be tested for such cases. ! We would have to ! check if the user has any of -msse, -msse2, -march=pentium3 or later, ! or any similar flag that enables use of SSE or later, If the user explicitely drop in those options, then it gets indeed complicated. I don't do that - I do NOT set any -march or whatever options, I only set CPUTYPE in make.conf, as it gets recommended by developers, e.g. here: https://forums.freebsd.org/threads/whats-in-your-make-conf.36150/post-199595 And that's why I am thinking of a global fix - because this fault is happening on recommended practices ground! ! and if the ! underlying userland is built with clang Ouch, that one I did neglect. One could have built the whole system with gcc, and then no action is needed. Hmm... ! Perhaps, as Andrew suggests, just adding -mstackrealign defensively ! in the port for i386 is a reasonable tradeoff? It would not help for ! other ports though. That is reasonable, certainly. :) PMc