Обсуждение: strange failure in plpgsql_control tests (on fulmar, ICC 14.0.3)
Hi, I happened to be updating our machine running our buildfarm animals, and I noticed something quite strange - the machine was unexpectedly running out of disk space, which is rather suspicious as it's running just the regression tests :-/ After a bit of investigation, I found this: # pwd /mnt/fulmar/build/HEAD/pgsql.build/src/pl/plpgsql/src/results # ls -l total 57089152 -rw-r--r--. 1 pgbuild pgbuild 714 Jan 1 00:27 plpgsql_call.out -rw-r--r--. 1 pgbuild pgbuild 58458515597 Mar 17 13:54 plpgsql_control.out Right - the plpgsql_control.out output has about 58GB, which is somewhat excessive I guess. The reason is fairly simple: NOTICE: 1..3: i = 1 NOTICE: 1..3: i = 2 NOTICE: 1..3: i = 3 NOTICE: 1..10 by 3: i = 1 NOTICE: 1..10 by 3: i = 4 NOTICE: 1..10 by 3: i = 7 NOTICE: 1..10 by 3: i = 10 NOTICE: 1..11 by 3: i = 1 NOTICE: 1..11 by 3: i = 4 NOTICE: 1..11 by 3: i = 7 NOTICE: 1..11 by 3: i = 10 NOTICE: reverse 10..0 by 3: i = 10 NOTICE: reverse 10..0 by 3: i = 7 NOTICE: reverse 10..0 by 3: i = 4 NOTICE: reverse 10..0 by 3: i = 1 NOTICE: 2147483620..2147483647 by 10: i = 2147483620 NOTICE: 2147483620..2147483647 by 10: i = 2147483630 NOTICE: 2147483620..2147483647 by 10: i = 2147483640 NOTICE: 2147483620..2147483647 by 10: i = -2147483646 NOTICE: 2147483620..2147483647 by 10: i = -2147483636 NOTICE: 2147483620..2147483647 by 10: i = -2147483626 ... many more NOTICE messages ... Looking at the plpgsql_call.out timestamp, this seems to be running since January 1, which also correlates with the last results reported to pgbuildfarm. Does that remind some known issue to anyone? It seems a bit like an undetected overflow, but I don't see why should the behavior change suddenly (AFAICS there were no updates to the machine just before January 1st). I wonder if this might be related ICC 14.0.3, but again - it has been like that for a very long time. Ideas? -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Tomas Vondra <tomas.vondra@2ndquadrant.com> writes: > I happened to be updating our machine running our buildfarm animals, and > I noticed something quite strange - the machine was unexpectedly running > out of disk space, which is rather suspicious as it's running just the > regression tests :-/ > After a bit of investigation, I found this: > Right - the plpgsql_control.out output has about 58GB, which is somewhat > excessive I guess. The reason is fairly simple: > NOTICE: 2147483620..2147483647 by 10: i = 2147483620 > NOTICE: 2147483620..2147483647 by 10: i = 2147483630 > NOTICE: 2147483620..2147483647 by 10: i = 2147483640 > NOTICE: 2147483620..2147483647 by 10: i = -2147483646 > NOTICE: 2147483620..2147483647 by 10: i = -2147483636 > NOTICE: 2147483620..2147483647 by 10: i = -2147483626 > ... many more NOTICE messages ... > Looking at the plpgsql_call.out timestamp, this seems to be running > since January 1, which also correlates with the last results reported to > pgbuildfarm. Ouch. That test is in fact new as of 31 Dec, and what this seems to prove is that plpgsql's handling of loop-variable overflow doesn't work on fulmar. regards, tom lane
On 03/17/2018 03:32 PM, Tom Lane wrote: > Tomas Vondra <tomas.vondra@2ndquadrant.com> writes: >> I happened to be updating our machine running our buildfarm animals, and >> I noticed something quite strange - the machine was unexpectedly running >> out of disk space, which is rather suspicious as it's running just the >> regression tests :-/ > >> After a bit of investigation, I found this: >> Right - the plpgsql_control.out output has about 58GB, which is somewhat >> excessive I guess. The reason is fairly simple: > >> NOTICE: 2147483620..2147483647 by 10: i = 2147483620 >> NOTICE: 2147483620..2147483647 by 10: i = 2147483630 >> NOTICE: 2147483620..2147483647 by 10: i = 2147483640 >> NOTICE: 2147483620..2147483647 by 10: i = -2147483646 >> NOTICE: 2147483620..2147483647 by 10: i = -2147483636 >> NOTICE: 2147483620..2147483647 by 10: i = -2147483626 >> ... many more NOTICE messages ... > >> Looking at the plpgsql_call.out timestamp, this seems to be running >> since January 1, which also correlates with the last results reported to >> pgbuildfarm. > > Ouch. That test is in fact new as of 31 Dec, and what this seems to > prove is that plpgsql's handling of loop-variable overflow doesn't > work on fulmar. > Yeah :-( I'll try restarting the tests on fulmar, to see if it's reproducible. If yes, I'll see if a newer version of ICC fixes it. If not, I'll temporarily disable fulmar, so that the two other animals on the same machine (magpie, treepie) are not stuck behind it. That should tell us if the issue is limited to ICC, or if there's something else wrong with that system. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
I wrote: > Ouch. That test is in fact new as of 31 Dec, and what this seems to > prove is that plpgsql's handling of loop-variable overflow doesn't > work on fulmar. Some of the other icc-using critters haven't reported in since December, either :-( Looking at the code, we do this like so: /* * Increase/decrease loop value, unless it would overflow, in which * case exit the loop. */ if (stmt->reverse) { if ((int32) (loop_value - step_value) > loop_value) break; loop_value -= step_value; } else { if ((int32) (loop_value + step_value) < loop_value) break; loop_value += step_value; } I imagine what's happening is that the compiler is assuming no overflow occurs (due to lacking any equivalent of -fwrapv), then deducing that the if-tests are no-ops and throwing them away. We could avoid the dependency on -fwrapv with something like if (stmt->reverse) { if (loop_value < (PG_INT32_MIN + step_value)) break; loop_value -= step_value; } else { if (loop_value > (PG_INT32_MAX - step_value)) break; loop_value += step_value; } which is safe because we enforce step_value > 0. It's kind of ugly because it hard-codes knowledge of what the limits are, but we may not have much choice. regards, tom lane
On March 17, 2018 7:56:40 AM PDT, Tom Lane <tgl@sss.pgh.pa.us> wrote: >I wrote: >> Ouch. That test is in fact new as of 31 Dec, and what this seems to >> prove is that plpgsql's handling of loop-variable overflow doesn't >> work on fulmar. > >Some of the other icc-using critters haven't reported in since >December, either :-( > >Looking at the code, we do this like so: > > /* > * Increase/decrease loop value, unless it would overflow, in which > * case exit the loop. > */ > if (stmt->reverse) > { > if ((int32) (loop_value - step_value) > loop_value) > break; > loop_value -= step_value; > } > else > { > if ((int32) (loop_value + step_value) < loop_value) > break; > loop_value += step_value; > } > >I imagine what's happening is that the compiler is assuming no overflow >occurs (due to lacking any equivalent of -fwrapv), then deducing that >the >if-tests are no-ops and throwing them away. > >We could avoid the dependency on -fwrapv with something like > > if (stmt->reverse) > { > if (loop_value < (PG_INT32_MIN + step_value)) > break; > loop_value -= step_value; > } > else > { > if (loop_value > (PG_INT32_MAX - step_value)) > break; > loop_value += step_value; > } > >which is safe because we enforce step_value > 0. It's kind of ugly >because it hard-codes knowledge of what the limits are, but we may not >have much choice. On the current branch just using the new overflow safe functions in int.h should work. But unless we are OK leaving thisbroken in the back branches, or want to backport the functionality, that's probably not sufficient. Andres -- Sent from my Android device with K-9 Mail. Please excuse my brevity.
On 03/17/2018 06:27 PM, Andres Freund wrote: > > > On March 17, 2018 7:56:40 AM PDT, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> I wrote: >>> Ouch. That test is in fact new as of 31 Dec, and what this seems to >>> prove is that plpgsql's handling of loop-variable overflow doesn't >>> work on fulmar. >> >> Some of the other icc-using critters haven't reported in since >> December, either :-( >> >> Looking at the code, we do this like so: >> >> /* >> * Increase/decrease loop value, unless it would overflow, in which >> * case exit the loop. >> */ >> if (stmt->reverse) >> { >> if ((int32) (loop_value - step_value) > loop_value) >> break; >> loop_value -= step_value; >> } >> else >> { >> if ((int32) (loop_value + step_value) < loop_value) >> break; >> loop_value += step_value; >> } >> >> I imagine what's happening is that the compiler is assuming no overflow >> occurs (due to lacking any equivalent of -fwrapv), then deducing that >> the >> if-tests are no-ops and throwing them away. >> >> We could avoid the dependency on -fwrapv with something like >> >> if (stmt->reverse) >> { >> if (loop_value < (PG_INT32_MIN + step_value)) >> break; >> loop_value -= step_value; >> } >> else >> { >> if (loop_value > (PG_INT32_MAX - step_value)) >> break; >> loop_value += step_value; >> } >> >> which is safe because we enforce step_value > 0. It's kind of ugly >> because it hard-codes knowledge of what the limits are, but we may not >> have much choice. > > On the current branch just using the new overflow safe functions in > int.h should work. But unless we are OK leaving this broken in the back > branches, or want to backport the functionality, that's probably not > sufficient. > Not sure, but the backbranches seem to be working fine, and the commit that triggers the issue is from December 31. Maybe the issue was there but we were lucky not to trip on it before. Anyway, I can confirm that the fix suggested by Tom does the trick (well, at least on Fulmar, which is running icc 14.0.3). I've also disassembled exec_stmt_fori both with and without the patch - reading assembly in not my strength, but if you're interested it's attached. The interesting part seems to be the last ~50 lines or so. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Вложения
Hi, On 2018-03-17 18:55:11 +0100, Tomas Vondra wrote: > Not sure, but the backbranches seem to be working fine, and the commit > that triggers the issue is from December 31. Well, that added the test. Are you saying that if you execute similar code on an older branch it doesn't fail? > Anyway, I can confirm that the fix suggested by Tom does the trick Could you try the attached patch, too? I'm a bit afraid that we'll have to go a lot further if we can't make icc do safe signed overflow in all cases. This case is easy enough to fix, but we were lucky in a way to find it. Given that apparently we know that ICC also optimizes based on overflow sematics it seems we should really work towards making all of the backend safe with that :(. - Andres
Вложения
Tomas Vondra <tomas.vondra@2ndquadrant.com> writes: > Not sure, but the backbranches seem to be working fine, and the commit > that triggers the issue is from December 31. Maybe the issue was there > but we were lucky not to trip on it before. Yeah, we were simply not testing that overflow-detection code before. Undoubtedly it would fail in the back branches too if we tested it. > Anyway, I can confirm that the fix suggested by Tom does the trick > (well, at least on Fulmar, which is running icc 14.0.3). I've also > disassembled exec_stmt_fori both with and without the patch - reading > assembly in not my strength, but if you're interested it's attached. The > interesting part seems to be the last ~50 lines or so. Hm, did you get the "master" and "patched" versions backwards? The allegedly-patched version does the !reverse case like this: 0x00007f71219457ae <+2200>: mov -0x108(%rbp),%eax 0x00007f71219457b4 <+2206>: test %eax,%eax 0x00007f71219457b6 <+2208>: jl 0x7f71219457cf <exec_stmt_fori+2233> 0x00007f71219457b8 <+2210>: mov -0x108(%rbp),%eax 0x00007f71219457be <+2216>: add -0x110(%rbp),%eax 0x00007f71219457c4 <+2222>: mov %eax,-0x110(%rbp) 0x00007f71219457ca <+2228>: jmpq 0x7f7121945573 <exec_stmt_fori+1629> so that it's apparently optimized if ((int32) (loop_value + step_value) < loop_value) break; into if (step_value < 0) break; which of course never exits the loop. That's slightly different (and stupider) than what I'd been hypothesizing, but it's a valid transformation if you ignore the possibility of integer overflow. It might be worth studying the icc manual to see if it has an equivalent of -fwrapv. Although we can and probably should fix this case by changing the code, I'm worried about what other bugs may exist only in icc builds. I know Andres would like to get rid of the need for -fwrapv but I suspect that's not really going to happen soon. regards, tom lane
Andres Freund <andres@anarazel.de> writes: > On the current branch just using the new overflow safe functions in > int.h should work. But unless we are OK leaving this broken in the back > branches, or want to backport the functionality, that's probably not > sufficient. Yeah ... I don't like either of the last two things, so probably we should go with the patch as I had it. Yours might perform a shade better on compilers with the built-in, but it'll be a lot worse on those without. What I was wondering about was whether to back-patch a test case. It doesn't seem really necessary, and we'd have to put it someplace else than where it is in HEAD, so I'm leaning against. regards, tom lane
Hi, On 2018-03-17 14:20:26 -0400, Tom Lane wrote: > It might be worth studying the icc manual to see if it has an > equivalent of -fwrapv. Yes. A *quick* look through https://software.intel.com/en-us/node/522795 unfortunately didn't show anything. > Although we can and probably should fix this case by changing the > code, I'm worried about what other bugs may exist only in icc builds. Yea, I know that I can produce a number of "bugs" today by removing -fwrapv for gcc, and found a few more by manual inspection. I'm sure icc can trigger at least some of them. > I know Andres would like to get rid of the need for -fwrapv but I > suspect that's not really going to happen soon. And definitely not in anything released or close to it. I do want to get rid of it because various compilers don't have comparable flags, and because it causes slowdowns. But I really would like to do it without running headfirst into a wall, and that'll mean going a bit slower. I think it'd be good practice to get rid of the known overflow hazards by using int.h, but I don't want to drop fwrapv immediately. Greetings, Andres Freund
On 03/17/2018 07:20 PM, Tom Lane wrote: > Tomas Vondra <tomas.vondra@2ndquadrant.com> writes: >> Not sure, but the backbranches seem to be working fine, and the commit >> that triggers the issue is from December 31. Maybe the issue was there >> but we were lucky not to trip on it before. > > Yeah, we were simply not testing that overflow-detection code before. > Undoubtedly it would fail in the back branches too if we tested it. > >> Anyway, I can confirm that the fix suggested by Tom does the trick >> (well, at least on Fulmar, which is running icc 14.0.3). I've also >> disassembled exec_stmt_fori both with and without the patch - reading >> assembly in not my strength, but if you're interested it's attached. The >> interesting part seems to be the last ~50 lines or so. > > Hm, did you get the "master" and "patched" versions backwards? The > allegedly-patched version does the !reverse case like this: > > 0x00007f71219457ae <+2200>: mov -0x108(%rbp),%eax > 0x00007f71219457b4 <+2206>: test %eax,%eax > 0x00007f71219457b6 <+2208>: jl 0x7f71219457cf <exec_stmt_fori+2233> > 0x00007f71219457b8 <+2210>: mov -0x108(%rbp),%eax > 0x00007f71219457be <+2216>: add -0x110(%rbp),%eax > 0x00007f71219457c4 <+2222>: mov %eax,-0x110(%rbp) > 0x00007f71219457ca <+2228>: jmpq 0x7f7121945573 <exec_stmt_fori+1629> > > so that it's apparently optimized > > if ((int32) (loop_value + step_value) < loop_value) > break; > > into > > if (step_value < 0) > break; > > which of course never exits the loop. That's slightly different > (and stupider) than what I'd been hypothesizing, but it's a valid > transformation if you ignore the possibility of integer overflow. > Yeah, it seems I've mixed up the files by accident. Sorry. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On March 17, 2018 11:32:36 AM PDT, Tom Lane <tgl@sss.pgh.pa.us> wrote: >Andres Freund <andres@anarazel.de> writes: >> On the current branch just using the new overflow safe functions in >> int.h should work. But unless we are OK leaving this broken in the >back >> branches, or want to backport the functionality, that's probably not >> sufficient. > >Yeah ... I don't like either of the last two things, so probably we >should >go with the patch as I had it. Yours might perform a shade better on >compilers with the built-in, but it'll be a lot worse on those without. I don't think performance is a prime driver here, or shouldn't be at least. Obviousness / grepability seem much more important. I'd vote for using my version in master, and yours in the back branches. I can do that, of you want. I'm OK with skipping the test for now. Andres -- Sent from my Android device with K-9 Mail. Please excuse my brevity.
Andres Freund <andres@anarazel.de> writes: > I don't think performance is a prime driver here, or shouldn't be at least. Obviousness / grepability seem much more important. I'd vote for using my version in master, and yours in the back branches. I can do that, of you want. I dunno, I think the code as I had it is quite obvious. It's just that I don't really like hard-coding references to INT_MIN/MAX, which I guess is a personal style thing rather than anything I can defend well. > I'm OK with skipping the test for now. If we're not putting a test into the back branches, then we darn well better be using the same code there as in HEAD, else we won't know that it actually solves the problem. regards, tom lane
On March 17, 2018 12:25:57 PM PDT, Tom Lane <tgl@sss.pgh.pa.us> wrote: >Andres Freund <andres@anarazel.de> writes: >> I don't think performance is a prime driver here, or shouldn't be at >least. Obviousness / grepability seem much more important. I'd vote >for using my version in master, and yours in the back branches. I can >do that, of you want. > >I dunno, I think the code as I had it is quite obvious. It's just that >I don't really like hard-coding references to INT_MIN/MAX, which I >guess >is a personal style thing rather than anything I can defend well. Certainly harder to grep for. There's lots of other uses of the min/max macros. And the logic to get or right depends onan earlier piece of code ensuring the step is positive... >> I'm OK with skipping the test for now. > >If we're not putting a test into the back branches, then we darn well >better be using the same code there as in HEAD, else we won't know that >it actually solves the problem. I was thinking of committing your version everywhere and then revising in master after a bf cycle. Andres -- Sent from my Android device with K-9 Mail. Please excuse my brevity.
Andres Freund <andres@anarazel.de> writes: > On March 17, 2018 12:25:57 PM PDT, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> If we're not putting a test into the back branches, then we darn well >> better be using the same code there as in HEAD, else we won't know that >> it actually solves the problem. > I was thinking of committing your version everywhere and then revising in master after a bf cycle. Meh. I don't really feel a need to change it at all, but if we do, I think it should be "after a release cycle" not just a few days. We need to see whether that regression test will expose any problems on a wider variety of compilers than exist in the buildfarm. Anyway, pushed for now --- Tomas, if you blocked fulmar, it should be safe to unblock. regards, tom lane
On 03/17/2018 08:41 PM, Tom Lane wrote: > Andres Freund <andres@anarazel.de> writes: >> On March 17, 2018 12:25:57 PM PDT, Tom Lane <tgl@sss.pgh.pa.us> wrote: >>> If we're not putting a test into the back branches, then we darn >>> well better be using the same code there as in HEAD, else we >>> won't know that it actually solves the problem. > >> I was thinking of committing your version everywhere and then >> revising in master after a bf cycle. > > Meh. I don't really feel a need to change it at all, but if we do, > I think it should be "after a release cycle" not just a few days. > We need to see whether that regression test will expose any problems > on a wider variety of compilers than exist in the buildfarm. > > Anyway, pushed for now --- Tomas, if you blocked fulmar, it should > be safe to unblock. > OK, enabled again. It'll take a while to run through the branches. I guess it might want to notify people running affected animals, because otherwise they may stay stuck for a long time. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Tomas Vondra <tomas.vondra@2ndquadrant.com> writes: > I guess it might want to notify people running affected animals, because > otherwise they may stay stuck for a long time. Yeah, I sent something out to buildfarm-members already. regards, tom lane
On Sun, Mar 18, 2018 at 7:33 AM, Andres Freund <andres@anarazel.de> wrote: > On 2018-03-17 14:20:26 -0400, Tom Lane wrote: >> It might be worth studying the icc manual to see if it has an >> equivalent of -fwrapv. > > Yes. > > A *quick* look through https://software.intel.com/en-us/node/522795 > unfortunately didn't show anything. Apparently it does support -fno-strict-overflow. Is that useful here? -- Thomas Munro http://www.enterprisedb.com
On Mon, Mar 19, 2018 at 10:32 AM, Thomas Munro <thomas.munro@enterprisedb.com> wrote: > On Sun, Mar 18, 2018 at 7:33 AM, Andres Freund <andres@anarazel.de> wrote: >> On 2018-03-17 14:20:26 -0400, Tom Lane wrote: >>> It might be worth studying the icc manual to see if it has an >>> equivalent of -fwrapv. >> >> Yes. >> >> A *quick* look through https://software.intel.com/en-us/node/522795 >> unfortunately didn't show anything. > > Apparently it does support -fno-strict-overflow. Is that useful here? Hmm. This was already discussed here: https://www.postgresql.org/message-id/51016409.2020808%40gmail.com Noah seemed to be at least slightly in favour of considering turning it on despite doubt about its precise meaning, but also pointed out that even -fwrapv doesn't mean exactly the same thing in GCC and Clang. Curiously -fno-strict-overflow doesn't seem to appear in the documentation that Andres posted (well I couldn't find it quickly, anyway), but we can see that it exists from this interaction between Xi Wang and Intel compiler engineering: https://software.intel.com/en-us/forums/intel-c-compiler/topic/358200 And we can see a reference here: https://www.spec.org/cpu2017/flags/Intel-ic18.0-official-linux64.html -- Thomas Munro http://www.enterprisedb.com