Обсуждение: longfin and tamandua aren't too happy but I'm not sure why
Hi, longfin and tamandua recently begun failing like this, quite possibly as a result of 05d4cbf9b6ba708858984b01ca0fc56d59d4ec7c: +++ regress check in contrib/test_decoding +++ test ddl ... FAILED (test process exited with exit code 2) 3276 ms (all other tests in this suite also fail, probably because the server crashed here) The server logs look like this: 2022-09-27 13:51:08.652 EDT [37090:4] LOG: server process (PID 37105) was terminated by signal 4: Illegal instruction: 4 2022-09-27 13:51:08.652 EDT [37090:5] DETAIL: Failed process was running: SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1'); Both animals are running with -fsanitize=alignment and it's not difficult to believe that the commit mentioned above could have introduced an alignment problem where we didn't have one before, but without a stack backtrace I don't know how to track it down. I tried running those tests locally with -fsanitize=alignment and they passed. Any ideas on how to track this down? -- Robert Haas EDB: http://www.enterprisedb.com
On Tue, Sep 27, 2022 at 02:55:18PM -0400, Robert Haas wrote: > Both animals are running with -fsanitize=alignment and it's not > difficult to believe that the commit mentioned above could have > introduced an alignment problem where we didn't have one before, but > without a stack backtrace I don't know how to track it down. I tried > running those tests locally with -fsanitize=alignment and they passed. There's one here: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=kestrel&dt=2022-09-27%2018%3A43%3A06 /mnt/resource/bf/build/kestrel/HEAD/pgsql.build/../pgsql/src/backend/access/rmgrdesc/xactdesc.c:102:30: runtime error: memberaccess within misaligned address 0x000004125074 for type 'xl_xact_invals' (aka 'struct xl_xact_invals'), which requires8 byte alignment #0 0x5b6702 in ParseCommitRecord /mnt/resource/bf/build/kestrel/HEAD/pgsql.build/../pgsql/src/backend/access/rmgrdesc/xactdesc.c:102:30 #1 0xb5264d in xact_decode /mnt/resource/bf/build/kestrel/HEAD/pgsql.build/../pgsql/src/backend/replication/logical/decode.c:201:5 #2 0xb521ac in LogicalDecodingProcessRecord /mnt/resource/bf/build/kestrel/HEAD/pgsql.build/../pgsql/src/backend/replication/logical/decode.c:119:3 #3 0xb5e868 in pg_logical_slot_get_changes_guts /mnt/resource/bf/build/kestrel/HEAD/pgsql.build/../pgsql/src/backend/replication/logical/logicalfuncs.c:271:5 #4 0xb5e25f in pg_logical_slot_get_changes /mnt/resource/bf/build/kestrel/HEAD/pgsql.build/../pgsql/src/backend/replication/logical/logicalfuncs.c:338:9 #5 0x896bba in ExecMakeTableFunctionResult /mnt/resource/bf/build/kestrel/HEAD/pgsql.build/../pgsql/src/backend/executor/execSRF.c:234:13 #6 0x8c7660 in FunctionNext /mnt/resource/bf/build/kestrel/HEAD/pgsql.build/../pgsql/src/backend/executor/nodeFunctionscan.c:95:5 #7 0x899048 in ExecScanFetch /mnt/resource/bf/build/kestrel/HEAD/pgsql.build/../pgsql/src/backend/executor/execScan.c:133:9 #8 0x89896b in ExecScan /mnt/resource/bf/build/kestrel/HEAD/pgsql.build/../pgsql/src/backend/executor/execScan.c:199:10 #9 0x8c6892 in ExecFunctionScan /mnt/resource/bf/build/kestrel/HEAD/pgsql.build/../pgsql/src/backend/executor/nodeFunctionscan.c:270:9 #10 0x892f42 in ExecProcNodeFirst /mnt/resource/bf/build/kestrel/HEAD/pgsql.build/../pgsql/src/backend/executor/execProcnode.c:464:9 #11 0x8802dd in ExecProcNode /mnt/resource/bf/build/kestrel/HEAD/pgsql.build/../pgsql/src/include/executor/executor.h:259:9 #12 0x8802dd in ExecutePlan /mnt/resource/bf/build/kestrel/HEAD/pgsql.build/../pgsql/src/backend/executor/execMain.c:1636:10 #13 0x8802dd in standard_ExecutorRun /mnt/resource/bf/build/kestrel/HEAD/pgsql.build/../pgsql/src/backend/executor/execMain.c:363:3 #14 0x87ffbb in ExecutorRun /mnt/resource/bf/build/kestrel/HEAD/pgsql.build/../pgsql/src/backend/executor/execMain.c:307:3 #15 0xc36c07 in PortalRunSelect /mnt/resource/bf/build/kestrel/HEAD/pgsql.build/../pgsql/src/backend/tcop/pquery.c:924:4 #16 0xc364ca in PortalRun /mnt/resource/bf/build/kestrel/HEAD/pgsql.build/../pgsql/src/backend/tcop/pquery.c:768:18 #17 0xc34138 in exec_simple_query /mnt/resource/bf/build/kestrel/HEAD/pgsql.build/../pgsql/src/backend/tcop/postgres.c:1238:10 #18 0xc30953 in PostgresMain /mnt/resource/bf/build/kestrel/HEAD/pgsql.build/../pgsql/src/backend/tcop/postgres.c #19 0xb27e3f in BackendRun /mnt/resource/bf/build/kestrel/HEAD/pgsql.build/../pgsql/src/backend/postmaster/postmaster.c:4482:2 #20 0xb2738d in BackendStartup /mnt/resource/bf/build/kestrel/HEAD/pgsql.build/../pgsql/src/backend/postmaster/postmaster.c:4210:3 #21 0xb2738d in ServerLoop /mnt/resource/bf/build/kestrel/HEAD/pgsql.build/../pgsql/src/backend/postmaster/postmaster.c:1804:7 #22 0xb24312 in PostmasterMain /mnt/resource/bf/build/kestrel/HEAD/pgsql.build/../pgsql/src/backend/postmaster/postmaster.c:1476:11 #23 0x953694 in main /mnt/resource/bf/build/kestrel/HEAD/pgsql.build/../pgsql/src/backend/main/main.c:197:3 #24 0x7f834e39a209 in __libc_start_call_main csu/../sysdeps/nptl/libc_start_call_main.h:58:16 #25 0x7f834e39a2bb in __libc_start_main csu/../csu/libc-start.c:389:3 #26 0x4a40a0 in _start (/mnt/resource/bf/build/kestrel/HEAD/pgsql.build/tmp_install/mnt/resource/bf/build/kestrel/HEAD/inst/bin/postgres+0x4a40a0) Note that cfbot is warning for a different reason now: https://cirrus-ci.com/task/5794615155490816
Justin Pryzby <pryzby@telsasoft.com> writes: > On Tue, Sep 27, 2022 at 02:55:18PM -0400, Robert Haas wrote: >> Both animals are running with -fsanitize=alignment and it's not >> difficult to believe that the commit mentioned above could have >> introduced an alignment problem where we didn't have one before, but >> without a stack backtrace I don't know how to track it down. I tried >> running those tests locally with -fsanitize=alignment and they passed. > There's one here: > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=kestrel&dt=2022-09-27%2018%3A43%3A06 On longfin's host, the test_decoding run produces two core files. One has a backtrace like this: * frame #0: 0x000000010a36af8c postgres`ParseCommitRecord(info='\x80', xlrec=0x00007fa0678a8090, parsed=0x00007ff7b5c50e78)at xactdesc.c:102:30 frame #1: 0x000000010a765f9e postgres`xact_decode(ctx=0x00007fa0680d9118, buf=0x00007ff7b5c51000) at decode.c:201:5 [opt] frame #2: 0x000000010a765d17 postgres`LogicalDecodingProcessRecord(ctx=0x00007fa0680d9118, record=<unavailable>) at decode.c:119:3[opt] frame #3: 0x000000010a76d890 postgres`pg_logical_slot_get_changes_guts(fcinfo=<unavailable>, confirm=true, binary=false)at logicalfuncs.c:271:5 [opt] frame #4: 0x000000010a76d320 postgres`pg_logical_slot_get_changes(fcinfo=<unavailable>) at logicalfuncs.c:338:9 [opt] frame #5: 0x000000010a5a521d postgres`ExecMakeTableFunctionResult(setexpr=<unavailable>, econtext=0x00007fa068098f50,argContext=<unavailable>, expectedDesc=0x00007fa06701ba38, randomAccess=<unavailable>) at execSRF.c:234:13[opt] frame #6: 0x000000010a5c405b postgres`FunctionNext(node=0x00007fa068098d40) at nodeFunctionscan.c:95:5 [opt] frame #7: 0x000000010a5a61b9 postgres`ExecScan(node=0x00007fa068098d40, accessMtd=(postgres`FunctionNext at nodeFunctionscan.c:61),recheckMtd=(postgres`FunctionRecheck at nodeFunctionscan.c:251)) at execScan.c:199:10 [opt] frame #8: 0x000000010a596ee0 postgres`standard_ExecutorRun [inlined] ExecProcNode(node=0x00007fa068098d40) at executor.h:259:9[opt] frame #9: 0x000000010a596eb8 postgres`standard_ExecutorRun [inlined] ExecutePlan(estate=<unavailable>, planstate=0x00007fa068098d40,use_parallel_mode=<unavailable>, operation=CMD_SELECT, sendTuples=<unavailable>, numberTuples=0,direction=1745456112, dest=0x00007fa067023848, execute_once=<unavailable>) at execMain.c:1636:10 [opt] frame #10: 0x000000010a596e2a postgres`standard_ExecutorRun(queryDesc=<unavailable>, direction=1745456112, count=0, execute_once=<unavailable>)at execMain.c:363:3 [opt] and the other * frame #0: 0x000000010a36af8c postgres`ParseCommitRecord(info='\x80', xlrec=0x00007fa06783a090, parsed=0x00007ff7b5c50040)at xactdesc.c:102:30 frame #1: 0x000000010a3cd24d postgres`xact_redo(record=0x00007fa0670096c8) at xact.c:6161:3 frame #2: 0x000000010a41770d postgres`ApplyWalRecord(xlogreader=0x00007fa0670096c8, record=0x00007fa06783a060, replayTLI=0x00007ff7b5c507f0)at xlogrecovery.c:1897:2 frame #3: 0x000000010a4154be postgres`PerformWalRecovery at xlogrecovery.c:1728:4 frame #4: 0x000000010a3e0dc7 postgres`StartupXLOG at xlog.c:5473:3 frame #5: 0x000000010a7498a0 postgres`StartupProcessMain at startup.c:267:2 [opt] frame #6: 0x000000010a73e2cb postgres`AuxiliaryProcessMain(auxtype=StartupProcess) at auxprocess.c:141:4 [opt] frame #7: 0x000000010a745b97 postgres`StartChildProcess(type=StartupProcess) at postmaster.c:5408:3 [opt] frame #8: 0x000000010a7487e2 postgres`PostmasterStateMachine at postmaster.c:4006:16 [opt] frame #9: 0x000000010a745804 postgres`reaper(postgres_signal_arg=<unavailable>) at postmaster.c:3256:2 [opt] frame #10: 0x00007ff815b16dfd libsystem_platform.dylib`_sigtramp + 29 frame #11: 0x00007ff815accd5b libsystem_kernel.dylib`__select + 11 frame #12: 0x000000010a74689c postgres`ServerLoop at postmaster.c:1768:13 [opt] frame #13: 0x000000010a743fbb postgres`PostmasterMain(argc=<unavailable>, argv=0x00006000006480a0) at postmaster.c:1476:11[opt] frame #14: 0x000000010a61c775 postgres`main(argc=8, argv=<unavailable>) at main.c:197:3 [opt] Looks like it might be the same bug, but perhaps not. I recompiled access/transam and access/rmgrdesc at -O0 to get the accurate line numbers shown for those files. Let me know if you need any more info; I can add -O0 in more places, or poke around in the cores. regards, tom lane
I wrote: > * frame #0: 0x000000010a36af8c postgres`ParseCommitRecord(info='\x80', xlrec=0x00007fa0678a8090, parsed=0x00007ff7b5c50e78)at xactdesc.c:102:30 Okay, so the problem is this: by widening RelFileNumber to 64 bits, you have increased the alignment requirement of struct RelFileLocator, and thereby also SharedInvalidationMessage, to 8 bytes where it had been 4. longfin's alignment check is therefore expecting that xl_xact_twophase will likewise be 8-byte-aligned, but it isn't: (lldb) p data (char *) $0 = 0x00007fa06783a0a4 "\U00000001" I'm not sure whether the code that generates commit WAL records is breaking a contract it should maintain, or xactdesc.c needs to be taught to not assume that this data is adequately aligned. There is a second problem that I am going to hold your feet to the fire about: (lldb) p sizeof(SharedInvalidationMessage) (unsigned long) $1 = 24 We have sweated a good deal for a long time to keep that struct to 16 bytes. I do not think 50% bloat is acceptable. regards, tom lane
On Tue, Sep 27, 2022 at 4:50 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > I wrote: > > * frame #0: 0x000000010a36af8c postgres`ParseCommitRecord(info='\x80', xlrec=0x00007fa0678a8090, parsed=0x00007ff7b5c50e78)at xactdesc.c:102:30 > > Okay, so the problem is this: by widening RelFileNumber to 64 bits, > you have increased the alignment requirement of struct RelFileLocator, > and thereby also SharedInvalidationMessage, to 8 bytes where it had > been 4. longfin's alignment check is therefore expecting that > xl_xact_twophase will likewise be 8-byte-aligned, but it isn't: Yeah, I reached the same conclusion. > There is a second problem that I am going to hold your feet to the > fire about: > > (lldb) p sizeof(SharedInvalidationMessage) > (unsigned long) $1 = 24 > > We have sweated a good deal for a long time to keep that struct > to 16 bytes. I do not think 50% bloat is acceptable. I noticed that problem, too. The attached patch, which perhaps you can try out, fixes the alignment issue and also reduces the size of SharedInvalidationMessage from 24 bytes back to 20 bytes. I do not really see a way to do better than that. We use 1 type byte, 3 bytes for the backend ID, 4 bytes for the database OID, and 4 bytes for the tablespace OID. Previously, we then used 4 bytes for the relfilenode, but now we need 7, and there's no place from which we can plausibly steal those bits, at least not as far as I can see. If you have ideas, I'm all ears. Also, I don't really know what problem you think it's going to cause if that structure gets bigger. If we increased the size from 16 bytes even all the way to 32 or 64 bytes, what negative impact do you think that would have? It would use a little bit more shared memory, but on modern systems I doubt it would be enough to get excited about. The bigger impact would probably be that it would make commit records a bit bigger since those carry invalidations as payload. That is not great, but I think it only affects commit records for transactions that do DDL, so I'm struggling to see that as a big performance problem. -- Robert Haas EDB: http://www.enterprisedb.com
Вложения
... also, lapwing's not too happy [1]. The alter_table test expects this to yield zero rows, but it doesn't: SELECT m.* FROM filenode_mapping m LEFT JOIN pg_class c ON c.oid = m.oid WHERE c.oid IS NOT NULL OR m.mapped_oid IS NOT NULL; I've reproduced that symptom in a 32-bit FreeBSD VM building with clang, so I suspect that it'll occur on any 32-bit build. mamba is a couple hours away from offering a confirmatory data point, though. (BTW, is that test case sane at all? I'm bemused by the symmetrical NOT NULL tests on a fundamentally not-symmetrical left join; what are those supposed to accomplish? Also, the fact that it doesn't deign to show any fields from "c" is sure making it hard to tell what's wrong.) regards, tom lane [1] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=lapwing&dt=2022-09-27%2018%3A40%3A18
Robert Haas <robertmhaas@gmail.com> writes: > On Tue, Sep 27, 2022 at 4:50 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: >> There is a second problem that I am going to hold your feet to the >> fire about: >> (lldb) p sizeof(SharedInvalidationMessage) >> (unsigned long) $1 = 24 > Also, I don't really know what problem you think it's going to cause > if that structure gets bigger. If we increased the size from 16 bytes > even all the way to 32 or 64 bytes, what negative impact do you think > that would have? Maybe it wouldn't have any great impact. I don't know, but I don't think it's incumbent on me to measure that. You or the patch author should have had a handle on that question *before* committing. regards, tom lane
On Wed, Sep 28, 2022 at 2:59 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > > ... also, lapwing's not too happy [1]. The alter_table test > expects this to yield zero rows, but it doesn't: By looking at regression diff as shown below, it seems that we are able to get the relfilenode from the Oid using pg_relation_filenode(oid) but the reverse mapping pg_filenode_relation(reltablespace, relfilenode) returned NULL. I am not sure but by looking at the code it is somehow related to alignment padding while computing the hash key size in the 32-bit machine in the function InitializeRelfilenumberMap(). I am still looking into this and will provide updates on this. + oid | mapped_oid | reltablespace | relfilenode | relname +-------+------------+---------------+-------------+------------------------------------------------ + 16385 | | 0 | 100000 | char_tbl + 16388 | | 0 | 100001 | float8_tbl + 16391 | | 0 | 100002 | int2_tbl + 16394 | | 0 | 100003 | int4_tbl -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
wrasse is also failing with a bus error, but I cannot get the stack trace. So it seems it is hitting some alignment issues during startup [1]. Is it possible to get the backtrace or lineno? [1] 2022-09-28 03:19:26.228 CEST [180:4] LOG: redo starts at 0/30FE9D8 2022-09-28 03:19:27.674 CEST [177:3] LOG: startup process (PID 180) was terminated by signal 10: Bus Error 2022-09-28 03:19:27.674 CEST [177:4] LOG: terminating any other active server processes 2022-09-28 03:19:27.677 CEST [177:5] LOG: shutting down due to startup process failure 2022-09-28 03:19:27.681 CEST [177:6] LOG: database system is shut down -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Dilip Kumar <dilipbalaut@gmail.com> writes: > wrasse is also failing with a bus error, Yeah. At this point I think it's time to call for this patch to get reverted. It should get tested *off line* on some non-Intel, non-64-bit, alignment-picky architectures before the rest of us have to deal with it any more. There may be a larger conversation to be had here about how much our CI infrastructure should be detecting. There seems to be a depressingly large gap between what that found and what the buildfarm is finding --- not only in portability issues, but in things like cpluspluscheck failures, which I had supposed CI would find. regards, tom lane
On Wed, Sep 28, 2022 at 6:15 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > There may be a larger conversation to be had here about how > much our CI infrastructure should be detecting. There seems > to be a depressingly large gap between what that found and > what the buildfarm is finding --- not only in portability > issues, but in things like cpluspluscheck failures, which > I had supposed CI would find. +1, Andres had some sanitizer flags in the works (stopped by a weird problem to be resolved first), and 32 bit CI would clearly be good. It also seems that ARM is now available to us via CI (either Amazon's or a Mac), which IIRC is more SIGBUS-y about alignment than x86? FTR CI reported that cpluspluscheck failure and more[1], so perhaps we just need to get clearer agreement on the status of CI, ie a policy that CI had better be passing before you get to the next stage. It's still pretty new... [1] https://cirrus-ci.com/github/postgresql-cfbot/postgresql/commitfest/39/3711
On Wed, Sep 28, 2022 at 10:45 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > > Dilip Kumar <dilipbalaut@gmail.com> writes: > > wrasse is also failing with a bus error, > > Yeah. At this point I think it's time to call for this patch > to get reverted. It should get tested *off line* on some > non-Intel, non-64-bit, alignment-picky architectures before > the rest of us have to deal with it any more. > > There may be a larger conversation to be had here about how > much our CI infrastructure should be detecting. There seems > to be a depressingly large gap between what that found and > what the buildfarm is finding --- not only in portability > issues, but in things like cpluspluscheck failures, which > I had supposed CI would find. Okay. Btw, I think the reason for the bus error on wrasse is the same as what is creating failure on longfin[1], I mean this unaligned access is causing Bus error during startup, IMHO. frame #0: 0x000000010a36af8c postgres`ParseCommitRecord(info='\x80', xlrec=0x00007fa06783a090, parsed=0x00007ff7b5c50040) at xactdesc.c:102:30 frame #1: 0x000000010a3cd24d postgres`xact_redo(record=0x00007fa0670096c8) at xact.c:6161:3 frame #2: 0x000000010a41770d postgres`ApplyWalRecord(xlogreader=0x00007fa0670096c8, record=0x00007fa06783a060, replayTLI=0x00007ff7b5c507f0) at xlogrecovery.c:1897:2 -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Dilip Kumar <dilipbalaut@gmail.com> writes: > Btw, I think the reason for the bus error on wrasse is the same as > what is creating failure on longfin[1], I mean this unaligned access > is causing Bus error during startup, IMHO. Maybe, but there's not a lot of evidence for that. wrasse got through the test_decoding check where longfin, tamandua, kestrel, and now skink are failing. It's evidently not the same issue that the 32-bit animals are choking on, either. Looks like yet a third bug to me. regards, tom lane
On Wed, Sep 28, 2022 at 9:40 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > On Wed, Sep 28, 2022 at 2:59 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > > > > ... also, lapwing's not too happy [1]. The alter_table test > > expects this to yield zero rows, but it doesn't: > > By looking at regression diff as shown below, it seems that we are > able to get the relfilenode from the Oid using > pg_relation_filenode(oid) but the reverse mapping > pg_filenode_relation(reltablespace, relfilenode) returned NULL. > It was a silly mistake, I used the F_OIDEQ function instead of F_INT8EQ. Although this was correct on the 0003 patch where we have removed the tablespace from key, but got missed in this :( I have locally reproduced this in a 32 bit machine consistently and the attached patch is fixing the issue for me. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Вложения
On Wed, Sep 28, 2022 at 9:26 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > It was a silly mistake, I used the F_OIDEQ function instead of > F_INT8EQ. Although this was correct on the 0003 patch where we have > removed the tablespace from key, but got missed in this :( > > I have locally reproduced this in a 32 bit machine consistently and > the attached patch is fixing the issue for me. I tested this with an armhf (32 bit) toolchain, and it passes check-world, and was failing before. Robert's patch isn't needed on this system. I didn't look into this subject for long but it seems that SIGBUS on misaligned access (as typically seen on eg SPARC) requires a 32 bit Linux/ARM kernel, but I was testing with 32 bit processes and a 64 bit kernel. Apparently 32 bit Linux/ARM has a control /proc/cpu/alignment to select behaviour (options include emulation, SIGBUS) but 64 bit kernels don't have it and are happy with misaligned access.
On Wed, Sep 28, 2022 at 11:57 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > > Dilip Kumar <dilipbalaut@gmail.com> writes: > > Btw, I think the reason for the bus error on wrasse is the same as > > what is creating failure on longfin[1], I mean this unaligned access > > is causing Bus error during startup, IMHO. > > Maybe, but there's not a lot of evidence for that. wrasse got > through the test_decoding check where longfin, tamandua, kestrel, > and now skink are failing. It's evidently not the same issue > that the 32-bit animals are choking on, either. Looks like yet > a third bug to me. I think the reason is that "longfin" is configured with the -fsanitize=alignment option so it will report the failure for any unaligned access. Whereas "wrasse" actually generates the "Bus error" due to architecture. So the difference is that with -fsanitize=alignment, it will always complain for any unaligned access but all unaligned access will not end up in the "Bus error", and I think that could be the reason "wrasse" is not failing in the test decoding. Yeah but anyway this is just a theory behind why failing at different places but we still do not have evidence/call stack to prove that. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
On Wed, Sep 28, 2022 at 1:48 AM Thomas Munro <thomas.munro@gmail.com> wrote: > FTR CI reported that cpluspluscheck failure and more[1], so perhaps we > just need to get clearer agreement on the status of CI, ie a policy > that CI had better be passing before you get to the next stage. It's > still pretty new... Yeah, I suppose I have to get in the habit of looking at CI before committing anything. It's sort of annoying to me, though. Here's a list of the follow-up fixes I've so far committed: 1. headerscheck 2. typos 3. pg_buffercache's meson.build 4. compiler warning 5. alignment problem 6. F_INTEQ/F_OIDEQ problem CI caught (1), (3), and (4). The buildfarm caught (1), (5), and (6). The number of buildfarm failures that I would have avoided by checking CI is less than the number of extra things I had to fix to keep CI happy, and the serious problems were caught by the buildfarm, not by CI. It's not even clear to me how I was supposed to know that every future Makefile change is going to require adjusting a meson.build file as well. It's not like that was mentioned in the commit message for the meson build system, which also has no README anywhere in the source tree. I found the wiki page by looking up the commit and finding the URL in the commit message, but, you know, that wiki page ALSO doesn't mention the need to now update meson.build files going forward. So I guess the way you're supposed to know that you need to update meson.build that is by looking at CI, but CI is also the only reason it's necessary to carry about meson.build in the first place. I feel like CI has not really made it in any easier to not break the buildfarm -- it's just provided a second buildfarm that you can break independently of the first one. And like the existing buildfarm, it's severely under-documented. -- Robert Haas EDB: http://www.enterprisedb.com
On Wed, Sep 28, 2022 at 1:15 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > Dilip Kumar <dilipbalaut@gmail.com> writes: > > wrasse is also failing with a bus error, > > Yeah. At this point I think it's time to call for this patch > to get reverted. It should get tested *off line* on some > non-Intel, non-64-bit, alignment-picky architectures before > the rest of us have to deal with it any more. I don't really understand how you expect me or Dilip to do this. Is every PostgreSQL hacker supposed to have a bunch of antiquated servers in their basement so that they can test this stuff? I don't think I have had easy access to non-Intel, non-64-bit, alignment-picky hardware in probably 25 years, unless my old Raspberry Pi counts. I admit that I should have checked the CI results before pushing this commit, but as you say yourself, that missed a bunch of stuff, and I'd say it was the important stuff. Unless and until CI is able to check all the same configurations that the buildfarm can check, it's not going to be possible to get test results on some of these platforms except by checking the code in and seeing what happens. If I revert this, I'm just going to be sitting here not knowing where any of the problems are and having no way to find them. Maybe I'm missing something here. Apart from visual inspection of the code and missing fewer mistakes than I did, how would you have avoided these problems in one of your commits? -- Robert Haas EDB: http://www.enterprisedb.com
On Tue, Sep 27, 2022 at 5:29 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > ... also, lapwing's not too happy [1]. The alter_table test > expects this to yield zero rows, but it doesn't: > > SELECT m.* FROM filenode_mapping m LEFT JOIN pg_class c ON c.oid = m.oid > WHERE c.oid IS NOT NULL OR m.mapped_oid IS NOT NULL; > > I've reproduced that symptom in a 32-bit FreeBSD VM building with clang, > so I suspect that it'll occur on any 32-bit build. mamba is a couple > hours away from offering a confirmatory data point, though. > > (BTW, is that test case sane at all? I'm bemused by the symmetrical > NOT NULL tests on a fundamentally not-symmetrical left join; what > are those supposed to accomplish? Also, the fact that it doesn't > deign to show any fields from "c" is sure making it hard to tell > what's wrong.) This was added by: commit f3fdd257a430ff581090740570af9f266bb893e3 Author: Noah Misch <noah@leadboat.com> Date: Fri Jun 13 19:57:59 2014 -0400 Harden pg_filenode_relation test against concurrent DROP TABLE. Per buildfarm member prairiedog. Back-patch to 9.4, where the test was introduced. Reviewed by Tom Lane. There seems to be a comment in that commit which explains the intent of those funny-looking NULL tests. -- Robert Haas EDB: http://www.enterprisedb.com
On Tue, Sep 27, 2022 at 5:50 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > Maybe it wouldn't have any great impact. I don't know, but I don't > think it's incumbent on me to measure that. You or the patch author > should have had a handle on that question *before* committing. I agree. I should have gone through and checked that every place where RelFileLocator got embedded in some larger struct, there was no problem with making it bigger and increasing the alignment requirement. I'll go back and do that as soon as the immediate problems are fixed. This case would have stood out as something needing attention. Some of the cases are pretty subtle, though. tamandua is still unhappy even after pushing that fix, because xl_xact_relfilelocators embeds a RelFileLocator which now requires 8-byte alignment, and ParseCommitRecord has an undocumented assumption that none of the things embedded in a commit record require more than 4-byte alignment. Really, if it's critical for a struct to never require more than 4-byte alignment, there ought to be a comment about that *on the struct itself*. Having a comment on a function that does something with that struct is probably not really good enough, and we don't even have that. -- Robert Haas EDB: http://www.enterprisedb.com
On Wed, Sep 28, 2022 at 9:16 AM Robert Haas <robertmhaas@gmail.com> wrote: > I agree. I should have gone through and checked that every place where > RelFileLocator got embedded in some larger struct, there was no > problem with making it bigger and increasing the alignment > requirement. I'll go back and do that as soon as the immediate > problems are fixed. This case would have stood out as something > needing attention. On second thought, I'm going to revert the whole thing. There's a bigger mess here than can be cleaned up on the fly. The alignment-related mess in ParseCommitRecord is maybe something for which I could just hack a quick fix, but what I've also just now realized is that this makes a huge number of WAL records larger by 4 bytes, since most WAL records will contain a block reference. I don't know whether that's OK or not, but I do know that it hasn't been thought about, and after commit is not the time to begin experimenting with such things. -- Robert Haas EDB: http://www.enterprisedb.com
On Wed, Sep 28, 2022 at 6:48 AM Robert Haas <robertmhaas@gmail.com> wrote: > On second thought, I'm going to revert the whole thing. There's a > bigger mess here than can be cleaned up on the fly. The > alignment-related mess in ParseCommitRecord is maybe something for > which I could just hack a quick fix, but what I've also just now > realized is that this makes a huge number of WAL records larger by 4 > bytes, since most WAL records will contain a block reference. It would be useful if there were generic tests that caught issues like this. There are various subtle effects related to how struct layout can impact WAL record size that might easily be missed. It's not like there are a huge number of truly critical WAL records to have tests for. The example that comes to mind is the XLOG_BTREE_INSERT_POST record type, which is used for B-Tree index tuple inserts with a posting list split. There is only an extra 2 bytes of payload for these record types compared to conventional XLOG_BTREE_INSERT_LEAF records, but we nevertheless tend to see a final record size that is consistently a full 8 bytes larger in many important cases, despite not needing to stored the IndexTuple with alignment padding. I believe that this is a consequence of the record header itself needing to be MAXALIGN()'d. Another important factor in this scenario is the general tendency for index tuple sizes to leave the final XLOG_BTREE_INSERT_LEAF record size at 64 bytes. It wouldn't have been okay if the deduplication work made that size jump up to 72 bytes for many kinds of indexes across the board, even when there was no accompanying posting list split (i.e. the vast majority of the time). Maybe it would have been okay if nbtree leaf page insert records were naturally rare, but that isn't the case at all, obviously. That's why we have two different record types here in the first place. Earlier versions of the deduplication patch just added an OffsetNumber field to XLOG_BTREE_INSERT_LEAF which could be set to InvalidOffsetNumber, resulting in a surprisingly large amount of waste in terms of WAL size. Because of the presence of 3 different factors. We don't bother doing this with the split records, which can also have accompanying posting list splits, because it makes hardly any difference at all (split records are much rarer than any kind of leaf insert record, and are far larger when considered individually). -- Peter Geoghegan
On 2022-Sep-28, Peter Geoghegan wrote: > It would be useful if there were generic tests that caught issues like > this. There are various subtle effects related to how struct layout > can impact WAL record size that might easily be missed. It's not like > there are a huge number of truly critical WAL records to have tests > for. What do you think would constitute a test here? Say: insert N records to a heapam table with one index of each kind (under controlled conditions: no checkpoint, no autovacuum, no FPIs), then measure the total number of bytes used by WAL records of each rmgr. Have a baseline and see how that changes over time. -- Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/
On 2022-Sep-28, Robert Haas wrote: > The number of buildfarm failures that I would have avoided by checking > CI is less than the number of extra things I had to fix to keep CI > happy, and the serious problems were caught by the buildfarm, not by > CI. [...] So I guess the way you're supposed to know that you need to > update meson.build that is by looking at CI, but CI is also the only > reason it's necessary to carry about meson.build in the first place. I > feel like CI has not really made it in any easier to not break the > buildfarm -- it's just provided a second buildfarm that you can break > independently of the first one. I have an additional, unrelated complaint about CI, which is that we don't have anything for past branches. I have a partial hack(*), but I wish we had something we could readily use. (*) I just backpatched the commit that added the .cirrus.yml file, plus some later fixes to it, and I keep that as a separate branch which I merge with whatever other changes I want to test. I then push that to github, and ignore the windows results when looking at cirrus-ci.com. -- Álvaro Herrera 48°01'N 7°57'E — https://www.EnterpriseDB.com/ "I am amazed at [the pgsql-sql] mailing list for the wonderful support, and lack of hesitasion in answering a lost soul's question, I just wished the rest of the mailing list could be like this." (Fotis) (http://archives.postgresql.org/pgsql-sql/2006-06/msg00265.php)
On Thu, Sep 29, 2022 at 1:27 AM Robert Haas <robertmhaas@gmail.com> wrote: > ... Here's a > list of the follow-up fixes I've so far committed: > > 1. headerscheck > 2. typos > 3. pg_buffercache's meson.build > 4. compiler warning > 5. alignment problem > 6. F_INTEQ/F_OIDEQ problem > > CI caught (1), (3), and (4). The buildfarm caught (1), (5), and (6). I think at least some of 5 and all of 6 would be covered by sanitizer flags and a 32 bit test respectively, and I think we should add those. We're feeling our way here, working out what's worth including at non-zero cost for each thing we could check. In other cases you and I have fought with, it's been Windows problems (mingw differences, or file handle semantics), which are frustrating to us all, but I see Meson as part of the solution to that: uniform testing on Windows (whereas the crusty perl would not run all the tests), and CI support for mingw is in the pipeline. > ... I > feel like CI has not really made it in any easier to not break the > buildfarm -- it's just provided a second buildfarm that you can break > independently of the first one. I don't agree with this. The build farm clearly has more ways to break than CI, because it has more CPUs, compilers, operating systems, combinations of configure options and rolls of the timing dice, but CI now catches a lot and, importantly, *before* it reaches the 'farm and everyone starts shouting a lot of stuff at you that you already knew, because it's impacting their work. Unless you don't look, and then it's just something that breaks with the build farm, and then you break CI on master for everyone else too and more people shout. I'm not personally aware of any significant project that isn't using CI, and although we're late to the party I happen to think that ours is getting pretty good considering the complexities. And it's going to keep improving.
On Wed, Sep 28, 2022 at 12:32 PM Thomas Munro <thomas.munro@gmail.com> wrote: > I don't agree with this. The build farm clearly has more ways to > break than CI, because it has more CPUs, compilers, operating systems, > combinations of configure options and rolls of the timing dice, but CI > now catches a lot and, importantly, *before* it reaches the 'farm and > everyone starts shouting a lot of stuff at you that you already knew, > because it's impacting their work. Right. I really don't can't imagine how CI could be seen as anything less than a very significant improvement. It wasn't that long ago that commits that certain kinds of work that used OS facilities would routinely break Windows in some completely predictable way. Just breaking every single Windows buildfarm animal was almost a routine occurrence. It was normal. Remember that? Of course it is also true that anything that breaks the buildfarm today will be disproportionately difficult and subtle. You really do have 2 buildfarms to break -- it's just that one of those buildfarms can be broken and fixed without it bothering anybody else, which is typically enough to prevent breaking the real buildfarm. But only if you actually check both! -- Peter Geoghegan
Robert Haas <robertmhaas@gmail.com> writes: > Yeah, I suppose I have to get in the habit of looking at CI before > committing anything. It's sort of annoying to me, though. Here's a > list of the follow-up fixes I've so far committed: > 1. headerscheck > 2. typos > 3. pg_buffercache's meson.build > 4. compiler warning > 5. alignment problem > 6. F_INTEQ/F_OIDEQ problem > CI caught (1), (3), and (4). The buildfarm caught (1), (5), and (6). > The number of buildfarm failures that I would have avoided by checking > CI is less than the number of extra things I had to fix to keep CI > happy, and the serious problems were caught by the buildfarm, not by > CI. That seems like an unfounded complaint. You would have had to fix (3) and (4) in any case, on some time schedule or other. I agree that it'd be good if CI did some 32-bit testing so it could have caught (5) and (6), but that's being worked on. > So I guess the way you're supposed to know that you need to > update meson.build that is by looking at CI, but CI is also the only > reason it's necessary to carry about meson.build in the first place. Not so. People are already using meson in preference to the makefiles for some things, I believe. And we're expecting that meson will supplant the MSVC scripts pretty soon and the makefiles eventually. > And like the existing buildfarm, it's severely under-documented. That complaint I agree with. A wiki page is a pretty poor substitute for in-tree docs. regards, tom lane
On Wed, Sep 28, 2022 at 12:20 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote: > What do you think would constitute a test here? I would start with something simple. Focus on the record types that we know are the most common. It's very skewed towards heap and nbtree record types, plus some transaction rmgr types. > Say: insert N records to a heapam table with one index of each kind > (under controlled conditions: no checkpoint, no autovacuum, no FPIs), > then measure the total number of bytes used by WAL records of each rmgr. > Have a baseline and see how that changes over time. There are multiple flavors of alignment involved here, which makes it tricky. For example, in index AMs the lp_len field from each line pointer is always MAXALIGN()'d. It is only aligned as required for the underlying heap tuple attributes in the case of heap tuples, though. Of course you also have alignment considerations for the record itself -- buffer data can usually be stored without being aligned at all. But you can still have an impact from WAL header alignment, especially for record types that tend to be relatively small -- like nbtree index tuple inserts on leaf pages. I think that the most interesting variation is among boundary cases for those records that affect a variable number of page items. These record types may be impacted by alignment considerations in subtle though important ways. Things like PRUNE records often don't have that many items. So having coverage of the overhead of every variation of a small PRUNE record could be important as a way of catching regressions that would otherwise be hard to catch. Continuing with that example, we could probably cover every possible permutation of PRUNE records that affect 5 or so items. Let's say that we have a regression where PRUNE records that happen to have 3 items that must all be set LP_DEAD increase in size by one MAXALIGN() quantum. This will probably make a huge difference in many workloads, but it's difficult to spot after the fact when it only affects those records that happen to have a number of items that happen to fall in some narrow but critical range. It might not affect PRUNE records with (say) 5 items at all. So if we're looking at the macro picture with (say) pgbench and pg_waldump we'll tend to miss the regression right now; it'll be obscured by the fact that the regression only affects a minority of all PRUNE records, for whatever reason. This is just a made up example, so the specifics might be off significantly -- I'd have to work on it to be sure. Hopefully the example still gets the general idea across. -- Peter Geoghegan
Hi, On 2022-09-28 16:07:13 -0400, Tom Lane wrote: > Robert Haas <robertmhaas@gmail.com> writes: > > And like the existing buildfarm, it's severely under-documented. > > That complaint I agree with. A wiki page is a pretty poor substitute > for in-tree docs. I assume we're talking about CI? What would you like to see documented? There is some content in src/tools/ci/README and the wiki page links to that too. Should we lift it into the sgml docs? If we're talking about meson - there's a pending documentation commit. It'd be good to get some review for it! Greetings, Andres Freund
Andres Freund <andres@anarazel.de> writes: > On 2022-09-28 16:07:13 -0400, Tom Lane wrote: >> Robert Haas <robertmhaas@gmail.com> writes: >>> And like the existing buildfarm, it's severely under-documented. >> That complaint I agree with. A wiki page is a pretty poor substitute >> for in-tree docs. > I assume we're talking about CI? I was thinking of meson when I wrote that ... but re-reading it, I think Robert meant CI. regards, tom lane
Hi, On 2022-09-28 22:14:11 -0400, Tom Lane wrote: > I was thinking of meson when I wrote that ... but re-reading it, > I think Robert meant CI. FWIW, I had planned to put the "translation table" between autoconf and meson into the docs, but Peter E. argued that the wiki is better for that. Happy to change course on that aspect if there's agreement on it. Greetings, Andres Freund
Hi, On 2022-09-28 21:22:26 +0200, Alvaro Herrera wrote: > I have an additional, unrelated complaint about CI, which is that we > don't have anything for past branches. I have a partial hack(*), but > I wish we had something we could readily use. > > (*) I just backpatched the commit that added the .cirrus.yml file, plus > some later fixes to it, and I keep that as a separate branch which I > merge with whatever other changes I want to test. I then push that to > github, and ignore the windows results when looking at cirrus-ci.com. I'd not be against backpatching the ci stuff if there were sufficient demand for it. It'd probably be a decent bit of initial work, but after that it shouldn't be too bad. Greetings, Andres Freund
Hi, On 2022-09-28 16:07:13 -0400, Tom Lane wrote: > I agree that it'd be good if CI did some 32-bit testing so it could have > caught (5) and (6), but that's being worked on. I wasn't aware of anybody doing so, thus here's a patch for that. I already added the necessary packages to the image. I didn't install llvm for 32bit because that'd have a) bloated the image unduly b) they can't currently be installed in parallel afaics. Attached is the patch adding it to CI. To avoid paying the task startup overhead twice, it seemed a tad better to build and test 32bit as part of an existing task. We could instead give each job fewer CPUs and run them concurrently. It might be worth changing one of the builds to use -Dwal_blocksize=4 and a few other flags, to increase our coverage. Greetings, Andres Freund
Вложения
Hi, On 2022-09-29 17:31:35 -0700, Andres Freund wrote: > I already added the necessary packages to the image. I didn't install llvm for > 32bit because that'd have a) bloated the image unduly b) they can't currently > be installed in parallel afaics. > Attached is the patch adding it to CI. To avoid paying the task startup > overhead twice, it seemed a tad better to build and test 32bit as part of an > existing task. We could instead give each job fewer CPUs and run them > concurrently. Ah, one thing I forgot to mention: The 32bit perl currently can't have a packaged IO:Pty installed. We could install it via cpan, but it doesn't seem worth the bother. Hence one skipped test in the 32bit build. Example run: https://cirrus-ci.com/task/4632556472631296?logs=test_world_32#L249 (scroll to the bottom) Onder if we should vary some build options like ICU or the system collation? Tom, wasn't there something recently that made you complain about not having coverage around collations due to system settings? Greetings, Andres Freund
On Thu, Sep 29, 2022 at 5:40 PM Andres Freund <andres@anarazel.de> wrote: > Onder if we should vary some build options like ICU or the system collation? > Tom, wasn't there something recently that made you complain about not having > coverage around collations due to system settings? That was related to TRUST_STRXFRM: https://postgr.es/m/CAH2-Wzmqrjqv9pgyzebgnqmcac1Ct+UxG3VQU7kSVUNDf_yF2A@mail.gmail.com -- Peter Geoghegan
Hi, On 2022-09-29 18:16:51 -0700, Peter Geoghegan wrote: > On Thu, Sep 29, 2022 at 5:40 PM Andres Freund <andres@anarazel.de> wrote: > > Onder if we should vary some build options like ICU or the system collation? > > Tom, wasn't there something recently that made you complain about not having > > coverage around collations due to system settings? > > That was related to TRUST_STRXFRM: > > https://postgr.es/m/CAH2-Wzmqrjqv9pgyzebgnqmcac1Ct+UxG3VQU7kSVUNDf_yF2A@mail.gmail.com It wasn't even that one, although I do recall it now that I reread the thread. But it did successfully jog my memory: https://www.postgresql.org/message-id/69170.1663425842%40sss.pgh.pa.us So possibly it could be worth running one of them with LANG=C? Greetings, Andres Freund
Andres Freund <andres@anarazel.de> writes: > Tom, wasn't there something recently that made you complain about not having > coverage around collations due to system settings? We found there was a gap for ICU plus LANG=C, IIRC. regards, tom lane
Hi, On 2022-09-29 21:24:44 -0400, Tom Lane wrote: > Andres Freund <andres@anarazel.de> writes: > > Tom, wasn't there something recently that made you complain about not having > > coverage around collations due to system settings? > > We found there was a gap for ICU plus LANG=C, IIRC. Using that then. Any opinions about whether to do this only in head or backpatch to 15? Greetings, Andres Freund
Andres Freund <andres@anarazel.de> writes: > Any opinions about whether to do this only in head or backpatch to 15? HEAD should be sufficient, IMO. regards, tom lane
Hi, On 2022-09-29 22:16:10 -0400, Tom Lane wrote: > Andres Freund <andres@anarazel.de> writes: > > Any opinions about whether to do this only in head or backpatch to 15? > > HEAD should be sufficient, IMO. Pushed. I think we should add some more divergent options to increase the coverage. E.g. a different xlog pagesize, a smaller segmement size so we can test the "segment boundary" code (although we don't currently allow < 1GB via normal means right now) etc. But that's for later. Greetings, Andres Freund
On Wed, Sep 28, 2022 at 08:45:31PM -0700, Andres Freund wrote: > Hi, > > On 2022-09-28 21:22:26 +0200, Alvaro Herrera wrote: > > I have an additional, unrelated complaint about CI, which is that we > > don't have anything for past branches. I have a partial hack(*), but > > I wish we had something we could readily use. > > > > (*) I just backpatched the commit that added the .cirrus.yml file, plus > > some later fixes to it, and I keep that as a separate branch which I > > merge with whatever other changes I want to test. I then push that to > > github, and ignore the windows results when looking at cirrus-ci.com. You wouldn't need to ignore Windows tap test failures if you also backpatch 76e38b37a, and either disable PG_TEST_USE_UNIX_SOCKETS, or include 45f52709d. > I'd not be against backpatching the ci stuff if there were sufficient demand > for it. It'd probably be a decent bit of initial work, but after that it > shouldn't be too bad. I just tried this, which works fine at least for v11-v14: | git checkout origin/REL_15_STABLE .cirrus.yml src/tools/ci https://cirrus-ci.com/task/5742859943936000 v15a https://cirrus-ci.com/task/6725412431593472 v15b https://cirrus-ci.com/task/5105320283340800 v13 https://cirrus-ci.com/task/4809469463887872 v12 https://cirrus-ci.com/task/6659971021537280 v11 (I still suggest my patches to run all tests using vcregress. The number of people who remember that, for v15, cirrusci runs incomplete tests is probably fewer than five.) https://www.postgresql.org/message-id/20220623193125.GB22452%40telsasoft.com https://www.postgresql.org/message-id/20220828144447.GA21897%40telsasoft.com If cirrusci were backpatched, it'd be kind of nice to use a ccache key that includes the branch name (but maybe the overhead of compilation is unimportant compared to the workload induced by cfbot). A gripe from me: the regression.diffs and other logs from the SQL regression tests are in a directory called "main" (same for "isolation"). I imagine I won't be the last person to spend minutes looking through the list of test dirs for the entry called "regress", conclude that it's inexplicably absent, and locate it only after reading src/test/regress/meson.build. -- Justin
Hi, On 2022-10-01 11:14:20 -0500, Justin Pryzby wrote: > I just tried this, which works fine at least for v11-v14: > | git checkout origin/REL_15_STABLE .cirrus.yml src/tools/ci > > https://cirrus-ci.com/task/5742859943936000 v15a > https://cirrus-ci.com/task/6725412431593472 v15b > https://cirrus-ci.com/task/5105320283340800 v13 > https://cirrus-ci.com/task/4809469463887872 v12 > https://cirrus-ci.com/task/6659971021537280 v11 Cool, thanks for trying that! I wonder if there's any problems on other branches... > (I still suggest my patches to run all tests using vcregress. The number of > people who remember that, for v15, cirrusci runs incomplete tests is probably > fewer than five.) > https://www.postgresql.org/message-id/20220623193125.GB22452%40telsasoft.com > https://www.postgresql.org/message-id/20220828144447.GA21897%40telsasoft.com Andrew, the defacto maintainer of src/tools/msvc, kind of NACKed those. But the reasoning might not hold with vcregress being on life support. OTOH, to me the basic advantage is to have *any* CI coverage. We don't need to put the bar for the backbranches higher than were we were at ~2 weeks ago. > If cirrusci were backpatched, it'd be kind of nice to use a ccache key > that includes the branch name (but maybe the overhead of compilation is > unimportant compared to the workload induced by cfbot). Hm. The branch name in general sounds like it might be too broad, particularly for cfbot. I think we possibly should just put the major version into .cirrus.yml and use that as the cache key. I think that'd also solve some of the "diff against what" arguments we've had around your CI improvements. > A gripe from me: the regression.diffs and other logs from the SQL regression > tests are in a directory called "main" (same for "isolation"). I imagine I > won't be the last person to spend minutes looking through the list of test dirs > for the entry called "regress", conclude that it's inexplicably absent, and > locate it only after reading src/test/regress/meson.build. I'd have no problem renaming main/isolation to isolation/isolation and main/regress to pg_regress/regress or such. FWIW, if you add --print-errorlogs meson test will show you the output of just failed tests, which for pg_regress style tests will include the path to regression.diffs: ... The differences that caused some tests to fail can be viewed in the file "/srv/dev/build/m/testrun/cube/regress/regression.diffs". A copy of the test summary that you see above is saved in the file "/srv/dev/build/m/testrun/cube/regress/regression.out". It's too bad the default of --print-errorlogs can't be changed. Unfortunately we don't print something as useful in the case of tap tests. I wonder if we should do something like diff --git i/src/test/perl/PostgreSQL/Test/Utils.pm w/src/test/perl/PostgreSQL/Test/Utils.pm index 99d33451064..acc18ca7c85 100644 --- i/src/test/perl/PostgreSQL/Test/Utils.pm +++ w/src/test/perl/PostgreSQL/Test/Utils.pm @@ -239,6 +239,8 @@ END # # Preserve temporary directories after (1) and after (2). $File::Temp::KEEP_ALL = 1 unless $? == 0 && all_tests_passing(); + + diag("test logfile: $test_logfile"); } =pod Potentially doing so only if $? != 0. This would make the output for a failing test end like this: ―――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――― ✀ ――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――― stderr: # Failed test at /home/andres/src/postgresql/contrib/amcheck/t/001_verify_heapam.pl line 20. # Failed test at /home/andres/src/postgresql/contrib/amcheck/t/001_verify_heapam.pl line 22. # test logfile: /srv/dev/build/m/testrun/amcheck/001_verify_heapam/log/regress_log_001_verify_heapam # Looks like you failed 2 tests of 275. (test program exited with status code 2) ――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――― which should make it a lot easier to find the log? Greetings, Andres Freund
On Sat, Oct 01, 2022 at 03:15:14PM -0700, Andres Freund wrote: > On 2022-10-01 11:14:20 -0500, Justin Pryzby wrote: > > (I still suggest my patches to run all tests using vcregress. The number of > > people who remember that, for v15, cirrusci runs incomplete tests is probably > > fewer than five.) > > https://www.postgresql.org/message-id/20220623193125.GB22452%40telsasoft.com > > https://www.postgresql.org/message-id/20220828144447.GA21897%40telsasoft.com > > Andrew, the defacto maintainer of src/tools/msvc, kind of NACKed those. But > the reasoning might not hold with vcregress being on life support. I think you're referring to comment here: 87a81b91-87bf-c0bc-7e4f-06dffadcf737@dunslane.net ..which I tried to discuss here: 20220528153741.GK19626@telsasoft.com | I think there was some confusion about the vcregress "alltaptests" | target. I said that it's okay to add it and make cirrus use it (and | that the buildfarm could use it too). Andrew responded that the | buildfarm wants to run different tests separately. But Andres seems | to have interpretted that as an objection to the addition of an | "alltaptests" target, which I think isn't what's intended - it's fine | if the buildfarm prefers not to use it. > OTOH, to me the basic advantage is to have *any* CI coverage. We don't need to > put the bar for the backbranches higher than were we were at ~2 weeks ago. I agree that something is frequently better than nothing. But it could be worse if it gives the impression that "CI showed that everything was green", when in fact it hadn't run 10% of the tests: https://www.postgresql.org/message-id/CA%2BhUKGLneD%2Bq%2BE7upHGwn41KGvbxhsKbJ%2BM-y9nvv7_Xjv8Qog%40mail.gmail.com > I'd have no problem renaming main/isolation to isolation/isolation and > main/regress to pg_regress/regress or such. +1 -- Justin