Обсуждение: Running tests under valgrind is getting slower at an alarming pace
Hi, After a recent migration of the skink and a few other animals (sorry for the false reports on BF, I forgot to adjust a path), I looked at the time it takes to complete a valgrind run: 9.6: Consumed 4h 53min 18.518s CPU time 10: Consumed 5h 32min 50.839s CPU time 11: Consumed 7h 7min 17.455s CPU time 14: still going at 11h 51min 57.951s HEAD: 14h 32min 29.571s CPU time I changed it so that HEAD with be built in parallel separately from the other branches, so that HEAD gets results within a useful timeframe. But even with that, the test times are increasing at a rate we're not going to be able to keep up. Part of this is caused by a lot of tests running serially, rather than in parallel. I was pondering setting PROVE_FLAGS=-j5 or something to reduce the impact of tap tests a bit. Greetings, Andres Freund
Andres Freund <andres@anarazel.de> writes: > After a recent migration of the skink and a few other animals (sorry for the > false reports on BF, I forgot to adjust a path), I looked at the time it takes > to complete a valgrind run: > 9.6: Consumed 4h 53min 18.518s CPU time > 10: Consumed 5h 32min 50.839s CPU time > 11: Consumed 7h 7min 17.455s CPU time > 14: still going at 11h 51min 57.951s > HEAD: 14h 32min 29.571s CPU time I have observed similar slowdowns across versions on just-plain-slow animals, too. Awhile ago (last December, I think), I tried enabling --enable-tap-tests across the board on prairiedog, and observed these buildfarm cycle times: 9.5 01:50:24 9.6 02:06:32 10 02:26:34 11 02:54:44 12 03:41:11 13 04:46:31 HEAD 04:49:04 I went back to not running TAP tests in the back branches :-( prairiedog's latest HEAD run consumed 5:30, so it's gotten way worse since December. In the same comparison, my other animal longfin had gone from 14 to 18 minutes (and it's now up to 22 minutes). It's not clear to me whether greater available parallelism (12 CPUs vs 1) is alone enough to explain why the more modern machine isn't suffering so badly. As you say, the TAP tests are not well parallelized at all, so that doesn't seem to fit the facts. In any case, it seems like we do need to be paying more attention to how long it takes to do the TAP tests. We could try to shave more cycles off the overhead, and we should think a little harder about the long-term value of every test case we add. regards, tom lane
On Wed, Oct 6, 2021 at 1:57 AM Andres Freund <andres@anarazel.de> wrote: > After a recent migration of the skink and a few other animals (sorry for the > false reports on BF, I forgot to adjust a path), I looked at the time it takes > to complete a valgrind run: > > 9.6: Consumed 4h 53min 18.518s CPU time > 10: Consumed 5h 32min 50.839s CPU time > 11: Consumed 7h 7min 17.455s CPU time > 14: still going at 11h 51min 57.951s > HEAD: 14h 32min 29.571s CPU time > > I changed it so that HEAD with be built in parallel separately from the other > branches, so that HEAD gets results within a useful timeframe. But even with > that, the test times are increasing at a rate we're not going to be able to > keep up. Is the problem here that we're adding a lot of new new test cases? Or is the problem that valgrind runs are getting slower for the same number of test cases? If it's taking longer because we have more test cases, I'm honestly not sure that's really something we should try to fix. I mean, I'm sure we have some bad test cases here and there, but overall I think we still have too little test coverage, not too much. The recent discovery that recovery_end_command had zero test coverage is one fine example of that. But if we've done something that increases the relative cost of valgrind, maybe we can fix that in a centralized way. -- Robert Haas EDB: http://www.enterprisedb.com
Hi, On 2021-10-06 12:09:36 -0400, Robert Haas wrote: > Is the problem here that we're adding a lot of new new test cases? Or > is the problem that valgrind runs are getting slower for the same > number of test cases? I don't know precisely. It's probably a combination of several factors. I do think we regressed somewhere around valgrind specifically - the leak origin tracking in older branches seems to work better than in newer branches. But I don't know if that affects performance. > If it's taking longer because we have more test cases, I'm honestly > not sure that's really something we should try to fix. I'm not arguing for having fewer tests. But clearly, executing them serially is problematic, when the times are going up like this. Skink is hosted on a machine with a CPU clocking around ~3.9GHZ for most of the test - getting a faster machine won't help that much. But most of the time only a few cores are active. This isn't just a problem with valgrind, the reporting times for other animals also aren't getting shorter... It takes my workstation 2min20s to execute check-world parallely, but > 16min sequentially. The BF executes tap tests sequentially... > I mean, I'm sure we have some bad test cases here and there, but overall I > think we still have too little test coverage, not too much. The recent > discovery that recovery_end_command had zero test coverage is one fine > example of that. > > But if we've done something that increases the relative cost of > valgrind, maybe we can fix that in a centralized way. There's probably some of that. The fact that the tap test infrastructure does all communication with the server via psql each only execute only a single query is a problem - connection startup is expensive. I think this is particularly a problem for things like PostgresNode::poll_query_until(), which is also used by ::wait_for_catchup(), because a) process creation is more expensive on valgrind b) things take longer on valgrind, so we pay that price many more times. At the same time increasing the timeout for the poll loop also makes the tests slower, because all the waits for things that already finished do add up. I'd guess that at the very least driving individual poll_query_until() via a psql that's running across queries would reduce this substantially, and perhaps allow us to reduce the polling time. But there's probably some nontrivial challenges around recognizing query boundaries :/ Briefly looking at a profile of valgrind, it looks like a lot of the cpu time is spent doing syscalls related to logging. So far skink had log_statement=all, log_connections=on, log_disconnections=on - I've turned them off for the next runs. We'll see if that helps. I'll also try to figure out print a bit more detail about timing for each tap test, looks like I need to figure out how to pass PROVE_TEST='--timer' through the buildfarm. Shouldn't be too hard. One thing I think would really help is having the total time for each run visible in an animals run history. That way we could pinpoint regressions reasonably efficiently, right now that's not easily possible without writing nontrivial queries to the buildfarm database... Greetings, Andres Freund
Andres Freund <andres@anarazel.de> writes: > One thing I think would really help is having the total time for each run > visible in an animals run history. That way we could pinpoint regressions > reasonably efficiently, right now that's not easily possible without writing > nontrivial queries to the buildfarm database... +1. I've lost count of how often I've had to drill down to an individual run just because I wanted to see how long it took. If we could fit that into the branch history pages like https://buildfarm.postgresql.org/cgi-bin/show_history.pl?nm=jay&br=HEAD it'd be really useful IMO. Perhaps we could replace "OK" with the total time, so as to avoid making these tables bigger? (This presumes that the time for a failed run isn't so interesting.) regards, tom lane
On Wed, Oct 6, 2021 at 12:47 PM Andres Freund <andres@anarazel.de> wrote: > There's probably some of that. > > The fact that the tap test infrastructure does all communication with the > server via psql each only execute only a single query is a problem - > connection startup is expensive. Ageed. safe_psql() is a poor-quality interface. I've been surprised in the past that we were relying on something so primitive. -- Robert Haas EDB: http://www.enterprisedb.com
Hi, On 2021-10-06 12:58:34 -0400, Tom Lane wrote: > Andres Freund <andres@anarazel.de> writes: > > One thing I think would really help is having the total time for each run > > visible in an animals run history. That way we could pinpoint regressions > > reasonably efficiently, right now that's not easily possible without writing > > nontrivial queries to the buildfarm database... > > +1. I've lost count of how often I've had to drill down to an individual > run just because I wanted to see how long it took. If we could fit that > into the branch history pages like I queried this in the DB for skink using select snapshot::date, substring(git_head_ref, 1, 12) as git_rev, (SELECT SUM(stage_duration) FROM build_status_log_raw bslrWHERE bslr.sysname = bsr.sysname AND bslr.snapshot = bsr.snapshot) FROM build_status_raw bsr WHERE branch = 'HEAD' ANDsysname = 'skink' and stage = 'OK' AND snapshot > '2021-01-01' order by snapshot desc; snapshot | git_rev | sum ------------+--------------+---------- 2021-10-06 | ec2133a44731 | 12:09:17 2021-10-05 | 0266e98c6b86 | 10:55:10 2021-10-03 | 2903f1404df3 | 10:24:11 2021-09-30 | 20f8671ef69b | 10:31:43 ... 2021-06-14 | 2d689babe3cb | 10:29:07 2021-06-12 | f452aaf7d4a9 | 10:26:12 2021-06-11 | d08237b5b494 | 10:50:53 2021-06-09 | 845cad4d51cb | 10:58:31 2021-06-08 | eab81953682d | 09:06:35 2021-06-06 | a2dee328bbe5 | 09:02:36 2021-06-05 | e6159885b78e | 08:59:14 2021-06-03 | 187682c32173 | 09:39:07 2021-06-02 | df466d30c6ca | 09:03:05 2021-06-03 | 187682c32173 | 09:39:07 2021-06-02 | df466d30c6ca | 09:03:05 2021-05-31 | 7c544ecdad81 | 09:09:42 2021-05-30 | ba356a397de5 | 08:54:29 2021-05-28 | d69fcb9caef1 | 09:00:36 2021-05-27 | 388e75ad3348 | 09:39:14 2021-05-25 | e30e3fdea873 | 08:51:04 2021-05-24 | 99c5852e20a0 | 08:57:08 ... 2021-03-23 | 1e3e8b51bda8 | 09:19:40 2021-03-21 | 96ae658e6238 | 08:29:05 2021-03-20 | 61752afb2640 | 08:15:47 2021-03-18 | da18d829c281 | 08:34:02 2021-03-17 | 6b67d72b604c | 09:11:46 2021-03-15 | 146cb3889c3c | 08:20:21 2021-03-14 | 58f57490facd | 08:06:07 2021-03-12 | d60e61de4fb4 | 08:41:12 2021-03-11 | 3f0daeb02f8d | 08:04:44 2021-03-08 | 8a812e5106c5 | 08:46:01 2021-03-07 | f9a0392e1cf3 | 08:01:47 2021-03-05 | 0ce4cd04da55 | 08:01:32 2021-03-04 | 040af779382e | 07:56:31 2021-03-02 | 5b2f2af3d9d5 | 08:20:50 2021-03-01 | f5a5773a9dc4 | 07:59:14 ... 2021-01-02 | 4d3f03f42227 | 08:14:41 2021-01-01 | 32d6287d2eef | 07:31:56 It's not too surprising that 2021-10-06 is slower, I yesterday changed things so that more valgrind runs are done in parallel (increasing individual test times, but still allowing to get results faster than testing 1-by-1). I don't see anything immediately suspicious for the slowdowns around eab81953682d. Perhaps there was a system update at that time causing changes. Unfortunately I don't have logs from back then anymore. OTOH, I don't see a clear slowdown in 13, 12 around that time. Greetings, Andres Freund
Hi, On 2021-10-06 09:47:54 -0700, Andres Freund wrote: > I'll also try to figure out print a bit more detail about timing for each tap > test, looks like I need to figure out how to pass PROVE_TEST='--timer' through > the buildfarm. Shouldn't be too hard. Turns out that the buildfarm already adds --timer. I added -j4 to allow for some concurrency in tap tests, but unfortunately my animals fell over after that (thanks Michael for noticing). Looks like the buildfarm client code isn't careful enough quoting PROVE_FLAGS? my $pflags = "PROVE_FLAGS=--timer"; if (exists $ENV{PROVE_FLAGS}) { $pflags = $ENV{PROVE_FLAGS} ? "PROVE_FLAGS=$ENV{PROVE_FLAGS}" : ""; } @makeout = run_log("cd $dir && $make NO_LOCALE=1 $pflags $instflags $taptarget"); Which doesn't work if pflags ends up as '-j4 --timer' or such... Greetings, Andres Freund
On 10/6/21 10:03 PM, Andres Freund wrote: > Hi, > > On 2021-10-06 09:47:54 -0700, Andres Freund wrote: >> I'll also try to figure out print a bit more detail about timing for each tap >> test, looks like I need to figure out how to pass PROVE_TEST='--timer' through >> the buildfarm. Shouldn't be too hard. > Turns out that the buildfarm already adds --timer. I added -j4 to allow for > some concurrency in tap tests, but unfortunately my animals fell over after > that (thanks Michael for noticing). > > Looks like the buildfarm client code isn't careful enough quoting PROVE_FLAGS? > > my $pflags = "PROVE_FLAGS=--timer"; > if (exists $ENV{PROVE_FLAGS}) > { > $pflags = > $ENV{PROVE_FLAGS} > ? "PROVE_FLAGS=$ENV{PROVE_FLAGS}" > : ""; > } > > @makeout = > run_log("cd $dir && $make NO_LOCALE=1 $pflags $instflags $taptarget"); > > Which doesn't work if pflags ends up as '-j4 --timer' or such... see <https://github.com/PGBuildFarm/client-code/commit/85ba5866c334f16c8683b524743f4d714be28d77> cheers andrew -- Andrew Dunstan EDB: https://www.enterprisedb.com
On 2021-10-08 15:41:09 -0400, Andrew Dunstan wrote: > see > <https://github.com/PGBuildFarm/client-code/commit/85ba5866c334f16c8683b524743f4d714be28d77> Thanks!