Обсуждение: reducing isolation tests runtime

Поиск
Список
Период
Сортировка

reducing isolation tests runtime

От
Alvaro Herrera
Дата:
On the subject of test total time, we could paralelize isolation tests.
Right now "make check" in src/test/isolation takes 1:16 on my machine.
Test "timeouts" takes full 40s of that, with nothing running in parallel
-- the machine is completely idle.

Seems like we can have a lot of time back just by changing the schedule
to use multiple tests per line (in particular, put the other slow tests
together with timeouts), per the attached; with this new schedule,
isolation takes 44 seconds in my machine -- a win of 32 seconds.  We can
win a couple of additional second by grouping a few other lines, but
this is the biggest win.

(This needs to be adjusted because some table names in the specs
conflict.)

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Вложения

Re: reducing isolation tests runtime

От
Robert Haas
Дата:
On Wed, Jan 24, 2018 at 6:10 PM, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
> On the subject of test total time, we could paralelize isolation tests.
> Right now "make check" in src/test/isolation takes 1:16 on my machine.
> Test "timeouts" takes full 40s of that, with nothing running in parallel
> -- the machine is completely idle.
>
> Seems like we can have a lot of time back just by changing the schedule
> to use multiple tests per line (in particular, put the other slow tests
> together with timeouts), per the attached; with this new schedule,
> isolation takes 44 seconds in my machine -- a win of 32 seconds.  We can
> win a couple of additional second by grouping a few other lines, but
> this is the biggest win.
>
> (This needs to be adjusted because some table names in the specs
> conflict.)

Oh, cool.  Yes, the time the isolation tests take to run is quite
annoying.  I didn't realize it would be so easy to run it in parallel.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: reducing isolation tests runtime

От
Tom Lane
Дата:
Robert Haas <robertmhaas@gmail.com> writes:
> On Wed, Jan 24, 2018 at 6:10 PM, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
>> On the subject of test total time, we could paralelize isolation tests.

> Oh, cool.  Yes, the time the isolation tests take to run is quite
> annoying.  I didn't realize it would be so easy to run it in parallel.

+1 to both --- I hadn't realized we had enough infrastructure to do this
in parallel, either.

            regards, tom lane


Re: reducing isolation tests runtime

От
Tom Lane
Дата:
Alvaro Herrera <alvherre@alvh.no-ip.org> writes:
> On the subject of test total time, we could paralelize isolation tests.
> Right now "make check" in src/test/isolation takes 1:16 on my machine.
> Test "timeouts" takes full 40s of that, with nothing running in parallel
> -- the machine is completely idle.

BTW, one small issue there is that the reason the timeouts test is so
slow is that we have to use multi-second timeouts to be sure slower
buildfarm critters (eg valgrind animals) will get the expected results.
So I'm worried that if the machine isn't otherwise idle, we will get
random failures.

We could parallelize the rest of those tests and leave timeouts in its own
group.  That cuts the payback a lot :-( but might still be worth doing.
Or maybe tweak things so that the buildfarm runs a serial schedule but
manual testing doesn't.  Or we could debate how important the timeout
tests really are ... or think harder about how to make them reproducible.

            regards, tom lane


Re: reducing isolation tests runtime

От
Alvaro Herrera
Дата:
Tom Lane wrote:
> Alvaro Herrera <alvherre@alvh.no-ip.org> writes:
> > On the subject of test total time, we could paralelize isolation tests.
> > Right now "make check" in src/test/isolation takes 1:16 on my machine.
> > Test "timeouts" takes full 40s of that, with nothing running in parallel
> > -- the machine is completely idle.
> 
> BTW, one small issue there is that the reason the timeouts test is so
> slow is that we have to use multi-second timeouts to be sure slower
> buildfarm critters (eg valgrind animals) will get the expected results.
> So I'm worried that if the machine isn't otherwise idle, we will get
> random failures.

I think we could solve this by putting in the same parallel group only
slow tests that mostly sleeps, ie. nothing that would monopolize CPU for
long enough to cause a problem.  Concretely:

test: timeouts tuplelock-update deadlock-hard deadlock-soft-2

all of these tests have lots of sleeps and don't go through a lot of
data.  (Compared to the previous patch, I removed alter-table-1, which
uses a thousand tuples, and multiple-row-versions, which uses 100k; also
removed receipt-report which uses a large number of permutations.)

Timings:
    timeouts        40.3s
    tuplelock-update    10.5s
    deadlock-hard        10.9s
    deadlock-soft-2        5.4s

alter-table-1 takes 1.5s, receipt-report 1.2s and there's nothing else
that takes above 1s, so I think this is good enough -- we can still have
the whole thing run in ~45 seconds without the hazard you describe.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: reducing isolation tests runtime

От
Tom Lane
Дата:
Alvaro Herrera <alvherre@alvh.no-ip.org> writes:
> Tom Lane wrote:
>> Alvaro Herrera <alvherre@alvh.no-ip.org> writes:
>>> On the subject of test total time, we could paralelize isolation tests.

>> BTW, one small issue there is that the reason the timeouts test is so
>> slow is that we have to use multi-second timeouts to be sure slower
>> buildfarm critters (eg valgrind animals) will get the expected results.
>> So I'm worried that if the machine isn't otherwise idle, we will get
>> random failures.

> I think we could solve this by putting in the same parallel group only
> slow tests that mostly sleeps, ie. nothing that would monopolize CPU for
> long enough to cause a problem.  Concretely:
> test: timeouts tuplelock-update deadlock-hard deadlock-soft-2

OK, but there'd better be a comment there explaining the concern
very precisely, or somebody will break it.

            regards, tom lane


Re: reducing isolation tests runtime

От
Alvaro Herrera
Дата:
Tom Lane wrote:
> Alvaro Herrera <alvherre@alvh.no-ip.org> writes:

> > I think we could solve this by putting in the same parallel group only
> > slow tests that mostly sleeps, ie. nothing that would monopolize CPU for
> > long enough to cause a problem.  Concretely:
> > test: timeouts tuplelock-update deadlock-hard deadlock-soft-2
> 
> OK, but there'd better be a comment there explaining the concern
> very precisely, or somebody will break it.

Here's a concrete proposal.  Runtime is 45.7 seconds on my laptop.  It
can be further reduced, but not by more than a second or two unless you
get in the business of modifying other tests.  (I only modified
deadlock-soft-2 because it saves 5 seconds).

Admittedly the new isolation_schedule file is a bit ugly.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Вложения

Re: reducing isolation tests runtime

От
Tom Lane
Дата:
Alvaro Herrera <alvherre@alvh.no-ip.org> writes:
> Here's a concrete proposal.  Runtime is 45.7 seconds on my laptop.  It
> can be further reduced, but not by more than a second or two unless you
> get in the business of modifying other tests.  (I only modified
> deadlock-soft-2 because it saves 5 seconds).

Looks reasonable to me, but do we want to set any particular convention
about the max number of tests to run in parallel?  If so there should
be at least a comment saying what.

> Admittedly the new isolation_schedule file is a bit ugly.

Meh, seems fine.

We won't know if this really works till it hits the buildfarm,
I suppose.

            regards, tom lane


Re: reducing isolation tests runtime

От
Alvaro Herrera
Дата:
Tom Lane wrote:
> Alvaro Herrera <alvherre@alvh.no-ip.org> writes:
> > Here's a concrete proposal.  Runtime is 45.7 seconds on my laptop.  It
> > can be further reduced, but not by more than a second or two unless you
> > get in the business of modifying other tests.  (I only modified
> > deadlock-soft-2 because it saves 5 seconds).
> 
> Looks reasonable to me, but do we want to set any particular convention
> about the max number of tests to run in parallel?  If so there should
> be at least a comment saying what.

Hmm, I ran this in a limited number of connections and found that it
fails with less than 27; and there's no MAX_CONNECTIONS like there is
for pg_regress.  So I'll put this back on the drawing board until I'm
back from vacations ...

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: reducing isolation tests runtime

От
Andres Freund
Дата:
Hi,

On 2018-01-25 18:27:28 -0300, Alvaro Herrera wrote:
> Tom Lane wrote:
> > Alvaro Herrera <alvherre@alvh.no-ip.org> writes:
> > > Here's a concrete proposal.  Runtime is 45.7 seconds on my laptop.  It
> > > can be further reduced, but not by more than a second or two unless you
> > > get in the business of modifying other tests.  (I only modified
> > > deadlock-soft-2 because it saves 5 seconds).
> > 
> > Looks reasonable to me, but do we want to set any particular convention
> > about the max number of tests to run in parallel?  If so there should
> > be at least a comment saying what.
> 
> Hmm, I ran this in a limited number of connections and found that it
> fails with less than 27; and there's no MAX_CONNECTIONS like there is
> for pg_regress.  So I'll put this back on the drawing board until I'm
> back from vacations ...

I'd like to see this revived, getting a bit tired waiting longer and
longer to see isolationtester complete.  Is it really a problem that we
require a certain number of connections? Something on the order of 30-50
connections ought not to be a real problem for realistic machines, and
if it's a problem for one, they can use a serialized schedule?

Greetings,

Andres Freund


Re: reducing isolation tests runtime

От
Alvaro Herrera
Дата:
On 2018-Dec-04, Andres Freund wrote:

> Hi,
> 
> I'd like to see this revived, getting a bit tired waiting longer and
> longer to see isolationtester complete.  Is it really a problem that we
> require a certain number of connections? Something on the order of 30-50
> connections ought not to be a real problem for realistic machines, and
> if it's a problem for one, they can use a serialized schedule?

Hello

Yeah, me too.  Let me see about it.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: reducing isolation tests runtime

От
Tom Lane
Дата:
Andres Freund <andres@anarazel.de> writes:
> I'd like to see this revived, getting a bit tired waiting longer and
> longer to see isolationtester complete.  Is it really a problem that we
> require a certain number of connections? Something on the order of 30-50
> connections ought not to be a real problem for realistic machines, and
> if it's a problem for one, they can use a serialized schedule?

The longstanding convention in the main regression tests is 20 max.
Is there a reason to be different here?

            regards, tom lane


Re: reducing isolation tests runtime

От
Andres Freund
Дата:
Hi,

On 2018-12-04 15:45:39 -0500, Tom Lane wrote:
> Andres Freund <andres@anarazel.de> writes:
> > I'd like to see this revived, getting a bit tired waiting longer and
> > longer to see isolationtester complete.  Is it really a problem that we
> > require a certain number of connections? Something on the order of 30-50
> > connections ought not to be a real problem for realistic machines, and
> > if it's a problem for one, they can use a serialized schedule?
> 
> The longstanding convention in the main regression tests is 20 max.
> Is there a reason to be different here?

It's a bit less obvious from the outside how many connections a test
spawn - IOW, it might be easier to maintain the schedule if the cap
isn't as tight.  And I'm doubtful that there's a good reason for the 20
limit these days, so going a bit higher seems reasonable.

Greetings,

Andres Freund


Re: reducing isolation tests runtime

От
Andres Freund
Дата:
Hi,

On 2018-01-25 17:34:15 -0300, Alvaro Herrera wrote:
> Tom Lane wrote:
> > Alvaro Herrera <alvherre@alvh.no-ip.org> writes:
> 
> > > I think we could solve this by putting in the same parallel group only
> > > slow tests that mostly sleeps, ie. nothing that would monopolize CPU for
> > > long enough to cause a problem.  Concretely:
> > > test: timeouts tuplelock-update deadlock-hard deadlock-soft-2
> > 
> > OK, but there'd better be a comment there explaining the concern
> > very precisely, or somebody will break it.
> 
> Here's a concrete proposal.  Runtime is 45.7 seconds on my laptop.  It
> can be further reduced, but not by more than a second or two unless you
> get in the business of modifying other tests.  (I only modified
> deadlock-soft-2 because it saves 5 seconds).

I'm working an updated version of this. Adding the new tests is a bit
painful because of conflicting names making it harder than necessary to
schedule tests. While it's possible to work out a schedule that doesn't
conflict, it's pretty annoying to do and more importantly seems fragile
- it's very easy to create schedules that succeed on one machine, and
not on another, based on how slow which tests are.

I'm more inclined to be a bit more aggressive in renaming tables -
there's not much point in having a lot of "foo"s around.  So I'm
inclined to rename some of the names that are more likely to
conflict. If we agree on doing that, I'd like to do that first, and
commit that more aggressively than the schedule itself.

An alternative approach would be to have isolationtester automatically
create a schema with the specfile's name, and place it in the search
path. But that'd make it impossible to use isolationtester against a
standby - which I think we currently don't do, but which probably would
be a good idea.


With regard to the schedule, I'm inclined to order it so that faster
test groups are earlier on, just to make it more likely to reach the
tests one is debugging faster.  Does that sound sane?

Do we want to maintain a serial version of the schedule too? I'm
wondering if we should just generate both the isolationtester and plain
regression test schedule by either adding an option to pg_regress that
serializes test groups, or by generating the serial schedule file in a
few lines of perl.

Greetings,

Andres Freund


Re: reducing isolation tests runtime

От
Tom Lane
Дата:
Andres Freund <andres@anarazel.de> writes:
> I'm working an updated version of this. Adding the new tests is a bit
> painful because of conflicting names making it harder than necessary to
> schedule tests. While it's possible to work out a schedule that doesn't
> conflict, it's pretty annoying to do and more importantly seems fragile
> - it's very easy to create schedules that succeed on one machine, and
> not on another, based on how slow which tests are.

> I'm more inclined to be a bit more aggressive in renaming tables -
> there's not much point in having a lot of "foo"s around.  So I'm
> inclined to rename some of the names that are more likely to
> conflict. If we agree on doing that, I'd like to do that first, and
> commit that more aggressively than the schedule itself.

+1

> Do we want to maintain a serial version of the schedule too?

Some of the slower buildfarm critters use MAX_CONNECTIONS to limit
the load on their hosts.  As long as the isolation tests honor that,
I don't see a real need for a separate serial schedule.

(We've talked about retiring the serial sched for the main regression
tests, and while that trigger's not been pulled yet, I think it's
just a matter of time.  So making the isolation tests follow that
precedent seems wrong anyway.)

            regards, tom lane


Re: reducing isolation tests runtime

От
Alvaro Herrera
Дата:
On 2019-Feb-13, Tom Lane wrote:

> Andres Freund <andres@anarazel.de> writes:
> > I'm working an updated version of this. Adding the new tests is a bit
> > painful because of conflicting names making it harder than necessary to
> > schedule tests. While it's possible to work out a schedule that doesn't
> > conflict, it's pretty annoying to do and more importantly seems fragile
> > - it's very easy to create schedules that succeed on one machine, and
> > not on another, based on how slow which tests are.
> 
> > I'm more inclined to be a bit more aggressive in renaming tables -
> > there's not much point in having a lot of "foo"s around.  So I'm
> > inclined to rename some of the names that are more likely to
> > conflict. If we agree on doing that, I'd like to do that first, and
> > commit that more aggressively than the schedule itself.
> 
> +1

+1

(Using separate schemas sounds a useful idea if we accumulate dozens of
tests, so I suggest that we do that for future tests, but for the time
being I wouldn't bother.)

> > Do we want to maintain a serial version of the schedule too?
> 
> Some of the slower buildfarm critters use MAX_CONNECTIONS to limit
> the load on their hosts.  As long as the isolation tests honor that,
> I don't see a real need for a separate serial schedule.

MAX_CONNECTIONS was the only reason I didn't push this through.  Do you
(Andres) have any solution to that?

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: reducing isolation tests runtime

От
Tom Lane
Дата:
Alvaro Herrera <alvherre@2ndquadrant.com> writes:
> On 2019-Feb-13, Tom Lane wrote:
>> Some of the slower buildfarm critters use MAX_CONNECTIONS to limit
>> the load on their hosts.  As long as the isolation tests honor that,
>> I don't see a real need for a separate serial schedule.

> MAX_CONNECTIONS was the only reason I didn't push this through.  Do you
> (Andres) have any solution to that?

Doesn't the common pg_regress.c infrastructure handle that?
We might need to improve isolation_main.c and/or the isolation
Makefile to make it accessible.

I suppose that in what I'm thinking about, MAX_CONNECTIONS would be
interpreted as "max number of concurrent isolation scripts", which
is not exactly number of connections.  A quick and dirty answer
would be to have isolation_main.c divide the limit by a factor of 4
or so.

            regards, tom lane


Re: reducing isolation tests runtime

От
Andres Freund
Дата:
On 2019-02-13 10:58:50 -0500, Tom Lane wrote:
> Alvaro Herrera <alvherre@2ndquadrant.com> writes:
> > On 2019-Feb-13, Tom Lane wrote:
> >> Some of the slower buildfarm critters use MAX_CONNECTIONS to limit
> >> the load on their hosts.  As long as the isolation tests honor that,
> >> I don't see a real need for a separate serial schedule.
> 
> > MAX_CONNECTIONS was the only reason I didn't push this through.  Do you
> > (Andres) have any solution to that?
> 
> Doesn't the common pg_regress.c infrastructure handle that?
> We might need to improve isolation_main.c and/or the isolation
> Makefile to make it accessible.

> I suppose that in what I'm thinking about, MAX_CONNECTIONS would be
> interpreted as "max number of concurrent isolation scripts", which
> is not exactly number of connections.  A quick and dirty answer
> would be to have isolation_main.c divide the limit by a factor of 4
> or so.

I guess that could work, although it's certainly not too pretty.
Alternatively we could pre-parse the spec files, but that's a bit
annoying given isolationtester.c is a separate c file...

Do you have an idea why we have both max_concurrent_tests *and*
max_connections in pg_regress? ISTM the former isn't really useful given
the latter?

Greetings,

Andres Freund


Re: reducing isolation tests runtime

От
Tom Lane
Дата:
Andres Freund <andres@anarazel.de> writes:
> Do you have an idea why we have both max_concurrent_tests *and*
> max_connections in pg_regress? ISTM the former isn't really useful given
> the latter?

No, the former is a static restriction on what the schedule file is
allowed to contain, the latter is a dynamic restriction (that typically
is unlimited anyway).

            regards, tom lane


Re: reducing isolation tests runtime

От
Andres Freund
Дата:
Hi,

On 2019-02-13 12:41:41 -0500, Tom Lane wrote:
> Andres Freund <andres@anarazel.de> writes:
> > Do you have an idea why we have both max_concurrent_tests *and*
> > max_connections in pg_regress? ISTM the former isn't really useful given
> > the latter?
> 
> No, the former is a static restriction on what the schedule file is
> allowed to contain, the latter is a dynamic restriction (that typically
> is unlimited anyway).

Right, but why don't we allow for more tests in a group, and then use a
default max_connections to limit concurrency? Having larger groups is
advantageous wrt test runtime - it reduces the number of artificial
serialization point where the slowest test slows things down.  Obviously
there's still a few groups that are needed for test interdependency
management, but that's comparatively rare. We have have plenty groups
that are just broken up to stay below max_concurrent_tests.

Greetings,

Andres Freund


Re: reducing isolation tests runtime

От
Tom Lane
Дата:
Andres Freund <andres@anarazel.de> writes:
> Right, but why don't we allow for more tests in a group, and then use a
> default max_connections to limit concurrency? Having larger groups is
> advantageous wrt test runtime - it reduces the number of artificial
> serialization point where the slowest test slows things down.  Obviously
> there's still a few groups that are needed for test interdependency
> management, but that's comparatively rare. We have have plenty groups
> that are just broken up to stay below max_concurrent_tests.

Meh.  That would also greatly increase the scope for hard-to-reproduce
conflicts between concurrent tests.  I'm not especially excited about
going there.

            regards, tom lane