Re: a raft of parallelism-related bug fixes

Поиск
Список
Период
Сортировка
От Robert Haas
Тема Re: a raft of parallelism-related bug fixes
Дата
Msg-id CA+TgmoaBtxWG_dq2LosKowU4XxALnGcTGEMh8i+Ws91YVkkc+w@mail.gmail.com
обсуждение исходный текст
Ответ на Re: a raft of parallelism-related bug fixes  (Andres Freund <andres@anarazel.de>)
Ответы Re: a raft of parallelism-related bug fixes  ("Joshua D. Drake" <jd@commandprompt.com>)
Re: a raft of parallelism-related bug fixes  (Tom Lane <tgl@sss.pgh.pa.us>)
Re: a raft of parallelism-related bug fixes  (Peter Geoghegan <pg@heroku.com>)
Re: a raft of parallelism-related bug fixes  (Alvaro Herrera <alvherre@2ndquadrant.com>)
Re: a raft of parallelism-related bug fixes  (Andres Freund <andres@anarazel.de>)
Список pgsql-hackers
On Mon, Feb 8, 2016 at 10:17 AM, Andres Freund <andres@anarazel.de> wrote:
> On 2016-02-02 15:41:45 -0500, Robert Haas wrote:
>> group-locking-v1.patch is a vastly improved version of the group
>> locking patch that we discussed, uh, extensively last year.  I realize
>> that there was a lot of doubt about this approach, but I still believe
>> it's the right approach, I have put a lot of work into making it work
>> correctly, I don't think anyone has come up with a really plausible
>> alternative approach (except one other approach I tried which turned
>> out to work but with significantly more restrictions), and I'm
>> committed to fixing it in whatever way is necessary if it turns out to
>> be broken, even if that amounts to a full rewrite.  Review is welcome,
>> but I honestly believe it's a good idea to get this into the tree
>> sooner rather than later at this point, because automated regression
>> testing falls to pieces without these changes, and I believe that
>> automated regression testing is a really good idea to shake out
>> whatever bugs we may have in the parallel query stuff.  The code in
>> this patch is all mine, but Amit Kapila deserves credit as co-author
>> for doing a lot of prototyping (that ended up getting tossed) and
>> testing.  This patch includes comments and an addition to
>> src/backend/storage/lmgr/README which explain in more detail what this
>> patch does, how it does it, and why that's OK.
>
> I see you pushed group locking support. I do wonder if somebody has
> actually reviewed this? On a quick scrollthrough it seems fairly
> invasive, touching some parts where bugs are really hard to find.
>
> I realize that this stuff has all been brewing long, and that there's
> still a lot to do. So you gotta keep moving. And I'm not sure that
> there's anything wrong or if there's any actually better approach. But
> pushing an unreviewed, complex patch, that originated in a thread
> orginally about different relatively small/mundane items, for a
> contentious issue, a few days after the initial post. Hm. Not sure how
> you'd react if you weren't the author.

Probably not very well.  Do you want me to revert it?

I mean, look.  Without that patch, parallel query is definitely
broken.  Just revert the patch and try running the regression tests
with force_parallel_mode=regress and max_parallel_degree>0.  It hangs
all over the place.  With the patch, every regression test suite we
have runs cleanly with those settings.  Without the patch, it's
trivial to construct a test case where parallel query experiences an
undetected deadlock.  With the patch, it appears to work reliably.
Could there bugs someplace?  Yes, there absolutely could.  Do I really
think anybody was going to spend the time to understand deadlock.c
well enough to verify my changes?  No, I don't.  What I think would
have happened is that the patch would have sat around like an
albatross around my neck - totally ignored by everyone - until the end
of the last CF, and then the discussion would have gone one of three
ways:

1. Boy, this patch is complicated and I don't understand it.  Let's
reject it, even though without it parallel query is trivially broken!
Uh, we'll just let parallel query be broken.
2. Like #1, but we rip out parallel query in its entirety on the eve of beta.
3. Oh well, Robert says we need this, I guess we'd better let him commit it.

I don't find any of those options to be better than the status quo.
If the patch is broken, another two months of having in the tree give
us a better chance of finding the bugs, especially because, combined
with the other patch which I also pushed, it enables *actual automated
regression testing* of the parallelism code, which I personally think
is a really good thing - and I'd like to see the buildfarm doing that
as soon as possible, so that we can find some of those bugs before
we're deep in beta.  Not just bugs in group locking but all sorts of
parallelism bugs that might be revealed by end-to-end testing.  The
*entire stack of patches* that began this thread was a response to
problems that were found by the automated testing that you can't do
without this patch.  Those bug fixes resulted in a huge increase in
the robustness of parallel query, and that would not have happened
without this code.  Every single one of those problems, some of them
in commits dating back years, was detected by the same method: run the
regression tests with parallel mode and parallel workers used for
every query for which that seems to be safe.

And, by the way, the patch, aside from the deadlock.c portion, was
posted back in October, admittedly without much fanfare, but nobody
reviewed that or any other patch on this thread.  If I'd waited for
those reviews to come in, parallel query would not be committed now,
nor probably in 9.6, nor probably in 9.7 or 9.8 either.  The whole
project would just be impossible on its face.  It would be impossible
in the first instance if I did not have a commit bit, because there is
just not enough committer bandwidth - even reviewer bandwidth more
generally - to review the number of patches that I've submitted
related to parallelism, so in the end some, perhaps many, of those are
going to be committed mostly on the strength of my personal opinion
that committing them is better than not committing them.  I am going
to have a heck of a lot of egg on my face if it turns out that I've
been too aggressive in pushing this stuff into the tree.  But,
basically, the alternative is that we don't get the feature, and I
think the feature is important enough to justify taking some risk.

I think it's myopic to say "well, but this patch might have bugs".
Very true.  But also, all the other parallelism patches that are
already committed or that are still under review but which can't be
properly tested without this patch might have bugs, too, so you've got
to weigh the risk that this patch might get better if I wait longer to
commit it against the possibility that not having committed it reduces
the chances of finding bugs elsewhere.  I don't want it to seem like
I'm forcing this down the community's throat - I don't have a right to
do that, and I will certainly revert this patch if that is the
consensus.  But that is not what I think best myself.  What I think
would be better is to (1) make an effort to get the buildfarm testing
which this patch enables up and running as soon as possible and (2)
for somebody to read over the committed code and raise any issues that
they find.  Or, for that matter, to read over the committed code for
any of the *other* parallelism patches and raise any issues that they
find with *that* code.  There's certainly scads of code here and this
is far from the only bit that might have bugs.

Oh: another thing that I would like to do is commit the isolation
tests I wrote for the deadlock detector a while back, which nobody has
reviewed either, though Tom and Alvaro seemed reasonably positive
about the concept.  Right now, the deadlock.c part of this patch isn't
tested at all by any of our regression test suites, because NOTHING in
deadlock.c is tested by any of our regression test suites.  You can
blow it up with dynamite and the regression tests are perfectly happy,
and that's pretty scary.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



В списке pgsql-hackers по дате отправления:

Предыдущее
От: "Daniel Verite"
Дата:
Сообщение: Re: [patch] Proposal for \crosstabview in psql
Следующее
От: Fabien COELHO
Дата:
Сообщение: Re: checkpointer continuous flushing - V16