Обсуждение: Multi-xacts and our process problem

Поиск
Список
Период
Сортировка

Multi-xacts and our process problem

От
Bruce Momjian
Дата:
Multi-xacts were made durable in Postgres 9.3 (released 2013-09-09) to
allow primary-key-column-only locks.  1.7 years later, we are still
dealing with bugs related to this feature.  Obviously, something is
wrong.

There were many 9.3 minor releases containing multi-xacts fixes, and
these fixes have extended into 9.4.  After the first few bug-fix
releases, I questioned whether we needed to revert or rework the
feature, but got no positive response.  Only in the past few weeks have
we got additional people involved.

I think we now know that our inaction didn't serve us well.  The
question is how can we identify chronic problems and get resources
involved sooner.  I feel we have been "asleep at the wheel" to some
extent on this.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + Everyone has their own god. +



Re: Multi-xacts and our process problem

От
"Joshua D. Drake"
Дата:
On 05/11/2015 02:00 PM, Bruce Momjian wrote:

> I think we now know that our inaction didn't serve us well.  The
> question is how can we identify chronic problems and get resources
> involved sooner.  I feel we have been "asleep at the wheel" to some
> extent on this.

Here are some options

Slow down the release cycleThe shortness of the release cycle puts a preference on adding features 
versus providing a more mature outcome.

or

Increase the release cycleMoving to a Ubuntu style release cycle would allow us to have a window 
to scratch the itch but with the eventual (and known) release of 
something that is LTS.

JD    



-- 
Command Prompt, Inc. - http://www.commandprompt.com/  503-667-4564
PostgreSQL Centered full stack support, consulting and development.
Announcing "I'm offended" is basically telling the world you can't
control your own emotions, so everyone else should do it for you.



Re: Multi-xacts and our process problem

От
Bruce Momjian
Дата:
On Mon, May 11, 2015 at 02:11:48PM -0700, Joshua Drake wrote:
> 
> On 05/11/2015 02:00 PM, Bruce Momjian wrote:
> 
> >I think we now know that our inaction didn't serve us well.  The
> >question is how can we identify chronic problems and get resources
> >involved sooner.  I feel we have been "asleep at the wheel" to some
> >extent on this.
> 
> Here are some options
> 
> Slow down the release cycle
>     The shortness of the release cycle puts a preference on adding
> features versus providing a more mature outcome.
> 
> or
> 
> Increase the release cycle
>     Moving to a Ubuntu style release cycle would allow us to have a
> window to scratch the itch but with the eventual (and known) release
> of something that is LTS.

The releases themselves are not the problem, but rather the volume of
bugs and our slowness in getting additional people involved to remove
data corruption bugs more quickly and systematically.  Our reputation
for reliability has been harmed by this inactivity.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + Everyone has their own god. +



Re: Multi-xacts and our process problem

От
Heikki Linnakangas
Дата:
On 05/12/2015 12:00 AM, Bruce Momjian wrote:
> Multi-xacts were made durable in Postgres 9.3 (released 2013-09-09) to
> allow primary-key-column-only locks.  1.7 years later, we are still
> dealing with bugs related to this feature.  Obviously, something is
> wrong.
>
> There were many 9.3 minor releases containing multi-xacts fixes, and
> these fixes have extended into 9.4.  After the first few bug-fix
> releases, I questioned whether we needed to revert or rework the
> feature, but got no positive response.  Only in the past few weeks have
> we got additional people involved.

The "revert or rework" ship had already sailed at that point. I don't 
think we had much choice than just soldier through the bugs after the 
release.

> I think we now know that our inaction didn't serve us well.  The
> question is how can we identify chronic problems and get resources
> involved sooner.  I feel we have been "asleep at the wheel" to some
> extent on this.

Yeah. I think the problem was that no-one realized that this was a 
significant change to the on-disk format. It was deceptively 
backwards-compatible. When it comes to permanent on-disk structures, we 
should all be more vigilant in the review.

- Heikki



Re: Multi-xacts and our process problem

От
Bruce Momjian
Дата:
On Tue, May 12, 2015 at 12:29:56AM +0300, Heikki Linnakangas wrote:
> On 05/12/2015 12:00 AM, Bruce Momjian wrote:
> >Multi-xacts were made durable in Postgres 9.3 (released 2013-09-09) to
> >allow primary-key-column-only locks.  1.7 years later, we are still
> >dealing with bugs related to this feature.  Obviously, something is
> >wrong.
> >
> >There were many 9.3 minor releases containing multi-xacts fixes, and
> >these fixes have extended into 9.4.  After the first few bug-fix
> >releases, I questioned whether we needed to revert or rework the
> >feature, but got no positive response.  Only in the past few weeks have
> >we got additional people involved.
> 
> The "revert or rework" ship had already sailed at that point. I

True.

> don't think we had much choice than just soldier through the bugs
> after the release.

The problem is we "soldiered on" without adding any resources to the
problem or doing a systematic review once it became clear one was
necessary.

> >I think we now know that our inaction didn't serve us well.  The
> >question is how can we identify chronic problems and get resources
> >involved sooner.  I feel we have been "asleep at the wheel" to some
> >extent on this.
> 
> Yeah. I think the problem was that no-one realized that this was a
> significant change to the on-disk format. It was deceptively
> backwards-compatible. When it comes to permanent on-disk structures,
> we should all be more vigilant in the review.

Yes, and the size/age of the patch helped mask problems too.  Are these
the lessons we need to learn?

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + Everyone has their own god. +



Re: Multi-xacts and our process problem

От
"Joshua D. Drake"
Дата:
On 05/11/2015 02:15 PM, Bruce Momjian wrote:
>
> On Mon, May 11, 2015 at 02:11:48PM -0700, Joshua Drake wrote:
>>

>> Here are some options
>>
>> Slow down the release cycle
>>     The shortness of the release cycle puts a preference on adding
>> features versus providing a more mature outcome.
>>
>> or
>>
>> Increase the release cycle
>>     Moving to a Ubuntu style release cycle would allow us to have a
>> window to scratch the itch but with the eventual (and known) release
>> of something that is LTS.
>
> The releases themselves are not the problem, but rather the volume of
> bugs and our slowness in getting additional people involved to remove
> data corruption bugs more quickly and systematically.  Our reputation
> for reliability has been harmed by this inactivity.
>

What I am arguing is that the release cycle is at least a big part of 
the problem. We are trying to get so many new features that bugs are 
increasing and quality is decreasing.

If we change the release cycle it will encourage an increase in eyeballs 
on code we are developing because people aren't going to be in such a 
rush to "get this feature done for this release".

Sincerely,

Joshua D. Drake


-- 
Command Prompt, Inc. - http://www.commandprompt.com/  503-667-4564
PostgreSQL Centered full stack support, consulting and development.
Announcing "I'm offended" is basically telling the world you can't
control your own emotions, so everyone else should do it for you.



Re: Multi-xacts and our process problem

От
Bruce Momjian
Дата:
On Mon, May 11, 2015 at 03:42:26PM -0700, Joshua Drake wrote:
> >The releases themselves are not the problem, but rather the volume of
> >bugs and our slowness in getting additional people involved to remove
> >data corruption bugs more quickly and systematically.  Our reputation
> >for reliability has been harmed by this inactivity.
> >
> 
> What I am arguing is that the release cycle is at least a big part
> of the problem. We are trying to get so many new features that bugs
> are increasing and quality is decreasing.

Now that is an interesting observation --- are we too focused on patches
and features to realize when we need to seriously revisit an issue?

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + Everyone has their own god. +



Re: Multi-xacts and our process problem

От
Tom Lane
Дата:
Bruce Momjian <bruce@momjian.us> writes:
> On Mon, May 11, 2015 at 03:42:26PM -0700, Joshua Drake wrote:
>> What I am arguing is that the release cycle is at least a big part
>> of the problem. We are trying to get so many new features that bugs
>> are increasing and quality is decreasing.

> Now that is an interesting observation --- are we too focused on patches
> and features to realize when we need to seriously revisit an issue?

I think there's nobody, or at least very few people, who are getting
paid to find/fix bugs rather than write cool new features.  This is
problematic.  It doesn't help when key committers are overwhelmed by
trying to process other peoples' patches.  (And no, I'm not sure that
"appoint more committers" would improve matters.  What we've got is
too many barely-good-enough patches.  Tweaking the process to let those
into the tree faster will not result in better quality.)
        regards, tom lane



Re: Multi-xacts and our process problem

От
Simon Riggs
Дата:
On 11 May 2015 at 23:47, Bruce Momjian <bruce@momjian.us> wrote:
On Mon, May 11, 2015 at 03:42:26PM -0700, Joshua Drake wrote:
> >The releases themselves are not the problem, but rather the volume of
> >bugs and our slowness in getting additional people involved to remove
> >data corruption bugs more quickly and systematically.  Our reputation
> >for reliability has been harmed by this inactivity.
> >
>
> What I am arguing is that the release cycle is at least a big part
> of the problem. We are trying to get so many new features that bugs
> are increasing and quality is decreasing.

Now that is an interesting observation --- are we too focused on patches
and features to realize when we need to seriously revisit an issue?

I think we are unused to bugs. We have a much lower bug rate than any other system.

I think we seriously need to review our policy of adding major new features and have them enabled by default with no parameter to disable them. In the early years of PostgreSQL everything had an off switch, e.g. stats, bgwriter and even autovacuum defaulted to off for many years.

--
Simon Riggs                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Multi-xacts and our process problem

От
Andres Freund
Дата:
On 2015-05-11 19:04:32 -0400, Tom Lane wrote:
> I think there's nobody, or at least very few people, who are getting
> paid to find/fix bugs rather than write cool new features.  This is
> problematic.  It doesn't help when key committers are overwhelmed by
> trying to process other peoples' patches.  (And no, I'm not sure that
> "appoint more committers" would improve matters.  What we've got is
> too many barely-good-enough patches.  Tweaking the process to let those
> into the tree faster will not result in better quality.)

+many

Except perhaps that I'd expand "find/fix bugs" to include "review and
integrate patches". Because I think few people are paid to do that
either.  I now partially am (which obviously isn't sufficient). There's
no way it's possible to e.g. work on integrating something like upsert
in a reasonable timeframe otherwise.

The lack of paid time to integrate stuff properly also leads to part of
the quality problem, besides delaying stuff.

Andres



Re: Multi-xacts and our process problem

От
"Joshua D. Drake"
Дата:
On 05/11/2015 04:04 PM, Tom Lane wrote:
> Bruce Momjian <bruce@momjian.us> writes:
>> On Mon, May 11, 2015 at 03:42:26PM -0700, Joshua Drake wrote:
>>> What I am arguing is that the release cycle is at least a big part
>>> of the problem. We are trying to get so many new features that bugs
>>> are increasing and quality is decreasing.
>
>> Now that is an interesting observation --- are we too focused on patches
>> and features to realize when we need to seriously revisit an issue?
>
> I think there's nobody, or at least very few people, who are getting
> paid to find/fix bugs rather than write cool new features.  This is
> problematic.  It doesn't help when key committers are overwhelmed by
> trying to process other peoples' patches.  (And no, I'm not sure that
> "appoint more committers" would improve matters.  What we've got is
> too many barely-good-enough patches.  Tweaking the process to let those
> into the tree faster will not result in better quality.)

Exactly correct.

JD


-- 
Command Prompt, Inc. - http://www.commandprompt.com/  503-667-4564
PostgreSQL Centered full stack support, consulting and development.
Announcing "I'm offended" is basically telling the world you can't
control your own emotions, so everyone else should do it for you.



Re: Multi-xacts and our process problem

От
"Joshua D. Drake"
Дата:
On 05/11/2015 04:18 PM, Simon Riggs wrote:
> On 11 May 2015 at 23:47, Bruce Momjian <bruce@momjian.us
> <mailto:bruce@momjian.us>> wrote:
>
>     On Mon, May 11, 2015 at 03:42:26PM -0700, Joshua Drake wrote:
>     > >The releases themselves are not the problem, but rather the volume of
>     > >bugs and our slowness in getting additional people involved to remove
>     > >data corruption bugs more quickly and systematically.  Our reputation
>     > >for reliability has been harmed by this inactivity.
>     > >
>     >
>     > What I am arguing is that the release cycle is at least a big part
>     > of the problem. We are trying to get so many new features that bugs
>     > are increasing and quality is decreasing.
>
>     Now that is an interesting observation --- are we too focused on patches
>     and features to realize when we need to seriously revisit an issue?
>
>
> I think we are unused to bugs. We have a much lower bug rate than any
> other system.

True we are used to having extremely high quality releases but if you 
look at the release notes since say 9.2, we are seeing a much larger 
increase in bug rates.

It is true that generally speaking our bug rate is low in comparison to 
other databases. That said, I think we are also resting on some laurels 
here per my previous paragraph.

>
> I think we seriously need to review our policy of adding major new
> features and have them enabled by default with no parameter to disable
> them. In the early years of PostgreSQL everything had an off switch,
> e.g. stats, bgwriter and even autovacuum defaulted to off for many years.

That's interesting although I am unsure of the cost of such a thing.

JD

-- 
Command Prompt, Inc. - http://www.commandprompt.com/  503-667-4564
PostgreSQL Centered full stack support, consulting and development.
Announcing "I'm offended" is basically telling the world you can't
control your own emotions, so everyone else should do it for you.



Re: Multi-xacts and our process problem

От
Robert Haas
Дата:
On Mon, May 11, 2015 at 7:04 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> I think there's nobody, or at least very few people, who are getting
> paid to find/fix bugs rather than write cool new features.  This is
> problematic.  It doesn't help when key committers are overwhelmed by
> trying to process other peoples' patches.  (And no, I'm not sure that
> "appoint more committers" would improve matters.  What we've got is
> too many barely-good-enough patches.  Tweaking the process to let those
> into the tree faster will not result in better quality.)

I agree, although generally I think committers are responsible for
fixing what they commit, and I've certainly dropped everything a few
times to do so.  And people who will someday become committers are
generally the sorts of people who do that, too.  Perhaps we've relied
overmuch on that in some cases - e.g. I really haven't paid much
attention to the multixact stuff until lately, because I assumed that
it was Alvaro's problem.  And maybe that's not right.  But I know that
when a serious bug is found in something I committed, I expect that if
anyone else fixes it, that's a bonus.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Multi-xacts and our process problem

От
Amit Kapila
Дата:
On Tue, May 12, 2015 at 4:55 AM, Andres Freund <andres@anarazel.de> wrote:
>
> On 2015-05-11 19:04:32 -0400, Tom Lane wrote:
> > I think there's nobody, or at least very few people, who are getting
> > paid to find/fix bugs rather than write cool new features.  This is
> > problematic.  It doesn't help when key committers are overwhelmed by
> > trying to process other peoples' patches.  (And no, I'm not sure that
> > "appoint more committers" would improve matters.  What we've got is
> > too many barely-good-enough patches.  Tweaking the process to let those
> > into the tree faster will not result in better quality.)
>
> +many
>
> Except perhaps that I'd expand "find/fix bugs" to include "review and
> integrate patches". Because I think few people are paid to do that
> either.

Well said and another thing to add to your point is helping in supporting
the other people's ideas by providing usecase and or much more robust
design that can be accepted in community.
I think one of the reasons for the same is that there is no reasonable
guarantee that if a person spends good amount of time on review, helping
other patches in design phase and fixing bugs, his feature patch/es will be
given more priority which makes it difficult to bargain with one's manager
or company to get more time to involve in such activities.  I think if the
current process of development includes some form of prioritization for
the feature patches by people who spend more time in helping other
patches/maintenance, then it can improve the situation.  Currently, we
do have some system in CF process which suggest that a person has
to review equal number and complexity of patches as he or she submits
for others to review, but I am not sure if that is followed strictly and is
sufficient.


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Multi-xacts and our process problem

От
Alvaro Herrera
Дата:
Robert Haas wrote:
> On Mon, May 11, 2015 at 7:04 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > I think there's nobody, or at least very few people, who are getting
> > paid to find/fix bugs rather than write cool new features.  This is
> > problematic.  It doesn't help when key committers are overwhelmed by
> > trying to process other peoples' patches.  (And no, I'm not sure that
> > "appoint more committers" would improve matters.  What we've got is
> > too many barely-good-enough patches.  Tweaking the process to let those
> > into the tree faster will not result in better quality.)
> 
> I agree, although generally I think committers are responsible for
> fixing what they commit, and I've certainly dropped everything a few
> times to do so.  And people who will someday become committers are
> generally the sorts of people who do that, too.  Perhaps we've relied
> overmuch on that in some cases - e.g. I really haven't paid much
> attention to the multixact stuff until lately, because I assumed that
> it was Alvaro's problem.  And maybe that's not right.  But I know that
> when a serious bug is found in something I committed, I expect that if
> anyone else fixes it, that's a bonus.

For the record, I share the responsibility over committed items
principle, and I adhere to it to as full an extent as possible.
Whenever possible I try to enlist the submitter's help for a fix, but if
they do not respond I consider whatever fix to be on me.  (I have
dropped everything to get fixes done, on several occasions.)

As for multixacts, since it's what brings up this thread, many of you
realize that the amount of time I have spent fixing issues post-facto is
enormous.  If I had a glimpse of the effort that the bugfixing would
cost, I would have certainly dropped it -- spending more time on it
before commit was out of the question.  I appreciate the involvement of
others in the fixes that became necessary.

One lesson I have learned from all this is to try to limit the
additional complexity from any individual patch.

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Multi-xacts and our process problem

От
Noah Misch
Дата:
On Mon, May 11, 2015 at 05:33:04PM -0400, Bruce Momjian wrote:
> On Tue, May 12, 2015 at 12:29:56AM +0300, Heikki Linnakangas wrote:
> > On 05/12/2015 12:00 AM, Bruce Momjian wrote:
> > >Multi-xacts were made durable in Postgres 9.3 (released 2013-09-09) to
> > >allow primary-key-column-only locks.  1.7 years later, we are still
> > >dealing with bugs related to this feature.  Obviously, something is
> > >wrong.
> > >
> > >There were many 9.3 minor releases containing multi-xacts fixes, and
> > >these fixes have extended into 9.4.  After the first few bug-fix
> > >releases, I questioned whether we needed to revert or rework the
> > >feature, but got no positive response.  Only in the past few weeks have
> > >we got additional people involved.
> > 
> > The "revert or rework" ship had already sailed at that point. I
> 
> True.
> 
> > don't think we had much choice than just soldier through the bugs
> > after the release.
> 
> The problem is we "soldiered on" without adding any resources to the
> problem or doing a systematic review once it became clear one was
> necessary.

In both this latest emergency and the Nov-Dec 2013 one, several people showed
up and helped.  We did well in that respect, but your idea that we should have
started a systematic post-commit review is a good one.  Hoping other parts of
the change were fine, the first team dispersed.


There's a lot we might try, but I'll focus on just a couple of points.

The adversarial relationship between author and committer is an important
guarantor of quality.  For the toughest patches, the principal author should
seek out an independent committer rather than self-commit the patch.

When I want to keep a soon-to-be-committed patch's bugs out of PostgreSQL, I
can review it to find as many bugs as possible, or I can express nonspecific
mistrust.  That first option is expensive; if I did full-time patch review, I
might cover 1/4 of the changes.  That second option boils down to me telling a
committer that he is practicing bad judgment, which is painful for both of us
and unlikely to modify the patch's fate.  At the PgCon 2014 Developer Meeting,
it came out that most people had identified fklocks as the highest-risk 9.3
patch.  Here's an idea.  Shortly after the 9.5 release notes draft, let's take
a secret ballot to identify the changes threatening the most damage through
undiscovered bugs.  (Let's say the electorate consists of every committer and
every person who reviewed at least one patch during the release cycle.)
Publish the three top vote totals.  This serves a few purposes.  It sends a
message to each original committer that the community doubts his handling of
the change.  The secret ballot helps voters be honest, and seven votes against
your commit is hard to ignore.  It's a hint to that committer to drum up more
reviews and testing, to pick a simpler project next time, or even to revert.
The poll results would also help target beta testing and post-commit reviews.
For example, I would plan to complete a full post-commit review of one patch
in the list.

Thanks,
nm



Re: Multi-xacts and our process problem

От
Peter Geoghegan
Дата:
On Mon, May 11, 2015 at 11:42 PM, Noah Misch <noah@leadboat.com> wrote:
> it came out that most people had identified fklocks as the highest-risk 9.3
> patch.  Here's an idea.  Shortly after the 9.5 release notes draft, let's take
> a secret ballot to identify the changes threatening the most damage through
> undiscovered bugs.  (Let's say the electorate consists of every committer and
> every person who reviewed at least one patch during the release cycle.)
> Publish the three top vote totals.  This serves a few purposes.  It sends a
> message to each original committer that the community doubts his handling of
> the change.  The secret ballot helps voters be honest, and seven votes against
> your commit is hard to ignore.  It's a hint to that committer to drum up more
> reviews and testing, to pick a simpler project next time, or even to revert.
> The poll results would also help target beta testing and post-commit reviews.
> For example, I would plan to complete a full post-commit review of one patch
> in the list.

The highest risk item identified for 9.4 was the B-Tree bug fix
patches, IIRC. It was certainly mentioned this time last year as the
most likely candidate (during the 2014 developer meeting). I'm
suspicious of this kind of ballot. While 9.4 has not been out for that
long, evidence that that B-Tree stuff is in any way destabilizing is
still thin on the ground, a year later.

Anyone that identified fklocks as the highest risk 9.3 item shouldn't
be too proud of their correct prediction. If you just look at the
release notes, it's completely obvious, even to someone who doesn't
know what a MultiXact is.
-- 
Peter Geoghegan



Re: Multi-xacts and our process problem

От
Bruce Momjian
Дата:
On Tue, May 12, 2015 at 02:42:16AM -0400, Noah Misch wrote:
> > > The "revert or rework" ship had already sailed at that point. I
> > 
> > True.
> > 
> > > don't think we had much choice than just soldier through the bugs
> > > after the release.
> > 
> > The problem is we "soldiered on" without adding any resources to the
> > problem or doing a systematic review once it became clear one was
> > necessary.
> 
> In both this latest emergency and the Nov-Dec 2013 one, several people showed
> up and helped.  We did well in that respect, but your idea that we should have
> started a systematic post-commit review is a good one.  Hoping other parts of
> the change were fine, the first team dispersed.

Yes, Alvaro faithfully addressed each bug that was reported and applied
a fix, and then we waited for the next reported bug, which happened
repeatedly.  What didn't happen is someone realizing that we had a
situation that required a _different_ approach.  We just stuck to the
approach that had always worked in the past and never revisited it.

Some people think we need a better plan, but a larger issue is that we
need to recognize when our existing plan _isn't_ working and do
something different.  No matter what plan or procedure we create, there
are going to be cases where it doesn't work, and if everyone is too busy
to recognize that our plan isn't working, these mistakes will be
repeated.  Our existing plan had worked so well for so many years that
it took very long for us to recognize we needed a new approach in this
case.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + Everyone has their own god. +



Re: Multi-xacts and our process problem

От
Robert Haas
Дата:
On Tue, May 12, 2015 at 3:12 AM, Peter Geoghegan <pg@heroku.com> wrote:
> On Mon, May 11, 2015 at 11:42 PM, Noah Misch <noah@leadboat.com> wrote:
>> it came out that most people had identified fklocks as the highest-risk 9.3
>> patch.  Here's an idea.  Shortly after the 9.5 release notes draft, let's take
>> a secret ballot to identify the changes threatening the most damage through
>> undiscovered bugs.  (Let's say the electorate consists of every committer and
>> every person who reviewed at least one patch during the release cycle.)
>> Publish the three top vote totals.  This serves a few purposes.  It sends a
>> message to each original committer that the community doubts his handling of
>> the change.  The secret ballot helps voters be honest, and seven votes against
>> your commit is hard to ignore.  It's a hint to that committer to drum up more
>> reviews and testing, to pick a simpler project next time, or even to revert.
>> The poll results would also help target beta testing and post-commit reviews.
>> For example, I would plan to complete a full post-commit review of one patch
>> in the list.
>
> The highest risk item identified for 9.4 was the B-Tree bug fix
> patches, IIRC. It was certainly mentioned this time last year as the
> most likely candidate (during the 2014 developer meeting). I'm
> suspicious of this kind of ballot. While 9.4 has not been out for that
> long, evidence that that B-Tree stuff is in any way destabilizing is
> still thin on the ground, a year later.
>
> Anyone that identified fklocks as the highest risk 9.3 item shouldn't
> be too proud of their correct prediction. If you just look at the
> release notes, it's completely obvious, even to someone who doesn't
> know what a MultiXact is.

I think that's rather facile, and I really don't see how you would
know that from looking at those release notes.  I thought multixacts
had risk, but obviously nobody came close to predicting how bad things
were going to be.  If they had, I'm pretty sure we would have pulled
the patch.  The fact that the 9.4 btree changes weren't equally
destabilizing doesn't mean that they weren't risky.  There was a risk
that the Cuban missile crisis would start a nuclear war; in the end,
it didn't, but that doesn't mean there was no risk.

Part of what went wrong with multixacts is neither Alvaro nor anyone
who reviewed the patch gave adequate thought to the vacuum
requirements.  There was a whole series of things that needed to be
done there which just weren't done.  I think if it had been realized
how much work remained to do there, and how necessary it was for every
single bit of machinery that we have for freezing xmin to also exist
for freezing xmax, we would not have gone forward.  Conceptual
failures, where there is a whole class of work that you just don't
even realize needs to be done, are much more damaging than mechanical
errors, where you realize that something needs to be done but you
don't do it correctly.

As an example, take Tom's patch to speed up the parameter setup hooks
for PL/pgsql.  Here's his initial analysis:

http://www.postgresql.org/message-id/4146.1425872254@sss.pgh.pa.us

Then a lot of arguing about non-technical points ensued, followed
eventually by this:

http://www.postgresql.org/message-id/25506.1426029880@sss.pgh.pa.us

We all have ideas like that - things that initially seem like good
ideas, but there's some crucial conceptual point that we're missing
that means that the patch doesn't just need bug fixes; but rather the
whole idea needs to be reconsidered.  If we find that out before
release, we can pull the whole thing back.  If we find it out after
release, things get a lot harder.  In the case of multixacts, we
didn't realize that we'd overlooked significant pieces of work until
after the thing was shipped.

Another crucial difference between the multixact patch and many other
patches is that it wasn't a feature you could turn off.  For example,
if BRIN has bugs, you can almost certainly avoid hitting them by not
using BRIN.  And many people won't, so even if the feature turns out
to be horrifically buggy, 90%+ of our users will not even notice.
ALTER TABLE .. SET LOGGED/UNLOGGED may easily have bugs that eat your
data, but if you don't use it, then you won't be affected.  Of the
major user-visible features committed to 9.5 that could hose our users
more broadly, I'd put RLS and UPSERT pretty high on the list.  We
might be lucky enough that any breakage there is confined to users of
those features, but the code is not as contained as it is for
something like BRIN, so there is a risk of breaking other stuff.
Departing from what's user-visible, Heikki's WAL format changes could
break recovery badly for everyone and we could just be screwed.  That
risk is particularly acute because we really can't change the WAL
format once the release is shipped.  If it's broken, we're probably in
big trouble.  Multixacts, too, fell into this category of things that
cannot be turned off: they touched the heap storage format, and anyone
who used foreign keys (which is nearly everyone) really had no choice
but to use them.

Finally, the multixact patch fell prey to reverse bikeshed syndrome.
It was a big complicated patch that most people couldn't really
understand (because it was big and complicated) so we just ignored it.
I certainly did that.  I may have participated in some mailing list
threads, but I didn't really understand what was going on in detail
and I didn't study it in a level of detail that would have led me to
find problems.  I was nervous about it, but instead of digging into
that, I just assumed it was probably OK.  I think many other people
probably did likewise.  The fact that it's been a long time since
we've done something that caused a serious, hard-to-fix reliability
problems likely contributed to our sense that things wouldn't go too
far wrong.

All of these things combined in an explosive fashion.  If the patch
had been simple enough to be broadly understandable, or if it had been
something that could plausibly have come with an "off" switch, or if
anyone had realized that there were whole areas that had not been
thought through carefully, the consequences would have been much less
serious.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Multi-xacts and our process problem

От
Peter Geoghegan
Дата:
On Tue, May 12, 2015 at 6:00 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> I think that's rather facile, and I really don't see how you would
> know that from looking at those release notes.  I thought multixacts
> had risk, but obviously nobody came close to predicting how bad things
> were going to be.  If they had, I'm pretty sure we would have pulled
> the patch.  The fact that the 9.4 btree changes weren't equally
> destabilizing doesn't mean that they weren't risky.  There was a risk
> that the Cuban missile crisis would start a nuclear war; in the end,
> it didn't, but that doesn't mean there was no risk.

I think you go on to make my argument for me, here. The fklocks patch
was particularly big and complicated, and slipped 9.2, and everyone
was more or less obligated to use it with their existing application.
It was not difficult to imagine that it was *the* highest risk item.
That wasn't a particularly useful observation at that point - I don't
think it made anyone very introspective about MultiXacts. My point, of
course, is that it was a concern about relative risk, as opposed to
absolute risk, and there's not that much you can do with that -
something has to be #1.

> Part of what went wrong with multixacts is neither Alvaro nor anyone
> who reviewed the patch gave adequate thought to the vacuum
> requirements.  There was a whole series of things that needed to be
> done there which just weren't done.  I think if it had been realized
> how much work remained to do there, and how necessary it was for every
> single bit of machinery that we have for freezing xmin to also exist
> for freezing xmax, we would not have gone forward.  Conceptual
> failures, where there is a whole class of work that you just don't
> even realize needs to be done, are much more damaging than mechanical
> errors, where you realize that something needs to be done but you
> don't do it correctly.

I agree, but no one really knew this at the time. Despite this,
everyone still would have identified fklocks as the highest risk item,
and indeed, some actually did. It's relatively easy to say that
something is the highest risk item in an anonymous poll. That's what
makes it easy to not take it seriously.

> Another crucial difference between the multixact patch and many other
> patches is that it wasn't a feature you could turn off.  For example,
> if BRIN has bugs, you can almost certainly avoid hitting them by not
> using BRIN.  And many people won't, so even if the feature turns out
> to be horrifically buggy, 90%+ of our users will not even notice.
> ALTER TABLE .. SET LOGGED/UNLOGGED may easily have bugs that eat your
> data, but if you don't use it, then you won't be affected.  Of the
> major user-visible features committed to 9.5 that could hose our users
> more broadly, I'd put RLS and UPSERT pretty high on the list.  We
> might be lucky enough that any breakage there is confined to users of
> those features, but the code is not as contained as it is for
> something like BRIN, so there is a risk of breaking other stuff.

I think that the chances of UPSERT seriously affecting those that
don't use it are extremely low. For those that use the feature, we
haven't repeated the mistakes of Multixacts: the on-disk
representation of tuples that are committed is always identical to the
historic representation of ordinary tuples, because speculative
insertions are explicitly "confirmed". VACUUM does not need to care.

> Departing from what's user-visible, Heikki's WAL format changes could
> break recovery badly for everyone and we could just be screwed.  That
> risk is particularly acute because we really can't change the WAL
> format once the release is shipped.  If it's broken, we're probably in
> big trouble.  Multixacts, too, fell into this category of things that
> cannot be turned off: they touched the heap storage format, and anyone
> who used foreign keys (which is nearly everyone) really had no choice
> but to use them.

It seems like you're just saying that because it's a complicated patch
that touches the WAL format. It's not a specific concern, and it's not
a concern about a systematic defect or "conceptual failure", as you
put it. That makes it of limited value - you can't hold up progress
because of a very vague concern like that.

> All of these things combined in an explosive fashion.  If the patch
> had been simple enough to be broadly understandable, or if it had been
> something that could plausibly have come with an "off" switch, or if
> anyone had realized that there were whole areas that had not been
> thought through carefully, the consequences would have been much less
> serious.

Agreed.

-- 
Peter Geoghegan



Re: Multi-xacts and our process problem

От
Stephen Frost
Дата:
* Peter Geoghegan (pg@heroku.com) wrote:
> On Tue, May 12, 2015 at 6:00 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> > Another crucial difference between the multixact patch and many other
> > patches is that it wasn't a feature you could turn off.  For example,
> > if BRIN has bugs, you can almost certainly avoid hitting them by not
> > using BRIN.  And many people won't, so even if the feature turns out
> > to be horrifically buggy, 90%+ of our users will not even notice.
> > ALTER TABLE .. SET LOGGED/UNLOGGED may easily have bugs that eat your
> > data, but if you don't use it, then you won't be affected.  Of the
> > major user-visible features committed to 9.5 that could hose our users
> > more broadly, I'd put RLS and UPSERT pretty high on the list.  We
> > might be lucky enough that any breakage there is confined to users of
> > those features, but the code is not as contained as it is for
> > something like BRIN, so there is a risk of breaking other stuff.
>
> I think that the chances of UPSERT seriously affecting those that
> don't use it are extremely low. For those that use the feature, we
> haven't repeated the mistakes of Multixacts: the on-disk
> representation of tuples that are committed is always identical to the
> historic representation of ordinary tuples, because speculative
> insertions are explicitly "confirmed". VACUUM does not need to care.

I feel more-or-less the same about RLS, but then again, it can be
difficult to see issues when you're so close to a piece of work.

Reviewing UPSERT is on my list of things to do post-feature freeze and
I'd certainly welcome additional review of RLS.  Thankfully, it's gotten
review, comments, and rework since it went in and is in quite a bit
better shape than it was originally.  It would have been better to get
some of that before it went in, though, on the flip side, we could have
ended up spending a lot more time trying to hash through the UPSERT+RLS
bits if RLS hadn't already gone in and been worked through by that
point and I'm also not sure if we would have gotten the other
improvments and changes in (particularly things like the improvement of
qual push-down through RLS and SB views, and changing when RLS on
INSERT/UPDATE happens would have been more difficult post
feature-freeze..).
Thanks!
    Stephen