Re: [CORE] postpone next week's release

Поиск
Список
Период
Сортировка
От Andres Freund
Тема Re: [CORE] postpone next week's release
Дата
Msg-id 20150530204727.GJ13944@alap3.anarazel.de
обсуждение исходный текст
Ответ на Re: [CORE] postpone next week's release  (Bruce Momjian <bruce@momjian.us>)
Ответы Re: [CORE] postpone next week's release  (Bruce Momjian <bruce@momjian.us>)
Re: [CORE] postpone next week's release  (Heikki Linnakangas <hlinnaka@iki.fi>)
Список pgsql-hackers
Hi Bruce, Everyone,

On 2015-05-30 11:45:59 -0400, Bruce Momjian wrote:
> Let me share something that people have told me privately but don't want
> to state publicly (at least with attribution), and that is that we have
> seen great increases in feature development (often funded), without a
> corresponding increase development efforts focused on stability.

Yes, I have seen and heard that too. What I think is also important that
in turn our adoption has outpaced feature development (and thus
transitively stability work).

> The bottom line is that we just can't keep going on like this.  The fact
> we put out a release two weeks ago, then need to put out a fix release
> for that, but we have more multi-xact bugs to fix and can't decide if we
> should do one or two minor releases, and are pushing out an alpha of 9.5
> because we know we aren't ready for a beta, just confirms my analysis.

I don't think that alone confirms very much.

> I hate to be the bearer of bad news, but I think bad news is what we
> must face.

Well, the question is what we do with that observation. Personally I
think it's not a new one. This point has been made repeatedly, including
at most if not all developer meetings I attended. I definitely had
conversations around it both in person, on IM and on list.


I don't think it's primarily a problem of lack of review; although that
is a large problem.  I think the biggest systematic problem is that the
compound complexity of postgres has increased dramatically over the
years.  Features have added complexity little by little, each not
incrementally not looking that bad.  But very little has been done to
manage complexity. Since 8.0 the codesize has roughly doubled, but
little has been done to manage the increased complexity. Few new
abstractions have been introduced and the structure of the code is
largely the same.

As a somewhat extreme example, let's look at StartupXLOG(). In 8.0 it
was ~500 LOC, in master it's ~1500.  The interactions in 8.0 were
complex, they have gotten much more complex since.  It fullfills lots of
different roles, all in one function:

(roughly in the order things happen, but simplified)
* Read the control file/determine whether we crashed
* recovery.conf handling
* backup label handling
* tablespace map handling (huh, I missed that this was added directly to StartupXLOG. What a bad idea)
* Determine whether we're doing archive recovery, read the relevant checkpoint if so
* relcache init file removal
* timeline switch handling
* Loading the checkpoint we're starting from
* Initialization of a lot of subsystems
* crash recovery/replay * Including pgstat, unlogged table, exported snapshot handling * iff hot standby, some more
subsystemsare initialized here * hot standby state handling * replay process intialization * crash replay itself,
including  * progress tracking   * recovery pause handling   * nextxid tracking   * timeline increase handling   * hot
standbystate handling * unlogged relations handling * archive recovery handling * creation/initialization of the end of
recoverycheckpoint * timeline increment if failover
 
* subsystem initialization iff !hot_standby
* end of recovery actions

Yes. that's one routine. And, to make things even funnier, half of that
routine isn't exercised by our tests.

You can argue that this is an outlier, but I don't think so. Heapam, the
planner, etc. have similar cases.

And I think this, to some degree, explains a lot of the multixact
problems. While there were a few "simple bugs", most of them were
interactions between the various subsystems that are rather intricate.


So, I think we have built up a lot of technical debt. And very little
effort has been made to fix that; and in the cases where people have the
reception has often been cool, because refactoring things obviously will
destabilize in the short term, even if it fixes problems in the long
term.  I don't think that's sustainable.

We can't improve the situation by just delaying the 9.5 release or
something like that. We need to actively work on making the codebase
easier to understand and better tested. But that is actual development
work, and shouldn't happen at the tail end of a release.


Regards,

Andres



В списке pgsql-hackers по дате отправления:

Предыдущее
От: Tom Lane
Дата:
Сообщение: Re: nested loop semijoin estimates
Следующее
От: Andres Freund
Дата:
Сообщение: Re: [CORE] postpone next week's release