Re: Improvement of checkpoint IO scheduler for stable transaction responses

Поиск
Список
Период
Сортировка
От Greg Smith
Тема Re: Improvement of checkpoint IO scheduler for stable transaction responses
Дата
Msg-id 51E943A1.9030702@2ndQuadrant.com
обсуждение исходный текст
Ответ на Re: Improvement of checkpoint IO scheduler for stable transaction responses  (KONDO Mitsumasa <kondo.mitsumasa@lab.ntt.co.jp>)
Ответы Re: Improvement of checkpoint IO scheduler for stable transaction responses  (didier <did447@gmail.com>)
Re: Improvement of checkpoint IO scheduler for stable transaction responses  (KONDO Mitsumasa <kondo.mitsumasa@lab.ntt.co.jp>)
Список pgsql-hackers
On 7/19/13 3:53 AM, KONDO Mitsumasa wrote:
> Recently, a user who think system availability is important uses
> synchronous replication cluster.

If your argument for why it's OK to ignore bounding crash recovery on 
the master is that it's possible to failover to a standby, I don't think 
that is acceptable.  PostgreSQL users certainly won't like it.

> I want you to read especially point that is line 631, 651, and 656.
> MAX_WRITEBACK_PAGES is 1024 (1024 * 4096 byte).

You should read http://www.westnet.com/~gsmith/content/linux-pdflush.htm 
to realize everything you're telling me about the writeback code and its 
congestion logic I knew back in 2007.  The situation is even worse than 
you describe, because this section of Linux has gone through multiple, 
major revisions since then.  You can't just say "here is the writeback 
source code"; you have to reference each of the commonly deployed 
versions of the writeback feature to tell how this is going to play out 
if released.  There are four major ones I pay attention to.  The old 
kernel style as see in RHEL5/2.6.18--that's what my 2007 paper 
discussed--the similar code but with very different defaults in 2.6.22, 
the writeback method/tuning in RHEL6/Debian Squeeze/2.6.32, and then 
there are newer kernels.  (The newer ones separate out into a few 
branches too, I haven't mapped those as carefully yet)

If you tried to model your feature on Linux's approach here, what that 
means is that the odds of an ugly feedback loop here are even higher. 
You're increasing the feedback on what's already a bad situation that 
triggers trouble for people in the field.  When Linux's congestion logic 
causes checkpoint I/O spikes to get worse than they otherwise might be, 
people panic because it seems like they stopped altogether.  There are 
some examples of what really bad checkpoints look like in 
http://www.2ndquadrant.com/static/2quad/media/pdfs/talks/WriteStuff-PGCon2011.pdf 
if you want to see some of them.  That's the talk I did around the same 
time I was trying out spreading the database fsync calls out over a 
longer period.

When I did that, checkpoints became even less predictable, and that was 
a major reason behind why I rejected the approach.  I think your 
suggestion will have the same problem.  You just aren't generating test 
cases with really large write workloads yet to see it.  You also don't 
seem afraid of how exceeding the checkpoint timeout is a very bad thing yet.

> In addition, if you set a large value of a checkpoint_timeout or
> checkpoint_complete_taget, you have said that performance is improved,
> but is it true in all the cases?

The timeout, yes.  Throughput is always improved by increasing 
checkpoint_timeout.  Less checkpoints per unit of time increases 
efficiency.  Less writes of the most heavy accessed buffers happen per 
transaction.  It is faster because you are doing less work, which on 
average is always faster than doing more work.  And doing less work 
usually beats doing more work, but doing it smarter.

If you want to see how much work per transaction a test is doing, track 
the numbers of buffers written at the beginning/end of your test via 
pg_stat_bgwriter.  Tests that delay checkpoints will show a lower total 
number of writes per transaction.  That seems more efficient, but it's 
efficiency mainly gained by ignoring checkpoint_timeout.

> When a checkpoint complication target is actually enlarged,
> performance may fall in some cases. I think this as the last fsync
> having become heavy owing to having write in slowly.

I think you're confusing throughput and latency here.  Increasing the 
checkpoint timeout, or to a lesser extent the completion target, on 
average that increases throughput.  It results in less work, and the 
more/less work amount is much more important than worrying about 
scheduler details.  Now matter how efficient a given write is, whether 
you've sorted it across elevator horizon boundary A or boundary B, it's 
better not do it at all.

But having less checkpoints makes latency worse sometimes too.  Whether 
latency or throughput is considered the more important thing is very 
complicated.  Having checkpoint_completion_target as the knob to control 
the latency/throughput trade-off hasn't worked out very well.  No one 
has done a really comprehensive look at this trade-off since the 8.3 
development.  I got halfway through it for 9.1, we figured out that the 
fsync queue filling was actually responsible for most of my result 
variation, and then Robert fixed that.  It was a big enough change that 
all my data from before that I had to throw out as no longer relevant.

By the way:  if you have a theory like "the last fsync having become 
heavy" for why something is happening, measure it.  Set log_min_messages 
to debug2 and you'll get details about every single fsync in your logs.  I did that for all my tests that led me to
concludefsync delaying on 
 
its own didn't help that problem.  I was measuring my theories as 
directly as possible.

> I would like to make a itemizing list which can be proof of my patch
> from you. Because DBT-2 benchmark spent lot of time about 1 setting test
> per 3 - 4 hours.

That's great, but to add some perspective here I have spent over 1 year 
of my life running tests like this.  The development cycle to do 
something useful in this area is normally measured in months of machine 
time running benchmarks, not hours or days.  You're doing well so far, 
but you're just getting started.

My itemized list is simple:  throw out all results where the checkpoint 
end goes more than 5% beyond its targets.  When that happens, no matter 
what you think is causing your gain, I will assume it's actually less 
total writes that are improving things.

I'm willing to consider an optional, sloppy checkpoint approach that 
uses heavy load to adjust how often checkpoints happen.  But if we're 
going to do that, it has to be extremely clear that the reason for the 
gain is the checkpoint spacing--and there is going to be a crash 
recovery time penalty paid for it.  And this patch is not how I would do 
that.

It's not really clear yet where the gains you're seeing are really 
coming from.  If you re-ran all your tests with pg_stat_bgwriter 
before/after snapshots, logged every fsync call, and then build some 
tools to analyze the fsync call latency, then you'll have enough data to 
talk about this usefully.  That's what I consider the bare minimum 
evidence to consider changing something here.  I have all of those 
features in pgbench-tools with checkpoint logging turned way up, but 
they're not all in the dbt2 toolset yet as far as I know.

-- 
Greg Smith   2ndQuadrant US    greg@2ndQuadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com



В списке pgsql-hackers по дате отправления:

Предыдущее
От: tubadzin
Дата:
Сообщение: Adding new joining alghoritm to postgresql
Следующее
От: Robert Haas
Дата:
Сообщение: Re: Simple documentation typo patch