Re: Let PostgreSQL's On Schedule checkpoint write buffer smooth spread cycle by tuning IsCheckpointOnSchedule?

Поиск
Список
Период
Сортировка
От Heikki Linnakangas
Тема Re: Let PostgreSQL's On Schedule checkpoint write buffer smooth spread cycle by tuning IsCheckpointOnSchedule?
Дата
Msg-id 5598E14E.50700@iki.fi
обсуждение исходный текст
Ответ на Re: Let PostgreSQL's On Schedule checkpoint write buffer smooth spread cycle by tuning IsCheckpointOnSchedule?  (Fabien COELHO <coelho@cri.ensmp.fr>)
Ответы Re: Let PostgreSQL's On Schedule checkpoint write buffer smooth spread cycle by tuning IsCheckpointOnSchedule?  (Fabien COELHO <coelho@cri.ensmp.fr>)
Re: Let PostgreSQL's On Schedule checkpoint write buffer smooth spread cycle by tuning IsCheckpointOnSchedule?  (Amit Kapila <amit.kapila16@gmail.com>)
Re: Let PostgreSQL's On Schedule checkpoint write buffer smooth spread cycle by tuning IsCheckpointOnSchedule?  (Tomas Vondra <tomas.vondra@2ndquadrant.com>)
Список pgsql-hackers
On 07/04/2015 07:34 PM, Fabien COELHO wrote:
>
>>>>> In summary, the X^1.5 correction seems to work pretty well. It doesn't
>>>>> completely eliminate the problem, but it makes it a lot better.
>
> I've looked at the maths.
>
> I think that the load is distributed as the derivative of this function,
> that is (1.5 * x ** 0.5): It starts at 0 but very quicky reaches 0.5, it
> pass the 1.0 (average load) around 40% progress, and ends up at 1.5, that
> is the finishing load is 1.5 the average load, just before fsyncing files.
> This looks like a recipee for a bad time: I would say this is too large an
> overload. I would suggest a much lower value, say around 1.1...

Hmm. Load is distributed as a derivate of that, but probably not the way 
you think. Note that X means the amount of WAL consumed, not time. The 
goal is that I/O is constant over time, but the consumption of WAL over 
time is non-linear, with a lot more WAL consumed in the beginning of a 
checkpoint cycle. The function compensates for that.

> The other issue with this function is that it should only degrade
> performance by disrupting the write distribution if someone has WAL on a
> different disk. As I understand it this thing does only make sense if the
> WAL & the data are on the samee disk. This really suggest a guc.

No, the I/O storm caused by full-page-writes is a problem even if WAL is 
on a different disk. Even though the burst of WAL I/O then happens on a 
different disk, the fact that we consume a lot of WAL in the beginning 
of a checkpoint makes the checkpointer think that it needs to hurry up, 
in order to meet the deadline. It will flush a lot of pages in a rush, 
so you get a burst of I/O on the data disk too. Yes, it's even worse 
when WAL and data are on the same disk, but even then, I think the 
random I/O caused by the checkpointer hurrying is more significant than 
the extra WAL I/O, which is sequential.

To illustrate that, imagine that the checkpoint begins now. The 
checkpointer calculates that it has 10 minutes to complete the 
checkpoint (checkpoint_timeout), or until 1 GB of WAL has been generated 
(derived from max_wal_size), whichever happens first. Immediately after 
the Redo-point has been established, in the very beginning of the 
checkpoint, the WAL storm begins. Every backend that dirties a page also 
writes a full-page image. After just 10 seconds, those backends have 
already written 200 MB of WAL. That's 1/5 of the quota, and based on 
that, the checkpointer will quickly flush 1/5 of all buffers. In 
reality, the WAL consumption is not linear, and will slow down as time 
passes and less full-page writes happen. So in reality, the checkpointer 
would have a lot more time to complete the checkpoint - it is 
unnecessarily aggressive in the beginning of the checkpoint.

The correction factor in the patch compensates for that. With the X^1.5 
formula, when 20% of the WAL has already been consumed, the checkpointer 
have flushed only ~ 9% of the buffers, not 20% as without the patch.

The ideal correction formula f(x), would be such that f(g(X)) = X, where:
 X is time, 0 = beginning of checkpoint, 1.0 = targeted end of 
checkpoint (checkpoint_segments), and
 g(X) is the amount of WAL generated. 0 = beginning of checkpoint, 1.0 
= targeted end of checkpoint (derived from max_wal_size).

Unfortunately, we don't know the shape of g(X), as that depends on the 
workload. It might be linear, if there is no effect at all from 
full_page_writes. Or it could be a step-function, where every write 
causes a full page write, until all pages have been touched, and after 
that none do (something like an UPDATE without a where-clause might 
cause that). In pgbench-like workloads, it's something like sqrt(x). I 
picked X^1.5 as a reasonable guess. It's close enough to linear that it 
shouldn't hurt too much if g(x) is linear. But it cuts the worst spike 
at the very beginning, if g(x) is more like sqrt(x).

This is all assuming that the application load is constant. If it's 
not, g(x) can obviously have any shape, and there's no way we can 
predict that. But that's a different story, nothing to do with 
full_page_writes.

>> I have ran some tests with this patch and the detailed results of the
>> runs are attached with this mail.
>
> I do not understand really the aggregated figures in the files attached.

Me neither. It looks like Amit measured the time spent in mdread and 
mdwrite, but I'm not sure what conclusions one can draw from that.

>> I thought the patch should show difference if I keep max_wal_size to
>> somewhat lower or moderate value so that checkpoint should get triggered
>> due to wal size, but I am not seeing any major difference in the writes
>> spreading.
>
> I'm not sure I understand your point. I would say that at full speed
> pgbench the disk is always busy writing as much as possible, either
> checkpoint writes or wal writes, so the write load as such should not be
> that different anyway?
>
> I understood that the point of the patch is to check whether there is a
> tps dip or not when the checkpoint begins, but I'm not sure how this can
> be infered from the many aggregated data you sent, and from my recent
> tests the tps is very variable anyway on HDD.

Right, that's my understanding too. If the disk is not saturated, 
perhaps because you used pgbench's rate-limiting option, then measuring 
the disk I/O would be useful too: flatter is better.

- Heikki




В списке pgsql-hackers по дате отправления:

Предыдущее
От: Fabien COELHO
Дата:
Сообщение: Re: Let PostgreSQL's On Schedule checkpoint write buffer smooth spread cycle by tuning IsCheckpointOnSchedule?
Следующее
От: Heikki Linnakangas
Дата:
Сообщение: Re: Let PostgreSQL's On Schedule checkpoint write buffer smooth spread cycle by tuning IsCheckpointOnSchedule?