Re: Let PostgreSQL's On Schedule checkpoint write buffer smooth spread cycle by tuning IsCheckpointOnSchedule?

Поиск
Список
Период
Сортировка
От Fabien COELHO
Тема Re: Let PostgreSQL's On Schedule checkpoint write buffer smooth spread cycle by tuning IsCheckpointOnSchedule?
Дата
Msg-id alpine.DEB.2.10.1512231554490.22350@sto
обсуждение исходный текст
Ответ на Re: Let PostgreSQL's On Schedule checkpoint write buffer smooth spread cycle by tuning IsCheckpointOnSchedule?  (Robert Haas <robertmhaas@gmail.com>)
Ответы Re: Let PostgreSQL's On Schedule checkpoint write buffer smooth spread cycle by tuning IsCheckpointOnSchedule?  (Robert Haas <robertmhaas@gmail.com>)
Re: Let PostgreSQL's On Schedule checkpoint write buffer smooth spread cycle by tuning IsCheckpointOnSchedule?  (Bruce Momjian <bruce@momjian.us>)
Список pgsql-hackers
Hello Robert,

> On a pgbench test, and probably many other workloads, the impact of
> FPWs declines exponentially (or maybe geometrically, but I think
> exponentially) as we get further into the checkpoint.

Indeed. If the probability of hitting a page is uniform, I think that the 
FPW probability is exp(-n/N) for the n-th page access.

> The first write is dead certain to need an FPW; after that, if access is 
> more or less random, the chance of needing an FPW for the next write 
> increases in proportion to the number of FPWs already written.  As the 
> chances of NOT needing an FPW grow higher, the tps rate starts to 
> increase, initially just a bit, but then faster and faster as the 
> percentage of the working set that has already had an FPW grows.  If the 
> working set is large, we're still doing FPWs pretty frequently when the 
> next checkpoint hits - if it's small, then it'll tail off sooner.

Yes.

>> My actual point is that it should be tested with different and especially
>> smaller values, because 1.5 changes the overall load distribution *a lot*.
>> For testing purpose I suggested that a guc would help, but the patch author
>> has never been back to intervene on the thread, discuss the arguments not
>> provide another patch.
>
> Well, somebody else should be able to hack a GUC into the patch.

Yep. But I'm so far behind everything that I was basically waiting for the 
author to do it:-)

> I think one thing that this conversation exposes is that the size of
> the working set matters a lot.   For example, if the workload is
> pgbench, you're going to see a relatively short FPW-related spike at
> scale factor 100, but at scale factor 3000 it's going to be longer and
> at some larger scale factor it will be longer still.  Therefore you're
> probably right that 1.5 is unlikely to be optimal for everyone.
>
> Another point (which Jan Wieck made me think of) is that the optimal
> behavior here likely depends on whether xlog and data are on the same
> disk controller.  If they aren't, the FPW spike and background writes
> may not interact as much.

Yep, I pointed out that as well. In which case the patch just disrupts the 
checkpoint load for no benefit... Which would make a guc mandatory.

>> [...]. I think that it make sense for xlog triggered checkpoints, but 
>> less so with time triggered checkpoints. I may be wrong, but I think 
>> that this deserve careful analysis.
>
> Hmm, off-hand I don't see why that should make any difference.  No
> matter what triggers the checkpoint, there is going to be a spike of
> FPI activity at the beginning.

Hmmm. Let us try with both hands:

AFAICR with xlog-triggered checkpoints, the checkpointer progress is 
measured with respect to the size of the WAL file, which does not grow 
linearly in time for the reason you pointed above (a lot of FPW at the 
beginning, less in the end). As the WAL file is growing quickly, the 
checkpointer thinks that it is late and that it has some catchup to do, so 
it will start to try writing quickly as well. There is a double whammy as 
both are writing more, and are probably not succeeding.

For time triggered checkpoints, the WAL file gets filled up *but* the 
checkpointer load is balanced against time. This is a "simple" whammy, 
where the checkpointer uses IO bandwith which is needed for the WAL, and 
it could wait a little bit because the WAL will need less later, but it is 
not trying to catch up by even writing more, so the load shifting needed 
in this case is not the same as the previous case.

As you point out there is a WAL spike in both case, but in one case there 
is also a checkpointer spike and in the other the checkpointer load is 
flat.

So I think that the correction should not be the same in both cases. 
Moreover no correction is needed if WAL & relations are on different 
disks. Also, as you pointed out, it also depends on the load (for a large 
base the FPW is spead more evenly, for smaller bases there is a spike), so 
the corrective formula should take that information into account, which 
means that some evaluation of the FPW distribution should be collected...

All this is non trivial. I may do some math to try to solve this, but I'm 
pretty sure that a blank 1.5 correction in all cases is not the solution.

-- 
Fabien.



В списке pgsql-hackers по дате отправления:

Предыдущее
От: Tom Lane
Дата:
Сообщение: Re: pg_hba_lookup function to get all matching pg_hba.conf entries
Следующее
От: Robert Haas
Дата:
Сообщение: Re: pgbench --latency-limit option