Re: checkpointer continuous flushing

Поиск

Список

Период

Сортировка

От	Tomas Vondra
Тема	Re: checkpointer continuous flushing
Дата	17 марта 2016 г. 22:59:38
Msg-id	cf885de8-0473-f35d-791c-de06217614ad@2ndquadrant.com обсуждение исходный текст
Ответ на	Re: checkpointer continuous flushing (Fabien COELHO <coelho@cri.ensmp.fr>)
Ответы	Re: checkpointer continuous flushing
Список	pgsql-hackers

Дерево обсуждения

Hi,

On 03/17/2016 10:14 PM, Fabien COELHO wrote:
>
...
>>> I would have suggested using the --latency-limit option to filter out
>>> very slow queries, otherwise if the system is stuck it may catch up
>>> later, but then this is not representative of "sustainable" performance.
>>>
>>> When pgbench is running under a target rate, in both runs the
>>> transaction distribution is expected to be the same, around 5000 tps,
>>> and the green run looks pretty ok with respect to that. The magenta one
>>> shows that about 25% of the time, things are not good at all, and the
>>> higher figures just show the catching up, which is not really
>>> interesting if you asked for a web page and it is finally delivered 1
>>> minutes later.
>>
>> Maybe. But that'd only increase the stress on the system, possibly
>> causing more issues, no? And the magenta line is the old code, thus it
>> would only increase the improvement of the new code.
>
> Yes and no. I agree that it stresses the system a little more, but
> the fact that you have 5000 tps in the end does not show that you can
> really sustain 5000 tps with reasonnable latency. I find this later
> information more interesting than knowing that you can get 5000 tps
> on average, thanks to some catching up. Moreover the non throttled
> runs already shown that the system could do 8000 tps, so the
> bandwidth is already  there.

Sure, but thanks to the tps charts we *do know* that for vast majority 
of the intervals (each second) the number of completed transactions is 
very close to 5000. And that wouldn't be possible if large part of the 
latencies were close to the maximums.

With 5000 tps and 32 clients, that means the average latency should be 
less than 6ms, otherwise the clients couldn't make ~160 tps each. But we 
do see that the maximum latency for most intervals is way higher. Only 
~10% of the intervals have max latency below 10ms, for example.

>
>> Notice the max latency is in microseconds (as logged by pgbench),
>> so according to the "max latency" charts the latencies are below
>> 10 seconds (old) and 1 second (new) about 99% of the time.
>
> AFAICS, the max latency is aggregated by second, but then it does
> not say much about the distribution of individuals latencies in the
> interval, that is whether they were all close to the max or not,
> Having the same chart with median or average might help. Also, with
> the stddev chart, the percent do not correspond with the latency one,
> so it may be that the latency is high but the stddev is low, i.e. all
> transactions are equally bad on the interval, or not.>
> So I must admit that I'm not clear at all how to interpret the max
> latency & stddev charts you provided.

You're right those charts are not describing distributions of the 
latencies but those aggregated metrics. And it's not particularly simple 
to deduce information about the source statistics, for example because 
all the intervals have the same "weight" although the number of 
transactions that completed in each interval may be different.

But I do think it's a very useful tool when it comes to measuring the 
consistency of behavior over time, assuming you're asking questions 
about the intervals and not the original transactions.

For example, had there been intervals with vastly different transaction 
rates, we'd see that on the tps charts (i.e. the chart would be much 
more gradual or wobbly, just like the "unpatched" one). Or if there were 
intervals with much higher variance of latencies, we'd see that on the 
STDDEV chart.

I'll consider repeating the benchmark and logging some reasonable sample 
of transactions - for the 24h run the unthrottled benchmark did ~670M 
transactions. Assuming ~30B per line, that's ~20GB, so 5% sample should 
be ~1GB of data, which I think is enough.

But of course, that's useful for answering questions about distribution 
of the individual latencies in global, not about consistency over time.

>
>> So I don't think this would make any measurable difference in practice.
>
> I think that it may show that 25% of the time the system could not
> match the target tps, even if it can handle much more on average, so
> the tps achieved when discarding late transactions would be under
> 4000 tps.

You mean the 'throttled-tps' chart? Yes, that one shows that without the 
patches, there's a lot of intervals where the tps was much lower - 
presumably due to a lot of slow transactions.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

В списке pgsql-hackers по дате отправления:

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Re: checkpointer continuous flushing