Re: [HACKERS] Speedup twophase transactions

Поиск
Список
Период
Сортировка
От Stas Kelvich
Тема Re: [HACKERS] Speedup twophase transactions
Дата
Msg-id 06A44F22-B58D-4FF1-BEED-5447A63D2A11@postgrespro.ru
обсуждение исходный текст
Ответ на Re: [HACKERS] Speedup twophase transactions  (Michael Paquier <michael.paquier@gmail.com>)
Ответы Re: [HACKERS] Speedup twophase transactions  (Nikhil Sontakke <nikhils@2ndquadrant.com>)
Список pgsql-hackers
> On 24 Jan 2017, at 09:42, Michael Paquier <michael.paquier@gmail.com> wrote:
>
> On Mon, Jan 23, 2017 at 9:00 PM, Nikhil Sontakke
> <nikhils@2ndquadrant.com> wrote:
>> Speeding up recovery or failover activity via a faster promote is a
>> desirable thing. So, maybe, we should look at teaching the relevant
>> code about using "KnownPreparedList"? I know that would increase the
>> size of this patch and would mean more testing, but this seems to be
>> last remaining optimization in this code path.
>
> That's a good idea, worth having in this patch. Actually we may not
> want to call KnownPreparedRecreateFiles() here as promotion is not
> synonym of end-of-recovery checkpoint for a couple of releases now.

Thanks for review, Nikhil and Michael.

I don’t follow here. We are moving data away from WAL to files on checkpoint because after checkpoint
there is no guaranty that WAL segment with our prepared tx will be still available.

> The difference between those two is likely noise.
>
> By the way, in those measurements, the OS cache is still filled with
> the past WAL segments, which is a rather best case, no? What happens
> if you do the same kind of tests on a box where memory is busy doing
> something else and replayed WAL segments get evicted from the OS cache
> more aggressively once the startup process switches to a new segment?
> This could be tested for example on a VM with few memory (say 386MB or
> less) so as the startup process needs to access again the past WAL
> segments to recover the 2PC information it needs to get them back
> directly from disk... One trick that you could use here would be to
> tweak the startup process so as it drops the OS cache once a segment
> is finished replaying, and see the effects of an aggressive OS cache
> eviction. This patch is showing really nice improvements with the OS
> cache backing up the data, still it would make sense to test things
> with a worse test case and see if things could be done better. The
> startup process now only reads records sequentially, not randomly
> which is a concept that this patch introduces.
>
> Anyway, perhaps this does not matter much, the non-recovery code path
> does the same thing as this patch, and the improvement is too much to
> be ignored. So for consistency's sake we could go with the approach
> proposed which has the advantage to not put any restriction on the
> size of the 2PC file contrary to what an implementation saving the
> contents of the 2PC files into memory would need to do.

Maybe i’m missing something, but I don’t see how OS cache can affect something here.

Total WAL size was 0x44 * 16 = 1088 MB, recovery time is about 20s. Sequential reading 1GB of data
is order of magnitude faster even on the old hdd, not speaking of ssd. Also you can take a look on flame graphs
attached to previous message — majority of time during recovery spent in pg_qsort while replaying
PageRepairFragmentation, while whole xact_redo_commit() takes about 1% of time. That amount can
grow in case of uncached disk read but taking into account total recovery time this should not affect much.

If you are talking about uncached access only during checkpoint than here we are restricted with
max_prepared_transaction, so at max we will read about hundred of small files (usually fitting into one filesystem
page)which will also 
be barely noticeable comparing to recovery time between checkpoints. Also wal segments cache eviction during
replay doesn’t seems to me as standard scenario.

Anyway i took the machine with hdd to slow down read speed and run tests again. During one of the runs i
launched in parallel bash loop that was dropping os cache each second (while wal fragment replay takesalso about one
second).

1.5M transactionsstart segment: 0x06last segment: 0x47

patched, with constant cache_drop: total recovery time: 86s

patched, without constant cache_drop:  total recovery time: 68s

(while difference is significant, i bet that happens mostly because of database file segments should be re-read after
cachedrop) 

master, without constant cache_drop:  time to recover 35 segments: 2h 25m (after that i tired to wait)  expected total
recoverytime: 4.5 hours 

--
Stas Kelvich
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company





В списке pgsql-hackers по дате отправления:

Предыдущее
От: Simon Riggs
Дата:
Сообщение: [HACKERS] Superowners
Следующее
От: David Steele
Дата:
Сообщение: Re: [HACKERS] patch proposal