Re: Proposal of PITR performance improvement for 8.4.

Поиск
Список
Период
Сортировка
От Heikki Linnakangas
Тема Re: Proposal of PITR performance improvement for 8.4.
Дата
Msg-id 490703B6.9060203@enterprisedb.com
обсуждение исходный текст
Ответ на Re: Proposal of PITR performance improvement for 8.4.  (Gregory Stark <stark@enterprisedb.com>)
Ответы Re: Proposal of PITR performance improvement for 8.4.  ("Koichi Suzuki" <koichi.szk@gmail.com>)
Re: Proposal of PITR performance improvement for 8.4.  (Simon Riggs <simon@2ndQuadrant.com>)
Список pgsql-hackers
Gregory Stark wrote:
> "Koichi Suzuki" <koichi.szk@gmail.com> writes:
> 
>> This is my first proposal of PITR performance improvement for
>> PostgreSQL 8.4 development.   This proposal includes readahead
>> mechanism of data pages which will be read by redo() routines in the
>> recovery.   This is especially effective in the recovery without full
>> page write.   Readahead is done by posix_fadvise() as proposed in
>> index scan improvement.
> 
> Incidentally: a bit of background for anyone who wasn't around when last this
> came up: prefetching is especially for our recovery code because it's
> single-threaded. If you have a raid array you're effectively limited to using
> a single drive at a time. This is a major problem because the logs could have
> been written by many processes hammering the raid array concurrently. In other
> words your warm standby database might not be able to keep up with the logs
> from the master database even on identical (or even better) hardware.
> 
> Simon (I think?) proposed allowing our recovery code to be multi-threaded.
> Heikki suggested using prefetching.

I actually played around with the prefetching, and even wrote a quick 
prototype of it, about a year ago. It read ahead a fixed number of the 
WAL records in xlog.c, calling posix_fadvise() for all pages that were 
referenced in them. I never got around to finish it, as I wanted to see 
Greg's posix_fadvise() patch get done first and rely on the same 
infrastructure, but here's some lessons I learned:

1. You should avoid useless posix_fadvise() calls. In the naive 
implementation, where you simply call posix_fadvise() for every page 
referenced in every WAL record, you'll do 1-2 posix_fadvise() syscalls 
per WAL record, and that's a lot of overhead. We face the same design 
question as with Greg's patch to use posix_fadvise() to prefetch index 
and bitmap scans: what should the interface to the buffer manager look 
like? The simplest approach would be a new function call like 
AdviseBuffer(Relation, BlockNumber), that calls posix_fadvise() for the 
page if it's not in the buffer cache, but is a no-op otherwise. But that 
means more overhead, since for every page access, we need to find the 
page twice in the buffer cache; once for the AdviseBuffer() call, and 
2nd time for the actual ReadBuffer(). It would be more efficient to pin 
the buffer in the AdviseBuffer() call already, but that requires much 
more changes to the callers.

2. The format of each WAL record is different, so you need a "readahead 
handler" for every resource manager, for every record type. It would be 
a lot simpler if there was a standardized way to store that information 
in the WAL records.

3. IIRC I tried to handle just a few most important WAL records at 
first, but it turned out that you really need to handle all WAL records 
(that are used at all) before you see any benefit. Otherwise, every time 
you hit a WAL record that you haven't done posix_fadvise() on, the 
recovery "stalls", and you don't need much of those to diminish the gains.

Not sure how these apply to your approach, it's very different. You seem 
to handle 1. by collecting all the page references for the WAL file, and 
sorting and removing the duplicates. I wonder how much CPU time is spent 
on that?

>> Details of the implementation will be found in README file in the material.
> 
> I've read through this and I think I disagree with the idea of using a
> separate program. It's a lot of extra code -- and duplicated code from the
> normal recovery too.

Agreed, this belongs into core. The nice thing about a separate process 
is that you could hook it into recovery_command, with no changes to the 
server, but as you note in the README, we'd want to use this in crash 
recovery as well, and the interaction between the external command and 
the startup process seems overly complex for that. Besides, we want to 
use the posix_fadvise() stuff in the backend anyway, so we should use 
the same infrastructure during WAL replay as well.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


В списке pgsql-hackers по дате отправления:

Предыдущее
От: Hannu Krosing
Дата:
Сообщение: Re: VACUUMs and WAL
Следующее
От: Tom Lane
Дата:
Сообщение: Re: Window Functions: v07 APIs and buffering strateties