Обсуждение: Read-ahead and parallelism in redo recovery

Поиск

Список

Период

Сортировка

Read-ahead and parallelism in redo recovery

От

"Pavan Deolasee"

Дата:

29 февраля 2008 г., 07:30:15

I remember Heikki mentioned improving redo recovery in one of the
emails in the past, so I know people are already thinking about this.
I have some ideas and just wanted to get comments here.

ISTM that its important to keep the redo recovery time as small as possible
in order to reduce the downtime in case of unplanned maintenence.
One way to do this is to take checkpoints very aggressively to keep the
amount of redo work small. But the current checkpoint logic writes all
the dirty buffers to disk and hence generates lots of IO. That limits our
ability to take very frequent checkpoints.

The current redo-recovery is a single threaded, synchronous process.
The XLOG is read sequentially, each log record is examined and replayed
if required. This requires reading disk blocks in the shared buffers and
applying changes to the buffer. The reading happens synchronously and
that would usually make the redo process very slow.

What I am thinking is if we can read ahead these blocks in the shared
buffers and then apply redo changes to them, it can potentially improve things
a lot. If there are multiple read requests, kernel (or controller ?)
can probably
schedule the reads more efficiently. One way to do this is to read ahead the
XLOG and make asynchronous read requests for these blocks. But I am not
sure if we support asynchronous reads yet. Another (and may be easier) way
is to fork another process which can just read-ahead the XLOG and get the
blocks in memory while other process does the normal redo recovery.
One obvious downside of reading ahead would be that we may need to
jump backward and forward in the XLOG file which is otherwise sequentially
read. But that can be handled by using XLOG buffers for redo.

Btw, isn't our redo recovery completely physical in nature ? I mean, can we
replay redo logs related to a block independent of other blocks ? The reason
I am asking because if thats the case, ISTM we can introduce parallelism in
recovery by splitting and reordering the xlog records and then run multiple
processes to do the redo recovery.



Thanks,
Pavan

-- 
Pavan Deolasee
EnterpriseDB     http://www.enterprisedb.com

Re: Read-ahead and parallelism in redo recovery

От

Florian Weimer

Дата:

29 февраля 2008 г., 10:17:15

* Pavan Deolasee:

> The current redo-recovery is a single threaded, synchronous process.
> The XLOG is read sequentially, each log record is examined and
> replayed if required. This requires reading disk blocks in the
> shared buffers and applying changes to the buffer. The reading
> happens synchronously and that would usually make the redo process
> very slow.

Are you sure that it's actually slow for that reason?  Sequential I/O
on the log is typically quite fast, and if the pages dirtied since the
last checkpoint fit into the cache (shared buffers or OS cache), even
that part of recovery does not result in lots of random I/O (with 8.3
and full page writes active; this is a relatively recent change).

In the end, I wouldn't be surprised if for most loads, cache warming
effects dominated recovery times, at least when the machine is not
starved on RAM.

--
Florian Weimer                <fweimer@bfk.de>
BFK edv-consulting GmbH       http://www.bfk.de/
Kriegsstraße 100              tel: +49-721-96201-1
D-76133 Karlsruhe             fax: +49-721-96201-99

Re: Read-ahead and parallelism in redo recovery

От

"Florian G. Pflug"

Дата:

29 февраля 2008 г., 10:49:27

Pavan Deolasee wrote:
> What I am thinking is if we can read ahead these blocks in the shared
>  buffers and then apply redo changes to them, it can potentially 
> improve things a lot. If there are multiple read requests, kernel (or
>  controller ?) can probably schedule the reads more efficiently.
The same holds true for index scans, though. Maybe we can find a
solution that benefits both cases - something along the line of a
bgreader process

> Btw, isn't our redo recovery completely physical in nature ? I mean, 
> can we replay redo logs related to a block independent of other 
> blocks ? The reason I am asking because if thats the case, ISTM we 
> can introduce parallelism in recovery by splitting and reordering the
>  xlog records and then run multiple processes to do the redo
> recovery.
> 
I'd say its "physical" on the tuple level (We just log the new tuple on an
update, not how to calculate it from the old one), but "logical" on the
page level (We log the fact that a tuple was inserted on a page, but
e.g. the physical location of the tuple on the page can come out
differently upon replay). It's even "more logical" for indices, because
we log page splits as multiple wal records, letting the recovery process
deal with synthesizing upper-level updates should we crash in the middle
of a page split. Additionally, we log full-page images as a safeguard
against torn page writes. Those would need to be considered as a kind of
"reorder barrier" in any parallel restore scenario, I guess.

I know that Simon has some ideas about parallel restored, though I don't
know how he wants to solve the dependency issues involved. Perhaps by
not parallelizing withon one table or index...

Re: Read-ahead and parallelism in redo recovery

От

Tom Lane

Дата:

29 февраля 2008 г., 12:08:12

"Florian G. Pflug" <fgp@phlo.org> writes:
> I know that Simon has some ideas about parallel restored, though I don't
> know how he wants to solve the dependency issues involved. Perhaps by
> not parallelizing withon one table or index...

I think we should be *extremely* cautious about introducing any sort of
parallelism or other hard-to-test behavior into xlog recovery.  Bugs
in that area will by definition bite people at the worst possible time.
And we already know that we don't have very good testing ability for
xlog recovery, because some pretty nasty bugs have gone undetected
for long periods.
        regards, tom lane

Re: Read-ahead and parallelism in redo recovery

От

Simon Riggs

Дата:

29 февраля 2008 г., 13:44:48

On Fri, 2008-02-29 at 15:49 +0100, Florian G. Pflug wrote:

> I know that Simon has some ideas about parallel restored, though I don't
> know how he wants to solve the dependency issues involved. Perhaps by
> not parallelizing withon one table or index...

Well, I think that problem is secondary to making progress with your
work on hot standby. I don't want to tune the existing setup and then
make it harder to introduce new features.

I'm aiming to review your patches in this commit fest, with a view to
getting the work fully committed by 4-6 months from now, assuming your
happy to make any changes we identify. That still leaves us time to tune
things before next release.

The hope is to increase the level of functionality here. We may not be
able to move forwards in just one more stride. Warm Standby has taken
last 4 releases to mature to where we are now and the work ahead is at
least as difficult as what has gone before.

--  Simon Riggs 2ndQuadrant  http://www.2ndQuadrant.com 
 PostgreSQL UK 2008 Conference: http://www.postgresql.org.uk

Re: Read-ahead and parallelism in redo recovery

От

Decibel!

Дата:

29 февраля 2008 г., 15:59:37

On Feb 29, 2008, at 8:10 AM, Florian Weimer wrote:
> In the end, I wouldn't be surprised if for most loads, cache warming
> effects dominated recovery times, at least when the machine is not
> starved on RAM.

Uh... that's exactly what all the synchronous reads are doing...  
warming the cache. And synchronous reads are only fast if the system  
understands what's going on and reads a good chunk of data in at  
once. I don't know that that happens.

Perhaps a good short-term measure would be to have recovery allocate  
a 16M buffer and read in entire xlog files at once.
-- 
Decibel!, aka Jim C. Nasby, Database Architect  decibel@decibel.org
Give your computer some brain candy! www.distributed.net Team #1828

Re: Read-ahead and parallelism in redo recovery

От

"Heikki Linnakangas"

Дата:

29 февраля 2008 г., 16:44:41

Decibel! wrote:
> On Feb 29, 2008, at 8:10 AM, Florian Weimer wrote:
>> In the end, I wouldn't be surprised if for most loads, cache warming
>> effects dominated recovery times, at least when the machine is not
>> starved on RAM.
> 
> 
> Uh... that's exactly what all the synchronous reads are doing... warming 
> the cache. And synchronous reads are only fast if the system understands 
> what's going on and reads a good chunk of data in at once. I don't know 
> that that happens.
> 
> Perhaps a good short-term measure would be to have recovery allocate a 
> 16M buffer and read in entire xlog files at once.

The problem isn't reading the WAL. The OS prefetches that just fine.

The problem is the random reads, when we read in the blocks mentioned in 
the WAL records, to replay the changes to them. The OS has no way of 
guessing and prefetching those blocks, and we read them synchronously, 
one block at a time, no matter how big your RAID array is.

I used to think it's a big problem, but I believe the full-page-write 
optimization in 8.3 made it much less so. Especially with the smoothed 
checkpoints: as checkpoints have less impact on response times, you can 
shorten checkpoint interval, which helps to keep the recovery time 
reasonable.

It'd still be nice to do the prefetching; I'm sure there's still 
workloads where it would be a big benefit. But as Tom pointed out, we 
shouldn't invent something new just for recovery. I think we should look 
at doing prefetching for index accesses etc. first, and once we have the 
infrastructure in place and tested, we can consider use it for recovery 
as well.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com

Re: Read-ahead and parallelism in redo recovery

От

Tom Lane

Дата:

29 февраля 2008 г., 16:47:41

Decibel! <decibel@decibel.org> writes:
> Perhaps a good short-term measure would be to have recovery allocate  
> a 16M buffer and read in entire xlog files at once.

If that isn't entirely useless, you need a better kernel.  The system
should *certainly* be bright enough to do read-ahead for our reads of
the source xlog file.  The fetches that are likely to be problematic are
the ones for pages in the data area, which will be a lot less regular
for typical workloads.
        regards, tom lane

Re: Read-ahead and parallelism in redo recovery

От

Aidan Van Dyk

Дата:

29 февраля 2008 г., 16:59:48

* Tom Lane <tgl@sss.pgh.pa.us> [080229 15:49]:
> 
> If that isn't entirely useless, you need a better kernel.  The system
> should *certainly* be bright enough to do read-ahead for our reads of
> the source xlog file.  The fetches that are likely to be problematic are
> the ones for pages in the data area, which will be a lot less regular
> for typical workloads.

How difficult is it to parse the WAL logs with enough knowledge to know
what heap page (file/offset) a wal record contains (I haven't looked
into any wal code)?

There are "compression/decompression" archive_command/restore_command
programs with rudimentary knowledge of the WAL record formats.  Would a
"restore_command" be able to parse the wal records as it copies them
over noting which file pages need to be read, and the just before it
exits, fork() and read each page in order.

This child doesn't need to do anything with the blocks it reads - it
just needs to read them to "pre-warm" the kernel buffer cache...  If the
restoration is doing any writing, this dumb reader would hopefully be
able to keep a block ahead...  And since it's separated enough from the
backend, any experiments in async_io/fadvise could easily be done.

-- 
Aidan Van Dyk                                             Create like a god,
aidan@highrise.ca                                       command like a king,
http://www.highrise.ca/                                   work like a slave.

Re: Read-ahead and parallelism in redo recovery

От

Gregory Stark

Дата:

29 февраля 2008 г., 18:30:40

Florian G. Pflug wrote:
> The same holds true for index scans, though. Maybe we can find a
> solution that benefits both cases - something along the line of a
> bgreader process

I posted a patch to do readahead for bitmap index scans using posix_fadvise.
Experiments showed it works great on raid arrays on Linux. Solaris will need
to use libaio though which I haven't tried yet.

Incidentally, I'm off on a ski vacation this week so I won't be around much on
email for the first half of the commit-fest. If anyone's putting together a
list of patches queued up for review I would like this patch considered. 

The main feedback I'm blocking on is whether others are interested in
restructuring the buffer manager to allow buffers to be allocated and pinned
with only the posix_fadvise i/o initiated. That would avoid the redundant trip
into the buffer manager for the usual case at the expense of a few buffers
being taken out of the cache. The patch I posted is the minimally invasive
approach of not altering the buffer management at all and just passing through
a readahead request.

I looked at doing it for normal index scans and couldn't think of any
convenient way. I can see getting a single buffer of read-ahead from the index
block's next pointer but that's about it. Luckily it seems to me that bitmap
index scans are much more likely to be chosen in the cases where there's a big
gain anyways.

--  Gregory Stark EnterpriseDB          http://www.enterprisedb.com

Re: Read-ahead and parallelism in redo recovery

От

"Florian G. Pflug"

Дата:

29 февраля 2008 г., 21:02:45

Greg Stark wrote:
> Florian G. Pflug wrote:
>> The same holds true for index scans, though. Maybe we can find a 
>> solution that benefits both cases - something along the line of a 
>> bgreader process
> I posted a patch to do readahead for bitmap index scans using 
> posix_fadvise. Experiments showed it works great on raid arrays on 
> Linux. Solaris will need to use libaio though which I haven't tried 
> yet.
Cool! I'd like to try it out - is that patch available in the pg-patches
archives?

> Doing it for normal index scans is much much harder. You can 
> readahead a single page by using the next pointer if it looks like 
> you'll need it. But I don't see a convenient way to get more than 
> that.
I was thinking that after reading a page from the index, the backend
could post a list of heap pages referenced from that index page to the
shmem. A background process would repeatedly scan that list, and load
those pages into the buffer cache.

regards, Florian Pflug

Re: Read-ahead and parallelism in redo recovery

От

"Pavan Deolasee"

Дата:

03 марта 2008 г., 02:31:00

On Fri, Feb 29, 2008 at 8:19 PM, Florian G. Pflug <fgp@phlo.org> wrote:
> Pavan Deolasee wrote:
>  > What I am thinking is if we can read ahead these blocks in the shared
>  >  buffers and then apply redo changes to them, it can potentially
>  > improve things a lot. If there are multiple read requests, kernel (or
>  >  controller ?) can probably schedule the reads more efficiently.
>  The same holds true for index scans, though. Maybe we can find a
>  solution that benefits both cases - something along the line of a
>  bgreader process
>
>

I agree. Something like bgreader process would make a good sense
as a  general solution. ISTM that this would be first and easy step towards
making recovery faster, without too much complexity in the recovery code
path.

>  > Btw, isn't our redo recovery completely physical in nature ? I mean,
>  > can we replay redo logs related to a block independent of other
>  > blocks ? The reason I am asking because if thats the case, ISTM we
>  > can introduce parallelism in recovery by splitting and reordering the
>  >  xlog records and then run multiple processes to do the redo
>  > recovery.
>  >
>  I'd say its "physical" on the tuple level (We just log the new tuple on an
>  update, not how to calculate it from the old one), but "logical" on the
>  page level (We log the fact that a tuple was inserted on a page, but
>  e.g. the physical location of the tuple on the page can come out
>  differently upon replay).

I think it would be OK if the recovery is logical at page level. As long as
we can apply redo logs in-order for a given page, but out-of-order with
respect to some other page, there is a great scope for introducing
parallelism. Though I would agree with Tom that we need to be extremely
cautious before we do anything like this.

I remember Heikki caught a few bugs in HOT redo recovery while
code reviewing which escaped from the manual crash recovery testing
I did, proving Tom's point that its hard to catch such bugs.

Thanks,
Pavan

-- 
Pavan Deolasee
EnterpriseDB     http://www.enterprisedb.com

Re: Read-ahead and parallelism in redo recovery

От

"Pavan Deolasee"

Дата:

03 марта 2008 г., 02:47:06

On Sat, Mar 1, 2008 at 2:13 AM, Heikki Linnakangas
<heikki@enterprisedb.com> wrote:
>
>
>  I used to think it's a big problem, but I believe the full-page-write
>  optimization in 8.3 made it much less so. Especially with the smoothed
>  checkpoints: as checkpoints have less impact on response times, you can
>  shorten checkpoint interval, which helps to keep the recovery time
>  reasonable.
>

I agree that smoothed checkpoints have considerably removed the response
time spikes we used to see in TPCC tests.

What I still don't like about the current checkpoint mechanism that it
writes all
the dirty buffers to disk. With very large shared buffers, this could
still be a problem.
Someday we may want to implement LAZY checkpoints which does not require
writing dirty pages and hence can be taken much more frequently. But lazy
checkpoints can increase the amount of redo  work to be done at the recovery
time. If we can significantly improve the recovery logic, we can then think of
reducing the work done at the checkpoint time (either through lazy checkpoints
or less frequent hard checkpoints) which would benefit the normal database
operation.

Thanks,
Pavan

-- 
Pavan Deolasee
EnterpriseDB     http://www.enterprisedb.com

Re: Read-ahead and parallelism in redo recovery

От

"Heikki Linnakangas"

Дата:

03 марта 2008 г., 06:51:24

Aidan Van Dyk wrote:
> How difficult is it to parse the WAL logs with enough knowledge to know
> what heap page (file/offset) a wal record contains (I haven't looked
> into any wal code)?

Unfortunately there's no common format for that. All the heap-related 
WAL records, insert, update and delete, have a 
RelFileNode+ItemPointerData at the beginning of the WAL payload, but 
update records have another ItemPointerData for the tid of the new tuple 
in addition to that. And all indexam WAL records use a format of their own.

It would be nice to refactor that so that there was a common format to 
store the file+block number touched by WAL record. Like we have for full 
page images. That would useful for all kinds of external tools to parse 
WAL files, like the read-ahead restore_command you envisioned.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com

Re: Read-ahead and parallelism in redo recovery

От

Bruce Momjian

Дата:

03 марта 2008 г., 14:45:58

I have added the following TODO:* Speed WAL recovery by allowing more than one page to be prefetched  This involves
havinga separate process that can be told which pages  the recovery process will need in the near future.
http://archives.postgresql.org/pgsql-hackers/2008-02/msg01279.php


---------------------------------------------------------------------------

Heikki Linnakangas wrote:
> Aidan Van Dyk wrote:
> > How difficult is it to parse the WAL logs with enough knowledge to know
> > what heap page (file/offset) a wal record contains (I haven't looked
> > into any wal code)?
> 
> Unfortunately there's no common format for that. All the heap-related 
> WAL records, insert, update and delete, have a 
> RelFileNode+ItemPointerData at the beginning of the WAL payload, but 
> update records have another ItemPointerData for the tid of the new tuple 
> in addition to that. And all indexam WAL records use a format of their own.
> 
> It would be nice to refactor that so that there was a common format to 
> store the file+block number touched by WAL record. Like we have for full 
> page images. That would useful for all kinds of external tools to parse 
> WAL files, like the read-ahead restore_command you envisioned.
> 
> -- 
>    Heikki Linnakangas
>    EnterpriseDB   http://www.enterprisedb.com
> 
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your Subscription:
> http://mail.postgresql.org/mj/mj_wwwusr?domain=postgresql.org&extra=pgsql-hackers

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://postgres.enterprisedb.com
 + If your life is a hard drive, Christ can be your backup. +

Re: Read-ahead and parallelism in redo recovery

От

Bruce Momjian

Дата:

03 марта 2008 г., 14:48:57

Florian G. Pflug wrote:
> Greg Stark wrote:
> > Florian G. Pflug wrote:
> >> The same holds true for index scans, though. Maybe we can find a 
> >> solution that benefits both cases - something along the line of a 
> >> bgreader process
> > I posted a patch to do readahead for bitmap index scans using 
> > posix_fadvise. Experiments showed it works great on raid arrays on 
> > Linux. Solaris will need to use libaio though which I haven't tried 
> > yet.
> Cool! I'd like to try it out - is that patch available in the pg-patches
> archives?
> 
> > Doing it for normal index scans is much much harder. You can 
> > readahead a single page by using the next pointer if it looks like 
> > you'll need it. But I don't see a convenient way to get more than 
> > that.
> I was thinking that after reading a page from the index, the backend
> could post a list of heap pages referenced from that index page to the
> shmem. A background process would repeatedly scan that list, and load
> those pages into the buffer cache.

Agreed.  Lots of database do the index/heap readahead via threads --- I
think we will probably use a separate read-ahead process that knows more
about all the concurrent reads and the tablespaces involved.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://postgres.enterprisedb.com
 + If your life is a hard drive, Christ can be your backup. +

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Обсуждение: Read-ahead and parallelism in redo recovery