Обсуждение: Read-ahead and parallelism in redo recovery
I remember Heikki mentioned improving redo recovery in one of the emails in the past, so I know people are already thinking about this. I have some ideas and just wanted to get comments here. ISTM that its important to keep the redo recovery time as small as possible in order to reduce the downtime in case of unplanned maintenence. One way to do this is to take checkpoints very aggressively to keep the amount of redo work small. But the current checkpoint logic writes all the dirty buffers to disk and hence generates lots of IO. That limits our ability to take very frequent checkpoints. The current redo-recovery is a single threaded, synchronous process. The XLOG is read sequentially, each log record is examined and replayed if required. This requires reading disk blocks in the shared buffers and applying changes to the buffer. The reading happens synchronously and that would usually make the redo process very slow. What I am thinking is if we can read ahead these blocks in the shared buffers and then apply redo changes to them, it can potentially improve things a lot. If there are multiple read requests, kernel (or controller ?) can probably schedule the reads more efficiently. One way to do this is to read ahead the XLOG and make asynchronous read requests for these blocks. But I am not sure if we support asynchronous reads yet. Another (and may be easier) way is to fork another process which can just read-ahead the XLOG and get the blocks in memory while other process does the normal redo recovery. One obvious downside of reading ahead would be that we may need to jump backward and forward in the XLOG file which is otherwise sequentially read. But that can be handled by using XLOG buffers for redo. Btw, isn't our redo recovery completely physical in nature ? I mean, can we replay redo logs related to a block independent of other blocks ? The reason I am asking because if thats the case, ISTM we can introduce parallelism in recovery by splitting and reordering the xlog records and then run multiple processes to do the redo recovery. Thanks, Pavan -- Pavan Deolasee EnterpriseDB http://www.enterprisedb.com
* Pavan Deolasee: > The current redo-recovery is a single threaded, synchronous process. > The XLOG is read sequentially, each log record is examined and > replayed if required. This requires reading disk blocks in the > shared buffers and applying changes to the buffer. The reading > happens synchronously and that would usually make the redo process > very slow. Are you sure that it's actually slow for that reason? Sequential I/O on the log is typically quite fast, and if the pages dirtied since the last checkpoint fit into the cache (shared buffers or OS cache), even that part of recovery does not result in lots of random I/O (with 8.3 and full page writes active; this is a relatively recent change). In the end, I wouldn't be surprised if for most loads, cache warming effects dominated recovery times, at least when the machine is not starved on RAM. -- Florian Weimer <fweimer@bfk.de> BFK edv-consulting GmbH http://www.bfk.de/ Kriegsstraße 100 tel: +49-721-96201-1 D-76133 Karlsruhe fax: +49-721-96201-99
Pavan Deolasee wrote: > What I am thinking is if we can read ahead these blocks in the shared > buffers and then apply redo changes to them, it can potentially > improve things a lot. If there are multiple read requests, kernel (or > controller ?) can probably schedule the reads more efficiently. The same holds true for index scans, though. Maybe we can find a solution that benefits both cases - something along the line of a bgreader process > Btw, isn't our redo recovery completely physical in nature ? I mean, > can we replay redo logs related to a block independent of other > blocks ? The reason I am asking because if thats the case, ISTM we > can introduce parallelism in recovery by splitting and reordering the > xlog records and then run multiple processes to do the redo > recovery. > I'd say its "physical" on the tuple level (We just log the new tuple on an update, not how to calculate it from the old one), but "logical" on the page level (We log the fact that a tuple was inserted on a page, but e.g. the physical location of the tuple on the page can come out differently upon replay). It's even "more logical" for indices, because we log page splits as multiple wal records, letting the recovery process deal with synthesizing upper-level updates should we crash in the middle of a page split. Additionally, we log full-page images as a safeguard against torn page writes. Those would need to be considered as a kind of "reorder barrier" in any parallel restore scenario, I guess. I know that Simon has some ideas about parallel restored, though I don't know how he wants to solve the dependency issues involved. Perhaps by not parallelizing withon one table or index...
"Florian G. Pflug" <fgp@phlo.org> writes: > I know that Simon has some ideas about parallel restored, though I don't > know how he wants to solve the dependency issues involved. Perhaps by > not parallelizing withon one table or index... I think we should be *extremely* cautious about introducing any sort of parallelism or other hard-to-test behavior into xlog recovery. Bugs in that area will by definition bite people at the worst possible time. And we already know that we don't have very good testing ability for xlog recovery, because some pretty nasty bugs have gone undetected for long periods. regards, tom lane
On Fri, 2008-02-29 at 15:49 +0100, Florian G. Pflug wrote: > I know that Simon has some ideas about parallel restored, though I don't > know how he wants to solve the dependency issues involved. Perhaps by > not parallelizing withon one table or index... Well, I think that problem is secondary to making progress with your work on hot standby. I don't want to tune the existing setup and then make it harder to introduce new features. I'm aiming to review your patches in this commit fest, with a view to getting the work fully committed by 4-6 months from now, assuming your happy to make any changes we identify. That still leaves us time to tune things before next release. The hope is to increase the level of functionality here. We may not be able to move forwards in just one more stride. Warm Standby has taken last 4 releases to mature to where we are now and the work ahead is at least as difficult as what has gone before. -- Simon Riggs 2ndQuadrant http://www.2ndQuadrant.com PostgreSQL UK 2008 Conference: http://www.postgresql.org.uk
On Feb 29, 2008, at 8:10 AM, Florian Weimer wrote: > In the end, I wouldn't be surprised if for most loads, cache warming > effects dominated recovery times, at least when the machine is not > starved on RAM. Uh... that's exactly what all the synchronous reads are doing... warming the cache. And synchronous reads are only fast if the system understands what's going on and reads a good chunk of data in at once. I don't know that that happens. Perhaps a good short-term measure would be to have recovery allocate a 16M buffer and read in entire xlog files at once. -- Decibel!, aka Jim C. Nasby, Database Architect decibel@decibel.org Give your computer some brain candy! www.distributed.net Team #1828
Decibel! wrote: > On Feb 29, 2008, at 8:10 AM, Florian Weimer wrote: >> In the end, I wouldn't be surprised if for most loads, cache warming >> effects dominated recovery times, at least when the machine is not >> starved on RAM. > > > Uh... that's exactly what all the synchronous reads are doing... warming > the cache. And synchronous reads are only fast if the system understands > what's going on and reads a good chunk of data in at once. I don't know > that that happens. > > Perhaps a good short-term measure would be to have recovery allocate a > 16M buffer and read in entire xlog files at once. The problem isn't reading the WAL. The OS prefetches that just fine. The problem is the random reads, when we read in the blocks mentioned in the WAL records, to replay the changes to them. The OS has no way of guessing and prefetching those blocks, and we read them synchronously, one block at a time, no matter how big your RAID array is. I used to think it's a big problem, but I believe the full-page-write optimization in 8.3 made it much less so. Especially with the smoothed checkpoints: as checkpoints have less impact on response times, you can shorten checkpoint interval, which helps to keep the recovery time reasonable. It'd still be nice to do the prefetching; I'm sure there's still workloads where it would be a big benefit. But as Tom pointed out, we shouldn't invent something new just for recovery. I think we should look at doing prefetching for index accesses etc. first, and once we have the infrastructure in place and tested, we can consider use it for recovery as well. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Decibel! <decibel@decibel.org> writes: > Perhaps a good short-term measure would be to have recovery allocate > a 16M buffer and read in entire xlog files at once. If that isn't entirely useless, you need a better kernel. The system should *certainly* be bright enough to do read-ahead for our reads of the source xlog file. The fetches that are likely to be problematic are the ones for pages in the data area, which will be a lot less regular for typical workloads. regards, tom lane
* Tom Lane <tgl@sss.pgh.pa.us> [080229 15:49]: > > If that isn't entirely useless, you need a better kernel. The system > should *certainly* be bright enough to do read-ahead for our reads of > the source xlog file. The fetches that are likely to be problematic are > the ones for pages in the data area, which will be a lot less regular > for typical workloads. How difficult is it to parse the WAL logs with enough knowledge to know what heap page (file/offset) a wal record contains (I haven't looked into any wal code)? There are "compression/decompression" archive_command/restore_command programs with rudimentary knowledge of the WAL record formats. Would a "restore_command" be able to parse the wal records as it copies them over noting which file pages need to be read, and the just before it exits, fork() and read each page in order. This child doesn't need to do anything with the blocks it reads - it just needs to read them to "pre-warm" the kernel buffer cache... If the restoration is doing any writing, this dumb reader would hopefully be able to keep a block ahead... And since it's separated enough from the backend, any experiments in async_io/fadvise could easily be done. -- Aidan Van Dyk Create like a god, aidan@highrise.ca command like a king, http://www.highrise.ca/ work like a slave.
Florian G. Pflug wrote: > The same holds true for index scans, though. Maybe we can find a > solution that benefits both cases - something along the line of a > bgreader process I posted a patch to do readahead for bitmap index scans using posix_fadvise. Experiments showed it works great on raid arrays on Linux. Solaris will need to use libaio though which I haven't tried yet. Incidentally, I'm off on a ski vacation this week so I won't be around much on email for the first half of the commit-fest. If anyone's putting together a list of patches queued up for review I would like this patch considered. The main feedback I'm blocking on is whether others are interested in restructuring the buffer manager to allow buffers to be allocated and pinned with only the posix_fadvise i/o initiated. That would avoid the redundant trip into the buffer manager for the usual case at the expense of a few buffers being taken out of the cache. The patch I posted is the minimally invasive approach of not altering the buffer management at all and just passing through a readahead request. I looked at doing it for normal index scans and couldn't think of any convenient way. I can see getting a single buffer of read-ahead from the index block's next pointer but that's about it. Luckily it seems to me that bitmap index scans are much more likely to be chosen in the cases where there's a big gain anyways. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com
Greg Stark wrote: > Florian G. Pflug wrote: >> The same holds true for index scans, though. Maybe we can find a >> solution that benefits both cases - something along the line of a >> bgreader process > I posted a patch to do readahead for bitmap index scans using > posix_fadvise. Experiments showed it works great on raid arrays on > Linux. Solaris will need to use libaio though which I haven't tried > yet. Cool! I'd like to try it out - is that patch available in the pg-patches archives? > Doing it for normal index scans is much much harder. You can > readahead a single page by using the next pointer if it looks like > you'll need it. But I don't see a convenient way to get more than > that. I was thinking that after reading a page from the index, the backend could post a list of heap pages referenced from that index page to the shmem. A background process would repeatedly scan that list, and load those pages into the buffer cache. regards, Florian Pflug
On Fri, Feb 29, 2008 at 8:19 PM, Florian G. Pflug <fgp@phlo.org> wrote: > Pavan Deolasee wrote: > > What I am thinking is if we can read ahead these blocks in the shared > > buffers and then apply redo changes to them, it can potentially > > improve things a lot. If there are multiple read requests, kernel (or > > controller ?) can probably schedule the reads more efficiently. > The same holds true for index scans, though. Maybe we can find a > solution that benefits both cases - something along the line of a > bgreader process > > I agree. Something like bgreader process would make a good sense as a general solution. ISTM that this would be first and easy step towards making recovery faster, without too much complexity in the recovery code path. > > Btw, isn't our redo recovery completely physical in nature ? I mean, > > can we replay redo logs related to a block independent of other > > blocks ? The reason I am asking because if thats the case, ISTM we > > can introduce parallelism in recovery by splitting and reordering the > > xlog records and then run multiple processes to do the redo > > recovery. > > > I'd say its "physical" on the tuple level (We just log the new tuple on an > update, not how to calculate it from the old one), but "logical" on the > page level (We log the fact that a tuple was inserted on a page, but > e.g. the physical location of the tuple on the page can come out > differently upon replay). I think it would be OK if the recovery is logical at page level. As long as we can apply redo logs in-order for a given page, but out-of-order with respect to some other page, there is a great scope for introducing parallelism. Though I would agree with Tom that we need to be extremely cautious before we do anything like this. I remember Heikki caught a few bugs in HOT redo recovery while code reviewing which escaped from the manual crash recovery testing I did, proving Tom's point that its hard to catch such bugs. Thanks, Pavan -- Pavan Deolasee EnterpriseDB http://www.enterprisedb.com
On Sat, Mar 1, 2008 at 2:13 AM, Heikki Linnakangas <heikki@enterprisedb.com> wrote: > > > I used to think it's a big problem, but I believe the full-page-write > optimization in 8.3 made it much less so. Especially with the smoothed > checkpoints: as checkpoints have less impact on response times, you can > shorten checkpoint interval, which helps to keep the recovery time > reasonable. > I agree that smoothed checkpoints have considerably removed the response time spikes we used to see in TPCC tests. What I still don't like about the current checkpoint mechanism that it writes all the dirty buffers to disk. With very large shared buffers, this could still be a problem. Someday we may want to implement LAZY checkpoints which does not require writing dirty pages and hence can be taken much more frequently. But lazy checkpoints can increase the amount of redo work to be done at the recovery time. If we can significantly improve the recovery logic, we can then think of reducing the work done at the checkpoint time (either through lazy checkpoints or less frequent hard checkpoints) which would benefit the normal database operation. Thanks, Pavan -- Pavan Deolasee EnterpriseDB http://www.enterprisedb.com
Aidan Van Dyk wrote: > How difficult is it to parse the WAL logs with enough knowledge to know > what heap page (file/offset) a wal record contains (I haven't looked > into any wal code)? Unfortunately there's no common format for that. All the heap-related WAL records, insert, update and delete, have a RelFileNode+ItemPointerData at the beginning of the WAL payload, but update records have another ItemPointerData for the tid of the new tuple in addition to that. And all indexam WAL records use a format of their own. It would be nice to refactor that so that there was a common format to store the file+block number touched by WAL record. Like we have for full page images. That would useful for all kinds of external tools to parse WAL files, like the read-ahead restore_command you envisioned. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
I have added the following TODO:* Speed WAL recovery by allowing more than one page to be prefetched This involves havinga separate process that can be told which pages the recovery process will need in the near future. http://archives.postgresql.org/pgsql-hackers/2008-02/msg01279.php --------------------------------------------------------------------------- Heikki Linnakangas wrote: > Aidan Van Dyk wrote: > > How difficult is it to parse the WAL logs with enough knowledge to know > > what heap page (file/offset) a wal record contains (I haven't looked > > into any wal code)? > > Unfortunately there's no common format for that. All the heap-related > WAL records, insert, update and delete, have a > RelFileNode+ItemPointerData at the beginning of the WAL payload, but > update records have another ItemPointerData for the tid of the new tuple > in addition to that. And all indexam WAL records use a format of their own. > > It would be nice to refactor that so that there was a common format to > store the file+block number touched by WAL record. Like we have for full > page images. That would useful for all kinds of external tools to parse > WAL files, like the read-ahead restore_command you envisioned. > > -- > Heikki Linnakangas > EnterpriseDB http://www.enterprisedb.com > > -- > Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) > To make changes to your Subscription: > http://mail.postgresql.org/mj/mj_wwwusr?domain=postgresql.org&extra=pgsql-hackers -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://postgres.enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
Florian G. Pflug wrote: > Greg Stark wrote: > > Florian G. Pflug wrote: > >> The same holds true for index scans, though. Maybe we can find a > >> solution that benefits both cases - something along the line of a > >> bgreader process > > I posted a patch to do readahead for bitmap index scans using > > posix_fadvise. Experiments showed it works great on raid arrays on > > Linux. Solaris will need to use libaio though which I haven't tried > > yet. > Cool! I'd like to try it out - is that patch available in the pg-patches > archives? > > > Doing it for normal index scans is much much harder. You can > > readahead a single page by using the next pointer if it looks like > > you'll need it. But I don't see a convenient way to get more than > > that. > I was thinking that after reading a page from the index, the backend > could post a list of heap pages referenced from that index page to the > shmem. A background process would repeatedly scan that list, and load > those pages into the buffer cache. Agreed. Lots of database do the index/heap readahead via threads --- I think we will probably use a separate read-ahead process that knows more about all the concurrent reads and the tablespaces involved. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://postgres.enterprisedb.com + If your life is a hard drive, Christ can be your backup. +