Обсуждение: xlog.c: WALInsertLock vs. WALWriteLock
Hello guys,<br /><br />I'm writing a function that will read data from the buffer in xlog (i.e. from XLogCtl->pages andXLogCtl->xlblocks). I want to make sure that I am doing it correctly.<br />For reading from the buffer, do I need tolock WALInsertLock or WALWriteLock? Also, can you explain a bit the usage of 'LW_SHARED'. Can we use it for read purposes?<br/><br />Thanks a lot.<br /><br /><br />
On Fri, Oct 22, 2010 at 12:08:54PM -0700, fazool mein wrote: > Hello guys, > > I'm writing a function that will read data from the buffer in xlog > (i.e. from XLogCtl->pages and XLogCtl->xlblocks). I want to make > sure that I am doing it correctly. Got an example of what the function might look like? > For reading from the buffer, do I need to lock WALInsertLock or > WALWriteLock? Also, can you explain a bit the usage of 'LW_SHARED'. > Can we use it for read purposes? Help me understand. Do you foresee some kind of concurrency issue, and if so, what? Cheers, David. > > Thanks a lot. -- David Fetter <david@fetter.org> http://fetter.org/ Phone: +1 415 235 3778 AIM: dfetter666 Yahoo!: dfetter Skype: davidfetter XMPP: david.fetter@gmail.com iCal: webcal://www.tripit.com/feed/ical/people/david74/tripit.ics Remember to vote! Consider donating to Postgres: http://www.postgresql.org/about/donate
> I'm writing a function that will read data from the buffer in xlog
Say something like this:
Yes. For example, while a process is reading from the buffer, another process may insert new records into the buffer. To give a specific example, walsender might want to read data from the buffer instead of reading log from disk. In parallel, there might be transactions on the server that modify the buffer.
Regards,
Tallat
> (i.e. from XLogCtl->pages and XLogCtl->xlblocks). I want to make
> sure that I am doing it correctly.
Got an example of what the function might look like?
Say something like this:
bool ReadLogFromBuffer(char *buf, int len, XLogRecPtr p)which will mean that we want to read the log (records) into buf at position p of length len.
> For reading from the buffer, do I need to lock WALInsertLock or
> WALWriteLock? Also, can you explain a bit the usage of 'LW_SHARED'.
> Can we use it for read purposes?
Help me understand. Do you foresee some kind of concurrency issue,
and if so, what?
Yes. For example, while a process is reading from the buffer, another process may insert new records into the buffer. To give a specific example, walsender might want to read data from the buffer instead of reading log from disk. In parallel, there might be transactions on the server that modify the buffer.
Regards,
Tallat
On Fri, Oct 22, 2010 at 3:08 PM, fazool mein <fazoolmein@gmail.com> wrote: > I'm writing a function that will read data from the buffer in xlog (i.e. > from XLogCtl->pages and XLogCtl->xlblocks). I want to make sure that I am > doing it correctly. > For reading from the buffer, do I need to lock WALInsertLock or > WALWriteLock? Also, can you explain a bit the usage of 'LW_SHARED'. Can we > use it for read purposes? Holding WALInsertLock in shared mode prevents other processes from inserting WAL, or in other words it keeps the "end" position from moving, while holding WALWriteLock in shared mode prevents other processes from writing the WAL from the buffers out to the operating system, or in other words it keeps the "start" position from moving. So you could probably take WALInsertLock in shared mode, figure out the current end of WAL position, release the lock; then take WALWriteLock in shared mode, read any WAL before the end of WAL position, and release the lock. But note that this wouldn't guarantee that you read all WAL as it's generated.... -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Mon, Oct 25, 2010 at 6:31 PM, Robert Haas <robertmhaas@gmail.com> wrote: > On Fri, Oct 22, 2010 at 3:08 PM, fazool mein <fazoolmein@gmail.com> wrote: >> I'm writing a function that will read data from the buffer in xlog (i.e. >> from XLogCtl->pages and XLogCtl->xlblocks). I want to make sure that I am >> doing it correctly. >> For reading from the buffer, do I need to lock WALInsertLock or >> WALWriteLock? Also, can you explain a bit the usage of 'LW_SHARED'. Can we >> use it for read purposes? > > Holding WALInsertLock in shared mode prevents other processes from > inserting WAL, or in other words it keeps the "end" position from > moving, while holding WALWriteLock in shared mode prevents other > processes from writing the WAL from the buffers out to the operating > system, or in other words it keeps the "start" position from moving. > So you could probably take WALInsertLock in shared mode, figure out > the current end of WAL position, release the lock; Once you release the WALInsertLock, someone else can grab it and overwrite the part of the buffer you think you are reading. So I think you have to hold WALInsertLock throughout the duration of the operation. Of course it couldn't be overwritten if the wal record itself is not yet written from buffer to the OS/disk. But since you are not yet holding the WALWriteLock, this could be happening at any time. > then take > WALWriteLock in shared mode, read any WAL before the end of WAL > position, and release the lock. But note that this wouldn't guarantee > that you read all WAL as it's generated.... I don't think that holding WALWriteLock accomplishes much. It prevents part of the buffer from being written out to OS/disk, and thus becoming eligible for being overwritten in the buffer, but the WALInsertLock prevents it from actually being overwritten. And what if the part of the buffer you want to read was already eligible for overwriting but not yet actually overwritten? WALWriteLock won't allow you to safely access it, but WALInsertLock will (assuming you have a safe way to identify the record in the first place). For either case, holding it in shared mode would be sufficient. Jeff
Excerpts from Jeff Janes's message of mar oct 26 12:22:38 -0300 2010: > I don't think that holding WALWriteLock accomplishes much. It > prevents part of the buffer from being written out to OS/disk, and > thus becoming eligible for being overwritten in the buffer, but the > WALInsertLock prevents it from actually being overwritten. And what > if the part of the buffer you want to read was already eligible for > overwriting but not yet actually overwritten? WALWriteLock won't > allow you to safely access it, but WALInsertLock will (assuming you > have a safe way to identify the record in the first place). For > either case, holding it in shared mode would be sufficient. And horrible for performance, I imagine. Those locks are highly trafficked. -- Álvaro Herrera <alvherre@commandprompt.com> The PostgreSQL Company - Command Prompt, Inc. PostgreSQL Replication, Consulting, Custom Development, 24x7 support
Alvaro Herrera <alvherre@commandprompt.com> writes: > Excerpts from Jeff Janes's message of mar oct 26 12:22:38 -0300 2010: >> I don't think that holding WALWriteLock accomplishes much. It >> prevents part of the buffer from being written out to OS/disk, and >> thus becoming eligible for being overwritten in the buffer, but the >> WALInsertLock prevents it from actually being overwritten. And what >> if the part of the buffer you want to read was already eligible for >> overwriting but not yet actually overwritten? WALWriteLock won't >> allow you to safely access it, but WALInsertLock will (assuming you >> have a safe way to identify the record in the first place). For >> either case, holding it in shared mode would be sufficient. > And horrible for performance, I imagine. Those locks are highly trafficked. I think you might actually need *both* locks to ensure the WAL buffers aren't changing underneath you. If you don't have the walwriter locked out, it is free to change the state of a buffer from "dirty" to "written" and then to "prepared to receive next page of WAL". If the latter doesn't involve changing the content of the buffer today, it still could tomorrow. And on top of all that, there remains the problem that the piece of WAL you want might already be gone from the buffers. Might I suggest adopting the same technique walsender does, ie just read the data back from disk? There's a reason why we gave up trying to have walsender read directly from the buffers. regards, tom lane
<br /><div class="gmail_quote"><blockquote class="gmail_quote" style="margin: 0pt 0pt 0pt 0.8ex; border-left: 1px solid rgb(204,204, 204); padding-left: 1ex;"> Might I suggest adopting the same technique walsender does, ie just read<br /> thedata back from disk? There's a reason why we gave up trying to have<br /> walsender read directly from the buffers.<br/><br /></blockquote><br /></div>That is exactly what I do not want to do, i.e. read from disk, as long as thepiece of WAL is available in the buffers. Can you please describe why walsender reading directly from the buffers wasgiven up? To avoid a lot of locking? <br /> The locking issue might not be a problem considering synchronous replication.In synchronous replication, the primary will anyways wait for the standby to send a confirmation before it cando more WAL inserts. Hence, reading from buffers might be better in this case.<br /><br />So, as I understand from theemails, we need to lock both WALWriteLock and WALInsertLock in exclusive mode for reading from buffers. Agreed?<br /><br/>Thanks.<br /><br /><br />
On 26.10.2010 21:03, fazool mein wrote: >> Might I suggest adopting the same technique walsender does, ie just read >> the data back from disk? There's a reason why we gave up trying to have >> walsender read directly from the buffers. >> > That is exactly what I do not want to do, i.e. read from disk, as long as > the piece of WAL is available in the buffers. Why not? If the reason is performance, I'd like to see some performance numbers to show that it's worth the trouble. You could perhaps do a quick and dirty hack that doesn't do the locking 100% correctly first, and do some benchmarking on that to get a ballpark number of how much potential there is. Or run oprofile on the current walsender implementation to see how much time is currently spent reading WAL from the kernel buffers. > Can you please describe why > walsender reading directly from the buffers was given up? To avoid a lot of > locking? To avoid locking yes, and complexity in general. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Tue, Oct 26, 2010 at 2:13 PM, Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> wrote: > On 26.10.2010 21:03, fazool mein wrote: >>> >>> Might I suggest adopting the same technique walsender does, ie just read >>> the data back from disk? There's a reason why we gave up trying to have >>> walsender read directly from the buffers. >>> >> That is exactly what I do not want to do, i.e. read from disk, as long as >> the piece of WAL is available in the buffers. > > Why not? If the reason is performance, I'd like to see some performance > numbers to show that it's worth the trouble. You could perhaps do a quick > and dirty hack that doesn't do the locking 100% correctly first, and do some > benchmarking on that to get a ballpark number of how much potential there > is. Or run oprofile on the current walsender implementation to see how much > time is currently spent reading WAL from the kernel buffers. > >> Can you please describe why >> walsender reading directly from the buffers was given up? To avoid a lot >> of >> locking? > > To avoid locking yes, and complexity in general. And the fact that it might allow the standby to get ahead of the master, leading to silent database corruption. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Tue, Oct 26, 2010 at 11:23 AM, Robert Haas <robertmhaas@gmail.com> wrote:
I agree that the standby might get ahead, but this doesn't necessarily lead to database corruption. Here, the interesting case is what happens when the primary fails, which can lead to *either* of the following two cases:And the fact that it might allow the standby to get ahead of theOn Tue, Oct 26, 2010 at 2:13 PM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:
>
>> Can you please describe why
>> walsender reading directly from the buffers was given up? To avoid a lot
>> of
>> locking?
>
> To avoid locking yes, and complexity in general.
master, leading to silent database corruption.
1) The standby, due to some triggering mechanism, becomes the new primary. In this case, even if the standby was ahead, its fine.
2) The primary comes back as primary. In this case, the standby will connect again to the primary. At this point, *if* somehow we are able to detect that the standby is ahead, then we should abort the standby and create a standby from scratch.
I agree with Heikki that going through all this trouble only makes sense if there is a huge performance boost.
On Tue, Oct 26, 2010 at 2:57 PM, fazool mein <fazoolmein@gmail.com> wrote: > > On Tue, Oct 26, 2010 at 11:23 AM, Robert Haas <robertmhaas@gmail.com> wrote: >> >> On Tue, Oct 26, 2010 at 2:13 PM, Heikki Linnakangas >> <heikki.linnakangas@enterprisedb.com> wrote: >> > >> >> Can you please describe why >> >> walsender reading directly from the buffers was given up? To avoid a >> >> lot >> >> of >> >> locking? >> > >> > To avoid locking yes, and complexity in general. >> >> And the fact that it might allow the standby to get ahead of the >> master, leading to silent database corruption. >> > > I agree that the standby might get ahead, but this doesn't necessarily lead > to database corruption. Here, the interesting case is what happens when the > primary fails, which can lead to *either* of the following two cases: > 1) The standby, due to some triggering mechanism, becomes the new primary. > In this case, even if the standby was ahead, its fine. True. > 2) The primary comes back as primary. In this case, the standby will connect > again to the primary. At this point, *if* somehow we are able to detect that > the standby is ahead, then we should abort the standby and create a standby > from scratch. Unless you set restart_after_crash=off, the master could crash-and-restart before you can do anything about it. But that doesn't exist in the 9.0 branch. > I agree with Heikki that going through all this trouble only makes sense if > there is a huge performance boost. There's probably quite a large performance boost in the sync rep case from allowing the master and standby to fsync() in parallel, but first we need to get something that works at all. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
> I agree that the standby might get ahead, but this doesn't necessarily > lead to database corruption. Here, the interesting case is what happens > when the primary fails, which can lead to *either* of the following two > cases: > 1) The standby, due to some triggering mechanism, becomes the new > primary. In this case, even if the standby was ahead, its fine. > 2) The primary comes back as primary. In this case, the standby will > connect again to the primary. At this point, *if* somehow we are able to > detect that the standby is ahead, then we should abort the standby and > create a standby from scratch. Yes. And we weren't able to implement that for 9.0. It's worth revisiting for 9.1. In fact, the issue of "is the standby ahead of the master" has come up repeatedly in potential failure scenarios; I think we're going to need a fairly bulletproof method to determine this. -- -- Josh Berkus PostgreSQL Experts Inc. http://www.pgexperts.com
On Tue, Oct 26, 2010 at 3:00 PM, Josh Berkus <josh@agliodbs.com> wrote: > >> I agree that the standby might get ahead, but this doesn't necessarily >> lead to database corruption. Here, the interesting case is what happens >> when the primary fails, which can lead to *either* of the following two >> cases: >> 1) The standby, due to some triggering mechanism, becomes the new >> primary. In this case, even if the standby was ahead, its fine. >> 2) The primary comes back as primary. In this case, the standby will >> connect again to the primary. At this point, *if* somehow we are able to >> detect that the standby is ahead, then we should abort the standby and >> create a standby from scratch. > > Yes. And we weren't able to implement that for 9.0. It's worth > revisiting for 9.1. In fact, the issue of "is the standby ahead of the > master" has come up repeatedly in potential failure scenarios; I think > we're going to need a fairly bulletproof method to determine this. Agreed. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 10/26/2010 05:52 PM, Alvaro Herrera wrote: > And horrible for performance, I imagine. Those locks are highly trafficked. Note, however, that offloading this to the file-system just moves congestion there. So we are effectively saying that we expect filesystems to do a better job (in that aspect) than our WAL implementation. (Note that I'm not claiming that is or is not true - I didn't measure). Regards Markus Wanner
Excerpts from Markus Wanner's message of mié oct 27 11:44:20 -0300 2010: > On 10/26/2010 05:52 PM, Alvaro Herrera wrote: > > And horrible for performance, I imagine. Those locks are highly trafficked. > > Note, however, that offloading this to the file-system just moves > congestion there. So we are effectively saying that we expect > filesystems to do a better job (in that aspect) than our WAL implementation. Well, you can just read at your pace from the filesystem; the data is going to stay there for a long time. WAL buffers are constantly moving, and aren't as big. -- Álvaro Herrera <alvherre@commandprompt.com> The PostgreSQL Company - Command Prompt, Inc. PostgreSQL Replication, Consulting, Custom Development, 24x7 support
On Wed, Oct 27, 2010 at 3:03 AM, fazool mein <fazoolmein@gmail.com> wrote: > >> Might I suggest adopting the same technique walsender does, ie just read >> the data back from disk? There's a reason why we gave up trying to have >> walsender read directly from the buffers. >> > > That is exactly what I do not want to do, i.e. read from disk, as long as > the piece of WAL is available in the buffers. I implemented before the patch which makes walsender read WAL from the buffer without holding neither WALInsertLock nor WALWriteLock. That might be helpful for you. Please see the following post. http://archives.postgresql.org/pgsql-hackers/2010-06/msg00661.php Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center