Обсуждение: crash-safe visibility map, take three
Last week, I posted a couple of possible designs for making the visibility map crash-safe, which did not elicit much comment. Since this is an important prerequisite to index-only scans, I'm trying again. http://archives.postgresql.org/pgsql-hackers/2010-11/msg01474.php http://archives.postgresql.org/pgsql-hackers/2010-11/msg01529.php Here's a third possible design. Instead of representing each page with a single bit in the visibility map, use two bits. One bit indicates whether all tuples on the page are all-visible (same as the current bit) - call this VM_ALL_VISIBLE. The other bit is only used during VACUUM and indicates whether VACUUM is trying to set the all-visible bit - call this VM_BECOMING_ALL_VISIBLE. We observe the rule that any operation that clears PD_ALL_VISIBLE must clear both the VM_ALL_VISIBLE and VM_BECOMING_ALL_VISIBLE bits for that page in the visibility map. VACUUM precedes as follows: 1. Pin each visibility map page. If any VM_BECOMING_ALL_VISIBLE bits are set, take the exclusive content lock for long enough to clear them. 2. Scan the heap. When a page is observed to be all-visible, set VM_BECOMING_ALL_VISIBLE and PD_ALL_VISIBLE. 3. Loop over shared buffers and write out every page to the OS which belongs to the target relation, was marked all-visible in step 2, and is still dirty. Note that this may require a WAL flush. 4. fsync() the heap. 5. Pin each visibility map page. If any VM_BECOMING_ALL_VISIBLE bits are set, take the exclusive content lock, clear each such bit, set the corresponding VM_ALL_VISIBLE bits and XLOG the page. One might actually want to do steps 2-5 incrementally, in 1GB chunks, so that you don't fsync() too much of the relation all at once. If you tilt your head just right, the recurring problem in all of this is that the heap page and the visibility map page can go to disk in either order, and we have no way of knowing which one. A more radical solution to this problem (and, thus, a fourth possible design) would be to add a field to the buffer descriptor allowing one page to "wire" another page into shared buffers. If the value is >0, it's the number of a buffer it's currently wiring. If the value is <0, it's the number of other buffers that have wired this buffer. A buffer both wire another buffer and itself be wired at the same time. If the value is =0, everything's normal. To make this work, though, you'd have to complicate the checkpoint logic pretty considerably - make sure all the unwired buffers are written and fsync'd first, thus unwiring the remaining ones to be written and fsync'd in a second pass; and there are also possible problems with very small relations, where the number of wired buffers might grow to an uncomfortably high percentage of the total. Waving my hands at all this complexity, you could then make the heap pages wire the visibility map pages. I can't say I'm totally in love with any of these designs. Anyone else have any ideas, or any opinions about which one is best? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 30.11.2010 06:57, Robert Haas wrote: > I can't say I'm totally in love with any of these designs. Anyone > else have any ideas, or any opinions about which one is best? Well, the design I've been pondering goes like this: At vacuum: 1. Write an "intent" XLOG record listing a chunk of visibility map bits that are not currently set, that we are going to try to set. A chunk of say 100 bits would be about right. 2. Scan the 100 heap pages as we currently do, setting the visibility map bits as we go. 3. After the scan, lock the visibility map page, check which of the bits that we set in step 2 are still set (concurrent updates might've cleared some), and write a final XLOG record listing the set bits. This step isn't necessary for correctness, BTW, but without it you lose all the set bits if you crash before next checkpoint. At replay, when we see the intent XLOG record, clear all the bits listed in it. This ensures that if we crashed and some of the visibility map bits were flushed to disk but the corresponding changes to the heap pages were not, the bits are cleared. When we see the final XLOG record, we set the bits. Some care is needed with checkpoints. Setting visibility map bits in step 2 is safe because crash recovery will replay the intent XLOG record and clear any incorrectly set bits. But if a checkpoint has happened after the intent XLOG record was written, that's not true. This can be avoided by checking RedoRecPtr in step 2, and writing a new intent XLOG record if it has changed since the last intent XLOG record was written. There's a small race condition in the way a visibility map bit is currently cleared. When a heap page is updated, it is locked, the update is WAL-logged, and the lock is released. The visibility map page is updated only after that. If the final vacuum XLOG record is written just after updating the heap page, but before the visibility map bit is cleared, replaying the final XLOG record will set a bit that should not have been set. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Mon, Nov 29, 2010 at 9:57 PM, Robert Haas <robertmhaas@gmail.com> wrote: > 1. Pin each visibility map page. If any VM_BECOMING_ALL_VISIBLE bits > are set, take the exclusive content lock for long enough to clear > them. I wonder what the performance hit will be to workloads with contention and if this feature should be optional. -- Rob Wultsch wultsch@gmail.com
On Tue, Nov 30, 2010 at 2:34 AM, Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> wrote: > Some care is needed with checkpoints. Setting visibility map bits in step 2 > is safe because crash recovery will replay the intent XLOG record and clear > any incorrectly set bits. But if a checkpoint has happened after the intent > XLOG record was written, that's not true. This can be avoided by checking > RedoRecPtr in step 2, and writing a new intent XLOG record if it has changed > since the last intent XLOG record was written. It seems like you'll need to hold some kind of lock between the time you examine RedoRecPtr and the time you actually examine the bit. WALInsertLock in shared mode, maybe? > There's a small race condition in the way a visibility map bit is currently > cleared. When a heap page is updated, it is locked, the update is > WAL-logged, and the lock is released. The visibility map page is updated > only after that. If the final vacuum XLOG record is written just after > updating the heap page, but before the visibility map bit is cleared, > replaying the final XLOG record will set a bit that should not have been > set. Well, if that final XLOG record isn't necessary for correctness anyway, the obvious thing to do seems to be - don't write it. Crashes are not so common that loss of even a full hour's visibility map bits in the event that we have one seems worth killing ourselves over. And not everybody sets checkpoint_timeout to an hour, and not all checkpoints are triggered by checkpoint_timeout, and not all crashes happen just before it expires. Seems like we might be better off writing that much less WAL. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:
> On 30.11.2010 06:57, Robert Haas wrote:
>> I can't say I'm totally in love with any of these designs.  Anyone
>> else have any ideas, or any opinions about which one is best?
> Well, the design I've been pondering goes like this:
Wouldn't it be easier and more robust to just consider VM bit changes to
be part of the WAL-logged actions?  That would include updating LSNs on
VM pages and flushing VM pages to disk during checkpoint based on their
LSN values.  All of these other schemes seem too complicated and not
provably correct.
Of course, that'd mean doing the bit changes inside the critical
sections for the related actions, so it's not a trivial change
code-wise, but neither are these other ideas.
        regards, tom lane
			
		On 30.11.2010 17:32, Robert Haas wrote: > On Tue, Nov 30, 2010 at 2:34 AM, Heikki Linnakangas > <heikki.linnakangas@enterprisedb.com> wrote: >> Some care is needed with checkpoints. Setting visibility map bits in step 2 >> is safe because crash recovery will replay the intent XLOG record and clear >> any incorrectly set bits. But if a checkpoint has happened after the intent >> XLOG record was written, that's not true. This can be avoided by checking >> RedoRecPtr in step 2, and writing a new intent XLOG record if it has changed >> since the last intent XLOG record was written. > > It seems like you'll need to hold some kind of lock between the time > you examine RedoRecPtr and the time you actually examine the bit. > WALInsertLock in shared mode, maybe? It's enough to hold an exclusive lock on the visibility map page. You have to set the bit first, and then check RedoRecPtr, and if it changed, write the XLOG record before releasing the lock. If RedoRecPtr changes any time before we check RedoRecPtr, we'll write the XLOG record so we're safe. If it changes after that, we're safe because the checkpoint will flush the updated heap page and visibility map page. >> There's a small race condition in the way a visibility map bit is currently >> cleared. When a heap page is updated, it is locked, the update is >> WAL-logged, and the lock is released. The visibility map page is updated >> only after that. If the final vacuum XLOG record is written just after >> updating the heap page, but before the visibility map bit is cleared, >> replaying the final XLOG record will set a bit that should not have been >> set. > > Well, if that final XLOG record isn't necessary for correctness > anyway, the obvious thing to do seems to be - don't write it. Crashes > are not so common that loss of even a full hour's visibility map bits > in the event that we have one seems worth killing ourselves over. And > not everybody sets checkpoint_timeout to an hour, and not all > checkpoints are triggered by checkpoint_timeout, and not all crashes > happen just before it expires. Seems like we might be better off > writing that much less WAL. Yeah, possibly. It also means that the set bits will not propagate to standby servers, though. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Tue, Nov 30, 2010 at 10:38 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes: >> On 30.11.2010 06:57, Robert Haas wrote: >>> I can't say I'm totally in love with any of these designs. Anyone >>> else have any ideas, or any opinions about which one is best? > >> Well, the design I've been pondering goes like this: > > Wouldn't it be easier and more robust to just consider VM bit changes to > be part of the WAL-logged actions? That would include updating LSNs on > VM pages and flushing VM pages to disk during checkpoint based on their > LSN values. All of these other schemes seem too complicated and not > provably correct. What WAL-logged actions? The problem case is where a page has no tuples or line pointers that need to be removed, and all we need to do is mark it all-visible. We don't current WAL-log anything in that case. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 30.11.2010 17:38, Tom Lane wrote: > Heikki Linnakangas<heikki.linnakangas@enterprisedb.com> writes: >> On 30.11.2010 06:57, Robert Haas wrote: >>> I can't say I'm totally in love with any of these designs. Anyone >>> else have any ideas, or any opinions about which one is best? > >> Well, the design I've been pondering goes like this: > > Wouldn't it be easier and more robust to just consider VM bit changes to > be part of the WAL-logged actions? That would include updating LSNs on > VM pages and flushing VM pages to disk during checkpoint based on their > LSN values. All of these other schemes seem too complicated and not > provably correct. The vm bit can be set once all the tuples on the page become visible to everyone. There is no WAL-logged action at that point we could piggyback on. Clearing the bit is already handled like that - replay of heap insert/update/delete records clear the visibility map bit. > Of course, that'd mean doing the bit changes inside the critical > sections for the related actions, so it's not a trivial change > code-wise, but neither are these other ideas. Yeah, I'm not terribly excited about any of these schemes. The "intent" record seems like the simplest one, but even that is quite different from the traditional WAL-logging we do that it makes me slightly nervous. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Tue, Nov 30, 2010 at 10:43 AM, Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> wrote: >> It seems like you'll need to hold some kind of lock between the time >> you examine RedoRecPtr and the time you actually examine the bit. >> WALInsertLock in shared mode, maybe? > > It's enough to hold an exclusive lock on the visibility map page. You have > to set the bit first, and then check RedoRecPtr, and if it changed, write > the XLOG record before releasing the lock. If RedoRecPtr changes any time > before we check RedoRecPtr, we'll write the XLOG record so we're safe. If it > changes after that, we're safe because the checkpoint will flush the updated > heap page and visibility map page. Brilliant. I assume that we need to call GetRedoRecPtr() after taking the exclusive lock on the page, though? > Yeah, possibly. It also means that the set bits will not propagate to > standby servers, though. That's definitely sucky, but in some ways it would be more complicated if they did, because I don't think all-visible on the master implies all-visible on the standby. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Here's one more idea: The trivial solution to this is to WAL-log setting the visibility map bit, like we WAL-log any other operation. Lock the heap page, lock the visibility map page, write WAL-record, and release locks. That works, but the problem is that it creates quite a lot of new WAL traffic. We could reduce the WAL traffic by simply updating multiple pages at a time. Lock N pages, lock the visibility map page, write one WAL record, and release locks. If N=10, for example, we only need to WAL-log a couple of bytes per page, so the WAL volume should be acceptable. The downside is that you need to keep more pages locked at the same time, but maybe that's not too bad. This wouldn't require anything special, which means fewer hard-to-debug visibility & recovery bugs. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:
> On 30.11.2010 17:38, Tom Lane wrote:
>> Wouldn't it be easier and more robust to just consider VM bit changes to
>> be part of the WAL-logged actions?  That would include updating LSNs on
>> VM pages and flushing VM pages to disk during checkpoint based on their
>> LSN values.  All of these other schemes seem too complicated and not
>> provably correct.
> The vm bit can be set once all the tuples on the page become visible to 
> everyone. There is no WAL-logged action at that point we could piggyback on.
So you start emitting a WAL entry for the act of setting the VM bit
(and I guess the page header hint bit would be included in that too).
> Yeah, I'm not terribly excited about any of these schemes. The "intent" 
> record seems like the simplest one, but even that is quite different 
> from the traditional WAL-logging we do that it makes me slightly nervous.
I'm not convinced it works at all.  Consider write intent record,
checkpoint, set bit, crash before completing vacuum.  There will be
no second intent record at which you could clean up if things are
inconsistent.
        regards, tom lane
			
		Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:
> The trivial solution to this is to WAL-log setting the visibility map 
> bit, like we WAL-log any other operation. Lock the heap page, lock the 
> visibility map page, write WAL-record, and release locks. That works, 
> but the problem is that it creates quite a lot of new WAL traffic.
How much is "quite a lot"?  Do we have any real reason to think that
this solution is unacceptable performance-wise?
I'd also suggest that if you want to prevent torn-page syndrome on VM
pages (and if you want to rely on their LSN values, you do) then you
probably don't have any choice anyway.  VM pages will have to adhere to
the same write-full-page-on-first-mod-after-checkpoint rule as any other
page.  I'd guess that this will swamp any savings from cutesy schemes
for reducing the number of WAL records.
> We could reduce the WAL traffic by simply updating multiple pages at a 
> time. Lock N pages, lock the visibility map page, write one WAL record, 
> and release locks.
I don't think that will work, because you have to hold the lock on a
page from the time you check that it's all-visible to the time you apply
the update.  The loss of concurrency against updates would be pretty
bad, and I think you'd be creating significant risk of deadlocks from
holding multiple buffer locks at once.
        regards, tom lane
			
		On Tue, Nov 30, 2010 at 11:16 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > How much is "quite a lot"? Do we have any real reason to think that > this solution is unacceptable performance-wise? Well, let's imagine a 1GB insert-only table. It has 128K pages. If you XLOG setting the bit on each page, you'll need to write 128K WAL records, each containing a 12-byte relfilenode and a 4-byte block offset, for a total of 16 bytes of WAL per page, thus 2MB of WAL. But you did just dirty a gigabyte of data. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Tue, Nov 30, 2010 at 11:22 AM, Robert Haas <robertmhaas@gmail.com> wrote: > On Tue, Nov 30, 2010 at 11:16 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> How much is "quite a lot"? Do we have any real reason to think that >> this solution is unacceptable performance-wise? > > Well, let's imagine a 1GB insert-only table. It has 128K pages. If > you XLOG setting the bit on each page, you'll need to write 128K WAL > records, each containing a 12-byte relfilenode and a 4-byte block > offset, for a total of 16 bytes of WAL per page, thus 2MB of WAL. > > But you did just dirty a gigabyte of data. Oh, but it's worse than that. When you XLOG a WAL record for each of those pages, you're going to trigger full-page writes for all of them.So now you've turned 1GB of data to write into 2+ GBof data to write. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 30.11.2010 18:22, Robert Haas wrote: > On Tue, Nov 30, 2010 at 11:16 AM, Tom Lane<tgl@sss.pgh.pa.us> wrote: >> How much is "quite a lot"? Do we have any real reason to think that >> this solution is unacceptable performance-wise? > > Well, let's imagine a 1GB insert-only table. It has 128K pages. If > you XLOG setting the bit on each page, you'll need to write 128K WAL > records, each containing a 12-byte relfilenode and a 4-byte block > offset, for a total of 16 bytes of WAL per page, thus 2MB of WAL. Plus WAL headers, I think it's something like 32 or 40 bytes of WAL per page. > But you did just dirty a gigabyte of data. Good point. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On 30.11.2010 18:10, Tom Lane wrote: > Heikki Linnakangas<heikki.linnakangas@enterprisedb.com> writes: >> Yeah, I'm not terribly excited about any of these schemes. The "intent" >> record seems like the simplest one, but even that is quite different >> from the traditional WAL-logging we do that it makes me slightly nervous. > > I'm not convinced it works at all. Consider write intent record, > checkpoint, set bit, crash before completing vacuum. There will be > no second intent record at which you could clean up if things are > inconsistent. That's why you need to check the RedoRecPtr when you set the bit. If it has changed, ie. a checkpoint has happened, the set bit step will write a new intent record. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Robert Haas <robertmhaas@gmail.com> writes:
> Oh, but it's worse than that.  When you XLOG a WAL record for each of
> those pages, you're going to trigger full-page writes for all of them.
>  So now you've turned 1GB of data to write into 2+ GB of data to
> write.
No, because only the first mod of each VM page would trigger a full page
write, at least assuming a reasonable ordering of the operations.
        regards, tom lane
			
		Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:
> On 30.11.2010 18:10, Tom Lane wrote:
>> I'm not convinced it works at all.  Consider write intent record,
>> checkpoint, set bit, crash before completing vacuum.  There will be
>> no second intent record at which you could clean up if things are
>> inconsistent.
> That's why you need to check the RedoRecPtr when you set the bit. If it 
> has changed, ie. a checkpoint has happened, the set bit step will write 
> a new intent record.
Oh, you explained the proposal poorly then.  I thought you meant recheck
and write another intent record just once, immediately before sending
the final xlog record.
It still seems rickety and not clearly correct, especially when you
start thinking about all the other constraints we have on xlog behavior
(eg, does this work while taking a base backup).
        regards, tom lane
			
		Robert Haas <robertmhaas@gmail.com> writes:
> That's definitely sucky, but in some ways it would be more complicated
> if they did, because I don't think all-visible on the master implies
> all-visible on the standby.
Ouch.  That seems like it could shoot down all these proposals.  There
definitely isn't any way to make VM crash-safe if there is no WAL-driven
mechanism for setting the bits.
I guess what we need is a way to delay the application of such a WAL
record on the slave until it's safe, which means the record also has to
carry some indication of the youngest XMIN on the page.
        regards, tom lane
			
		On Tue, Nov 30, 2010 at 11:33 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Robert Haas <robertmhaas@gmail.com> writes: >> Oh, but it's worse than that. When you XLOG a WAL record for each of >> those pages, you're going to trigger full-page writes for all of them. >> So now you've turned 1GB of data to write into 2+ GB of data to >> write. > > No, because only the first mod of each VM page would trigger a full page > write, at least assuming a reasonable ordering of the operations. I'm not worried about the full-page writes from updating the visibility map - I'm worried about the full-page writes from updating the heap. It doesn't matter a whit if we fail to set a bit in the visibility map. What matters is if we DO set the bit in the visibility map but FAIL TO set the bit in the heap, because then a subsequent update to the heap page won't check the visibility map and clear the bit. The *heap* updates are the ones that have to be guaranteed to make it to disk. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 30.11.2010 18:40, Tom Lane wrote: > Robert Haas<robertmhaas@gmail.com> writes: >> That's definitely sucky, but in some ways it would be more complicated >> if they did, because I don't think all-visible on the master implies >> all-visible on the standby. > > Ouch. That seems like it could shoot down all these proposals. There > definitely isn't any way to make VM crash-safe if there is no WAL-driven > mechanism for setting the bits. Note that this is only a problem for *hot* standby. After failover, all the tuples that were visible to everyone in the master are also visible to all new transactions in the standby. We dealt with this in 9.0 already, with the "killed" flag in index tuples and the PD_ALL_VISIBLE flag in heap scans. We simply don't believe them in hot standby mode, and check visibility even if the flag is set. > I guess what we need is a way to delay the application of such a WAL > record on the slave until it's safe, which means the record also has to > carry some indication of the youngest XMIN on the page. Something like that would certainly be nice. With index-only scans, it can be a big disappointment if you can't do an index-only scan in hot standby. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On 30.11.2010 18:33, Tom Lane wrote: > Robert Haas<robertmhaas@gmail.com> writes: >> Oh, but it's worse than that. When you XLOG a WAL record for each of >> those pages, you're going to trigger full-page writes for all of them. >> So now you've turned 1GB of data to write into 2+ GB of data to >> write. > > No, because only the first mod of each VM page would trigger a full page > write, at least assuming a reasonable ordering of the operations. If you change the LSN on the heap pages, you have to write full page images of those as well. Let's recap what happens when a VM bit is set: You set the PD_ALL_VISIBLE flag on the heap page (assuming it's not set already, it usually isn't), and then set the bit in the VM while keeping the heap page locked. Can we get away with not setting the LSN on the heap page, even though we set the PD_ALL_VISIBLE flag? If we don't set the LSN, the heap page can be flushed to disk before the WAL record, but I think that's fine because it's OK to have the flag set in the heap page even if the VM bit is not set. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Tue, Nov 30, 2010 at 11:40 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Robert Haas <robertmhaas@gmail.com> writes: >> That's definitely sucky, but in some ways it would be more complicated >> if they did, because I don't think all-visible on the master implies >> all-visible on the standby. > > Ouch. That seems like it could shoot down all these proposals. There > definitely isn't any way to make VM crash-safe if there is no WAL-driven > mechanism for setting the bits. Heikki's intent method works fine, because the WAL record only clears the visibility map bits on redo; it never sets them. > I guess what we need is a way to delay the application of such a WAL > record on the slave until it's safe, which means the record also has to > carry some indication of the youngest XMIN on the page. I'm unexcited about inventing more ways to delay XLOG application on the standby. We have enough of those already. We could actually allow the slave to set the visibility map bits based on its own xmin horizon. The only problem is that you wouldn't be able to write the intent XLOG records. I suppose you could have a separate file which is just used to store the intent records, and designate a range of very-high numbered LSNs to mean blocks of the intent file rather than a position in the regular WAL stream. VACUUM is so much fun on the master, let's have it on the standby too. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:
> Can we get away with not setting the LSN on the heap page, even though 
> we set the PD_ALL_VISIBLE flag? If we don't set the LSN, the heap page 
> can be flushed to disk before the WAL record, but I think that's fine 
> because it's OK to have the flag set in the heap page even if the VM bit 
> is not set.
Why is that fine?  It's certainly not fine from the standpoint of
someone wondering why his index-only scan performs so badly.
I think all this hair-splitting about cases where it's okay to have one
bit set and not the other is misguided.  To me, crash-safety of the VM
means that its copy of the page-header bit is right.  Period.  Yes, it
will cost something to ensure that; so what?  If we don't get more than
enough compensating performance gain from index-only scans, the whole
patch is going to end up reverted.
        regards, tom lane
			
		Robert Haas <robertmhaas@gmail.com> writes:
> On Tue, Nov 30, 2010 at 11:40 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Ouch. �That seems like it could shoot down all these proposals. �There
>> definitely isn't any way to make VM crash-safe if there is no WAL-driven
>> mechanism for setting the bits.
> Heikki's intent method works fine, because the WAL record only clears
> the visibility map bits on redo; it never sets them.
Uh, no, because he also had that final WAL record that would set the
bits.
> We could actually allow the slave to set the visibility map bits based
> on its own xmin horizon.
Not in a crash-safe way, which is exactly the problem here.
        regards, tom lane
			
		On Tue, Nov 30, 2010 at 11:49 AM, Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> wrote: > On 30.11.2010 18:33, Tom Lane wrote: >> >> Robert Haas<robertmhaas@gmail.com> writes: >>> >>> Oh, but it's worse than that. When you XLOG a WAL record for each of >>> those pages, you're going to trigger full-page writes for all of them. >>> So now you've turned 1GB of data to write into 2+ GB of data to >>> write. >> >> No, because only the first mod of each VM page would trigger a full page >> write, at least assuming a reasonable ordering of the operations. > > If you change the LSN on the heap pages, you have to write full page images > of those as well. > > Let's recap what happens when a VM bit is set: You set the PD_ALL_VISIBLE > flag on the heap page (assuming it's not set already, it usually isn't), and > then set the bit in the VM while keeping the heap page locked. > > Can we get away with not setting the LSN on the heap page, even though we > set the PD_ALL_VISIBLE flag? If we don't set the LSN, the heap page can be > flushed to disk before the WAL record, but I think that's fine because it's > OK to have the flag set in the heap page even if the VM bit is not set. I don't immediately see why that wouldn't work. As long as you bump the LSN on the visibility map page, and also bump the LSN of the visibility map page every time you clear a bit, I think you should be OK. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Tue, Nov 30, 2010 at 11:55 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes: >> Can we get away with not setting the LSN on the heap page, even though >> we set the PD_ALL_VISIBLE flag? If we don't set the LSN, the heap page >> can be flushed to disk before the WAL record, but I think that's fine >> because it's OK to have the flag set in the heap page even if the VM bit >> is not set. > > Why is that fine? It's certainly not fine from the standpoint of > someone wondering why his index-only scan performs so badly. > > I think all this hair-splitting about cases where it's okay to have one > bit set and not the other is misguided. To me, crash-safety of the VM > means that its copy of the page-header bit is right. Period. Yes, it > will cost something to ensure that; so what? If we don't get more than > enough compensating performance gain from index-only scans, the whole > patch is going to end up reverted. We're not going to double the cost of VACUUM to get index-only scans. And that's exactly what will happen if you do full-page writes of every heap page to set a single bit. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Tue, Nov 30, 2010 at 11:59 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Robert Haas <robertmhaas@gmail.com> writes: >> On Tue, Nov 30, 2010 at 11:40 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >>> Ouch. That seems like it could shoot down all these proposals. There >>> definitely isn't any way to make VM crash-safe if there is no WAL-driven >>> mechanism for setting the bits. > >> Heikki's intent method works fine, because the WAL record only clears >> the visibility map bits on redo; it never sets them. > > Uh, no, because he also had that final WAL record that would set the > bits. Well, as already discussed upthread, that WAL record causes some other problems, so make it Heikki's intent method, without the final WAL record that breaks things. >> We could actually allow the slave to set the visibility map bits based >> on its own xmin horizon. > > Not in a crash-safe way, which is exactly the problem here. Brilliant selective quoting. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes:
> We're not going to double the cost of VACUUM to get index-only scans.
> And that's exactly what will happen if you do full-page writes of
> every heap page to set a single bit.
It's ridiculous to claim that that "doubles the cost of VACUUM".  In the
worst case, it will add 25% to the cost of setting an all-visible bit on
a page where there is no other work to do.  (You already are writing out
the heap page and the VM page, plus a WAL image of the heap page, so a
WAL image of the VM page adds 25%.  But only if you did not set any
other bits on the same VM page, which is probably not a real common
case.)  Given that VACUUM has a lot of other cleanup besides visibility
bit setting, I'm not convinced that this would even be noticeable.
I think the burden is on people who are proposing complicated mechanisms
to show that there's actually a strong need for them.
        regards, tom lane
			
		On Tue, Nov 30, 2010 at 12:10 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Robert Haas <robertmhaas@gmail.com> writes: >> We're not going to double the cost of VACUUM to get index-only scans. >> And that's exactly what will happen if you do full-page writes of >> every heap page to set a single bit. > > It's ridiculous to claim that that "doubles the cost of VACUUM". In the > worst case, it will add 25% to the cost of setting an all-visible bit on > a page where there is no other work to do. (You already are writing out > the heap page and the VM page, True. > plus a WAL image of the heap page, so a False. That is exactly what we are NOT doing now and what we must find a way to avoid doing. > WAL image of the VM page adds 25%. But only if you did not set any > other bits on the same VM page, which is probably not a real common > case.) Given that VACUUM has a lot of other cleanup besides visibility > bit setting, I'm not convinced that this would even be noticeable. > > I think the burden is on people who are proposing complicated mechanisms > to show that there's actually a strong need for them. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes:
> On Tue, Nov 30, 2010 at 12:10 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> It's ridiculous to claim that that "doubles the cost of VACUUM". �In the
>> worst case, it will add 25% to the cost of setting an all-visible bit on
>> a page where there is no other work to do. �(You already are writing out
>> the heap page and the VM page,
> True.
>> plus a WAL image of the heap page, so a
> False.  That is exactly what we are NOT doing now and what we must
> find a way to avoid doing.
I do not accept that argument.  You can't make an omelette without
breaking eggs, and the cost of index-only scans is going to be that
it costs more to get the visibility bits set in the first place.
But having said that, I wonder whether we need a full-page image for
a WAL-logged action that is known to involve only setting a single bit
and updating LSN.  Would omitting the FPI be any more risky than what
happens now (ie, the page does get written back to disk at some point,
without any image from which it can be rewritten if the write fails...)
        regards, tom lane
			
		On 30.11.2010 19:22, Tom Lane wrote: > But having said that, I wonder whether we need a full-page image for > a WAL-logged action that is known to involve only setting a single bit > and updating LSN. Would omitting the FPI be any more risky than what > happens now (ie, the page does get written back to disk at some point, > without any image from which it can be rewritten if the write fails...) You have to write a full-page image if you update the LSN, because otherwise the next update that comes along will not write a full page image. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Tue, Nov 30, 2010 at 12:22 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > But having said that, I wonder whether we need a full-page image for > a WAL-logged action that is known to involve only setting a single bit > and updating LSN. Would omitting the FPI be any more risky than what > happens now (ie, the page does get written back to disk at some point, > without any image from which it can be rewritten if the write fails...) That's pretty much exactly what Heikki proposed 35 minutes ago, and you objected 6 minutes later. I still think it might work. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Tue, Nov 30, 2010 at 12:25 PM, Robert Haas <robertmhaas@gmail.com> wrote: > On Tue, Nov 30, 2010 at 12:22 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> But having said that, I wonder whether we need a full-page image for >> a WAL-logged action that is known to involve only setting a single bit >> and updating LSN. Would omitting the FPI be any more risky than what >> happens now (ie, the page does get written back to disk at some point, >> without any image from which it can be rewritten if the write fails...) > > That's pretty much exactly what Heikki proposed 35 minutes ago, and > you objected 6 minutes later. I still think it might work. Oh, I see the difference now. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:
> On 30.11.2010 19:22, Tom Lane wrote:
>> But having said that, I wonder whether we need a full-page image for
>> a WAL-logged action that is known to involve only setting a single bit
>> and updating LSN.
> You have to write a full-page image if you update the LSN, because 
> otherwise the next update that comes along will not write a full page image.
Um.  Drat.  I was thinking about the replay side, where I think it would
actually work --- but you're right, it would break the logic on the
generation side.  Unless you want to put in some kind of flag saying
"this was only a visibility bit update, any bigger update still needs
to write an FPI".
        regards, tom lane
			
		Heikki Linnakangas wrote: > On 30.11.2010 18:33, Tom Lane wrote: > > Robert Haas<robertmhaas@gmail.com> writes: > >> Oh, but it's worse than that. When you XLOG a WAL record for each of > >> those pages, you're going to trigger full-page writes for all of them. > >> So now you've turned 1GB of data to write into 2+ GB of data to > >> write. > > > > No, because only the first mod of each VM page would trigger a full page > > write, at least assuming a reasonable ordering of the operations. > > If you change the LSN on the heap pages, you have to write full page > images of those as well. > > Let's recap what happens when a VM bit is set: You set the > PD_ALL_VISIBLE flag on the heap page (assuming it's not set already, it > usually isn't), and then set the bit in the VM while keeping the heap > page locked. What if we set PD_ALL_VISIBLE on the heap page, wait for a checkpoint to happen so the heap page is guaranteed to be on disk, then on next read, if PD_ALL_VISIBLE is set and the VM all-visible bit is not set, set the VM bit. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. +
On 01.12.2010 03:35, Bruce Momjian wrote: > Heikki Linnakangas wrote: >> Let's recap what happens when a VM bit is set: You set the >> PD_ALL_VISIBLE flag on the heap page (assuming it's not set already, it >> usually isn't), and then set the bit in the VM while keeping the heap >> page locked. > > What if we set PD_ALL_VISIBLE on the heap page, wait for a checkpoint to > happen so the heap page is guaranteed to be on disk, then on next read, > if PD_ALL_VISIBLE is set and the VM all-visible bit is not set, set the > VM bit. Hmm, you'd somehow have to know if a checkpoint has happened since the flag was set. And it might be a long wait, which makes it less attractive for index-only scans. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Heikki Linnakangas wrote: > On 01.12.2010 03:35, Bruce Momjian wrote: > > Heikki Linnakangas wrote: > >> Let's recap what happens when a VM bit is set: You set the > >> PD_ALL_VISIBLE flag on the heap page (assuming it's not set already, it > >> usually isn't), and then set the bit in the VM while keeping the heap > >> page locked. > > > > What if we set PD_ALL_VISIBLE on the heap page, wait for a checkpoint to > > happen so the heap page is guaranteed to be on disk, then on next read, > > if PD_ALL_VISIBLE is set and the VM all-visible bit is not set, set the > > VM bit. > > Hmm, you'd somehow have to know if a checkpoint has happened since the > flag was set. And it might be a long wait, which makes it less Well, doesn't the page LSN tell you that already? > attractive for index-only scans. My assumption is that this page will remain read-only for a while, so I don't see the big downside in a delay of max 5-10 minutes. For sites where pages go frequently in and out of read-only status, I don't think index-only scans are every going to be a big win. Long-running queries alone are going to delay how quickly we can set PD_ALL_VISIBLE, so I don't see an additional 5-10 minutes as a big problem. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. +
On 01.12.2010 15:39, Bruce Momjian wrote: > Heikki Linnakangas wrote: >> On 01.12.2010 03:35, Bruce Momjian wrote: >>> Heikki Linnakangas wrote: >>>> Let's recap what happens when a VM bit is set: You set the >>>> PD_ALL_VISIBLE flag on the heap page (assuming it's not set already, it >>>> usually isn't), and then set the bit in the VM while keeping the heap >>>> page locked. >>> >>> What if we set PD_ALL_VISIBLE on the heap page, wait for a checkpoint to >>> happen so the heap page is guaranteed to be on disk, then on next read, >>> if PD_ALL_VISIBLE is set and the VM all-visible bit is not set, set the >>> VM bit. >> >> Hmm, you'd somehow have to know if a checkpoint has happened since the >> flag was set. And it might be a long wait, which makes it less > > Well, doesn't the page LSN tell you that already? If we update the LSN when we set the flag, then we have to write the full-page-image. That's very expensive. If we don't update the LSN, then the LSN says nothing about when the flag was set. >> attractive for index-only scans. > > My assumption is that this page will remain read-only for a while, so I > don't see the big downside in a delay of max 5-10 minutes. For sites > where pages go frequently in and out of read-only status, I don't think > index-only scans are every going to be a big win. Long-running queries > alone are going to delay how quickly we can set PD_ALL_VISIBLE, so I > don't see an additional 5-10 minutes as a big problem. Perhaps we could live with it, but it would be annoying to have to checkpoint after a data load, before index-only scans kick in. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> wrote: > it would be annoying to have to checkpoint after a data load Heck, in my world it's currently pretty much a necessity to run VACUUM FREEZE ANALYZE on a table after a data load before it's reasonable to expose the table to production use. It would hardly be an inconvenience to also run a CHECKPOINT. -Kevin
Heikki Linnakangas wrote: > On 01.12.2010 15:39, Bruce Momjian wrote: > > Heikki Linnakangas wrote: > >> On 01.12.2010 03:35, Bruce Momjian wrote: > >>> Heikki Linnakangas wrote: > >>>> Let's recap what happens when a VM bit is set: You set the > >>>> PD_ALL_VISIBLE flag on the heap page (assuming it's not set already, it > >>>> usually isn't), and then set the bit in the VM while keeping the heap > >>>> page locked. > >>> > >>> What if we set PD_ALL_VISIBLE on the heap page, wait for a checkpoint to > >>> happen so the heap page is guaranteed to be on disk, then on next read, > >>> if PD_ALL_VISIBLE is set and the VM all-visible bit is not set, set the > >>> VM bit. > >> > >> Hmm, you'd somehow have to know if a checkpoint has happened since the > >> flag was set. And it might be a long wait, which makes it less > > > > Well, doesn't the page LSN tell you that already? > > If we update the LSN when we set the flag, then we have to write the > full-page-image. That's very expensive. If we don't update the LSN, then > the LSN says nothing about when the flag was set. Oh, we don't update the LSN when we set the PD_ALL_VISIBLE flag? OK, please let me think some more. Thanks. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. +
On Wed, Dec 1, 2010 at 9:57 AM, Kevin Grittner <Kevin.Grittner@wicourts.gov> wrote: > Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> wrote: > >> it would be annoying to have to checkpoint after a data load > > Heck, in my world it's currently pretty much a necessity to run > VACUUM FREEZE ANALYZE on a table after a data load before it's > reasonable to expose the table to production use. It would hardly > be an inconvenience to also run a CHECKPOINT. The goal here is to make that better, not worse! -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Wed, Dec 1, 2010 at 10:36 AM, Bruce Momjian <bruce@momjian.us> wrote: > Oh, we don't update the LSN when we set the PD_ALL_VISIBLE flag? OK, > please let me think some more. Thanks. As far as I can tell, there are basically two viable solutions on the table here. 1. Every time we observe a page as all-visible, (a) set the PD_ALL_VISIBLE bit on the page, without bumping the LSN; (b) set the bit in the visibility map page, bumping the LSN as usual, and (c) emit a WAL record indicating the relation and block number. On redo of this record, set both the page-level bit and the visibility map bit. The heap page may hit the disk before the WAL record, but that's OK; it just might result in a little extra work until some subsequent operation gets the visibility map bit set. The visibility map page page may hit the disk before the heap page, but that's OK too, because the WAL record will already be on disk due to the LSN interlock. If a crash occurs before the heap page is flushed, redo will fix the heap page. (The heap page will get flushed as part of the next checkpoint, if not sooner, so by the time the redo pointer advances past the WAL record, there's no longer a risk.) 2. Every time we observe a page as all-visible, (a) set the PD_ALL_VISIBLE bit on the page, without bumping the LSN, (b) set the bit in the visibility map page, bumping the LSN if a WAL record is issued (which only happens sometimes, read on), and (c) emit a WAL record indicating the "chunk" of 128 visibility map bits which contains the bit we just set - but only if we're now dealing with a new group of 128 visibility map bits or if a checkpoint has intervened since the last such record we emitted. On redo of this record, clear the visibility map bits in each chunk. The heap page may hit the disk before the WAL record, but that's OK for the same reasons as in plan #1. The visibility map page may hit the disk before the heap page, but that's OK too, because the WAL record will already be on disk to due the LSN interlock. If a crash occurs before the heap page makes it to disk, then redo will clear the visibility map bits, leaving them to be reset by a subsequent VACUUM. As is typical with good ideas, neither of these seems terribly complicated in retrospect. Kudos to Heikki for thinking them up and explaining them. After some thought, I think that approach #1 is probably better, because it propagates visibility map bits to the standby. During index-only scans, the standby will have to ignore them during HS operation just as it currently ignores the PD_ALL_VISIBLE page-level bit, but if and when the standby is promoted to master, it's important to have those bits already set, both for index-only scans and also because, absent that, the first autovacuum on each table will end up scanning the whole things and dirtying tremendous gobs of data setting all those bits, which is just the sort of ugly surprise that we don't want to give people right after they've been forced to perform a failover. I think we can improve this a bit further by also introducing a HEAP_XMIN_FROZEN bit that we set in lieu of overwriting XMIN with FrozenXID. This allows us to freeze tuples aggressively - if we want - without losing any forensic information. We can then modify the above algorithm slightly, so that when we observe that a page is all visible, we not only set PD_ALL_VISIBLE on the page but also HEAP_XMIN_FROZEN on each tuple. The WAL record marking the page as all-visible then doubles as a WAL record marking it frozen, eliminating the need to dirty the page yet again at anti-wraparound vacuum time. It'll still be a net increase in WAL volume (as Heikki pointed out) but the added WAL volume is small compared with the I/O involved in writing out the dirty heap pages (as Tom pointed out), so it should hopefully be OK. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 01.12.2010 18:25, Robert Haas wrote: > I think we can improve this a bit further by also introducing a > HEAP_XMIN_FROZEN bit that we set in lieu of overwriting XMIN with > FrozenXID. This allows us to freeze tuples aggressively - if we want > - without losing any forensic information. We can then modify the > above algorithm slightly, so that when we observe that a page is all > visible, we not only set PD_ALL_VISIBLE on the page but also > HEAP_XMIN_FROZEN on each tuple. Hmm, actually, if we're willing to believe PD_ALL_VISIBLE in the page header over the xmin/xmax on the tuples, we could simply not bother doing anti-wraparound vacuums for pages that have the flag set. I'm not sure what changes that would require outside heapam.c, as we'd have to be careful to not trust the xmin/xmax if the flag was set. The first update on the page that clears the flag would need to freeze all the tuples in that scheme. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Robert Haas <robertmhaas@gmail.com> writes:
> As far as I can tell, there are basically two viable solutions on the
> table here.
> 1. Every time we observe a page as all-visible, (a) set the
> PD_ALL_VISIBLE bit on the page, without bumping the LSN; (b) set the
> bit in the visibility map page, bumping the LSN as usual, and (c) emit
> a WAL record indicating the relation and block number.  On redo of
> this record, set both the page-level bit and the visibility map bit.
> The heap page may hit the disk before the WAL record, but that's OK;
Um, no it isn't.  Suppose the heap page gets to disk but we crash before
the WAL record does.  Now we have a persistent state where the heap page
is marked PD_ALL_VISIBLE but the corresponding VM bit is not set.  The
VM bit will never become set, either, because operations on the heap
page will see PD_ALL_VISIBLE and assume it already is set.  This state
of affairs might be acceptable from a correctness standpoint, but not
from a performance standpoint.
        regards, tom lane
			
		Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:
> Hmm, actually, if we're willing to believe PD_ALL_VISIBLE in the page 
> header over the xmin/xmax on the tuples, we could simply not bother 
> doing anti-wraparound vacuums for pages that have the flag set. I'm not 
> sure what changes that would require outside heapam.c, as we'd have to 
> be careful to not trust the xmin/xmax if the flag was set.
That seems pretty ugly/dangerous.  If we're going to try to do something
here, I much prefer Robert's approach of marking each tuple in the tuple
header.
        regards, tom lane
			
		On 01.12.2010 18:40, Tom Lane wrote: > Robert Haas<robertmhaas@gmail.com> writes: >> As far as I can tell, there are basically two viable solutions on the >> table here. > >> 1. Every time we observe a page as all-visible, (a) set the >> PD_ALL_VISIBLE bit on the page, without bumping the LSN; (b) set the >> bit in the visibility map page, bumping the LSN as usual, and (c) emit >> a WAL record indicating the relation and block number. On redo of >> this record, set both the page-level bit and the visibility map bit. >> The heap page may hit the disk before the WAL record, but that's OK; > > Um, no it isn't. Suppose the heap page gets to disk but we crash before > the WAL record does. Now we have a persistent state where the heap page > is marked PD_ALL_VISIBLE but the corresponding VM bit is not set. The > VM bit will never become set, either, because operations on the heap > page will see PD_ALL_VISIBLE and assume it already is set. This state > of affairs might be acceptable from a correctness standpoint, but not > from a performance standpoint. The next vacuum will fix it. We already handle that. It's no different from the situation where neither change makes it to the disk. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:
> On 01.12.2010 18:40, Tom Lane wrote:
>> Um, no it isn't.  Suppose the heap page gets to disk but we crash before
>> the WAL record does.  Now we have a persistent state where the heap page
>> is marked PD_ALL_VISIBLE but the corresponding VM bit is not set.  The
>> VM bit will never become set, either, because operations on the heap
>> page will see PD_ALL_VISIBLE and assume it already is set.  This state
>> of affairs might be acceptable from a correctness standpoint, but not
>> from a performance standpoint.
> The next vacuum will fix it. We already handle that. It's no different 
> from the situation where neither change makes it to the disk.
Well, as long as we're careful not to optimize away setting the VM bit
on the grounds that PD_ALL_VISIBLE is already set, I suppose that will
work.
        regards, tom lane
			
		Robert Haas <robertmhaas@gmail.com> writes:
> I think we can improve this a bit further by also introducing a
> HEAP_XMIN_FROZEN bit that we set in lieu of overwriting XMIN with
> FrozenXID.  This allows us to freeze tuples aggressively - if we want
> - without losing any forensic information.
So far so good ...
> We can then modify the
> above algorithm slightly, so that when we observe that a page is all
> visible, we not only set PD_ALL_VISIBLE on the page but also
> HEAP_XMIN_FROZEN on each tuple.  The WAL record marking the page as
> all-visible then doubles as a WAL record marking it frozen,
> eliminating the need to dirty the page yet again at anti-wraparound
> vacuum time.
but this seems a lot more dubious/fragile.  The basic problem is that
it's not clear whether HEAP_XMIN_FROZEN is a hint bit or essential
data.  If you want to set it without the overhead of an LSN bump or a
possible FPI in WAL, then it's a hint bit.  But if you're using it to
protect clog truncation then it's essential data.  Perhaps you can make
this work but there are some nonobvious requirements:
1. Seeing PD_ALL_VISIBLE set does not excuse vacuum from having to
iterate through all the tuples on the page checking for
HEAP_XMIN_FROZEN.  This is because the non-logged update of the page
might have been torn on the way to disk, such that PD_ALL_VISIBLE got
set but not all of the FROZEN bits did.
2. During an anti-wraparound vacuum, you *need to* emit a WAL record
when setting HEAP_XMIN_FROZEN.  It's not a hint, any more than writing
FrozenXID is now.
Actually, #2 isn't even good enough.  What if vacuum passes over a page
and finds all the FROZEN bits set, but the reason they're set is that
somebody else updated them in hint fashion microseconds before?  It
seems possible that those bits might not make it to disk before a
subsequent crash.  The only way to be really sure those bits are set is
to emit a WAL record that says to set them, whether or not they seem to
be set already.  While the WAL record could be small, you'd need one for
every page, making the argument that this saves I/O somewhat dubious.
        regards, tom lane
			
		On Wed, Dec 1, 2010 at 11:40 AM, Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> wrote: > On 01.12.2010 18:25, Robert Haas wrote: >> >> I think we can improve this a bit further by also introducing a >> HEAP_XMIN_FROZEN bit that we set in lieu of overwriting XMIN with >> FrozenXID. This allows us to freeze tuples aggressively - if we want >> - without losing any forensic information. We can then modify the >> above algorithm slightly, so that when we observe that a page is all >> visible, we not only set PD_ALL_VISIBLE on the page but also >> HEAP_XMIN_FROZEN on each tuple. > > Hmm, actually, if we're willing to believe PD_ALL_VISIBLE in the page header > over the xmin/xmax on the tuples, we could simply not bother doing > anti-wraparound vacuums for pages that have the flag set. I'm not sure what > changes that would require outside heapam.c, as we'd have to be careful to > not trust the xmin/xmax if the flag was set. > > The first update on the page that clears the flag would need to freeze all > the tuples in that scheme. That seems more complex for no particular gain. I guess it saves an infomask bit, but I'm willing to burn one to reduce code complexity. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Wed, Dec 1, 2010 at 12:22 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Robert Haas <robertmhaas@gmail.com> writes: >> I think we can improve this a bit further by also introducing a >> HEAP_XMIN_FROZEN bit that we set in lieu of overwriting XMIN with >> FrozenXID. This allows us to freeze tuples aggressively - if we want >> - without losing any forensic information. > > So far so good ... > >> We can then modify the >> above algorithm slightly, so that when we observe that a page is all >> visible, we not only set PD_ALL_VISIBLE on the page but also >> HEAP_XMIN_FROZEN on each tuple. The WAL record marking the page as >> all-visible then doubles as a WAL record marking it frozen, >> eliminating the need to dirty the page yet again at anti-wraparound >> vacuum time. > > but this seems a lot more dubious/fragile. The basic problem is that > it's not clear whether HEAP_XMIN_FROZEN is a hint bit or essential > data. If you want to set it without the overhead of an LSN bump or a > possible FPI in WAL, then it's a hint bit. But if you're using it to > protect clog truncation then it's essential data. Perhaps you can make > this work but there are some nonobvious requirements: > > 1. Seeing PD_ALL_VISIBLE set does not excuse vacuum from having to > iterate through all the tuples on the page checking for > HEAP_XMIN_FROZEN. This is because the non-logged update of the page > might have been torn on the way to disk, such that PD_ALL_VISIBLE got > set but not all of the FROZEN bits did. Good point. If we see the bit set in the visibility map set, it should be safe to infer that the PD_ALL_VISIBLE bit and all HEAP_XMIN_FROZEN bits are set. But if the visibility map bit is NOT set, we must check PD_ALL_VISIBLE and, whether it's set or not, each individual HEAP_XMIN_FROZEN bit. > 2. During an anti-wraparound vacuum, you *need to* emit a WAL record > when setting HEAP_XMIN_FROZEN. It's not a hint, any more than writing > FrozenXID is now. > > Actually, #2 isn't even good enough. What if vacuum passes over a page > and finds all the FROZEN bits set, but the reason they're set is that > somebody else updated them in hint fashion microseconds before? It > seems possible that those bits might not make it to disk before a > subsequent crash. The only way to be really sure those bits are set is > to emit a WAL record that says to set them, whether or not they seem to > be set already. While the WAL record could be small, you'd need one for > every page, making the argument that this saves I/O somewhat dubious. I think that we would only ever allow HEAP_XMIN_FROZEN to be set as part of a WAL-logged operation. Either we are marking the page all-visible - in which case we're emitting the new WAL record type XLOG_HEAP_ALLVISIBLE - or we're freezing individual tuples on a page where very old and very new tuples are intermixed - in which case we emit the existing XLOG_HEAP2_FREEZE. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Wed, 2010-12-01 at 11:25 -0500, Robert Haas wrote: > 1. Every time we observe a page as all-visible, (a) set the > PD_ALL_VISIBLE bit on the page, without bumping the LSN; ... > 2. Every time we observe a page as all-visible, (a) set the > PD_ALL_VISIBLE bit on the page, without bumping the LSN, My concern here is that both of these proposals introduce something new that is hint-bit-like, in the sense that we change a data page and potentially write it to disk without necessarily writing WAL. This will make it even harder to do CRCs in the future, which are also an important feature. I understand the motivation: we don't want to write a page to disk yet one more time. I think this is only a problem for bulk loading, where it does seem pretty ridiculous. But for cases other than bulk loading, it doesn't seem like a problem (likely, there will be other changes to the page anyway). Is it worth taking a step back, and saying that we're trying to hard to use the exact same mechanism for OLTP and bulk loading? I know bulk loading poses a number of problems. But it might be easier to make a passable bulk-loading mechanism (not necessarily a general one) than to continue to fight the WAL-before-data invariant on every new feature[1]. I'm not proposing anything in specific, just a potential re-framing of the problem. Regards,Jeff Davis [1] If we obey WAL-before-data everywhere WAL is required, the correctness can be reasonably understood by mere mortals. But when we start breaking it for special cases, observe how few people can even participate in such discussions.
On Wed, Dec 1, 2010 at 3:31 PM, Jeff Davis <pgsql@j-davis.com> wrote: > On Wed, 2010-12-01 at 11:25 -0500, Robert Haas wrote: >> 1. Every time we observe a page as all-visible, (a) set the >> PD_ALL_VISIBLE bit on the page, without bumping the LSN; > > ... > >> 2. Every time we observe a page as all-visible, (a) set the >> PD_ALL_VISIBLE bit on the page, without bumping the LSN, > > My concern here is that both of these proposals introduce something new > that is hint-bit-like, in the sense that we change a data page and > potentially write it to disk without necessarily writing WAL. This will > make it even harder to do CRCs in the future, which are also an > important feature. Well, there's an intermediate concept between "logged" and "unlogged", which is an update that is WAL-logged but doesn't require torn-page protection because writing some but not all of the 512-byte sectors in the page won't break anything. We're using the LSN to track both whether the page is logged and whether the update necessitates torn-page protection. That might be fixable. As for CRCs, there's a pretty direct chain of inference here: 1. CRCs are hard (really impossible) because we have hint bits. 2. Hint bits are necessary because an old XID can't be viewed as guaranteed committed. 3. An old XID can't be viewed as guaranteed committed because we clean up aborted transactions lazily rather than eagerly. Changing (3) would amount to a rewrite of our whole MVCC architecture.If we switched from per-tuple MVCC based on XIDs toper-page MVCC based on LSNs and a rollback segment, all of this stuff would go out the window. Hint bits, gone. Anti-wraparound VACUUM, gone. CRCs, feasible. Visibility map... we might still need that, but the page-level bits go away. Of course, it would also create new problems. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes:
>  If we switched from per-tuple MVCC based on XIDs to per-page MVCC
> based on LSNs and a rollback segment, all of this stuff would go out
> the window.  Hint bits, gone.  Anti-wraparound VACUUM, gone.  CRCs,
> feasible.  Visibility map... we might still need that, but the
> page-level bits go away.
> Of course, it would also create new problems.
Yup, we've seen that proposal before.  It's called Oracle.  There's
no good reason to believe that we'd have a net win after we were done
switching over ... not to mention the likelihood that they hold a ton
of patents about particular aspects of doing things that way.
        regards, tom lane
			
		On Wed, 2010-12-01 at 15:59 -0500, Robert Haas wrote: > As for CRCs, there's a pretty direct chain of inference here: > > 1. CRCs are hard (really impossible) because we have hint bits. I would disagree with "impossible". If we don't set hint bits during reading; and when we do set them, we log them (including full page writes); then we can do CRCs. Those things have costs, but we might be willing to pay them if we had a bulk loading strategy that avoids or mitigates the costs. The reason we can't do CRCs now is because hint bits violate the WAL-before-data rule; not because of hint bits themselves. We're talking about adding another feature that breaks the rule, in a more complex way than hint bits. I just wanted to step back for a second and consider the problem from a different angle before we committed to that. Regards,Jeff Davis
On Dec 1, 2010, at 2:59 PM, Robert Haas wrote: > 2. Hint bits are necessary because an old XID can't be viewed as > guaranteed committed. Hmm... I thought hint bits were necessary because it's too expensive to query CLOG for every tuple. If my understanding iscorrect then if we fix the CLOG performance issue we don't need hint bits anymore. -- Jim C. Nasby, Database Architect jim@nasby.net 512.569.9461 (cell) http://jim.nasby.net
On Wed, Dec 1, 2010 at 5:24 PM, Jeff Davis <pgsql@j-davis.com> wrote: > On Wed, 2010-12-01 at 15:59 -0500, Robert Haas wrote: >> As for CRCs, there's a pretty direct chain of inference here: >> >> 1. CRCs are hard (really impossible) because we have hint bits. > > I would disagree with "impossible". If we don't set hint bits during > reading; and when we do set them, we log them (including full page > writes); then we can do CRCs. > > Those things have costs, but we might be willing to pay them if we had a > bulk loading strategy that avoids or mitigates the costs. > > The reason we can't do CRCs now is because hint bits violate the > WAL-before-data rule; not because of hint bits themselves. We're talking > about adding another feature that breaks the rule, in a more complex way > than hint bits. > > I just wanted to step back for a second and consider the problem from a > different angle before we committed to that. Well, let's think about what we'd need to do to make CRCs work reliably. There are two problems. 1. Currently, hint bits are not vulnerable to the torn-page problem, because the hint bit change is to single byte, and neither of the two possible values for the affected byte invalidate the contents of the block. Thus, they do not need to be WAL-logged - we're happy if they all make it to disk, but if some or none of them make it to disk, that's OK. If we CRC the entire page, the torn pages are never acceptable, so every action that modifies the page must be WAL-logged. 2. Currently, we allow hint bits on a page to be updated while holding a shared-content lock; we also allow the page to be written while holding only a shared-content lock. This makes it a bit nondeterministic whether the hint bit update is included in the write, but we don't care. If we were to compute a CRC and write that into the page before writing it out to the OS, it would be unacceptable for the page contents to change thereafter in any way. So, to make CRCs work, we'd need to (a) WAL-log every hint bit update and (b) change either hint bit updates or page write-outs to require an exclusive content lock rather than a shared one. The first would result in an increase in I/O, while the second would result in a reduction in concurrency. Thinking about it a bit, I wonder if we couldn't mitigate (b) quite a bit by adding a new level for buffer content locks, share exclusive. This would conflict with itself and with exclusive but not with share locks, and would be required to set hint bits or write the buffer. When setting hint bits with only a share lock, we'd attempt to do a non-blocking upgrade to share exclusive. If that failed - because someone else already held a share-exclusive lock - we'd just skip the hint bit update. I have no idea what to do about (a), though. *thinks some more* Or maybe I do. One other thing I've been thinking about with regard to hint bit updates is that we might choose to mark that are hint-bit-updated as "untidy" rather than "dirty". The background writer could treat these pages as dirty, but checkpoints and backends doing desperation-buffer-reclamation could treat them as clean. This would allow hint bit updates to trickle out to disk in the background, without letting them bottleneck anything on the critical path. Maybe we could do this - if CRCs are enabled and we are the background writer cleaning scan, write dirty buffers in the usual way and write untidy buffers to a "double-write buffer" (to borrow a page from InnoDB) along with the current LSN. At the conclusion of the scan, fsync() the double-write buffer and then write the buffers a second time in the normal fashion if their mappings haven't changed and they are still untidy. On redo, when you reach an LSN recorded in the double-write buffer, restore the FPI. In general, a double-write buffer is inferior to our existing FPI system, because you end up needing to fsync both the double-write buffer and the WAL stream. But it might be OK in this case, if it's all happening as background work. -- With respect to your concerns about this method, after some thought, I think #2 isn't an issue at all, because I don't believe we can risk having our update to HEAP_XMIN_FROZEN stomped on by someone else trying to set HEAP_XMIN_COMMITTED, so I think that when making a page all-visible we'll need an exclusive (or share-exclusive) content lock anyway. As to #1, I think we could restore the WAL-before-data rules if we kept a bit somewhere in the buffer descriptor indicating whether a given buffer has had an FPI since the last checkpoint. Then, perhaps, WAL records that are torn-page-safe could bump the TLI without emitting a FPI. The next WAL record to come along would be able to determine that one was still needed. Of course, to make CRCs work with this, you still need to emit FPIs or use a double-write buffer. That sucks, and I don't know what to do about it. Since our current hint-bit updates are not WAL-logged, a CRC implementation over it could try to get by with chunking untidy buffers (either all the time or just sometimes) without actually writing them. But these updates WILL be WAL-logged, so you can't just refuse to write them after the fact. Hmm... -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Wed, Dec 1, 2010 at 6:41 PM, Jim Nasby <jim@nasby.net> wrote: > On Dec 1, 2010, at 2:59 PM, Robert Haas wrote: >> 2. Hint bits are necessary because an old XID can't be viewed as >> guaranteed committed. > > Hmm... I thought hint bits were necessary because it's too expensive to query CLOG for every tuple. If my understandingis correct then if we fix the CLOG performance issue we don't need hint bits anymore. My point is - in InnoDB, when an XID aborts, it's not removed from their equivalent of the ProcArray until it has been fully rolled back.So if you see an XID prior to GlobalXmin, you don'tneed a hint bit to tell you whether it's committed. Go directly to yes. It is also correct to say that it's too expensive to query CLOG for every tuple. But InnoDB isn't solving that by having a cheaper CLOG; they're solving it by not having CLOG at all. I doubt that it's possible to make CLOG lookups so cheap that we don't need hint bits any more. Take a look at HeapTupleSatisfiesMVCC(). Those hint bit tests are a single machine instruction. It's tough to beat that. It's tough to get within two orders of magnitude. I'd like to, but I don't see how. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > Or maybe I do. One other thing I've been thinking about with regard > to hint bit updates is that we might choose to mark that are > hint-bit-updated as "untidy" rather than "dirty". The background Please review archives, you'll find the idea discussed and some patches to implement it, by Simon. I suppose you could begin here: http://archives.postgresql.org/pgsql-patches/2008-06/msg00113.php Regards, -- Dimitri Fontaine http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support
On Thu, Dec 2, 2010 at 6:37 AM, Dimitri Fontaine <dimitri@2ndquadrant.fr> wrote: > Robert Haas <robertmhaas@gmail.com> writes: >> Or maybe I do. One other thing I've been thinking about with regard >> to hint bit updates is that we might choose to mark that are >> hint-bit-updated as "untidy" rather than "dirty". The background > > Please review archives, you'll find the idea discussed and some patches > to implement it, by Simon. I suppose you could begin here: > > http://archives.postgresql.org/pgsql-patches/2008-06/msg00113.php Thanks for the pointer. I guess that demonstrates that good ideas will keep floating back up to the surface. It seems like the idea was met with generally positive feedback, except that most people seemed to want a slightly simpler system than what Simon was proposing. I suspect that this is mostly suffering from a lack of round tuits. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Wed, 2010-12-01 at 23:22 -0500, Robert Haas wrote: > Well, let's think about what we'd need to do to make CRCs work > reliably. There are two problems. > > 1. [...] If we CRC the entire page, the torn pages are never > acceptable, so every action that modifies the page must be WAL-logged. > > 2. Currently, we allow hint bits on a page to be updated while holding [...] The way I see it, here are the rules we are breaking, and why: * We don't get an exclusive lock when dirtying a page with hint bits - Why: we write while reading, and we want good concurrency. - Why': because after a bulk load, we don't have any hint bits, and the only way to get them set without VACUUM is to write while reading. I've never been entirely sure why VACUUM isn't good enough in this case, aside from the fact that a user might not run VACUUM (and autovacuum might not either, if it was only a bulk load and no updates/deletes). * We don't WAL log setting hint bits (which dirties a page) - Why: because after a bulk load, we don't want to write the data a 4th time Hypothetically, if we had a bulk loading strategy, these problems would go away, and we could follow the rules. Right? Is there a case other than bulk loading which demands that we break these rules? And, if we had a bulk loading path, we could probably get away with writing the data only twice (today, we write it 3 times including the hint bits) or maybe once if WAL archiving is off. So, is there a case other than bulk loading for which we need to break these rules? If not, perhaps we should consider bulk loading a different problem, and simplify the design of all of these other features (and allow new storage-touching features to come about, like CRCs, without exponentially increasing the complexity with each one). Regards,Jeff Davis
Jeff Davis <pgsql@j-davis.com> wrote: > And, if we had a bulk loading path, we could probably get away > with writing the data only twice (today, we write it 3 times > including the hint bits) or maybe once if WAL archiving is off. If you're counting WAL writes, you're low. If you don't go out of your way to avoid it, you are likely to write the data to the table once during the bulk load, a second time on first read to set the hint bits, and a third time to freeze data to prevent wrap-around. The initial write may or may not be WAL-logged. The freezing typically is WAL-logged. So, you can easily write the data to disk four or five times. With luck these are spread out uniformly or happen during off-peak periods. Unmanaged, a WAL-logged freeze of bulk-loaded data is somewhat more likely to occur, however, during hours of heavy OLTP load, because transaction numbers are consumed so quickly. Currently, a VACUUM FREEZE after a bulk load collapses at least two of those writes to one. With luck, some pages might still be dirty in cache, and you can save two of the writes. -Kevin
On Thu, Dec 2, 2010 at 2:01 PM, Jeff Davis <pgsql@j-davis.com> wrote: > * We don't get an exclusive lock when dirtying a page with hint bits > - Why: we write while reading, and we want good concurrency. > - Why': because after a bulk load, we don't have any hint bits, and the > only way to get them set without VACUUM is to write while reading. I've > never been entirely sure why VACUUM isn't good enough in this case, > aside from the fact that a user might not run VACUUM (and autovacuum > might not either, if it was only a bulk load and no updates/deletes). > > * We don't WAL log setting hint bits (which dirties a page) > - Why: because after a bulk load, we don't want to write the data a 4th > time > > Hypothetically, if we had a bulk loading strategy, these problems would > go away, and we could follow the rules. Right? Is there a case other > than bulk loading which demands that we break these rules? I'm not really convinced that this problem is confined to bulk loading. Every INSERT or UPDATE results in a new tuple that may need hit bits set and eventually to be frozen. A bulk load is just a time when you do lots of inserts all at once; it seems to me that a large update would cause all the same problems, plus bloat. The triple I/O problem exists for small transactions as well (and isn't desirable there either); it's just less noticeable because the second and third writes are, like the first one, small. > And, if we had a bulk loading path, we could probably get away with > writing the data only twice (today, we write it 3 times including the > hint bits) or maybe once if WAL archiving is off. It seems to me that a COPY command executed in a transaction with no other open snapshots writing to a table created or truncated within the same transaction should be able to write frozen tuples from the get-go, regardless of anything else we do. > So, is there a case other than bulk loading for which we need to break > these rules? If not, perhaps we should consider bulk loading a different > problem, and simplify the design of all of these other features (and > allow new storage-touching features to come about, like CRCs, without > exponentially increasing the complexity with each one). I don't think we're exponentially increasing complexity - I think we're incrementally improving our algorithms. If you want to propose a bulk loading path, great. Propose away! But without something a bit more concrete, I don't think it would be appropriate to hold off making the visibility map crash-safe, on the off chance that our design for so doing might complicate something else we want to do later. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Thu, 2010-12-02 at 17:00 -0500, Robert Haas wrote: > I'm not really convinced that this problem is confined to bulk > loading. Every INSERT or UPDATE results in a new tuple that may need > hit bits set and eventually to be frozen. A bulk load is just a time > when you do lots of inserts all at once; it seems to me that a large > update would cause all the same problems, plus bloat. A big UPDATE does a lot of work, I don't see any near-term approach for solving that. Avoiding WAL (and full page writes) for the updates to PD_ALL_VISIBLE, hint bits, VM bits, etc., would probably be the least of my concerns in that case. > The triple I/O > problem exists for small transactions as well (and isn't desirable > there either); it's just less noticeable because the second and third > writes are, like the first one, small. Bulk loading poses some unique challenges because there is no opportunity to set PD_ALL_VISIBLE or hint bits before the loading is complete; and by that time, many checkpoints will have already happened, and the pages have already hit disk. That means we need to re-read them, modify them, and write them again (plus WAL, if we were following the rules). Small transactions don't suffer from the same problems. They generally begin and end without an intervening checkpoint. That means that you have an opportunity to set PD_ALL_VISIBLE or hint bits before the checkpoint happens, thus avoiding unnecessary extra writes. Additionally, small transaction workloads will generally have, to some approximation, some working set of pages. So, even if you do a read of a single tuple, write PD_ALL_VISIBLE and hint bits (thus dirtying the page), there is a reasonable chance that someone will come by later and do an insert/update/delete (thus forcing WAL anyway). And if the small transaction workload is completely random and you aren't touching the same pages between checkpoints, then setting hint bits one-at-a-time is not a good strategy anyway. It would be much better to do it in bulk with a VACUUM. And if VACUUM does anything significant to a page, it's going to WAL anyway. I'm having trouble seeing a case other than bulk-loading which causes a real problem. Maybe a small-transaction workload with a few long-running transactions? Even that doesn't seem so bad. > It seems to me that a COPY command executed in a transaction with no > other open snapshots writing to a table created or truncated within > the same transaction should be able to write frozen tuples from the > get-go, regardless of anything else we do. Well, some transaction might pick up a snapshot between the time you begin the copy and the time it commits. We'd need to prevent such a transaction from actually reading the table. > I don't think it would be appropriate to hold off > making the visibility map crash-safe, on the off chance that our > design for so doing might complicate something else we want to do > later. I'm not suggesting we hold off on it at all. To the contrary, I'm suggesting that we simply log updates of PD_ALL_VISIBLE as well as VM bits, at least until a performance problem presents itself. That will _simplify_ the design. Then, when a performance problem does present itself for a certain use case, we can see how to fix it. If many cases are affected, then we might choose one of these more creative solutions that breaks the rules in controlled ways, understanding the trade-offs. If only bulk loading is affected, we might choose to address that case directly. Regards,Jeff Davis
On Thu, Dec 2, 2010 at 6:37 PM, Jeff Davis <pgsql@j-davis.com> wrote: >> It seems to me that a COPY command executed in a transaction with no >> other open snapshots writing to a table created or truncated within >> the same transaction should be able to write frozen tuples from the >> get-go, regardless of anything else we do. > > Well, some transaction might pick up a snapshot between the time you > begin the copy and the time it commits. We'd need to prevent such a > transaction from actually reading the table. Oh, hmm. That's awkward. I guess under present semantics it can see the table - but not its contents - once the inserting transaction has committed. That stinks. >> I don't think it would be appropriate to hold off >> making the visibility map crash-safe, on the off chance that our >> design for so doing might complicate something else we want to do >> later. > > I'm not suggesting we hold off on it at all. To the contrary, I'm > suggesting that we simply log updates of PD_ALL_VISIBLE as well as VM > bits, at least until a performance problem presents itself. That will > _simplify_ the design. > > Then, when a performance problem does present itself for a certain use > case, we can see how to fix it. If many cases are affected, then we > might choose one of these more creative solutions that breaks the rules > in controlled ways, understanding the trade-offs. If only bulk loading > is affected, we might choose to address that case directly. I don't think that you can seriously suggest that emitting that volume of FPIs isn't going to be a problem immediately. We have to have some solution to that problem out of the gate. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Thu, 2010-12-02 at 19:06 -0500, Robert Haas wrote: > I don't think that you can seriously suggest that emitting that volume > of FPIs isn't going to be a problem immediately. We have to have some > solution to that problem out of the gate. Fair enough. I think you understand my point, and it's easy enough to remove complexity later if it makes sense to do so. Regards,Jeff Davis
* Robert Haas: > Those hint bit tests are a single machine instruction. It's tough > to beat that. It's tough to get within two orders of magnitude. > I'd like to, but I don't see how. For some scans, it might be possible to hoist the checks out of inner loops. (At least in principle, I'm not sure how much that would interfere with the executor architecture.) -- Florian Weimer <fweimer@bfk.de> BFK edv-consulting GmbH http://www.bfk.de/ Kriegsstraße 100 tel: +49-721-96201-1 D-76133 Karlsruhe fax: +49-721-96201-99
On 2010-11-30 05:57, Robert Haas wrote: > Last week, I posted a couple of possible designs for making the > visibility map crash-safe, which did not elicit much comment. Since > this is an important prerequisite to index-only scans, I'm trying > again. The logic seems to be: * If the visibillity map should be crash-safe if should be WAL-logged. * PD_ALL_VISIBLE is currently not being WAL-logged when vacuum sets it. * WAL logging the visibillity map bit is not "that" bad (small size). * WAL-logging the PD_ALL_VISIBLE bit would can WAL-record for the entire relation to be written out (potentially huge). Would the problem not be solved by not "trying to keep the two bits in sync" but simply removing the PD_ALL_VISIBLE bit in the page-header in favor for the bit in the visibillity map, that is now WAL-logged and thus safe to trust? Then vacuum could emit WAL records for setting the visibillity map bits, combined with changes on the page could clear it? The question probably boils down to: Given a crash-safe visibility map, what purpuse does the PD_ALL_VISIBLE bit serve? I've probably just missed some logic? Having index-only scans per-table ajustable would make quite some sense.. I have a couple of tables with a high turn-over rate that never get out of the OS-cache anyway, the benefit of index-only scans are quite small, especially if they come with additional overhead on INSERT/UPDATE/DELETE operations, whereas I also have huge tables with a very small amount of changes. Just the saved IO of not having to go to the heap in some cases would be good. I could see some benefits in having the index-only scan work on tuple-level visibillity information and not page-level, but that would require a bigger map (allthough still less than 1% of the heap size if my calculations are correct), but would enable visibillity testing of a tuple without going to the heap even other (unrelated) changes happend on the same page. Jesper -- Jesper
On Wed, Jan 5, 2011 at 3:22 PM, Jesper Krogh <jesper@krogh.cc> wrote: > Given a crash-safe visibility map, what purpuse does the PD_ALL_VISIBLE bit > serve? If we modify a page on which PD_ALL_VISIBLE isn't set, we don't attempt to update the visibility map. In theory, this is an important optimization to reduce contention on the visibility map page, since there are something like 64K heap pages per visibility map page. In practice, I'm not sure in what workloads it matters or by how much. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2011-01-06 03:10, Robert Haas wrote: > On Wed, Jan 5, 2011 at 3:22 PM, Jesper Krogh<jesper@krogh.cc> wrote: >> Given a crash-safe visibility map, what purpuse does the PD_ALL_VISIBLE bit >> serve? > If we modify a page on which PD_ALL_VISIBLE isn't set, we don't > attempt to update the visibility map. In theory, this is an important > optimization to reduce contention on the visibility map page, since > there are something like 64K heap pages per visibility map page. In > practice, I'm not sure in what workloads it matters or by how much If that is the only reason, I would suggest just making the visibillity map pages more sparse. If you just had 500 or 1K heap pages per visibillity map page, then it would not change a thing for the performance. With 1K heap pages per VM map page a relation with 100GB of data would have a VM map of the VM map of 100MB, even 100 heap pages per VM-map page would still enable the database to have visibillity information of 100GB data stored in 1GB of memory. But it would need testing and benchmarking to find out. Jesper -- Jesper
On Jan 5, 2011, at 8:10 PM, Robert Haas wrote: > On Wed, Jan 5, 2011 at 3:22 PM, Jesper Krogh <jesper@krogh.cc> wrote: >> Given a crash-safe visibility map, what purpuse does the PD_ALL_VISIBLE bit >> serve? > > If we modify a page on which PD_ALL_VISIBLE isn't set, we don't > attempt to update the visibility map. In theory, this is an important > optimization to reduce contention on the visibility map page, since > there are something like 64K heap pages per visibility map page. In > practice, I'm not sure in what workloads it matters or by how much. What specific locking are you worried about? The page locks themselves? Isn't changing the bit essentially a single instructionoperation? This is sounding like premature optimization... ;) -- Jim C. Nasby, Database Architect jim@nasby.net 512.569.9461 (cell) http://jim.nasby.net
On Fri, Jan 7, 2011 at 1:28 PM, Jim Nasby <jim@nasby.net> wrote: > On Jan 5, 2011, at 8:10 PM, Robert Haas wrote: >> On Wed, Jan 5, 2011 at 3:22 PM, Jesper Krogh <jesper@krogh.cc> wrote: >>> Given a crash-safe visibility map, what purpuse does the PD_ALL_VISIBLE bit >>> serve? >> >> If we modify a page on which PD_ALL_VISIBLE isn't set, we don't >> attempt to update the visibility map. In theory, this is an important >> optimization to reduce contention on the visibility map page, since >> there are something like 64K heap pages per visibility map page. In >> practice, I'm not sure in what workloads it matters or by how much. > > What specific locking are you worried about? The page locks themselves? Isn't changing the bit essentially a single instructionoperation? > > This is sounding like premature optimization... ;) I'm not quite invested enough in this to get worried about it, but if I were, I'd probably start with the buffer content lock, and move on to the buffer header spinlock and the buf mapping locks. Changing the bit is a single instruction once you've got the page pinned and locked, but that's not free. (And even if you could hypothetically figure out some clever lock-free algorithm to avoid some of this work, there's still going to be cache line contention, which is quite expensive as it turns out. See the relatively recent discussions of why our backend startup cost is so high.) -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company