Re: Savepoints
От | Bruce Momjian |
---|---|
Тема | Re: Savepoints |
Дата | |
Msg-id | 200201241922.g0OJMJf13378@candle.pha.pa.us обсуждение исходный текст |
Ответ на | Re: Savepoints ("Mikheev, Vadim" <vmikheev@SECTORBASE.COM>) |
Список | pgsql-hackers |
OK, I have had time to think about this, and I think I can put the two proposals into perspective. I will use Vadim's terminology. In our current setup, rollback/undo data is kept in the same file as our live data. This data is used for two purposes, one, for rollback of transactions, and perhaps subtransactions in the future, and second, for MVCC visibility for backends making changes. So, it seems the real question is whether a database modification should write the old data into a separate rollback segment and modify the heap data, or just create a new row and require the old row to be removed later by vacuum. Let's look at this behavior without MVCC. In such cases, if someone tries to read a modified row, it will block and wait for the modifying backend to commit or rollback, when it will then continue. In such cases, there is no reason for the waiting transaction to read the old data in the redo segment because it can't continue anyway. Now, with MVCC, the backend has to read through the redo segment to get the original data value for that row. Now, while rollback segments do help with cleaning out old UPDATE rows, how does it improve DELETE performance? Seems it would just mark it as expired like we do now. One objection I always had to redo segments was that if I start a transaction in the morning and walk away, none of the redo segments can be recycled. I was going to ask if we can force some type of redo segment compaction to keep old active rows and delete rows no longer visible to any transaction. However, I now realize that our VACUUM has the same problem. Tuples with XID >= GetOldestXmin() are not recycled, meaning we have this problem in our current implementation too. (I wonder if our vacuum could be smarter about knowing which rows are visible, perhaps by creating a sorted list of xid's and doing a binary search on the list to determine visibility.) So, I guess the issue is, do we want to keep redo information in the main table, or split it out into redo segments. Certainly we have to eliminate the Oracle restrictions that redo segment size is fixed at install time. The advantages of a redo segment is that hopefully we don't have transactions reading through irrelevant undo information. The disadvantage is that we now have redo information grouped into table files where a sequential scan can be performed. (Index scans of redo info are a performance problem currently.) We would have to somehow efficiently access redo information grouped into the redo segments. Perhaps a hash based in relid would help here. Another disadvantage is concurrency. When we start modifying heap data in place, we have to prevent other backends from seeing that modification while we move the old data to the redo segment. I guess my feeling is that if we can get vacuum to happen automatically, how is our current non-overwriting storage manager different from redo segments? One big advantage of redo segments would be that right now, if someone updates a row repeatedly, there are lots of heap versions of the row that are difficult to shrink in the table, while if they are in the redo segments, we can more efficiently remove them, and there is only on heap row. How is recovery handled with rollback segments? Do we write old and new data to WAL? We just write new data to WAL now, right? Do we fsync rollback segments? Have I outlined this accurately? --------------------------------------------------------------------------- Mikheev, Vadim wrote: > > > How about: use overwriting smgr + put old records into rollback > > > segments - RS - (you have to keep them somewhere till TX's running > > > anyway) + use WAL only as REDO log (RS will be used to rollback TX' > > > changes and WAL will be used for RS/data files recovery). > > > Something like what Oracle does. > > > > I am sorry. I see what you are saying now. I missed the words > > And I'm sorry for missing your notes about storing relid+tid only. > > > "overwriting smgr". You are suggesting going to an overwriting > > storage manager. Is this to be done only because of savepoints. > > No. One point I made a few monthes ago (and never got objections) > is - why to keep old data in data files sooooo long? > Imagine long running TX (eg pg_dump). Why other TX-s must read > again and again completely useless (for them) old data we keep > for pg_dump? > > > Doesn't seem worth it when I have a possible solution without > > such a drastic change. > > Also, overwriting storage manager will require MVCC to read > > through there to get accurate MVCC visibility, right? > > Right... just like now non-overwriting smgr requires *ALL* > TX-s to read old data in data files. But with overwriting smgr > TX will read RS only when it is required and as far (much) as > it is required. > > Simple solutions are not always the best ones. > Compare Oracle and InterBase. Both have MVCC. > Smgr-s are different. What RDBMS is more cool? > Why doesn't Oracle use more simple non-overwriting smgr > (as InterBase... and we do)? > > Vadim > -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
В списке pgsql-hackers по дате отправления: