Обсуждение: Volunteer: Large Tuples / Tuple chaining
Hello, I'll donate some (read all freely available) of my spare time to implementing tuple chaining. It looks like this feature is most wanted and it would be a pity to hold this until post 7.0. Personally I don't need it, yet ... But I will definitely find a use for it once available ;-) And it looks like a good start for hacking on pgsql. I already dived into the depth of pgsql's page and tuple structures and it looks like it is possible. But before I start coding I would like to hear some more experienced opinions on how to implement it. Did you alread discuss technical matters about the implementation? How can I get in touch with it? (Simply browse the mailing list archives?) Here's a layout how I imagine the work: What is needed: - lay out a tuple continuation structure - put tuple into multiple chunks when pages are considered, reconcile when loaded from disk (how to continue a tuple - need a structure) how is a tuple (read page item) addressed? ItemPointerDataI imagine to store a continuation address as the last bytes of the tuple unless it fits into one page. I need to mark large tuples (how, just one flag in tuple) How to tell a maximum possiblesize last block from a continued (which carries a pointer to the next one at its end)? Or don't care: make itemcontinued and put last 6(?) bytes into a new block - note that the continued tuples are not referenced directly (vacuum?) mark them as used. I hope vacuum operates on a tuplebasis and has no concept of pages - I guess that the tuple pointer points into page memory, if multiple pages are concatenated for a tuple, these pages must not reside in memory but the full tuple's memory must be allocated (from a memory similar to pages) (shared mem?) - should be possible for memory only pages see PageGetPageSize but od_pagesize is 16bit! Reuse another variable? Anothertype of page? (32bit od_pagesize) Very fascinated by this large beast of ancient code to explore Christof PS: I think the documentation on page layout is far outdated (or points into the future since it speaks about ItemContinuationData structures.) Should I update it? The table doesn't match actual structure components. At least I don't understand what it's about. The source code mentions a different page layout. PPS: Do not pity me, I have ten+ years of coding experience in C. PPPS: Could someone in few words tell me what an access method is (a tuple is an access method, log pages are another?)
> -----Original Message----- > From: owner-pgsql-hackers@postgreSQL.org > [mailto:owner-pgsql-hackers@postgreSQL.org]On Behalf Of Christof Petig > > Hello, > > I'll donate some (read all freely available) of my spare time to > implementing tuple > chaining. It looks like this feature is most wanted and it would be a > pity to hold this until post 7.0. Personally I don't need it, yet ... > But I will definitely find a use for it once available ;-) And it looks > like a good start for hacking on pgsql. > > I already dived into the depth of pgsql's page and tuple structures and > it looks like it is possible. But before I start coding I would like to > hear some more experienced opinions on how to implement it. > Will you put a long tuple into a long logical page(continued multiple phisical(?) pages) ? I'm suspicious about the way that allows non-page-formatted page. Anyway it would need a big change around bufmgr/smgr etc. Could someone estimate the influence/danger before going forward ? Regards. Hiroshi Inoue Inoue@tpf.co.jp
Thanks. Seems like Jan is going to be doing this. > Hello, > > I'll donate some (read all freely available) of my spare time to > implementing tuple > chaining. It looks like this feature is most wanted and it would be a > pity to hold this until post 7.0. Personally I don't need it, yet ... > But I will definitely find a use for it once available ;-) And it looks > like a good start for hacking on pgsql. > > I already dived into the depth of pgsql's page and tuple structures and > it looks like it is possible. But before I start coding I would like to > hear some more experienced opinions on how to implement it. > > Did you alread discuss technical matters about the implementation? How > can I get in touch with it? (Simply browse the mailing list archives?) > > Here's a layout how I imagine the work: > > What is needed: > - lay out a tuple continuation structure > - put tuple into multiple chunks when pages are considered, reconcile > when > loaded from disk > (how to continue a tuple - need a structure) > how is a tuple (read page item) addressed? ItemPointerData > I imagine to store a continuation address as the last bytes of the > tuple unless it > fits into one page. > I need to mark large tuples (how, just one flag in tuple) > How to tell a maximum possible size last block from a continued > (which carries a pointer to the next one at its end)? > Or don't care: make item continued and put last 6(?) bytes into a new > block > - note that the continued tuples are not referenced directly (vacuum?) > mark them as used. I hope vacuum operates on a tuple basis and has no > concept of > pages > - I guess that the tuple pointer points into page memory, if multiple > pages > are concatenated for a tuple, these pages must not reside in memory > but > the full tuple's memory must be allocated (from a memory similar to > pages) > (shared mem?) > - should be possible for memory only pages > see PageGetPageSize but od_pagesize is 16bit! > Reuse another variable? Another type of page? (32bit od_pagesize) > > Very fascinated by this large beast of ancient code to explore > Christof > > PS: I think the documentation on page layout is far outdated (or points > into the future since it speaks about ItemContinuationData structures.) > Should I update it? > The table doesn't match actual structure components. At least I don't > understand what it's about. The source code mentions a different page > layout. > > PPS: Do not pity me, I have ten+ years of coding experience in C. > > PPPS: Could someone in few words tell me what an access method is (a > tuple is an access method, log pages are another?) > > > ************ > -- Bruce Momjian | http://www.op.net/~candle maillist@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
Hiroshi Inoue wrote: > > Will you put a long tuple into a long logical page(continued multiple > phisical(?) pages) ? > I'm suspicious about the way that allows non-page-formatted page. > > Anyway it would need a big change around bufmgr/smgr etc. > Could someone estimate the influence/danger before going forward ? > I planned to use as many of PostgreSQL data structures unaltered as possible. Storing one Tuple in multiple Items should not pose too much danger on bufmgr and smgr unless they access tuple internals. (I didn't check that yet). This would mean that on disk Items do no longer correspond to Tuples. (Some of them might form one tuple). I dropped the plan of Unformatted pages very soon. But the issue of tuple in-memory-storage remains (I don't know the internals of allocating/freeing, yet). Christof
> -----Original Message----- > From: christof@to.wtal.de [mailto:christof@to.wtal.de]On Behalf Of > Christof Petig > > Hiroshi Inoue wrote: > > > > Will you put a long tuple into a long logical page(continued multiple > > phisical(?) pages) ? > > I'm suspicious about the way that allows non-page-formatted page. > > > > Anyway it would need a big change around bufmgr/smgr etc. > > Could someone estimate the influence/danger before going forward ? > > > > I planned to use as many of PostgreSQL data structures unaltered as > possible. Storing one Tuple in multiple Items should not pose too much > danger on bufmgr and smgr unless they access tuple internals. (I didn't > check that yet). This would mean that on disk Items do no longer > correspond to Tuples. (Some of them might form one tuple). > Hmm,we have discussed about LONG. Change by LONG is transparent to users and would resolve the big tuple problem mostly. I'm suspicious that tuple chaining is worth the work now. At least a consensus is needed before going,I think. Bad design would only introduce a confusion. Regards. Hiroshi Inoue Inoue@tpf.co.jp
> > I planned to use as many of PostgreSQL data structures unaltered as > > possible. Storing one Tuple in multiple Items should not pose too much > > danger on bufmgr and smgr unless they access tuple internals. (I didn't > > check that yet). This would mean that on disk Items do no longer > > correspond to Tuples. (Some of them might form one tuple). > > > > Hmm,we have discussed about LONG. > Change by LONG is transparent to users and would resolve > the big tuple problem mostly. > I'm suspicious that tuple chaining is worth the work now. > > At least a consensus is needed before going,I think. > Bad design would only introduce a confusion. Agreed. -- Bruce Momjian | http://www.op.net/~candle maillist@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
Bruce Momjian wrote: > > > I planned to use as many of PostgreSQL data structures unaltered as > > > possible. Storing one Tuple in multiple Items should not pose too much > > > danger on bufmgr and smgr unless they access tuple internals. (I didn't > > > check that yet). This would mean that on disk Items do no longer > > > correspond to Tuples. (Some of them might form one tuple). > > > > > > > Hmm,we have discussed about LONG. > > Change by LONG is transparent to users and would resolve > > the big tuple problem mostly. > > I'm suspicious that tuple chaining is worth the work now. > > > > At least a consensus is needed before going,I think. > > Bad design would only introduce a confusion. > > Agreed. Me too. I think that only a combination of LONG attributes and split tuples will be a complete solution. What I'm worried about is to make the segments of a large tuple specialized things in the main table. The reliability of Vacuum is one of the most important things for any system in production. While the general operation of vacuum seems to be well known, it's requirements for atomicy of some actions appears to be lesser. The more chunks a tuple consists of, the more possible an abort of vacuum in the middle of their moving becomes. So keeping the links of chained tuples fail safe intact is IMHO an issue, a little underestimated in this discussion. Maybe we can split tuples in another way, must think about it for another hour - 'til later. Jan -- #======================================================================# # It's easier to get forgiveness for being wrong than for being right. # # Let's break this rule - forgive me. # #========================================= wieck@debis.com (Jan Wieck) #
wieck@debis.com (Jan Wieck) writes: > I think that only a combination of LONG attributes and split > tuples will be a complete solution. If we can do a good job with long attributes, I really think we will not need to have split tuples too. You'd be able to put perhaps 400 LONG attributes into an 8K tuple, more than that if they are float8 or int or bool attributes. If someone needs tables with even more columns than that, they could bump BLCKSZ up to 32K and quadruple the number of columns. How many people are really going to be bumping into that limit? Is it worth the work and reliability risk to support long tuples for a few applications that are about three sigmas out on the bell curve? I doubt it. I think the effort this would take would be *much* more profitably spent on tuning the LONG-attribute support. If we can make that fast and robust, we will have very few complaints. regards, tom lane
Tom Lane wrote: > wieck@debis.com (Jan Wieck) writes: > > I think that only a combination of LONG attributes and split > > tuples will be a complete solution. > > If we can do a good job with long attributes, I really think we > will not need to have split tuples too. I really hope so, because there will be very severe problems coming up with a real tuple split at arbitrary cut points that can occur somewhere in the middle of an attribute. Arbitrary cut points are the only way to support single values over BLKSIZE. Just to tell one problem, the scan key tests during heap_getnext() are handed down into heapgettup() and performed with HeapTupleSatisfies, a macro using the in buffer tuple here. IIRC it was turned into a macro in one of our last releases for performance reasons. If now faced with a tuple living in multiple pages, these checks will need to reconstruct the tuple in memory, to concatenate the attributes well again. This now needs to lock multiple buffers at once during heapgettup(), where I'm not sure if they must all stay with the bumped refcount when returning the tuple or not. So ReleaseBuffer() might need to be changed into something, where the HeapTuple remembers all the buffers that where locked for it. Also this separate ReleaseBuffer() reminds me, that there are some places in the backend that assume a tuple returned by heap AM allways is in a buffer! But that can't be true any more, because a buffer allways has BLKSIZE. > I think the effort this would take would be *much* more profitably > spent on tuning the LONG-attribute support. If we can make that > fast and robust, we will have very few complaints. *MUCH*! Jan -- #======================================================================# # It's easier to get forgiveness for being wrong than for being right. # # Let's break this rule - forgive me. # #========================================= wieck@debis.com (Jan Wieck) #
> -----Original Message----- > From: Jan Wieck [mailto:wieck@debis.com] > Sent: Wednesday, December 15, 1999 3:45 AM > > Bruce Momjian wrote: > > > > > I planned to use as many of PostgreSQL data structures unaltered as > > > > possible. Storing one Tuple in multiple Items should not > pose too much > > > > danger on bufmgr and smgr unless they access tuple > internals. (I didn't > > > > check that yet). This would mean that on disk Items do no longer > > > > correspond to Tuples. (Some of them might form one tuple). > > > > > > > > > > Hmm,we have discussed about LONG. > > > Change by LONG is transparent to users and would resolve > > > the big tuple problem mostly. > > > I'm suspicious that tuple chaining is worth the work now. > > > > > > At least a consensus is needed before going,I think. > > > Bad design would only introduce a confusion. > > > > Agreed. > > Me too. > > I think that only a combination of LONG attributes and split > tuples will be a complete solution. > > What I'm worried about is to make the segments of a large > tuple specialized things in the main table. The reliability > of Vacuum is one of the most important things for any system > in production. While the general operation of vacuum seems to > be well known, it's requirements for atomicy of some actions > appears to be lesser. The more chunks a tuple consists of, > the more possible an abort of vacuum in the middle of their > moving becomes. So keeping the links of chained tuples fail > safe intact is IMHO an issue, a little underestimated in this > discussion. > There exists another related problem. Vacuum could hardly move big tuples if some tuples of each page live long. Though we have to move a long tuple at once,there won't be so many clean pages. Probably vacuum couldn't move even a 8K tuple in some cases. The problem is already there,more or less. But it seems very difficult to solve this problem without giving up to preserve consistency in case of a crash. Regards. Hiroshi Inoue Inoue@tpf.co.jp
Remember, chaining tuples had all sorts of performance, vacuum, code handling, and UPDATE problems. They buy us very little, and almost nothing if we have LONG tables. -- Bruce Momjian | http://www.op.net/~candle maillist@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
Bruce Momjian wrote: > > Remember, chaining tuples had all sorts of performance, vacuum, code > handling, and UPDATE problems. They buy us very little, and almost > nothing if we have LONG tables. > I had already contacted Jan in private Email. Since we share country, native language and time zone, this is even the most comfortable way. I agree with the concerns you mailed and will (most likely) start helping Jan to implement LONG. As I had seen your LONG discussion _after_ my original post, this had been a strange coincidence. But I had been following it with interest. Christof