Обсуждение: Arbitrary tuple size
Well, doing arbitrary tuple size should be as generic as possible. Thus I think the best place to do it is down in the heapam routines (heap_fetch(), heap_getnext(), heap_insert(), ...). I'm not 100% sure but nothing should access a heap relation going around them. Anyway, if there are places, then it's time to clean them up. What about adding one more ItemPointerData to the tuple header which holds the ctid of a DATA continuation tuple. If a tuple doesn't fit into one block, this will tell where to get the next chunk of tuple data building a chain until an invalid ctid is found. The continuation tuples can have a negative t_natts to be easily identified and ignored by scanning routines. By doing it this way we could also sqeeze out some currently wasted space. All tuples that get inserted/updated are added to the end of the relation. If a tuple currently doesn't fit into the freespace of the actual last block, that freespace is wasted and the tuple is placed into a new allocated block at the end. So if there is 5K freespace and another 5.5K tuple is added, the relation grows effectively by 10.5K! I'm not sure how to handle this with vacuum, but I believe Vadim is able to put some well placed goto's that make it. Jan -- #======================================================================# # It's easier to get forgiveness for being wrong than for being right. # # Let's break this rule - forgive me. # #========================================= wieck@debis.com (Jan Wieck) #
Jan Wieck wrote: > > What about adding one more ItemPointerData to the tuple > header which holds the ctid of a DATA continuation tuple. If Oh no. Fortunately we need not in this: we can just add new flag to t_infomask and add continuation tid at the end of tuple chunk. Ok? > a tuple doesn't fit into one block, this will tell where to > get the next chunk of tuple data building a chain until an > invalid ctid is found. The continuation tuples can have a > negative t_natts to be easily identified and ignored by > scanning routines. ... > > I'm not sure how to handle this with vacuum, but I believe > Vadim is able to put some well placed goto's that make it. -:))) Ok, ok - I have great number of goto-s in my pocket -:) Vadim
> Well, > > doing arbitrary tuple size should be as generic as possible. > Thus I think the best place to do it is down in the heapam > routines (heap_fetch(), heap_getnext(), heap_insert(), ...). > I'm not 100% sure but nothing should access a heap relation > going around them. Anyway, if there are places, then it's > time to clean them up. > > What about adding one more ItemPointerData to the tuple > header which holds the ctid of a DATA continuation tuple. If > a tuple doesn't fit into one block, this will tell where to > get the next chunk of tuple data building a chain until an > invalid ctid is found. The continuation tuples can have a > negative t_natts to be easily identified and ignored by > scanning routines. > > By doing it this way we could also sqeeze out some currently > wasted space. All tuples that get inserted/updated are added > to the end of the relation. If a tuple currently doesn't fit > into the freespace of the actual last block, that freespace > is wasted and the tuple is placed into a new allocated block > at the end. So if there is 5K freespace and another 5.5K > tuple is added, the relation grows effectively by 10.5K! > > I'm not sure how to handle this with vacuum, but I believe > Vadim is able to put some well placed goto's that make it. I agree this is the way to go. There is nothing I can think of that is limited to how large a tuple can be. It is just accessing it from the heap routines that is the problem. If the tuple is alloc'ed to be used, we can paste together the parts on disk and return one tuple. If they are accessing the buffer copy directly, we would have to be smart about going off the end of the disk copy and moving to the next segment. The code is very clear now about accessing tuples or tuple copies. Hopefully locking will not be an issue because you only need to lock the main tuple. No one is going to see the secondary part of the tuple. If Vadim can do MVCC, he certainly can handle this, with the help of goto. :-) -- Bruce Momjian | http://www.op.net/~candle maillist@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
> Jan Wieck wrote: > > > > What about adding one more ItemPointerData to the tuple > > header which holds the ctid of a DATA continuation tuple. If > > Oh no. Fortunately we need not in this: we can just add new flag > to t_infomask and add continuation tid at the end of tuple chunk. > Ok? Sounds good. I would rather not add stuff to the tuple header if we can prevent it. > > I'm not sure how to handle this with vacuum, but I believe > > Vadim is able to put some well placed goto's that make it. > > -:))) > Ok, ok - I have great number of goto-s in my pocket -:) I can send you more. -- Bruce Momjian | http://www.op.net/~candle maillist@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
Bruce Momjian wrote: > > -:))) > > Ok, ok - I have great number of goto-s in my pocket -:) > > I can send you more. I have some cheap, spare longjmp()'s over here - anyone need them? :-) Jan -- #======================================================================# # It's easier to get forgiveness for being wrong than for being right. # # Let's break this rule - forgive me. # #========================================= wieck@debis.com (Jan Wieck) #
Bruce Momjian wrote: > > What about adding one more ItemPointerData to the tuple > > header which holds the ctid of a DATA continuation tuple. If > > a tuple doesn't fit into one block, this will tell where to > > get the next chunk of tuple data building a chain until an > > invalid ctid is found. The continuation tuples can have a > > negative t_natts to be easily identified and ignored by > > scanning routines. Yes, Vadim, putting a flag into the bits already there to tell it is much better. The information that a particular tuple is an extension tuple should also go there instead of misusing t_natts. > > I agree this is the way to go. There is nothing I can think of that is > limited to how large a tuple can be. It is just accessing it from the > heap routines that is the problem. If the tuple is alloc'ed to be used, > we can paste together the parts on disk and return one tuple. If they > are accessing the buffer copy directly, we would have to be smart about > going off the end of the disk copy and moving to the next segment. Who's accessing tuple attributes directly inside the buffer copy (not only the header which will still be unsplitted at the top of the chain)? Aren't these situations where it is done restricted to system catalogs? I think we can live with the restriction that the tuple split will not be available for system relations because the only place where the limit hit us is pg_rewrite and that can be handled by redesigning the storage of rules which is already required by the rule recompilation TODO. I can't think that anywhere in the code a buffer from a user relation (except for sequences and that's another story) is accessed that clumsy. > > The code is very clear now about accessing tuples or tuple copies. > Hopefully locking will not be an issue because you only need to lock the > main tuple. No one is going to see the secondary part of the tuple. Jan -- #======================================================================# # It's easier to get forgiveness for being wrong than for being right. # # Let's break this rule - forgive me. # #========================================= wieck@debis.com (Jan Wieck) #
Bruce Momjian wrote: > I agree this is the way to go. There is nothing I can think of that is > limited to how large a tuple can be. Outch - I can. Having an index on a varlen field that now doesn't fit any more into an index block. Wouldn't this cause problems? Well it's bad database design to index fields that will receive that long data because indexing them will blow up the database but it must work anyway. Jan -- #======================================================================# # It's easier to get forgiveness for being wrong than for being right. # # Let's break this rule - forgive me. # #========================================= wieck@debis.com (Jan Wieck) #
wieck@debis.com (Jan Wieck) writes: > Bruce Momjian wrote: >> I agree this is the way to go. There is nothing I can think of that is >> limited to how large a tuple can be. > Outch - I can. > Having an index on a varlen field that now doesn't fit any > more into an index block. Wouldn't this cause problems? Aren't index tuples still tuples? Can't they be split just like regular tuples? regards, tom lane
Tom Lane wrote: > > wieck@debis.com (Jan Wieck) writes: > > Bruce Momjian wrote: > >> I agree this is the way to go. There is nothing I can think of that is > >> limited to how large a tuple can be. > > > Outch - I can. > > > Having an index on a varlen field that now doesn't fit any > > more into an index block. Wouldn't this cause problems? > > Aren't index tuples still tuples? Can't they be split just like > regular tuples? Don't know, maybe. While looking for some places where tuple data might be accessed directly inside of the buffers I've searched for WriteBuffer() and friends. These are mostly used in the index access methods and some other places where I expected them, so index AM's have at least to be carefully visited when implementing tuple split. Jan -- #======================================================================# # It's easier to get forgiveness for being wrong than for being right. # # Let's break this rule - forgive me. # #========================================= wieck@debis.com (Jan Wieck) #
I wrote: > > Tom Lane wrote: > > > > Aren't index tuples still tuples? Can't they be split just like > > regular tuples? > > Don't know, maybe. Actually we have some problems with indices on text attributes when the content exceeds HALF of the blocksize: FATAL 1: btree: failed to add item to the page It crashes the backend AND seems to corrupt the index! Looks to me that at least the btree code needs to be able to store at minimum two items into one block and painfully fails if it can't. And just another one: pgsql=> create table t1 (a int4, b char(4000)); CREATE pgsql=> create index t1_b on t1 (b); CREATE pgsql=> insert into t1 values (1, 'a'); TRAP: Failed Assertion("!(( itid)->lp_flags & 0x01):", File: "nbtinsert.c", Line: 361) Bruce: One more TODO item! Jan -- #======================================================================# # It's easier to get forgiveness for being wrong than for being right. # # Let's break this rule - forgive me. # #========================================= wieck@debis.com (Jan Wieck) #
Going toward >8k tuples would be really good, but I suspect we may some difficulties with LO stuffs once we implement it. Also it seems that it's not worth to adapt LOs with newly designed tuples. I think the design of current LOs are so broken that we need to redesign them. o it's slow: accessing a LO need a open() that is not cheap. creating many LOs makes data/base/DBNAME/ directory fat. o it consumes lots of i-nodes o it breaks the tuple abstraction: this makes difficult to maintain the code. I would propose followings for the new version of LO: o create a new data type that represents the LO o when defining the LO data type in a table, it actually points to a LO "body" in another place where it is physically stored. o the storage for LO bodies would be a hidden table that contains several LOs, not single one. o we can have several tables for the LO bodies. Probably a LO body table for each corresponding table (where LO data type is defined) is appropreate. o it would be nice to place a LO table on a separate directory/partition from the original table where LO data type is defined, since a LO body table could become huge. Comments? Opinions? --- Tatsuo Ishii
Jan Wieck wrote: > > Bruce Momjian wrote: > > > I agree this is the way to go. There is nothing I can think of that is > > limited to how large a tuple can be. > > Outch - I can. > > Having an index on a varlen field that now doesn't fit any > more into an index block. Wouldn't this cause problems? Well > it's bad database design to index fields that will receive > that long data because indexing them will blow up the > database but it must work anyway. Seems that in other DBMSes len of index tuple is more restricted than len of heap one. So I think we shouldn't worry about this case. Vadim
At 10:12 9/07/99 +0900, Tatsuo Ishii wrote: > >o create a new data type that represents the LO > >o when defining the LO data type in a table, it actually points to a >LO "body" in another place where it is physically stored. Much as the purist in me hates concept of hard links (as in Leon's suggestions), this *may* be a good application for them.Certainly that's how Dec(Oracle)/Rdb does it. Since most blobs will be totally rewritten when they are updated, thisrepresents a slightly smaller problem in terms of MVCC. >o we can have several tables for the LO bodies. Probably a LO body >table for each corresponding table (where LO data type is defined) is >appropreate. Did you mean a table for each field? Or a table for each table (which may have more than 1 LO field). See comments below. >o it would be nice to place a LO table on a separate >directory/partition from the original table where LO data type is >defined, since a LO body table could become huge. I would very much like to see the ability to have multi-file databases and tables - ie. the ability to store and table orindex in a separate file. Perhaps with a user-defined partitioning function for table rows. The idea being that: 1. User specifies that a table can be stored in one of (say) three files. 2. When a record is first stored, the partitioning function is called to determine the file 'storage area' to use. [or arandom selection method is used] If you are going to allow LOs to be stored in multiple files, it seems a pity not to add some or all of this feature. Additionally, the issue of pg_dump support for LOs needs to be addressed. That's sbout it for me, Philip Warner. ---------------------------------------------------------------- Philip Warner | __---_____ Albatross Consulting Pty. Ltd. |----/ - \ (A.C.N. 008 659 498) | /(@) ______---_ Tel: +61-03-5367 7422 | _________ \ Fax: +61-03-5367 7430 | ___________ | Http://www.rhyme.com.au | / \| | --________-- PGP key available upon request, | / and from pgp5.ai.mit.edu:11371 |/
> > I agree this is the way to go. There is nothing I can think of that is > > limited to how large a tuple can be. It is just accessing it from the > > heap routines that is the problem. If the tuple is alloc'ed to be used, > > we can paste together the parts on disk and return one tuple. If they > > are accessing the buffer copy directly, we would have to be smart about > > going off the end of the disk copy and moving to the next segment. > > Who's accessing tuple attributes directly inside the buffer > copy (not only the header which will still be unsplitted at > the top of the chain)? Every call to heap_getnext(), for one. It locks the buffer, and returns a pointer to the tuple. The next heap_getnext(), or heap_endscan() releases the lock. The cost of returning every tuple as palloc'ed memory would be huge. We may be able to get away with just returning palloc'ed stuff for long tuples. That may be a simple, clean solution that would be isolated. In fact, if we want a copy, we call heap_copytuple() to return a palloc'ed copy. This interface has been cleaned up so it should be clear what is happening. The old code was messy about this. See my comments from heap_fetch(), which does require the user to supply a buffer variable, so they can unlock it when they are done. The old code allowed you to pass a NULL as a buffer pointer, so there was no locking done, and that is bad! --------------------------------------------------------------------------- /* ----------------* heap_fetch - retrive tuple with tid** Currently ignores LP_IVALID during processing!** Because this is not part of a scan, there is no way to* automatically lock/unlock the shared buffers.* For this reason, we require that the user retrieve the buffer* value, and they are required to BufferRelease()it when they* are done. If they want to make a copy of it before releasing it,* they can call heap_copytyple().*----------------*/ void heap_fetch(Relation relation, Snapshot snapshot, HeapTuple tuple, Buffer *userbuf) -- Bruce Momjian | http://www.op.net/~candle maillist@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
> > Aren't index tuples still tuples? Can't they be split just like > > regular tuples? > > Don't know, maybe. > > While looking for some places where tuple data might be > accessed directly inside of the buffers I've searched for > WriteBuffer() and friends. These are mostly used in the index > access methods and some other places where I expected them, > so index AM's have at least to be carefully visited when > implementing tuple split. See my recent mail. heap_getnext and heap_fetch(). Can't get lower access than that. -- Bruce Momjian | http://www.op.net/~candle maillist@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
I knew there had to be a reason that some tests where BLCKSZ/2 and some BLCKSZ. Added to TODO: * Allow index on tuple greater than 1/2 block size Seems we have to allow columns over 1/2 block size for now. Most people wouln't index on them. > > Don't know, maybe. > > Actually we have some problems with indices on text > attributes when the content exceeds HALF of the blocksize: > > FATAL 1: btree: failed to add item to the page > > It crashes the backend AND seems to corrupt the index! Looks > to me that at least the btree code needs to be able to store > at minimum two items into one block and painfully fails if it > can't. > > And just another one: > > pgsql=> create table t1 (a int4, b char(4000)); > CREATE > pgsql=> create index t1_b on t1 (b); > CREATE > pgsql=> insert into t1 values (1, 'a'); > > TRAP: Failed Assertion("!(( itid)->lp_flags & 0x01):", > File: "nbtinsert.c", Line: 361) > > Bruce: One more TODO item! > > > Jan > > -- > > #======================================================================# > # It's easier to get forgiveness for being wrong than for being right. # > # Let's break this rule - forgive me. # > #========================================= wieck@debis.com (Jan Wieck) # > > > -- Bruce Momjian | http://www.op.net/~candle maillist@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
If we get wide tuples, we could just throw all large objects into one table, and have an on it. We can then vacuum it to compact space, etc. > Going toward >8k tuples would be really good, but I suspect we may > some difficulties with LO stuffs once we implement it. Also it seems > that it's not worth to adapt LOs with newly designed tuples. I think > the design of current LOs are so broken that we need to redesign them. > > o it's slow: accessing a LO need a open() that is not cheap. creating > many LOs makes data/base/DBNAME/ directory fat. > > o it consumes lots of i-nodes > > o it breaks the tuple abstraction: this makes difficult to maintain > the code. > > I would propose followings for the new version of LO: > > o create a new data type that represents the LO > > o when defining the LO data type in a table, it actually points to a > LO "body" in another place where it is physically stored. > > o the storage for LO bodies would be a hidden table that contains > several LOs, not single one. > > o we can have several tables for the LO bodies. Probably a LO body > table for each corresponding table (where LO data type is defined) is > appropreate. > > o it would be nice to place a LO table on a separate > directory/partition from the original table where LO data type is > defined, since a LO body table could become huge. > > Comments? Opinions? > --- > Tatsuo Ishii > > -- Bruce Momjian | http://www.op.net/~candle maillist@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
>If we get wide tuples, we could just throw all large objects into one >table, and have an on it. We can then vacuum it to compact space, etc. I thought about that too. But if a table contains lots of LOs, scanning of it will take for a long time. On the otherhand, if LOs are stored outside the table, scanning time will be shorter as long as we don't need to read the content of each LO type field. -- Tatsuo Ishii
Bruce Momjian wrote: > > If we get wide tuples, we could just throw all large objects into one > table, and have an on it. We can then vacuum it to compact space, etc. Storing 2Gb LO in table is not good thing. Vadim
> >If we get wide tuples, we could just throw all large objects into one > >table, and have an on it. We can then vacuum it to compact space, etc. > > I thought about that too. But if a table contains lots of LOs, > scanning of it will take for a long time. On the otherhand, if LOs are > stored outside the table, scanning time will be shorter as long as we > don't need to read the content of each LO type field. Use an index to get to the LO's in the table. -- Bruce Momjian | http://www.op.net/~candle maillist@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
> Bruce Momjian wrote: > > > > If we get wide tuples, we could just throw all large objects into one > > table, and have an on it. We can then vacuum it to compact space, etc. > > Storing 2Gb LO in table is not good thing. > > Vadim > Ah, but we have segemented tables now. It will auto-split at 1 gig. -- Bruce Momjian | http://www.op.net/~candle maillist@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
Bruce Momjian wrote: > > > Bruce Momjian wrote: > > > > > > If we get wide tuples, we could just throw all large objects into one > > > table, and have an on it. We can then vacuum it to compact space, etc. > > > > Storing 2Gb LO in table is not good thing. > > > > Vadim > > > > Ah, but we have segemented tables now. It will auto-split at 1 gig. Well, now consider update of 2Gb row! I worry not due to non-overwriting but about writing 2Gb log record to WAL - we'll not be able to do it, sure. Isn't it why Informix restrict tuple len to 32k only? And the same is what Oracle does. Both of them have ability to use > 1 page for single row, but they have this restriction anyway. I don't like _arbitrary_ tuple size. I vote for some limit. 32K or 64K, at max. Vadim
Vadim Mikheev wrote: > > Bruce Momjian wrote: > > > > > Bruce Momjian wrote: > > > > > > > > If we get wide tuples, we could just throw all large objects into one > > > > table, and have an on it. We can then vacuum it to compact space, etc. > > > > > > Storing 2Gb LO in table is not good thing. > > > > > > Vadim > > > > > > > Ah, but we have segemented tables now. It will auto-split at 1 gig. > > Well, now consider update of 2Gb row! > I worry not due to non-overwriting but about writing > 2Gb log record to WAL - we'll not be able to do it, sure. Can't we write just some kind of diff (only changed pages) in WAL, either starting at some thresold or just based the seek/write logic of LOs? It will add complexity, but having some arbitrary limits seems very wrong. It will also make indexing LOs more complex, but as we don't currently index them anyway, its not a big problem yet. Setting the limit higher (like 16M, where all my current LOs would fit :) ) is just postponing the problems. Does "who will need more than 640k of RAM" sound familiar ? > Isn't it why Informix restrict tuple len to 32k only? > And the same is what Oracle does. Does anyone know what the limit for Oracle8i is ? As they advertise it as a replacement file system among other things, I guess it can't be too low - I suspect 2G at minimum > Both of them have ability to use > 1 page for single row, > but they have this restriction anyway. > > I don't like _arbitrary_ tuple size. Why not ? IMHO we should allow _arbitrary_ (in reality probably <= MAXINT), but optimize for some known size and tell the users that if they exceed it the performance would suffer. So when I have 99% of my LOs in 10k-80k range but have a few 512k-2M ones I can just live with the bigger ones having bad performance instead implementing an additional LO manager in the frontend too. > I vote for some limit. Why limit ? > 32K or 64K, at max. Why so low ? Please make it at least configurable, preferrably at runtime. And if you go that way, please assume this limit (in code) for tuple size only, and not in FE/BE protocol - it will make it easier for someone to fix the backend to work with larger ones later The LOs should remain load-on-demant anyway, just it should be made more transparent for end-users. > Vadim
Tatsuo Ishii wrote: > > Going toward >8k tuples would be really good, but I suspect we may > some difficulties with LO stuffs once we implement it. Also it seems > that it's not worth to adapt LOs with newly designed tuples. I think > the design of current LOs are so broken that we need to redesign them. > > [... LO stuff deleted ...] I wasn't talking about a new datatype that can exceed the tuple limit. The general tuple split I want will also handle it if a row with 40 text attributes of each 1K gets stored. That's something different. Jan -- #======================================================================# # It's easier to get forgiveness for being wrong than for being right. # # Let's break this rule - forgive me. # #========================================= wieck@debis.com (Jan Wieck) #
Vadim wrote: > > Bruce Momjian wrote: > > > > > Bruce Momjian wrote: > > > > > > > > If we get wide tuples, we could just throw all large objects into one > > > > table, and have an on it. We can then vacuum it to compact space, etc. > > > > > > Storing 2Gb LO in table is not good thing. > > > > > > Vadim > > > > > > > Ah, but we have segemented tables now. It will auto-split at 1 gig. > > Well, now consider update of 2Gb row! > I worry not due to non-overwriting but about writing > 2Gb log record to WAL - we'll not be able to do it, sure. > > Isn't it why Informix restrict tuple len to 32k only? > And the same is what Oracle does. > Both of them have ability to use > 1 page for single row, > but they have this restriction anyway. > > I don't like _arbitrary_ tuple size. > I vote for some limit. 32K or 64K, at max. To have some limit seems reasonable for me (I've also read the other comments). When dealing with regular tuples, first off the query to insert or update them will get read into memory. Next the querytree with the Const vars is built, rewritten, planned. Then the tuple is built in memory and maybe somewhere else copied (fulltext index trigger). So the amount of memory will be allocated many times! There is some natural limit on the tuple size depending on the available swapspace. Not everyone has multiple GB swapspace setup. Making it a well known hard limit that doesn't hurt even if 20 backends do things simultaneously is better. I vote for a limit too. 64K should be enough. Jan -- #======================================================================# # It's easier to get forgiveness for being wrong than for being right. # # Let's break this rule - forgive me. # #========================================= wieck@debis.com (Jan Wieck) #
> > Ah, but we have segemented tables now. It will auto-split at 1 gig. > > Well, now consider update of 2Gb row! > I worry not due to non-overwriting but about writing > 2Gb log record to WAL - we'll not be able to do it, sure. > > Isn't it why Informix restrict tuple len to 32k only? > And the same is what Oracle does. > Both of them have ability to use > 1 page for single row, > but they have this restriction anyway. > > I don't like _arbitrary_ tuple size. > I vote for some limit. 32K or 64K, at max. Yes, but having it all in one table prevents fopen() call for every access, inode use for every large object, and allows vacuum to clean up multiple versions. Just an idea. I realized your point. -- Bruce Momjian | http://www.op.net/~candle maillist@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
> > Well, now consider update of 2Gb row! > > I worry not due to non-overwriting but about writing > > 2Gb log record to WAL - we'll not be able to do it, sure. > > Can't we write just some kind of diff (only changed pages) in WAL, > either starting at some thresold or just based the seek/write logic of > LOs? > > It will add complexity, but having some arbitrary limits seems very > wrong. > > It will also make indexing LOs more complex, but as we don't currently > index > them anyway, its not a big problem yet. Well, we do indexing of large objects by using the OS directory code to find a given directory entry. > Why not ? > > IMHO we should allow _arbitrary_ (in reality probably <= MAXINT), but > optimize for some known size and tell the users that if they exceed it > the performance would suffer. If they go over a certain size, they can decide to store it in the file system, as many users are doing now. -- Bruce Momjian | http://www.op.net/~candle maillist@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
> I don't like _arbitrary_ tuple size. > I vote for some limit. 32K or 64K, at max. To have some limit seems reasonable for me (I've also read the other comments). When dealing with regular tuples, first Isn't anything other than arbitrary sizes just making us encounter the same problem later. Clearly, there are real hardware limits, but we shouldn't build that into the code. It seems to me the solution is to have arbitrary (e.g., hardware driven) limits, document what is necessary to support certain operations, and let the fanatics buy mega-systems if they need to support huge tuples. As long as the code is optimized for more reasonable situations, there should be no penalty. Cheers, Brook
Jan Wieck wrote: > > Bruce Momjian wrote: > > > I agree this is the way to go. There is nothing I can think of that is > > limited to how large a tuple can be. > > Outch - I can. > > Having an index on a varlen field that now doesn't fit any > more into an index block. Wouldn't this cause problems? Well > it's bad database design to index fields that will receive > that long data because indexing them will blow up the > database but it must work anyway. Seems that in other DBMSes len of index tuple is more restricted than len of heap one. So I think we shouldn't worry about this case. Vadim
> > Aren't index tuples still tuples? Can't they be split just like > > regular tuples? > > Don't know, maybe. > > While looking for some places where tuple data might be > accessed directly inside of the buffers I've searched for > WriteBuffer() and friends. These are mostly used in the index > access methods and some other places where I expected them, > so index AM's have at least to be carefully visited when > implementing tuple split. See my recent mail. heap_getnext and heap_fetch(). Can't get lower access than that. -- Bruce Momjian | http://www.op.net/~candle maillist@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
If we get wide tuples, we could just throw all large objects into one table, and have an on it. We can then vacuum it to compact space, etc. > Going toward >8k tuples would be really good, but I suspect we may > some difficulties with LO stuffs once we implement it. Also it seems > that it's not worth to adapt LOs with newly designed tuples. I think > the design of current LOs are so broken that we need to redesign them. > > o it's slow: accessing a LO need a open() that is not cheap. creating > many LOs makes data/base/DBNAME/ directory fat. > > o it consumes lots of i-nodes > > o it breaks the tuple abstraction: this makes difficult to maintain > the code. > > I would propose followings for the new version of LO: > > o create a new data type that represents the LO > > o when defining the LO data type in a table, it actually points to a > LO "body" in another place where it is physically stored. > > o the storage for LO bodies would be a hidden table that contains > several LOs, not single one. > > o we can have several tables for the LO bodies. Probably a LO body > table for each corresponding table (where LO data type is defined) is > appropreate. > > o it would be nice to place a LO table on a separate > directory/partition from the original table where LO data type is > defined, since a LO body table could become huge. > > Comments? Opinions? > --- > Tatsuo Ishii > > -- Bruce Momjian | http://www.op.net/~candle maillist@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
> >If we get wide tuples, we could just throw all large objects into one > >table, and have an on it. We can then vacuum it to compact space, etc. > > I thought about that too. But if a table contains lots of LOs, > scanning of it will take for a long time. On the otherhand, if LOs are > stored outside the table, scanning time will be shorter as long as we > don't need to read the content of each LO type field. Use an index to get to the LO's in the table. -- Bruce Momjian | http://www.op.net/~candle maillist@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
Bruce Momjian wrote: > > > Bruce Momjian wrote: > > > > > > If we get wide tuples, we could just throw all large objects into one > > > table, and have an on it. We can then vacuum it to compact space, etc. > > > > Storing 2Gb LO in table is not good thing. > > > > Vadim > > > > Ah, but we have segemented tables now. It will auto-split at 1 gig. Well, now consider update of 2Gb row! I worry not due to non-overwriting but about writing 2Gb log record to WAL - we'll not be able to do it, sure. Isn't it why Informix restrict tuple len to 32k only? And the same is what Oracle does. Both of them have ability to use > 1 page for single row, but they have this restriction anyway. I don't like _arbitrary_ tuple size. I vote for some limit. 32K or 64K, at max. Vadim
>If we get wide tuples, we could just throw all large objects into one >table, and have an on it. We can then vacuum it to compact space, etc. I thought about that too. But if a table contains lots of LOs, scanning of it will take for a long time. On the otherhand, if LOs are stored outside the table, scanning time will be shorter as long as we don't need to read the content of each LO type field. -- Tatsuo Ishii
> > I agree this is the way to go. There is nothing I can think of that is > > limited to how large a tuple can be. It is just accessing it from the > > heap routines that is the problem. If the tuple is alloc'ed to be used, > > we can paste together the parts on disk and return one tuple. If they > > are accessing the buffer copy directly, we would have to be smart about > > going off the end of the disk copy and moving to the next segment. > > Who's accessing tuple attributes directly inside the buffer > copy (not only the header which will still be unsplitted at > the top of the chain)? Every call to heap_getnext(), for one. It locks the buffer, and returns a pointer to the tuple. The next heap_getnext(), or heap_endscan() releases the lock. The cost of returning every tuple as palloc'ed memory would be huge. We may be able to get away with just returning palloc'ed stuff for long tuples. That may be a simple, clean solution that would be isolated. In fact, if we want a copy, we call heap_copytuple() to return a palloc'ed copy. This interface has been cleaned up so it should be clear what is happening. The old code was messy about this. See my comments from heap_fetch(), which does require the user to supply a buffer variable, so they can unlock it when they are done. The old code allowed you to pass a NULL as a buffer pointer, so there was no locking done, and that is bad! --------------------------------------------------------------------------- /* ----------------* heap_fetch - retrive tuple with tid** Currently ignores LP_IVALID during processing!** Because this is not part of a scan, there is no way to* automatically lock/unlock the shared buffers.* For this reason, we require that the user retrieve the buffer* value, and they are required to BufferRelease()it when they* are done. If they want to make a copy of it before releasing it,* they can call heap_copytyple().*----------------*/ void heap_fetch(Relation relation, Snapshot snapshot, HeapTuple tuple, Buffer *userbuf) -- Bruce Momjian | http://www.op.net/~candle maillist@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
Tatsuo Ishii wrote: > > Going toward >8k tuples would be really good, but I suspect we may > some difficulties with LO stuffs once we implement it. Also it seems > that it's not worth to adapt LOs with newly designed tuples. I think > the design of current LOs are so broken that we need to redesign them. > > [... LO stuff deleted ...] I wasn't talking about a new datatype that can exceed the tuple limit. The general tuple split I want will also handle it if a row with 40 text attributes of each 1K gets stored. That's something different. Jan -- #======================================================================# # It's easier to get forgiveness for being wrong than for being right. # # Let's break this rule - forgive me. # #========================================= wieck@debis.com (Jan Wieck) #
On Fri, 9 Jul 1999, Vadim Mikheev wrote: > > Bruce Momjian wrote: > > > > > Bruce Momjian wrote: > > > > > > > > If we get wide tuples, we could just throw all large objects into one > > > > table, and have an on it. We can then vacuum it to compact space, etc. > > > > > > Storing 2Gb LO in table is not good thing. > > > > > > Vadim > > > > > > > Ah, but we have segemented tables now. It will auto-split at 1 gig. > > Well, now consider update of 2Gb row! > I worry not due to non-overwriting but about writing > 2Gb log record to WAL - we'll not be able to do it, sure. What I'm kinda curious about is *why* you would want to store a LO in the table in the first place? And, consequently, as Bruce had suggested...index it? Unless something has changed recently that I totally missed, the only time the index would be used is if a query was based on a) start of string (ie. ^<string>) or b) complete string (ie. ^<string>$) ... So what benefit would an index be on a LO? Marc G. Fournier ICQ#7615664 IRC Nick: Scrappy Systems Administrator @ hub.org primary: scrappy@hub.org secondary: scrappy@{freebsd|postgresql}.org
At 09:04 28/07/99 -0300, The Hermit Hacker wrote: >On Fri, 9 Jul 1999, Vadim Mikheev wrote: > >> >> Bruce Momjian wrote: >> > >> > > Bruce Momjian wrote: >> > > > >> > > > If we get wide tuples, we could just throw all large objects into one >> > > > table, and have an on it. We can then vacuum it to compact space, etc. >> > > >> > > Storing 2Gb LO in table is not good thing. >> > > >> > > Vadim >> > > >> > >> > Ah, but we have segemented tables now. It will auto-split at 1 gig. >> >> Well, now consider update of 2Gb row! >> I worry not due to non-overwriting but about writing >> 2Gb log record to WAL - we'll not be able to do it, sure. > >What I'm kinda curious about is *why* you would want to store a LO in the >table in the first place? And, consequently, as Bruce had >suggested...index it? Unless something has changed recently that I >totally missed, the only time the index would be used is if a query was >based on a) start of string (ie. ^<string>) or b) complete string (ie. >^<string>$) ... > >So what benefit would an index be on a LO? > Some systems (Dec RDB) won't even let you index the contents of an LO. Anyone know what other systems do? Also, to repeat question from an earlier post: is there a plan for the BLOB implementation that is available for comment/contribution? ---------------------------------------------------------------- Philip Warner | __---_____ Albatross Consulting Pty. Ltd. |----/ - \ (A.C.N. 008 659 498) | /(@) ______---_ Tel: +61-03-5367 7422 | _________ \ Fax: +61-03-5367 7430 | ___________ | Http://www.rhyme.com.au | / \| | --________-- PGP key available upon request, | / and from pgp5.ai.mit.edu:11371 |/