Обсуждение: tuptoaster.c must *not* use SnapshotAny
The tuple toaster currently fetches toast-table rows using SnapshotAny. This is quite uncool, because it will in fact find *any* row. Including dead rows resulting from an aborted (or, in 7.2, still-in-progress) VACUUM. I believe this is the explanation for a problem report I'm currently chasing from Joshua Drake: digivision2=# select * from change_log; ERROR: chunk 1 for toast value 3388774 appears multiple times (there are some thousands of rows with similar problems) I suggest SnapshotSelf instead. Comments anyone? regards, tom lane
Josh Drake here: Although I know that development has pretty much ceased on the 7.1 series. This bug, is potentially very serious for those who have the problem. The last time this came up (that I can find) is in August of 2001, so the bug is not frequent that we are aware of. However, as it appears the fix is simple should we apply a 7.1.3.1 or 7.1.4 that fixes this issue? J > On Tue, 15 Jan 2002, Tom Lane wrote: > The tuple toaster currently fetches toast-table rows using SnapshotAny. > This is quite uncool, because it will in fact find *any* row. Including > dead rows resulting from an aborted (or, in 7.2, still-in-progress) > VACUUM. > > I believe this is the explanation for a problem report I'm currently > chasing from Joshua Drake: > > digivision2=# select * from change_log; > ERROR: chunk 1 for toast value 3388774 appears multiple times > > (there are some thousands of rows with similar problems) > > I suggest SnapshotSelf instead. Comments anyone? > > regards, tom lane > > ---------------------------(end of broadcast)--------------------------- > TIP 3: if posting/reading through Usenet, please send an appropriate > subscribe-nomail command to majordomo@postgresql.org so that your > message can get through to the mailing list cleanly > -- by way of pgsql-hackers@commandprompt.com
"Command Prompt, Inc." <pgsql-hackers@commandprompt.com> writes: > Josh Drake here: Although I know that development has pretty much ceased > on the 7.1 series. This bug, is potentially very serious for those who > have the problem. Well, based on our experiments last night, recovery is pretty simple: vacuum the affected table. So it doesn't qualify as a critical problem in my eyes, although surely it's a "must fix" for 7.2. > The last time this came up (that I can find) is in August of 2001, so the > bug is not frequent that we are aware of. However, as it appears the fix > is simple should we apply a 7.1.3.1 or 7.1.4 that fixes this issue? I don't think anyone has the energy to put out a 7.1.4. We seem to have enough trouble getting 7.2 out the door. regards, tom lane
> The tuple toaster currently fetches toast-table rows using SnapshotAny. > This is quite uncool, because it will in fact find *any* row. Well, doesn't it need to find *any* row, since the accesses to heap and toast are not "atomic" (not interlocked) ? > Including dead rows resulting from an aborted But then also the heap tuple is aborted, and thus a normal select will not select that toast anyway. > (or, in 7.2, still-in-progress) > VACUUM. I thought new vacuum did not move tuples ? Why should it then produce two rows ? And VACUUM FULL locks the table so ... I think SnapshotSelf could have concurrency problems (not see a toast that someone else committed between our heap read and toast read) IIRC, at the time at least SnapshotAny was chosen on purpose. Might it be a problem, that VACUUM sets xids to FrozenXid, and we always consider that one Valid, even if that vacuum is still in progress ? I can only see a problem with a partway through vacuum full, that aborted. Andreas PS: short desc between Any and Self, anybody ?
> You're right, lazy VACUUM doesn't create this issue. My mistake. > But VACUUM FULL does. > > > I can only see a problem with a partway through vacuum > full, that aborted. > > Exactly. Can you try give me a hint, why an aborted VACUUM FULL will not allways be a problem (also for other operations) until you run another VACUUM FULL that succeeds ? How do we know, that a (newly) FrozenXid tuple does not still have a (visible) duplicate ? Andreas
"Zeugswetter Andreas SB SD" <ZeugswetterA@spardat.at> writes: >> The tuple toaster currently fetches toast-table rows using SnapshotAny. >> This is quite uncool, because it will in fact find *any* row. > Well, doesn't it need to find *any* row, since the accesses to heap and toast > are not "atomic" (not interlocked) ? No; it needs to *not* find dead rows resulting from a failed vacuum. You are right that for the most part TOAST relies on time qualification of the main-table row, and no doubt that was why Jan thought he could code it like this; but vacuuming the TOAST table is not an operation that affects the main table. >> Including dead rows resulting from an aborted > But then also the heap tuple is aborted, and thus a normal select will not > select that toast anyway. Wrong; see above. We are talking about reshuffling rows in the TOAST table that belong (presumably) to a live row in the main table. During the VACUUM there will be more than one copy of such rows. >> (or, in 7.2, still-in-progress) >> VACUUM. > I thought new vacuum did not move tuples ? Why should it then produce > two rows ? And VACUUM FULL locks the table so ... You're right, lazy VACUUM doesn't create this issue. My mistake. But VACUUM FULL does. > I can only see a problem with a partway through vacuum full, that aborted. Exactly. regards, tom lane
"Zeugswetter Andreas SB SD" <ZeugswetterA@spardat.at> writes: > Can you try give me a hint, why an aborted VACUUM FULL will not allways be > a problem (also for other operations) until you run another VACUUM FULL > that succeeds ? Because noplace else uses SnapshotAny, basically. The only really legitimate use I can see for it is in Tatsuo's contrib/pgstattuple hack: that wants to count both visible and invisible tuples. > How do we know, that a (newly) FrozenXid tuple does not still have > a (visible) duplicate ? It's *not* visible, if you are applying any visibility checks whatever. But SnapshotAny bypasses all visibility checking. I forgot to answer your question about what these mean. See src/backend/utils/time/tqual.c for the gory details, but if you are willing to accept the summary comments in that file: Any No visibility check whatever Self Other committed xacts + effects of this xact up to *and including* current command Now Other committed xacts + effects of this xact up to *but not including* current command Dirty like Self, but can see effects of other in-progress xacts too Snapshot like Now, but specified other xacts are considered uncommitted even if they've actually committed by now There are also special visibility rules for Update and Vacuum, which are not really different from the normal rules, they just return more information than "visible" or "not visible". (Dirty returns some side information too.) Hmm. Now that I look at it, Self probably isn't quite right either; if we are reading a main-table tuple that's committed dead but is still visible according to our snapshot, we need to be able to see its toast tuples too; but they're committed dead as well. Sigh. I think we need a special visibility rule for TOAST, that only checks for HEAP_MOVED_IN/HEAP_MOVED_OFF (the bits that take care of VACUUM moves). Comments? regards, tom lane
Tom Lane wrote: > The tuple toaster currently fetches toast-table rows using SnapshotAny. > This is quite uncool, because it will in fact find *any* row. Including > dead rows resulting from an aborted (or, in 7.2, still-in-progress) > VACUUM. Really? So what you're saying means that while vacuum is running there's a window where one tuple can exist multiple times and just the fact that other backends can see the vacuum transaction alive prevents them from showingup? That doesn't sound really good to me. Jan > > I believe this is the explanation for a problem report I'm currently > chasing from Joshua Drake: > > digivision2=# select * from change_log; > ERROR: chunk 1 for toast value 3388774 appears multiple times > > (there are some thousands of rows with similar problems) > > I suggest SnapshotSelf instead. Comments anyone? > > regards, tom lane > > ---------------------------(end of broadcast)--------------------------- > TIP 3: if posting/reading through Usenet, please send an appropriate > subscribe-nomail command to majordomo@postgresql.org so that your > message can get through to the mailing list cleanly > -- #======================================================================# # It's easier to get forgiveness for being wrong than for being right. # # Let's break this rule - forgive me. # #================================================== JanWieck@Yahoo.com # _________________________________________________________ Do You Yahoo!? Get your free @yahoo.com address at http://mail.yahoo.com
Jan Wieck <janwieck@yahoo.com> writes: > Really? So what you're saying means that while vacuum is > running there's a window where one tuple can exist multiple > times Moving a tuple in VACUUM isn't fundamentally different from an UPDATE. You need a visibility rule to tell you which version of the tuple to pay attention to. Right now, TOAST hasn't got one. Per later discussion, Andreas is right that SnapshotSelf won't do. I think we need a new tqual.c routine just for TOAST. (The alternative would be for TOAST to know which snapshot was used to find the main- table tuple, but that doesn't seem workable.) regards, tom lane
I have added more comments to tqual.c to the stuff Tom just added. I added visibility items to the top of each function comment. --------------------------------------------------------------------------- Tom Lane wrote: > "Zeugswetter Andreas SB SD" <ZeugswetterA@spardat.at> writes: > > Can you try give me a hint, why an aborted VACUUM FULL will not allways be > > a problem (also for other operations) until you run another VACUUM FULL > > that succeeds ? > > Because noplace else uses SnapshotAny, basically. The only really > legitimate use I can see for it is in Tatsuo's contrib/pgstattuple > hack: that wants to count both visible and invisible tuples. > > > How do we know, that a (newly) FrozenXid tuple does not still have > > a (visible) duplicate ? > > It's *not* visible, if you are applying any visibility checks whatever. > But SnapshotAny bypasses all visibility checking. > > I forgot to answer your question about what these mean. See > src/backend/utils/time/tqual.c for the gory details, but if you > are willing to accept the summary comments in that file: > > Any No visibility check whatever > > Self Other committed xacts + effects of this xact up to > *and including* current command > > Now Other committed xacts + effects of this xact up to > *but not including* current command > > Dirty like Self, but can see effects of other in-progress > xacts too > > Snapshot like Now, but specified other xacts are considered > uncommitted even if they've actually committed by now > > There are also special visibility rules for Update and Vacuum, which > are not really different from the normal rules, they just return more > information than "visible" or "not visible". (Dirty returns some > side information too.) > > Hmm. Now that I look at it, Self probably isn't quite right either; > if we are reading a main-table tuple that's committed dead but is > still visible according to our snapshot, we need to be able to see > its toast tuples too; but they're committed dead as well. Sigh. > I think we need a special visibility rule for TOAST, that only > checks for HEAP_MOVED_IN/HEAP_MOVED_OFF (the bits that take care > of VACUUM moves). > > Comments? > > regards, tom lane > > ---------------------------(end of broadcast)--------------------------- > TIP 4: Don't 'kill -9' the postmaster > -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
> > How do we know, that a (newly) FrozenXid tuple does not still have > > a (visible) duplicate ? > > It's *not* visible, if you are applying any visibility checks whatever. > But SnapshotAny bypasses all visibility checking. I am concerned about the case where VACUUM FULL: 1. inserts heap tuple to new location using FrozenXid 2. updates original heap tuples's xmax What if we crash/abort between step 1 and 2 but we used FrozenXid for 1. Don't know if we actually do this, but imho we are only allowed to use FrozenXid for an inplace vacuum operation. Andreas
"Zeugswetter Andreas SB SD" <ZeugswetterA@spardat.at> writes: > I am concerned about the case where VACUUM FULL: > 1. inserts heap tuple to new location using FrozenXid > 2. updates original heap tuples's xmax It doesn't matter whether it's FrozenXid or not. The tuple is not visible if it's got the wrong setting of HEAP_MOVED_OFF/IN. regards, tom lane
"Zeugswetter Andreas SB SD" <ZeugswetterA@spardat.at> writes: >> It doesn't matter whether it's FrozenXid or not. The tuple is not >> visible if it's got the wrong setting of HEAP_MOVED_OFF/IN. > But the FrozenXid tuple has HEAP_MOVED_IN and the original has > not yet been altered to HEAP_MOVED_OFF because of abort. > Is the HEAP_MOVED_IN tuple not visible ? Right. Actually it doesn't matter whether the old tuple has HEAP_MOVED_OFF or not; it's still visible *until* the VACUUM commits. The commit atomically switches us from the state where the unmoved tuples are good to the state where the moved ones are good. This is all exactly the same whether FrozenXid is involved or not. regards, tom lane
> > I am concerned about the case where VACUUM FULL: > > 1. inserts heap tuple to new location using FrozenXid > > 2. updates original heap tuples's xmax > > It doesn't matter whether it's FrozenXid or not. The tuple is not > visible if it's got the wrong setting of HEAP_MOVED_OFF/IN. But the FrozenXid tuple has HEAP_MOVED_IN and the original has not yet been altered to HEAP_MOVED_OFF because of abort. Is the HEAP_MOVED_IN tuple not visible ? Andreas
Tom Lane wrote: > "Zeugswetter Andreas SB SD" <ZeugswetterA@spardat.at> writes: > >> It doesn't matter whether it's FrozenXid or not. The tuple is not > >> visible if it's got the wrong setting of HEAP_MOVED_OFF/IN. > > > But the FrozenXid tuple has HEAP_MOVED_IN and the original has > > not yet been altered to HEAP_MOVED_OFF because of abort. > > Is the HEAP_MOVED_IN tuple not visible ? > > Right. Actually it doesn't matter whether the old tuple has > HEAP_MOVED_OFF or not; it's still visible *until* the VACUUM commits. > The commit atomically switches us from the state where the unmoved > tuples are good to the state where the moved ones are good. > > This is all exactly the same whether FrozenXid is involved or not. Originally I added SnapshotAny only to solve the problem that the lookup of the toast table happens independant from the access to the main tuple, and that in fact the main tuple could have been deleted by the own transactionin between. Would changing SnapshotAny to honor HEAP_MOVED_* fix the problem? If nothing else by now uses this snapshot, itcould be done in place. Jan -- #======================================================================# # It's easier to get forgiveness for being wrong than for being right. # # Let's break this rule - forgive me. # #================================================== JanWieck@Yahoo.com # _________________________________________________________ Do You Yahoo!? Get your free @yahoo.com address at http://mail.yahoo.com
Jan Wieck <janwieck@yahoo.com> writes: > Originally I added SnapshotAny only to solve the problem that > the lookup of the toast table happens independant from the > access to the main tuple, and that in fact the main tuple > could have been deleted by the own transaction in between. > Would changing SnapshotAny to honor HEAP_MOVED_* fix the > problem? If nothing else by now uses this snapshot, it could > be done in place. What I've done is add an additional snapshot class SnapshotToast that behaves that way. There are a couple other uses of SnapshotAny that I didn't think I wanted to mess with. regards, tom lane
Hiroshi Inoue <Inoue@tpf.co.jp> writes: > SnapshotToast doesn't seem a proper solution to me Do you have a better idea in mind? regards, tom lane
Tom Lane wrote: > > Jan Wieck <janwieck@yahoo.com> writes: > > Originally I added SnapshotAny only to solve the problem that > > the lookup of the toast table happens independant from the > > access to the main tuple, and that in fact the main tuple > > could have been deleted by the own transaction in between. > > > Would changing SnapshotAny to honor HEAP_MOVED_* fix the > > problem? If nothing else by now uses this snapshot, it could > > be done in place. > > What I've done is add an additional snapshot class SnapshotToast > that behaves that way. There are a couple other uses of SnapshotAny > that I didn't think I wanted to mess with. AFAIR SnapshotAny was introduced for Foreign Key stuff. SnapshotToast doesn't seem a proper solution to me but 7.2 should be released ASAP anyway. regards, Hiroshi Inoue
Tom Lane wrote: > > Hiroshi Inoue <Inoue@tpf.co.jp> writes: > > SnapshotToast doesn't seem a proper solution to me > > Do you have a better idea in mind? The same snapshot for both the main and the toast table seems better though I don't know how it's difficult to change. For example is it possible to update a toast chunk partially using SnapshotToast ? regards, Hiroshi Inoue
Hiroshi Inoue <Inoue@tpf.co.jp> writes: > The same snapshot for both the main and the toast table > seems better though I don't know how it's difficult to > change. That would be cleaner but I don't see any way to do it. The toasted value gets passed all over the system, potentially, before someone asks to detoast it. At that point you really don't have a good way to know which snapshot was used to fetch the main-table row. > For example is it possible to update a toast > chunk partially using SnapshotToast ? As things stand (with either SnapshotToast or the old SnapshotAny way) it is never possible to update an individual toast value, either partially or wholly. All you can do is lay down a new toast value (with a new identifying OID) and then delete the old one. But I'm not sure that this is wrong, or fixable. Trying to update part of a toasted value is very much like wanting to update part of an existing row in-place, which we cannot possibly do. We have to lay down a whole new row whenever any part of it is updated. regards, tom lane
Tom Lane wrote: > Hiroshi Inoue <Inoue@tpf.co.jp> writes: > > The same snapshot for both the main and the toast table > > seems better though I don't know how it's difficult to > > change. > > That would be cleaner but I don't see any way to do it. The toasted > value gets passed all over the system, potentially, before someone asks > to detoast it. At that point you really don't have a good way to know > which snapshot was used to fetch the main-table row. That's IMHO not the point. The design of toast is that if a main heap tuple points to a toast id, and the heap tuple at scan time was visible, then no matter what, the value is readable. Regardless of any isolation level or command counter or whatnot. That's achieved by allways assigning a new toast ID on change of a toasted value.So if the columns value changed, the old and new rows point to different toast values not by visibility but bytoast id. If it did not change, they point to the same. The problem arises only because SnapshotAny is *too* blind and ignores the special vacuum HEAP_MOVED_* states, sothat a scan after a crashed vacuum full, that left yet unborn moved- in tuples in the toast table, can see them. Since the design of toast is NO NEED to care about snapshots during toast value fetching, I don't see why doingso now would be a cleaner solution. Jan -- #======================================================================# # It's easier to get forgiveness for being wrong than for being right. # # Let's break this rule - forgive me. # #================================================== JanWieck@Yahoo.com # _________________________________________________________ Do You Yahoo!? Get your free @yahoo.com address at http://mail.yahoo.com
Tom Lane wrote: > > Hiroshi Inoue <Inoue@tpf.co.jp> writes: > > > For example is it possible to update a toast > > chunk partially using SnapshotToast ? > > As things stand (with either SnapshotToast or the old SnapshotAny way) > it is never possible to update an individual toast value, either > partially or wholly. All you can do is lay down a new toast value (with > a new identifying OID) and then delete the old one. > > But I'm not sure that this is wrong, or fixable. Trying to update part > of a toasted value is very much like wanting to update part of an > existing row in-place, which we cannot possibly do. Bytea seems to be considered as a candidate for BLOB though I think the entirely new type is preferable. It seems impossible to implement a functionality like inv_write using bytea which the current large object stuff has. regards, Hiroshi Inoue
> Tom Lane wrote: > > > > Hiroshi Inoue <Inoue@tpf.co.jp> writes: > > > > > For example is it possible to update a toast > > > chunk partially using SnapshotToast ? > > > > As things stand (with either SnapshotToast or the old SnapshotAny way) > > it is never possible to update an individual toast value, either > > partially or wholly. All you can do is lay down a new toast value (with > > a new identifying OID) and then delete the old one. > > > > But I'm not sure that this is wrong, or fixable. Trying to update part > > of a toasted value is very much like wanting to update part of an > > existing row in-place, which we cannot possibly do. > > Bytea seems to be considered as a candidate for BLOB > though I think the entirely new type is preferable. > It seems impossible to implement a functionality like > inv_write using bytea which the current large object > stuff has. Agreed. I think that was the reason we kept TOAST and large objects, because large objects were designed for random read-write. If we can get large objects to auto-delete, probably with pg_depend, we can then use them seamlessly with BLOB I/O routines. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
> > Agreed. I think that was the reason we kept TOAST and large objects, > > because large objects were designed for random read-write. If we can > > get large objects to auto-delete, probably with pg_depend, we can then > > use them seamlessly with BLOB I/O routines. > > Oops I seem to have missed the discussion about excluding > bytea from the candidate from BLOB. Yes now we seem to have > a good reason to exclude existent type from the candidate > of BLOB. Well, we had the discussion when Jan was adding TOAST, and Jan was saying we still need large objects for I/O purposes and for very large items. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
Bruce Momjian wrote: > > > Tom Lane wrote: > > > > > > Hiroshi Inoue <Inoue@tpf.co.jp> writes: > > > > > > > For example is it possible to update a toast > > > > chunk partially using SnapshotToast ? > > > > > > As things stand (with either SnapshotToast or the old SnapshotAny way) > > > it is never possible to update an individual toast value, either > > > partially or wholly. All you can do is lay down a new toast value (with > > > a new identifying OID) and then delete the old one. > > > > > > But I'm not sure that this is wrong, or fixable. Trying to update part > > > of a toasted value is very much like wanting to update part of an > > > existing row in-place, which we cannot possibly do. > > > > Bytea seems to be considered as a candidate for BLOB > > though I think the entirely new type is preferable. > > It seems impossible to implement a functionality like > > inv_write using bytea which the current large object > > stuff has. > > Agreed. I think that was the reason we kept TOAST and large objects, > because large objects were designed for random read-write. If we can > get large objects to auto-delete, probably with pg_depend, we can then > use them seamlessly with BLOB I/O routines. Oops I seem to have missed the discussion about excluding bytea from the candidate from BLOB. Yes now we seem to have a good reason to exclude existent type from the candidate of BLOB. regards, Hiroshi Inoue
Bruce Momjian <pgman@candle.pha.pa.us> writes: > Well, we had the discussion when Jan was adding TOAST, and Jan was > saying we still need large objects for I/O purposes and for very large > items. Yeah. I hadn't realized until this conversation that TOAST couldn't support partial updates of huge values. That is a definite problem for supporting lo_read/lo_write-type operations. regards, tom lane
Jan Wieck wrote: > Bruce Momjian wrote: > > > > Agreed. I think that was the reason we kept TOAST and large objects, > > because large objects were designed for random read-write. If we can > > get large objects to auto-delete, probably with pg_depend, we can then > > use them seamlessly with BLOB I/O routines. > > Not entirely. BLOB's should have copy semantics, so that > doing > > INSERT INTO tab1 SELECT id, blob_column FROM tab2 ... > > and subsequently changing the blob in either tab1 or tab2 > does NOT change the blob in the other table. Currently, even > if you make the pg_depend stuff smart enough to let the blob > live until all references are gone, the two tables would > share the same blob. And that's not only IMHO wrong, it's > also incompatible to Oracle :-) Yes, we have to add duplication large object stuff hard-wired to that new data type. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
Bruce Momjian wrote: > > > > Agreed. I think that was the reason we kept TOAST and large objects, > > > because large objects were designed for random read-write. If we can > > > get large objects to auto-delete, probably with pg_depend, we can then > > > use them seamlessly with BLOB I/O routines. > > > > Oops I seem to have missed the discussion about excluding > > bytea from the candidate from BLOB. Yes now we seem to have > > a good reason to exclude existent type from the candidate > > of BLOB. > > Well, we had the discussion when Jan was adding TOAST, and Jan was > saying we still need large objects for I/O purposes and for very large > items. Though I've often seen the reference to bytea BLOB I remember no clear negation. Don't we have to negate it clearly from the first ? regards, Hiroshi Inoue
Bruce Momjian wrote: > > Agreed. I think that was the reason we kept TOAST and large objects, > because large objects were designed for random read-write. If we can > get large objects to auto-delete, probably with pg_depend, we can then > use them seamlessly with BLOB I/O routines. Not entirely. BLOB's should have copy semantics, so that doing INSERT INTO tab1 SELECT id, blob_column FROM tab2 ... and subsequently changing the blob in either tab1 or tab2 does NOT change the blob in the other table. Currently,even if you make the pg_depend stuff smart enough to let the blob live until all references are gone,the two tables would share the same blob. And that's not only IMHO wrong, it's also incompatible to Oracle :-) Jan -- #======================================================================# # It's easier to get forgiveness for being wrong than for being right. # # Let's break this rule - forgive me. # #================================================== JanWieck@Yahoo.com # _________________________________________________________ Do You Yahoo!? Get your free @yahoo.com address at http://mail.yahoo.com
> > Well, we had the discussion when Jan was adding TOAST, and Jan was > > saying we still need large objects for I/O purposes and for very large > > items. > > Though I've often seen the reference to bytea BLOB > I remember no clear negation. Don't we have to negate > it clearly from the first ? As I remember, we only mention BLOB when people ask. We don't refer to either as BLOB, only as TOAST/bytea and large objects. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
Hiroshi Inoue wrote: > Though I've often seen the reference to bytea BLOB > I remember no clear negation. Don't we have to negate > it clearly from the first ? > > regards, > Hiroshi Inoue With the changes in 7.2, bytea is actually very close to matching the SQL99 definition of binary strings, which it also refers to as BLOB. BLOBS should include support for (section 4.3): <comparison predicate> - yes <blob concatenation> - yes <blob substring function> - yes <blob overlay function> - no <trim function> - yes <length expression> - yes <position expression> - yes <like predicate> - yes Other than the overlay function, the primary thing missing is the use ofhex as the I/O representation. See table 3-9 at http://candle.pha.pa.us/main/writings/pgsql/sgml/datatype-binary.html for more info. Joe
Hiroshi Inoue wrote: [Charset iso-2022-jp unsupported, skipping...] > Though I've often seen the reference to bytea BLOB > I remember no clear negation. Don't we have to negate > it clearly from the first ? Any datatype that has the potential to be loaded into memory entirely by the backend is dangerous. It leads to problems like "out of swapspace", which aren't really funny in production. We need something that is restricted to streaming access only. Yet, it needs to have copy semantics. That means to me, that this will not only be a new datatype, but some more infrastructure too. We need INSERT ... RETURNING ... where the values for BLOBs are given as NEW_BLOB() and the returned identifier is subsequently used to open the created BLOBs for writing and pump the data in. We need heap access level tuple support to duplicate those values on INSERT ... SELECT, to remove them on DELETEand to manipulate them on UPDATE. We need new functionality in procedural languages to support streaming in and out of these guy's, where the actualdata is buffered in temp files and only blob identifiers passed around. I see the final storage of blob data to happen in the toast table, sliced, diced and handy (most probably not compressed). But I don't see an easy way to abuse the bytea datatype for real blob support. Jan -- #======================================================================# # It's easier to get forgiveness for being wrong than for being right. # # Let's break this rule - forgive me. # #================================================== JanWieck@Yahoo.com # _________________________________________________________ Do You Yahoo!? Get your free @yahoo.com address at http://mail.yahoo.com
Jan Wieck wrote: > > Any datatype that has the potential to be loaded into memory > entirely by the backend is dangerous. It leads to problems > like "out of swapspace", which aren't really funny in > production. > > We need something that is restricted to streaming access > only. Yet, it needs to have copy semantics. > > That means to me, that this will not only be a new datatype, > but some more infrastructure too. We need INSERT ... > RETURNING ... where the values for BLOBs are given as > NEW_BLOB() and the returned identifier is subsequently used > to open the created BLOBs for writing and pump the data in. > > We need heap access level tuple support to duplicate those > values on INSERT ... SELECT, to remove them on DELETE and to > manipulate them on UPDATE. > > We need new functionality in procedural languages to support > streaming in and out of these guy's, where the actual data is > buffered in temp files and only blob identifiers passed > around. > > I see the final storage of blob data to happen in the toast > table, sliced, diced and handy (most probably not > compressed). But I don't see an easy way to abuse the bytea > datatype for real blob support. > All this may be needed for BLOBs that are *huge*, but there are at least two classes of BLOB data that are not typically huge in practice. One is encrypted or hashed data, the other is image data for web pages. Either of these work very well with the current bytea support. Joe
Joe Conway wrote: > > Hiroshi Inoue wrote: > > > Though I've often seen the reference to bytea BLOB > > I remember no clear negation. Don't we have to negate > > it clearly from the first ? > > > > regards, > > Hiroshi Inoue > > With the changes in 7.2, bytea is actually very close to matching the > SQL99 definition of binary strings, which it also refers to as BLOB. > BLOBS should include support for (section 4.3): > > <comparison predicate> - yes > <blob concatenation> - yes > <blob substring function> - yes > <blob overlay function> - no > <trim function> - yes > <length expression> - yes > <position expression> - yes > <like predicate> - yes > Other than the overlay function, It seems good as long as they are not large(huge ?). It seems a significant flaw for bytea to have no possibility of partial update. > the primary thing missing is the use of > hex as the I/O representation. I don't love the current I/O representation of bytea. It seems worth adding a new data type only to change the binary I/O representation. regards, Hiroshi Inoue