Обсуждение: TOAST performance (was Re: [GENERAL] Delete Performance)
"P.J. \"Josh\" Rovero" <rovero@sonalysts.com> writes: > [ complains that deletes are slow in table containing toasted data ] I did some experimentation here and found a rather surprising dependency: the time to delete a bunch of data is pretty much directly proportional to the disk space it occupies. This says that we're paying through the nose for having XLOG make copies of about-to-be-modified pages. I did: create table foo (f1 text); insert into foo values ('a short entry'); insert into foo select * from foo; -- repeat enough times to build up 32K rows total delete from foo; The "delete" took about 2 seconds. I then did it over with the 'value' being a 5K chunk of text, which according to octet_length got compressed to 3900 bytes. (This'd require two rows in the TOAST table.) This time the delete took 127 seconds. I was expecting about a 3X penalty since we needed to delete three rows not one, but what I got was a 60X penalty. Trying to understand this, I did some profiling and found that most of the time was going into XLogInsert and XLOG I/O. That's when I remembered that the actual data volume involved is considerably different in the two cases. Allowing for tuple header overhead and so forth, the small-data case involves about 1.8MB, the large-data case about 131MB, or about 70 times as much data. I believe this indicates that what's determining the runtime is the fact that the XLOG code writes out an image of each page modified in the transaction. These page images will be the bulk of the XLOG traffic for the TOAST table (since there are only four or so tuples on each TOAST page, the actual XLOG delete records take little space by comparison). I've worried for some time that the decision to XLOG page images was costing us a lot more performance than could be justified... One trick we could perhaps pull is to postpone deletion of TOAST tuples until VACUUM, so that the bulk of the work is done in a noncritical path (from the point of view of the application anyway). I'm not sure how this interacts with the way that we re-use a TOAST entry when other fields in the row are updated, however. It might be too difficult for VACUUM to tell when to delete a TOAST item. regards, tom lane
Bruce Momjian <pgman@candle.pha.pa.us> writes: > Is it because we take a snapshot of the page before we write it in case > we only write part of the page? AFAIR, the partial-page-write problem is the entire reason for doing it. If we could be certain that writes to datafile pages were atomic, we'd not need this. Of course we can't be certain of that. But I'm wondering if there isn't a cheaper solution. regards, tom lane
> The "delete" took about 2 seconds. I then did it over with the > 'value' being a 5K chunk of text, which according to octet_length > got compressed to 3900 bytes. (This'd require two rows in the TOAST > table.) This time the delete took 127 seconds. I was expecting > about a 3X penalty since we needed to delete three rows not one, > but what I got was a 60X penalty. Wow. Can someone remind me why we take page images on delete? We aren't really writing anything special to the page except a transction id. > I've worried for some time that the decision to XLOG page images was > costing us a lot more performance than could be justified... Is it because we take a snapshot of the page before we write it in case we only write part of the page? -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
> Bruce Momjian <pgman@candle.pha.pa.us> writes: > > Is it because we take a snapshot of the page before we write it in case > > we only write part of the page? > > AFAIR, the partial-page-write problem is the entire reason for doing it. > If we could be certain that writes to datafile pages were atomic, we'd > not need this. > > Of course we can't be certain of that. But I'm wondering if there isn't > a cheaper solution. Could we add code to detect a partial write when we recover from one using WAL so we can know if these partial writes are ever happening? I am with you on this. There has to be a better way. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
Bruce Momjian <pgman@candle.pha.pa.us> writes: > Could we add code to detect a partial write when we recover from one > using WAL so we can know if these partial writes are ever > happening? What's your point? It clearly *can* happen during power-failure scenarios. All the monitoring in the world won't disprove that. regards, tom lane
> Bruce Momjian <pgman@candle.pha.pa.us> writes: > > Could we add code to detect a partial write when we recover from one > > using WAL so we can know if these partial writes are ever > > happening? > > What's your point? It clearly *can* happen during power-failure > scenarios. All the monitoring in the world won't disprove that. My point is uh, um, eh, I think it is a very important point that I should make ... um. :-) Seriously, how do OS's handle partial page write, especially to directories? Another item I was considering is that INSERT and UPDATE, because they append to the tables, don't really cause lots of pre-page writes, while DELETE could affect all page in a table and would require pre-page writes on all of them. However, deletes are only marking the XID status of the rows. Unfortunately I can't think of a way of recording those new XID's in WAL and preventing a possible failure while the XID's are written to the page. Can someone help me here? -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
> Bruce Momjian <pgman@candle.pha.pa.us> writes: > > Seriously, how do OS's handle partial page write, especially to > > directories? I realize UPDATE also requires pre-page writes for the old tuples. What bothers me is that unlike INSERT and UPDATE of new rows, DELETE and UPDATE of old rows is not writing new data but just setting transaction ID's. I wish there was a way to store those XID's somewhere else so the page wouldn't have to be modified. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
> Bruce Momjian <pgman@candle.pha.pa.us> writes: > > Seriously, how do OS's handle partial page write, especially to > > directories? > > ... fsck ... But how can it handle partial writes to a directory when many files exist in that single block? -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
Bruce Momjian <pgman@candle.pha.pa.us> writes: > Seriously, how do OS's handle partial page write, especially to > directories? ... fsck ... regards, tom lane
Tom Lane wrote > >I did some experimentation here and found a rather surprising >dependency: the time to delete a bunch of data is pretty much >directly proportional to the disk space it occupies. This says >that we're paying through the nose for having XLOG make copies >of about-to-be-modified pages. > At least now I know I wasn't imagining things.... :-) Which brings up the question, what is the best way to deal with many thousands of variable-length binary chunks. Net input == net output over the course of a day. The new vacuum should help (both lo_ and toasted tables take a long time to vacuum full), but I'm running into the "Hotel California" situation. Data goes in fast, but can't be deleted fast enough to keep the database from continuously growing in size.
Josh Rovero wrote: > Tom Lane wrote > >> >> I did some experimentation here and found a rather surprising >> dependency: the time to delete a bunch of data is pretty much >> directly proportional to the disk space it occupies. This says >> that we're paying through the nose for having XLOG make copies >> of about-to-be-modified pages. > Can't we somehow WAL only metadata and not the actual pages for DELETEs - as delete is essentially (though currently not technically) just metadata it should be a possible thing to do. >> ------------------ > Hannu
On Tue, 2001-11-20 at 10:11, Hannu Krosing wrote: > > > Tom Lane wrote > > > >> I did some experimentation here and found a rather surprising > >> dependency: the time to delete a bunch of data is pretty much > >> directly proportional to the disk space it occupies. This says > >> that we're paying through the nose for having XLOG make copies > >> of about-to-be-modified pages. > > > Can't we somehow WAL only metadata and not the actual pages for > DELETEs - as delete is essentially (though currently not technically) > just metadata it should be a possible thing to do. Is it possible to do ordered writes, the way ext3 does? http://www-106.ibm.com/developerworks/linux/library/l-fs7/ Is an interesting article discussing the approach. Regards, Andrew. -- -------------------------------------------------------------------- Andrew @ Catalyst .Net.NZ Ltd, PO Box 11-053, Manners St, Wellington WEB: http://catalyst.net.nz/ PHYS: Level 2, 150-154 Willis St DDI: +64(4)916-7201 MOB: +64(21)635-694 OFFICE: +64(4)499-2267
Andrew McMillan wrote: > > On Tue, 2001-11-20 at 10:11, Hannu Krosing wrote: > > > > > Tom Lane wrote > > > > > >> I did some experimentation here and found a rather surprising > > >> dependency: the time to delete a bunch of data is pretty much > > >> directly proportional to the disk space it occupies. This says > > >> that we're paying through the nose for having XLOG make copies > > >> of about-to-be-modified pages. > > > > > Can't we somehow WAL only metadata and not the actual pages for > > DELETEs - as delete is essentially (though currently not technically) > > just metadata it should be a possible thing to do. > > Is it possible to do ordered writes, the way ext3 does? I remember it being discussed on this list that you have very little control over writing order if you operate above filesystem/cache level. > http://www-106.ibm.com/developerworks/linux/library/l-fs7/ I guess that is the article that sparked the idea of journalling only metadata for deletes (including the delete half of update) Using the Journaling Block Device described there could actually be a good (though currently not portable) solution if you run linux. ------------- Hannu
> Bruce Momjian <pgman@candle.pha.pa.us> writes: > > Could we add code to detect a partial write when we recover from one > > using WAL so we can know if these partial writes are ever > > happening? > > What's your point? It clearly *can* happen during power-failure > scenarios. All the monitoring in the world won't disprove that. What bothers me about this is that we have the original page with the old data. It would be nice if we could write the new page in a different location, make the new page active and recycle the old page at some later time. We are storing the pre-page image in WAL, but it seems like a waste because we already have a pre-image. It is just that we are overwriting it. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
Bruce Momjian <pgman@candle.pha.pa.us> writes: > What bothers me about this is that we have the original page with the > old data. It would be nice if we could write the new page in a > different location, make the new page active and recycle the old page at > some later time. I don't see how that reduces the total amount of disk traffic? It's also kind of unclear how to do it without doubling (or worse) the amount of table space used in many common scenarios. I doubt many people will be happy if "DELETE FROM foo" requires transient space equal to twice the original size of foo. regards, tom lane
> Bruce Momjian <pgman@candle.pha.pa.us> writes: > > What bothers me about this is that we have the original page with the > > old data. It would be nice if we could write the new page in a > > different location, make the new page active and recycle the old page at > > some later time. > > I don't see how that reduces the total amount of disk traffic? > > It's also kind of unclear how to do it without doubling (or worse) the > amount of table space used in many common scenarios. I doubt many > people will be happy if "DELETE FROM foo" requires transient space equal > to twice the original size of foo. Well, right now we write the pre-image to WAL, then write the new page over the old one. In my case, you just write the new, and somewhere record that the old page is no longer active. Sounds a little like VACUUM, but for pages. With DELETE FROM foo, let's suppose you have 10 pages in the table. To modify page 1, you write to page 11, then record in WAL that page 1 is inactive. To write page 2, you write to page 1 and record page 2 as inactive, etc. You basically are writing your new data one behind. One problem I see is that you don't really know the pages are on disk so I am not sure how to be safe when over-writing the inactive pages. Of course, I am just throwing out ideas, looking for a solution. Help! -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
Bruce Momjian <pgman@candle.pha.pa.us> writes: >> I don't see how that reduces the total amount of disk traffic? > Well, right now we write the pre-image to WAL, then write the new page > over the old one. In my case, you just write the new, and somewhere > record that the old page is no longer active. The devil is in the details of that last little bit. How is "mark a page inactive" cheaper than "mark a tuple dead"? More specifically, how do you propose to avoid WAL-logging the page you are going to do this marking in? Seems you still end up with a WAL page image for something. regards, tom lane
> Bruce Momjian <pgman@candle.pha.pa.us> writes: > >> I don't see how that reduces the total amount of disk traffic? > > > Well, right now we write the pre-image to WAL, then write the new page > > over the old one. In my case, you just write the new, and somewhere > > record that the old page is no longer active. > > The devil is in the details of that last little bit. How is "mark a > page inactive" cheaper than "mark a tuple dead"? More specifically, > how do you propose to avoid WAL-logging the page you are going to do > this marking in? Seems you still end up with a WAL page image for > something. I was thinking of just throwing the inactive page number into WAL. Much smaller than the entire page image. You don't touch the page. Does that help? -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
Bruce Momjian <pgman@candle.pha.pa.us> writes: > I was thinking of just throwing the inactive page number into WAL. Much > smaller than the entire page image. You don't touch the page. Does > that help? I don't think so. Somehow you have to tell the other backends that that page is dead; merely recording it in WAL doesn't do that. More to the point, you can't recycle (overwrite) that page until you've checkpointed or WAL-logged the replacement page; so you still end up with disk I/O for the replacement. regards, tom lane
Tom Lane wrote: > > Bruce Momjian <pgman@candle.pha.pa.us> writes: > > What bothers me about this is that we have the original page with the > > old data. It would be nice if we could write the new page in a > > different location, make the new page active and recycle the old page at > > some later time. > > I don't see how that reduces the total amount of disk traffic? > > It's also kind of unclear how to do it without doubling (or worse) the > amount of table space used in many common scenarios. I doubt many > people will be happy if "DELETE FROM foo" requires transient space equal > to twice the original size of foo. IIRC the double space requrement is what has kept us from implementing DROP COLUMN. ----------- Hannu
Tom Lane wrote: > > Bruce Momjian <pgman@candle.pha.pa.us> writes: > >> I don't see how that reduces the total amount of disk traffic? > > > Well, right now we write the pre-image to WAL, then write the new page > > over the old one. In my case, you just write the new, and somewhere > > record that the old page is no longer active. > > The devil is in the details of that last little bit. How is "mark a > page inactive" cheaper than "mark a tuple dead"? More specifically, > how do you propose to avoid WAL-logging the page you are going to do > this marking in? Seems you still end up with a WAL page image for > something. Assuming that we WAL with the granularity of disk sector (512b) I think that restructuring of database heap page (8kb) would be a big win for delete/update. The idea is to move metadata (oid,tableoid,xmin,cmin,xmax,cmax,ctid) to the beginning of heap page to the same space with tuple pointers. It's easy (<grin>) as all of it is fixed length. Then a change in metadata like setting xmax for deleted/updated tuple will dirty only the first disk page and not all of them. The new structure of ItemId will be (itemId-pointer nbitsitemId-flags 32-n bitsoid,tableoid,xmin,cmin,xmax,cmax,ctid ) Assuming that we do account of dirty pages and WAL with the granularity of database page we may get a big win by just moving to smaller gramularity. The win from increasing cranularity was not very big before WAL, as the database pages are continuous on disk, but will be significant when we have to log all dirty pages. ------------------ Hannu
> With DELETE FROM foo, let's suppose you have 10 pages in the table. To > modify page 1, you write to page 11 But what with the indexes ? They would all need to be modified accordingly. If you did something like chaining, then before long all tuples would be chained, even those that were not touched. If you really want to avoid the page writes to WAL, imho the best way would be to revive the original PG page design where the physical position of slots in a heap page where only changed by vacuum. Then, a heap page that was only partly written would only be a problem iff the hardware wrote wrong data, not if it only skipped part of the write. Reasonable hardware does detect such corrupted pages. E.g. on AIX if you reduce the PG pagesize to 4k, an only partly written page that stays undetected can be ruled out. Then you would only need to write index pages to WAL, but not heap pages. Maybe a better idea would be to only conditionally write pages to WAL if slot positions changed. In the "delete" example heap slot positions certainly do not need to change. To be extra safe it would probably be necessary to not split tuple headers (at least the xact info) across physical pages. Then it would also be safe to use a pg pagesize that is a multiple of the physical page size. or so ? ... Andreas
> > It's also kind of unclear how to do it without doubling (or worse) the > > amount of table space used in many common scenarios. I doubt many > > people will be happy if "DELETE FROM foo" requires transient space equal > > to twice the original size of foo. > > IIRC the double space requrement is what has kept us from implementing > DROP COLUMN. The correct solution then, according methinks to my old Human Computer Interaction lecturer, is to implement the feature anyway, and warn the DBA what the consequences are. That way, the DBA can do it if she wants, unlike the current situation where it's next to impossible (with lots of referencing foreign keys). Chris
> > > It's also kind of unclear how to do it without doubling (or worse) the > > > amount of table space used in many common scenarios. I doubt many > > > people will be happy if "DELETE FROM foo" requires transient space equal > > > to twice the original size of foo. > > > > IIRC the double space requrement is what has kept us from implementing > > DROP COLUMN. > > The correct solution then, according methinks to my old Human Computer > Interaction lecturer, is to implement the feature anyway, and warn the DBA > what the consequences are. That way, the DBA can do it if she wants, unlike > the current situation where it's next to impossible (with lots of > referencing foreign keys). Yes, I personally am going to try this for 7.3, as well as fix CLUSTER. I think someone has already started on CLUSTER anyway. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
> Bruce Momjian <pgman@candle.pha.pa.us> writes: > > Is it because we take a snapshot of the page before we write it in case > > we only write part of the page? > > AFAIR, the partial-page-write problem is the entire reason for doing it. > If we could be certain that writes to datafile pages were atomic, we'd > not need this. > > Of course we can't be certain of that. But I'm wondering if there isn't > a cheaper solution. I have added these TODO items to summarize this discussion: * Reduce number of pre-page WAL writes; they exist only to gaurd against partial page writes * Turn off pre-page writes if fsync is disabled (?) -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
If "pre-page WAL write" means the value of the page before the current changes, then there is generally another reason for writing it out. Imagine this sequence of events: 1. transaction A starts 2. transaction B starts 3. tran A makes a change 4. tran B makes a change 5. tran A commits 6. all changes get written to disk (this can happen even without fsync, for example tran C might do a full table scan whichfills the buffer cache before B commits) 7. the system crashes When the system comes back up, we need to do a rollback on transaction B since it did not commit and we need the "pre-page" to know how to undo the change for B that got saved in step 6 above. At least this is what happens in most DBMSs... Brian Beuning Bruce Momjian wrote: > > Bruce Momjian <pgman@candle.pha.pa.us> writes: > > > Is it because we take a snapshot of the page before we write it in case > > > we only write part of the page? > > > > AFAIR, the partial-page-write problem is the entire reason for doing it. > > If we could be certain that writes to datafile pages were atomic, we'd > > not need this. > > > > Of course we can't be certain of that. But I'm wondering if there isn't > > a cheaper solution. > > I have added these TODO items to summarize this discussion: > > * Reduce number of pre-page WAL writes; they exist only to gaurd against > partial page writes > * Turn off pre-page writes if fsync is disabled (?) > > -- > Bruce Momjian | http://candle.pha.pa.us > pgman@candle.pha.pa.us | (610) 853-3000 > + If your life is a hard drive, | 830 Blythe Avenue > + Christ can be your backup. | Drexel Hill, Pennsylvania 19026 > > ---------------------------(end of broadcast)--------------------------- > TIP 2: you can get off all lists at once with the unregister command > (send "unregister YourEmailAddressHere" to majordomo@postgresql.org)
> If "pre-page WAL write" means the value of the page before the current > changes, then there is generally another reason for writing it out. > > Imagine this sequence of events: > 1. transaction A starts > 2. transaction B starts > 3. tran A makes a change > 4. tran B makes a change > 5. tran A commits > 6. all changes get written to disk (this can happen even without fsync, > for example tran C might do a full table scan which fills the buffer cache > before B commits) > 7. the system crashes > > When the system comes back up, we need to do a rollback on > transaction B since it did not commit and we need the "pre-page" > to know how to undo the change for B that got saved in step 6 above. > > At least this is what happens in most DBMSs... Because we have a non-overwriting storage manager, I don't think this issue applies to us. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026