Обсуждение: TOAST performance (was Re: [GENERAL] Delete Performance)

Поиск
Список
Период
Сортировка

TOAST performance (was Re: [GENERAL] Delete Performance)

От
Tom Lane
Дата:
"P.J. \"Josh\" Rovero" <rovero@sonalysts.com> writes:
> [ complains that deletes are slow in table containing toasted data ]

I did some experimentation here and found a rather surprising
dependency: the time to delete a bunch of data is pretty much
directly proportional to the disk space it occupies.  This says
that we're paying through the nose for having XLOG make copies
of about-to-be-modified pages.

I did:

create table foo (f1 text);
insert into foo values ('a short entry');
insert into foo select * from foo;
-- repeat enough times to build up 32K rows total
delete from foo;

The "delete" took about 2 seconds.  I then did it over with the
'value' being a 5K chunk of text, which according to octet_length
got compressed to 3900 bytes.  (This'd require two rows in the TOAST
table.)  This time the delete took 127 seconds.  I was expecting
about a 3X penalty since we needed to delete three rows not one,
but what I got was a 60X penalty.

Trying to understand this, I did some profiling and found that most
of the time was going into XLogInsert and XLOG I/O.  That's when I
remembered that the actual data volume involved is considerably
different in the two cases.  Allowing for tuple header overhead and
so forth, the small-data case involves about 1.8MB, the large-data
case about 131MB, or about 70 times as much data.

I believe this indicates that what's determining the runtime is the fact
that the XLOG code writes out an image of each page modified in the
transaction.  These page images will be the bulk of the XLOG traffic
for the TOAST table (since there are only four or so tuples on each
TOAST page, the actual XLOG delete records take little space by
comparison).

I've worried for some time that the decision to XLOG page images was
costing us a lot more performance than could be justified...

One trick we could perhaps pull is to postpone deletion of TOAST tuples
until VACUUM, so that the bulk of the work is done in a noncritical path
(from the point of view of the application anyway).  I'm not sure how
this interacts with the way that we re-use a TOAST entry when other
fields in the row are updated, however.  It might be too difficult for
VACUUM to tell when to delete a TOAST item.
        regards, tom lane


Re: TOAST performance (was Re: [GENERAL] Delete Performance)

От
Tom Lane
Дата:
Bruce Momjian <pgman@candle.pha.pa.us> writes:
> Is it because we take a snapshot of the page before we write it in case
> we only write part of the page?

AFAIR, the partial-page-write problem is the entire reason for doing it.
If we could be certain that writes to datafile pages were atomic, we'd
not need this.

Of course we can't be certain of that.  But I'm wondering if there isn't
a cheaper solution.
        regards, tom lane


Re: TOAST performance (was Re: [GENERAL] Delete Performance)

От
Bruce Momjian
Дата:
> The "delete" took about 2 seconds.  I then did it over with the
> 'value' being a 5K chunk of text, which according to octet_length
> got compressed to 3900 bytes.  (This'd require two rows in the TOAST
> table.)  This time the delete took 127 seconds.  I was expecting
> about a 3X penalty since we needed to delete three rows not one,
> but what I got was a 60X penalty.

Wow.  Can someone remind me why we take page images on delete?  We
aren't really writing anything special to the page except a transction
id.

> I've worried for some time that the decision to XLOG page images was
> costing us a lot more performance than could be justified...

Is it because we take a snapshot of the page before we write it in case
we only write part of the page?

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026
 


Re: TOAST performance (was Re: [GENERAL] Delete Performance)

От
Bruce Momjian
Дата:
> Bruce Momjian <pgman@candle.pha.pa.us> writes:
> > Is it because we take a snapshot of the page before we write it in case
> > we only write part of the page?
> 
> AFAIR, the partial-page-write problem is the entire reason for doing it.
> If we could be certain that writes to datafile pages were atomic, we'd
> not need this.
> 
> Of course we can't be certain of that.  But I'm wondering if there isn't
> a cheaper solution.

Could we add code to detect a partial write when we recover from one
using WAL so we can know if these partial writes are ever
happening?


I am with you on this.  There has to be a better way.

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026
 


Re: TOAST performance (was Re: [GENERAL] Delete Performance)

От
Tom Lane
Дата:
Bruce Momjian <pgman@candle.pha.pa.us> writes:
> Could we add code to detect a partial write when we recover from one
> using WAL so we can know if these partial writes are ever
> happening?

What's your point?  It clearly *can* happen during power-failure
scenarios.  All the monitoring in the world won't disprove that.
        regards, tom lane


Re: TOAST performance (was Re: [GENERAL] Delete Performance)

От
Bruce Momjian
Дата:
> Bruce Momjian <pgman@candle.pha.pa.us> writes:
> > Could we add code to detect a partial write when we recover from one
> > using WAL so we can know if these partial writes are ever
> > happening?
> 
> What's your point?  It clearly *can* happen during power-failure
> scenarios.  All the monitoring in the world won't disprove that.

My point is uh, um, eh, I think it is a very important point that I
should make ... um.  :-)

Seriously, how do OS's handle partial page write, especially to
directories?

Another item I was considering is that INSERT and UPDATE, because they
append to the tables, don't really cause lots of pre-page writes, while
DELETE could affect all page in a table and would require pre-page
writes on all of them.  

However, deletes are only marking the XID status of the rows. 
Unfortunately I can't think of a way of recording those new XID's in WAL
and preventing a possible failure while the XID's are written to the
page.  Can someone help me here?

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026
 


Re: TOAST performance (was Re: [GENERAL] Delete Performance)

От
Bruce Momjian
Дата:
> Bruce Momjian <pgman@candle.pha.pa.us> writes:
> > Seriously, how do OS's handle partial page write, especially to
> > directories?


I realize UPDATE also requires pre-page writes for the old tuples.  What
bothers me is that unlike INSERT and UPDATE of new rows, DELETE and
UPDATE of old rows is not writing new data but just setting transaction
ID's.  I wish there was a way to store those XID's somewhere else so the
page wouldn't have to be modified.

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026
 


Re: TOAST performance (was Re: [GENERAL] Delete Performance)

От
Bruce Momjian
Дата:
> Bruce Momjian <pgman@candle.pha.pa.us> writes:
> > Seriously, how do OS's handle partial page write, especially to
> > directories?
> 
> ... fsck ...

But how can it handle partial writes to a directory when many files
exist in that single block?

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026
 


Re: TOAST performance (was Re: [GENERAL] Delete Performance)

От
Tom Lane
Дата:
Bruce Momjian <pgman@candle.pha.pa.us> writes:
> Seriously, how do OS's handle partial page write, especially to
> directories?

... fsck ...
        regards, tom lane


Re: TOAST performance (was Re: [GENERAL] Delete Performance)

От
Josh Rovero
Дата:
Tom Lane wrote

>
>I did some experimentation here and found a rather surprising
>dependency: the time to delete a bunch of data is pretty much
>directly proportional to the disk space it occupies.  This says
>that we're paying through the nose for having XLOG make copies
>of about-to-be-modified pages.
>
At least now I know I wasn't imagining things....            :-)

Which brings up the question, what is the best way to deal with many
thousands of variable-length binary chunks.  Net input == net output
over the course of a day.   The new vacuum should help (both lo_ and
toasted tables take a long time to vacuum full), but I'm running into
the "Hotel California" situation.  Data goes in fast, but can't be
deleted fast enough to keep the database from continuously growing
in size.






Re: TOAST performance (was Re: [GENERAL] Delete Performance)

От
Hannu Krosing
Дата:

Josh Rovero wrote:

> Tom Lane wrote
>
>>
>> I did some experimentation here and found a rather surprising
>> dependency: the time to delete a bunch of data is pretty much
>> directly proportional to the disk space it occupies.  This says
>> that we're paying through the nose for having XLOG make copies
>> of about-to-be-modified pages.
>
Can't we somehow WAL only metadata and not the actual pages for
DELETEs - as delete is essentially (though currently not technically)
just metadata it should be a possible thing to do.

>> ------------------
>
Hannu




Re: TOAST performance (was Re: [GENERAL] Delete

От
Andrew McMillan
Дата:
On Tue, 2001-11-20 at 10:11, Hannu Krosing wrote:
> 
> > Tom Lane wrote
> >
> >> I did some experimentation here and found a rather surprising
> >> dependency: the time to delete a bunch of data is pretty much
> >> directly proportional to the disk space it occupies.  This says
> >> that we're paying through the nose for having XLOG make copies
> >> of about-to-be-modified pages.
> >
> Can't we somehow WAL only metadata and not the actual pages for
> DELETEs - as delete is essentially (though currently not technically)
> just metadata it should be a possible thing to do.

Is it possible to do ordered writes, the way ext3 does?

http://www-106.ibm.com/developerworks/linux/library/l-fs7/

Is an interesting article discussing the approach.

Regards,                Andrew.
-- 
--------------------------------------------------------------------
Andrew @ Catalyst .Net.NZ Ltd, PO Box 11-053, Manners St, Wellington
WEB: http://catalyst.net.nz/        PHYS: Level 2, 150-154 Willis St
DDI: +64(4)916-7201    MOB: +64(21)635-694    OFFICE: +64(4)499-2267



Re: TOAST performance (was Re: [GENERAL] Delete

От
Hannu Krosing
Дата:
Andrew McMillan wrote:
> 
> On Tue, 2001-11-20 at 10:11, Hannu Krosing wrote:
> >
> > > Tom Lane wrote
> > >
> > >> I did some experimentation here and found a rather surprising
> > >> dependency: the time to delete a bunch of data is pretty much
> > >> directly proportional to the disk space it occupies.  This says
> > >> that we're paying through the nose for having XLOG make copies
> > >> of about-to-be-modified pages.
> > >
> > Can't we somehow WAL only metadata and not the actual pages for
> > DELETEs - as delete is essentially (though currently not technically)
> > just metadata it should be a possible thing to do.
> 
> Is it possible to do ordered writes, the way ext3 does?

I remember it being discussed on this list that you have very little 
control over writing order if you operate above filesystem/cache level.

> http://www-106.ibm.com/developerworks/linux/library/l-fs7/

I guess that is the article that sparked the idea of journalling only 
metadata for deletes (including the delete half of update)

Using the Journaling Block Device described there could actually be 
a good (though currently not portable) solution if you run linux.

-------------
Hannu


Re: TOAST performance (was Re: [GENERAL] Delete Performance)

От
Bruce Momjian
Дата:
> Bruce Momjian <pgman@candle.pha.pa.us> writes:
> > Could we add code to detect a partial write when we recover from one
> > using WAL so we can know if these partial writes are ever
> > happening?
> 
> What's your point?  It clearly *can* happen during power-failure
> scenarios.  All the monitoring in the world won't disprove that.

What bothers me about this is that we have the original page with the
old data.  It would be nice if we could write the new page in a
different location, make the new page active and recycle the old page at
some later time.

We are storing the pre-page image in WAL, but it seems like a waste
because we already have a pre-image.  It is just that we are overwriting
it.

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026
 


Re: TOAST performance (was Re: [GENERAL] Delete Performance)

От
Tom Lane
Дата:
Bruce Momjian <pgman@candle.pha.pa.us> writes:
> What bothers me about this is that we have the original page with the
> old data.  It would be nice if we could write the new page in a
> different location, make the new page active and recycle the old page at
> some later time.

I don't see how that reduces the total amount of disk traffic?

It's also kind of unclear how to do it without doubling (or worse) the
amount of table space used in many common scenarios.  I doubt many
people will be happy if "DELETE FROM foo" requires transient space equal
to twice the original size of foo.
        regards, tom lane


Re: TOAST performance (was Re: [GENERAL] Delete Performance)

От
Bruce Momjian
Дата:
> Bruce Momjian <pgman@candle.pha.pa.us> writes:
> > What bothers me about this is that we have the original page with the
> > old data.  It would be nice if we could write the new page in a
> > different location, make the new page active and recycle the old page at
> > some later time.
> 
> I don't see how that reduces the total amount of disk traffic?
> 
> It's also kind of unclear how to do it without doubling (or worse) the
> amount of table space used in many common scenarios.  I doubt many
> people will be happy if "DELETE FROM foo" requires transient space equal
> to twice the original size of foo.

Well, right now we write the pre-image to WAL, then write the new page
over the old one.  In my case, you just write the new, and somewhere
record that the old page is no longer active.  Sounds a little like
VACUUM, but for pages.

With DELETE FROM foo, let's suppose you have 10 pages in the table.  To
modify page 1, you write to page 11, then record in WAL that page 1 is
inactive.  To write page 2, you write to page 1 and record page 2 as
inactive, etc.  You basically are writing your new data one behind.

One problem I see is that you don't really know the pages are on disk so
I am not sure how to be safe when over-writing the inactive pages.

Of course, I am just throwing out ideas, looking for a solution.  Help!

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026
 


Re: TOAST performance (was Re: [GENERAL] Delete Performance)

От
Tom Lane
Дата:
Bruce Momjian <pgman@candle.pha.pa.us> writes:
>> I don't see how that reduces the total amount of disk traffic?

> Well, right now we write the pre-image to WAL, then write the new page
> over the old one.  In my case, you just write the new, and somewhere
> record that the old page is no longer active.

The devil is in the details of that last little bit.  How is "mark a
page inactive" cheaper than "mark a tuple dead"?  More specifically,
how do you propose to avoid WAL-logging the page you are going to do
this marking in?  Seems you still end up with a WAL page image for
something.
        regards, tom lane


Re: TOAST performance (was Re: [GENERAL] Delete Performance)

От
Bruce Momjian
Дата:
> Bruce Momjian <pgman@candle.pha.pa.us> writes:
> >> I don't see how that reduces the total amount of disk traffic?
> 
> > Well, right now we write the pre-image to WAL, then write the new page
> > over the old one.  In my case, you just write the new, and somewhere
> > record that the old page is no longer active.
> 
> The devil is in the details of that last little bit.  How is "mark a
> page inactive" cheaper than "mark a tuple dead"?  More specifically,
> how do you propose to avoid WAL-logging the page you are going to do
> this marking in?  Seems you still end up with a WAL page image for
> something.

I was thinking of just throwing the inactive page number into WAL.  Much
smaller than the entire page image.  You don't touch the page.  Does
that help?

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026
 


Re: TOAST performance (was Re: [GENERAL] Delete Performance)

От
Tom Lane
Дата:
Bruce Momjian <pgman@candle.pha.pa.us> writes:
> I was thinking of just throwing the inactive page number into WAL.  Much
> smaller than the entire page image.  You don't touch the page.  Does
> that help?

I don't think so.  Somehow you have to tell the other backends that that
page is dead; merely recording it in WAL doesn't do that.

More to the point, you can't recycle (overwrite) that page until you've
checkpointed or WAL-logged the replacement page; so you still end up
with disk I/O for the replacement.
        regards, tom lane


Re: TOAST performance (was Re: [GENERAL] Delete Performance)

От
Hannu Krosing
Дата:
Tom Lane wrote:
> 
> Bruce Momjian <pgman@candle.pha.pa.us> writes:
> > What bothers me about this is that we have the original page with the
> > old data.  It would be nice if we could write the new page in a
> > different location, make the new page active and recycle the old page at
> > some later time.
> 
> I don't see how that reduces the total amount of disk traffic?
> 
> It's also kind of unclear how to do it without doubling (or worse) the
> amount of table space used in many common scenarios.  I doubt many
> people will be happy if "DELETE FROM foo" requires transient space equal
> to twice the original size of foo.

IIRC the double space requrement is what has kept us from implementing 
DROP COLUMN.

-----------
Hannu


Re: TOAST performance (was Re: [GENERAL] Delete Performance)

От
Hannu Krosing
Дата:
Tom Lane wrote:
> 
> Bruce Momjian <pgman@candle.pha.pa.us> writes:
> >> I don't see how that reduces the total amount of disk traffic?
> 
> > Well, right now we write the pre-image to WAL, then write the new page
> > over the old one.  In my case, you just write the new, and somewhere
> > record that the old page is no longer active.
> 
> The devil is in the details of that last little bit.  How is "mark a
> page inactive" cheaper than "mark a tuple dead"?  More specifically,
> how do you propose to avoid WAL-logging the page you are going to do
> this marking in?  Seems you still end up with a WAL page image for
> something.

Assuming that we WAL with the granularity of disk sector (512b) I think 
that restructuring of database heap page (8kb) would be a big win for 
delete/update.

The idea is to move metadata (oid,tableoid,xmin,cmin,xmax,cmax,ctid) 
to the beginning of heap page to the same space with tuple pointers. 
It's easy (<grin>) as all of it is fixed length.
Then a change in metadata like setting xmax for deleted/updated tuple
will dirty only the first disk page and not all of them.

The new structure of ItemId will be (itemId-pointer nbitsitemId-flags   32-n bitsoid,tableoid,xmin,cmin,xmax,cmax,ctid
)

Assuming that we do account of dirty pages and WAL with the granularity 
of database page we may get a big win by just moving to smaller
gramularity.

The win from increasing cranularity was not very big before WAL, as the 
database pages are continuous on disk, but will be significant when we 
have to log all dirty pages.

------------------
Hannu


Re: TOAST performance (was Re: [GENERAL] Delete Performance)

От
"Zeugswetter Andreas SB SD"
Дата:
> With DELETE FROM foo, let's suppose you have 10 pages in the table.
To
> modify page 1, you write to page 11

But what with the indexes ? They would all need to be modified
accordingly.
If you did something like chaining, then before long all tuples would be

chained, even those that were not touched.

If you really want to avoid the page writes to WAL, imho the best way
would be 
to revive the original PG page design where the physical position of
slots in a 
heap page where only changed by vacuum.

Then, a heap page that was only partly written would only be a problem
iff 
the hardware wrote wrong data, not if it only skipped part of the write.

Reasonable hardware does detect such corrupted pages.
E.g. on AIX if you reduce the PG pagesize to 4k, an only partly written
page 
that stays undetected can be ruled out.

Then you would only need to write index pages to WAL, but not heap
pages.

Maybe a better idea would be to only conditionally write pages to WAL if
slot 
positions changed. In the "delete" example heap slot positions certainly
do 
not need to change. 
To be extra safe it would probably be necessary to not split tuple
headers
(at least the xact info) across physical pages. Then it would also be
safe to 
use a pg pagesize that is a multiple of the physical page size.

or so ? ...
Andreas


Re: TOAST performance (was Re: [GENERAL] Delete Performance)

От
"Christopher Kings-Lynne"
Дата:
> > It's also kind of unclear how to do it without doubling (or worse) the
> > amount of table space used in many common scenarios.  I doubt many
> > people will be happy if "DELETE FROM foo" requires transient space equal
> > to twice the original size of foo.
>
> IIRC the double space requrement is what has kept us from implementing
> DROP COLUMN.

The correct solution then, according methinks to my old Human Computer
Interaction lecturer, is to implement the feature anyway, and warn the DBA
what the consequences are.  That way, the DBA can do it if she wants, unlike
the current situation where it's next to impossible (with lots of
referencing foreign keys).

Chris



Re: TOAST performance (was Re: [GENERAL] Delete Performance)

От
Bruce Momjian
Дата:
> > > It's also kind of unclear how to do it without doubling (or worse) the
> > > amount of table space used in many common scenarios.  I doubt many
> > > people will be happy if "DELETE FROM foo" requires transient space equal
> > > to twice the original size of foo.
> >
> > IIRC the double space requrement is what has kept us from implementing
> > DROP COLUMN.
> 
> The correct solution then, according methinks to my old Human Computer
> Interaction lecturer, is to implement the feature anyway, and warn the DBA
> what the consequences are.  That way, the DBA can do it if she wants, unlike
> the current situation where it's next to impossible (with lots of
> referencing foreign keys).

Yes, I personally am going to try this for 7.3, as well as fix CLUSTER. 
I think someone has already started on CLUSTER anyway.

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026
 


Re: TOAST performance (was Re: [GENERAL] Delete Performance)

От
Bruce Momjian
Дата:
> Bruce Momjian <pgman@candle.pha.pa.us> writes:
> > Is it because we take a snapshot of the page before we write it in case
> > we only write part of the page?
> 
> AFAIR, the partial-page-write problem is the entire reason for doing it.
> If we could be certain that writes to datafile pages were atomic, we'd
> not need this.
> 
> Of course we can't be certain of that.  But I'm wondering if there isn't
> a cheaper solution.

I have added these TODO items to summarize this discussion:

* Reduce number of pre-page WAL writes; they exist only to gaurd against partial page writes
* Turn off pre-page writes if fsync is disabled (?)

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026
 


Re: TOAST performance (was Re: [GENERAL] Delete Performance)

От
Brian Beuning
Дата:
If "pre-page WAL write" means the value of the page before the current
changes, then there is generally another reason for writing it out.

Imagine this sequence of events:
1. transaction A starts
2. transaction B starts
3. tran A makes a change
4. tran B makes a change
5. tran A commits
6. all changes get written to disk (this can happen even without fsync,   for example tran C might do a full table scan
whichfills the buffer cache   before B commits)
 
7. the system crashes

When the system comes back up, we need to do a rollback on
transaction B since it did not commit and we need the "pre-page"
to know how to undo the change for B that got saved in step 6 above.

At least this is what happens in most DBMSs...

Brian Beuning



Bruce Momjian wrote:

> > Bruce Momjian <pgman@candle.pha.pa.us> writes:
> > > Is it because we take a snapshot of the page before we write it in case
> > > we only write part of the page?
> >
> > AFAIR, the partial-page-write problem is the entire reason for doing it.
> > If we could be certain that writes to datafile pages were atomic, we'd
> > not need this.
> >
> > Of course we can't be certain of that.  But I'm wondering if there isn't
> > a cheaper solution.
>
> I have added these TODO items to summarize this discussion:
>
> * Reduce number of pre-page WAL writes; they exist only to gaurd against
>   partial page writes
> * Turn off pre-page writes if fsync is disabled (?)
>
> --
>   Bruce Momjian                        |  http://candle.pha.pa.us
>   pgman@candle.pha.pa.us               |  (610) 853-3000
>   +  If your life is a hard drive,     |  830 Blythe Avenue
>   +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026
>
> ---------------------------(end of broadcast)---------------------------
> TIP 2: you can get off all lists at once with the unregister command
>     (send "unregister YourEmailAddressHere" to majordomo@postgresql.org)




Re: TOAST performance (was Re: [GENERAL] Delete Performance)

От
Bruce Momjian
Дата:
> If "pre-page WAL write" means the value of the page before the current
> changes, then there is generally another reason for writing it out.
> 
> Imagine this sequence of events:
> 1. transaction A starts
> 2. transaction B starts
> 3. tran A makes a change
> 4. tran B makes a change
> 5. tran A commits
> 6. all changes get written to disk (this can happen even without fsync,
>     for example tran C might do a full table scan which fills the buffer cache
>     before B commits)
> 7. the system crashes
> 
> When the system comes back up, we need to do a rollback on
> transaction B since it did not commit and we need the "pre-page"
> to know how to undo the change for B that got saved in step 6 above.
> 
> At least this is what happens in most DBMSs...

Because we have a non-overwriting storage manager, I don't think this
issue applies to us.

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026