Обсуждение: Re: [PATCHES] WAL logging freezing

Поиск
Список
Период
Сортировка

Re: [PATCHES] WAL logging freezing

От
Tom Lane
Дата:
Heikki Linnakangas <heikki@enterprisedb.com> writes:
> I would've liked to give freezing a new opcode,
>   but we've ran out of them (see htup.h).

Hardly ... we have plenty of unused rmgr id's still.

The real issue that still has to be resolved is the interaction of all
this stuff with PITR scenarios --- is it still safe to not log hint-bit
updates when PITR is on?  I think it's premature to start writing
patches until we've decided how this really needs to work.

            regards, tom lane

Re: [PATCHES] WAL logging freezing

От
"Heikki Linnakangas"
Дата:
Tom Lane wrote:
> Heikki Linnakangas <heikki@enterprisedb.com> writes:
>> I would've liked to give freezing a new opcode,
>>   but we've ran out of them (see htup.h).
>
> Hardly ... we have plenty of unused rmgr id's still.

Good point.

> The real issue that still has to be resolved is the interaction of all
> this stuff with PITR scenarios --- is it still safe to not log hint-bit
> updates when PITR is on?  I think it's premature to start writing
> patches until we've decided how this really needs to work.

Not logging hint-bit updates seems safe to me. As long as we have the
clog, the hint-bit is just a hint. The problem with freezing is that
after freezing tuples, the corresponding clog page can go away.

--
   Heikki Linnakangas
   EnterpriseDB   http://www.enterprisedb.com

Re: [PATCHES] WAL logging freezing

От
Tom Lane
Дата:
"Heikki Linnakangas" <heikki@enterprisedb.com> writes:
> Tom Lane wrote:
>> I think it's premature to start writing
>> patches until we've decided how this really needs to work.

> Not logging hint-bit updates seems safe to me. As long as we have the
> clog, the hint-bit is just a hint. The problem with freezing is that
> after freezing tuples, the corresponding clog page can go away.

Actually clog can go away much sooner than that, at least in normal
operation --- that's what datvacuumxid is for, to track where we can
truncate clog.  Maybe it's OK to say that during WAL replay we keep it
all the way back to the freeze horizon, but I'm not sure how we keep the
system from wiping clog it still needs right after switching to normal
operation.  Maybe we should somehow not xlog updates of datvacuumxid?

Another thing I'm concerned about is the scenario where a PITR
hot-standby machine tracks a master over a period of more than 4 billion
transactions.  I'm not sure what will happen in the slave's pg_clog
directory, but I'm afraid it won't be good :-(

            regards, tom lane

Re: [PATCHES] WAL logging freezing

От
Tom Lane
Дата:
"Simon Riggs" <simon@2ndquadrant.com> writes:
> [I've just coded the relcache invalidation WAL logging patch also.]

What?  That doesn't make any sense to me.

            regards, tom lane

Re: [PATCHES] WAL logging freezing

От
"Simon Riggs"
Дата:
On Fri, 2006-10-27 at 12:01 -0400, Tom Lane wrote:
> "Heikki Linnakangas" <heikki@enterprisedb.com> writes:
> > Tom Lane wrote:
> >> I think it's premature to start writing
> >> patches until we've decided how this really needs to work.
>
> > Not logging hint-bit updates seems safe to me. As long as we have the
> > clog, the hint-bit is just a hint. The problem with freezing is that
> > after freezing tuples, the corresponding clog page can go away.
>
> Actually clog can go away much sooner than that, at least in normal
> operation --- that's what datvacuumxid is for, to track where we can
> truncate clog.

So we definitely have a nasty problem here.

VACUUM FREEZE is just a loaded gun right now.

> Maybe it's OK to say that during WAL replay we keep it
> all the way back to the freeze horizon, but I'm not sure how we keep the
> system from wiping clog it still needs right after switching to normal
> operation.  Maybe we should somehow not xlog updates of datvacuumxid?

Thinking...

Also, we should probably be setting all the hint bits for pages during
recovery then, so we don't need to re-write them again later.

> Another thing I'm concerned about is the scenario where a PITR
> hot-standby machine tracks a master over a period of more than 4 billion
> transactions.  I'm not sure what will happen in the slave's pg_clog
> directory, but I'm afraid it won't be good :-(

I think we'll need to error-out at that point, plus produce messages
when we pass 2 billion transactions recovered. It makes sense to produce
a new base backup regularly anyway.

We'll also need to produce an error message on the primary server so
that we take a new base backup every 2 billion transactions.

There are better solutions, but I'm not sure it makes sense to try and
fix them right now, since that could well delay the release. If we think
it is a necessary fix for the 8.2 line then we could get a better fix
into 8.2.1

[I've just coded the relcache invalidation WAL logging patch also.]

--
  Simon Riggs
  EnterpriseDB   http://www.enterprisedb.com



Re: [PATCHES] WAL logging freezing

От
"Simon Riggs"
Дата:
On Fri, 2006-10-27 at 22:19 +0100, Simon Riggs wrote:

> So we definitely have a nasty problem here.
>
> VACUUM FREEZE is just a loaded gun right now.
>
> > Maybe it's OK to say that during WAL replay we keep it
> > all the way back to the freeze horizon, but I'm not sure how we keep the
> > system from wiping clog it still needs right after switching to normal
> > operation.  Maybe we should somehow not xlog updates of datvacuumxid?
>
> Thinking...

Suggestions:

1. Create a new Utility rmgr that can issue XLOG_UTIL_FREEZE messages
for each block that has had any tuples frozen on it during normal
VACUUMs. We need log only the relid, blockid and vacuum's xid to redo
the freeze operation.

2. VACUUM FREEZE need not generate any additional WAL records, but will
do an immediate sync following execution and before clog truncation.
That way the large number of changed blocks will all reach disk before
we do the updates to the catalog.

3. We don't truncate the clog during WAL replay, so the clog will grow
during recovery. Nothing to do there to make things safe.

4. When InArchiveRecovery we should set all of the datminxid and
datvacuumxid fields to be the Xid from where recovery started, so that
clog is not truncated soon after recovery. Performing a VACUUM FREEZE
after a recovery would be mentioned as an optional task at the end of a
PITR recovery on a failover/second server.

5. At 3.5 billion records during recovery we should halt the replay, do
a full database scan to set hint bits, truncate clog, then restart
replay. (Automatically within the recovery process).

6. During WAL replay, put out a warning message every 1 billion rows
saying that a hint bit scan will eventually be required if recovery
continues.

--
  Simon Riggs
  EnterpriseDB   http://www.enterprisedb.com



Re: [PATCHES] WAL logging freezing

От
"Heikki Linnakangas"
Дата:
We just discussed this in detail with Simon, and it looks like we have
5 (!) different but related problems:

1) The original problem of freeze then crash, leaving too high values in
relminxid and datminxid. If you then run vacuum, it might truncate CLOG
and you lose the commit status of the records that were supposed to be
frozen.

To fix this, we need to WAL log freezing as already discussed.

2) vactuple_get_minxid doesn't take into account xmax's of tuples that
have HEAP_XMAX_INVALID set. That's a problem:

transaction 1001 - BEGIN; DELETE FROM foo where key = 1;
transaction 1001 - ROLLBACK;
transaction 1002 - VACUUM foo;
crash

VACUUM foo will set relminxid to 1002, because HEAP_XMAX_INVALID was set
on the tuple (possibly by vacuum itself) that the deletion that rolled
back touched. However, that hint-bit update hasn't hit the disk yet, so
after recovery, the tuple will have an xmax of 1001 with no hint-bit,
and relminxid is 1002.

The simplest fix for this issue is to ignore the HEAP_XMAX_INVALID hint
bit, and take any xmax other than InvalidXid into account when
calculating the relminxid.

3) If you recover from a PITR backup (or have a hot stand-by), with base
backup that's more than 4 billion transactions older than the newest WAL
record, the clog entries of old transactions in the base backup will
overlap with the clog entries of new transactions that are in the WAL
records. This is the problem you also pointed out below.

To fix this, we need to emit a WAL record when truncating the clog. We
must also make sure that recovery of any WAL record type doesn't rely on
clog, because if we truncate the clog and then crash, recovery won't
have the clog available for the old transactions. At the moment,
TruncateCLog issues a checkpoint to protect from that but that's not
going to work when rolling forward logs in PITR, right?

4) If we fix issue 2 so that vactuple_get_minxid always takes xmax into
account, even if HEAP_XMAX_INVALID is set, a tuple with an aborted xmax
will keep us from advancing relminxid and truncating clog etc. That
doesn't lead to data corruption, but you will eventually hit the
transaction wrap-around limit. We don't have the same problem with xmin,
because we freeze tuples that are older than FreezeLimit to avoid it,
but we don't do that for xmax.

To fix this, replace any xmax older than FreezeLimit with InvalidXid
during vacuum. That also needs to be WAL logged.

5) We don't freeze tuples that are in RECENTLY_DEAD or
DELETE_IN_PROGRESS state. That doesn't lead to data corruption, but it
might make you hit the transaction wrap-around limit. That can happen if
you have a transaction that deletes or updates a very old, but not yet
frozen tuple. If you run vacuum while the deleting transaction is in
progress, vacuum won't freeze the tuple, and won't advance the
wrap-around limit because of the old tuple. That's not serious if the
deleting transaction commits, because the next vacuum will then remove
the tuple, but if it aborts, we might run into the same problem on the
next vacuum, and the next one, and the next one, until we reach the
wrap-around.

To fix this, simply do the freezing for tuples in RECENTLY_DEAD and
DELETE_IN_PROGRESS states as well,

Am I missing something? Finding this many bugs makes me nervous...

Simon volunteered to make the clog changes for 3 because it's a PITR
related issue. I can write a patch/patches for the other changes if it
helps.

Tom Lane wrote:
> "Heikki Linnakangas" <heikki@enterprisedb.com> writes:
>> Tom Lane wrote:
>>> I think it's premature to start writing
>>> patches until we've decided how this really needs to work.
>
>> Not logging hint-bit updates seems safe to me. As long as we have the
>> clog, the hint-bit is just a hint. The problem with freezing is that
>> after freezing tuples, the corresponding clog page can go away.
>
> Actually clog can go away much sooner than that, at least in normal
> operation --- that's what datvacuumxid is for, to track where we can
> truncate clog.  Maybe it's OK to say that during WAL replay we keep it
> all the way back to the freeze horizon, but I'm not sure how we keep the
> system from wiping clog it still needs right after switching to normal
> operation.  Maybe we should somehow not xlog updates of datvacuumxid?
>
> Another thing I'm concerned about is the scenario where a PITR
> hot-standby machine tracks a master over a period of more than 4 billion
> transactions.  I'm not sure what will happen in the slave's pg_clog
> directory, but I'm afraid it won't be good :-(
>
>             regards, tom lane
>


--
   Heikki Linnakangas
   EnterpriseDB   http://www.enterprisedb.com

Re: [PATCHES] WAL logging freezing

От
"Heikki Linnakangas"
Дата:
Tom Lane wrote:
> "Heikki Linnakangas" <heikki@enterprisedb.com> writes:
>> Tom Lane wrote:
>>> I think it's premature to start writing
>>> patches until we've decided how this really needs to work.
>
>> Not logging hint-bit updates seems safe to me. As long as we have the
>> clog, the hint-bit is just a hint. The problem with freezing is that
>> after freezing tuples, the corresponding clog page can go away.
>
> Actually clog can go away much sooner than that, at least in normal
> operation --- that's what datvacuumxid is for, to track where we can
> truncate clog.  Maybe it's OK to say that during WAL replay we keep it
> all the way back to the freeze horizon, but I'm not sure how we keep the
> system from wiping clog it still needs right after switching to normal
> operation.  Maybe we should somehow not xlog updates of datvacuumxid?

Oh, I just understood what you were trying to say. We really don't keep
clog around all the way back to the freeze horizon.

Well, how about doing just that? With a FreezeLimit of 1 billion
transactions, that's 128 megabytes of clog. We could freeze more
aggressively to make it less.

We can't just not xlog updates of datvacuumxid, because that buffer
might still be written to disk before all the hint-bit updates. One
trick would be to include the old value in the WAL record. WAL recovery
would have to keep track of those records, and reset any datvacuumxid to
the old value if there's no checkpoint record afterwards.

--
   Heikki Linnakangas
   EnterpriseDB   http://www.enterprisedb.com

Re: [PATCHES] WAL logging freezing

От
Alvaro Herrera
Дата:
Heikki Linnakangas wrote:
> We just discussed this in detail with Simon, and it looks like we have
> 5 (!) different but related problems:

Wow, four of them are mine :-(

> 2) vactuple_get_minxid doesn't take into account xmax's of tuples that
> have HEAP_XMAX_INVALID set. That's a problem:
>
> transaction 1001 - BEGIN; DELETE FROM foo where key = 1;
> transaction 1001 - ROLLBACK;
> transaction 1002 - VACUUM foo;
> crash
>
> VACUUM foo will set relminxid to 1002, because HEAP_XMAX_INVALID was set
> on the tuple (possibly by vacuum itself) that the deletion that rolled
> back touched. However, that hint-bit update hasn't hit the disk yet, so
> after recovery, the tuple will have an xmax of 1001 with no hint-bit,
> and relminxid is 1002.
>
> The simplest fix for this issue is to ignore the HEAP_XMAX_INVALID hint
> bit, and take any xmax other than InvalidXid into account when
> calculating the relminxid.

Ugh.  Is there another solution to this?  Say, sync the buffer so that
the hint bits are written to disk?  The bug (4) below is problematic if
you take this approach; basically it removes all the optimization won by
the relminxid patch.


> Simon volunteered to make the clog changes for 3 because it's a PITR
> related issue. I can write a patch/patches for the other changes if it
> helps.

I'm swamped at the moment, so I'd appreciate it.

--
Alvaro Herrera                                http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.

Re: [PATCHES] WAL logging freezing

От
Tom Lane
Дата:
Alvaro Herrera <alvherre@commandprompt.com> writes:
> Ugh.  Is there another solution to this?  Say, sync the buffer so that
> the hint bits are written to disk?

Yeah.  The original design for all this is explained by the notes for
TruncateCLOG:

 * When this is called, we know that the database logically contains no
 * reference to transaction IDs older than oldestXact.    However, we must
 * not truncate the CLOG until we have performed a checkpoint, to ensure
 * that no such references remain on disk either; else a crash just after
 * the truncation might leave us with a problem.

The pre-8.2 coding is actually perfectly safe within a single database,
because TruncateCLOG is only called at the end of a database-wide
vacuum, and so the checkpoint is guaranteed to have flushed valid hint
bits for all tuples to disk.  There is a risk in other databases though.
I think that in the 8.2 structure the equivalent notion must be that
VACUUM has to flush and fsync a table before it can advance the table's
relminxid.

That still leaves us with the problem of hint bits not being updated
during WAL replay.  I think the best solution for this is for WAL replay
to force relvacuumxid to equal relminxid (btw, these field names seem
poorly chosen, and the comment in catalogs.sgml isn't self-explanatory...)
rather than adopting the value shown in the WAL record.  This probably
is best done by abandoning the generic "overwrite tuple" WAL record type
in favor of something specific to minxid updates.  The effect would then
be that a PITR slave would not truncate its clog beyond the freeze
horizon until it had performed a vacuum of its own.

The point about aborted xmax being a risk factor is a good one.  I don't
think the risk is material for ordinary crash recovery scenarios,
because ordinarily we'd have many opportunities to set the hint bit
before anything really breaks, but it's definitely an issue for
long-term PITR replay scenarios.

I'll work on this as soon as I get done with the btree-index issue I'm
messing with now.

            regards, tom lane

Re: [PATCHES] WAL logging freezing

От
"Simon Riggs"
Дата:
On Mon, 2006-10-30 at 12:05 -0500, Tom Lane wrote:
> Alvaro Herrera <alvherre@commandprompt.com> writes:
> > Ugh.  Is there another solution to this?  Say, sync the buffer so that
> > the hint bits are written to disk?
>
> Yeah.  The original design for all this is explained by the notes for
> TruncateCLOG:
>
>  * When this is called, we know that the database logically contains no
>  * reference to transaction IDs older than oldestXact.    However, we must
>  * not truncate the CLOG until we have performed a checkpoint, to ensure
>  * that no such references remain on disk either; else a crash just after
>  * the truncation might leave us with a problem.
>
> The pre-8.2 coding is actually perfectly safe within a single database,
> because TruncateCLOG is only called at the end of a database-wide
> vacuum, and so the checkpoint is guaranteed to have flushed valid hint
> bits for all tuples to disk.  There is a risk in other databases though.
> I think that in the 8.2 structure the equivalent notion must be that
> VACUUM has to flush and fsync a table before it can advance the table's
> relminxid.

Ouch! We did discuss that also. Flushing the buffercache is nasty with
very large caches, so this makes autovacuum much less friendly - and
could take a seriously long time if you enforce the vacuum delay
costings.

ISTM we only need to flush iff the clog would be truncated when we
update relminxid. Otherwise we are safe to update even if we crash,
since the clog will not have been truncated.

> That still leaves us with the problem of hint bits not being updated
> during WAL replay.  I think the best solution for this is for WAL replay
> to force relvacuumxid to equal relminxid (btw, these field names seem
> poorly chosen, and the comment in catalogs.sgml isn't self-explanatory...)
> rather than adopting the value shown in the WAL record.  This probably
> is best done by abandoning the generic "overwrite tuple" WAL record type
> in favor of something specific to minxid updates.  The effect would then
> be that a PITR slave would not truncate its clog beyond the freeze
> horizon until it had performed a vacuum of its own.

Sounds good. Methinks we do still need the TruncateCLOG patch to ensure
we do WAL replay for the truncation? I'm posting that now to -patches as
a prototype.

--
  Simon Riggs
  EnterpriseDB   http://www.enterprisedb.com


Вложения

Re: [PATCHES] WAL logging freezing

От
Alvaro Herrera
Дата:
Simon Riggs wrote:
> On Mon, 2006-10-30 at 12:05 -0500, Tom Lane wrote:
> > Alvaro Herrera <alvherre@commandprompt.com> writes:
> > > Ugh.  Is there another solution to this?  Say, sync the buffer so that
> > > the hint bits are written to disk?
> >
> > Yeah.  The original design for all this is explained by the notes for
> > TruncateCLOG:
> >
> >  * When this is called, we know that the database logically contains no
> >  * reference to transaction IDs older than oldestXact.    However, we must
> >  * not truncate the CLOG until we have performed a checkpoint, to ensure
> >  * that no such references remain on disk either; else a crash just after
> >  * the truncation might leave us with a problem.
> >
> > The pre-8.2 coding is actually perfectly safe within a single database,
> > because TruncateCLOG is only called at the end of a database-wide
> > vacuum, and so the checkpoint is guaranteed to have flushed valid hint
> > bits for all tuples to disk.  There is a risk in other databases though.
> > I think that in the 8.2 structure the equivalent notion must be that
> > VACUUM has to flush and fsync a table before it can advance the table's
> > relminxid.
>
> Ouch! We did discuss that also. Flushing the buffercache is nasty with
> very large caches, so this makes autovacuum much less friendly - and
> could take a seriously long time if you enforce the vacuum delay
> costings.
>
> ISTM we only need to flush iff the clog would be truncated when we
> update relminxid. Otherwise we are safe to update even if we crash,
> since the clog will not have been truncated.

I don't understand.  When clog is actually going to be truncated, if
it's determined that there's any page that can be truncated, then a
checkpoint is forced.  If no page is going to be removed then there's no
checkpoint, which makes a lot of sense and of course avoids the problem
of useless flushes.

In fact I don't understand what's the point about multiple databases vs.
a single database.  Surely a checkpoint would flush all buffers in all
databases, no?  This would flush all hint bits, everywhere.  So this bug
does not really exist.

--
Alvaro Herrera                                http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

Re: [PATCHES] WAL logging freezing

От
Tom Lane
Дата:
"Simon Riggs" <simon@2ndquadrant.com> writes:
> ISTM we only need to flush iff the clog would be truncated when we
> update relminxid.

Wrong :-(  If the relvacuumxid change (not relminxid ... as I said, these
names aren't very transparent) makes it to disk but not all the hint
bits do, you're at risk.  Crash, restart, vacuum some other table, and
*now* the global min vacuumxid advances.  The fact that we're
WAL-logging the relvacuumxid change makes this scenario exceedingly
probable, if no action is taken to force out the hint bits.

The only alternative I can see is the one Heikki suggested: don't
truncate clog until the freeze horizon.  That's safe (given the planned
change to WAL-log tuple freezing) and clean and simple, but a permanent
requirement of 250MB+ for pg_clog would put the final nail in the coffin
of PG's usability in small-disk-footprint environments.  So I don't like
it much.  I suppose it could be made more tolerable by reducing the
freeze horizon, say to 100M instead of 1G transactions.  Anyone for a
GUC parameter?  In a high-volume DB you'd want the larger setting to
minimize the amount of tuple freezing work.  OTOH it seems like making
this configurable creates a nasty risk for PITR situations: a slave
that's configured with a smaller freeze window than the master is
probably not safe.

            regards, tom lane

Re: [PATCHES] WAL logging freezing

От
Tom Lane
Дата:
Alvaro Herrera <alvherre@commandprompt.com> writes:
> In fact I don't understand what's the point about multiple databases vs.
> a single database.  Surely a checkpoint would flush all buffers in all
> databases, no?

Yeah --- all the ones that are dirty *now*.  Consider the case where you
vacuum DB X, update its datvacuumxid, and don't checkpoint because the
global min didn't advance.  Now you crash, possibly leaving some hint
bits unwritten; but the datvacuumxid change did make it to disk.  After
restart, vacuum DB Y, update its datvacuumxid, and find that the global
min *did* advance.  You checkpoint, and that guarantees that DB Y is
clean for the clog truncation.  But DB X isn't.

The 8.2 changes have created the equivalent risk at the level of each
individual table.  We can't write a vacuumxid change unless we are sure
that the hint-bit changes it promises are actually down to disk.

            regards, tom lane

Re: [PATCHES] WAL logging freezing

От
"Simon Riggs"
Дата:
On Mon, 2006-10-30 at 16:58 -0500, Tom Lane wrote:
> "Simon Riggs" <simon@2ndquadrant.com> writes:
> > ISTM we only need to flush iff the clog would be truncated when we
> > update relminxid.
>
> Wrong :-(  If the relvacuumxid change (not relminxid ... as I said, these
> names aren't very transparent) makes it to disk but not all the hint
> bits do, you're at risk.  Crash, restart, vacuum some other table, and
> *now* the global min vacuumxid advances.  The fact that we're
> WAL-logging the relvacuumxid change makes this scenario exceedingly
> probable, if no action is taken to force out the hint bits.

I don't agree: If the truncation points are at 1 million, 2 million etc,
then if we advance the relvacuumxid from 1.2 million to 1.5 million,
then crash, the hints bits for that last vacuum are lost. Sounds bad,
but we have not truncated clog, so there is no danger. In order to
truncate up to 2 million we need to re-vacuum; at that point we discover
that the 1.5 million setting was wrong, realise it should have been 1.2
million but don't care because we now set it to 1.8 million etc. No
problem, even with repeated crashes. We only flush when we move the
counter past a truncation point.

If you look at this another way, maybe you'll see what I'm saying: Only
update relvacuumxid iff the update would allow us to truncate the clog.
That way we leap forwards in 1 million Xid chunks, rounded down. No
change to clog => no update => no danger that we need to flush to avoid.

> The only alternative I can see is the one Heikki suggested: don't
> truncate clog until the freeze horizon.  That's safe (given the planned
> change to WAL-log tuple freezing) and clean and simple, but a permanent
> requirement of 250MB+ for pg_clog would put the final nail in the coffin
> of PG's usability in small-disk-footprint environments.  So I don't like
> it much.  I suppose it could be made more tolerable by reducing the
> freeze horizon, say to 100M instead of 1G transactions.  Anyone for a
> GUC parameter?  In a high-volume DB you'd want the larger setting to
> minimize the amount of tuple freezing work.  OTOH it seems like making
> this configurable creates a nasty risk for PITR situations: a slave
> that's configured with a smaller freeze window than the master is
> probably not safe.

If we need to, just put the CLOG seg size in pg_config_manual.h

--
  Simon Riggs
  EnterpriseDB   http://www.enterprisedb.com



Re: [PATCHES] WAL logging freezing

От
Tom Lane
Дата:
"Simon Riggs" <simon@2ndquadrant.com> writes:
> I don't agree: If the truncation points are at 1 million, 2 million etc,
> then if we advance the relvacuumxid from 1.2 million to 1.5 million,
> then crash, the hints bits for that last vacuum are lost. Sounds bad,
> but we have not truncated clog, so there is no danger.

You're still wrong though.  Suppose that VACUUM moves a particular rel's
relvacuumxid from 1.9 to 2.1 million, but because this rel is not
currently the oldest vacuumxid, it doesn't truncate clog.  Then we crash
and lose hint bits, but not the relvacuumxid change.  Then VACUUM
vacuums some other rel and advances its relvacuumxid from 1.9 to 2.1
million --- but this time that *was* the globally oldest value, and now
we think we can truncate clog at 2 million.  But the first rel might
still have some unhinted xids around 1.9 million.

> If you look at this another way, maybe you'll see what I'm saying: Only
> update relvacuumxid iff the update would allow us to truncate the clog.

Then you'll never update it at all, because there will always be some
other rel constraining the global min.

            regards, tom lane

Re: [PATCHES] WAL logging freezing

От
"Simon Riggs"
Дата:
On Mon, 2006-10-30 at 19:18 -0500, Tom Lane wrote:
> "Simon Riggs" <simon@2ndquadrant.com> writes:
> > I don't agree: If the truncation points are at 1 million, 2 million etc,
> > then if we advance the relvacuumxid from 1.2 million to 1.5 million,
> > then crash, the hints bits for that last vacuum are lost. Sounds bad,
> > but we have not truncated clog, so there is no danger.
>
> You're still wrong though.

Frequently, I'd say :-)

>  Suppose that VACUUM moves a particular rel's
> relvacuumxid from 1.9 to 2.1 million, but because this rel is not
> currently the oldest vacuumxid, it doesn't truncate clog.  Then we crash
> and lose hint bits, but not the relvacuumxid change.  Then VACUUM
> vacuums some other rel and advances its relvacuumxid from 1.9 to 2.1
> million --- but this time that *was* the globally oldest value, and now
> we think we can truncate clog at 2 million.  But the first rel might
> still have some unhinted xids around 1.9 million.

That was understood; in the above example I agree you need to flush. If
you don't pass a truncation point, you don't need to flush whether or
not you actually truncate. So we don't need to flush *every* time, so
IMHO we don't need to play safe and keep clog the size of an iceberg.

Anyway, if PITR is safe again, I'd like to sleep....zzzzzzzzzzzzzz

--
  Simon Riggs
  EnterpriseDB   http://www.enterprisedb.com



Re: [PATCHES] WAL logging freezing

От
Tom Lane
Дата:
"Simon Riggs" <simon@2ndquadrant.com> writes:
> That was understood; in the above example I agree you need to flush. If
> you don't pass a truncation point, you don't need to flush whether or
> not you actually truncate. So we don't need to flush *every* time,

OK, but does that actually do much of anything for your performance
complaint?  Just after GlobalXmin has passed a truncation point, *every*
vacuum the system does will start performing a flush-n-fsync, which
seems like exactly what you didn't like.  If the syncs were spread out
in time for different rels then maybe this idea would help, but AFAICS
they won't be.

            regards, tom lane

Re: [PATCHES] WAL logging freezing

От
Heikki Linnakangas
Дата:
Tom Lane wrote:
> The only alternative I can see is the one Heikki suggested: don't
> truncate clog until the freeze horizon.  That's safe (given the planned
> change to WAL-log tuple freezing) and clean and simple, but a permanent
> requirement of 250MB+ for pg_clog would put the final nail in the coffin
> of PG's usability in small-disk-footprint environments.  So I don't like
> it much.  I suppose it could be made more tolerable by reducing the
> freeze horizon, say to 100M instead of 1G transactions.  Anyone for a
> GUC parameter?  In a high-volume DB you'd want the larger setting to
> minimize the amount of tuple freezing work.  OTOH it seems like making
> this configurable creates a nasty risk for PITR situations: a slave
> that's configured with a smaller freeze window than the master is
> probably not safe.

If we go down that route, we really should make it a GUC parameter, and
reduce the default at least for 8_1_STABLE.

I got another idea. If we make sure that vacuum removes any aborted xid
older than OldestXmin from the table, we can safely assume that any xid
< the current clog truncation point we are going to be interested in is
committed. Vacuum already removes any tuple with an aborted xmin. If we
also set any aborted xmax (and xvac) to InvalidXid, and WAL logged that,
we would know that after vacuum commits, any xid < relvacuumxid in the
vacuumed table was committed, regardless of the hint bits. We could then
safely truncate the clog without flushing anything. This also seems safe
for PITR.

The only performance hit would be the clearing of xmax values of aborted
transactions, but that doesn't seem too bad to me because most
transactions commit.

--
   Heikki Linnakangas
   EnterpriseDB   http://www.enterprisedb.com

Re: [PATCHES] WAL logging freezing

От
"Simon Riggs"
Дата:
On Mon, 2006-10-30 at 20:40 -0500, Tom Lane wrote:
> "Simon Riggs" <simon@2ndquadrant.com> writes:
> > That was understood; in the above example I agree you need to flush. If
> > you don't pass a truncation point, you don't need to flush whether or
> > not you actually truncate. So we don't need to flush *every* time,
>
> OK, but does that actually do much of anything for your performance
> complaint?  Just after GlobalXmin has passed a truncation point, *every*
> vacuum the system does will start performing a flush-n-fsync, which
> seems like exactly what you didn't like.  If the syncs were spread out
> in time for different rels then maybe this idea would help, but AFAICS
> they won't be.

Makes sense, so we shouldn't do it that way after all.

Are you OK with the other patches I've submitted? My understanding was
that you're gonna have a look at those and this general area? I don't
want to hold up the release because of a PITR patch.

Feedback welcome ;-)

--
  Simon Riggs
  EnterpriseDB   http://www.enterprisedb.com



Re: [PATCHES] WAL logging freezing

От
Tom Lane
Дата:
Heikki Linnakangas <heikki@enterprisedb.com> writes:
> I got another idea. If we make sure that vacuum removes any aborted xid
> older than OldestXmin from the table, we can safely assume that any xid
> < the current clog truncation point we are going to be interested in is
> committed. Vacuum already removes any tuple with an aborted xmin. If we
> also set any aborted xmax (and xvac) to InvalidXid, and WAL logged that,

The problem with that is all the extra WAL log volume it creates.  I'm
also concerned about the loss of forensic information --- xmax values
are frequently useful for inferring what's been going on in a database.
(This is another reason for not wanting a very short freeze interval BTW.)

            regards, tom lane

Re: [PATCHES] WAL logging freezing

От
Tom Lane
Дата:
It seems that we're converging on the conclusion that not truncating
clog early is the least bad alternative.  This has the advantage of
making things a lot simpler --- we won't need to track minxid at all.
Allow me to summarize what I think has to happen:

* VACUUM will determine a freeze cutoff XID the same way it does now,
except that instead of using a hard-wired freeze window of 1G
transactions, we'll either reduce the window to (say) 100M transactions
or provide a GUC variable that can be adjusted over some reasonable
range.

* All XIDs present in the table that are older than the cutoff XID will
be replaced by FrozenXid or InvalidXid as required, and such actions
will be WAL-logged.  (I think we need to consider all 3 of xmin, xmax,
and xvac here.)

* On successful completion, the cutoff XID is stored in
pg_class.relvacuumxid, and pg_database.datvacuumxid is updated
if appropriate.  (The minxid columns are now useless, but unless there
is another reason to force initdb before 8.2, I'm inclined to leave them
there and unused.  We can remove 'em in 8.3.)

* pg_clog is truncated according to the oldest pg_database.datvacuumxid.
We should WAL-log this action, because replaying such an entry will
allow a PITR slave to truncate its own clog and thereby avoid wraparound
conflicts.  Note that we no longer need a checkpoint before truncating
--- what we need is XLogFlush, instead.  ("WAL before data")

These changes get us back into the regime where the hint bits truly are
hints, because the underlying pg_clog data is still there, both in a
master database and in a PITR slave.  So we don't need to worry about
WAL-logging hint bits.  We also avoid needing any flushes/fsyncs or
extra checkpoints.  The added WAL volume should be pretty minimal,
because only tuples that have gone untouched for a long time incur extra
work.  The added storage space for pg_clog could be annoying for a small
database, but reducing the freeze window ameliorates that objection.

Comments?  Anyone see any remaining holes?

            regards, tom lane

Re: [PATCHES] WAL logging freezing

От
Tom Lane
Дата:
Alvaro Herrera <alvherre@commandprompt.com> writes:
> Simon Riggs wrote:
>> Ouch! We did discuss that also. Flushing the buffercache is nasty with
>> very large caches, so this makes autovacuum much less friendly - and
>> could take a seriously long time if you enforce the vacuum delay
>> costings.

> Hmm, isn't the buffer cache aware of a vacuum operation?

Yeah.  What would probably happen is that we'd dump off most of the
dirtied pages to the kernel, which would likely still have a lot of them
in kernel buffers pending write.  But then we'd have to fsync the table
--- so a physical write storm would ensue, which we have no way to
throttle.

I think the don't-truncate-clog approach is a much better answer.

            regards, tom lane

Re: [PATCHES] WAL logging freezing

От
Gregory Stark
Дата:
"Tom Lane" <tgl@sss.pgh.pa.us> writes:

> The added WAL volume should be pretty minimal, because only tuples that have
> gone untouched for a long time incur extra work.

That seems like a weak point in the logic. It seems like it would make VACUUM
which is already an i/o hog even more so. Perhaps something clever can be done
with vacuum_cost_delay and commit_siblings.

Something like inserting the delay between WAL logging and syncing the log and
writing to the heap. So if another transaction commits in the meantime we can
skip the extra fsync and continue.


--
  Gregory Stark
  EnterpriseDB          http://www.enterprisedb.com

Re: [PATCHES] WAL logging freezing

От
Alvaro Herrera
Дата:
Gregory Stark wrote:
>
> "Tom Lane" <tgl@sss.pgh.pa.us> writes:
>
> > The added WAL volume should be pretty minimal, because only tuples that have
> > gone untouched for a long time incur extra work.
>
> That seems like a weak point in the logic. It seems like it would make VACUUM
> which is already an i/o hog even more so. Perhaps something clever can be done
> with vacuum_cost_delay and commit_siblings.
>
> Something like inserting the delay between WAL logging and syncing the log and
> writing to the heap. So if another transaction commits in the meantime we can
> skip the extra fsync and continue.

Huh, but the log would not be flushed for each operation that the vacuum
logs.  Only when it's going to commit.

--
Alvaro Herrera                                http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

Re: [PATCHES] WAL logging freezing

От
Tom Lane
Дата:
Alvaro Herrera <alvherre@commandprompt.com> writes:
> Huh, but the log would not be flushed for each operation that the vacuum
> logs.  Only when it's going to commit.

It strikes me that the vacuum cost delay feature omits to consider
generation of WAL records as a cost factor.  It may not be a big problem
though, as long as we can limit the number of records created to one or
two per page --- then you can see it as just a component of the "dirtied
a page" cost.  If we made a separate WAL record for each tuple then it
could be important to account for.

            regards, tom lane

Re: [PATCHES] WAL logging freezing

От
"Simon Riggs"
Дата:
On Tue, 2006-10-31 at 11:04 -0500, Tom Lane wrote:

> It seems that we're converging on the conclusion that not truncating
> clog early is the least bad alternative.  This has the advantage of
> making things a lot simpler --- we won't need to track minxid at all.
> Allow me to summarize what I think has to happen:
>
> * VACUUM will determine a freeze cutoff XID the same way it does now,
> except that instead of using a hard-wired freeze window of 1G
> transactions, we'll either reduce the window to (say) 100M transactions
> or provide a GUC variable that can be adjusted over some reasonable
> range.
>
> * All XIDs present in the table that are older than the cutoff XID will
> be replaced by FrozenXid or InvalidXid as required, and such actions
> will be WAL-logged.  (I think we need to consider all 3 of xmin, xmax,
> and xvac here.)
>
> * On successful completion, the cutoff XID is stored in
> pg_class.relvacuumxid, and pg_database.datvacuumxid is updated
> if appropriate.  (The minxid columns are now useless, but unless there
> is another reason to force initdb before 8.2, I'm inclined to leave them
> there and unused.  We can remove 'em in 8.3.)
>
> * pg_clog is truncated according to the oldest pg_database.datvacuumxid.
> We should WAL-log this action, because replaying such an entry will
> allow a PITR slave to truncate its own clog and thereby avoid wraparound
> conflicts.  Note that we no longer need a checkpoint before truncating
> --- what we need is XLogFlush, instead.  ("WAL before data")
>
> These changes get us back into the regime where the hint bits truly are
> hints, because the underlying pg_clog data is still there, both in a
> master database and in a PITR slave.  So we don't need to worry about
> WAL-logging hint bits.  We also avoid needing any flushes/fsyncs or
> extra checkpoints.  The added WAL volume should be pretty minimal,
> because only tuples that have gone untouched for a long time incur extra
> work.  The added storage space for pg_clog could be annoying for a small
> database, but reducing the freeze window ameliorates that objection.

Complete agreement, nice summary.

Do we need another GUC? I thought your observation about a PITR slave
having that set lower than its master still remains unresolved. Perhaps
we should do that by pg_config_manual.h, so its much less likely that
the two would be out of step?

Thanks to Heikki, for spotting the original bug before it was reported
in production.

--
  Simon Riggs
  EnterpriseDB   http://www.enterprisedb.com



Re: [PATCHES] WAL logging freezing

От
Tom Lane
Дата:
"Simon Riggs" <simon@2ndquadrant.com> writes:
> Do we need another GUC? I thought your observation about a PITR slave
> having that set lower than its master still remains unresolved.

No, AFAICS that's not an issue in this design.  The facts-on-the-ground
are whatever is recorded in pg_class.relvacuumxid, and whether a
particular table has been vacuumed with a shorter or longer freeze
window doesn't affect correctness.  In particular, a slave with
ambitions towards having a shorter window would still be unable to
truncate its clog before having re-vacuumed everything.

So, not only could we have a GUC variable, but it could be USERSET;
there's no breakage risk as long as we constrain the value range to
something sane.

It strikes me that VACUUM FREEZE could be replaced by
    SET vacuum_freeze_limit = 0
    VACUUM ...
which would be a good thing because the FREEZE keyword has to be
partially reserved in this syntax, and that is contrary to spec.

            regards, tom lane

Re: [PATCHES] WAL logging freezing

От
Tom Lane
Дата:
I wrote:
> * pg_clog is truncated according to the oldest pg_database.datvacuumxid.

While testing this patch I realized that there's a bit of an issue here.
It's usually going to be the case that the oldest datvacuumxid is
template0's, meaning that it will never be possible to truncate clog
until autovacuum decides that template0 is at risk of wraparound and
goes and vacuums it.  Shortening the freeze horizon will reduce the size
that pg_clog occupies just *after* that happens, but we're still going
to see pg_clog bloating up to something close to 256MB before autovacuum
kicks in.  And there is nothing a user can do about it, unless he's
willing to override the datallowconn = false safety cover on template0
so he can connect to it and vacuum it manually.

This wasn't a problem in the pre-8.2 logic because we ignored
non-connectable databases while determining the global minimum
datvacuumxid, but it's a real problem now.

Seems like either we go back to ignoring non-connectable databases
(with the risks that entails), or adopt some more-aggressive policy
for launching autovacuums on them, or give up the idea of keeping
pg_clog small.  A more-aggressive policy seems like the best option,
but I'm not entirely sure what it should look like.  Maybe force
autovacuum when age(datvacuumxid) exceeds twice the freeze horizon,
or some such?  Comments?
        regards, tom lane


Re: [PATCHES] WAL logging freezing

От
"Simon Riggs"
Дата:
On Thu, 2006-11-02 at 16:50 -0500, Tom Lane wrote:
> I wrote:
> > * pg_clog is truncated according to the oldest pg_database.datvacuumxid.

> Shortening the freeze horizon will reduce the size
> that pg_clog occupies just *after* that happens, but we're still going
> to see pg_clog bloating up to something close to 256MB before autovacuum
> kicks in. 

Well, by default a Windows install is about 80MB, plus 7x 16MB WAL gives
nearly 200MB, so we're talking about the doubling the basic on-disk
footprint for every user if we let that happen.

> This wasn't a problem in the pre-8.2 logic because we ignored
> non-connectable databases while determining the global minimum
> datvacuumxid, but it's a real problem now.
> 
> Seems like either we go back to ignoring non-connectable databases
> (with the risks that entails), or adopt some more-aggressive policy
> for launching autovacuums on them, or give up the idea of keeping
> pg_clog small.  A more-aggressive policy seems like the best option,
> but I'm not entirely sure what it should look like.  Maybe force
> autovacuum when age(datvacuumxid) exceeds twice the freeze horizon,
> or some such?  Comments?

Given many users are individual PCs, or at least stand-alone servers not
in constant use, then I think more aggressive vacuuming makes sense as a
way to keep clog smaller. In many situations the time lost through
continually virus scanning databases will be more than if we do a more
regular autovacuum, so we shouldn't really worry about that.

Sounds like we need a GUC for those who don't care about 256MB though,
but may care about autovacuum switching in at bad moments.

Also, that solution doesn't square with the current autovacuum defaults:
We should set autovacuum on by default, with
autovacuum_vacuum_cost_delay = 10 by default.

--  Simon Riggs              EnterpriseDB   http://www.enterprisedb.com




Re: [PATCHES] WAL logging freezing

От
"Zeugswetter Andreas ADI SD"
Дата:
> Seems like either we go back to ignoring non-connectable
> databases (with the risks that entails), or adopt some
> more-aggressive policy for launching autovacuums on them, or

Um, I think we would want to still have the ability to not need to
frequently vacuum known frozen databases.

Could we set datvacuumxid to frozenxid after freezing template0
and making it non-connectable ? There was discussion about making
a non-connectable db connectable, one step would be to set datvacuumxid
to currentxid.

Then the rule would be: ignore db's that have datvacuumxid == frozenxid.
(only readonly db's would be allowed to have datvacuumxid == frozenxid)

Andreas

PS: I agree that the rename from datfrozenxid to datvacuumxid should be
undone


Re: [PATCHES] WAL logging freezing

От
Alvaro Herrera
Дата:
Zeugswetter Andreas ADI SD wrote:
> 
> > Seems like either we go back to ignoring non-connectable 
> > databases (with the risks that entails), or adopt some 
> > more-aggressive policy for launching autovacuums on them, or
> 
> Um, I think we would want to still have the ability to not need to
> frequently vacuum known frozen databases.
> 
> Could we set datvacuumxid to frozenxid after freezing template0
> and making it non-connectable ? There was discussion about making 
> a non-connectable db connectable, one step would be to set datvacuumxid
> to currentxid.

I tried to do this at some point, but it doesn't work.  I think the
problem appears when you create a new database after the template
database has been frozen for a while.  IIRC one of my proposed solutions
was to force a vacuum of the new database right away, but that was shot
down (for good reasons, I think).  Another one was to mark the tables as
frozen, which led to the suggestion of pg_class_nt, which eventually we
shot down for other reasons.  The last suggestion was to force the
vacuuming of frozen databases, which is where we are now.

I think the best solution for now is to force the vacuum of template
databases.  They are generally small, so this _shouldn't_ be an issue.
(There _would_ be an issue if someone made big databases non-connectable
for the purposes of avoiding vacuum; this was mentioned back then, but
no one said that this would be an issue.)  We can change that in 8.3 by
reintroducing the concept of pg_class_nt and possibly pg_database_nt,
which would allow us to really mark tables as completely frozen, and
databases as not needing vacuum at all.


> PS: I agree that the rename from datfrozenxid to datvacuumxid should be
> undone

Actually, the one that was renamed from datfrozenxid was datminxid,
which is now being proposed to be removed.  We are now changing the
semantics of datvacuumxid to the old datfrozenxid.  I also agree with
the change (but keep in mind that I'm not affected as a beta tester, so
my opinion may not count).

PS -- yeah, I neglected the maintenance.sgml docs at the time :-(  I
noticed some days ago and I was going to submit appropriate changes.

-- 
Alvaro Herrera                                http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support


Proposal: vacuum and autovacuum parameters to control freezing

От
Tom Lane
Дата:
I wrote:
> It's usually going to be the case that the oldest datvacuumxid is
> template0's, meaning that it will never be possible to truncate clog
> until autovacuum decides that template0 is at risk of wraparound and
> goes and vacuums it.  Shortening the freeze horizon will reduce the size
> that pg_clog occupies just *after* that happens, but we're still going
> to see pg_clog bloating up to something close to 256MB before autovacuum
> kicks in.

After further thought I see that there are actually two parameters
involved in this process:

1. the critical age (currentXID - relfrozenxid) beyond which autovacuum
will force a vacuum of a particular table to forestall XID wraparound.
(Note: as the 8.2 code stands, the system will launch autovacuums even
when autovac is nominally disabled in order to fix tables that have
exceeded a hard-wired critical age.)

2. the freeze distance vacuum (whether auto or normal) uses to determine
the new cutoff point, ie, the new relfrozenxid for the table.

We can make a few observations:

* For a table that otherwise goes unvacuumed, the interval between
forced anti-wraparound vacuums will be critical_age - freeze_distance.
Therefore, for large static tables there is value in being able to
adjust this difference to be as large as possible.

* The size of pg_clog is determined by the system-wide maximum of
critical_age + number-of-transactions-needed-to-finish-vacuuming.
Therefore, critical_age is the knob we must expose if we want to
provide user control of pg_clog growth.

* It might seem that there's no point in per-table adjustment of
critical_age, since only the system-wide maximum means anything for
resource consumption.  I'm not so sure though --- for a really large
table, the time needed to finish vacuuming it could be significant,
meaning it would need a lower critical age than other tables.  With the
current one-process-at-a-time autovac infrastructure, this probably
isn't very important, but we've been talking about allowing multiple
parallel autovacuums specifically to deal with the problem of some
tables being much larger than others.

So it seems to me that an argument can be made for creating two new
GUC variables and adding two columns to pg_autovacuum:

vacuum_freeze_distance: number of transactions back from current
that a VACUUM will use as the freeze cutoff point, ie, XIDs older
than that will be replaced by FrozenXID, and the cutoff point will
become the table's new relfrozenxid value.  Valid range zero to
perhaps 1 billion.  VACUUM FREEZE is a shorthand for doing a
vacuum with vacuum_freeze_distance = 0.

autovacuum_freeze_limit: maximum age of relfrozenxid before autovacuum
will force a vacuum for anti-wraparound purposes.  Valid range perhaps
100 million to (2 billion - 100 million).

pg_autovacuum.freeze_distance: per-table vacuum_freeze_distance setting
for autovacuum to use.

pg_autovacuum.freeze_limit: per-table autovacuum_freeze_limit for
autovacuum to use.

(I'm not wedded to these names, anyone have better ideas?)

I'd propose default values of 200 million for autovacuum_freeze_limit
and half that for vacuum_freeze_distance, resulting in a maximum pg_clog
size of 50MB and forced autovacs about every 100 million transactions.

One minor point is that while the values of these variables have to have
sane relationships to each other, the GUC infrastructure doesn't really
allow us to enforce such a constraint directly (the behavior would be
too dependent on which variable got set first).  I'd suggest making
vacuum just silently limit the effective freeze_distance to not more
than half of the system's autovacuum_freeze_limit, rather than trying
to enforce any relationship within GUC.

This is kind of a lot to be inventing in late beta, but if we want to
have a really credible solution to the WAL-versus-freezing problem
I think we need to do all of this.

Comments?
        regards, tom lane


Re: Proposal: vacuum and autovacuum parameters to control freezing

От
"Simon Riggs"
Дата:
On Sat, 2006-11-04 at 12:35 -0500, Tom Lane wrote:
> I wrote:
> > It's usually going to be the case that the oldest datvacuumxid is
> > template0's, meaning that it will never be possible to truncate clog
> > until autovacuum decides that template0 is at risk of wraparound and
> > goes and vacuums it.  Shortening the freeze horizon will reduce the size
> > that pg_clog occupies just *after* that happens, but we're still going
> > to see pg_clog bloating up to something close to 256MB before autovacuum
> > kicks in.
> 
> After further thought I see that there are actually two parameters
> involved in this process:
> 
> 1. the critical age (currentXID - relfrozenxid) beyond which autovacuum
> will force a vacuum of a particular table to forestall XID wraparound.
> (Note: as the 8.2 code stands, the system will launch autovacuums even
> when autovac is nominally disabled in order to fix tables that have
> exceeded a hard-wired critical age.)
> 
> 2. the freeze distance vacuum (whether auto or normal) uses to determine
> the new cutoff point, ie, the new relfrozenxid for the table.

I guess I don't fully understand the way you've described this, so I'd
like to put my own understanding to see if they both agree. Getting that
right is the key to understanding the proposal more fully (for me) - I
get the bit about 2 parameters...

If we take the current Xid as zero, we can go back in time to various
Xids using a timeline. That timeline can be divided into various Eras,
as with Geologic time (Jurassic, Triassic, Ice Ages etc). With
PostgreSQL, we used to have 5 eras:

1. In Progress Transactions            0 - OldestXmin<---- limit is OldestXmin
2. Completed, normal Xids, status in clog<---- limit is oldest Xid in clog (no specific name in code)
3. Completed, normal Xids, status marked on tuple<---- limit is Freeze distance?
4. Frozen Xids, tuples frozen early to ensure nothing passes Wrap<---- critical age? - practical last point to
forestallWrap<---- limit is Wraparound
 
5. Frozen Xids (PreHistory)

now we have agreed to have only 4 eras, IIRC:

1. In Progress Transactions            0 - OldestXmin<---- limit is OldestXmin
2. Completed, normal Xids, status in clog<---- limit is Freeze distance?
3. Frozen Xids, tuples frozen early to ensure nothing passes Wrap<---- critical age? - practical last point to
forestallWrap<---- limit is Wraparound
 
4. Frozen Xids (PreHistory)

Perhaps you could edit the above if needed? I'm not making a separate
proposal, just trying to get it very clear.

If we can explain this simply now, then we stand a chance of other
people understanding it as well and setting these parameters correctly.
Otherwise we'll be doing the explanation hundreds of times on list/IRC.

AFICS freeze_limit and freeze_distance are both expressed in number of
Xids before current, so the "units" are the same for both. In that case,
maybe slightly more differentiated names would be appropriate.

--  Simon Riggs              EnterpriseDB   http://www.enterprisedb.com




Re: Proposal: vacuum and autovacuum parameters to control freezing

От
"Heikki Linnakangas"
Дата:
Tom Lane wrote:

> * It might seem that there's no point in per-table adjustment of
> critical_age, since only the system-wide maximum means anything for
> resource consumption.  I'm not so sure though --- for a really large
> table, the time needed to finish vacuuming it could be significant,
> meaning it would need a lower critical age than other tables.  With the
> current one-process-at-a-time autovac infrastructure, this probably
> isn't very important, but we've been talking about allowing multiple
> parallel autovacuums specifically to deal with the problem of some
> tables being much larger than others.

I think a global critical_age parameter is just fine. If you have one 
huge table that takes a long time to vacuum, just adjust critical_age so 
that there's enough time for the huge table vacuum to finish before 
wrap-around. That means that other smaller tables are vacuumed more 
frequently than would otherwise be necessary, but that's not a big deal 
if the other tables really are much smaller.

> pg_autovacuum.freeze_distance: per-table vacuum_freeze_distance setting
> for autovacuum to use.

Shouldn't this be used for manual vacuums as well?

> I'd propose default values of 200 million for autovacuum_freeze_limit
> and half that for vacuum_freeze_distance, resulting in a maximum pg_clog
> size of 50MB and forced autovacs about every 100 million transactions.

Sounds fine to me.

> One minor point is that while the values of these variables have to have
> sane relationships to each other, the GUC infrastructure doesn't really
> allow us to enforce such a constraint directly (the behavior would be
> too dependent on which variable got set first).  I'd suggest making
> vacuum just silently limit the effective freeze_distance to not more
> than half of the system's autovacuum_freeze_limit, rather than trying
> to enforce any relationship within GUC.

Makes sense.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: Proposal: vacuum and autovacuum parameters to control freezing

От
Tom Lane
Дата:
"Simon Riggs" <simon@2ndquadrant.com> writes:
> now we have agreed to have only 4 eras, IIRC:

> 1. In Progress Transactions            0 - OldestXmin
>     <---- limit is OldestXmin
> 2. Completed, normal Xids, status in clog
>     <---- limit is Freeze distance?
> 3. Frozen Xids, tuples frozen early to ensure nothing passes Wrap
>     <---- critical age? - practical last point to forestall Wrap
>     <---- limit is Wraparound
> 4. Frozen Xids (PreHistory)

You've got the eras right but not the boundaries.  The limit on era 2 is
the current system-wide minimum relfrozenxid (or equivalently, the
current system-wide minimum datfrozenxid), which is where we have
truncated clog.  That will normally be a little more than
autovacuum_freeze_limit --- the difference corresponding to the
"reaction time" needed to fire up autovac and get through all the tables
that have exceeded autovacuum_freeze_limit.  The freeze_distance has to
be considerably *less* than this, else we'll constantly be firing new
autovac cycles each of which will freeze just a few more tuples.

> Perhaps you could edit the above if needed?

1. In Progress Transactions            0 - OldestXmin<---- limit is OldestXmin
2. Completed, normal Xids, status in clog<---- limit is length of clog (a bit more than freeze_limit)       <---- Xids
olderthan freeze_distance will be frozen whenever             next visited by VACUUM, but there is no forcing function
          until they exceed freeze_limit
 
3. Frozen Xids, tuples frozen early to ensure nothing passes Wrap<---- limit is Wraparound
4. Frozen Xids (PreHistory)

We also have that the time between forced autovacuums of unchanging
tuples is approximately freeze_limit - freeze_distance.

> AFICS freeze_limit and freeze_distance are both expressed in number of
> Xids before current, so the "units" are the same for both.

Check.  Actually, as coded there's a bit of difference: freeze_distance
is subtracted from OldestXmin whereas freeze_limit is subtracted from
ReadNextTransactionId().  Normally OldestXmin should be so much smaller
than these parameters that it won't matter, but vacuum.c does contain
logic to do something reasonable if not.

> In that case,
> maybe slightly more differentiated names would be appropriate.

Got a suggestion?  I think the names *should* be clearly related, but
as I said, I'm by no means wedded to these particular ones.
        regards, tom lane


Re: Proposal: vacuum and autovacuum parameters to control freezing

От
Tom Lane
Дата:
"Heikki Linnakangas" <heikki@enterprisedb.com> writes:
> I think a global critical_age parameter is just fine. If you have one 
> huge table that takes a long time to vacuum, just adjust critical_age so 
> that there's enough time for the huge table vacuum to finish before 
> wrap-around. That means that other smaller tables are vacuumed more 
> frequently than would otherwise be necessary, but that's not a big deal 
> if the other tables really are much smaller.

Well, that's what you have to do right now, but as soon as we support
multiple autovac processes it'll be useful to do the other; so I figured
we might as well add the pg_autovacuum column while we're at it.

>> pg_autovacuum.freeze_distance: per-table vacuum_freeze_distance setting
>> for autovacuum to use.

> Shouldn't this be used for manual vacuums as well?

I thought about that but it seemed pretty dubious --- manual vacuums
don't look at pg_autovacuum for anything else, eg not for the cost
variables.
        regards, tom lane


Re: Proposal: vacuum and autovacuum parameters to control freezing

От
Tom Lane
Дата:
I wrote:
> "Simon Riggs" <simon@2ndquadrant.com> writes:
>> Perhaps you could edit the above if needed?

>         <---- Xids older than freeze_distance will be frozen whenever
>               next visited by VACUUM, but there is no forcing function
>               until they exceed freeze_limit

>> In that case,
>> maybe slightly more differentiated names would be appropriate.

> Got a suggestion?  I think the names *should* be clearly related, but
> as I said, I'm by no means wedded to these particular ones.

After re-reading the above, it strikes me that maybe names based around
"freeze_min" and "freeze_max" would be useful?
        regards, tom lane


Re: Proposal: vacuum and autovacuum parameters to control freezing

От
"Simon Riggs"
Дата:
On Sun, 2006-11-05 at 12:01 -0500, Tom Lane wrote:

> After re-reading the above, it strikes me that maybe names based around
> "freeze_min" and "freeze_max" would be useful?

Works for me. They are clearly related, yet different and allow a
straightforward explanation of their need and use.

e.g.

vacuum_freeze_min    The latest TransactionId that will be "frozen" during
a VACUUM is calculated by CurrentTransactionId - vacuum_freeze_min.

vacuum_freeze_max
The maximum age, calculated as distance from CurrentTransactionId, that
will be allowed before a autovacuum will be forced for that database
object.

--  Simon Riggs              EnterpriseDB   http://www.enterprisedb.com




Re: Proposal: vacuum and autovacuum parameters to control freezing

От
Tom Lane
Дата:
"Simon Riggs" <simon@2ndquadrant.com> writes:
> On Sun, 2006-11-05 at 12:01 -0500, Tom Lane wrote:
>> After re-reading the above, it strikes me that maybe names based around
>> "freeze_min" and "freeze_max" would be useful?

> Works for me. They are clearly related, yet different and allow a
> straightforward explanation of their need and use.

Sold, I'll rename them before committing the patch.
        regards, tom lane


Re: Proposal: vacuum and autovacuum parameters to control freezing

От
Gregory Stark
Дата:
"Simon Riggs" <simon@2ndquadrant.com> writes:

> vacuum_freeze_min    The latest TransactionId that will be "frozen" during
> a VACUUM is calculated by CurrentTransactionId - vacuum_freeze_min.
>
> vacuum_freeze_max
> The maximum age, calculated as distance from CurrentTransactionId, that
> will be allowed before a autovacuum will be forced for that database
> object.

I think it's clearer if "min" and "max" are considered adjectives and always
have a subject they modify. Otherwise it's unclear what they refer to.

So "vacuum_freeze_min_age" and "vacuum_freeze_max_age" instead. 

That way it's unambiguous which is which. Ie, that it's minimum and maximum
age and not minimum and maximum transaction id which would be the other way
around.

--  Gregory Stark EnterpriseDB          http://www.enterprisedb.com


Re: Proposal: vacuum and autovacuum parameters tocontrol freezing

От
"Simon Riggs"
Дата:
On Sun, 2006-11-05 at 14:47 -0500, Gregory Stark wrote:
> "Simon Riggs" <simon@2ndquadrant.com> writes:
> 
> > vacuum_freeze_min    The latest TransactionId that will be "frozen" during
> > a VACUUM is calculated by CurrentTransactionId - vacuum_freeze_min.
> >
> > vacuum_freeze_max
> > The maximum age, calculated as distance from CurrentTransactionId, that
> > will be allowed before a autovacuum will be forced for that database
> > object.
> 
> I think it's clearer if "min" and "max" are considered adjectives and always
> have a subject they modify. Otherwise it's unclear what they refer to.
> 
> So "vacuum_freeze_min_age" and "vacuum_freeze_max_age" instead. 
> 
> That way it's unambiguous which is which. Ie, that it's minimum and maximum
> age and not minimum and maximum transaction id which would be the other way
> around.

Sounds logical. Hadn't seen that you could take it both ways.

--  Simon Riggs              EnterpriseDB   http://www.enterprisedb.com




Re: Proposal: vacuum and autovacuum parameters tocontrol freezing

От
Tom Lane
Дата:
"Simon Riggs" <simon@2ndquadrant.com> writes:
> On Sun, 2006-11-05 at 14:47 -0500, Gregory Stark wrote:
>> That way it's unambiguous which is which. Ie, that it's minimum and maximum
>> age and not minimum and maximum transaction id which would be the other way
>> around.

> Sounds logical. Hadn't seen that you could take it both ways.

Agreed --- I've adopted Greg's suggestion.
        regards, tom lane