Обсуждение: AW: AW: WAL-based allocation of XIDs is insecure

Поиск
Список
Период
Сортировка

AW: AW: WAL-based allocation of XIDs is insecure

От
Zeugswetter Andreas SB
Дата:
> > > 5. We will now run a new transaction with the same XID that was in use
> > > before the crash.  If that transaction commits, then we have a tuple on
> > > disk that will be considered valid --- and should not be.
> > 
> > I do not think this is true. Before any modification to a page the original page will be
> > written to the log (aka physical log).
> 
> Yes there must be XLogFlush() before writing buffers.
> BTW how do we get the next XID if WAL files are corrupted ?

Normally:
1. pg_control checkpoint info
2. checkpoint record in WAL ?
3. then rollforward of WAL

If WAL is corrupt the only way to get a consistent state is to bring the
db into a state as it was during last good checkpoint. But this is only possible
if you can at least read all "physical log" records from WAL.

Failing that, the only way would probably be to scan all heap files for XID's that are 
greater than the XID from checkpoint.

I think the utility Tom has in mind, that resets WAL, will allow you to dump the db
so you can initdb and reload. I don't think it is intended that you can immediately 
resume operation, (unless of course for the mentioned case of an upgrade with
a good checkpoint as last WAL record (== proper shutdown)).

Andreas


Re: AW: AW: WAL-based allocation of XIDs is insecure

От
Tom Lane
Дата:
Zeugswetter Andreas SB  <ZeugswetterA@wien.spardat.at> writes:
>> Hmm.  Actually, what is written to the log is the *modified* page not
>> its original contents.

> I thus really doubt above statement.

Read the code.

> Each page about to be modified should be written to the txlog once,
> and only once before the first modification after each checkpoint.

Yes, there's only one page dump per page per checkpoint.  But the
sequence is (1) make the modification in shmem buffers then (2) make
the XLOG entry. 

I believe this is OK since the XLOG entry is flushed before any of
the pages it affects are written out from shmem.  Since we have not
changed the storage management policy, it's OK if heap pages contain
changes from uncommitted transactions --- all we must avoid is
inconsistencies (eg not all three pages of a btree split written out),
and redo of the XLOG entry will ensure that for us.

>> However, I'd just as soon have the NEXTXID log records too to be doubly
>> sure.  I do now agree that we needn't fsync the NEXTXID records,
>> however.

> I do not really see an additional benefit. If the WAL is busted those
> records are likely busted too.

The point is to make the allocation of XIDs and OIDs work the same way.
In particular, if we are forced to reset the XLOG using what's stored in
pg_control, it would be good if what's stored in pg_control is a value
beyond the last-used XID/OID, not a value less than the last-used ones.
        regards, tom lane