Обсуждение: Are ctid chaining loops safe without relation size checks?
Hi,
While looking at [1] I was rephrasing this comment + chck in
heap_get_latest_tid():
- * Since this can be called with user-supplied TID, don't trust the input
- * too much. (RelationGetNumberOfBlocks is an expensive check, so we
- * don't check t_ctid links again this way. Note that it would not do to
- * call it just once and save the result, either.)
*/
- blk = ItemPointerGetBlockNumber(tid);
- if (blk >= RelationGetNumberOfBlocks(relation))
- elog(ERROR, "block number %u is out of range for relation \"%s\"",
- blk, RelationGetRelationName(relation));
Which I dutifully rewrote. But I'm actually not sure it's safe at all
for heap to rely on t_ctid links to be valid. What prevents a ctid link
to point to a page that's since been truncated away?
And it's not just heap_get_latest_tid() afaict. As far as I can tell
just about every ctid chaining code ought to test the t_ctid link
against the relation size - otherwise it seems entirely possible to get
"could not read block %u in file \"%s\": %m" or
"could not read block %u in file \"%s\": read only 0 of %d bytes"
style errors, no?
These loops are of such long-standing vintage, that I feel like I must
be missing something.
Greetings,
Andres Freund
[1] https://www.postgresql.org/message-id/20190515185447.gno2jtqxyktylyvs%40alap3.anarazel.de
On 2019-May-15, Andres Freund wrote: > - blk = ItemPointerGetBlockNumber(tid); > - if (blk >= RelationGetNumberOfBlocks(relation)) > - elog(ERROR, "block number %u is out of range for relation \"%s\"", > - blk, RelationGetRelationName(relation)); > > Which I dutifully rewrote. But I'm actually not sure it's safe at all > for heap to rely on t_ctid links to be valid. What prevents a ctid link > to point to a page that's since been truncated away? Umm .. IIUC all index entries for truncated pages should have been removed prior to the truncation. Otherwise, how would those index entries not become immediately data corruption the instant the heap is re-grown to cover those truncated pages? So I think if the TID comes directly from user then this is a check worth doing, but if the TID comes from an index, then it isn't. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Andres Freund <andres@anarazel.de> writes:
> Which I dutifully rewrote. But I'm actually not sure it's safe at all
> for heap to rely on t_ctid links to be valid. What prevents a ctid link
> to point to a page that's since been truncated away?
Nothing, but when would the issue come up? The updated tuple must be
newer than the one pointing at it, so if it's dead then the one pointing
at it must be too, no?
(If we're not checking liveness of x_max before following the link,
we'd have trouble ...)
regards, tom lane
Hi,
On 2019-05-15 15:09:34 -0400, Tom Lane wrote:
> (If we're not checking liveness of x_max before following the link,
> we'd have trouble ...)
I don't think we do everywhere - e.g. in heap_get_latest_tid() case that
made me think about this there's only this as an xmax based loop
termination:
/*
* After following a t_ctid link, we might arrive at an unrelated
* tuple. Check for XMIN match.
*/
if (TransactionIdIsValid(priorXmax) &&
!TransactionIdEquals(priorXmax, HeapTupleHeaderGetXmin(tp.t_data)))
{
UnlockReleaseBuffer(buffer);
break;
}
but that's after we already followed the link, and read the page (and it
obviously isn't a liveliness check).
Additionally, note that heap_get_latest_tid() does *not* terminate when
it finds a visible tuple:
/*
* Check tuple visibility; if visible, set it as the new result
* candidate.
*/
valid = HeapTupleSatisfiesVisibility(&tp, snapshot, buffer);
CheckForSerializableConflictOut(valid, relation, &tp, buffer, snapshot);
if (valid)
*tid = ctid;
it just continues to the next tuple version, if any.
EvalPlanQualFetch() in <= 11 and heapam_tuple_lock() in master and
heap_lock_updated_tuple_rec() don't have the problem that xmax might
suddenly abort while following the chain, because they have code like:
/*
* If tuple is being updated by other transaction then we
* have to wait for its commit/abort, or die trying.
*/
if (TransactionIdIsValid(SnapshotDirty.xmax))
{
ReleaseBuffer(buffer);
switch (wait_policy)
{
case LockWaitBlock:
XactLockTableWait(SnapshotDirty.xmax,
relation, &tuple->t_self,
XLTW_FetchUpdated);
break;
case LockWaitSkip:
if (!ConditionalXactLockTableWait(SnapshotDirty.xmax))
/* skip instead of waiting */
return TM_WouldBlock;
break;
case LockWaitError:
if (!ConditionalXactLockTableWait(SnapshotDirty.xmax))
ereport(ERROR,
(errcode(ERRCODE_LOCK_NOT_AVAILABLE),
errmsg("could not obtain lock on row in relation \"%s\"",
RelationGetRelationName(relation))));
break;
}
continue; /* loop back to repeat heap_fetch */
}
but heap_get_latest_tid() doesn't have that logic.
So I think the problem is just that heap_get_latest_tid() is missing
this type of check. The reason for which presumably is this piece of
intended functionality:
* Actually, this gets the latest version that is visible according to
* the passed snapshot. You can pass SnapshotDirty to get the very latest,
* possibly uncommitted version.
which means that neither can it block when xmax is still running, nor
can it terminate when HeapTupleSatisfiesVisibility() returns true.
There's no core code using a !mvcc snapshot however.
Because it's relevant for other work we've talked about for v13, and for
a potential fix:
> Andres Freund <andres@anarazel.de> writes:
> > Which I dutifully rewrote. But I'm actually not sure it's safe at all
> > for heap to rely on t_ctid links to be valid. What prevents a ctid link
> > to point to a page that's since been truncated away?
>
> Nothing, but when would the issue come up? The updated tuple must be
> newer than the one pointing at it, so if it's dead then the one pointing
> at it must be too, no?
Well, the current tuple might not be dead, it might be
UPDATE_IN_PROGRESS when start following the ctid chain. By the time we
get to the next tuple, that UPDATE might have rolled back, vacuum came
along, removed the new version of thew tuple (which then becomes DEAD,
not RECENTLY_DEAD) and then truncated the relation. Currently that's
not possible in the nodeTidscan.c case, because we'll have a lock
preventing truncations. But if we were to allow truncations without an
AEL, that'd be different.
I went through a few possible ways to fix this:
1) Break out of heap_get_latest_tid()/ loop, if the ctid to be chained
to is bigger than the block length. It can't be visible by any
definition except SnapshotAny, and we could just disallow that. As
there's a lock on the relation heap_get_latest_tid() is operating on,
we can rely on that value not getting too outdated.
But I don't think that's correct, because the newest version of the
tuple *actually* might be beyond the end of the table at the
beginning of the scan - we *do* allow extension of the table while
somebody holds a lock on the table after all.
I also don't like adding more assumptions that depend on preventing
truncations while any other lock is held.
2) Just disallow SnapshotDirty/SnapshotAny for heap_get_latest_tid(),
and break out of the loop if the current tuple is visible. There
can't be a newer visible version anyway, and as long as the input tid
for heap_get_latest_tid() points to something the calling transaction
could see (even if in an earlier snapshot), there has to be a visible
version somewhere.
That'd not fix the tid.c callers, but they're essentially unused and
weird anyway.
I'm not sure if WHERE CURRENT OF with a WITH HOLD cursor is possible
/ would be a problem. In that case we might need to add a
XactLockTableWait too.
3) Declare these problems as esoteric, and don't care.
Greetings,
Andres Freund
Hi, On 2019-05-15 15:07:13 -0400, Alvaro Herrera wrote: > On 2019-May-15, Andres Freund wrote: > > > - blk = ItemPointerGetBlockNumber(tid); > > - if (blk >= RelationGetNumberOfBlocks(relation)) > > - elog(ERROR, "block number %u is out of range for relation \"%s\"", > > - blk, RelationGetRelationName(relation)); > > > > Which I dutifully rewrote. But I'm actually not sure it's safe at all > > for heap to rely on t_ctid links to be valid. What prevents a ctid link > > to point to a page that's since been truncated away? > > Umm .. IIUC all index entries for truncated pages should have been > removed prior to the truncation. Otherwise, how would those index > entries not become immediately data corruption the instant the heap is > re-grown to cover those truncated pages? So I think if the TID comes > directly from user then this is a check worth doing, but if the TID > comes from an index, then it isn't. I'm not sure how indexes come into play here? For one, I don't think heap_get_latest_tid() is called straight on a tuple returned from an index scan. But also, I don't think that'd change much - it's not the tid that's passed to heap_get_latest_tid() that's the problem, it's the tuples it chains to via t_ctid. Greetings, Andres Freund