heapam_index_build_range_scan's anyvisible

Поиск
Список
Период
Сортировка
От Robert Haas
Тема heapam_index_build_range_scan's anyvisible
Дата
Msg-id CA+TgmoYgwjmmjK24Qxb_vWAu8_Hh7gfVFcr3+R7ocdLvYOWJXg@mail.gmail.com
обсуждение исходный текст
Ответы Re: heapam_index_build_range_scan's anyvisible  (Alvaro Herrera <alvherre@2ndquadrant.com>)
Re: heapam_index_build_range_scan's anyvisible  (Ashwin Agrawal <aagrawal@pivotal.io>)
Список pgsql-hackers
I spent some time today studying heapam_index_build_range_scan and
quickly reached the conclusion that it's kind of a mess.  At heart
it's pretty simple: loop over all the table, check each tuple against
any qual, and pass the visible ones to the callback.  However, in an
attempt to make it cater to various needs slightly outside of its
original design purpose, various warts have been added, and there are
enough of them now that I at least find it fairly difficult to
understand.  One of those warts is anyvisible, which I gather was
added in support of BRIN.

I first spent some time looking at how the 'anyvisible' flag affects
the behavior of the function. AFAICS, setting the flag to true results
in three behavior changes:

1. The elog(WARNING, ...) calls about a concurrent insert/delete calls
in progress can't be reached.
2. In some cases, reltuples += 1 might not occur where it would've
happened otherwise.
3. If we encounter a HOT-updated which was deleted by our own
transaction, we index it instead of skipping it.

Change #2 doesn't matter because the only caller that passes
anyvisible = true seems to be BRIN, and BRIN ignores the return value.
I initially thought that change #1 must not matter either, because
function has comments in several places saying that the caller must
hold ShareLock or better.  And I thought change #3 must also not
matter, because as the comments explain, this function is used to
build indexes, and if our CREATE INDEX command commits, then any
deletions that it has already performed will commit too, so the fact
that we haven't indexed the now-deleted tuples will be fine.  Then I
realized that brin_summarize_new_values() is calling this function
*without* ShareLock and for *not* for the purpose of creating a new
index but rather for the purpose of updating an existing index, which
means #1 and #3 do matter after all. But I think it's kind of
confusing because anyvisible doesn't change anything about which
tuples are visible.  SnapshotAny is already making "any" tuple
"visible." This flag really means "caller is holding a
lower-than-normal lock level and is not inserting into a brand new
relfilnode".

There may be more than just a cosmetic problem here, because the comments say:

         * It might look unsafe to use this information across buffer
         * lock/unlock.  However, we hold ShareLock on the table so no
         * ordinary insert/update/delete should occur; and we hold pin on the
         * buffer continuously while visiting the page, so no pruning
         * operation can occur either.

In the BRIN case that doesn't apply; I don't know whether this is safe
in that case for some other reason.

I also note that amcheck's bt_check_every_level can also call this
without ShareLock. It doesn't need to set anyvisible because passing a
snapshot bypasses the WARNINGs anyway, but it might have whatever
problem the above comment is thinking about.

Also, it's just cosmetic, but this comment definitely needs updating:

            /*
             * We could possibly get away with not locking the buffer here,
             * since caller should hold ShareLock on the relation, but let's
             * be conservative about it.  (This remark is still correct even
             * with HOT-pruning: our pin on the buffer prevents pruning.)
             */
            LockBuffer(hscan->rs_cbuf, BUFFER_LOCK_SHARE);

One more thing.  Assuming that there are no live bugs here, or that we
fix them, another possible simplification would be to remove the
anyvisible = true flag and have BRIN pass SnapshotNonVacuumable.
SnapshotNonVacuumable returns true when HeapTupleSatisfiesVacuum
doesn't return HEAPTUPLE_DEAD, so I think we'd get exactly the same
behavior (again, modulo reltuples, which doesn't matter).
heap_getnext() would perform functionally the same check as the
bespoke code internally, and just wouldn't return the dead tuples in
the first place.  There's an assertion that would trip, but we could
probably just change it.  BRIN's callback might also get a different
value for tupleIsAlive in some cases, but it ignores that value
anyway.

So to summarize:

1. Is this function really safe with < ShareLock?  Both BRIN and
amcheck think so, but the function itself isn't sure. If yes, we need
to adapt the comments.  If no, we need to think about some other fix.

2. anyvisible is a funny name given what the flag really does. Maybe
we can simplify by replacing it with SnapshotNonVacuumable().
Otherwise maybe we should rename the flag.

Thoughts?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



В списке pgsql-hackers по дате отправления:

Предыдущее
От: Alvaro Herrera
Дата:
Сообщение: Re: pg_dump: fail to restore partition table with serial type
Следующее
От: Alvaro Herrera
Дата:
Сообщение: Re: heapam_index_build_range_scan's anyvisible