Re: GiST index corruption with large tuples

Поиск
Список
Период
Сортировка
От Tom Lane
Тема Re: GiST index corruption with large tuples
Дата
Msg-id 2424.1560365079@sss.pgh.pa.us
обсуждение исходный текст
Ответ на GiST index corruption with large tuples  (Gabríel Arthúr Pétursson<gabriel.arthur.petursson@advania.is>)
Список pgsql-bugs
=?iso-8859-1?Q?Gabr=EDel_Arth=FAr_P=E9tursson?= <gabriel.arthur.petursson@advania.is> writes:
> We have been seeing a GiST index becoming corrupt when large tuples are
> inserted into an indexed table. On our production database we've seen the
> corruption manifest itself as various obscure errors, including: "failed to
> add item to index page", "fixing incomplete split in index", "no unpinned
> buffers available", and "stack depth limit exceeded" during INSERT-s and
> REINDEX-es.
> We've also observed that if we ignore these errors and retry our transaction
> repeatedly, the index will sometimes start to grow without bounds -- until we
> run out of disk space.

The short answer here seems to be that btree_gist, and maybe GiST in
general, isn't defending itself against index entries that are beyond
its capacity.  btree_gist will accept text values that are up to a little
under one page in size, but that's completely unsafe, because as soon as
a page split happens it *must* be able to represent a range with such
a value as either or both ends.  So there's a hard limit that the values
had better not be wider than a little under half a page.  I'm doubtful
that even that restriction is enough, because it'd still imply that
upper-page index entries could be so large that you could only fit one per
page, meaning you have no fanout.  Likely we need to restrict the input
values to less than a quarter page to ensure sane behavior.

(We could be laxer about this if we had some sort of truncation scheme to
make the upper-page tuples smaller.  I think Peter G. has been working on
ideas related to that for btree; but it's not going to be done for v12,
let alone be something we'd consider back-patching.  In any case there's
still going to be *some* limit that's a good deal smaller than what's
in your example.)

FWIW, I doubt that the index is "corrupt" per se, it's just in a state
where the code doesn't know how to perform a valid page split.

> Here's a small reproducer:
>     CREATE EXTENSION IF NOT EXISTS btree_gist;
>     CREATE TABLE aoeu(id INT GENERATED BY DEFAULT AS IDENTITY, value TEXT NOT NULL);
>     CREATE INDEX aoeu_value_idx ON aoeu USING GIST (value);
>     INSERT INTO aoeu (value) VALUES (repeat('a', 65536*9));
>     INSERT INTO aoeu (value) VALUES (repeat('a', 65536));

The index entry created for that first insertion is around 6750 bytes
(after compression), so it's going to cause trouble as soon as any
page split or reindexing happens.

I'm not very sure where's a convenient place to enforce an
input-value-size limit, unfortunately.  Ideally we'd prefer to enforce the
limit post-compression, but it doesn't look like btree_gist deals with the
compressed representation of the input at all.  Maybe it needs to do
compression in gbt_var_compress?

The situation is probably even worse when you consider multicolumn
GIST indexes.  Then we need an overall limit that is tighter than
what any one column opclass would think it needs to enforce.  That
leads to wanting to just check the overall size of the leaf tuple
... but the generic code that would do that doesn't know whether
the opclass(es) are going to form upper-page values that are larger
than the leaf value.  We definitely need to ensure that any failure
of this sort happens at leaf-tuple insertion, not later.

            regards, tom lane



В списке pgsql-bugs по дате отправления:

Предыдущее
От: PG Bug reporting form
Дата:
Сообщение: BUG #15848: cluster initialization failed
Следующее
От: Nagaraj Raj
Дата:
Сообщение: Re: BUG #15848: cluster initialization failed