Обсуждение: GiST index corruption with large tuples

Поиск
Список
Период
Сортировка

GiST index corruption with large tuples

От
Gabríel Arthúr Pétursson
Дата:
Hi,

We have been seeing a GiST index becoming corrupt when large tuples are
inserted into an indexed table. On our production database we've seen the
corruption manifest itself as various obscure errors, including: "failed to
add item to index page", "fixing incomplete split in index", "no unpinned
buffers available", and "stack depth limit exceeded" during INSERT-s and
REINDEX-es.

We've also observed that if we ignore these errors and retry our transaction
repeatedly, the index will sometimes start to grow without bounds -- until we
run out of disk space.

We've reproduced the bug on both CentOS 7 and Fedora 30 using the official RPM repositories.

    * PostgreSQL 11.3 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-36), 64-bit
    * PostgreSQL 11.3 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 9.0.1 20190312 (Red Hat 9.0.1-0.10), 64-bit

Here's a small reproducer:

    CREATE EXTENSION IF NOT EXISTS btree_gist;
    CREATE TABLE aoeu(id INT GENERATED BY DEFAULT AS IDENTITY, value TEXT NOT NULL);
    CREATE INDEX aoeu_value_idx ON aoeu USING GIST (value);

    INSERT INTO aoeu (value) VALUES (repeat('a', 65536*9));
    INSERT INTO aoeu (value) VALUES (repeat('a', 65536));

    -- ERROR:  XX000: failed to add item to index page in "aoeu_value_idx"
    -- LOCATION:  gistplacetopage, gist.c:417
    INSERT INTO aoeu (value) VALUES (repeat('a', 65536));

    -- ERROR:  54000: index row requires 13536 bytes, maximum size is 8191
    -- LOCATION:  index_form_tuple, indextuple.c:177
    REINDEX INDEX aoeu_value_idx;

Please let us know if there's any additional information requested.

Thanks in advance,
Gabríel


Re: GiST index corruption with large tuples

От
Tom Lane
Дата:
=?iso-8859-1?Q?Gabr=EDel_Arth=FAr_P=E9tursson?= <gabriel.arthur.petursson@advania.is> writes:
> We have been seeing a GiST index becoming corrupt when large tuples are
> inserted into an indexed table. On our production database we've seen the
> corruption manifest itself as various obscure errors, including: "failed to
> add item to index page", "fixing incomplete split in index", "no unpinned
> buffers available", and "stack depth limit exceeded" during INSERT-s and
> REINDEX-es.
> We've also observed that if we ignore these errors and retry our transaction
> repeatedly, the index will sometimes start to grow without bounds -- until we
> run out of disk space.

The short answer here seems to be that btree_gist, and maybe GiST in
general, isn't defending itself against index entries that are beyond
its capacity.  btree_gist will accept text values that are up to a little
under one page in size, but that's completely unsafe, because as soon as
a page split happens it *must* be able to represent a range with such
a value as either or both ends.  So there's a hard limit that the values
had better not be wider than a little under half a page.  I'm doubtful
that even that restriction is enough, because it'd still imply that
upper-page index entries could be so large that you could only fit one per
page, meaning you have no fanout.  Likely we need to restrict the input
values to less than a quarter page to ensure sane behavior.

(We could be laxer about this if we had some sort of truncation scheme to
make the upper-page tuples smaller.  I think Peter G. has been working on
ideas related to that for btree; but it's not going to be done for v12,
let alone be something we'd consider back-patching.  In any case there's
still going to be *some* limit that's a good deal smaller than what's
in your example.)

FWIW, I doubt that the index is "corrupt" per se, it's just in a state
where the code doesn't know how to perform a valid page split.

> Here's a small reproducer:
>     CREATE EXTENSION IF NOT EXISTS btree_gist;
>     CREATE TABLE aoeu(id INT GENERATED BY DEFAULT AS IDENTITY, value TEXT NOT NULL);
>     CREATE INDEX aoeu_value_idx ON aoeu USING GIST (value);
>     INSERT INTO aoeu (value) VALUES (repeat('a', 65536*9));
>     INSERT INTO aoeu (value) VALUES (repeat('a', 65536));

The index entry created for that first insertion is around 6750 bytes
(after compression), so it's going to cause trouble as soon as any
page split or reindexing happens.

I'm not very sure where's a convenient place to enforce an
input-value-size limit, unfortunately.  Ideally we'd prefer to enforce the
limit post-compression, but it doesn't look like btree_gist deals with the
compressed representation of the input at all.  Maybe it needs to do
compression in gbt_var_compress?

The situation is probably even worse when you consider multicolumn
GIST indexes.  Then we need an overall limit that is tighter than
what any one column opclass would think it needs to enforce.  That
leads to wanting to just check the overall size of the leaf tuple
... but the generic code that would do that doesn't know whether
the opclass(es) are going to form upper-page values that are larger
than the leaf value.  We definitely need to ensure that any failure
of this sort happens at leaf-tuple insertion, not later.

            regards, tom lane