Обсуждение: Correct docs about GiST leaf page structure

Поиск
Список
Период
Сортировка

Correct docs about GiST leaf page structure

От
Paul A Jungwirth
Дата:
Our docs for GiST indexes say the compress function is only used for
internal pages, not leaf pages, but actually it is used everywhere.
Here are two patches to clean things up.

You can see that we store compressed values with the pageinspect
extension. For instance, multiranges are compressed to ranges. Here
they are in leaf pages:

[v19devel:5432][314069] postgres=# create table mr (id int4multirange);
CREATE TABLE
[v19devel:5432][314069] postgres=# create index idx_mr on mr using gist (id);
CREATE INDEX
[v19devel:5432][314069] postgres=# insert into mr values ('{[1,2)}'),
('{[2,3)}');
INSERT 0 2
[v19devel:5432][314069] postgres=# select * from
gist_page_opaque_info(get_raw_page('idx_mr', 0));
    lsn     |    nsn     | rightlink  | flags
------------+------------+------------+--------
 0/01917320 | 0/00000000 | 4294967295 | {leaf}
(1 row)

[v19devel:5432][314069] postgres=# select * from
gist_page_items(get_raw_page('idx_mr', 0), 'idx_mr');
 itemoffset | ctid  | itemlen | dead |      keys
------------+-------+---------+------+----------------
          1 | (0,1) |      24 | f    | (id)=("[1,2)")
          2 | (0,2) |      24 | f    | (id)=("[2,3)")

Similarly, btree_gist stores gbtreekey entries in leaf pages:

[v19devel:5432][314069] postgres=# create table i (id int);
CREATE TABLE
[v19devel:5432][314069] postgres=# create index idx_i on i using gist (id);
CREATE INDEX
[v19devel:5432][314069] postgres=# insert into i values (1), (2), (3);
INSERT 0 3
[v19devel:5432][314069] postgres=# select * from
gist_page_items(get_raw_page('idx_i', 0), 'idx_i');
ERROR:  cannot display a value of type gbtreekey?
[v19devel:5432][314069] postgres=# select * from
gist_page_items_bytea(get_raw_page('idx_i', 0));
 itemoffset | ctid  | itemlen | dead |              key_data
------------+-------+---------+------+------------------------------------
          1 | (0,1) |      16 | f    | \x00000000010010000100000001000000
          2 | (0,2) |      16 | f    | \x00000000020010000200000002000000
          3 | (0,3) |      16 | f    | \x00000000030010000300000003000000

I think this error goes back to the second GiST paper referenced in
access/gist/README, titled "Concurrency and Recovery in Generalized
Search Trees" (from 1997). On page 2 it says that internal pages store
the predicate and leaf pages store the key. (The original 1995 paper
doesn't differentiate like that though.) Since our README has a list
of ways that our implementation diverges from the research, I added a
note there as well.

I've also supplied a patch to clarify that there are two papers. The
old wording is a bit confusing:

> GiST stands for Generalized Search Tree. It was introduced in the seminal paper
> "Generalized Search Trees for Database Systems", 1995, Joseph M. Hellerstein,
> Jeffrey F. Naughton, Avi Pfeffer:
>
>     http://www.sai.msu.su/~megera/postgres/gist/papers/gist.ps
>     https://dsf.berkeley.edu/papers/sigmod97-gist.pdf

Clarifying the two papers helps me call out the leaf page difference
in the main patch.

Yours,

-- 
Paul              ~{:-)
pj@illuminatedcomputing.com

Вложения

Re: Correct docs about GiST leaf page structure

От
Tom Lane
Дата:
Paul A Jungwirth <pj@illuminatedcomputing.com> writes:
> Our docs for GiST indexes say the compress function is only used for
> internal pages, not leaf pages, but actually it is used everywhere.
> Here are two patches to clean things up.

> You can see that we store compressed values with the pageinspect
> extension. For instance, multiranges are compressed to ranges. Here
> they are in leaf pages:

Actually I think it's more complicated than that.  A GiST opclass
can choose whether to compress leaf-key entries, and if it does it
can use a different representation than it does on internal pages.
You can see that in action in compress/decompress functions that
pay attention to the GISTENTRY.leafkey flag, which many do.

So I'm inclined to propose text more like the attached.  I merged
your two patches into one (didn't seem all that useful to separate).
Also, I dropped the adjacent sentence suggesting using the STORAGE
option.  AFAIK that's pretty useless here: I don't think any GiST
code pays attention to it.  At least part of the reason is that it's
inadequate to describe the possibility that leaf and internal datums
are different.

Thoughts?

            regards, tom lane

diff --git a/doc/src/sgml/gist.sgml b/doc/src/sgml/gist.sgml
index 5c0a0c48bab..7a5e664db68 100644
--- a/doc/src/sgml/gist.sgml
+++ b/doc/src/sgml/gist.sgml
@@ -273,13 +273,15 @@ CREATE INDEX ON my_table USING GIST (my_inet_column inet_ops);
    index will depend on the <function>penalty</function> and <function>picksplit</function>
    methods.
    Two optional methods are <function>compress</function> and
-   <function>decompress</function>, which allow an index to have internal tree data of
-   a different type than the data it indexes. The leaves are to be of the
-   indexed data type, while the other tree nodes can be of any C struct (but
+   <function>decompress</function>, which allow an index to store keys that
+   are of a different type than the data it indexes. The index entries can be
+   any valid Datums (but
    you still have to follow <productname>PostgreSQL</productname> data type rules here,
-   see about <literal>varlena</literal> for variable sized data). If the tree's
-   internal data type exists at the SQL level, the <literal>STORAGE</literal> option
-   of the <command>CREATE OPERATOR CLASS</command> command can be used.
+   see about <literal>varlena</literal> for variable sized data).
+   Furthermore, since <function>compress</function> and
+   <function>decompress</function> are told whether they are working on
+   Datums for leaf-level or internal pages, different representations
+   can be used for leaf keys than higher-level pages.
    The optional eighth method is <function>distance</function>, which is needed
    if the operator class wishes to support ordered scans (nearest-neighbor
    searches). The optional ninth method <function>fetch</function> is needed if the
diff --git a/src/backend/access/gist/README b/src/backend/access/gist/README
index 76e0e11f228..75445b07455 100644
--- a/src/backend/access/gist/README
+++ b/src/backend/access/gist/README
@@ -10,9 +10,13 @@ GiST stands for Generalized Search Tree. It was introduced in the seminal paper
 Jeffrey F. Naughton, Avi Pfeffer:

     http://www.sai.msu.su/~megera/postgres/gist/papers/gist.ps
+
+Concurrency support was described in "Concurrency and Recovery in Generalized
+Search Trees", 1997, Marcel Kornacker, C. Mohan, Joseph M. Hellerstein:
+
     https://dsf.berkeley.edu/papers/sigmod97-gist.pdf

-and implemented by J. Hellerstein and P. Aoki in an early version of
+GiST was implemented by J. Hellerstein and P. Aoki in an early version of
 PostgreSQL (more details are available from The GiST Indexing Project
 at Berkeley at http://gist.cs.berkeley.edu/). As a "university"
 project it had a limited number of features and was in rare use.
@@ -55,6 +59,9 @@ The original algorithms were modified in several ways:
   it is now a single-pass algorithm.
 * Since the papers were theoretical, some details were omitted and we
   had to find out ourself how to solve some specific problems.
+* The 1997 paper above (but not the 1995 one) states that leaf pages should
+  store the original key.  While that can be done in PostgreSQL, it is
+  also possible to use a compressed representation in leaf pages.

 Because of the above reasons, we have revised the interaction of GiST
 core and PostgreSQL WAL system. Moreover, we encountered (and solved)

Re: Correct docs about GiST leaf page structure

От
Paul A Jungwirth
Дата:
On Sun, Mar 29, 2026 at 12:32 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> Paul A Jungwirth <pj@illuminatedcomputing.com> writes:
> > Our docs for GiST indexes say the compress function is only used for
> > internal pages, not leaf pages, but actually it is used everywhere.
> > Here are two patches to clean things up.
>
> > You can see that we store compressed values with the pageinspect
> > extension. For instance, multiranges are compressed to ranges. Here
> > they are in leaf pages:
>
> Actually I think it's more complicated than that.  A GiST opclass
> can choose whether to compress leaf-key entries, and if it does it
> can use a different representation than it does on internal pages.
> You can see that in action in compress/decompress functions that
> pay attention to the GISTENTRY.leafkey flag, which many do.

Ah, thanks for pointing that out!

> So I'm inclined to propose text more like the attached.  I merged
> your two patches into one (didn't seem all that useful to separate).
> Also, I dropped the adjacent sentence suggesting using the STORAGE
> option.  AFAIK that's pretty useless here: I don't think any GiST
> code pays attention to it.  At least part of the reason is that it's
> inadequate to describe the possibility that leaf and internal datums
> are different.

I think your changes are great. I agree about not needing two commits.
My only hesitation is removing the line about STORAGE. In btree_gist
we do declare the storage of many opclasses. But I'm not sure why. Is
it necessary? Does an opclass gain some advantage from it? Does core
use that information somehow? Especially if leaf keys might or might
not be the same type as internal keys, I'm not sure what value
declaring STORAGE can provide. (It must be for core's sake, not the
opclass's, right?)

Yours,

--
Paul              ~{:-)
pj@illuminatedcomputing.com



Re: Correct docs about GiST leaf page structure

От
Tom Lane
Дата:
Paul A Jungwirth <pj@illuminatedcomputing.com> writes:
> On Sun, Mar 29, 2026 at 12:32 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Actually I think it's more complicated than that.  A GiST opclass
>> can choose whether to compress leaf-key entries, and if it does it
>> can use a different representation than it does on internal pages.
>> You can see that in action in compress/decompress functions that
>> pay attention to the GISTENTRY.leafkey flag, which many do.

> I think your changes are great. I agree about not needing two commits.
> My only hesitation is removing the line about STORAGE. In btree_gist
> we do declare the storage of many opclasses. But I'm not sure why. Is
> it necessary? Does an opclass gain some advantage from it? Does core
> use that information somehow? Especially if leaf keys might or might
> not be the same type as internal keys, I'm not sure what value
> declaring STORAGE can provide. (It must be for core's sake, not the
> opclass's, right?)

Excellent questions, and thanks for holding my feet to the fire about
that ;-).  Reading more closely, the STORAGE option does indeed do
something: it determines the declared data type of the index's column,
as stored in pg_attribute.  And that's important because GiST uses
that datatype while forming or deforming index tuples.  So it has to
be accurate --- but only to the extent of having the right
typlen/typbyval/typalign properties, because that's as much as
index_form_tuple() and related functions really care about.  They
don't look into the contents of the entries, except for the length
word if it's typlen -1.

My claim that the leaf key representation can be different from upper
levels is still accurate, but both representations have to match the
typlen/typbyval/typalign properties of whatever type is mentioned
in STORAGE.  (I was misled by the fact that the GiST code has
different "leafTupdesc" and "nonLeafTupdesc" tuple descriptors.
But leafTupdesc is just the standard rd_att descriptor made from
the index's pg_attribute entries, and nonLeafTupdesc differs from
it only in having removed any INCLUDE attributes.)

So here's a v3 that accounts for that.  I also decided that we were
going in quite the wrong direction by cramming more info into the
summary paragraph early in gist.sgml.  The general plan there is to
offer about a one-sentence description of each opclass method, and
then go into more detail as necessary in the per-method text below.
So I moved all this info down into the compress method's section.
This seems to me to read noticeably better.

            regards, tom lane

diff --git a/doc/src/sgml/gist.sgml b/doc/src/sgml/gist.sgml
index 5c0a0c48bab..bedbc84d7b0 100644
--- a/doc/src/sgml/gist.sgml
+++ b/doc/src/sgml/gist.sgml
@@ -273,13 +273,9 @@ CREATE INDEX ON my_table USING GIST (my_inet_column inet_ops);
    index will depend on the <function>penalty</function> and <function>picksplit</function>
    methods.
    Two optional methods are <function>compress</function> and
-   <function>decompress</function>, which allow an index to have internal tree data of
-   a different type than the data it indexes. The leaves are to be of the
-   indexed data type, while the other tree nodes can be of any C struct (but
-   you still have to follow <productname>PostgreSQL</productname> data type rules here,
-   see about <literal>varlena</literal> for variable sized data). If the tree's
-   internal data type exists at the SQL level, the <literal>STORAGE</literal> option
-   of the <command>CREATE OPERATOR CLASS</command> command can be used.
+   <function>decompress</function>, which allow an index to store keys that
+   are of a different type than the data it indexes, or are a compressed
+   representation of that type.
    The optional eighth method is <function>distance</function>, which is needed
    if the operator class wishes to support ordered scans (nearest-neighbor
    searches). The optional ninth method <function>fetch</function> is needed if the
@@ -484,6 +480,24 @@ my_union(PG_FUNCTION_ARGS)
        in the index without modification.
       </para>

+      <para>
+       Use the <literal>STORAGE</literal> option of the <command>CREATE
+       OPERATOR CLASS</command> command to define the data type that is
+       stored in the index, if it is different from the data type being
+       indexed.  Be aware however that the <literal>STORAGE</literal> data
+       type is only used to define the physical properties of the index
+       entries (their <replaceable>typlen</replaceable>,
+       <replaceable>typbyval</replaceable>,
+       and <replaceable>typalign</replaceable> attributes).  What is
+       actually in the index datums is under the control of the
+       <function>compress</function> and <function>decompress</function>
+       methods, so long as the stored datums fit those properties.  It is
+       allowed for <function>compress</function> to produce different
+       representations for leaf keys than for keys on higher-level index
+       pages, so long as both representations fit
+       the <literal>STORAGE</literal> data type.
+      </para>
+
       <para>
         The <acronym>SQL</acronym> declaration of the function must look like this:

diff --git a/src/backend/access/gist/README b/src/backend/access/gist/README
index 76e0e11f228..75445b07455 100644
--- a/src/backend/access/gist/README
+++ b/src/backend/access/gist/README
@@ -10,9 +10,13 @@ GiST stands for Generalized Search Tree. It was introduced in the seminal paper
 Jeffrey F. Naughton, Avi Pfeffer:

     http://www.sai.msu.su/~megera/postgres/gist/papers/gist.ps
+
+Concurrency support was described in "Concurrency and Recovery in Generalized
+Search Trees", 1997, Marcel Kornacker, C. Mohan, Joseph M. Hellerstein:
+
     https://dsf.berkeley.edu/papers/sigmod97-gist.pdf

-and implemented by J. Hellerstein and P. Aoki in an early version of
+GiST was implemented by J. Hellerstein and P. Aoki in an early version of
 PostgreSQL (more details are available from The GiST Indexing Project
 at Berkeley at http://gist.cs.berkeley.edu/). As a "university"
 project it had a limited number of features and was in rare use.
@@ -55,6 +59,9 @@ The original algorithms were modified in several ways:
   it is now a single-pass algorithm.
 * Since the papers were theoretical, some details were omitted and we
   had to find out ourself how to solve some specific problems.
+* The 1997 paper above (but not the 1995 one) states that leaf pages should
+  store the original key.  While that can be done in PostgreSQL, it is
+  also possible to use a compressed representation in leaf pages.

 Because of the above reasons, we have revised the interaction of GiST
 core and PostgreSQL WAL system. Moreover, we encountered (and solved)

Re: Correct docs about GiST leaf page structure

От
Paul A Jungwirth
Дата:
On Mon, Mar 30, 2026 at 3:06 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> Paul A Jungwirth <pj@illuminatedcomputing.com> writes:
> > On Sun, Mar 29, 2026 at 12:32 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> >> Actually I think it's more complicated than that.  A GiST opclass
> >> can choose whether to compress leaf-key entries, and if it does it
> >> can use a different representation than it does on internal pages.
> >> You can see that in action in compress/decompress functions that
> >> pay attention to the GISTENTRY.leafkey flag, which many do.
>
> > I think your changes are great. I agree about not needing two commits.
> > My only hesitation is removing the line about STORAGE. In btree_gist
> > we do declare the storage of many opclasses. But I'm not sure why. Is
> > it necessary? Does an opclass gain some advantage from it? Does core
> > use that information somehow? Especially if leaf keys might or might
> > not be the same type as internal keys, I'm not sure what value
> > declaring STORAGE can provide. (It must be for core's sake, not the
> > opclass's, right?)
>
> Excellent questions, and thanks for holding my feet to the fire about
> that ;-).  Reading more closely, the STORAGE option does indeed do
> something: it determines the declared data type of the index's column,
> as stored in pg_attribute.  And that's important because GiST uses
> that datatype while forming or deforming index tuples.  So it has to
> be accurate --- but only to the extent of having the right
> typlen/typbyval/typalign properties, because that's as much as
> index_form_tuple() and related functions really care about.  They
> don't look into the contents of the entries, except for the length
> word if it's typlen -1.
>
> My claim that the leaf key representation can be different from upper
> levels is still accurate, but both representations have to match the
> typlen/typbyval/typalign properties of whatever type is mentioned
> in STORAGE.  (I was misled by the fact that the GiST code has
> different "leafTupdesc" and "nonLeafTupdesc" tuple descriptors.
> But leafTupdesc is just the standard rd_att descriptor made from
> the index's pg_attribute entries, and nonLeafTupdesc differs from
> it only in having removed any INCLUDE attributes.)
>
> So here's a v3 that accounts for that.  I also decided that we were
> going in quite the wrong direction by cramming more info into the
> summary paragraph early in gist.sgml.  The general plan there is to
> offer about a one-sentence description of each opclass method, and
> then go into more detail as necessary in the per-method text below.
> So I moved all this info down into the compress method's section.
> This seems to me to read noticeably better.

I appreciate the explanation! Your v3 text has a lot of good info. I like it.

--
Paul              ~{:-)
pj@illuminatedcomputing.com



Re: Correct docs about GiST leaf page structure

От
Tom Lane
Дата:
Paul A Jungwirth <pj@illuminatedcomputing.com> writes:
> I appreciate the explanation! Your v3 text has a lot of good info. I like it.

OK.  Pushed after some trivial additional wordsmithing.

            regards, tom lane