Обсуждение: Yet another fast GiST build

Поиск

Список

Период

Сортировка

Yet another fast GiST build

От

Andrey Borodin

Дата:

26 августа 2019 г., 10:59:11

Hi!

In many cases GiST index can be build fast using z-order sorting.

I've looked into proof of concept by Nikita Glukhov [0] and it looks very interesting.
So, I've implemented yet another version of B-tree-like GiST build.
It's main use case and benefits can be summarized with small example:

postgres=# create table x as select point (random(),random()) from generate_series(1,3000000,1);
SELECT 3000000
Time: 5061,967 ms (00:05,062)
postgres=# create index ON x using gist (point ) with (fast_build_sort_function=gist_point_sortsupport);
CREATE INDEX
Time: 6140,227 ms (00:06,140)
postgres=# create index ON x using gist (point );
CREATE INDEX
Time: 32061,200 ms (00:32,061)

As you can see, Z-order build is on order of magnitude faster. Select performance is roughly the same. Also, index is
significantlysmaller. 

Nikita's PoC is faster because it uses parallel build, but it intervenes into B-tree code a lot (for reuse). This
patchsetis GiST-isolated. 
My biggest concern is that passing function to relation option seems a bit hacky. You can pass there any function
matchingsort support signature. 
Embedding this function into opclass makes no sense: it does not affect scan anyhow.

In current version, docs and tests are not implemented. I want to discuss overall design. Do we really want yet another
GiSTbuild, if it is 3-10 times faster? 

Thanks!

Best regards, Andrey Borodin.

[0] https://github.com/postgres/postgres/compare/master...glukhovn:gist_btree_build

Вложения

Re: Yet another fast GiST build

От

Darafei "Komяpa" Praliaskouski

Дата:

26 августа 2019 г., 13:27:27

Hello,

This is very interesting. In my pipeline currently GiST index rebuild is the biggest time consuming step.

I believe introducing optional concept of order in the GiST opclass will be beneficial not only for fast build, but for other tasks later:
- CLUSTER can order the table using that notion, in parallel way.
- btree_gist can be even closer to btree by getting the tuples sorted inside page.
- tree descend on insertion in future can traverse the list in more opportunistic way, calculating penalty for siblings-by-order first.

I believe everywhere the idea of ordering is needed it's provided by giving a btree opclass.

How about giving a link to btree opclass inside a gist opclass?

On Mon, Aug 26, 2019 at 10:59 AM Andrey Borodin <x4mmm@yandex-team.ru> wrote:

Hi!

In many cases GiST index can be build fast using z-order sorting.

I've looked into proof of concept by Nikita Glukhov [0] and it looks very interesting.
So, I've implemented yet another version of B-tree-like GiST build.
It's main use case and benefits can be summarized with small example:

postgres=# create table x as select point (random(),random()) from generate_series(1,3000000,1);
SELECT 3000000
Time: 5061,967 ms (00:05,062)
postgres=# create index ON x using gist (point ) with (fast_build_sort_function=gist_point_sortsupport);
CREATE INDEX
Time: 6140,227 ms (00:06,140)
postgres=# create index ON x using gist (point );
CREATE INDEX
Time: 32061,200 ms (00:32,061)

As you can see, Z-order build is on order of magnitude faster. Select performance is roughly the same. Also, index is significantly smaller.

Nikita's PoC is faster because it uses parallel build, but it intervenes into B-tree code a lot (for reuse). This patchset is GiST-isolated.
My biggest concern is that passing function to relation option seems a bit hacky. You can pass there any function matching sort support signature.
Embedding this function into opclass makes no sense: it does not affect scan anyhow.

In current version, docs and tests are not implemented. I want to discuss overall design. Do we really want yet another GiST build, if it is 3-10 times faster?

Thanks!

Best regards, Andrey Borodin.

[0] https://github.com/postgres/postgres/compare/master...glukhovn:gist_btree_build

Darafei Praliaskouski

Support me: http://patreon.com/komzpa

Re: Yet another fast GiST build

От

Heikki Linnakangas

Дата:

26 августа 2019 г., 13:47:22

On 26/08/2019 10:59, Andrey Borodin wrote:
> Hi!
> 
> In many cases GiST index can be build fast using z-order sorting.
> 
> I've looked into proof of concept by Nikita Glukhov [0] and it looks very interesting.
> So, I've implemented yet another version of B-tree-like GiST build.

Cool!

> My biggest concern is that passing function to relation option seems
> a bit hacky. You can pass there any function matching sort support
> signature. Embedding this function into opclass makes no sense: it
> does not affect scan anyhow.

I think it should be a new, optional, GiST "AM support function", in 
pg_amproc. That's how the sort support functions are defined for B-tree 
opfamilies, too.

- Heikki

Re: Yet another fast GiST build

От

Alexander Korotkov

Дата:

30 августа 2019 г., 01:47:49

On Mon, Aug 26, 2019 at 10:59 AM Andrey Borodin <x4mmm@yandex-team.ru> wrote:
> In many cases GiST index can be build fast using z-order sorting.
>
> I've looked into proof of concept by Nikita Glukhov [0] and it looks very interesting.
> So, I've implemented yet another version of B-tree-like GiST build.
> It's main use case and benefits can be summarized with small example:
>
> postgres=# create table x as select point (random(),random()) from generate_series(1,3000000,1);
> SELECT 3000000
> Time: 5061,967 ms (00:05,062)
> postgres=# create index ON x using gist (point ) with (fast_build_sort_function=gist_point_sortsupport);
> CREATE INDEX
> Time: 6140,227 ms (00:06,140)
> postgres=# create index ON x using gist (point );
> CREATE INDEX
> Time: 32061,200 ms (00:32,061)
>
> As you can see, Z-order build is on order of magnitude faster. Select performance is roughly the same. Also, index is
significantlysmaller.

Cool!  These experiments bring me to following thoughts.  Can we not
only build, but also maintain GiST indexes in B-tree-like manner?  If
we put Z-value together with MBR to the non-leaf keys, that should be
possible.  Maintaining it in B-tree-like manner would have a lot of
advantages.
1) Binary search in non-leaf pages instead of probing each key is much faster.
2) Split algorithm is also simpler and faster.
3) We can refind existing index tuple in predictable time of O(log N).
And this doesn't depend on MBR overlapping degree.  This is valuable
for future table AMs, which would have notion of deleting individual
index tuples (for instance, zheap promises to eventually support
delete-marking indexes).

Eventually, we may come to the idea of B-tree indexes with
user-defined additional keys in non-leaf tuples, which could be used
for scanning.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: Yet another fast GiST build

От

Peter Geoghegan

Дата:

30 августа 2019 г., 02:33:48

On Thu, Aug 29, 2019 at 3:48 PM Alexander Korotkov
<a.korotkov@postgrespro.ru> wrote:
> > As you can see, Z-order build is on order of magnitude faster. Select performance is roughly the same. Also, index
issignificantly smaller.

>
> Cool!  These experiments bring me to following thoughts.  Can we not
> only build, but also maintain GiST indexes in B-tree-like manner?  If
> we put Z-value together with MBR to the non-leaf keys, that should be
> possible.  Maintaining it in B-tree-like manner would have a lot of
> advantages.

I'm not an expert on GiST, but that seems like it would have a lot of
advantages in the long term. It is certainly theoretically appealing.

Could this make it easier to use merge join with containment
operators? I'm thinking of things like geospatial joins, which can
generally only be performed as nested loop joins at the moment. This
is often wildly inefficient.

-- 
Peter Geoghegan

Re: Yet another fast GiST build

От

Alexander Korotkov

Дата:

30 августа 2019 г., 06:21:55

On Fri, Aug 30, 2019 at 2:34 AM Peter Geoghegan <pg@bowt.ie> wrote:
> On Thu, Aug 29, 2019 at 3:48 PM Alexander Korotkov
> <a.korotkov@postgrespro.ru> wrote:
> > > As you can see, Z-order build is on order of magnitude faster. Select performance is roughly the same. Also,
indexis significantly smaller.

> >
> > Cool!  These experiments bring me to following thoughts.  Can we not
> > only build, but also maintain GiST indexes in B-tree-like manner?  If
> > we put Z-value together with MBR to the non-leaf keys, that should be
> > possible.  Maintaining it in B-tree-like manner would have a lot of
> > advantages.
>
> I'm not an expert on GiST, but that seems like it would have a lot of
> advantages in the long term. It is certainly theoretically appealing.
>
> Could this make it easier to use merge join with containment
> operators? I'm thinking of things like geospatial joins, which can
> generally only be performed as nested loop joins at the moment. This
> is often wildly inefficient.

AFAICS, spatial joins aren't going to be as easy as just merge joins
on Z-value.  When searching for close points in two datasets (closer
than given threshold) we can scan some ranges of Z-value in one
dataset while iterating on another.  But dealing with prolonged
spatial objects in not that easy.  In order to determine if there are
matching values in given Z-range you also need to be aware on size of
objects inside that Z-range.  So, before merge-like join you need to
not only sort, but build some index-like structure.

Alternatively you can encode size in Z-value.  But this increases
dimensionality of space and decreases efficiency of join.  Also,
spatial join can be made using two indexes, even just current GiST
without Z-values.  We've prototyped that, see [1].

Links
1. https://github.com/pgsphere/pgsphere/blob/crossmatch_cnode/crossmatch.c

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: Yet another fast GiST build

От

Peter Geoghegan

Дата:

30 августа 2019 г., 06:28:02

On Thu, Aug 29, 2019 at 8:22 PM Alexander Korotkov
<a.korotkov@postgrespro.ru> wrote:
> Alternatively you can encode size in Z-value.  But this increases
> dimensionality of space and decreases efficiency of join.  Also,
> spatial join can be made using two indexes, even just current GiST
> without Z-values.  We've prototyped that, see [1].

I'm pretty sure that spatial joins generally need two spatial indexes
(usually R-Trees). There seems to have been quite a lot of research in
it in the 1990s.

-- 
Peter Geoghegan

Re: Yet another fast GiST build

От

Andrey Borodin

Дата:

30 августа 2019 г., 14:44:43

30 авг. 2019 г., в 3:47, Alexander Korotkov <a.korotkov@postgrespro.ru> написал(а):

1) Binary search in non-leaf pages instead of probing each key is much faster.

That's a neat idea, but key union breaks ordering, even for z-order.

for two sets of tuples X and Y

if for any i,o from N, Xi < Yo

does not guaranty union(X) < union (Y)

For example consider this z-ordered keyspace (picture attached)

union(5, 9) is z-order-smaller than union(4,4)

I'm not even sure we can use sorted search for choosing subtree for insertion.

How do you think, should I supply GiST-build patch with docs and tests and add it to CF? Or do we need more design discussion before?

Best regards, Andrey Borodin.

Вложения

PastedGraphic-2.png

Re: Yet another fast GiST build

От

Andrey Borodin

Дата:

01 сентября 2019 г., 13:53:26


> 30 авг. 2019 г., в 16:44, Andrey Borodin <x4mmm@yandex-team.ru> написал(а):
>
>
> How do you think, should I supply GiST-build patch with docs and tests and add it to CF? Or do we need more design
discussionbefore? 

PFA v2: now sort support is part of opclass.
There's a problem with Z-ordering NaN which causes regression tests to fail. So I decided not to add patch to CF
(despitehaving few minutes to do so). 
How to correctly Z-order NaN? So that it would be consistent with semantics of union() and consistent() functions. If
oneof values is NaN, then we consider all it's bits to be 1? 

BTW patch uses
union {
        float f;
        uint32 i;
    }
I hope it's OK, because AFAIK we do not have non-IEEE-754 platforms now.

Thanks!

Best regards, Andrey Borodin.

Вложения

Re: Yet another fast GiST build

От

Alexander Korotkov

Дата:

01 сентября 2019 г., 20:17:48

On Fri, Aug 30, 2019 at 2:44 PM Andrey Borodin <x4mmm@yandex-team.ru> wrote:
>
> 30 авг. 2019 г., в 3:47, Alexander Korotkov <a.korotkov@postgrespro.ru> написал(а):
>
> 1) Binary search in non-leaf pages instead of probing each key is much faster.
>
>
> That's a neat idea, but key union breaks ordering, even for z-order.
> for two sets of tuples X and Y
> if for any i,o from N, Xi < Yo
> does not guaranty union(X) < union (Y)
>
> For example consider this z-ordered keyspace (picture attached)
>
> union(5, 9) is z-order-smaller than union(4,4)
>
> I'm not even sure we can use sorted search for choosing subtree for insertion.

Sorry, I didn't explain my proposal in enough details.  I didn't mean
B-tree separator keys would be the same as union key (MBR).  I mean
B-tree on Z-values, which maintains union key in addition to separator
keys.  So, you select downlink to insert using separator Z-values and
then also extend union key (if needed).  It's probably still not
enough detail yet.  I'll try to spend more time for more detailed
description later.

>
> How do you think, should I supply GiST-build patch with docs and tests and add it to CF? Or do we need more design
discussionbefore? 

+1 for adding to CF.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: Yet another fast GiST build

От

Alexander Korotkov

Дата:

01 сентября 2019 г., 20:20:51

On Fri, Aug 30, 2019 at 6:28 AM Peter Geoghegan <pg@bowt.ie> wrote:
> On Thu, Aug 29, 2019 at 8:22 PM Alexander Korotkov
> <a.korotkov@postgrespro.ru> wrote:
> > Alternatively you can encode size in Z-value.  But this increases
> > dimensionality of space and decreases efficiency of join.  Also,
> > spatial join can be made using two indexes, even just current GiST
> > without Z-values.  We've prototyped that, see [1].
>
> I'm pretty sure that spatial joins generally need two spatial indexes
> (usually R-Trees). There seems to have been quite a lot of research in
> it in the 1990s.

Sure, our prototype was an implementation of one of such papers.  My
point is that advantages of Z-value ordered GiST for spatial joins are
not yet clear for me.  Except faster build and smaller index, which
are general advantages.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: Yet another fast GiST build

От

Andrey Borodin

Дата:

08 сентября 2019 г., 11:54:35


> 1 сент. 2019 г., в 15:53, Andrey Borodin <x4mmm@yandex-team.ru> написал(а):
>
>
>
<v2-0001-Add-sort-support-for-point-gist_point_sortsupport.patch><v2-0002-Implement-GiST-build-using-sort-support.patch>

Here's V3 of the patch set.
Changes:
1. Added some documentation of new sort support routines
2. Fixed bug with dirty pages

I did not add sort support procs to built-in boxes, circles and polys, since it may be not optimal index for them.
However,for points Z-order is quite good as a default. 

Tests only pass with fixes for GiST KNN from Alexander in other thread.

Thanks!

Best regards, Andrey Borodin.

Вложения

Re: Yet another fast GiST build

От

Michael Paquier

Дата:

01 декабря 2019 г., 05:06:35

On Sun, Sep 08, 2019 at 01:54:35PM +0500, Andrey Borodin wrote:
> Here's V3 of the patch set.
> Changes:
> 1. Added some documentation of new sort support routines
> 2. Fixed bug with dirty pages
>
> I did not add sort support procs to built-in boxes, circles and
> polys, since it may be not optimal index for them. However, for
> points Z-order is quite good as a defaul t.

The latest patch does not apply.  Could you send a rebase?  I have
moved the patch to next CF, waiting on author for now.
--
Michael

Вложения

signature.asc

Re: Yet another fast GiST build

От

Andrey Borodin

Дата:

30 декабря 2019 г., 09:43:04


> 1 дек. 2019 г., в 7:06, Michael Paquier <michael@paquier.xyz> написал(а):
>
> On Sun, Sep 08, 2019 at 01:54:35PM +0500, Andrey Borodin wrote:
>> Here's V3 of the patch set.
>> Changes:
>> 1. Added some documentation of new sort support routines
>> 2. Fixed bug with dirty pages
>>
>> I did not add sort support procs to built-in boxes, circles and
>> polys, since it may be not optimal index for them. However, for
>> points Z-order is quite good as a defaul t.
>
> The latest patch does not apply.  Could you send a rebase?  I have
> moved the patch to next CF, waiting on author for now.

Thanks, Michael!

PFA rebased patch.

Best regards, Andrey Borodin.

Вложения

Re: Yet another fast GiST build

От

Thomas Munro

Дата:

19 февраля 2020 г., 10:00:49

On Mon, Dec 30, 2019 at 7:43 PM Andrey Borodin <x4mmm@yandex-team.ru> wrote:
> PFA rebased patch.

Hi Andrey,

This looks really interesting, and I am sure there are a lot of GIS
people who would love to see dramatically faster and smaller indexes
in PG13.  I don't know enough to comment on the details, but here are
some superficial comments:

+   method is also optional and is used diring fast GiST build.

-> during

+    /* esteblish order between x and y */

-> establish

+/* Compute Z-oder for point */
 static inline uint64
 point_zorder_internal(Point *p)

-> order

Could this function please have a comment that explains why it works?
I mean, just a breadcrumb... the name of the technique or something...
so that uninitiated hackers can google their way to a clue (is it
"Morton encoding"?)

MSVC says:

src/backend/access/gist/gistproc.c(1582): error C2065: 'INT32_MAX' :
undeclared identifier

GCC says:

gistbuild.c: In function ‘gist_indexsortbuild’:
gistbuild.c:256:4: error: ISO C90 forbids mixed declarations and code
[-Werror=declaration-after-statement]
    IndexTuple *itvec = gistextractpage(lower_page, &vect_len);
    ^

Re: Yet another fast GiST build

От

Thomas Munro

Дата:

20 февраля 2020 г., 00:14:02

On Wed, Feb 19, 2020 at 8:00 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> Could this function please have a comment that explains why it works?
> I mean, just a breadcrumb... the name of the technique or something...
> so that uninitiated hackers can google their way to a clue (is it
> "Morton encoding"?)

Ok I think I get it now after doing some homework.

1.  We expect floats to be in IEEE format, and the sort order of IEEE
floats is mostly correlated to the binary sort order of the bits
reinterpreted as an int.  It isn't in some special cases, but for this
use case we don't really care about that, we're just trying to
encourage locality.
2.  We generate a Morton code that interleaves the bits of N integers
to produce a single integer that preserves locality: things that were
close in the N dimensional space are close in the resulting integer.

Cool.

+static int
+my_fastcmp(Datum x, Datum y, SortSupport ssup)
+{
+    /* esteblish order between x and y */
+
+    return z1 == z2 ? 0 : z1 > z2 ? 1 : -1;
+}

This example code from the documentation looks wrong, probably missing
eg int64 z1 = DatumGetInt64(x).

Re: Yet another fast GiST build

От

Thomas Munro

Дата:

23 февраля 2020 г., 23:58:18

On Thu, Feb 20, 2020 at 10:14 AM Thomas Munro <thomas.munro@gmail.com> wrote:
> 1.  We expect floats to be in IEEE format, and the sort order of IEEE
> floats is mostly correlated to the binary sort order of the bits
> reinterpreted as an int.  It isn't in some special cases, but for this
> use case we don't really care about that, we're just trying to
> encourage locality.

I suppose there is a big jump in integer value (whether signed or
unsigned) as you cross from positive to negative floats, and then the
sort order is reversed.  I have no idea if either of those things is a
problem worth fixing.  That made me wonder if there might also be an
endianness problem.  It seems from some quick googling that all
current architectures have integers and floats of the same endianness.
Apparently this wasn't always the case, and some ARMs have a weird
half-flipped arrangement for 64 bit floats, but not 32 bit floats as
you are using here.

Re: Yet another fast GiST build

От

"Andrey M. Borodin"

Дата:

24 февраля 2020 г., 11:50:31

Hi Thomas!

Thanks for looking into this! I’ll fix your notices asap.

> On 24 февр. 2020 г., at 01:58, Thomas Munro <thomas.munro@gmail.com> wrote:
>
> On Thu, Feb 20, 2020 at 10:14 AM Thomas Munro <thomas.munro@gmail.com> wrote:
>> 1.  We expect floats to be in IEEE format, and the sort order of IEEE
>> floats is mostly correlated to the binary sort order of the bits
>> reinterpreted as an int.  It isn't in some special cases, but for this
>> use case we don't really care about that, we're just trying to
>> encourage locality.
>
> I suppose there is a big jump in integer value (whether signed or
> unsigned) as you cross from positive to negative floats, and then the
> sort order is reversed.  I have no idea if either of those things is a
> problem worth fixing.  That made me wonder if there might also be an
> endianness problem.  It seems from some quick googling that all
> current architectures have integers and floats of the same endianness.
> Apparently this wasn't always the case, and some ARMs have a weird
> half-flipped arrangement for 64 bit floats, but not 32 bit floats as
> you are using here.

Yes, this leap is a problem for point as generic data type. And I do not know
how to fix it. It can cause inefficient Index Scans when searching near (0,0) and query
window touches simultaneously all quadrants (4x slower).
But everything will be just fine when all data is in 2nd quadrant.

Actually, we do not need to add this hacky code to core: we can provide colum-wise
ordering or something similar as an example.
This feature is aimed at PostGIS and they already possess bit tricks tricks [0].
I’ve taken this union code from PostGIS.

Thanks!

Best regards, Andrey Borodin.

[0] https://github.com/postgis/postgis/blob/master/postgis/gserialized_gist_nd.c#L1150

Re: Yet another fast GiST build

От

"Andrey M. Borodin"

Дата:

29 февраля 2020 г., 15:13:20

Hi!

> On 24 февр. 2020 г., at 13:50, Andrey M. Borodin <x4mmm@yandex-team.ru> wrote:
>
> Hi Thomas!
>
> Thanks for looking into this! I’ll fix your notices asap.

PFA v5.
Thomas, I've used your wording almost exactly with explanation how
point_zorder_internal() works. It has more explanation power than my attempts
to compose good comment.

There is one design decision that worries me most:
should we use opclass function or index option to provide this sorting information?
It is needed only during index creation, actually. And having extra i-class only for fast build
seems excessive.
I think we can provide both ways and let opclass developers decide?

Thanks!

Best regards, Andrey Borodin.

Вложения

Re: Yet another fast GiST build (typo)

От

Erik Rijkers

Дата:

29 февраля 2020 г., 15:20:02

On 2020-02-29 13:13, Andrey M. Borodin wrote:
> Hi!
> 
>> On 24 февр. 2020 г., at 13:50, Andrey M. Borodin 
>> <x4mmm@yandex-team.ru> wrote:
>> 
>> Hi Thomas!
>> 
>> Thanks for looking into this! I’ll fix your notices asap.
> 
> PFA v5.
> Thomas, I've used your wording almost exactly with explanation how
> point_zorder_internal() works. It has more explanation power than my 
> attempts
> to compose good comment.

Small typo alert:
In v5-0002-Implement-GiST-build-using-sort-support.patch there is:

+   method is also optional and is used diring fast GiST build.

'diring' should be 'during'

Re: Yet another fast GiST build (typo)

От

"Andrey M. Borodin"

Дата:

29 февраля 2020 г., 20:24:38

> On 29 февр. 2020 г., at 17:20, Erik Rijkers <er@xs4all.nl> wrote:
>
> Small typo alert:
> In v5-0002-Implement-GiST-build-using-sort-support.patch there is:
>
> +   method is also optional and is used diring fast GiST build.
>
> 'diring' should be 'during'
>

Thanks!

I've fixed this and added patch with GiST reloption to provide sort support function.

Best regards, Andrey Borodin.

Вложения

Re: Yet another fast GiST build

От

Darafei "Komяpa" Praliaskouski

Дата:

01 марта 2020 г., 13:09:22

Hello,

Thanks for the patch and working on GiST infrastructure, it's really
valuable for PostGIS use cases and I wait to see this improvement in
PG13.

On Sat, Feb 29, 2020 at 3:13 PM Andrey M. Borodin <x4mmm@yandex-team.ru> wrote:

> Thomas, I've used your wording almost exactly with explanation how
> point_zorder_internal() works. It has more explanation power than my attempts
> to compose good comment.

PostGIS uses this trick to ensure locality. In PostGIS 3 we enhanced
that trick to have the Hilbert curve instead of Z Order curve.

For visual representation have a look at these links:
 - http://blog.cleverelephant.ca/2019/08/postgis-3-sorting.html - as
it's implemented in PostGIS btree sorting opclass
 - https://observablehq.com/@mourner/hilbert-curve-packing - to
explore general approach.

Indeed if it feels insecure to work with bit magic that implementation
can be left out to extensions.

> There is one design decision that worries me most:
> should we use opclass function or index option to provide this sorting information?
> It is needed only during index creation, actually. And having extra i-class only for fast build
> seems excessive.
> I think we can provide both ways and let opclass developers decide?

Reloption variant looks dirty. It won't cover an index on (id uuid,
geom geometry) where id is duplicated (say, tracked car identifier)
but present in every query - no way to pass such thing as reloption.
I'm also concerned about security of passing a sortsupport function
manually during index creation (what if that's not from the same
extension or even (wrong-)user defined something).

We know for sure it's a good idea for all btree_gist types and
geometry and I don't see a case where user would want to disable it.
Just make it part of operator class, and that would also allow fast
creation of multi-column index.

Thanks.

-- 
Darafei Praliaskouski
Support me: http://patreon.com/komzpa

Re: Yet another fast GiST build

От

Yuri Astrakhan

Дата:

03 апреля 2020 г., 06:17:40

Awesome addition!  Would it make sense to use x86's BMI2's PDEP instruction, or is the interleave computation too small
ofa percentage to introduce not-so-easy-to-port code?  Also, I think it needs a bit more documentation to explain the
logic,i.e. a link to https://stackoverflow.com/questions/39490345/interleave-bits-efficiently ?  Thx for making it
faster:)

Re: Yet another fast GiST build

От

Darafei "Komяpa" Praliaskouski

Дата:

05 апреля 2020 г., 12:48:33

Hello Yuri,

PDEP is indeed first thing that comes up when you start googling
z-curve and bit interleaving :)
We had the code with z-curve generating PDEP instruction in PostGIS,
and dropped it since. In sorting, we now utilize sort support / prefix
search, and key generated as Hilbert curve, with fine tuning it for
different projections' geometric properties.

From this patch the most valuable thing for us is the sorting build
infrastructure itself. Maybe to get it a bit more understandable for
people not deep in geometry it makes sense to first expose the
btree_gist datatypes to this thing? So that btree_gist index on
integer will be built exactly the same way the btree index on integer
is built. This will also get everyone a reference point on the
bottlenecks and optimality of patch.

On Fri, Apr 3, 2020 at 10:56 AM Yuri Astrakhan <yuriastrakhan@gmail.com> wrote:
>
> Awesome addition!  Would it make sense to use x86's BMI2's PDEP instruction, or is the interleave computation too
smallof a percentage to introduce not-so-easy-to-port code?  Also, I think it needs a bit more documentation to explain
thelogic, i.e. a link to https://stackoverflow.com/questions/39490345/interleave-bits-efficiently ?  Thx for making it
faster:) 

--
Darafei Praliaskouski
Support me: http://patreon.com/komzpa

Re: Yet another fast GiST build (typo)

От

Daniel Gustafsson

Дата:

01 июля 2020 г., 15:05:22

> On 29 Feb 2020, at 18:24, Andrey M. Borodin <x4mmm@yandex-team.ru> wrote:

> I've fixed this and added patch with GiST reloption to provide sort support function.

0002 in this patchset fails to apply, please submit a rebased version.  I've
marked the entry Waiting for Author in the meantime.

cheers ./daniel

Re: Yet another fast GiST build (typo)

От

"Andrey M. Borodin"

Дата:

06 июля 2020 г., 15:55:44


> 1 июля 2020 г., в 17:05, Daniel Gustafsson <daniel@yesql.se> написал(а):
>
>> On 29 Feb 2020, at 18:24, Andrey M. Borodin <x4mmm@yandex-team.ru> wrote:
>
>> I've fixed this and added patch with GiST reloption to provide sort support function.
>
> 0002 in this patchset fails to apply, please submit a rebased version.  I've
> marked the entry Waiting for Author in the meantime.

Thanks, Daniel!

PFA v7.

Best regards, Andrey Borodin.

Вложения

Re: Yet another fast GiST build (typo)

От

"Andrey M. Borodin"

Дата:

07 июля 2020 г., 10:03:20


> 6 июля 2020 г., в 17:55, Andrey M. Borodin <x4mmm@yandex-team.ru> написал(а):
>
>
>
>> 1 июля 2020 г., в 17:05, Daniel Gustafsson <daniel@yesql.se> написал(а):
>>
>>> On 29 Feb 2020, at 18:24, Andrey M. Borodin <x4mmm@yandex-team.ru> wrote:
>>
>>> I've fixed this and added patch with GiST reloption to provide sort support function.
>>
>> 0002 in this patchset fails to apply, please submit a rebased version.  I've
>> marked the entry Waiting for Author in the meantime.
>
> Thanks, Daniel!
>
> PFA v7.

Oops. I've mismerged docs and did not notice this with check world. PFA v8 with fixed docs.

Best regards, Andrey Borodin.

Вложения

Re: Yet another fast GiST build (typo)

От

Thomas Munro

Дата:

10 июля 2020 г., 08:53:50

On Tue, Jul 7, 2020 at 7:03 PM Andrey M. Borodin <x4mmm@yandex-team.ru> wrote:
> Oops. I've mismerged docs and did not notice this with check world. PFA v8 with fixed docs.

It looks like point_zorder_internal() has the check for NaN in the wrong place.

Re: Yet another fast GiST build (typo)

От

"Andrey M. Borodin"

Дата:

10 июля 2020 г., 09:55:47


> 10 июля 2020 г., в 10:53, Thomas Munro <thomas.munro@gmail.com> написал(а):
>
> On Tue, Jul 7, 2020 at 7:03 PM Andrey M. Borodin <x4mmm@yandex-team.ru> wrote:
>> Oops. I've mismerged docs and did not notice this with check world. PFA v8 with fixed docs.
>
> It looks like point_zorder_internal() has the check for NaN in the wrong place.
Thanks! Fixed.

Best regards, Andrey Borodin.

Вложения

Re: Yet another fast GiST build (typo)

От

Thomas Munro

Дата:

30 июля 2020 г., 04:26:42

On Fri, Jul 10, 2020 at 6:55 PM Andrey M. Borodin <x4mmm@yandex-team.ru> wrote:
> Thanks! Fixed.

It's not a bug, but I think those 64 bit constants should be wrapped
in UINT64CONST(), following our convention.

I'm confused about these two patches: 0001 introduces
gist_point_fastcmp(), but then 0002 changes it to gist_bbox_fastcmp().
Maybe you intended to keep both of them?  Also 0002 seems to have
fixups for 0001 squashed into it.

Re: Yet another fast GiST build (typo)

От

"Andrey M. Borodin"

Дата:

04 августа 2020 г., 20:28:22


> 30 июля 2020 г., в 06:26, Thomas Munro <thomas.munro@gmail.com> написал(а):
>
> On Fri, Jul 10, 2020 at 6:55 PM Andrey M. Borodin <x4mmm@yandex-team.ru> wrote:
>> Thanks! Fixed.
>
> It's not a bug, but I think those 64 bit constants should be wrapped
> in UINT64CONST(), following our convention.
Thanks, fixed!

> I'm confused about these two patches: 0001 introduces
> gist_point_fastcmp(), but then 0002 changes it to gist_bbox_fastcmp().
> Maybe you intended to keep both of them?  Also 0002 seems to have
> fixups for 0001 squashed into it.
Indeed, that were fixups: point converted to GiST representation is a bbox already, and the function expects only
bboxes.

Also I've fixed some mismerges in documentation.

Thanks!

Best regards, Andrey Borodin.

Вложения

Re: Yet another fast GiST build (typo)

От

Pavel Borisov

Дата:

14 августа 2020 г., 12:21:32

I see this feature quite useful in concept and decided to test it.

On a real database of 7 million rows I observed speedup of 4 times in case of single column index on points only and 2.5 times speedup in case of index on points with several included columns. Standard deviation between in series of measurements being of 10%. Index size saving was of 1.7-1.5 times respectively. Points were in all four quadrants.

On random points same as query in the original message it was observer 3 times speedup with the patch. Then I generated same points set but so they will get into one quadrant didn't observe a difference with the previous case. So probably anomaly in Morton curve not so big to induce noticeable slowdown in a whole random set. But as the ordering is done only for index and not used outside index it seems to me possible to introduce shifting floating point coordinates respective to leftmost-bottom corner point and thus make all of them positive to avoid anomaly of Morton curve near quadrants transitions.

Of course speed measurements depend on machine and configuration a lot, but I am sure anyway there is a noticeable difference in index build time and this is quite valuable for end-user who build GiSt index on point type of data. Furthermore same speedup is also for REBUILD INDEX CONCURRENTLY. There short rebuild time also mean fewer user modified table rows during rebuild which should be integrated in a newer index after rebuild.

This patch can be also seen as a step to futher introduce the other ordering algoritms e.g. Gilbert curve and I consider this feature is useful and is worth to be committed.

Both patches 0001 and 0002 when applied on version 14dev compile and work cleanly. Regression tests are passed.

Code seems clean and legible for me.

In declaration I see little bit different style in similar argument pointers/arrays:

extern IndexTuple gistFormTuple(GISTSTATE *giststate, Relation r, Datum *attdata, bool *isnull, bool isleaf);

extern IndexTuple gistCompressValusAndFormTuple(GISTSTATE *giststate, Relation r, Datum attdata[], bool isnull[], bool isleaf, Datum compatt[]);

I suppose this is because gistFormTuple previously had different style in declaration and definition. Maybe it would be nice to change them to one style in all code, I propose pointers instead of [].

In a big comment

/*
+ * In this function we need to compute Morton codes for non-integral
+ * components p->x and p->y. But Morton codes are defined only for
+ * integral values.

i don't quite caught meaning of "non-integral" and "integral" and propose to replace it to "float" and "integers".

Also there are some extra spaces before line

"prev_level_start = level_start;"

and after

"The argument is a pointer to a <structname>SortSupport</structname> struct."

Overall I see the patch useful and almost ready for commit.

вт, 4 авг. 2020 г. в 21:28, Andrey M. Borodin <x4mmm@yandex-team.ru>:

> 30 июля 2020 г., в 06:26, Thomas Munro <thomas.munro@gmail.com> написал(а):
>
> On Fri, Jul 10, 2020 at 6:55 PM Andrey M. Borodin <x4mmm@yandex-team.ru> wrote:
>> Thanks! Fixed.
>
> It's not a bug, but I think those 64 bit constants should be wrapped
> in UINT64CONST(), following our convention.
Thanks, fixed!

> I'm confused about these two patches: 0001 introduces
> gist_point_fastcmp(), but then 0002 changes it to gist_bbox_fastcmp().
> Maybe you intended to keep both of them? Also 0002 seems to have
> fixups for 0001 squashed into it.
Indeed, that were fixups: point converted to GiST representation is a bbox already, and the function expects only bboxes.

Also I've fixed some mismerges in documentation.

Thanks!

Best regards, Andrey Borodin.

Best regards,
Pavel Borisov

Postgres Professional: http://postgrespro.com

Re: Yet another fast GiST build

От

Pavel Borisov

Дата:

18 августа 2020 г., 18:57:33

The following review has been posted through the commitfest application:
make installcheck-world:  tested, passed
Implements feature:       tested, passed
Spec compliant:           not tested
Documentation:            not tested

I consider this patch almost ready for commit with minor corrections (see previous message)

The new status of this patch is: Waiting on Author

Re: Yet another fast GiST build (typo)

От

"Andrey M. Borodin"

Дата:

23 августа 2020 г., 12:39:38


> 14 авг. 2020 г., в 14:21, Pavel Borisov <pashkin.elfe@gmail.com> написал(а):
>
> I see this feature quite useful in concept and decided to test it.

Thanks for reviewing and benchmarking, Pavel!

I agree with your suggestions. PFA patch with relevant changes.

Thanks a lot.

Best regards, Andrey Borodin.

Вложения

Re: Yet another fast GiST build (typo)

От

"Andrey M. Borodin"

Дата:

30 августа 2020 г., 15:04:46


> 23 авг. 2020 г., в 14:39, Andrey M. Borodin <x4mmm@yandex-team.ru> написал(а):
>
> Thanks for reviewing and benchmarking, Pavel!

Pavel sent me few typos offlist. PFA v12 fixing these typos.
Thanks!

Best regards, Andrey Borodin.

Вложения

Re: Yet another fast GiST build (typo)

От

Pavel Borisov

Дата:

31 августа 2020 г., 15:02:22

Pavel sent me few typos offlist. PFA v12 fixing these typos.
Thanks!

Now I consider the patch ready to be committed and mark it so on CF.

Thank you!

Best regards,
Pavel Borisov

Postgres Professional: http://postgrespro.com

Re: Yet another fast GiST build (typo)

От

Heikki Linnakangas

Дата:

03 сентября 2020 г., 21:40:07

On 30/08/2020 15:04, Andrey M. Borodin wrote:
>> 23 авг. 2020 г., в 14:39, Andrey M. Borodin <x4mmm@yandex-team.ru> написал(а):
>>
>> Thanks for reviewing and benchmarking, Pavel!
> 
> Pavel sent me few typos offlist. PFA v12 fixing these typos.

In gist_indexsortbuild(), you first build all the leaf pages. Then, you 
read through all the index pages you just built, to form the tuples for 
the next level, and repeat for all the upper levels. That seems 
inefficient, it would be more better to form the tuples for the 
downlinks as you go, when you build the leaf pages in the first place. 
That's how nbtsort.c works. Also, you could WAL-log the pages as you go.

In gist_indexsortbuild_flush(), can't you just memcpy() the page from
memory to the buffer?

- Heikki

Re: Yet another fast GiST build (typo)

От

"Andrey M. Borodin"

Дата:

05 сентября 2020 г., 14:53:31


> 3 сент. 2020 г., в 23:40, Heikki Linnakangas <hlinnaka@iki.fi> написал(а):
>
> On 30/08/2020 15:04, Andrey M. Borodin wrote:
>>> 23 авг. 2020 г., в 14:39, Andrey M. Borodin <x4mmm@yandex-team.ru> написал(а):
>>>
>>> Thanks for reviewing and benchmarking, Pavel!
>> Pavel sent me few typos offlist. PFA v12 fixing these typos.
>
> In gist_indexsortbuild(), you first build all the leaf pages. Then, you read through all the index pages you just
built,to form the tuples for the next level, and repeat for all the upper levels. That seems inefficient, it would be
morebetter to form the tuples for the downlinks as you go, when you build the leaf pages in the first place. That's how
nbtsort.cworks. Also, you could WAL-log the pages as you go. 
>
> In gist_indexsortbuild_flush(), can't you just memcpy() the page from
> memory to the buffer?
>
> - Heikki
Thanks for ideas, Heikki. Please see v13 with proposed changes.
But I've found out that logging page-by-page slows down GiST build by approximately 15% (when CPU constrained).
Though In think that this is IO-wise.

Thanks!

Best regards, Andrey Borodin.

Вложения

Re: Yet another fast GiST build (typo)

От

Heikki Linnakangas

Дата:

06 сентября 2020 г., 16:26:08

On 05/09/2020 14:53, Andrey M. Borodin wrote:
> Thanks for ideas, Heikki. Please see v13 with proposed changes.

Thanks, that was quick!

> But I've found out that logging page-by-page slows down GiST build by
> approximately 15% (when CPU constrained). Though In think that this
> is IO-wise.
Hmm, any ideas why that is? log_newpage_range() writes one WAL record 
for 32 pages, while now you're writing one record per page, so you'll 
have a little bit more overhead from that. But 15% seems like a lot.

- Heikki

Re: Yet another fast GiST build (typo)

От

"Andrey M. Borodin"

Дата:

06 сентября 2020 г., 21:33:25


> 6 сент. 2020 г., в 18:26, Heikki Linnakangas <hlinnaka@iki.fi> написал(а):
>
> On 05/09/2020 14:53, Andrey M. Borodin wrote:
>> Thanks for ideas, Heikki. Please see v13 with proposed changes.
>
> Thanks, that was quick!
>
>> But I've found out that logging page-by-page slows down GiST build by
>> approximately 15% (when CPU constrained). Though In think that this
>> is IO-wise.
> Hmm, any ideas why that is? log_newpage_range() writes one WAL record for 32 pages, while now you're writing one
recordper page, so you'll have a little bit more overhead from that. But 15% seems like a lot. 
I do not know. I guess this can be some effect of pglz compression during cold stage. It can be slower and less
compressivethan pglz with cache table? But this is pointing into the sky. 
Nevertheless, here's the patch identical to v13, but with 3rd part: log flushed pages with bunches of 32.
This brings CPU performance back and slightly better than before page-by-page logging.

Some details about test:
MacOS, 6-core i7
psql -c '\timing' -c "create table x as select point (random(),random()) from generate_series(1,10000000,1);" -c
"createindex on x using gist (point);" 

With patch v13 this takes 20,567 seconds, with v14 18,149 seconds, v12 ~18,3s (which is closer to 10% btw, sorry for
miscomputation).This was not statistically significant testing, just a quick laptop benchmark with 2-3 tests to verify
stability.

Best regards, Andrey Borodin.

Вложения

Re: Yet another fast GiST build (typo)

От

"Andrey M. Borodin"

Дата:

07 сентября 2020 г., 10:14:35

> 6 сент. 2020 г., в 18:26, Heikki Linnakangas <hlinnaka@iki.fi> написал(а):
>
> On 05/09/2020 14:53, Andrey M. Borodin wrote:
>> Thanks for ideas, Heikki. Please see v13 with proposed changes.
>
> Thanks, that was quick!
>
>> But I've found out that logging page-by-page slows down GiST build by
>> approximately 15% (when CPU constrained). Though In think that this
>> is IO-wise.
> Hmm, any ideas why that is? log_newpage_range() writes one WAL record for 32 pages, while now you're writing one
recordper page, so you'll have a little bit more overhead from that. But 15% seems like a lot. 

Hmm, this works for B-tree too.
this index creation
psql -c '\timing' -c "create table x as select random() from generate_series(1,10000000,1);" -c "create index ON x
(random);" 
takes 7 seconds on may machine, but with one weird patch it takes only 6 :)

Maybe I'm missing something? Like forgot to log 10% of pages, or something like that...

Best regards, Andrey Borodin.

Вложения

0001-nbtree-faster-logging.patch

Re: Yet another fast GiST build (typo)

От

"Andrey M. Borodin"

Дата:

07 сентября 2020 г., 11:36:46


> 7 сент. 2020 г., в 12:14, Andrey M. Borodin <x4mmm@yandex-team.ru> написал(а):
>
> Maybe I'm missing something? Like forgot to log 10% of pages, or something like that...

Indeed, there was a bug. I've fixed it, and I still observe same performance gain.

Best regards, Andrey Borodin.

Вложения

v2-0001-nbtree-faster-logging.patch

Re: Yet another fast GiST build

От

Heikki Linnakangas

Дата:

07 сентября 2020 г., 12:47:16

On 24/02/2020 10:50, Andrey M. Borodin wrote:
>> On 24 февр. 2020 г., at 01:58, Thomas Munro <thomas.munro@gmail.com> wrote:
>>
>> On Thu, Feb 20, 2020 at 10:14 AM Thomas Munro <thomas.munro@gmail.com> wrote:
>>> 1.  We expect floats to be in IEEE format, and the sort order of IEEE
>>> floats is mostly correlated to the binary sort order of the bits
>>> reinterpreted as an int.  It isn't in some special cases, but for this
>>> use case we don't really care about that, we're just trying to
>>> encourage locality.
>>
>> I suppose there is a big jump in integer value (whether signed or
>> unsigned) as you cross from positive to negative floats, and then the
>> sort order is reversed.  I have no idea if either of those things is a
>> problem worth fixing.  That made me wonder if there might also be an
>> endianness problem.  It seems from some quick googling that all
>> current architectures have integers and floats of the same endianness.
>> Apparently this wasn't always the case, and some ARMs have a weird
>> half-flipped arrangement for 64 bit floats, but not 32 bit floats as
>> you are using here.
> 
> Yes, this leap is a problem for point as generic data type. And I do not know
> how to fix it. It can cause inefficient Index Scans when searching near (0,0) and query
> window touches simultaneously all quadrants (4x slower).

I took a stab at fixing this, see attached patch (applies on top of your 
patch v14).

To evaluate this, I used the other attached patch to expose the zorder 
function to SQL, and plotted points around zero with gnuplot. See the 
attached two images, one with patch v14, and the other one with this patch.

I'll continue looking at these patches in whole tomorrow. I think it's 
getting close to a committable state.

> But everything will be just fine when all data is in 2nd quadrant.

Simon Riggs and friends would agree :-)

- Heikki

Вложения

Re: Yet another fast GiST build

От

Pavel Borisov

Дата:

07 сентября 2020 г., 13:59:30

>> I suppose there is a big jump in integer value (whether signed or
>> unsigned) as you cross from positive to negative floats, and then the
>> sort order is reversed. I have no idea if either of those things is a
>> problem worth fixing. That made me wonder if there might also be an

I took a stab at fixing this, see attached patch (applies on top of your
patch v14).

To evaluate this, I used the other attached patch to expose the zorder
function to SQL, and plotted points around zero with gnuplot. See the
attached two images, one with patch v14, and the other one with this patch.

I'd made testing of sorted SpGist build in cases of points distributed only in 2d quadrant and points in all 4 quadrants and it appears that this abnormality doesn't affect as much as Andrey supposed. But Heikki's patch is really nice way to avoid what can be avoided and I'd like it is included together with Andrey's patch.

Pavel.

Re: Yet another fast GiST build

От

Heikki Linnakangas

Дата:

07 сентября 2020 г., 17:10:08

On 07/09/2020 13:59, Pavel Borisov wrote:
>>>> I suppose there is a big jump in integer value (whether signed or
>>>> unsigned) as you cross from positive to negative floats, and then the
>>>> sort order is reversed.  I have no idea if either of those things is a
>>>> problem worth fixing.  That made me wonder if there might also be an
>>
>> I took a stab at fixing this, see attached patch (applies on top of your
>> patch v14).
>>
>> To evaluate this, I used the other attached patch to expose the zorder
>> function to SQL, and plotted points around zero with gnuplot. See the
>> attached two images, one with patch v14, and the other one with this patch.
> 
> I'd made testing of sorted SpGist build in cases of points distributed only
> in 2d quadrant and points in all 4 quadrants and it appears that this
> abnormality doesn't affect as much as Andrey supposed. But Heikki's patch
> is really nice way to avoid what can be avoided and I'd like it is included
> together with Andrey's patch.

Thanks! Did you measure the quality of the built index somehow? The 
ordering shouldn't make any difference to the build speed, but it 
affects the shape of the resulting index and the speed of queries 
against it.

I played with some simple queries like this:

explain (analyze, buffers) select count(*) from points_good where p <@ 
box(point(50, 50), point(75, 75));

and looking at the "Buffers" line for how many pages were accessed. 
There doesn't seem to be any consistent difference between v14 and my 
fix. So I concur it doesn't seem to matter much.

I played some more with plotting the curve. I wrote a little python 
program to make an animation of it, and also simulated how the points 
would be divided into pages, assuming that each GiST page can hold 200 
tuples (I think the real number is around 150 with default page size). 
In the animation, the leaf pages appear as rectangles as it walks 
through the Z-order curve. This is just a simulation by splitting all 
the points into batches of 200 and drawing a bounding box around each 
batch. I haven't checked the actual pages as the GiST creates, but I 
think this gives a good idea of how it works.

The animation shows that there's quite a lot of overlap between the 
pages. It's not necessarily this patch's business to try to improve 
that, and the non-sorting index build isn't perfect either. But it 
occurs to me that there's maybe one pretty simple trick we could do: 
instead of blindly filling the leaf pages in Z-order, collect tuples 
into a larger buffer, in Z-order. I'm thinking 32 pages worth of tuples, 
or something in that ballpark, or maybe go all the way up to work_mem. 
When the buffer fills up, call the picksplit code to divide the buffer 
into the actual pages, and flush them to disk. If you look at the 
animation and imagine that you would take a handful of pages in the 
order they're created, and re-divide the points with the split 
algorithm, there would be much less overlap.

- Heikki

Вложения

Re: Yet another fast GiST build

От

"Andrey M. Borodin"

Дата:

07 сентября 2020 г., 18:24:18


> 7 сент. 2020 г., в 19:10, Heikki Linnakangas <hlinnaka@iki.fi> написал(а):
>
> On 07/09/2020 13:59, Pavel Borisov wrote:
>>>>> I suppose there is a big jump in integer value (whether signed or
>>>>> unsigned) as you cross from positive to negative floats, and then the
>>>>> sort order is reversed.  I have no idea if either of those things is a
>>>>> problem worth fixing.  That made me wonder if there might also be an
>>>
>>> I took a stab at fixing this, see attached patch (applies on top of your
>>> patch v14).
>>>
>>> To evaluate this, I used the other attached patch to expose the zorder
>>> function to SQL, and plotted points around zero with gnuplot. See the
>>> attached two images, one with patch v14, and the other one with this patch.
>>
>> I'd made testing of sorted SpGist build in cases of points distributed only
>> in 2d quadrant and points in all 4 quadrants and it appears that this
>> abnormality doesn't affect as much as Andrey supposed. But Heikki's patch
>> is really nice way to avoid what can be avoided and I'd like it is included
>> together with Andrey's patch.
>
> Thanks! Did you measure the quality of the built index somehow? The
> ordering shouldn't make any difference to the build speed, but it
> affects the shape of the resulting index and the speed of queries
> against it.
I've tried to benchmark the difference between build time v14 and v15. v15 seems to be slightly slower, but with
negligibledifference. 

> I played with some simple queries like this:
>
> explain (analyze, buffers) select count(*) from points_good where p <@
> box(point(50, 50), point(75, 75));
To observe IndexScan difference query should touch 4 quadrants. i.e. search within ((-25,-25),point(25,25))

> and looking at the "Buffers" line for how many pages were accessed.
> There doesn't seem to be any consistent difference between v14 and my
> fix. So I concur it doesn't seem to matter much.
>
>
> I played some more with plotting the curve. I wrote a little python
> program to make an animation of it, and also simulated how the points
> would be divided into pages, assuming that each GiST page can hold 200
> tuples (I think the real number is around 150 with default page size).
> In the animation, the leaf pages appear as rectangles as it walks
> through the Z-order curve. This is just a simulation by splitting all
> the points into batches of 200 and drawing a bounding box around each
> batch. I haven't checked the actual pages as the GiST creates, but I
> think this gives a good idea of how it works.
> The animation shows that there's quite a lot of overlap between the
> pages. It's not necessarily this patch's business to try to improve
> that, and the non-sorting index build isn't perfect either. But it
> occurs to me that there's maybe one pretty simple trick we could do:
> instead of blindly filling the leaf pages in Z-order, collect tuples
> into a larger buffer, in Z-order. I'm thinking 32 pages worth of tuples,
> or something in that ballpark, or maybe go all the way up to work_mem.
> When the buffer fills up, call the picksplit code to divide the buffer
> into the actual pages, and flush them to disk. If you look at the
> animation and imagine that you would take a handful of pages in the
> order they're created, and re-divide the points with the split
> algorithm, there would be much less overlap.

Animation looks cool! It really pins the inefficiency of resulting MBRs.
But in R*-tree one of Beckman's points was that overlap optimisation worth doing on higher levels, not lower.
But we can do this for splits on each level, I think. We do not know tree depth in advance to divide maintenance
workmemamong level.. But, probably we don't need to, let's allocate half to first level, quarter to second, 1/8 to
thirdetc until it's one page. Should we take allocations inside picksplit() into account? 
The more I think about it the cooler idea seem to me.

BTW I've found one more bug in the patch: it writes WAL even for unlogged tables. I'm not sending a patch because
changesare trivial and currently we already have lengthy patchset in different messages. 
Also, to avoid critical section we can use log_new_page() instead of log_buffer().


Thanks!

Best regards, Andrey Borodin.

Re: Yet another fast GiST build

От

Pavel Borisov

Дата:

08 сентября 2020 г., 21:33:59

> Thanks! Did you measure the quality of the built index somehow? The
> ordering shouldn't make any difference to the build speed, but it
> affects the shape of the resulting index and the speed of queries
> against it.

Again I've tried random select tests near axes and haven't noticed any performance difference between ordinary gist build and z-ordered one. The same is for selects far from axes. Theoretically, there may be a possible slowdown for particular points inside the MBR which crosses the axis but I haven't tried to dig so deep and haven't tested performance as a function of coordinate.

So I feel this patch is not about select speed optimization.

Best regards,
Pavel Borisov

Postgres Professional: http://postgrespro.com

Re: Yet another fast GiST build

От

Heikki Linnakangas

Дата:

08 сентября 2020 г., 22:05:37

On 08/09/2020 21:33, Pavel Borisov wrote:
>      > Thanks! Did you measure the quality of the built index somehow? The
>      > ordering shouldn't make any difference to the build speed, but it
>      > affects the shape of the resulting index and the speed of queries
>      > against it.
> 
> Again I've tried random select tests near axes and haven't noticed any 
> performance difference between ordinary gist build and z-ordered one. 
> The same is for selects far from axes. Theoretically, there may be a 
> possible slowdown for particular points inside the MBR which crosses the 
> axis but I haven't tried to dig so deep and haven't tested performance 
> as a function of coordinate.
> 
> So I feel this patch is not about select speed optimization.

Ok, thank for confirming.

I've been reviewing the patch today. The biggest changes I've made have 
been in restructuring the code in gistbuild.c for readability, but there 
are a bunch of smaller changes throughout. Attached is what I've got so 
far, squashed into one patch. I'm continuing to review it, but a couple 
of questions so far:

In the gistBuildCallback(), you're skipping the tuple if 'tupleIsAlive 
== false'. That seems fishy, surely we need to index recently-dead 
tuples, too. The normal index build path isn't skipping them either.

How does the 'sortsupport' routine interact with 
'compress'/'decompress'? Which representation is passed to the 
comparator routine: the original value from the table, the compressed 
representation, or the decompressed representation? Do the 
comparetup_index_btree() and readtup_index() routines agree with that?

- Heikki

Вложения

v16-0001-Add-sort-support-for-point-gist_point_sortsuppor.patch

Re: Yet another fast GiST build

От

"Andrey M. Borodin"

Дата:

09 сентября 2020 г., 09:43:40


> 9 сент. 2020 г., в 00:05, Heikki Linnakangas <hlinnaka@iki.fi> написал(а):
>
> I've been reviewing the patch today. The biggest changes I've made have been in restructuring the code in gistbuild.c
forreadability, but there are a bunch of smaller changes throughout. Attached is what I've got so far, squashed into
onepatch. 
Thanks!

> I'm continuing to review it, but a couple of questions so far:
>
> In the gistBuildCallback(), you're skipping the tuple if 'tupleIsAlive == false'. That seems fishy, surely we need to
indexrecently-dead tuples, too. The normal index build path isn't skipping them either. 
That's an oversight.
>
> How does the 'sortsupport' routine interact with 'compress'/'decompress'? Which representation is passed to the
comparatorroutine: the original value from the table, the compressed representation, or the decompressed
representation?Do the comparetup_index_btree() and readtup_index() routines agree with that? 

Currently we pass compressed values, which seems not very good.
But there was a request from PostGIS maintainers to pass values before decompression.
Darafei, please, correct me if I'm wrong. Also can you please provide link on PostGIS B-tree sorting functions?

Thanks!

Best regards, Andrey Borodin.

Re: Yet another fast GiST build

От

Darafei "Komяpa" Praliaskouski

Дата:

09 сентября 2020 г., 10:05:04

Hi,

On Wed, Sep 9, 2020 at 9:43 AM Andrey M. Borodin <x4mmm@yandex-team.ru> wrote:

> 9 сент. 2020 г., в 00:05, Heikki Linnakangas <hlinnaka@iki.fi> написал(а):
>
> I've been reviewing the patch today. The biggest changes I've made have been in restructuring the code in gistbuild.c for readability, but there are a bunch of smaller changes throughout. Attached is what I've got so far, squashed into one patch.
Thanks!

> I'm continuing to review it, but a couple of questions so far:
>
> In the gistBuildCallback(), you're skipping the tuple if 'tupleIsAlive == false'. That seems fishy, surely we need to index recently-dead tuples, too. The normal index build path isn't skipping them either.
That's an oversight.
>
> How does the 'sortsupport' routine interact with 'compress'/'decompress'? Which representation is passed to the comparator routine: the original value from the table, the compressed representation, or the decompressed representation? Do the comparetup_index_btree() and readtup_index() routines agree with that?

Currently we pass compressed values, which seems not very good.
But there was a request from PostGIS maintainers to pass values before decompression.
Darafei, please, correct me if I'm wrong. Also can you please provide link on PostGIS B-tree sorting functions?

We were expecting to reuse btree opclass for this thing. This way btree_gist extension will become a lot thinner. :)

Core routine for current sorting implementation is Hilbert curve, which is based on 2D center of a box - and used for abbreviated sort:
https://github.com/postgis/postgis/blob/2a7ebd0111b02aed3aa24752aad0ba89aef5d431/liblwgeom/gbox.c#L893

All the btree functions are wrappers around gserialized_cmp which just adds a bunch of tiebreakers that don't matter in practice:
https://github.com/postgis/postgis/blob/2a7ebd0111b02aed3aa24752aad0ba89aef5d431/liblwgeom/gserialized.c#L313

Base representation for index compressed datatype is GIDX, which is also a box. We can make it work on top of it instead of the original representation.
There is no such thing as "decompressed representation" unfortunately as compression is lossy.

Re: Yet another fast GiST build

От

"Andrey M. Borodin"

Дата:

09 сентября 2020 г., 13:28:43

Thanks Darafei!

> 9 сент. 2020 г., в 12:05, Darafei Komяpa Praliaskouski <me@komzpa.net> написал(а):
>
> > How does the 'sortsupport' routine interact with 'compress'/'decompress'? Which representation is passed to the
comparatorroutine: the original value from the table, the compressed representation, or the decompressed
representation?Do the comparetup_index_btree() and readtup_index() routines agree with that? 
>
> Currently we pass compressed values, which seems not very good.
> But there was a request from PostGIS maintainers to pass values before decompression.
> Darafei, please, correct me if I'm wrong. Also can you please provide link on PostGIS B-tree sorting functions?
>
> We were expecting to reuse btree opclass for this thing. This way btree_gist extension will become a lot thinner. :)

I think if we aim at reusing B-tree sort support functions we have to pass uncompressed values. They can be a lot
biggerand slower in case of PostGIS. We will be sorting actual geometries instead of MBRs. 

In my view it's better to implement GiST-specific sort support in btree_gist, rather than trying to reuse existing sort
supports.

Best regards, Andrey Borodin.

Re: Yet another fast GiST build

От

Heikki Linnakangas

Дата:

09 сентября 2020 г., 15:09:16

On 09/09/2020 13:28, Andrey M. Borodin wrote:
> Thanks Darafei!
> 
>> 9 сент. 2020 г., в 12:05, Darafei Komяpa Praliaskouski
>> <me@komzpa.net> написал(а):
>> 
>>> How does the 'sortsupport' routine interact with
>>> 'compress'/'decompress'? Which representation is passed to the
>>> comparator routine: the original value from the table, the
>>> compressed representation, or the decompressed representation? Do
>>> the comparetup_index_btree() and readtup_index() routines agree
>>> with that?
>> 
>> Currently we pass compressed values, which seems not very good. But
>> there was a request from PostGIS maintainers to pass values before
>> decompression. Darafei, please, correct me if I'm wrong. Also can
>> you please provide link on PostGIS B-tree sorting functions?
>> 
>> We were expecting to reuse btree opclass for this thing. This way
>> btree_gist extension will become a lot thinner. :)
> 
> I think if we aim at reusing B-tree sort support functions we have to
> pass uncompressed values. They can be a lot bigger and slower in case
> of PostGIS. We will be sorting actual geometries instead of MBRs.
> 
> In my view it's better to implement GiST-specific sort support in
> btree_gist, rather than trying to reuse existing sort supports.

Yeah, I don't think reusing existing sortsupport functions directly is 
important. The comparison function should be short anyway for 
performance reasons, so it won't be a lot of code to copy-paste. And if 
there are some common subroutines, you can put them in a separate 
internal functions for reuse.

Using the 'compressed' format seems reasonable to me. It's natural to 
the gistbuild.c code, and the comparison routine can 'decompress' itself 
if it wishes. If the decompressions is somewhat expensive, it's 
unfortunate if you need to do it repeatedly in the comparator, but 
tuplesort.c would need pretty big changes to keep around a separate 
in-memory representation compare. However, you could use the sort 
"abbreviation" functionality to mitigate that.

Come to think of it, the point z-order comparator could benefit a lot 
from key abbreviation, too. You could do the point -> zorder conversion 
in the abbreviation routine.

- Heikki

Re: Yet another fast GiST build

От

Darafei "Komяpa" Praliaskouski

Дата:

09 сентября 2020 г., 15:20:43

On Wed, Sep 9, 2020 at 3:09 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:

On 09/09/2020 13:28, Andrey M. Borodin wrote:
> Thanks Darafei!
>
>> 9 сент. 2020 г., в 12:05, Darafei Komяpa Praliaskouski
>> <me@komzpa.net> написал(а):
>>
>>> How does the 'sortsupport' routine interact with
>>> 'compress'/'decompress'? Which representation is passed to the
>>> comparator routine: the original value from the table, the
>>> compressed representation, or the decompressed representation? Do
>>> the comparetup_index_btree() and readtup_index() routines agree
>>> with that?
>>
....

Come to think of it, the point z-order comparator could benefit a lot
from key abbreviation, too. You could do the point -> zorder conversion
in the abbreviation routine.

That's how it works in PostGIS, only that we moved to more effecient Hilbert curve:
https://github.com/postgis/postgis/blob/54399b9f6b0f02e8db9444f9f042b8d4ca6d4fa4/postgis/lwgeom_btree.c#L171

Re: Yet another fast GiST build

От

Heikki Linnakangas

Дата:

09 сентября 2020 г., 18:39:21

On 09/09/2020 15:20, Darafei "Komяpa" Praliaskouski wrote:
> On Wed, Sep 9, 2020 at 3:09 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> 
>> Come to think of it, the point z-order comparator could benefit a lot
>> from key abbreviation, too. You could do the point -> zorder conversion
>> in the abbreviation routine.
> 
> That's how it works in PostGIS, only that we moved to more
> effecient Hilbert curve:
> https://github.com/postgis/postgis/blob/54399b9f6b0f02e8db9444f9f042b8d4ca6d4fa4/postgis/lwgeom_btree.c#L171

Thanks, that's interesting.

I implemented the abbreviated keys for the point opclass, too, and 
noticed that the patch as it was never used it. I reworked the patch so 
that tuplesort_begin_index_gist() is responsible for looking up the 
sortsupport function, like tuplesort_begin_index_btree() does, and uses 
abbreviation when possible.

I think this is pretty much ready for commit now. I'll do a bit more 
testing (do we have regression test coverage for this?), also on a 
SIZEOF_DATUM==4 system since the abbreviation works differently with 
that, and push if nothing new comes up. And clarify the documentation 
and/or comments that the sortsupport function sees "compressed" values.

I wonder if we could use sorting to also speed up building tsvector 
indexes? The values stored there are bit signatures, what would be a 
good sort order for those?

- Heikki

Re: Yet another fast GiST build

От

Heikki Linnakangas

Дата:

09 сентября 2020 г., 18:39:41

On 09/09/2020 15:20, Darafei "Komяpa" Praliaskouski wrote:
> On Wed, Sep 9, 2020 at 3:09 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> 
>> Come to think of it, the point z-order comparator could benefit a lot
>> from key abbreviation, too. You could do the point -> zorder conversion
>> in the abbreviation routine.
> 
> That's how it works in PostGIS, only that we moved to more
> effecient Hilbert curve:
> https://github.com/postgis/postgis/blob/54399b9f6b0f02e8db9444f9f042b8d4ca6d4fa4/postgis/lwgeom_btree.c#L171

Thanks, that's interesting.

I implemented the abbreviated keys for the point opclass, too, and 
noticed that the patch as it was never used it. I reworked the patch so 
that tuplesort_begin_index_gist() is responsible for looking up the 
sortsupport function, like tuplesort_begin_index_btree() does, and uses 
abbreviation when possible.

I think this is pretty much ready for commit now. I'll do a bit more 
testing (do we have regression test coverage for this?), also on a 
SIZEOF_DATUM==4 system since the abbreviation works differently with 
that, and push if nothing new comes up. And clarify the documentation 
and/or comments that the sortsupport function sees "compressed" values.

I wonder if we could use sorting to also speed up building tsvector 
indexes? The values stored there are bit signatures, what would be a 
good sort order for those?

- Heikki

Вложения

v17-0001-Add-sort-support-for-point-gist_point_sortsuppor.patch

Re: Yet another fast GiST build

От

"Andrey M. Borodin"

Дата:

09 сентября 2020 г., 19:50:44


> 9 сент. 2020 г., в 20:39, Heikki Linnakangas <hlinnaka@iki.fi> написал(а):
>
> On 09/09/2020 15:20, Darafei "Komяpa" Praliaskouski wrote:
>> On Wed, Sep 9, 2020 at 3:09 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
>>> Come to think of it, the point z-order comparator could benefit a lot
>>> from key abbreviation, too. You could do the point -> zorder conversion
>>> in the abbreviation routine.
>> That's how it works in PostGIS, only that we moved to more
>> effecient Hilbert curve:
>> https://github.com/postgis/postgis/blob/54399b9f6b0f02e8db9444f9f042b8d4ca6d4fa4/postgis/lwgeom_btree.c#L171
>
> Thanks, that's interesting.
>
> I implemented the abbreviated keys for the point opclass, too, and noticed that the patch as it was never used it. I
reworkedthe patch so that tuplesort_begin_index_gist() is responsible for looking up the sortsupport function, like
tuplesort_begin_index_btree()does, and uses abbreviation when possible. 
Wow, abbreviated sort made gist for points construction even 1.5x faster!
btw there is small typo in arg names in gist_bbox_zorder_cmp_abbrev(); z1,z2 -> a,b

> do we have regression test coverage for this?
Yes, sorting build for points is tested in point.sql, but with small dataset. index_including_gist.sql seems to be
workingwith boxes, but triggers point paths too. 

> , also on a SIZEOF_DATUM==4 system since the abbreviation works differently with that, and push if nothing new comes
up.And clarify the documentation and/or comments that the sortsupport function sees "compressed" values. 
>
> I wonder if we could use sorting to also speed up building tsvector indexes? The values stored there are bit
signatures,what would be a good sort order for those? 
We need an order so that nearby values have a lot of bits in common.
What is the length of this signature?
For each 4 bytes we can compute number of 1s in it's binary representation. Then z-order these dwords as values 0-32.

This will be very inefficient grouping, but it will tend to keep empty and dense 4-byte regions apart.

Thanks for working on this!


Best regards, Andrey Borodin.

Re: Yet another fast GiST build

От

Oleg Bartunov

Дата:

10 сентября 2020 г., 13:21:09

On Mon, Sep 7, 2020 at 7:50 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
>
> On 07/09/2020 13:59, Pavel Borisov wrote:
> >>>> I suppose there is a big jump in integer value (whether signed or
> >>>> unsigned) as you cross from positive to negative floats, and then the
> >>>> sort order is reversed.  I have no idea if either of those things is a
> >>>> problem worth fixing.  That made me wonder if there might also be an
> >>
> >> I took a stab at fixing this, see attached patch (applies on top of your
> >> patch v14).
> >>
> >> To evaluate this, I used the other attached patch to expose the zorder
> >> function to SQL, and plotted points around zero with gnuplot. See the
> >> attached two images, one with patch v14, and the other one with this patch.
> >
> > I'd made testing of sorted SpGist build in cases of points distributed only
> > in 2d quadrant and points in all 4 quadrants and it appears that this
> > abnormality doesn't affect as much as Andrey supposed. But Heikki's patch
> > is really nice way to avoid what can be avoided and I'd like it is included
> > together with Andrey's patch.
>
> Thanks! Did you measure the quality of the built index somehow? The
> ordering shouldn't make any difference to the build speed, but it
> affects the shape of the resulting index and the speed of queries
> against it.
>
> I played with some simple queries like this:
>
> explain (analyze, buffers) select count(*) from points_good where p <@
> box(point(50, 50), point(75, 75));
>
> and looking at the "Buffers" line for how many pages were accessed.
> There doesn't seem to be any consistent difference between v14 and my
> fix. So I concur it doesn't seem to matter much.
>
>
> I played some more with plotting the curve. I wrote a little python
> program to make an animation of it, and also simulated how the points
> would be divided into pages, assuming that each GiST page can hold 200
> tuples (I think the real number is around 150 with default page size).
> In the animation, the leaf pages appear as rectangles as it walks
> through the Z-order curve. This is just a simulation by splitting all
> the points into batches of 200 and drawing a bounding box around each
> batch. I haven't checked the actual pages as the GiST creates, but I
> think this gives a good idea of how it works.

Heikki, you may use our gevel extension to visualize index tree
http://www.sai.msu.su/~megera/wiki/Gevel
I used it to investigate rtree index
http://www.sai.msu.su/~megera/wiki/Rtree_Index


>
> The animation shows that there's quite a lot of overlap between the
> pages. It's not necessarily this patch's business to try to improve
> that, and the non-sorting index build isn't perfect either. But it
> occurs to me that there's maybe one pretty simple trick we could do:
> instead of blindly filling the leaf pages in Z-order, collect tuples
> into a larger buffer, in Z-order. I'm thinking 32 pages worth of tuples,
> or something in that ballpark, or maybe go all the way up to work_mem.
> When the buffer fills up, call the picksplit code to divide the buffer
> into the actual pages, and flush them to disk. If you look at the
> animation and imagine that you would take a handful of pages in the
> order they're created, and re-divide the points with the split
> algorithm, there would be much less overlap.

Interesting to see also the size of index, it should be several times less.

>
> - Heikki



-- 
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: Yet another fast GiST build

От

Pavel Borisov

Дата:

10 сентября 2020 г., 14:19:10

Interesting to see also the size of index, it should be several times less.

I've tested this above in CF thread and got ordered GiST index ~1.7 times smaller than non-ordered one for single column of real points.

Best regards,
Pavel Borisov

Postgres Professional: http://postgrespro.com

Re: Yet another fast GiST build

От

Heikki Linnakangas

Дата:

10 сентября 2020 г., 15:43:11

On 09/09/2020 19:50, Andrey M. Borodin wrote:
>> 9 сент. 2020 г., в 20:39, Heikki Linnakangas <hlinnaka@iki.fi> написал(а):
>>
>> On 09/09/2020 15:20, Darafei "Komяpa" Praliaskouski wrote:
>>> On Wed, Sep 9, 2020 at 3:09 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
>>>> Come to think of it, the point z-order comparator could benefit a lot
>>>> from key abbreviation, too. You could do the point -> zorder conversion
>>>> in the abbreviation routine.
>>> That's how it works in PostGIS, only that we moved to more
>>> effecient Hilbert curve:
>>> https://github.com/postgis/postgis/blob/54399b9f6b0f02e8db9444f9f042b8d4ca6d4fa4/postgis/lwgeom_btree.c#L171
>>
>> Thanks, that's interesting.
>>
>> I implemented the abbreviated keys for the point opclass, too, and noticed that the patch as it was never used it. I
reworkedthe patch so that tuplesort_begin_index_gist() is responsible for looking up the sortsupport function, like
tuplesort_begin_index_btree()does, and uses abbreviation when possible.
 
> Wow, abbreviated sort made gist for points construction even 1.5x faster!
> btw there is small typo in arg names in gist_bbox_zorder_cmp_abbrev(); z1,z2 -> a,b

One more patch version attached. I fixed some memory leaks, and fixed 
the abbreviation on 32-bit systems, and a bunch of small comment changes 
and such.

- Heikki

Вложения

v18-0001-Add-support-for-building-GiST-index-by-sorting.patch

Re: Yet another fast GiST build

От

"Andrey M. Borodin"

Дата:

11 сентября 2020 г., 09:02:06


> 10 сент. 2020 г., в 17:43, Heikki Linnakangas <hlinnaka@iki.fi> написал(а):
>
> One more patch version attached. I fixed some memory leaks, and fixed the abbreviation on 32-bit systems, and a bunch
ofsmall comment changes and such. 
>
> - Heikki
> <v18-0001-Add-support-for-building-GiST-index-by-sorting.patch>

The patch looks fine to me. On my machine GiST for points is builded 10x faster than before the patch.

Future action items:
1. Sort support for gist_btree data types
2. Better page borders with split and fillfactor
3. Consider sort build for tsvector

I'll certainly do 1 before next CF and most probably 2.
Item 1 is basically a lot of similar code for many many different types.
In Item 2 I plan to use Oleg's gevel to evaluation possibilities of MBR overlap reduction.

Item 3 seems tricky and need deeper evaluation: chances are sort build will decrease IndexScan performance in this
case.

Thanks, Heikki!

Best regards, Andrey Borodin,

Re: Yet another fast GiST build

От

Heikki Linnakangas

Дата:

15 сентября 2020 г., 14:36:15

On 11/09/2020 09:02, Andrey M. Borodin wrote:
>> 10 сент. 2020 г., в 17:43, Heikki Linnakangas <hlinnaka@iki.fi>
>> написал(а):
>> 
>> One more patch version attached. I fixed some memory leaks, and
>> fixed the abbreviation on 32-bit systems, and a bunch of small
>> comment changes and such.
>> 
>> <v18-0001-Add-support-for-building-GiST-index-by-sorting.patch>
> 
> The patch looks fine to me. On my machine GiST for points is builded
> 10x faster than before the patch.

Another patch version, fixed a few small bugs pointed out by assertion 
failures in the regression tests.

- Heikki

Вложения

v19-0001-Add-support-for-building-GiST-index-by-sorting.patch

Re: Yet another fast GiST build

От

"Andrey M. Borodin"

Дата:

15 сентября 2020 г., 19:46:11


> 15 сент. 2020 г., в 16:36, Heikki Linnakangas <hlinnaka@iki.fi> написал(а):
>
> Another patch version, fixed a few small bugs pointed out by assertion failures in the regression tests.
>
> - Heikki
> <v19-0001-Add-support-for-building-GiST-index-by-sorting.patch>

These changes in create_index.out do not seem correct to me

 SELECT * FROM point_tbl ORDER BY f1 <-> '0,1';
         f1
 -------------------
- (0,0)
  (1e-300,-1e-300)
+ (0,0)

I did not figure out the root cause yet. We do not touch anything related to distance computation..

Best regards, Andrey Borodin.

Re: Yet another fast GiST build

От

Heikki Linnakangas

Дата:

15 сентября 2020 г., 20:07:38

On 15/09/2020 19:46, Andrey M. Borodin wrote:
> 
> 
>> 15 сент. 2020 г., в 16:36, Heikki Linnakangas <hlinnaka@iki.fi> написал(а):
>>
>> Another patch version, fixed a few small bugs pointed out by assertion failures in the regression tests.
>>
>> - Heikki
>> <v19-0001-Add-support-for-building-GiST-index-by-sorting.patch>
> 
> These changes in create_index.out do not seem correct to me
> 
>   SELECT * FROM point_tbl ORDER BY f1 <-> '0,1';
>           f1
>   -------------------
> - (0,0)
>    (1e-300,-1e-300)
> + (0,0)
> 
> I did not figure out the root cause yet. We do not touch anything related to distance computation..

Ah yeah, that's subtle. Those rows are considered to be equally distant 
from (0, 1), given the precision of the <-> operator:

regression=#  SELECT f1, f1 <-> '0,1' FROM point_tbl ORDER BY f1 <-> '0,1';
         f1         |     ?column?
-------------------+------------------
  (0,0)             |                1
  (1e-300,-1e-300)  |                1
  (-3,4)            | 4.24264068711929
  (-10,0)           | 10.0498756211209
  (10,10)           | 13.4536240470737
  (-5,-12)          | 13.9283882771841
  (5.1,34.5)        |  33.885985303662
  (1e+300,Infinity) |         Infinity
  (NaN,NaN)         |              NaN
                    |
(10 rows)

It is arbitrary which one you get first.

It's not very nice to have a not-well defined order of rows in the 
expected output, as it could change in the future if we change the index 
build algorithm again. But we have plenty of cases that depend on the 
physical row order, and it's not like this changes very often, so I 
think it's ok to just memorize the new order in the expected output.

- Heikki

Re: Yet another fast GiST build

От

"Andrey M. Borodin"

Дата:

16 сентября 2020 г., 10:27:09


> 15 сент. 2020 г., в 22:07, Heikki Linnakangas <hlinnaka@iki.fi> написал(а):
>
> regression=#  SELECT f1, f1 <-> '0,1' FROM point_tbl ORDER BY f1 <-> '0,1';
>        f1         |     ?column?
> -------------------+------------------
> (0,0)             |                1
> (1e-300,-1e-300)  |                1
> (-3,4)            | 4.24264068711929
> (-10,0)           | 10.0498756211209
> (10,10)           | 13.4536240470737
> (-5,-12)          | 13.9283882771841
> (5.1,34.5)        |  33.885985303662
> (1e+300,Infinity) |         Infinity
> (NaN,NaN)         |              NaN
>                   |
> (10 rows)
>
> It is arbitrary which one you get first.
>
> It's not very nice to have a not-well defined order of rows in the expected output, as it could change in the future
ifwe change the index build algorithm again. But we have plenty of cases that depend on the physical row order, and
it'snot like this changes very often, so I think it's ok to just memorize the new order in the expected output. 


I think this is valid reasoning. GiST choose subtree algorithm is not deterministic, it calls random(), but not in
testedpaths.  
I was thinking that machine epsilon is near 1e-300, but I was wrong. It's actually near 1e-15.

Actually, I just want to understand what changes between v18 and v19 changed on-page order of items. I look into patch
diffand cannot figure it out. There are only logging changes. How this affects scan? 

Best regards, Andrey Borodin.

Re: Yet another fast GiST build

От

Heikki Linnakangas

Дата:

16 сентября 2020 г., 11:20:48

On 16/09/2020 10:27, Andrey M. Borodin wrote:
> Actually, I just want to understand what changes between v18 and v19
> changed on-page order of items. I look into patch diff and cannot
> figure it out. There are only logging changes. How this affects
> scan?

The test was failing with v18 too.

- Heikki

Re: Yet another fast GiST build

От

Kyotaro Horiguchi

Дата:

16 сентября 2020 г., 12:09:03

At Wed, 16 Sep 2020 12:27:09 +0500, "Andrey M. Borodin" <x4mmm@yandex-team.ru> wrote in 
> I was thinking that machine epsilon is near 1e-300, but I was
> wrong. It's actually near 1e-15.

FWIW, the mantissa of double is effectively 52+1 bits, about 15.9
digits. so 1+(1e-16) is basically indistincitve from
1+(2e-16). Actually two double precisions 1+2e-16 and 1+3e-16 are
indistinctive from each other.

> Actually, I just want to understand what changes between v18 and v19 changed on-page order of items. I look into
patchdiff and cannot figure it out. There are only logging changes. How this affects scan?
 

FWIW, I saw the same symptom by my another patch after adding a value
to POINT_TBL. (But I didn't pursue the cause further..)

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: Yet another fast GiST build

От

Heikki Linnakangas

Дата:

17 сентября 2020 г., 11:38:47

On 15/09/2020 14:36, Heikki Linnakangas wrote:
> Another patch version, fixed a few small bugs pointed out by assertion
> failures in the regression tests.

Pushed. Thanks everyone!

- Heikki

Re: Yet another fast GiST build

От

"Andrey M. Borodin"

Дата:

17 сентября 2020 г., 11:43:52


> 17 сент. 2020 г., в 13:38, Heikki Linnakangas <hlinnaka@iki.fi> написал(а):
>
> On 15/09/2020 14:36, Heikki Linnakangas wrote:
>> Another patch version, fixed a few small bugs pointed out by assertion
>> failures in the regression tests.
>
> Pushed. Thanks everyone!

That's wonderful! Thank you, Heikki!

Best regards, Andrey Borodin.

Re: Yet another fast GiST build

От

Tom Lane

Дата:

20 сентября 2020 г., 23:41:12

Heikki Linnakangas <hlinnaka@iki.fi> writes:
> Pushed. Thanks everyone!

It appears that hyrax (CLOBBER_CACHE_ALWAYS) is not very happy
with this:

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=hyrax&dt=2020-09-19%2021%3A27%3A23

We have a recent pass from prion, showing that -DRELCACHE_FORCE_RELEASE
-DCATCACHE_FORCE_RELEASE doesn't cause a problem, so maybe hyrax's
result is just random cosmic rays or something.  But I doubt it.

            regards, tom lane

Re: Yet another fast GiST build

От

Tom Lane

Дата:

21 сентября 2020 г., 00:10:05

I wrote:
> It appears that hyrax (CLOBBER_CACHE_ALWAYS) is not very happy
> with this:
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=hyrax&dt=2020-09-19%2021%3A27%3A23

I reproduced that and traced it to a missing RelationOpenSmgr call.
Fixed now.

            regards, tom lane

Re: Yet another fast GiST build

От

Tom Lane

Дата:

21 сентября 2020 г., 02:06:09

Justin Pryzby <pryzby@telsasoft.com> writes:
> This also appears to break checksums.

I was wondering about that, because the typical pattern for use of
smgrextend for indexes seems to be

    RelationOpenSmgr(rel);
    PageSetChecksumInplace(page, lastblock);
    smgrextend(rel->rd_smgr, MAIN_FORKNUM, lastblock, zerobuf.data, false);

and gist_indexsortbuild wasn't doing either of the first two things.

gist_indexsortbuild_flush_ready_pages looks like it might be
a few bricks shy of a load too.  But my local CLOBBER_CACHE_ALWAYS
run hasn't gotten to anything except the pretty-trivial index
made in point.sql, so I don't have evidence about it.

Another interesting point is that all the other index AMs seem to WAL-log
the new page before the smgrextend call, whereas this code is doing it
in the other order.  I strongly doubt that both patterns are equally
correct.  Could be that the other AMs are in the wrong though.

            regards, tom lane

Re: Yet another fast GiST build

От

Heikki Linnakangas

Дата:

21 сентября 2020 г., 11:08:01

On 21/09/2020 02:06, Tom Lane wrote:
> Justin Pryzby <pryzby@telsasoft.com> writes:
>> This also appears to break checksums.

Thanks, I'll go fix it.

> I was wondering about that, because the typical pattern for use of
> smgrextend for indexes seems to be
> 
>     RelationOpenSmgr(rel);
>     PageSetChecksumInplace(page, lastblock);
>     smgrextend(rel->rd_smgr, MAIN_FORKNUM, lastblock, zerobuf.data, false);
> 
> and gist_indexsortbuild wasn't doing either of the first two things.
> 
> gist_indexsortbuild_flush_ready_pages looks like it might be
> a few bricks shy of a load too.  But my local CLOBBER_CACHE_ALWAYS
> run hasn't gotten to anything except the pretty-trivial index
> made in point.sql, so I don't have evidence about it.

I don't think a relcache invalidation can happen on the index we're 
building. Other similar callers call RelationOpenSmgr(rel) before every 
write though (e.g. _bt_blwritepage()), so perhaps it's better to copy 
that pattern here too.

> Another interesting point is that all the other index AMs seem to WAL-log
> the new page before the smgrextend call, whereas this code is doing it
> in the other order.  I strongly doubt that both patterns are equally
> correct.  Could be that the other AMs are in the wrong though.

My thinking was that it's better to call smgrextend() first, so that if 
you run out of disk space, you get the error before WAL-logging it. That 
reduces the chance that WAL replay will run out of disk space. A lot of 
things are different during WAL replay, so it's quite likely that WAL 
replay runs out of disk space anyway if you're living on the edge, but 
still.

I didn't notice that the other callers are doing it the other way round, 
though. I think they need to, so that they can stamp the page with the 
LSN of the WAL record. But GiST build is special in that regard, because 
it stamps all pages with GistBuildLSN.

- Heikki

Re: Yet another fast GiST build

От

Heikki Linnakangas

Дата:

21 сентября 2020 г., 11:45:59

On 21/09/2020 11:08, Heikki Linnakangas wrote:
> I think they need to, so that they can stamp the page with the LSN of
> the WAL record. But GiST build is special in that regard, because it
> stamps all pages with GistBuildLSN.

Actually, don't we have a problem with that, even before this patch? 
Even though we set the LSN to the magic GistBuildLSN value when we build 
the index, WAL replay will write the LSN of the record instead. That 
would mess with the LSN-NSN interlock. After WAL replay (or in a 
streaming replica), a scan on the GiST index might traverse right-links 
unnecessarily.

- Heikki

Re: Yet another fast GiST build

От

"Andrey M. Borodin"

Дата:

21 сентября 2020 г., 12:06:47


> 21 сент. 2020 г., в 13:45, Heikki Linnakangas <hlinnaka@iki.fi> написал(а):
>
> On 21/09/2020 11:08, Heikki Linnakangas wrote:
>> I think they need to, so that they can stamp the page with the LSN of
>> the WAL record. But GiST build is special in that regard, because it
>> stamps all pages with GistBuildLSN.
>
> Actually, don't we have a problem with that, even before this patch? Even though we set the LSN to the magic
GistBuildLSNvalue when we build the index, WAL replay will write the LSN of the record instead. That would mess with
theLSN-NSN interlock. After WAL replay (or in a streaming replica), a scan on the GiST index might traverse right-links
unnecessarily.

I think we don't set rightlinks during index build.

Best regards, Andrey Borodin.

Re: Yet another fast GiST build

От

Heikki Linnakangas

Дата:

21 сентября 2020 г., 14:56:44

On 21/09/2020 02:06, Tom Lane wrote:
> Justin Pryzby <pryzby@telsasoft.com> writes:
>> This also appears to break checksums.

Fixed, thanks for the report!

> I was wondering about that, because the typical pattern for use of
> smgrextend for indexes seems to be
> 
>     RelationOpenSmgr(rel);
>     PageSetChecksumInplace(page, lastblock);
>     smgrextend(rel->rd_smgr, MAIN_FORKNUM, lastblock, zerobuf.data, false);
> 
> and gist_indexsortbuild wasn't doing either of the first two things.
> 
> gist_indexsortbuild_flush_ready_pages looks like it might be
> a few bricks shy of a load too.  But my local CLOBBER_CACHE_ALWAYS
> run hasn't gotten to anything except the pretty-trivial index
> made in point.sql, so I don't have evidence about it.

I added a RelationOpenSmgr() call there too, although it's not needed 
currently. It seems to be enough to do it before the first smgrextend() 
call. But if you removed or refactored the first call someohow, so it 
was not the first call anymore, it would be easy to miss that you'd 
still need the RelationOpenSmgr() call there. It's more consistent with 
the code in nbtsort.c now, too.

- Heikki

Re: Yet another fast GiST build

От

Heikki Linnakangas

Дата:

21 сентября 2020 г., 15:15:18

On 21/09/2020 12:06, Andrey M. Borodin wrote:
>> 21 сент. 2020 г., в 13:45, Heikki Linnakangas <hlinnaka@iki.fi>
>> написал(а):
>> 
>> Actually, don't we have a problem with that, even before this
>> patch? Even though we set the LSN to the magic GistBuildLSN value
>> when we build the index, WAL replay will write the LSN of the
>> record instead. That would mess with the LSN-NSN interlock. After
>> WAL replay (or in a streaming replica), a scan on the GiST index
>> might traverse right-links unnecessarily.
> 
> I think we don't set rightlinks during index build.

The new GiST sorting code does not, but the regular insert-based code does.

That's a bit questionable in the new code actually. Was that a conscious
decision? The right-links are only needed when there are concurrent page
splits, so I think it's OK, but the checks for InvalidBlockNumber in
gistScanPage() and gistFindPage() have comment "/* sanity check */".
Comment changes are needed, at least.

- Heikki

Re: Yet another fast GiST build

От

"Andrey M. Borodin"

Дата:

21 сентября 2020 г., 16:29:14


> 21 сент. 2020 г., в 17:15, Heikki Linnakangas <hlinnaka@iki.fi> написал(а):
>
> On 21/09/2020 12:06, Andrey M. Borodin wrote
>>
>> I think we don't set rightlinks during index build.
>
> The new GiST sorting code does not, but the regular insert-based code does.
>
> That's a bit questionable in the new code actually. Was that a conscious
> decision? The right-links are only needed when there are concurrent page
> splits, so I think it's OK, but the checks for InvalidBlockNumber in
> gistScanPage() and gistFindPage() have comment "/* sanity check */".
> Comment changes are needed, at least.

It was a conscious decision with incorrect motivation. I was thinking that it will help to reduce number of "false
positive"inspecting right pages. But now I see that: 
1. There should be no such "false positives" that we can avoid
2. Valid rightlinks could help to do amcheck verification in future

But thing that bothers me now: when we vacuum leaf page, we bump it's NSN. But we do not bump internal page LSN. Does
thismeans we will follow rightlinks after vacuum? It seems superflous. And btw we did not adjust internal page tuples
aftervacuum... 

Best regards, Andrey Borodin.

Re: Yet another fast GiST build

От

"Andrey M. Borodin"

Дата:

21 сентября 2020 г., 17:19:32


> 21 сент. 2020 г., в 18:29, Andrey M. Borodin <x4mmm@yandex-team.ru> написал(а):
>
> It was a conscious decision with incorrect motivation. I was thinking that it will help to reduce number of "false
positive"inspecting right pages. But now I see that: 
> 1. There should be no such "false positives" that we can avoid
> 2. Valid rightlinks could help to do amcheck verification in future

Well, point number 2 here is invalid. There exist one leaf page p, so that if we start traversing rightlink from p we
willreach all leaf pages. But we practically have no means to find this page. This makes rightlinks not very helpful in
amcheckfor GiST. 

But for consistency I think it worth to install them.

Thanks!

Best regards, Andrey Borodin.

Вложения

0001-Install-rightlinks-on-GiST-pages-in-case-of-sorting-.patch

Re: Yet another fast GiST build

От

Tom Lane

Дата:

21 сентября 2020 г., 17:20:19

Heikki Linnakangas <hlinnaka@iki.fi> writes:
> On 21/09/2020 02:06, Tom Lane wrote:
>> Another interesting point is that all the other index AMs seem to WAL-log
>> the new page before the smgrextend call, whereas this code is doing it
>> in the other order.  I strongly doubt that both patterns are equally
>> correct.  Could be that the other AMs are in the wrong though.

> My thinking was that it's better to call smgrextend() first, so that if 
> you run out of disk space, you get the error before WAL-logging it. That 
> reduces the chance that WAL replay will run out of disk space. A lot of 
> things are different during WAL replay, so it's quite likely that WAL 
> replay runs out of disk space anyway if you're living on the edge, but 
> still.

Yeah.  access/transam/README points out that such failures need to be
planned for, and explains what we do for heap pages;

    1. Adding a disk page to an existing table.

    This action isn't WAL-logged at all.  We extend a table by writing a page
    of zeroes at its end.  We must actually do this write so that we are sure
    the filesystem has allocated the space.  If the write fails we can just
    error out normally.  Once the space is known allocated, we can initialize
    and fill the page via one or more normal WAL-logged actions.  Because it's
    possible that we crash between extending the file and writing out the WAL
    entries, we have to treat discovery of an all-zeroes page in a table or
    index as being a non-error condition.  In such cases we can just reclaim
    the space for re-use.

So GIST seems to be acting according to that design.  (Someday we need
to update this para to acknowledge that not all filesystems behave as
it's assuming.)

> I didn't notice that the other callers are doing it the other way round, 
> though. I think they need to, so that they can stamp the page with the 
> LSN of the WAL record. But GiST build is special in that regard, because 
> it stamps all pages with GistBuildLSN.

Kind of unpleasant; that means they risk what the README points out:

    In all of these cases, if WAL replay fails to redo the original action
    we must panic and abort recovery.  The DBA will have to manually clean up
    (for instance, free up some disk space or fix directory permissions) and
    then restart recovery.  This is part of the reason for not writing a WAL
    entry until we've successfully done the original action.

I'm not sufficiently motivated to go and change it right now, though.

            regards, tom lane

Re: Yet another fast GiST build

От

Heikki Linnakangas

Дата:

28 сентября 2020 г., 09:53:45

On 21/09/2020 16:29, Andrey M. Borodin wrote:
> But thing that bothers me now: when we vacuum leaf page, we bump it's
> NSN. But we do not bump internal page LSN. Does this means we will
> follow rightlinks after vacuum? It seems superflous.

Sorry, I did not understand what you said above. Vacuum doesn't update 
any NSNs, only LSNs. Can you elaborate?

> And btw we did not adjust internal page tuples after vacuum...
What do you mean by that?

- Heikki

Re: Yet another fast GiST build

От

Heikki Linnakangas

Дата:

28 сентября 2020 г., 11:12:02

On 21/09/2020 17:19, Andrey M. Borodin wrote:
>> 21 сент. 2020 г., в 18:29, Andrey M. Borodin <x4mmm@yandex-team.ru> написал(а):
>>
>> It was a conscious decision with incorrect motivation. I was thinking that it will help to reduce number of "false
positive"inspecting right pages. But now I see that:
 
>> 1. There should be no such "false positives" that we can avoid
>> 2. Valid rightlinks could help to do amcheck verification in future
> 
> Well, point number 2 here is invalid. There exist one leaf page p, so that if we start traversing rightlink from p we
willreach all leaf pages. But we practically have no means to find this page. This makes rightlinks not very helpful in
amcheckfor GiST.
 

Well, if you store all the right links in a hash table or something, you 
can "connect the dots" after you have scanned all the pages to see that 
the chain is unbroken. Probably would not be worth the trouble, since 
the rightlinks are not actually needed after concurrent scans have 
completed.

> But for consistency I think it worth to install them.

I agree. I did some testing with your patch. It seems that the 
rightlinks are still not always set. I didn't try to debug why.

I wrote a couple of 'pageinspect' function to inspect GiST pages for 
this. See attached. I then created a test table and index like this:

create table points (p point);
insert into points select point(x,y) from generate_series(-2000, 2000) 
x, generate_series(-2000, 2000) y;
create index points_idx on points using gist (p);

And this is what the root page looks like:

postgres=# select * from gist_page_items(get_raw_page('points_idx', 0));
  itemoffset |     ctid      | itemlen
------------+---------------+---------
           1 | (27891,65535) |      40
           2 | (55614,65535) |      40
           3 | (83337,65535) |      40
           4 | (97019,65535) |      40
(4 rows)

And the right links on the next level:

postgres=# select * from (VALUES (27891), (55614), (83337), (97019)) b 
(blkno), lateral gist_page_opaque_info(get_raw_page('points_idx', blkno));
  blkno | lsn | nsn | rightlink  | flags
-------+-----+-----+------------+-------
  27891 | 0/1 | 0/0 | 4294967295 | {}
  55614 | 0/1 | 0/0 | 4294967295 | {}
  83337 | 0/1 | 0/0 |      27891 | {}
  97019 | 0/1 | 0/0 |      55614 | {}
(4 rows)

I expected there to be only one page with invalid right link, but there 
are two.

- Heikki

Вложения

0001-Add-functions-to-pageinspect-to-inspect-GiST-indexes.patch

Re: Yet another fast GiST build

От

"Andrey M. Borodin"

Дата:

29 сентября 2020 г., 21:04:18


> 28 сент. 2020 г., в 13:12, Heikki Linnakangas <hlinnaka@iki.fi> написал(а):
>
> On 21/09/2020 17:19, Andrey M. Borodin wrote:
>>> 21 сент. 2020 г., в 18:29, Andrey M. Borodin <x4mmm@yandex-team.ru> написал(а):
>>>
>>> It was a conscious decision with incorrect motivation. I was thinking that it will help to reduce number of "false
positive"inspecting right pages. But now I see that: 
>>> 1. There should be no such "false positives" that we can avoid
>>> 2. Valid rightlinks could help to do amcheck verification in future
>> Well, point number 2 here is invalid. There exist one leaf page p, so that if we start traversing rightlink from p
wewill reach all leaf pages. But we practically have no means to find this page. This makes rightlinks not very helpful
inamcheck for GiST. 
>
> Well, if you store all the right links in a hash table or something, you can "connect the dots" after you have
scannedall the pages to see that the chain is unbroken. Probably would not be worth the trouble, since the rightlinks
arenot actually needed after concurrent scans have completed. 
>
>> But for consistency I think it worth to install them.
>
> I agree. I did some testing with your patch. It seems that the rightlinks are still not always set. I didn't try to
debugwhy. 
>
> I wrote a couple of 'pageinspect' function to inspect GiST pages for this. See attached. I then created a test table
andindex like this: 
>
> create table points (p point);
> insert into points select point(x,y) from generate_series(-2000, 2000) x, generate_series(-2000, 2000) y;
> create index points_idx on points using gist (p);
>
> And this is what the root page looks like:
>
> postgres=# select * from gist_page_items(get_raw_page('points_idx', 0));
> itemoffset |     ctid      | itemlen
> ------------+---------------+---------
>          1 | (27891,65535) |      40
>          2 | (55614,65535) |      40
>          3 | (83337,65535) |      40
>          4 | (97019,65535) |      40
> (4 rows)
>
> And the right links on the next level:
>
> postgres=# select * from (VALUES (27891), (55614), (83337), (97019)) b (blkno), lateral
gist_page_opaque_info(get_raw_page('points_idx',blkno)); 
> blkno | lsn | nsn | rightlink  | flags
> -------+-----+-----+------------+-------
> 27891 | 0/1 | 0/0 | 4294967295 | {}
> 55614 | 0/1 | 0/0 | 4294967295 | {}
> 83337 | 0/1 | 0/0 |      27891 | {}
> 97019 | 0/1 | 0/0 |      55614 | {}
> (4 rows)
>
> I expected there to be only one page with invalid right link, but there are two.

Yes, there is a bug. Now it seems to me so obvious, yet it took some time to understand that links were shifted by one
extrajump. PFA fixed rightlinks installation. 

BTW some one more small thing: we initialise page buffers with palloc() and palloc0(), while first one is sufficient
forgistinitpage(). 
Also I'm working on btree_gist opclasses and found out that new sortsupport function is not mentioned in
gistadjustmembers().I think this can be fixed with gist_btree patch. 

Your pageinspect patch seems very useful. How do you think, should we provide a way to find invalid tuples in GiST
withingist_page_items()? At some point we will have to ask user to reindex GiSTs with invalid tuples. 


> 28 сент. 2020 г., в 11:53, Heikki Linnakangas <hlinnaka@iki.fi> написал(а):
>
> On 21/09/2020 16:29, Andrey M. Borodin wrote:
>> But thing that bothers me now: when we vacuum leaf page, we bump it's
>> NSN. But we do not bump internal page LSN. Does this means we will
>> follow rightlinks after vacuum? It seems superflous.
>
> Sorry, I did not understand what you said above. Vacuum doesn't update any NSNs, only LSNs. Can you elaborate?
I've misunderstood difference between NSN and LSN. Seems like everything is fine.

>
>> And btw we did not adjust internal page tuples after vacuum...
> What do you mean by that?

When we delete rows from table internal tuples in GiST stay wide.


Thanks!

Best regards, Andrey Borodin.

Вложения

install_rightlinks_v2.diff

Re: Yet another fast GiST build

От

Heikki Linnakangas

Дата:

29 сентября 2020 г., 23:31:07

On 29/09/2020 21:04, Andrey M. Borodin wrote:
> 28 сент. 2020 г., в 13:12, Heikki Linnakangas <hlinnaka@iki.fi> написал(а):
>> I did some testing with your patch. It seems that the rightlinks
>> are still not always set. I didn't try to debug why.
> 
> Yes, there is a bug. Now it seems to me so obvious, yet it took some 
> time to understand that links were shifted by one extra jump. PFA 
> fixed rightlinks installation.
Ah, that was simple. I propose adding a comment on it:

--- a/src/backend/access/gist/gistbuild.c
+++ b/src/backend/access/gist/gistbuild.c
@@ -540,6 +540,19 @@ gist_indexsortbuild_pagestate_flush(GISTBuildState 
*state,
      /* Re-initialize the page buffer for next page on this level. */
      pagestate->page = palloc(BLCKSZ);
      gistinitpage(pagestate->page, isleaf ? F_LEAF : 0);
+
+    /*
+     * Set the right link to point to the previous page. This is just for
+     * debugging purposes: GiST only follows the right link if a page is split
+     * concurrently to a scan, and that cannot happen during index build.
+     *
+     * It's a bit counterintuitive that we set the right link on the new page
+     * to point to the previous page, and not the other way round. But GiST
+     * pages are not ordered like B-tree pages are, so as long as the
+     * right-links form a chain through all the pages in the same level, the
+     * order doesn't matter.
+     */
+    GistPageGetOpaque(pagestate->page)->rightlink = blkno;
  }

> BTW some one more small thing: we initialise page buffers with
> palloc() and palloc0(), while first one is sufficient for
> gistinitpage().
Hmm. Only the first one, in gist_indexsortbuild(), but that needs to be 
palloc0, because it's used to write an all-zeros placeholder for the 
root page.

> Also I'm working on btree_gist opclasses and found out that new
> sortsupport function is not mentioned in gistadjustmembers(). I think
> this can be fixed with gist_btree patch.

Thanks!

> Your pageinspect patch seems very useful. How do you think, should we
> provide a way to find invalid tuples in GiST within
> gist_page_items()? At some point we will have to ask user to reindex
> GiSTs with invalid tuples.
You mean invalid tuples created by crash on PostgreSQL version 9.0 or 
below, and pg_upgraded? I doubt there are any of those still around in 
the wild. We have to keep the code to detect them, though.

It would be nice to improve gist_page_items() to display more 
information about the items, although I wouldn't worry much about 
invalid tuples. The 'gevel' extension that Oleg mentioned upthread does 
more, it would be nice to incorporate that into pageinspect somehow.

- Heikki

Re: Yet another fast GiST build

От

Pavel Borisov

Дата:

06 октября 2020 г., 14:05:42

I've been making tests with memory sanitizer and got one another error in regression test create_index:

CREATE INDEX gpointind ON point_tbl USING gist (f1);

server closed the connection unexpectedly

with logfile:

gistproc.c:1714:28: runtime error: 1e+300 is outside the range of representable values of type 'float'
SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior gistproc.c:1714:28

gistproc.c:

1714 z = point_zorder_internal(p->x, p->y);

Consider this a minor issue but unrelated to the other issues discussed. It is reproduced on the last master commit 0a3c864c32751fd29d021929cf70af421fd27370 after all changes into Gist committed.

cflags="-DUSE_VALGRIND -Og -O0 -fsanitize=address -fsanitize=undefined -fno-sanitize-recover=all -fno-sanitize=alignment -fstack-protector"

Best regards,
Pavel Borisov

Postgres Professional: http://postgrespro.com

Re: Yet another fast GiST build

От

Heikki Linnakangas

Дата:

06 октября 2020 г., 16:19:31

On 06/10/2020 14:05, Pavel Borisov wrote:
> I've been making tests with memory sanitizer

Thanks!

> and got one another error 
> in regression test create_index:
> CREATE INDEX gpointind ON point_tbl USING gist (f1);
> server closed the connection unexpectedly
> 
> with logfile:
> gistproc.c:1714:28: runtime error: 1e+300 is outside the range of 
> representable values of type 'float'
> SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior gistproc.c:1714:28
> 
> gistproc.c:
> 1714     z = point_zorder_internal(p->x, p->y);
> 
> Consider this a minor issue but unrelated to the other issues discussed. 
> It is reproduced on the last master 
> commit 0a3c864c32751fd29d021929cf70af421fd27370 after all changes into 
> Gist committed.
> 
> cflags="-DUSE_VALGRIND -Og -O0 -fsanitize=address -fsanitize=undefined 
> -fno-sanitize-recover=all -fno-sanitize=alignment -fstack-protector"

You get the same error with:

select (float8 '1e+300')::float4;

float.c:1204:11: runtime error: 1e+300 is outside the range of 
representable values of type 'float'
SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior float.c:1204:11 in

It boils down to casting a C double to float, when the value doesn't fit 
in float. I'm surprised that's undefined behavior, but I'm no C99 
lawyer. The code in dtof() expects it to yield Inf.

I'm inclined to shrug this off and say that the sanitizer is being 
over-zealous. Is there some compiler flag we should be setting, to tell 
it that we require specific behavior? Any other ideas?

- Heikki

Re: Yet another fast GiST build

От

Tom Lane

Дата:

06 октября 2020 г., 17:29:42

Heikki Linnakangas <hlinnaka@iki.fi> writes:
> You get the same error with:
> select (float8 '1e+300')::float4;
> float.c:1204:11: runtime error: 1e+300 is outside the range of 
> representable values of type 'float'
> SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior float.c:1204:11 in

> It boils down to casting a C double to float, when the value doesn't fit 
> in float. I'm surprised that's undefined behavior, but I'm no C99 
> lawyer. The code in dtof() expects it to yield Inf.

I think UBSan read C99 6.3.1.5:

       [#2]  When  a double is demoted to float or a long double to
       double or float, if the value being converted is outside the
       range  of  values  that  can be represented, the behavior is
       undefined.

and stopped reading at that point, which they should not have.
If you go on to read the portions around, particularly, <fenv.h>,
you get a different picture of affairs.  If we're relying on IEEE
float semantics in other places, which we are, we're perfectly
entitled to assume that the cast will yield Inf (and a floating
point exception flag, which we ignore).  I think the "undefined"
here is just meant to say that there's no single behavior promised
across all possible C implementations.  They'd have been better to
write "implementation-defined", though.

> I'm inclined to shrug this off and say that the sanitizer is being 
> over-zealous. Is there some compiler flag we should be setting, to tell 
> it that we require specific behavior? Any other ideas?

If UBSan doesn't have a flag to tell it to assume IEEE math,
I'd say that makes it next door to worthless for our purposes.

            regards, tom lane

Re: Yet another fast GiST build

От

Pavel Borisov

Дата:

06 октября 2020 г., 18:29:29

It became normal with

-fsanitize=signed-integer-overflow,null,alignment

instead of

-fsanitize=undefined

(which is strictly a 'default' list of needed and unnecessary things to check, can be overridden anyway but needed some reading for it)

Thanks!

Best regards,
Pavel Borisov

Postgres Professional: http://postgrespro.com

Re: Yet another fast GiST build

От

Andrey Borodin

Дата:

07 октября 2020 г., 15:27:21


> 29 сент. 2020 г., в 23:04, Andrey M. Borodin <x4mmm@yandex-team.ru> написал(а):
>
>
> Also I'm working on btree_gist opclasses and found out that new sortsupport function is not mentioned in
gistadjustmembers().I think this can be fixed with gist_btree patch. 

Here's draft patch with implementation of sortsupport for ints and floats.

Thanks!

Best regards, Andrey Borodin.

Вложения

0001-Sortsupport-for-sorting-GiST-build-for-ints-and-floa.patch

Re: Yet another fast GiST build

От

Heikki Linnakangas

Дата:

07 октября 2020 г., 15:38:20

On 07/10/2020 15:27, Andrey Borodin wrote:
> Here's draft patch with implementation of sortsupport for ints and floats.

> +static int
> +gbt_int4_cmp(Datum a, Datum b, SortSupport ssup)
> +{
> +    int32KEY   *ia = (int32KEY *) DatumGetPointer(a);
> +    int32KEY   *ib = (int32KEY *) DatumGetPointer(b);
> +
> +    if (ia->lower == ib->lower)
> +    {
> +        if (ia->upper == ib->upper)
> +            return 0;
> +
> +        return (ia->upper > ib->upper) ? 1 : -1;
> +    }
> +
> +    return (ia->lower > ib->lower) ? 1 : -1;
> +}

We're only dealing with leaf items during index build, so the 'upper' 
and 'lower' should always be equal here, right? Maybe add a comment and 
an assertion on that.

(It's pretty sad that the on-disk representation is identical for leaf 
and internal items, because that wastes a lot of disk space, but that's 
way out of scope for this patch.)

- Heikki

Re: Yet another fast GiST build

От

Andrey Borodin

Дата:

21 октября 2020 г., 20:13:34


> 7 окт. 2020 г., в 17:38, Heikki Linnakangas <hlinnaka@iki.fi> написал(а):
>
> On 07/10/2020 15:27, Andrey Borodin wrote:
>> Here's draft patch with implementation of sortsupport for ints and floats.
>
>> +static int
>> +gbt_int4_cmp(Datum a, Datum b, SortSupport ssup)
>> +{
>> +    int32KEY   *ia = (int32KEY *) DatumGetPointer(a);
>> +    int32KEY   *ib = (int32KEY *) DatumGetPointer(b);
>> +
>> +    if (ia->lower == ib->lower)
>> +    {
>> +        if (ia->upper == ib->upper)
>> +            return 0;
>> +
>> +        return (ia->upper > ib->upper) ? 1 : -1;
>> +    }
>> +
>> +    return (ia->lower > ib->lower) ? 1 : -1;
>> +}
>
> We're only dealing with leaf items during index build, so the 'upper' and 'lower' should always be equal here, right?
Maybeadd a comment and an assertion on that. 
>
> (It's pretty sad that the on-disk representation is identical for leaf and internal items, because that wastes a lot
ofdisk space, but that's way out of scope for this patch.) 

Thanks, I've added assert() where is was easy to test equalty.

PFA patch with all types.

I had a plan to implement and test one type each day. I did not quite understood how rich our type system is.

Thanks!

Best regards, Andrey Borodin.

Вложения

0001-Sortsupport-for-sorting-GiST-build-for-gist_btree-ty.patch

Re: Yet another fast GiST build

От

Pavel Borisov

Дата:

21 октября 2020 г., 20:35:27

>> + return (ia->lower > ib->lower) ? 1 : -1;
>> +}
>
> We're only dealing with leaf items during index build, so the 'upper' and 'lower' should always be equal here, right? Maybe add a comment and an assertion on that.
>
> (It's pretty sad that the on-disk representation is identical for leaf and internal items, because that wastes a lot of disk space, but that's way out of scope for this patch.)

Thanks, I've added assert() where is was easy to test equalty.

PFA patch with all types.

I had a plan to implement and test one type each day. I did not quite understood how rich our type system is.

Cool, thanks!

BTW there is a somewhat parallel discussion on this gist patch in pgsql-bugs which you may miss

https://www.postgresql.org/message-id/flat/3dda6945-c147-5afc-7720-38ec6848dafa%40gmail.com#9be5b8a75c4921498748514cb77c6e68

The main point is that buffering build is better to be enforced with buffering=on as otherwise buffering build becomes hard to test in the presence of sort support.

Best regards,
Pavel Borisov

Postgres Professional: http://postgrespro.com

Re: Yet another fast GiST build

От

Heikki Linnakangas

Дата:

27 октября 2020 г., 14:43:48

On 21/10/2020 20:13, Andrey Borodin wrote:
>> 7 окт. 2020 г., в 17:38, Heikki Linnakangas <hlinnaka@iki.fi> написал(а):
>>
>> On 07/10/2020 15:27, Andrey Borodin wrote:
>>> Here's draft patch with implementation of sortsupport for ints and floats.
>>
>>> +static int
>>> +gbt_int4_cmp(Datum a, Datum b, SortSupport ssup)
>>> +{
>>> +    int32KEY   *ia = (int32KEY *) DatumGetPointer(a);
>>> +    int32KEY   *ib = (int32KEY *) DatumGetPointer(b);
>>> +
>>> +    if (ia->lower == ib->lower)
>>> +    {
>>> +        if (ia->upper == ib->upper)
>>> +            return 0;
>>> +
>>> +        return (ia->upper > ib->upper) ? 1 : -1;
>>> +    }
>>> +
>>> +    return (ia->lower > ib->lower) ? 1 : -1;
>>> +}
>>
>> We're only dealing with leaf items during index build, so the 'upper' and 'lower' should always be equal here,
right?Maybe add a comment and an assertion on that.
 
>>
>> (It's pretty sad that the on-disk representation is identical for leaf and internal items, because that wastes a lot
ofdisk space, but that's way out of scope for this patch.)
 
> 
> Thanks, I've added assert() where is was easy to test equalty.
> 
> PFA patch with all types.

gbt_ts_cmp(), gbt_time_cmp_sort() and gbt_date_cmp_sort() still have the 
above issue, they still compare "upper" for no good reason.

> +static Datum
> +gbt_bit_abbrev_convert(Datum original, SortSupport ssup)
> +{
> +       return (Datum) 0;
> +}
> +
> +static int
> +gbt_bit_cmp_abbrev(Datum z1, Datum z2, SortSupport ssup)
> +{
> +       return 0;
> +}

If an abbreviated key is not useful, just don't define abbrev functions 
and don't set SortSupport->abbrev_converter in the first place.

> static bool
> gbt_inet_abbrev_abort(int memtupcount, SortSupport ssup)
> {
> #if SIZEOF_DATUM == 8
>     return false;
> #else
>     return true;
> #endif
> }

Better to not set the 'abbrev_converter' function in the first place. Or 
would it be better to cast the float8 to float4 if SIZEOF_DATUM == 4?

> I had a plan to implement and test one type each day. I did not quite understood how rich our type system is.

:-)

- Heikki

Re: Yet another fast GiST build

От

Andrey Borodin

Дата:

30 октября 2020 г., 21:20:58

Thanks!

> 27 окт. 2020 г., в 16:43, Heikki Linnakangas <hlinnaka@iki.fi> написал(а):
>
> gbt_ts_cmp(), gbt_time_cmp_sort() and gbt_date_cmp_sort() still have the above issue, they still compare "upper" for
nogood reason. 
Fixed.

>> +static Datum
>> +gbt_bit_abbrev_convert(Datum original, SortSupport ssup)
>> +{
>> +       return (Datum) 0;
>> +}
>> +
>> +static int
>> +gbt_bit_cmp_abbrev(Datum z1, Datum z2, SortSupport ssup)
>> +{
>> +       return 0;
>> +}
>
> If an abbreviated key is not useful, just don't define abbrev functions and don't set SortSupport->abbrev_converter
inthe first place. 
Fixed.
>
>> static bool
>> gbt_inet_abbrev_abort(int memtupcount, SortSupport ssup)
>> {
>> #if SIZEOF_DATUM == 8
>>     return false;
>> #else
>>     return true;
>> #endif
>> }
>
> Better to not set the 'abbrev_converter' function in the first place. Or would it be better to cast the float8 to
float4if SIZEOF_DATUM == 4? 
Ok, now for 4 bytes Datum we do return (Datum) Float4GetDatum((float) z);

How do you think, should this patch and patch with pageinspect GiST functions be registered on commitfest?

Thanks!

Best regards, Andrey Borodin.

Вложения

v3-0001-Sortsupport-for-sorting-GiST-build-for-gist_btree.patch

Re: Yet another fast GiST build

От

Heikki Linnakangas

Дата:

02 ноября 2020 г., 17:45:56

On 30/10/2020 20:20, Andrey Borodin wrote:
>> 27 окт. 2020 г., в 16:43, Heikki Linnakangas <hlinnaka@iki.fi> написал(а):
>>> static bool
>>> gbt_inet_abbrev_abort(int memtupcount, SortSupport ssup)
>>> {
>>> #if SIZEOF_DATUM == 8
>>>     return false;
>>> #else
>>>     return true;
>>> #endif
>>> }
>>
>> Better to not set the 'abbrev_converter' function in the first place. Or would it be better to cast the float8 to
float4if SIZEOF_DATUM == 4?

> Ok, now for 4 bytes Datum we do return (Datum) Float4GetDatum((float) z);

A few quick comments:

* Currently with float8, you immediately abort abbreviation if 
SIZEOF_DATUM == 4. Like in the 'inet' above, you could convert the 
float8 to float4 instead.

* Some of the ALTER OPERATOR FAMILY commands in btree_gist--1.6--1.7.sql 
are copy-pasted wrong. Here's one example:

ALTER OPERATOR FAMILY gist_timestamptz_ops USING gist ADD
    FUNCTION    11    (text, text) gbt_text_sortsupport;

* It's easy to confuse the new comparison functions with the existing 
comparisons used by the picksplit functions. Some comments and/or naming 
changes would be good to clarify how they're different.

* It would be straightforward to have abbreviated functions for macaddr 
and macaddr8 too.

> How do you think, should this patch and patch with pageinspect GiST functions be registered on commitfest?

Yeah, that'd be good.

- Heikki

Re: Yet another fast GiST build

От

Andrey Borodin

Дата:

05 ноября 2020 г., 20:11:52


> 2 нояб. 2020 г., в 19:45, Heikki Linnakangas <hlinnaka@iki.fi> написал(а):
>
>> How do you think, should this patch and patch with pageinspect GiST functions be registered on commitfest?
>
> Yeah, that'd be good.
I've registered both patches on January CF.
pageinspect patch's code looks goot to me (besides TODO item there), but it lacks docs and tests. I can add some info
andcalls in future reviews. 

>
> On 30/10/2020 20:20, Andrey Borodin wrote:
> A few quick comments:
>
> * Currently with float8, you immediately abort abbreviation if SIZEOF_DATUM == 4. Like in the 'inet' above, you could
convertthe float8 to float4 instead. 
>
> * Some of the ALTER OPERATOR FAMILY commands in btree_gist--1.6--1.7.sql are copy-pasted wrong. Here's one example:
>
> ALTER OPERATOR FAMILY gist_timestamptz_ops USING gist ADD
>     FUNCTION    11    (text, text) gbt_text_sortsupport;
>
> * It's easy to confuse the new comparison functions with the existing comparisons used by the picksplit functions.
Somecomments and/or naming changes would be good to clarify how they're different. 
>
> * It would be straightforward to have abbreviated functions for macaddr and macaddr8 too.

I'll fix these issues soon. But things like this bother me a lot:
> ALTER OPERATOR FAMILY gist_timestamptz_ops USING gist ADD
>     FUNCTION    11    (text, text) gbt_text_sortsupport;

To test that functions are actually called for sorting build we should support directive sorting build like "CREATE
INDEXON A USING GIST(B) WITH(sorting=surely,and fail if not)". 
If we have unconditional sorting build and unconditional buffered build we can check opclasses code better.
The problem is current reloption is called "buffering". It would be strange if we gave this enum possible value like
"notbuffering, but very much like buffering, just another way". 
How do you think, is it ok to add reloption "buffering=sorting" to enhance tests?

Best regards, Andrey Borodin.

Re: Yet another fast GiST build

От

Andrey Borodin

Дата:

07 ноября 2020 г., 18:42:31


> 5 нояб. 2020 г., в 22:20, Justin Pryzby <pryzby@telsasoft.com> написал(а):
>
> On Thu, Nov 05, 2020 at 10:11:52PM +0500, Andrey Borodin wrote:
>> To test that functions are actually called for sorting build we should support directive sorting build like "CREATE
INDEXON A USING GIST(B) WITH(sorting=surely,and fail if not)". 
>
> Maybe you could add a DEBUG1 message for that, and include that in regression
> tests, which would then fail if sorting wasn't used.

That's a good idea. Thanks!
>
> Maybe you'd need to make it consistent by setting GUCs like work_mem /
> max_parallel_maintenance_workers / ??
>
> Similar to this
>
> postgres=# SET client_min_messages =debug;
> postgres=# CREATE INDEX ON A USING GIST(i) WITH(buffering=off);
> DEBUG:  building index "a_i_idx2" on table "a" serially
> CREATE INDEX

Currently, only B-tree uses parallel build, so no need to tweak GUCs except client_min_messages.
Before these tests, actually, ~20% of opclasses were not working as expected. Despite I've checked each one by hand. I
have 

PFA patch with fixed comments from Heikki.

Thanks!

Best regards, Andrey Borodin.

Вложения

v4-0001-Sortsupport-for-sorting-GiST-build-for-gist_btree.patch

Re: Yet another fast GiST build

От

Andrey Borodin

Дата:

07 декабря 2020 г., 13:05:19


> 28 сент. 2020 г., в 13:12, Heikki Linnakangas <hlinnaka@iki.fi> написал(а):
>
> I wrote a couple of 'pageinspect' function to inspect GiST pages for this. See attached.
> <0001-Add-functions-to-pageinspect-to-inspect-GiST-indexes.patch>

Here's version with tests and docs. I still have no idea how to print some useful information about tuples keys.

Thanks!

Best regards, Andrey Borodin.

Вложения

v2-0001-Add-functions-to-pageinspect-to-inspect-GiST-inde.patch

Re: Yet another fast GiST build

От

Peter Geoghegan

Дата:

07 декабря 2020 г., 21:56:09

On Mon, Dec 7, 2020 at 2:05 AM Andrey Borodin <x4mmm@yandex-team.ru> wrote:
> Here's version with tests and docs. I still have no idea how to print some useful information about tuples keys.

I suggest calling BuildIndexValueDescription() from your own custom
debug instrumentation code.

-- 
Peter Geoghegan

Re: Yet another fast GiST build

От

Andrey Borodin

Дата:

09 декабря 2020 г., 12:47:31


> 7 дек. 2020 г., в 23:56, Peter Geoghegan <pg@bowt.ie> написал(а):
>
> On Mon, Dec 7, 2020 at 2:05 AM Andrey Borodin <x4mmm@yandex-team.ru> wrote:
>> Here's version with tests and docs. I still have no idea how to print some useful information about tuples keys.
>
> I suggest calling BuildIndexValueDescription() from your own custom
> debug instrumentation code.
Thanks for the hint, Peter!
This function does exactly what I want to do. But I have no Relation inside gist_page_items(bytea) function...
probably,I'll add gist_page_items(relname, blockno) overload to fetch keys. 

Thanks!

Best regards, Andrey Borodin.

Re: Yet another fast GiST build

От

Andrey Borodin

Дата:

10 декабря 2020 г., 13:16:04


> 9 дек. 2020 г., в 14:47, Andrey Borodin <x4mmm@yandex-team.ru> написал(а):
>
>
>
>> 7 дек. 2020 г., в 23:56, Peter Geoghegan <pg@bowt.ie> написал(а):
>>
>> On Mon, Dec 7, 2020 at 2:05 AM Andrey Borodin <x4mmm@yandex-team.ru> wrote:
>>> Here's version with tests and docs. I still have no idea how to print some useful information about tuples keys.
>>
>> I suggest calling BuildIndexValueDescription() from your own custom
>> debug instrumentation code.
> Thanks for the hint, Peter!
> This function does exactly what I want to do. But I have no Relation inside gist_page_items(bytea) function...
probably,I'll add gist_page_items(relname, blockno) overload to fetch keys. 

PFA patch with implementation.

Best regards, Andrey Borodin.

Вложения

v3-0001-Add-functions-to-pageinspect-to-inspect-GiST-inde.patch

Re: Yet another fast GiST build

От

Heikki Linnakangas

Дата:

12 января 2021 г., 16:49:36

On 10/12/2020 12:16, Andrey Borodin wrote:
>> 9 дек. 2020 г., в 14:47, Andrey Borodin <x4mmm@yandex-team.ru> написал(а):
>>> 7 дек. 2020 г., в 23:56, Peter Geoghegan <pg@bowt.ie> написал(а):
>>>
>>> On Mon, Dec 7, 2020 at 2:05 AM Andrey Borodin <x4mmm@yandex-team.ru> wrote:
>>>> Here's version with tests and docs. I still have no idea how to print some useful information about tuples keys.
>>>
>>> I suggest calling BuildIndexValueDescription() from your own custom
>>> debug instrumentation code.
>> Thanks for the hint, Peter!
>> This function does exactly what I want to do. But I have no Relation inside gist_page_items(bytea) function...
probably,I'll add gist_page_items(relname, blockno) overload to fetch keys.

> 
> PFA patch with implementation.

I did a bit of cleanup on the function signature. The .sql script 
claimed that gist_page_items() took bytea as argument, but in reality it 
was a relation name, as text. I changed it so that it takes a page image 
as argument, instead of reading the block straight from the index. 
Mainly to make it consistent with brin_page_items(), if it wasn't for 
that precedence I might've gone either way on it.

Fixed the docs accordingly, and ran pgindent. New patch version attached.

- Heikki

Вложения

v4-0001-Add-functions-to-pageinspect-to-inspect-GiST-inde.patch

Re: Yet another fast GiST build

От

Andrey Borodin

Дата:

12 января 2021 г., 19:19:44


> 12 янв. 2021 г., в 18:49, Heikki Linnakangas <hlinnaka@iki.fi> написал(а):
>
>> PFA patch with implementation.
>
> I did a bit of cleanup on the function signature. The .sql script claimed that gist_page_items() took bytea as
argument,but in reality it was a relation name, as text. I changed it so that it takes a page image as argument,
insteadof reading the block straight from the index. Mainly to make it consistent with brin_page_items(), if it wasn't
forthat precedence I might've gone either way on it. 
bt_page_items() takes relation name and block number, that was a reason for doing so. But all others *_page_items()
(heap,brin, hash) are doing as in v4. So I think it's more common way. 

>
> Fixed the docs accordingly, and ran pgindent. New patch version attached.

Thanks! Looks good to me.

One more question: will bytea tests run correctly on 32bit\different-endian systems?

Best regards, Andrey Borodin.

Re: Yet another fast GiST build

От

Heikki Linnakangas

Дата:

13 января 2021 г., 11:41:44

On 12/01/2021 18:19, Andrey Borodin wrote:
>> 12 янв. 2021 г., в 18:49, Heikki Linnakangas <hlinnaka@iki.fi>
>> написал(а):
>> 
>> Fixed the docs accordingly, and ran pgindent. New patch version
>> attached.
> 
> Thanks! Looks good to me.

Pushed, thanks!

> One more question: will bytea tests run correctly on
> 32bit\different-endian systems?

Good question. Somehow I thought we were printing esseantilly text 
values as bytea. But they are Points, which consists of float8's. Since 
I already pushed this, I'm going to just wait and see what the buildfarm 
says, and fix if needed. I think the fix is going to be to just remove 
the test for the bytea-variant, it doesn't seem that important to test.

- Heikki

Re: Yet another fast GiST build

От

Andrey Borodin

Дата:

13 января 2021 г., 12:46:51


> 13 янв. 2021 г., в 13:41, Heikki Linnakangas <hlinnaka@iki.fi> написал(а):
>
>> One more question: will bytea tests run correctly on
>> 32bit\different-endian systems?
>
> Good question. Somehow I thought we were printing esseantilly text values as bytea. But they are Points, which
consistsof float8's. Since I already pushed this, I'm going to just wait and see what the buildfarm says, and fix if
needed.I think the fix is going to be to just remove the test for the bytea-variant, it doesn't seem that important to
test.

Maybe we can just omit key_data from tests?

Best regards, Andrey Borodin.

Re: Yet another fast GiST build

От

Heikki Linnakangas

Дата:

13 января 2021 г., 13:34:08

On 13/01/2021 11:46, Andrey Borodin wrote:
>> 13 янв. 2021 г., в 13:41, Heikki Linnakangas <hlinnaka@iki.fi>
>> написал(а):
>> 
>>> One more question: will bytea tests run correctly on 
>>> 32bit\different-endian systems?
>> 
>> Good question. Somehow I thought we were printing esseantilly text
>> values as bytea. But they are Points, which consists of float8's.
>> Since I already pushed this, I'm going to just wait and see what
>> the buildfarm says, and fix if needed. I think the fix is going to
>> be to just remove the test for the bytea-variant, it doesn't seem
>> that important to test.
> 
> Maybe we can just omit key_data from tests?

Make sense, fixed it that way. Thanks!

- Heikki

Re: Yet another fast GiST build

От

Heikki Linnakangas

Дата:

13 января 2021 г., 14:53:39

On 13/01/2021 12:34, Heikki Linnakangas wrote:
> On 13/01/2021 11:46, Andrey Borodin wrote:
>>> 13 янв. 2021 г., в 13:41, Heikki Linnakangas <hlinnaka@iki.fi>
>>> написал(а):
>>>
>>>> One more question: will bytea tests run correctly on
>>>> 32bit\different-endian systems?
>>>
>>> Good question. Somehow I thought we were printing esseantilly text
>>> values as bytea. But they are Points, which consists of float8's.
>>> Since I already pushed this, I'm going to just wait and see what
>>> the buildfarm says, and fix if needed. I think the fix is going to
>>> be to just remove the test for the bytea-variant, it doesn't seem
>>> that important to test.
>>
>> Maybe we can just omit key_data from tests?
> 
> Make sense, fixed it that way. Thanks!

Buildfarm animal thorntail is still not happy:

> --- /home/nm/farm/sparc64_deb10_gcc_64_ubsan/HEAD/pgsql.build/../pgsql/contrib/pageinspect/expected/gist.out
2021-01-1313:38:09.721752365 +0300
 
> +++ /home/nm/farm/sparc64_deb10_gcc_64_ubsan/HEAD/pgsql.build/contrib/pageinspect/results/gist.out    2021-01-13
14:12:21.540046507+0300
 
> @@ -3,21 +3,21 @@
>  CREATE INDEX test_gist_idx ON test_gist USING gist (p);
>  -- Page 0 is the root, the rest are leaf pages
>  SELECT * FROM gist_page_opaque_info(get_raw_page('test_gist_idx', 0));
> - lsn | nsn | rightlink  | flags 
> ------+-----+------------+-------
> - 0/1 | 0/0 | 4294967295 | {}
> +    lsn     | nsn | rightlink  | flags 
> +------------+-----+------------+-------
> + 0/1B8357F8 | 0/0 | 4294967295 | {}
>  (1 row)

Looks like the LSN on the page is not set to GistBuildLSN as expected. 
Weird.

Thorntail is a sparc64 system, so little-endian, but the other 
little-endian buildfarm members are not reporting this error. Any idea 
what might be going on?

- Heikki

Re: Yet another fast GiST build

От

Tom Lane

Дата:

13 января 2021 г., 19:17:47

Heikki Linnakangas <hlinnaka@iki.fi> writes:
> On 13/01/2021 11:46, Andrey Borodin wrote:
>> Maybe we can just omit key_data from tests?

> Make sense, fixed it that way. Thanks!

thorntail, at least, is still unhappy.

            regards, tom lane

Re: Yet another fast GiST build

От

Heikki Linnakangas

Дата:

13 января 2021 г., 21:04:10

On 13 January 2021 13:53:39 EET, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
>Buildfarm animal thorntail is still not happy:
>
>> --- /home/nm/farm/sparc64_deb10_gcc_64_ubsan/HEAD/pgsql.build/../pgsql/contrib/pageinspect/expected/gist.out
2021-01-1313:38:09.721752365 +0300 
>> +++ /home/nm/farm/sparc64_deb10_gcc_64_ubsan/HEAD/pgsql.build/contrib/pageinspect/results/gist.out    2021-01-13
14:12:21.540046507+0300 
>> @@ -3,21 +3,21 @@
>>  CREATE INDEX test_gist_idx ON test_gist USING gist (p);
>>  -- Page 0 is the root, the rest are leaf pages
>>  SELECT * FROM gist_page_opaque_info(get_raw_page('test_gist_idx', 0));
>> - lsn | nsn | rightlink  | flags
>> ------+-----+------------+-------
>> - 0/1 | 0/0 | 4294967295 | {}
>> +    lsn     | nsn | rightlink  | flags
>> +------------+-----+------------+-------
>> + 0/1B8357F8 | 0/0 | 4294967295 | {}
>>  (1 row)
>
>Looks like the LSN on the page is not set to GistBuildLSN as expected.
>Weird.
>
>Thorntail is a sparc64 system, so little-endian, but the other
>little-endian buildfarm members are not reporting this error. Any idea
>what might be going on?

Status update on this: I am building Postgres in a qemu sparc64 emulated virtual machine, hoping to be able to
reproducethis. It's very slow, so it will take hours still to complete. 

I don't think this is a problem with the test, or with the new pageinspect functions, but a genuine bug in the gist
building code. Or there is something special on that animal that causes the just-created index pages to be dirtied. It
doesseem to happen consistently on thorntail, but not on other animals. 

- Heikki

Re: Yet another fast GiST build

От

Heikki Linnakangas

Дата:

13 января 2021 г., 21:16:23

On 13 January 2021 20:04:10 EET, Heikki Linnakangas <hlinnaka@gmail.com> wrote:
>
>
>On 13 January 2021 13:53:39 EET, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
>>Looks like the LSN on the page is not set to GistBuildLSN as expected.
>>Weird.
>>
>>Thorntail is a sparc64 system, so little-endian, but the other
>>little-endian buildfarm members are not reporting this error. Any idea
>>what might be going on?
>
>Status update on this: I am building Postgres in a qemu sparc64 emulated virtual machine, hoping to be able to
reproducethis. It's very slow, so it will take hours still to complete. 
>
>I don't think this is a problem with the test, or with the new pageinspect functions, but a genuine bug in the gist
building code. Or there is something special on that animal that causes the just-created index pages to be dirtied. It
doesseem to happen consistently on thorntail, but not on other animals. 

Ah, silly me. Thorntail uses "wal_level=minimal". With that, I can readily reproduce this.

I'm still not sure why it happens, but now it should be straightforward to debug.

- Heikki

Re: Yet another fast GiST build

От

Peter Geoghegan

Дата:

14 января 2021 г., 02:47:52

On Tue, Jan 12, 2021 at 5:49 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> I did a bit of cleanup on the function signature. The .sql script
> claimed that gist_page_items() took bytea as argument, but in reality it
> was a relation name, as text. I changed it so that it takes a page image
> as argument, instead of reading the block straight from the index.
> Mainly to make it consistent with brin_page_items(), if it wasn't for
> that precedence I might've gone either way on it.

BTW it would be nice if gist_page_items() had a "dead" boolean output
argument for the item's LP_DEAD bit, just like bt_page_items(). I plan
on adding some testing for GiST's opportunistic index deletion soon. I
may also add some of the same enhancements that nbtree got today
(following commit d168b666).

This feature was originally heavily based on the nbtree LP_DEAD
deletion mechanism (now called simple deletion), and I see no reason
(or at least no good reason) why it shouldn't be possible to keep it
in sync (except maybe with bottom-up deletion, where that it at least
isn't straightforward/mechanical).

-- 
Peter Geoghegan

Re: Yet another fast GiST build

От

Andrey Borodin

Дата:

14 января 2021 г., 12:52:12


> 14 янв. 2021 г., в 04:47, Peter Geoghegan <pg@bowt.ie> написал(а):
>
> On Tue, Jan 12, 2021 at 5:49 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
>> I did a bit of cleanup on the function signature. The .sql script
>> claimed that gist_page_items() took bytea as argument, but in reality it
>> was a relation name, as text. I changed it so that it takes a page image
>> as argument, instead of reading the block straight from the index.
>> Mainly to make it consistent with brin_page_items(), if it wasn't for
>> that precedence I might've gone either way on it.
>
> BTW it would be nice if gist_page_items() had a "dead" boolean output
> argument for the item's LP_DEAD bit, just like bt_page_items().
+1. PFA patch.

> I plan
> on adding some testing for GiST's opportunistic index deletion soon. I
> may also add some of the same enhancements that nbtree got today
> (following commit d168b666).
>
> This feature was originally heavily based on the nbtree LP_DEAD
> deletion mechanism (now called simple deletion), and I see no reason
> (or at least no good reason) why it shouldn't be possible to keep it
> in sync (except maybe with bottom-up deletion, where that it at least
> isn't straightforward/mechanical).

Sound great!

Best regards, Andrey Borodin.

Вложения

0001-Add-bool-column-for-LP_DEAF-flag-to-GiST-pageinspect.patch

Re: Yet another fast GiST build

От

Peter Eisentraut

Дата:

15 января 2021 г., 08:24:42

On 2021-01-12 14:49, Heikki Linnakangas wrote:
>>>> I suggest calling BuildIndexValueDescription() from your own custom
>>>> debug instrumentation code.
>>> Thanks for the hint, Peter!
>>> This function does exactly what I want to do. But I have no Relation inside gist_page_items(bytea) function...
probably,I'll add gist_page_items(relname, blockno) overload to fetch keys.

>>
>> PFA patch with implementation.
> 
> I did a bit of cleanup on the function signature. The .sql script
> claimed that gist_page_items() took bytea as argument, but in reality it
> was a relation name, as text. I changed it so that it takes a page image
> as argument, instead of reading the block straight from the index.
> Mainly to make it consistent with brin_page_items(), if it wasn't for
> that precedence I might've gone either way on it.

I noticed this patch while working on another patch for pageinspect [0], 
and this one appears to introduce a problem similar to the one the other 
patch attempts to fix: The "itemlen" output parameters are declared to 
be of type smallint, but the underlying C data is of type uint16 
(OffsetNumber).  I don't know the details of gist enough to determine 
whether overflow is possible here.  If not, perhaps a check or at least 
a comment would be useful.  Otherwise, these parameters should be of 
type int in SQL.

[0]: 
https://www.postgresql.org/message-id/09e2dd82-4eb6-bba1-271a-d2b58bf6c71f@enterprisedb.com

Re: Yet another fast GiST build

От

Andrey Borodin

Дата:

15 января 2021 г., 20:05:49


> 15 янв. 2021 г., в 10:24, Peter Eisentraut <peter.eisentraut@enterprisedb.com> написал(а):
>
> I noticed this patch while working on another patch for pageinspect [0], and this one appears to introduce a problem
similarto the one the other patch attempts to fix: The "itemlen" output parameters are declared to be of type smallint,
butthe underlying C data is of type uint16 (OffsetNumber).  I don't know the details of gist enough to determine
whetheroverflow is possible here.  If not, perhaps a check or at least a comment would be useful.  Otherwise, these
parametersshould be of type int in SQL. 

Item offsets cannot exceed maximum block size of 32768. And even 32768/sizeof(ItemId). Thus overflow is impossible.
Interesting question is wether pageinspect should protect itself from corrupted input?
Generating description from bogus tuple, probably, can go wrong.

Best regards, Andrey Borodin.

Re: Yet another fast GiST build

От

Tom Lane

Дата:

17 января 2021 г., 23:50:31

Heikki Linnakangas <hlinnaka@iki.fi> writes:
> On 12/01/2021 18:19, Andrey Borodin wrote:
>> Thanks! Looks good to me.

> Pushed, thanks!

I noticed that gist_page_items() thinks it can hold inter_call_data->rel
open across a series of calls.  That's completely unsafe: the executor
might not run the call series to completion (see LIMIT), resulting in
relcache leak complaints.  I suspect that it might have cache-flush
hazards even without that.  I think this code needs to be rewritten to do
all the interesting work in the first call.  Or maybe better, return the
results as a tuplestore so you don't have to do multiple calls at all.

            regards, tom lane

Re: Yet another fast GiST build

От

Peter Geoghegan

Дата:

18 января 2021 г., 01:35:18

On Sun, Jan 17, 2021 at 12:50 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> I noticed that gist_page_items() thinks it can hold inter_call_data->rel
> open across a series of calls.  That's completely unsafe: the executor
> might not run the call series to completion (see LIMIT), resulting in
> relcache leak complaints.

It also has the potential to run into big problems should the user
input a raw page image with an regclass-argument-incompatible tuple
descriptor. Maybe that's okay (this is a tool for experts), but it
certainly is a consideration.

-- 
Peter Geoghegan

Re: Yet another fast GiST build

От

Heikki Linnakangas

Дата:

18 января 2021 г., 01:52:12

On 18/01/2021 00:35, Peter Geoghegan wrote:
> On Sun, Jan 17, 2021 at 12:50 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> I noticed that gist_page_items() thinks it can hold inter_call_data->rel
>> open across a series of calls.  That's completely unsafe: the executor
>> might not run the call series to completion (see LIMIT), resulting in
>> relcache leak complaints.

Fixed, thanks! I changed it to return a tuplestore.

> It also has the potential to run into big problems should the user
> input a raw page image with an regclass-argument-incompatible tuple
> descriptor. Maybe that's okay (this is a tool for experts), but it
> certainly is a consideration.

I'm not sure I understand. It's true that the raw page image can contain 
data from a different index, or any garbage really. And the function 
will behave badly if you do that. That's an accepted risk with 
pageinspect functions, that's why they're superuser-only, although some 
of them are more tolerant of corrupt pages than others. The 
gist_page_items_bytea() variant doesn't try to parse the key data and is 
less likely to crash on bad input.

- Heikki

Re: Yet another fast GiST build

От

Peter Geoghegan

Дата:

18 января 2021 г., 02:04:36

On Sun, Jan 17, 2021 at 2:52 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> I'm not sure I understand. It's true that the raw page image can contain
> data from a different index, or any garbage really. And the function
> will behave badly if you do that. That's an accepted risk with
> pageinspect functions, that's why they're superuser-only, although some
> of them are more tolerant of corrupt pages than others. The
> gist_page_items_bytea() variant doesn't try to parse the key data and is
> less likely to crash on bad input.

I personally agree with you - it's not like there aren't other ways
for superusers to crash the server (most of which seem very similar to
this gist_page_items() issue, in fact). I just think that it's worth
being clear about that being a trade-off that we've accepted.

-- 
Peter Geoghegan

Re: Yet another fast GiST build

От

Peter Geoghegan

Дата:

18 января 2021 г., 02:10:43

On Sun, Jan 17, 2021 at 3:04 PM Peter Geoghegan <pg@bowt.ie> wrote:
> I personally agree with you - it's not like there aren't other ways
> for superusers to crash the server (most of which seem very similar to
> this gist_page_items() issue, in fact). I just think that it's worth
> being clear about that being a trade-off that we've accepted.

Can we rename gist_page_items_bytea() to gist_page_items(), and at the
same time rename the current gist_page_items() -- perhaps call it
gist_page_items_output()?

That way we could add a bt_page_items_output() function later, while
leaving everything consistent (actually not quite, since
bt_page_items() outputs text instead of bytea -- but that seems worth
fixing too). This also has the merit of making the unsafe "output"
variant into the special case.

-- 
Peter Geoghegan

Re: Yet another fast GiST build

От

Heikki Linnakangas

Дата:

18 января 2021 г., 15:40:02

On 18/01/2021 01:10, Peter Geoghegan wrote:
> On Sun, Jan 17, 2021 at 3:04 PM Peter Geoghegan <pg@bowt.ie> wrote:
>> I personally agree with you - it's not like there aren't other ways
>> for superusers to crash the server (most of which seem very similar to
>> this gist_page_items() issue, in fact). I just think that it's worth
>> being clear about that being a trade-off that we've accepted.
> 
> Can we rename gist_page_items_bytea() to gist_page_items(), and at the
> same time rename the current gist_page_items() -- perhaps call it
> gist_page_items_output()?
> 
> That way we could add a bt_page_items_output() function later, while
> leaving everything consistent (actually not quite, since
> bt_page_items() outputs text instead of bytea -- but that seems worth
> fixing too). This also has the merit of making the unsafe "output"
> variant into the special case.

bt_page_items() and bt_page_items_bytea() exist already. And 
brin_page_items() also calls the output functions (there's no bytea 
version of that). Perhaps it would've been better to make the 
bytea-variants the default, but I'm afraid that ship has already sailed.

We're not terribly consistent; heap_page_items(), hash_page_items() and 
gin_page_items() don't attempt to call the output functions.

Then again, I don't think we need to worry much about backwards 
compatibility in pageinspect, so I guess we could rename them all. It 
doesn't bother me enough to make me do it, but I won't object if you 
want to.

- Heikki

Re: Yet another fast GiST build

От

Heikki Linnakangas

Дата:

07 апреля 2021 г., 09:00:10

On 08/03/2021 19:06, Andrey Borodin wrote:
> There were numerous GiST-build-related patches in this thread. Yet uncommitted is a patch with sortsupport routines
forbtree_gist contrib module.

> Here's its version which needs review.

Reviewing this now again. One thing caught my eye:

> +static int
> +gbt_bit_sort_build_cmp(Datum a, Datum b, SortSupport ssup)
> +{
> +    return DatumGetInt32(DirectFunctionCall2(byteacmp,
> +                                             PointerGetDatum(a),
> +                                             PointerGetDatum(b)));
> +}

That doesn't quite match the sort order used by the comparison 
functions, gbt_bitlt and such. The comparison functions compare the bits 
first, and use the length as a tie-breaker. Using byteacmp() will 
compare the "bit length" first. However, gbt_bitcmp() also uses 
byteacmp(), so I'm a bit confused. So, huh?

- Heikki

Re: Yet another fast GiST build

От

Heikki Linnakangas

Дата:

07 апреля 2021 г., 13:23:42

On 07/04/2021 09:00, Heikki Linnakangas wrote:
> On 08/03/2021 19:06, Andrey Borodin wrote:
>> There were numerous GiST-build-related patches in this thread. Yet uncommitted is a patch with sortsupport routines
forbtree_gist contrib module.

>> Here's its version which needs review.

Committed with small fixes. I changed the all functions to use 
*GetDatum() and DatumGet*() macros, instead of just comparing Datums 
with < and >. Datum is unsigned, while int2, int4, int8 and money are 
signed, so that changes the sort order around 0 for those types to be 
the same as the picksplit and picksplit functions use. Not a correctness 
issue, the sorting only affects the quality of the index, but let's be tidy.

This issue remains:

> Reviewing this now again. One thing caught my eye:
> 
>> +static int
>> +gbt_bit_sort_build_cmp(Datum a, Datum b, SortSupport ssup)
>> +{
>> +    return DatumGetInt32(DirectFunctionCall2(byteacmp,
>> +                                             PointerGetDatum(a),
>> +                                             PointerGetDatum(b)));
>> +}
> 
> That doesn't quite match the sort order used by the comparison
> functions, gbt_bitlt and such. The comparison functions compare the bits
> first, and use the length as a tie-breaker. Using byteacmp() will
> compare the "bit length" first. However, gbt_bitcmp() also uses
> byteacmp(), so I'm a bit confused. So, huh?

Since we used byteacmp() previously for picksplit, too, this is 
consistent with the order you got previously. It still seems wrong to me 
and should be investigated, but it doesn't need to block this patch.

- Heikki

Re: Yet another fast GiST build

От

Heikki Linnakangas

Дата:

07 апреля 2021 г., 14:56:42

On 07/04/2021 09:00, Heikki Linnakangas wrote:
> On 08/03/2021 19:06, Andrey Borodin wrote:
>> There were numerous GiST-build-related patches in this thread. Yet uncommitted is a patch with sortsupport routines
forbtree_gist contrib module.

>> Here's its version which needs review.
> 
> Reviewing this now again. One thing caught my eye:
> 
>> +static int
>> +gbt_bit_sort_build_cmp(Datum a, Datum b, SortSupport ssup)
>> +{
>> +    return DatumGetInt32(DirectFunctionCall2(byteacmp,
>> +                                             PointerGetDatum(a),
>> +                                             PointerGetDatum(b)));
>> +}
> 
> That doesn't quite match the sort order used by the comparison
> functions, gbt_bitlt and such. The comparison functions compare the bits
> first, and use the length as a tie-breaker. Using byteacmp() will
> compare the "bit length" first. However, gbt_bitcmp() also uses
> byteacmp(), so I'm a bit confused. So, huh?

Ok, I think I understand that now. In btree_gist, the *_cmp() function 
operates on non-leaf values, and *_lt(), *_gt() et al operate on leaf 
values. For all other datatypes, the leaf and non-leaf representation is 
the same, but for bit/varbit, the non-leaf representation is different. 
The leaf representation is VarBit, and non-leaf is just the bits without 
the 'bit_len' field. That's why it is indeed correct for gbt_bitcmp() to 
just use byteacmp(), whereas gbt_bitlt() et al compares the 'bit_len' 
field separately. That's subtle, and 100% uncommented.

What that means for this patch is that gbt_bit_sort_build_cmp() should 
*not* call byteacmp(), but bitcmp(). Because it operates on the original 
datatype stored in the table.

- Heikki

Re: Yet another fast GiST build

От

Andrey Borodin

Дата:

07 апреля 2021 г., 15:12:21


> 7 апр. 2021 г., в 13:23, Heikki Linnakangas <hlinnaka@iki.fi> написал(а):
>
> Committed with small fixes.

Thanks!

> 7 апр. 2021 г., в 14:56, Heikki Linnakangas <hlinnaka@iki.fi> написал(а):
>
> Ok, I think I understand that now. In btree_gist, the *_cmp() function operates on non-leaf values, and *_lt(),
*_gt()et al operate on leaf values. For all other datatypes, the leaf and non-leaf representation is the same, but for
bit/varbit,the non-leaf representation is different. The leaf representation is VarBit, and non-leaf is just the bits
withoutthe 'bit_len' field. That's why it is indeed correct for gbt_bitcmp() to just use byteacmp(), whereas
gbt_bitlt()et al compares the 'bit_len' field separately. That's subtle, and 100% uncommented. 
>
> What that means for this patch is that gbt_bit_sort_build_cmp() should *not* call byteacmp(), but bitcmp(). Because
itoperates on the original datatype stored in the table. 

+1
Thanks for investigating this.
If I understand things right, adding test values with different lengths of bit sequences would not uncover the problem
anyway?

Best regards, Andrey Borodin.

Re: Yet another fast GiST build

От

Heikki Linnakangas

Дата:

07 апреля 2021 г., 16:18:53

On 07/04/2021 15:12, Andrey Borodin wrote:
>> 7 апр. 2021 г., в 14:56, Heikki Linnakangas <hlinnaka@iki.fi>
>> написал(а):
>> 
>> Ok, I think I understand that now. In btree_gist, the *_cmp()
>> function operates on non-leaf values, and *_lt(), *_gt() et al
>> operate on leaf values. For all other datatypes, the leaf and
>> non-leaf representation is the same, but for bit/varbit, the
>> non-leaf representation is different. The leaf representation is
>> VarBit, and non-leaf is just the bits without the 'bit_len' field.
>> That's why it is indeed correct for gbt_bitcmp() to just use
>> byteacmp(), whereas gbt_bitlt() et al compares the 'bit_len' field
>> separately. That's subtle, and 100% uncommented.
>> 
>> What that means for this patch is that gbt_bit_sort_build_cmp()
>> should *not* call byteacmp(), but bitcmp(). Because it operates on
>> the original datatype stored in the table.
> 
> +1 Thanks for investigating this. If I understand things right,
> adding test values with different lengths of bit sequences would not
> uncover the problem anyway?

That's right, the only consequence of a "wrong" sort order is that the 
quality of the tree suffers, and scans need to scan more pages 
unnecessarily.

I tried to investigate this by creating a varbit index with and without 
sorting, and compared them with pageinspect, but in quick testing, I 
wasn't able to find cases where the sorted version was badly ordered. I 
guess I didn't find the right data set yet.

- Heikki

Re: Yet another fast GiST build

От

Andrey Borodin

Дата:

07 апреля 2021 г., 22:41:29


> 7 апр. 2021 г., в 16:18, Heikki Linnakangas <hlinnaka@iki.fi> написал(а):
>

I see there is a problem with "SET client_min_messages = DEBUG1;" on buildfarm. I think these checks were necessary to
makesure test paths are triggered. When we know that code paths are tested, maybe we can omit checks? 

Best regards, Andrey Borodin.

Re: Yet another fast GiST build

От

Heikki Linnakangas

Дата:

07 апреля 2021 г., 23:18:26

On 07/04/2021 22:41, Andrey Borodin wrote:
> I see there is a problem with "SET client_min_messages = DEBUG1;" on
> buildfarm. I think these checks were necessary to make sure test
> paths are triggered. When we know that code paths are tested, maybe
> we can omit checks?
Yeah. We don't have very reliable coverage of different GiST build 
methods, as noted earlier in this thread. But that's not this patch's fault.

I've been investigating the varbit issue, and realized that all the 
comparison functions in this patch for varlen datatypes are broken. The 
representation that the sortsupport function sees is the one that the 
'compress' function returns. The fixed-length versions got this right, 
but the varlen versions assume that the input is a Datum of the original 
datatype. It happens to not crash, because the representation returned 
by gbt_var_compress() is also a varlena, and all of the comparison 
functions tolerate the garbage inputs, but it's bogus. The correct 
pattern would be something like this (without the debugging NOTICE, of 
course):

> static int
> gbt_text_sort_build_cmp(Datum a, Datum b, SortSupport ssup)
> {
>     GBT_VARKEY_R ra = gbt_var_key_readable((GBT_VARKEY *) PG_DETOAST_DATUM(a));
>     GBT_VARKEY_R rb = gbt_var_key_readable((GBT_VARKEY *) PG_DETOAST_DATUM(b));
> 
>     int x = DatumGetInt32(DirectFunctionCall2Coll(bttextcmp,
>                                                  ssup->ssup_collation,
>                                                  PointerGetDatum(a),
>                                                  PointerGetDatum(b)));
>     elog(NOTICE, "cmp: %s vs %s: %d",
>          TextDatumGetCString(ra.lower),
>          TextDatumGetCString(rb.lower),
>          x);
>     return x;
> }
- Heikki

Re: Yet another fast GiST build

От

Andrey Borodin

Дата:

07 мая 2021 г., 16:05:41

Thanks for the investigation, Heikki.

> 8 апр. 2021 г., в 01:18, Heikki Linnakangas <hlinnaka@iki.fi> написThe correct pattern would be something like this
(withoutthe debugging NOTICE, of course): 
>
>> static int
>> gbt_text_sort_build_cmp(Datum a, Datum b, SortSupport ssup)
>> {
>>     GBT_VARKEY_R ra = gbt_var_key_readable((GBT_VARKEY *) PG_DETOAST_DATUM(a));
>>     GBT_VARKEY_R rb = gbt_var_key_readable((GBT_VARKEY *) PG_DETOAST_DATUM(b));
>>     int x = DatumGetInt32(DirectFunctionCall2Coll(bttextcmp,
>>                                                  ssup->ssup_collation,
>>                                                  PointerGetDatum(a),
>>                                                  PointerGetDatum(b)));
>>     elog(NOTICE, "cmp: %s vs %s: %d",
>>          TextDatumGetCString(ra.lower),
>>          TextDatumGetCString(rb.lower),
>>          x);
>>     return x;
>> }
>

In this pattern I flipped PointerGetDatum(a) to PointerGetDatum(ra.lower), because it seems to me correct. I've
followedrule of thumb: every sort function must extract and user "lower" somehow. Though I suspect numeric a bit. Is it
regularvarlena? 
PFA patchset with v6 intact + two fixes of discovered issues.

Thanks!

Best regards, Andrey Borodin.

Вложения

Re: Yet another fast GiST build

От

Emre Hasegeli

Дата:

05 июля 2021 г., 09:27:50

I tried reviewing the remaining patches.  It seems to work correctly,
and passes the tests on my laptop.

> In this pattern I flipped PointerGetDatum(a) to PointerGetDatum(ra.lower), because it seems to me correct. I've
followedrule of thumb: every sort function must extract and use "lower" somehow. Though I suspect numeric a bit. Is it
regularvarlena?
 

As far as I understand, we cannot use the sortsupport functions from
the btree operator classes because the btree_gist extension handles
things differently.  This is unfortunate and a source of bugs [1], but
we cannot do anything about it.

Given that the lower and upper datums must be the same for the leaf
nodes, it makes sense to me to compare one of them.

Using numeric_cmp() for numeric in line with using bttextcmp() for text.

> +   /*
> +    * Numeric has abbreviation routines in numeric.c, but we don't try to use
> +    * them here. Maybe later.
> +    */

This is also true for text.  Perhaps we should also add a comment there.

> PFA patchset with v6 intact + two fixes of discovered issues.

> +   /* Use byteacmp(), like gbt_bitcmp() does */

We can improve this comment by incorporating Heikki's previous email:

> Ok, I think I understand that now. In btree_gist, the *_cmp() function
> operates on non-leaf values, and *_lt(), *_gt() et al operate on leaf
> values. For all other datatypes, the leaf and non-leaf representation is
> the same, but for bit/varbit, the non-leaf representation is different.
> The leaf representation is VarBit, and non-leaf is just the bits without
> the 'bit_len' field. That's why it is indeed correct for gbt_bitcmp() to
> just use byteacmp(), whereas gbt_bitlt() et al compares the 'bit_len'
> field separately. That's subtle, and 100% uncommented.

I think patch number 3 should be squashed to patch number 1.

I couldn't understand patch number 2 "Remove DEBUG1 verification".  It
seems like something rather useful.

[1] https://www.postgresql.org/message-id/flat/201010112055.o9BKtZf7011251%40wwwmaster.postgresql.org

Re: Yet another fast GiST build

От

Daniel Gustafsson

Дата:

17 ноября 2021 г., 16:33:41

> On 5 Jul 2021, at 08:27, Emre Hasegeli <emre@hasegeli.com> wrote:

> ...
> 
> I couldn't understand patch number 2 "Remove DEBUG1 verification".  It
> seems like something rather useful.

These questions have gone unanswered since July, and the patch fails to apply
anymore.  Is there an updated version on the way?

--
Daniel Gustafsson        https://vmware.com/

Re: Yet another fast GiST build

От

Andrey Borodin

Дата:

25 ноября 2021 г., 16:38:01

> 17 нояб. 2021 г., в 16:33, Daniel Gustafsson <daniel@yesql.se> написал(а):
>
>> On 5 Jul 2021, at 08:27, Emre Hasegeli <emre@hasegeli.com> wrote:
>
>> ...
>>
>> I couldn't understand patch number 2 "Remove DEBUG1 verification".  It
>> seems like something rather useful.

Emre, thanks for the review! And sorry for this delay. Properly answering questions is still in my queue.

>
> These questions have gone unanswered since July, and the patch fails to apply
> anymore.  Is there an updated version on the way?

Yes. In future versions I also want to address IOS vs pinned buffers issue[0]. And, probably, sort items on leaf pages.
And,maybe, split pages more intelligently. 
I hope to get to this in December.

I'll post rebased version ASAP.

Best regards, Andrey Borodin.

[0] https://www.postgresql.org/message-id/CAH2-Wz=PqOziyRSrnN5jAtfXWXY7-BJcHz9S355LH8Dt=5qxWQ@mail.gmail.com

Re: Yet another fast GiST build

От

Andrey Borodin

Дата:

16 декабря 2021 г., 12:18:52

Hi Emre!

Thank you for the review.

> I tried reviewing the remaining patches. It seems to work correctly,
> and passes the tests on my laptop.
> 
>> In this pattern I flipped PointerGetDatum(a) to PointerGetDatum(ra.lower), because it seems to me correct. I've
followedrule of thumb: every sort function must extract and use "lower" somehow. Though I suspect numeric a bit. Is it
regularvarlena?
 
> 
> As far as I understand, we cannot use the sortsupport functions from
> the btree operator classes because the btree_gist extension handles
> things differently. This is unfortunate and a source of bugs [1], but
> we cannot do anything about it.
> 
> Given that the lower and upper datums must be the same for the leaf
> nodes, it makes sense to me to compare one of them.
> 
> Using numeric_cmp() for numeric in line with using bttextcmp() for text.
OK.

>> + /*
>> + * Numeric has abbreviation routines in numeric.c, but we don't try to use
>> + * them here. Maybe later.
>> + */
> 
> This is also true for text. Perhaps we should also add a comment there.
Done.

> 
>> PFA patchset with v6 intact + two fixes of discovered issues.
> 
>> + /* Use byteacmp(), like gbt_bitcmp() does */
> 
> We can improve this comment by incorporating Heikki's previous email:
> 
>> Ok, I think I understand that now. In btree_gist, the *_cmp() function
>> operates on non-leaf values, and *_lt(), *_gt() et al operate on leaf
>> values. For all other datatypes, the leaf and non-leaf representation is
>> the same, but for bit/varbit, the non-leaf representation is different.
>> The leaf representation is VarBit, and non-leaf is just the bits without
>> the 'bit_len' field. That's why it is indeed correct for gbt_bitcmp() to
>> just use byteacmp(), whereas gbt_bitlt() et al compares the 'bit_len'
>> field separately. That's subtle, and 100% uncommented.
> 
Done.

> I think patch number 3 should be squashed to patch number 1.
All patches in the patchset expected to be squashed into 1 during commit.

> 
> I couldn't understand patch number 2 "Remove DEBUG1 verification". It
> seems like something rather useful.
If failed on buildfarm on some nodes. There were somewhat extroneous error messages.
Currently Step 1 and 2 are separete to ensure that opclasses are used correctly.

Thanks!

Best regards, Andrey Borodin.

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Обсуждение: Yet another fast GiST build

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения