Обсуждение: GiST optimizing memmoves in gistplacetopage for fixed-size updates [PoC]

Поиск
Список
Период
Сортировка

GiST optimizing memmoves in gistplacetopage for fixed-size updates [PoC]

От
Andrew Borodin
Дата:
Hi, hackers!

I think there is some room for improving GiST inserts. Following is
the description of what I think is performance problem.

Function gistplacetopage in file /src/backend/access/gist/gist.c is
responsible for updating or appending new tuple on GiST page.
Currently, after checking necessity of page split due to overflow, it
essentially executes following:

               if (OffsetNumberIsValid(oldoffnum))
                       PageIndexTupleDelete(page, oldoffnum);
               gistfillbuffer(page, itup, ntup, InvalidOffsetNumber);

That is: remove old tuple if it’s there, then place updated tuple at the end.

Half of the old data have to be shifted my memmove inside
PageIndexTupleDelete() call, half of the linp-s have to be corrected.

If the updated tuple has same size as already residing on page we can
just overwrite it. Attached patch demonstrates that concept. Attached
test.sql inserts million rows into GiST index based on cube extension.
My machine is Hyper-V VM running Ubuntu on i5-2500 CPU with SSD
storage. Before patch, insert part of test is executed on average
within 159 second, after patch application: insert part is executed
within 77 seconds on average. That is almost twice faster (for
CPU\Mem-bounded inserts, disk-constrained test will show no
improvement). But it works only for fixed-size tuple inserts.

I know that code in patch is far from beautiful: it operates with
three different levels of abstraction within 5 lines of code. Those
are low level memmove(), system-wide PageAddItem() and GiST private
gistfillBuffer().

By the way PageAddItem() have overwrite flag, but it only works with
unused ItemId’s. Marking old ItemId as unused before PageAddItem()
didn’t work for me. Unfortunately bufpage.c routines do not contain
one for updating(replacing with new) tuple on page. It is important
for me because I’m working on advanced GiST page layout (
https://www.postgresql.org/message-id/CAJEAwVE0rrr%2BOBT-P0gDCtXbVDkBBG_WcXwCBK%3DGHo4fewu3Yg%40mail.gmail.com
), current approach is to use skip-tuples which can be used to skip
many flowing tuples with one key check. Obviously, this design cares
about tuples order. And update in a fashion “place updated tuple at
the end” won’t work for me.

So, I think it would be better to implement PageReplaceItem()
functionality to make code better, to make existing GiST inserts
faster and to enable new advanced page layouts in indices.

Current work is here https://github.com/x4m/pggistopt/tree/simplefix

What do you think about found performance problem and about patch
attached? Am I missing something?



Best regards, Andrey Borodin, Octonica & Ural Federal University.

Вложения

Re: GiST optimizing memmoves in gistplacetopage for fixed-size updates [PoC]

От
Alexander Korotkov
Дата:
Hi!

On Sun, Jul 3, 2016 at 12:24 PM, Andrew Borodin <borodin@octonica.com> wrote:
I think there is some room for improving GiST inserts. Following is
the description of what I think is performance problem.

Function gistplacetopage in file /src/backend/access/gist/gist.c is
responsible for updating or appending new tuple on GiST page.
Currently, after checking necessity of page split due to overflow, it
essentially executes following:

               if (OffsetNumberIsValid(oldoffnum))
                       PageIndexTupleDelete(page, oldoffnum);
               gistfillbuffer(page, itup, ntup, InvalidOffsetNumber);

That is: remove old tuple if it’s there, then place updated tuple at the end.

Half of the old data have to be shifted my memmove inside
PageIndexTupleDelete() call, half of the linp-s have to be corrected.

If the updated tuple has same size as already residing on page we can
just overwrite it. Attached patch demonstrates that concept. Attached
test.sql inserts million rows into GiST index based on cube extension.
My machine is Hyper-V VM running Ubuntu on i5-2500 CPU with SSD
storage. Before patch, insert part of test is executed on average
within 159 second, after patch application: insert part is executed
within 77 seconds on average. That is almost twice faster (for
CPU\Mem-bounded inserts, disk-constrained test will show no
improvement). But it works only for fixed-size tuple inserts.
 
Very promising results!

I know that code in patch is far from beautiful: it operates with
three different levels of abstraction within 5 lines of code. Those
are low level memmove(), system-wide PageAddItem() and GiST private
gistfillBuffer().

By the way PageAddItem() have overwrite flag, but it only works with
unused ItemId’s. Marking old ItemId as unused before PageAddItem()
didn’t work for me. Unfortunately bufpage.c routines do not contain
one for updating(replacing with new) tuple on page. It is important
for me because I’m working on advanced GiST page layout (
https://www.postgresql.org/message-id/CAJEAwVE0rrr%2BOBT-P0gDCtXbVDkBBG_WcXwCBK%3DGHo4fewu3Yg%40mail.gmail.com
), current approach is to use skip-tuples which can be used to skip
many flowing tuples with one key check. Obviously, this design cares
about tuples order. And update in a fashion “place updated tuple at
the end” won’t work for me.

So, I think it would be better to implement PageReplaceItem()
functionality to make code better, to make existing GiST inserts
faster and to enable new advanced page layouts in indices.

+1 for PageReplaceItem()
Even if item is not the same size, we can move the tail of page once instead of twice.
I think you should implement PageReplaceItem() version and add it to the commitfest.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: GiST optimizing memmoves in gistplacetopage for fixed-size updates [PoC]

От
Andrew Borodin
Дата:
Hi!
>I think you should implement PageReplaceItem() version and add it to the commitfest.
Here is the patch.
I've called function PageIndexTupleOverwrite() because it's suitable
only for indices. It works on my tests and performance is the same as
in proof-of-concept (despite some sanity checks copied from
PageIndexTupleDelete).
Next weekend I'll do more testing, and then add it to commitfest.

Best regards, Andrey Borodin, Octonica & Ural Federal University.

2016-07-03 15:21 GMT+05:00 Alexander Korotkov <a.korotkov@postgrespro.ru>:
> Hi!
>
> On Sun, Jul 3, 2016 at 12:24 PM, Andrew Borodin <borodin@octonica.com>
> wrote:
>>
>> I think there is some room for improving GiST inserts. Following is
>> the description of what I think is performance problem.
>>
>> Function gistplacetopage in file /src/backend/access/gist/gist.c is
>> responsible for updating or appending new tuple on GiST page.
>> Currently, after checking necessity of page split due to overflow, it
>> essentially executes following:
>>
>>                if (OffsetNumberIsValid(oldoffnum))
>>                        PageIndexTupleDelete(page, oldoffnum);
>>                gistfillbuffer(page, itup, ntup, InvalidOffsetNumber);
>>
>> That is: remove old tuple if it’s there, then place updated tuple at the
>> end.
>>
>> Half of the old data have to be shifted my memmove inside
>> PageIndexTupleDelete() call, half of the linp-s have to be corrected.
>>
>> If the updated tuple has same size as already residing on page we can
>> just overwrite it. Attached patch demonstrates that concept. Attached
>> test.sql inserts million rows into GiST index based on cube extension.
>> My machine is Hyper-V VM running Ubuntu on i5-2500 CPU with SSD
>> storage. Before patch, insert part of test is executed on average
>> within 159 second, after patch application: insert part is executed
>> within 77 seconds on average. That is almost twice faster (for
>> CPU\Mem-bounded inserts, disk-constrained test will show no
>> improvement). But it works only for fixed-size tuple inserts.
>
>
> Very promising results!
>
>> I know that code in patch is far from beautiful: it operates with
>> three different levels of abstraction within 5 lines of code. Those
>> are low level memmove(), system-wide PageAddItem() and GiST private
>> gistfillBuffer().
>>
>> By the way PageAddItem() have overwrite flag, but it only works with
>> unused ItemId’s. Marking old ItemId as unused before PageAddItem()
>> didn’t work for me. Unfortunately bufpage.c routines do not contain
>> one for updating(replacing with new) tuple on page. It is important
>> for me because I’m working on advanced GiST page layout (
>>
>> https://www.postgresql.org/message-id/CAJEAwVE0rrr%2BOBT-P0gDCtXbVDkBBG_WcXwCBK%3DGHo4fewu3Yg%40mail.gmail.com
>> ), current approach is to use skip-tuples which can be used to skip
>> many flowing tuples with one key check. Obviously, this design cares
>> about tuples order. And update in a fashion “place updated tuple at
>> the end” won’t work for me.
>>
>> So, I think it would be better to implement PageReplaceItem()
>> functionality to make code better, to make existing GiST inserts
>> faster and to enable new advanced page layouts in indices.
>
>
> +1 for PageReplaceItem()
> Even if item is not the same size, we can move the tail of page once instead
> of twice.
> I think you should implement PageReplaceItem() version and add it to the
> commitfest.
>
> ------
> Alexander Korotkov
> Postgres Professional: http://www.postgrespro.com
> The Russian Postgres Company

Вложения

Re: GiST optimizing memmoves in gistplacetopage for fixed-size updates [PoC]

От
Andrew Borodin
Дата:
Here is a patch with corrections from Alexander Korotkov.
My tests, check and check-world show no problems against Ubuntu LTS 14 x64 VM.


Best regards, Andrey Borodin, Octonica & Ural Federal University.

2016-07-04 10:41 GMT+05:00 Andrew Borodin <borodin@octonica.com>:
> Hi!
>>I think you should implement PageReplaceItem() version and add it to the commitfest.
> Here is the patch.
> I've called function PageIndexTupleOverwrite() because it's suitable
> only for indices. It works on my tests and performance is the same as
> in proof-of-concept (despite some sanity checks copied from
> PageIndexTupleDelete).
> Next weekend I'll do more testing, and then add it to commitfest.
>
> Best regards, Andrey Borodin, Octonica & Ural Federal University.
>
> 2016-07-03 15:21 GMT+05:00 Alexander Korotkov <a.korotkov@postgrespro.ru>:
>> Hi!
>>
>> On Sun, Jul 3, 2016 at 12:24 PM, Andrew Borodin <borodin@octonica.com>
>> wrote:
>>>
>>> I think there is some room for improving GiST inserts. Following is
>>> the description of what I think is performance problem.
>>>
>>> Function gistplacetopage in file /src/backend/access/gist/gist.c is
>>> responsible for updating or appending new tuple on GiST page.
>>> Currently, after checking necessity of page split due to overflow, it
>>> essentially executes following:
>>>
>>>                if (OffsetNumberIsValid(oldoffnum))
>>>                        PageIndexTupleDelete(page, oldoffnum);
>>>                gistfillbuffer(page, itup, ntup, InvalidOffsetNumber);
>>>
>>> That is: remove old tuple if it’s there, then place updated tuple at the
>>> end.
>>>
>>> Half of the old data have to be shifted my memmove inside
>>> PageIndexTupleDelete() call, half of the linp-s have to be corrected.
>>>
>>> If the updated tuple has same size as already residing on page we can
>>> just overwrite it. Attached patch demonstrates that concept. Attached
>>> test.sql inserts million rows into GiST index based on cube extension.
>>> My machine is Hyper-V VM running Ubuntu on i5-2500 CPU with SSD
>>> storage. Before patch, insert part of test is executed on average
>>> within 159 second, after patch application: insert part is executed
>>> within 77 seconds on average. That is almost twice faster (for
>>> CPU\Mem-bounded inserts, disk-constrained test will show no
>>> improvement). But it works only for fixed-size tuple inserts.
>>
>>
>> Very promising results!
>>
>>> I know that code in patch is far from beautiful: it operates with
>>> three different levels of abstraction within 5 lines of code. Those
>>> are low level memmove(), system-wide PageAddItem() and GiST private
>>> gistfillBuffer().
>>>
>>> By the way PageAddItem() have overwrite flag, but it only works with
>>> unused ItemId’s. Marking old ItemId as unused before PageAddItem()
>>> didn’t work for me. Unfortunately bufpage.c routines do not contain
>>> one for updating(replacing with new) tuple on page. It is important
>>> for me because I’m working on advanced GiST page layout (
>>>
>>> https://www.postgresql.org/message-id/CAJEAwVE0rrr%2BOBT-P0gDCtXbVDkBBG_WcXwCBK%3DGHo4fewu3Yg%40mail.gmail.com
>>> ), current approach is to use skip-tuples which can be used to skip
>>> many flowing tuples with one key check. Obviously, this design cares
>>> about tuples order. And update in a fashion “place updated tuple at
>>> the end” won’t work for me.
>>>
>>> So, I think it would be better to implement PageReplaceItem()
>>> functionality to make code better, to make existing GiST inserts
>>> faster and to enable new advanced page layouts in indices.
>>
>>
>> +1 for PageReplaceItem()
>> Even if item is not the same size, we can move the tail of page once instead
>> of twice.
>> I think you should implement PageReplaceItem() version and add it to the
>> commitfest.
>>
>> ------
>> Alexander Korotkov
>> Postgres Professional: http://www.postgrespro.com
>> The Russian Postgres Company

Вложения

Re: Re: GiST optimizing memmoves in gistplacetopage for fixed-size updates [PoC]

От
Amit Kapila
Дата:
On Mon, Jul 4, 2016 at 4:52 PM, Andrew Borodin <borodin@octonica.com> wrote:
> Here is a patch with corrections from Alexander Korotkov.
> My tests, check and check-world show no problems against Ubuntu LTS 14 x64 VM.
>
>

- PageIndexTupleDelete(page, oldoffnum);
- gistfillbuffer(page, itup, ntup, InvalidOffsetNumber);
+ {
+ /*if we have just one tuple to update we replace it on-place on page*/
+ if(ntup == 1)
+ {
+ PageIndexTupleOverwrite(page,oldoffnum,*itup);
+ }
+ else
+ {
+ /*this corner case is here to support mix calls case (see comment above)*/
+ PageIndexTupleDelete(page, oldoffnum);
+ gistfillbuffer(page, itup, ntup, InvalidOffsetNumber);
+ }


Here, you have changed the way tuple is added on a page, but still the
WAL record is same as before.  I think WAL replay should perform the
action in a same way as we have done when original operation is
performed.  If not, then it can change the organization of page which
might lead to failure in replay of consecutive WAL records.

+ /*if we have just one tuple to update we replace it on-place on page*/

In comments, we always leave a space both in the beginning and at the
end of a comment.  See comments in nearby code.


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: Re: GiST optimizing memmoves in gistplacetopage for fixed-size updates [PoC]

От
Andrew Borodin
Дата:
Thank you, Amit.

Currently, if WAL reconstruct page in another order it won't be a problem.
How can I check that it works? Would it be sufficient if I kill 9 backend and postmaster after commit and it will start normally executing select queries after restart?

I'll make a patch with missing spaces later. I also found some more missing spaces in PageIndexTupleOverwrite call and typo in comments ("on-place").

Best regards, Andrey Borodin, Octonica & Ural Federal University.

2016-07-04 18:58 GMT+05:00 Amit Kapila <amit.kapila16@gmail.com>:
On Mon, Jul 4, 2016 at 4:52 PM, Andrew Borodin <borodin@octonica.com> wrote:
> Here is a patch with corrections from Alexander Korotkov.
> My tests, check and check-world show no problems against Ubuntu LTS 14 x64 VM.
>
>

- PageIndexTupleDelete(page, oldoffnum);
- gistfillbuffer(page, itup, ntup, InvalidOffsetNumber);
+ {
+ /*if we have just one tuple to update we replace it on-place on page*/
+ if(ntup == 1)
+ {
+ PageIndexTupleOverwrite(page,oldoffnum,*itup);
+ }
+ else
+ {
+ /*this corner case is here to support mix calls case (see comment above)*/
+ PageIndexTupleDelete(page, oldoffnum);
+ gistfillbuffer(page, itup, ntup, InvalidOffsetNumber);
+ }


Here, you have changed the way tuple is added on a page, but still the
WAL record is same as before.  I think WAL replay should perform the
action in a same way as we have done when original operation is
performed.  If not, then it can change the organization of page which
might lead to failure in replay of consecutive WAL records.

+ /*if we have just one tuple to update we replace it on-place on page*/

In comments, we always leave a space both in the beginning and at the
end of a comment.  See comments in nearby code.


--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Re: GiST optimizing memmoves in gistplacetopage for fixed-size updates [PoC]

От
Alvaro Herrera
Дата:
Andrew Borodin wrote:
> Thank you, Amit.
> 
> Currently, if WAL reconstruct page in another order it won't be a problem.

We require that replay of WAL leads to identical bytes than the
original.  Among other things, this enables use of a tool that verifies
that WAL is working correctly simply by comparing page images.  So
please fix it instead of just verifying that this works for GiST.

By the way, BRIN indexes have a need of this operation too.  The current
approach is to call PageIndexDeleteNoCompact followed by PageAddItem.
I suppose it would be beneficial to use your new routine there too.

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Re: GiST optimizing memmoves in gistplacetopage for fixed-size updates [PoC]

От
Andrew Borodin
Дата:
Thank you, Alvaro.

Yes, now I see. I'll update gistRedoPageUpdateRecord() to work
accordingly with patched gistplacetopage().
Also, i've noticed that PageAddItem uses MAXALIGN() macro to calculate
tuple size. So, PageIndexTupleOverwrite should behave the same.

About PageIndexDeleteNoCompact() in BRIN. I do not see why
PageIndexDeleteNoCompact() is not a good solution instead of
PageIndexTupleOverwrite? Will it mark tuple as unused until next
PageAddItem will use it's place?

Best regards, Andrey Borodin, Octonica & Ural Federal University.

2016-07-04 22:16 GMT+05:00 Alvaro Herrera <alvherre@2ndquadrant.com>:
> Andrew Borodin wrote:
>> Thank you, Amit.
>>
>> Currently, if WAL reconstruct page in another order it won't be a problem.
>
> We require that replay of WAL leads to identical bytes than the
> original.  Among other things, this enables use of a tool that verifies
> that WAL is working correctly simply by comparing page images.  So
> please fix it instead of just verifying that this works for GiST.
>
> By the way, BRIN indexes have a need of this operation too.  The current
> approach is to call PageIndexDeleteNoCompact followed by PageAddItem.
> I suppose it would be beneficial to use your new routine there too.
>
> --
> Álvaro Herrera                http://www.2ndQuadrant.com/
> PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Re: GiST optimizing memmoves in gistplacetopage for fixed-size updates [PoC]

От
Andrew Borodin
Дата:
Here is a patch with updated WAL behavior.

I'm not certain about MAXALIGN for size of appended tuple. Currently
if anyone calls PageIndexTupleOverwrite() with unalligned tuple it
will ereport(ERROR).
I suspect that it should allign size instead.

Also I suspect that PageIndexTupleOverwrite() should make a call to
ItemIdSetNormal() to pretend it is generic function and not just a
private part of GiST.

Best regards, Andrey Borodin, Octonica & Ural Federal University.

2016-07-04 22:59 GMT+05:00 Andrew Borodin <borodin@octonica.com>:
> Thank you, Alvaro.
>
> Yes, now I see. I'll update gistRedoPageUpdateRecord() to work
> accordingly with patched gistplacetopage().
> Also, i've noticed that PageAddItem uses MAXALIGN() macro to calculate
> tuple size. So, PageIndexTupleOverwrite should behave the same.
>
> About PageIndexDeleteNoCompact() in BRIN. I do not see why
> PageIndexDeleteNoCompact() is not a good solution instead of
> PageIndexTupleOverwrite? Will it mark tuple as unused until next
> PageAddItem will use it's place?
>
> Best regards, Andrey Borodin, Octonica & Ural Federal University.
>
> 2016-07-04 22:16 GMT+05:00 Alvaro Herrera <alvherre@2ndquadrant.com>:
>> Andrew Borodin wrote:
>>> Thank you, Amit.
>>>
>>> Currently, if WAL reconstruct page in another order it won't be a problem.
>>
>> We require that replay of WAL leads to identical bytes than the
>> original.  Among other things, this enables use of a tool that verifies
>> that WAL is working correctly simply by comparing page images.  So
>> please fix it instead of just verifying that this works for GiST.
>>
>> By the way, BRIN indexes have a need of this operation too.  The current
>> approach is to call PageIndexDeleteNoCompact followed by PageAddItem.
>> I suppose it would be beneficial to use your new routine there too.
>>
>> --
>> Álvaro Herrera                http://www.2ndQuadrant.com/
>> PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Вложения

Re: Re: GiST optimizing memmoves in gistplacetopage for fixed-size updates [PoC]

От
Andrew Borodin
Дата:
Here is a new patch. Updates:
1. Uses MAXALIGN to allocate space on page
2. Uses ItemIdSetNormal in case when ItemId was not normal for some
reasons before call
3. Improved comments and var names

Best regards, Andrey Borodin, Octonica & Ural Federal University.

2016-07-05 17:54 GMT+05:00 Andrew Borodin <borodin@octonica.com>:
> Here is a patch with updated WAL behavior.
>
> I'm not certain about MAXALIGN for size of appended tuple. Currently
> if anyone calls PageIndexTupleOverwrite() with unalligned tuple it
> will ereport(ERROR).
> I suspect that it should allign size instead.
>
> Also I suspect that PageIndexTupleOverwrite() should make a call to
> ItemIdSetNormal() to pretend it is generic function and not just a
> private part of GiST.
>
> Best regards, Andrey Borodin, Octonica & Ural Federal University.
>
> 2016-07-04 22:59 GMT+05:00 Andrew Borodin <borodin@octonica.com>:
>> Thank you, Alvaro.
>>
>> Yes, now I see. I'll update gistRedoPageUpdateRecord() to work
>> accordingly with patched gistplacetopage().
>> Also, i've noticed that PageAddItem uses MAXALIGN() macro to calculate
>> tuple size. So, PageIndexTupleOverwrite should behave the same.
>>
>> About PageIndexDeleteNoCompact() in BRIN. I do not see why
>> PageIndexDeleteNoCompact() is not a good solution instead of
>> PageIndexTupleOverwrite? Will it mark tuple as unused until next
>> PageAddItem will use it's place?
>>
>> Best regards, Andrey Borodin, Octonica & Ural Federal University.
>>
>> 2016-07-04 22:16 GMT+05:00 Alvaro Herrera <alvherre@2ndquadrant.com>:
>>> Andrew Borodin wrote:
>>>> Thank you, Amit.
>>>>
>>>> Currently, if WAL reconstruct page in another order it won't be a problem.
>>>
>>> We require that replay of WAL leads to identical bytes than the
>>> original.  Among other things, this enables use of a tool that verifies
>>> that WAL is working correctly simply by comparing page images.  So
>>> please fix it instead of just verifying that this works for GiST.
>>>
>>> By the way, BRIN indexes have a need of this operation too.  The current
>>> approach is to call PageIndexDeleteNoCompact followed by PageAddItem.
>>> I suppose it would be beneficial to use your new routine there too.
>>>
>>> --
>>> Álvaro Herrera                http://www.2ndQuadrant.com/
>>> PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Вложения

Re: Re: GiST optimizing memmoves in gistplacetopage for fixed-size updates [PoC]

От
Andrew Borodin
Дата:
One more thing came to my mind:
WAL records done by the GiST before patch will corrupt GiST after
patch if replayed.
Is it a problem? May be it's prevented on some higher level?

Best regards, Andrey Borodin, Octonica & Ural Federal University.

2016-07-06 22:11 GMT+05:00 Andrew Borodin <borodin@octonica.com>:
> Here is a new patch. Updates:
> 1. Uses MAXALIGN to allocate space on page
> 2. Uses ItemIdSetNormal in case when ItemId was not normal for some
> reasons before call
> 3. Improved comments and var names
>
> Best regards, Andrey Borodin, Octonica & Ural Federal University.
>
> 2016-07-05 17:54 GMT+05:00 Andrew Borodin <borodin@octonica.com>:
>> Here is a patch with updated WAL behavior.
>>
>> I'm not certain about MAXALIGN for size of appended tuple. Currently
>> if anyone calls PageIndexTupleOverwrite() with unalligned tuple it
>> will ereport(ERROR).
>> I suspect that it should allign size instead.
>>
>> Also I suspect that PageIndexTupleOverwrite() should make a call to
>> ItemIdSetNormal() to pretend it is generic function and not just a
>> private part of GiST.
>>
>> Best regards, Andrey Borodin, Octonica & Ural Federal University.
>>
>> 2016-07-04 22:59 GMT+05:00 Andrew Borodin <borodin@octonica.com>:
>>> Thank you, Alvaro.
>>>
>>> Yes, now I see. I'll update gistRedoPageUpdateRecord() to work
>>> accordingly with patched gistplacetopage().
>>> Also, i've noticed that PageAddItem uses MAXALIGN() macro to calculate
>>> tuple size. So, PageIndexTupleOverwrite should behave the same.
>>>
>>> About PageIndexDeleteNoCompact() in BRIN. I do not see why
>>> PageIndexDeleteNoCompact() is not a good solution instead of
>>> PageIndexTupleOverwrite? Will it mark tuple as unused until next
>>> PageAddItem will use it's place?
>>>
>>> Best regards, Andrey Borodin, Octonica & Ural Federal University.
>>>
>>> 2016-07-04 22:16 GMT+05:00 Alvaro Herrera <alvherre@2ndquadrant.com>:
>>>> Andrew Borodin wrote:
>>>>> Thank you, Amit.
>>>>>
>>>>> Currently, if WAL reconstruct page in another order it won't be a problem.
>>>>
>>>> We require that replay of WAL leads to identical bytes than the
>>>> original.  Among other things, this enables use of a tool that verifies
>>>> that WAL is working correctly simply by comparing page images.  So
>>>> please fix it instead of just verifying that this works for GiST.
>>>>
>>>> By the way, BRIN indexes have a need of this operation too.  The current
>>>> approach is to call PageIndexDeleteNoCompact followed by PageAddItem.
>>>> I suppose it would be beneficial to use your new routine there too.
>>>>
>>>> --
>>>> Álvaro Herrera                http://www.2ndQuadrant.com/
>>>> PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Re: GiST optimizing memmoves in gistplacetopage for fixed-size updates [PoC]

От
Michael Paquier
Дата:
On Fri, Jul 8, 2016 at 2:00 PM, Andrew Borodin <borodin@octonica.com> wrote:

(Please top-post on threads, this is annoying)

> One more thing came to my mind:
> WAL records done by the GiST before patch will corrupt GiST after
> patch if replayed.
> Is it a problem? May be it's prevented on some higher level?

If a feature changes the shape of WAL records, XLOG_PAGE_MAGIC is
bumped to prevent any problems.
-- 
Michael



Re: Re: GiST optimizing memmoves in gistplacetopage for fixed-size updates [PoC]

От
Andrew Borodin
Дата:
>If a feature changes the shape of WAL records, XLOG_PAGE_MAGIC is bumped to prevent any problems.
I've attached patch with a bump, but, obviously, it'll be irrelevant
after any other commit changing WAL shape.

Here is a patch with updated GiST README.
I haven't found apropriate place to describe PageIndexTupleOverwrite
function in docs, since all it's siblings are described only in the
code.
Code comments describing this function are coherent with others
(PageAddItem, PageIndexTupleDelete).

Best regards, Andrey Borodin, Octonica & Ural Federal University.

Вложения

Re: Re: GiST optimizing memmoves in gistplacetopage for fixed-size updates [PoC]

От
Michael Paquier
Дата:
On Sun, Jul 24, 2016 at 7:18 PM, Andrew Borodin <borodin@octonica.com> wrote:
>>If a feature changes the shape of WAL records, XLOG_PAGE_MAGIC is bumped to prevent any problems.
>
> I've attached patch with a bump, but, obviously, it'll be irrelevant
> after any other commit changing WAL shape.

Usually the committer in charge of reviewing such a patch would bump
it. There is no need for the patch submitter to do so. I should have
been more precise previously, sorry for my twisted words.
-- 
Michael



Re: Re: GiST optimizing memmoves in gistplacetopage for fixed-size updates [PoC]

От
Tom Lane
Дата:
Michael Paquier <michael.paquier@gmail.com> writes:
> On Sun, Jul 24, 2016 at 7:18 PM, Andrew Borodin <borodin@octonica.com> wrote:
>> I've attached patch with a bump, but, obviously, it'll be irrelevant
>> after any other commit changing WAL shape.

> Usually the committer in charge of reviewing such a patch would bump
> it. There is no need for the patch submitter to do so. I should have
> been more precise previously, sorry for my twisted words.

It's good to remind the committer that such a bump is needed, of course.
But yeah, casting the reminder in the form of a hunk of the patch is
more likely to cause trouble than be helpful.
        regards, tom lane



Re: Re: GiST optimizing memmoves in gistplacetopage for fixed-size updates [PoC]

От
Alvaro Herrera
Дата:
Andrew Borodin wrote:
> >If a feature changes the shape of WAL records, XLOG_PAGE_MAGIC is bumped to prevent any problems.
> I've attached patch with a bump, but, obviously, it'll be irrelevant
> after any other commit changing WAL shape.
> 
> Here is a patch with updated GiST README.
> I haven't found apropriate place to describe PageIndexTupleOverwrite
> function in docs, since all it's siblings are described only in the
> code.
> Code comments describing this function are coherent with others
> (PageAddItem, PageIndexTupleDelete).

Can you please patch BRIN to use this new function too?

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Re: GiST optimizing memmoves in gistplacetopage for fixed-size updates [PoC]

От
Andrew Borodin
Дата:
>Can you please patch BRIN to use this new function too?
AFAIK Sergey Mirvoda was going to implement and test it.
It is easy to do this optimization for brin_samepage_update (see patch
draft in attachment, make check passes), but WAL of regular BRIN
update is not clear for me.

Best regards, Andrey Borodin, Octonica & Ural Federal University.

Вложения

Re: Re: GiST optimizing memmoves in gistplacetopage for fixed-size updates [PoC]

От
Andrew Borodin
Дата:
>Can you please patch BRIN to use this new function too?

On my machine replacement of both BRIN update cases (see
https://github.com/x4m/pggistopt/commit/a6d301ff79104b977619339d53aebf748045418a
) showed no performance changes on folowing update query (6 seconds of
updates avg):
create table dataTable(x int, y int);
insert into dataTable(x,y) select x,y from generate_series(1,1e3,1)
x,generate_series(1,1e3,1) y;
create index idx on dataTable using brin(x,y);
update datatable set x = random()*1024, y = random()*1024;

https://gist.github.com/x4m/7e69fd924b9ffd2fdc9c6100e741171d

Probably I was looking in a wrong place. I do not see other cases when
PageIndexTupleOverwrite can improve performance of BRIN. Though I'll
make PageIndexTupleOverwrite BRIN-compatible in forthcoming patch
version: BRIN  tuples have no length in header and it must be taken as
a parameter. Just as the PageAddItem do.


Best regards, Andrey Borodin, Octonica & Ural Federal University.



Re: Re: GiST optimizing memmoves in gistplacetopage for fixed-size updates [PoC]

От
Andrew Borodin
Дата:
Here is BRIN-compatible version of patch. Now function
PageIndexTupleOverwrite consumes tuple size as a parameter, this
behavior is coherent with PageAddItem.
Also, i had to remove asserstion that item has a storage in the loop
that adjusts line pointers. It would be great if someone could check
that place (commented Assert(ItemIdHasStorage(ii)); in
PageIndexTupleOverwrite). I suspect that there might be a case when
linp's should not be adjusted.

Best regards, Andrey Borodin, Octonica & Ural Federal University.

Вложения

Re: Re: GiST optimizing memmoves in gistplacetopage for fixed-size updates [PoC]

От
Anastasia Lubennikova
Дата:
30.07.2016 14:00, Andrew Borodin:
> Here is BRIN-compatible version of patch. Now function
> PageIndexTupleOverwrite consumes tuple size as a parameter, this
> behavior is coherent with PageAddItem.
> Also, i had to remove asserstion that item has a storage in the loop
> that adjusts line pointers. It would be great if someone could check
> that place (commented Assert(ItemIdHasStorage(ii)); in
> PageIndexTupleOverwrite). I suspect that there might be a case when
> linp's should not be adjusted.

Hi, I reviewed the code and I have couple of questions about
following code:
    /* may have negative size here if new tuple is larger */    size_diff = oldsize - alignednewsize;    offset =
ItemIdGetOffset(tupid);
    if (offset < phdr->pd_upper || (offset + size_diff) > 
phdr->pd_special ||        offset != MAXALIGN(offset) || size_diff != MAXALIGN(size_diff))        ereport(ERROR,
       (errcode(ERRCODE_DATA_CORRUPTED),                 errmsg("corrupted item offset: offset = %u, size = %u",
               offset, (unsigned int) size_diff)));
 

First of all, shouldn't we use MAXALIGN(oldsize) instead of oldsize?
Although, I'm quite sure that it was already aligned somewhere before.

I doubt that the check (size_diff != MAXALIGN(size_diff)) is necessary.
I'd rather add  the check: (offset+size_diff < pd_lower).

Besides that moment, the patch seems pretty good.

BTW, I'm very surprised that it improves performance so much.
And also size is reduced significantly. 89MB against 289MB without patch.
Could you explain in details, why does it happen?

-- 
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company




Re: Re: GiST optimizing memmoves in gistplacetopage for fixed-size updates [PoC]

От
Andrew Borodin
Дата:
Anastasia , thank you for your attentive code examine.

2016-08-05 21:19 GMT+05:00 Anastasia Lubennikova <a.lubennikova@postgrespro.ru>:
> First of all, shouldn't we use MAXALIGN(oldsize) instead of oldsize?
> Although, I'm quite sure that it was already aligned somewhere before.
> I doubt that the check (size_diff != MAXALIGN(size_diff)) is necessary.
> I'd rather add  the check: (offset+size_diff < pd_lower).
Actually, that's a tricky question. There may be different code expectations.
1. If we expect that not-maxaligned tuple may be placed by any other
routine, we should remove check (size_diff != MAXALIGN(size_diff)),
since it will fail for not-maxaligned tuple.
2. If we expect that noone should use PageIndexTupleOverwrite with
not-maxaligned tuples, than current checks are OK: we will break
execution if we find non-maxaligned tuples. With an almost correct
message.
I suggest that this check may be debug-only assertion: in a production
code routine will work with not-maxaligned tuples if they already
reside on the page, in a dev build it will inform dev that something
is going wrong. Is this behavior Postgres-style compliant?


I agree that pd_lower check makes sense.

> BTW, I'm very surprised that it improves performance so much.
> And also size is reduced significantly. 89MB against 289MB without patch.
> Could you explain in details, why does it happen?
Size reduction is unexpected for me.
There might be 4 plausible explanations. I'll list them ordered by
descending of probability:
1. Before this patch every update was throwing recently updated tuple
to the end of a page. Thus, in some-how ordered data, recent best-fit
will be discovered last. This can cause increase of MBB's overlap in
spatial index and slightly higher tree. But 3 times size decrease is
unlikely.
How did you obtained those results? Can I look at a test case?
2. Bug in PageIndexTupleDelete causing unused space emersion. I've
searched for it, unsuccessfully.
3. Bug in PageIndexTupleOVerwrite. I cannot imagine nature of such a
bug. May be we are not placing something not very important and big on
a page?
4. Magic.
I do not see what one should do with the R-tree to change it's size by
a factor of 3. First three explanations are not better that forth,
actually.
Those 89 MB, they do not include WAL, right?

Thank you for the review.


Best regards, Andrey Borodin, Octonica & Ural Federal University.



Re: Re: GiST optimizing memmoves in gistplacetopage for fixed-size updates [PoC]

От
Anastasia Lubennikova
Дата:
06.08.2016 19:51, Andrew Borodin:
> Anastasia , thank you for your attentive code examine.
>
> 2016-08-05 21:19 GMT+05:00 Anastasia Lubennikova <a.lubennikova@postgrespro.ru>:
>> First of all, shouldn't we use MAXALIGN(oldsize) instead of oldsize?
>> Although, I'm quite sure that it was already aligned somewhere before.
>> I doubt that the check (size_diff != MAXALIGN(size_diff)) is necessary.
>> I'd rather add  the check: (offset+size_diff < pd_lower).
> Actually, that's a tricky question. There may be different code expectations.
> 1. If we expect that not-maxaligned tuple may be placed by any other
> routine, we should remove check (size_diff != MAXALIGN(size_diff)),
> since it will fail for not-maxaligned tuple.
> 2. If we expect that noone should use PageIndexTupleOverwrite with
> not-maxaligned tuples, than current checks are OK: we will break
> execution if we find non-maxaligned tuples. With an almost correct
> message.
> I suggest that this check may be debug-only assertion: in a production
> code routine will work with not-maxaligned tuples if they already
> reside on the page, in a dev build it will inform dev that something
> is going wrong. Is this behavior Postgres-style compliant?
>
>
> I agree that pd_lower check makes sense.

Pointing out to this comparison I worried about possible call of
MAXALIGN for negative size_diff value. Although I don't sure if it can 
cause any problem.

Tuples on a page are always maxaligned.
But, as far as I recall,  itemid->lp_len can contain non-maxaligned value.

So, I'd suggest you to perform MAXALIGN(oldsize) before computing size_diff: size_diff = oldsize - alignednewsize;

Since both arguments are maxaligned the check of size_diff is not needed.

>> BTW, I'm very surprised that it improves performance so much.
>> And also size is reduced significantly. 89MB against 289MB without patch.
>> Could you explain in details, why does it happen?
> Size reduction is unexpected for me.
> There might be 4 plausible explanations. I'll list them ordered by
> descending of probability:
> 1. Before this patch every update was throwing recently updated tuple
> to the end of a page. Thus, in some-how ordered data, recent best-fit
> will be discovered last. This can cause increase of MBB's overlap in
> spatial index and slightly higher tree. But 3 times size decrease is
> unlikely.
> How did you obtained those results? Can I look at a test case?

Nothing special, I've just run the test you provided in this thread.
And check index size via select pg_size_pretty(pg_relation_size('idx'));

> 2. Bug in PageIndexTupleDelete causing unused space emersion. I've
> searched for it, unsuccessfully.
> 3. Bug in PageIndexTupleOVerwrite. I cannot imagine nature of such a
> bug. May be we are not placing something not very important and big on
> a page?
> 4. Magic.
> I do not see what one should do with the R-tree to change it's size by
> a factor of 3. First three explanations are not better that forth,
> actually.
> Those 89 MB, they do not include WAL, right?

-- 
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company




Re: Re: GiST optimizing memmoves in gistplacetopage for fixed-size updates [PoC]

От
Andrew Borodin
Дата:
Here is new version of the patch, now it includes recommendations from
Anastasia Lubennikova.


I've investigated anamalous index size decrease. Most probable version
appeared to be true.
Cube extension, as some others, use Guttman's polynomial time split
algorithm. It is known to generate "needle-like" MBBs (MBRs) for
sorted data due to imbalanced splits (splitting 100 tuples as 98 to
2). Due to previous throw-to-the-end behavior of GiST this imbalance
was further amplificated (most of inserts were going to bigger part
after split). So GiST inserts were extremely slow for sorted data.
There is no need to do exact sorting to trigger this behavior.
This behavior can be fused by implementation of small-m restriction in
split (AFAIR this is described in original R-tree paper from 84),
which prescribes to do a split in a parts no smaller than m, where m
is around 20% of a page capacity (in tuples number). After application
of this patch performance for sorted data is roughly the same as
performance for randomized data.

If data is randomized this effect of Guttman poly-time split is not in
effect; test from the start of the thread will show no statistically
confident improvement by the patch.
To prove performance increase in randomized case I've composed
modified test https://gist.github.com/x4m/856b2e1a5db745f8265c76b9c195f2e1
This test with five runs shows following:
Before patch
Insert Time AVG 78 seconds STD 9.5
Afer patch
Insert Time AVG 68 seconds STD 7.7
This is marginal but statistically viable performance improvement.

For sorted data performance is improved by a factor of 3.

Best regards, Andrey Borodin, Octonica & Ural Federal University.

Вложения

Re: Re: GiST optimizing memmoves in gistplacetopage for fixed-size updates [PoC]

От
Anastasia Lubennikova
Дата:
09.08.2016 19:45, Andrew Borodin:
> Here is new version of the patch, now it includes recommendations from
> Anastasia Lubennikova.
>
>
> I've investigated anamalous index size decrease. Most probable version
> appeared to be true.
> Cube extension, as some others, use Guttman's polynomial time split
> algorithm. It is known to generate "needle-like" MBBs (MBRs) for
> sorted data due to imbalanced splits (splitting 100 tuples as 98 to
> 2). Due to previous throw-to-the-end behavior of GiST this imbalance
> was further amplificated (most of inserts were going to bigger part
> after split). So GiST inserts were extremely slow for sorted data.
> There is no need to do exact sorting to trigger this behavior.
> This behavior can be fused by implementation of small-m restriction in
> split (AFAIR this is described in original R-tree paper from 84),
> which prescribes to do a split in a parts no smaller than m, where m
> is around 20% of a page capacity (in tuples number). After application
> of this patch performance for sorted data is roughly the same as
> performance for randomized data.

Thank you for explanation. Now it's clear to me. I did some more testing and
found nothing special. The declared feature is implemented correctly.
It passes all regression tests and improves performance.

I still have a few minor nitpicks about the patch style.
You can find a style guide on 
https://www.postgresql.org/docs/9.6/static/source.html

1) remove extra whitespace in README

2) add whitespace: if(ntup == 1)

3) fix comments in accordance with coding conventions

/* In case of single tuple update GiST calls overwrite * In all other cases function gistplacetopage deletes * old
tuplesand place updated at the end *  */
 


+            /* Normally here was following assertion
+             * Assert(ItemIdHasStorage(ii));
+             * This assertion was introduced in PageIndexTupleDelete
+             * But if this function will be used from BRIN index
+             * this assertion will fail. Thus, here we do not check that
+             * item has the storage.
+             * */

4) remove unrelated changes
-            data += sizeof(OffsetNumber) * xldata->ntodelete;
+            data += sizeof(OffsetNumber) *xldata->ntodelete;

5) If the comment is correct, maxalignment is not necessary.
+    /* tuples on a page are always maxaligned */
+    oldsize = MAXALIGN(oldsize);

6) I'd rather use alignednewsize here. +    ItemIdSetNormal(tupid, offset + size_diff, newsize);


After the cleanup you can change status to "Ready for Committer"
without waiting for the response.

> If data is randomized this effect of Guttman poly-time split is not in
> effect; test from the start of the thread will show no statistically
> confident improvement by the patch.
> To prove performance increase in randomized case I've composed
> modified test https://gist.github.com/x4m/856b2e1a5db745f8265c76b9c195f2e1
> This test with five runs shows following:
> Before patch
> Insert Time AVG 78 seconds STD 9.5
> Afer patch
> Insert Time AVG 68 seconds STD 7.7
> This is marginal but statistically viable performance improvement.
>
> For sorted data performance is improved by a factor of 3.
> Best regards, Andrey Borodin, Octonica & Ural Federal University.
>

-- 
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company




Re: Re: GiST optimizing memmoves in gistplacetopage for fixed-size updates [PoC]

От
Andrew Borodin
Дата:
Thank you for your corrections.
Here is the patch with suggestions taken into account, except 6th.
>6) I'd rather use alignednewsize here.
> +    ItemIdSetNormal(tupid, offset + size_diff, newsize);
This behavior is accroding to ubiquitous PageAddItem.

Best regards, Andrey Borodin, Octonica & Ural Federal University.

2016-08-16 20:23 GMT+05:00 Anastasia Lubennikova <a.lubennikova@postgrespro.ru>:
> 09.08.2016 19:45, Andrew Borodin:
>>
>> Here is new version of the patch, now it includes recommendations from
>> Anastasia Lubennikova.
>>
>>
>> I've investigated anamalous index size decrease. Most probable version
>> appeared to be true.
>> Cube extension, as some others, use Guttman's polynomial time split
>> algorithm. It is known to generate "needle-like" MBBs (MBRs) for
>> sorted data due to imbalanced splits (splitting 100 tuples as 98 to
>> 2). Due to previous throw-to-the-end behavior of GiST this imbalance
>> was further amplificated (most of inserts were going to bigger part
>> after split). So GiST inserts were extremely slow for sorted data.
>> There is no need to do exact sorting to trigger this behavior.
>> This behavior can be fused by implementation of small-m restriction in
>> split (AFAIR this is described in original R-tree paper from 84),
>> which prescribes to do a split in a parts no smaller than m, where m
>> is around 20% of a page capacity (in tuples number). After application
>> of this patch performance for sorted data is roughly the same as
>> performance for randomized data.
>
>
> Thank you for explanation. Now it's clear to me. I did some more testing and
> found nothing special. The declared feature is implemented correctly.
> It passes all regression tests and improves performance.
>
> I still have a few minor nitpicks about the patch style.
> You can find a style guide on
> https://www.postgresql.org/docs/9.6/static/source.html
>
> 1) remove extra whitespace in README
>
> 2) add whitespace: if(ntup == 1)
>
> 3) fix comments in accordance with coding conventions
>
> /* In case of single tuple update GiST calls overwrite
>  * In all other cases function gistplacetopage deletes
>  * old tuples and place updated at the end
>  *  */
>
>
> +            /* Normally here was following assertion
> +             * Assert(ItemIdHasStorage(ii));
> +             * This assertion was introduced in PageIndexTupleDelete
> +             * But if this function will be used from BRIN index
> +             * this assertion will fail. Thus, here we do not check that
> +             * item has the storage.
> +             * */
>
> 4) remove unrelated changes
> -            data += sizeof(OffsetNumber) * xldata->ntodelete;
> +            data += sizeof(OffsetNumber) *xldata->ntodelete;
>
> 5) If the comment is correct, maxalignment is not necessary.
> +    /* tuples on a page are always maxaligned */
> +    oldsize = MAXALIGN(oldsize);
>
> 6) I'd rather use alignednewsize here.
>  +    ItemIdSetNormal(tupid, offset + size_diff, newsize);
>
>
> After the cleanup you can change status to "Ready for Committer"
> without waiting for the response.
>
>> If data is randomized this effect of Guttman poly-time split is not in
>> effect; test from the start of the thread will show no statistically
>> confident improvement by the patch.
>> To prove performance increase in randomized case I've composed
>> modified test https://gist.github.com/x4m/856b2e1a5db745f8265c76b9c195f2e1
>> This test with five runs shows following:
>> Before patch
>> Insert Time AVG 78 seconds STD 9.5
>> Afer patch
>> Insert Time AVG 68 seconds STD 7.7
>> This is marginal but statistically viable performance improvement.
>>
>> For sorted data performance is improved by a factor of 3.
>> Best regards, Andrey Borodin, Octonica & Ural Federal University.
>>
>
> --
> Anastasia Lubennikova
> Postgres Professional: http://www.postgrespro.com
> The Russian Postgres Company
>

Вложения

Re: Re: GiST optimizing memmoves in gistplacetopage for fixed-size updates [PoC]

От
Tom Lane
Дата:
Andrew Borodin <borodin@octonica.com> writes:
> Thank you for your corrections.
> Here is the patch with suggestions taken into account, except 6th.

I'll pick this up, unless some other committer is already on it.

I think that we should buy back the addition of PageIndexTupleOverwrite
to bufpage.c by getting rid of PageIndexDeleteNoCompact, as suggested
earlier by Alvaro.  The latter seems a lot messier in concept, and it's
more general than BRIN needs.  It also kinda looks like we could get
rid of PAI_ALLOW_FAR_OFFSET, which is a messy kluge in itself.
        regards, tom lane



Re: Re: GiST optimizing memmoves in gistplacetopage for fixed-size updates [PoC]

От
Alvaro Herrera
Дата:
Tom Lane wrote:
> Andrew Borodin <borodin@octonica.com> writes:
> > Thank you for your corrections.
> > Here is the patch with suggestions taken into account, except 6th.
> 
> I'll pick this up, unless some other committer is already on it.
> 
> I think that we should buy back the addition of PageIndexTupleOverwrite
> to bufpage.c by getting rid of PageIndexDeleteNoCompact, as suggested
> earlier by Alvaro.  The latter seems a lot messier in concept, and it's
> more general than BRIN needs.

Yeah, BRIN only wants to remove one item nowadays; that function was
written when the BRIN code wanted to remove multiple items in one go.

> It also kinda looks like we could get rid of PAI_ALLOW_FAR_OFFSET,
> which is a messy kluge in itself.

That'd be nice.

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Re: GiST optimizing memmoves in gistplacetopage for fixed-size updates [PoC]

От
Tom Lane
Дата:
Hey Alvaro, can you comment on this bit in the proposed patch?

+        for (i = FirstOffsetNumber; i <= itemcount; i++)
+        {
+            ItemId        ii = PageGetItemId(phdr, i);
+
+            /* Normally here was following assertion
+             * Assert(ItemIdHasStorage(ii));
+             * This assertion was introduced in PageIndexTupleDelete
+             * But if this function will be used from BRIN index
+             * this assertion will fail. Thus, here we do not check that
+             * item has the storage.
+             */
+            if (ItemIdGetOffset(ii) <= offset)
+                ii->lp_off += size_diff;
+        }
+    }

That comment seems utterly wrong to me, because both PageIndexTupleDelete
and PageIndexMultiDelete certainly contain assertions that every item on
the page has storage.  Are you expecting that any BRIN index items
wouldn't?  If they wouldn't, is adjusting their lp_off as if they did
have storage sensible?
        regards, tom lane



Re: Re: GiST optimizing memmoves in gistplacetopage for fixed-size updates [PoC]

От
Alvaro Herrera
Дата:
Tom Lane wrote:
> Hey Alvaro, can you comment on this bit in the proposed patch?
> 
> +        for (i = FirstOffsetNumber; i <= itemcount; i++)
> +        {
> +            ItemId        ii = PageGetItemId(phdr, i);
> +
> +            /* Normally here was following assertion
> +             * Assert(ItemIdHasStorage(ii));
> +             * This assertion was introduced in PageIndexTupleDelete
> +             * But if this function will be used from BRIN index
> +             * this assertion will fail. Thus, here we do not check that
> +             * item has the storage.
> +             */
> +            if (ItemIdGetOffset(ii) <= offset)
> +                ii->lp_off += size_diff;
> +        }
> +    }
> 
> That comment seems utterly wrong to me, because both PageIndexTupleDelete
> and PageIndexMultiDelete certainly contain assertions that every item on
> the page has storage.  Are you expecting that any BRIN index items
> wouldn't?  If they wouldn't, is adjusting their lp_off as if they did
> have storage sensible?

It is possible in BRIN to have empty intermediate tuples; for example it
is possible for lp 1 and 3 to contain index tuples, while lp 2 does not.

This is a tricky case to reproduce, but I think this should do it:
consider a table with pages_per_range=1.  Page 1 is summarized by index
tuple 1, page 2 is summarized by itup 2, page 3 is summarized by index
tuple 3.  On heap page 2 a new tuple is inserted and the summary data is
now much larger, such that the summary tuple no longer fits in the index
page, so it is moved to a new index page.  Then index tuple 2 in the
first index page becomes unused.  Revmap still points to lp 3, so it
can't be moved to lp 2.  The lp_offs can be adjusted, as long as the lp
itself is not moved from its original position.

Now if this loop is concerned only with live lps and does not move lps,
then it should be fine to add the assertion.

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Re: GiST optimizing memmoves in gistplacetopage for fixed-size updates [PoC]

От
Tom Lane
Дата:
Alvaro Herrera <alvherre@2ndquadrant.com> writes:
> Tom Lane wrote:
>> That comment seems utterly wrong to me, because both PageIndexTupleDelete
>> and PageIndexMultiDelete certainly contain assertions that every item on
>> the page has storage.  Are you expecting that any BRIN index items
>> wouldn't?  If they wouldn't, is adjusting their lp_off as if they did
>> have storage sensible?

> It is possible in BRIN to have empty intermediate tuples; for example it
> is possible for lp 1 and 3 to contain index tuples, while lp 2 does not.

Hm.  So apparently, the only reason this stuff works at all is that
BRIN isn't using either PageIndexTupleDelete or PageIndexMultiDelete.

> Now if this loop is concerned only with live lps and does not move lps,
> then it should be fine to add the assertion.

No, it iterates over all lps on the page.  I'm inclined to think it
should be written like
if (ItemIdHasStorage(ii) && ItemIdGetOffset(ii) <= offset)    ii->lp_off += size_diff;

because futzing with the lp_off field in an item that isn't really
pointing at storage feels wrong.  We might need to do that to
PageIndexTupleDelete and/or PageIndexMultiDelete someday, too.

I notice that PageIndexDeleteNoCompact, which seems to express what
BRIN is expecting in a rather underdocumented way, forces any
items without storage into "unused" state.  I don't really think
it's bufpage.c's place to do that, though.  Do you think that code
is actually supposed to fire, or is it just there for lack of a
better idea?
        regards, tom lane



Re: Re: GiST optimizing memmoves in gistplacetopage for fixed-size updates [PoC]

От
Alvaro Herrera
Дата:
Tom Lane wrote:
> Alvaro Herrera <alvherre@2ndquadrant.com> writes:
> > Tom Lane wrote:
> >> That comment seems utterly wrong to me, because both PageIndexTupleDelete
> >> and PageIndexMultiDelete certainly contain assertions that every item on
> >> the page has storage.  Are you expecting that any BRIN index items
> >> wouldn't?  If they wouldn't, is adjusting their lp_off as if they did
> >> have storage sensible?
> 
> > It is possible in BRIN to have empty intermediate tuples; for example it
> > is possible for lp 1 and 3 to contain index tuples, while lp 2 does not.
> 
> Hm.  So apparently, the only reason this stuff works at all is that
> BRIN isn't using either PageIndexTupleDelete or PageIndexMultiDelete.

Yes, this is why the NoCompact variant was introduced in the first
place.

> > Now if this loop is concerned only with live lps and does not move lps,
> > then it should be fine to add the assertion.
> 
> No, it iterates over all lps on the page.  I'm inclined to think it
> should be written like
> 
>     if (ItemIdHasStorage(ii) && ItemIdGetOffset(ii) <= offset)
>         ii->lp_off += size_diff;
> 
> because futzing with the lp_off field in an item that isn't really
> pointing at storage feels wrong.  We might need to do that to
> PageIndexTupleDelete and/or PageIndexMultiDelete someday, too.

I suppose it is conceivable that we start using lp_off for other
purposes in the future, so I don't disagree.  I don't think index pages
currently do any funny business with it.

> I notice that PageIndexDeleteNoCompact, which seems to express what
> BRIN is expecting in a rather underdocumented way, forces any
> items without storage into "unused" state.  I don't really think
> it's bufpage.c's place to do that, though.  Do you think that code
> is actually supposed to fire, or is it just there for lack of a
> better idea?

I just put it there only because I didn't see any reason not to, really.
I don't think BRIN relies on it.

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Re: GiST optimizing memmoves in gistplacetopage for fixed-size updates [PoC]

От
Andrey Borodin
Дата:
That storage assertion fired during usual update table set x=random() without conditions. Also Make check fails without
it(for brin usage, gist is ok with it). 
Cannot type quotation properly, I'm on a phone for a long time.

Regards, Andrey Borodin.


Re: Re: GiST optimizing memmoves in gistplacetopage for fixed-size updates [PoC]

От
Tom Lane
Дата:
Andrey Borodin <borodin@octonica.com> writes:
> That storage assertion fired during usual update table set x=random() without conditions. Also Make check fails
withoutit (for brin usage, gist is ok with it).
 

I'm confused by that assertion, because the patch-as-submitted didn't
touch BRIN.  Based on Alvaro's comments, yes it would fail if we'd
modified BRIN to use this function ... but we hadn't yet.
        regards, tom lane



Re: Re: GiST optimizing memmoves in gistplacetopage for fixed-size updates [PoC]

От
Tom Lane
Дата:
Here's a draft patch that replaces BRIN's use of PageIndexDeleteNoCompact
and an immediately following PageAddItemExtended with
PageIndexTupleOverwrite.  It turns out that there are two use-cases in
BRIN for PageIndexDeleteNoCompact, and the other one can't be replaced
because while there is a PageAddItem call beside it, that one is for a
different page.  But this is still enough to let us get rid of
PAI_ALLOW_FAR_OFFSET, which seems like a good thing.

Actually, with this patch, PageAddItemExtended isn't directly referenced
anywhere, so we could revert that API and fold it back into PageAddItem,
saving a nanosecond or two per call.  I'm inclined to leave that alone,
though, since I agree with the underlying thought that any future
extensions to PageAddItem's functionality would be better handled by more
bits in a flags bitmask.  Maybe a better idea is to get rid of the call
overhead by turning PageAddItem into a macro.  In any case, doing
something about that could be a separate patch.

It's also kind of tempting to rewrite PageIndexDeleteNoCompact into a
function that only deletes one tuple, and presumably is simpler and faster
than what's there.  But maybe there's something coming down the pike that
needs the extra generality?  That should also be a separate patch, anyway.

This passes make check-world but I'm not sure how heavily that stresses
BRIN ... are there any other tests you'd want to see run before
committing?

            regards, tom lane

diff --git a/src/backend/access/brin/brin_pageops.c b/src/backend/access/brin/brin_pageops.c
index 6ebfedd..24ae38b 100644
*** a/src/backend/access/brin/brin_pageops.c
--- b/src/backend/access/brin/brin_pageops.c
*************** brin_doupdate(Relation idxrel, BlockNumb
*** 178,187 ****
          }

          START_CRIT_SECTION();
!         PageIndexDeleteNoCompact(oldpage, &oldoff, 1);
!         if (PageAddItemExtended(oldpage, (Item) newtup, newsz, oldoff,
!                 PAI_OVERWRITE | PAI_ALLOW_FAR_OFFSET) == InvalidOffsetNumber)
!             elog(ERROR, "failed to add BRIN tuple");
          MarkBufferDirty(oldbuf);

          /* XLOG stuff */
--- 178,185 ----
          }

          START_CRIT_SECTION();
!         if (!PageIndexTupleOverwrite(oldpage, oldoff, (Item) newtup, newsz))
!             elog(ERROR, "failed to replace BRIN tuple");
          MarkBufferDirty(oldbuf);

          /* XLOG stuff */
diff --git a/src/backend/access/brin/brin_xlog.c b/src/backend/access/brin/brin_xlog.c
index 27ba0a9..ed14729 100644
*** a/src/backend/access/brin/brin_xlog.c
--- b/src/backend/access/brin/brin_xlog.c
*************** brin_xlog_samepage_update(XLogReaderStat
*** 189,202 ****
          page = (Page) BufferGetPage(buffer);

          offnum = xlrec->offnum;
-         if (PageGetMaxOffsetNumber(page) + 1 < offnum)
-             elog(PANIC, "brin_xlog_samepage_update: invalid max offset number");

!         PageIndexDeleteNoCompact(page, &offnum, 1);
!         offnum = PageAddItemExtended(page, (Item) brintuple, tuplen, offnum,
!                                      PAI_OVERWRITE | PAI_ALLOW_FAR_OFFSET);
!         if (offnum == InvalidOffsetNumber)
!             elog(PANIC, "brin_xlog_samepage_update: failed to add tuple");

          PageSetLSN(page, lsn);
          MarkBufferDirty(buffer);
--- 189,197 ----
          page = (Page) BufferGetPage(buffer);

          offnum = xlrec->offnum;

!         if (!PageIndexTupleOverwrite(page, offnum, (Item) brintuple, tuplen))
!             elog(PANIC, "brin_xlog_samepage_update: failed to replace tuple");

          PageSetLSN(page, lsn);
          MarkBufferDirty(buffer);
diff --git a/src/backend/storage/page/bufpage.c b/src/backend/storage/page/bufpage.c
index bce0d53..08e222e 100644
*** a/src/backend/storage/page/bufpage.c
--- b/src/backend/storage/page/bufpage.c
*************** PageIsVerified(Page page, BlockNumber bl
*** 166,186 ****
   *    inserted, or InvalidOffsetNumber if the item is not inserted for any
   *    reason.  A WARNING is issued indicating the reason for the refusal.
   *
!  *    If flag PAI_OVERWRITE is set, we just store the item at the specified
!  *    offsetNumber (which must be either a currently-unused item pointer,
!  *    or one past the last existing item).  Otherwise,
!  *    if offsetNumber is valid and <= current max offset in the page,
!  *    insert item into the array at that position by shuffling ItemId's
!  *    down to make room.
!  *    If offsetNumber is not valid, then assign one by finding the first
   *    one that is both unused and deallocated.
   *
   *    If flag PAI_IS_HEAP is set, we enforce that there can't be more than
   *    MaxHeapTuplesPerPage line pointers on the page.
   *
-  *    If flag PAI_ALLOW_FAR_OFFSET is not set, we disallow placing items
-  *    beyond one past the last existing item.
-  *
   *    !!! EREPORT(ERROR) IS DISALLOWED HERE !!!
   */
  OffsetNumber
--- 166,189 ----
   *    inserted, or InvalidOffsetNumber if the item is not inserted for any
   *    reason.  A WARNING is issued indicating the reason for the refusal.
   *
!  *    offsetNumber must be either InvalidOffsetNumber to specify finding a
!  *    free item pointer, or a value between FirstOffsetNumber and one past
!  *    the last existing item, to specify using that particular item pointer.
!  *
!  *    If offsetNumber is valid and flag PAI_OVERWRITE is set, we just store
!  *    the item at the specified offsetNumber, which must be either a
!  *    currently-unused item pointer, or one past the last existing item.
!  *
!  *    If offsetNumber is valid and flag PAI_OVERWRITE is not set, insert
!  *    the item at the specified offsetNumber, moving existing items later
!  *    in the array to make room.
!  *
!  *    If offsetNumber is not valid, then assign a slot by finding the first
   *    one that is both unused and deallocated.
   *
   *    If flag PAI_IS_HEAP is set, we enforce that there can't be more than
   *    MaxHeapTuplesPerPage line pointers on the page.
   *
   *    !!! EREPORT(ERROR) IS DISALLOWED HERE !!!
   */
  OffsetNumber
*************** PageAddItemExtended(Page page,
*** 267,277 ****
          }
      }

!     /*
!      * Reject placing items beyond the first unused line pointer, unless
!      * caller asked for that behavior specifically.
!      */
!     if ((flags & PAI_ALLOW_FAR_OFFSET) == 0 && offsetNumber > limit)
      {
          elog(WARNING, "specified item offset is too large");
          return InvalidOffsetNumber;
--- 270,277 ----
          }
      }

!     /* Reject placing items beyond the first unused line pointer */
!     if (offsetNumber > limit)
      {
          elog(WARNING, "specified item offset is too large");
          return InvalidOffsetNumber;
*************** PageAddItemExtended(Page page,
*** 290,299 ****
       * Note: do arithmetic as signed ints, to avoid mistakes if, say,
       * alignedSize > pd_upper.
       */
!     if ((flags & PAI_ALLOW_FAR_OFFSET) != 0)
!         lower = Max(phdr->pd_lower,
!                     SizeOfPageHeaderData + sizeof(ItemIdData) * offsetNumber);
!     else if (offsetNumber == limit || needshuffle)
          lower = phdr->pd_lower + sizeof(ItemIdData);
      else
          lower = phdr->pd_lower;
--- 290,296 ----
       * Note: do arithmetic as signed ints, to avoid mistakes if, say,
       * alignedSize > pd_upper.
       */
!     if (offsetNumber == limit || needshuffle)
          lower = phdr->pd_lower + sizeof(ItemIdData);
      else
          lower = phdr->pd_lower;
*************** PageIndexDeleteNoCompact(Page page, Offs
*** 1093,1098 ****
--- 1090,1202 ----
      }
  }

+
+ /*
+  * PageIndexTupleOverwrite
+  *
+  * Replace a specified tuple on an index page.
+  *
+  * The new tuple is placed exactly where the old one had been, shifting
+  * other tuples' data up or down as needed to keep the page compacted.
+  * This is better than deleting and reinserting the tuple, because it
+  * avoids any data shifting when the tuple size doesn't change; and
+  * even when it does, we avoid moving the item pointers around.
+  * Conceivably this could also be of use to an index AM that cares about
+  * the physical order of tuples as well as their ItemId order.
+  *
+  * If there's insufficient space for the new tuple, return false.  Other
+  * errors represent data-corruption problems, so we just elog.
+  */
+ bool
+ PageIndexTupleOverwrite(Page page, OffsetNumber offnum,
+                         Item newtup, Size newsize)
+ {
+     PageHeader    phdr = (PageHeader) page;
+     ItemId        tupid;
+     int            oldsize;
+     unsigned    offset;
+     Size        alignednewsize;
+     int            size_diff;
+     int            itemcount;
+
+     /*
+      * As with PageRepairFragmentation, paranoia seems justified.
+      */
+     if (phdr->pd_lower < SizeOfPageHeaderData ||
+         phdr->pd_lower > phdr->pd_upper ||
+         phdr->pd_upper > phdr->pd_special ||
+         phdr->pd_special > BLCKSZ ||
+         phdr->pd_special != MAXALIGN(phdr->pd_special))
+         ereport(ERROR,
+                 (errcode(ERRCODE_DATA_CORRUPTED),
+                  errmsg("corrupted page pointers: lower = %u, upper = %u, special = %u",
+                         phdr->pd_lower, phdr->pd_upper, phdr->pd_special)));
+
+     itemcount = PageGetMaxOffsetNumber(page);
+     if ((int) offnum <= 0 || (int) offnum > itemcount)
+         elog(ERROR, "invalid index offnum: %u", offnum);
+
+     tupid = PageGetItemId(page, offnum);
+     Assert(ItemIdHasStorage(tupid));
+     oldsize = ItemIdGetLength(tupid);
+     offset = ItemIdGetOffset(tupid);
+
+     if (offset < phdr->pd_upper || (offset + oldsize) > phdr->pd_special ||
+         offset != MAXALIGN(offset))
+         ereport(ERROR,
+                 (errcode(ERRCODE_DATA_CORRUPTED),
+                  errmsg("corrupted item pointer: offset = %u, size = %u",
+                         offset, (unsigned int) oldsize)));
+
+     /*
+      * Determine actual change in space requirement, check for page overflow.
+      */
+     oldsize = MAXALIGN(oldsize);
+     alignednewsize = MAXALIGN(newsize);
+     if (alignednewsize > oldsize + (phdr->pd_upper - phdr->pd_lower))
+         return false;
+
+     /*
+      * Relocate existing data and update line pointers, unless the new tuple
+      * is the same size as the old (after alignment), in which case there's
+      * nothing to do.  Notice that what we have to relocate is data before the
+      * target tuple, not data after, so it's convenient to express size_diff
+      * as the amount by which the tuple's size is decreasing, making it the
+      * delta to add to pd_upper and affected line pointers.
+      */
+     size_diff = oldsize - (int) alignednewsize;
+     if (size_diff != 0)
+     {
+         char       *addr = (char *) page + phdr->pd_upper;
+         int            i;
+
+         /* relocate all tuple data before the target tuple */
+         memmove(addr + size_diff, addr, offset - phdr->pd_upper);
+
+         /* adjust free space boundary pointer */
+         phdr->pd_upper += size_diff;
+
+         /* adjust affected line pointers too */
+         for (i = FirstOffsetNumber; i <= itemcount; i++)
+         {
+             ItemId        ii = PageGetItemId(phdr, i);
+
+             /* Allow items without storage; currently only BRIN needs that */
+             if (ItemIdHasStorage(ii) && ItemIdGetOffset(ii) <= offset)
+                 ii->lp_off += size_diff;
+         }
+     }
+
+     /* Update the item's tuple length (other fields shouldn't change) */
+     ItemIdSetNormal(tupid, offset + size_diff, newsize);
+
+     /* Copy new tuple data onto page */
+     memcpy(PageGetItem(page, tupid), newtup, newsize);
+
+     return true;
+ }
+
+
  /*
   * Set checksum for a page in shared buffers.
   *
diff --git a/src/include/storage/bufpage.h b/src/include/storage/bufpage.h
index 15cebfc..0ea47f5 100644
*** a/src/include/storage/bufpage.h
--- b/src/include/storage/bufpage.h
*************** do { \
*** 409,415 ****
   */
  #define PAI_OVERWRITE            (1 << 0)
  #define PAI_IS_HEAP                (1 << 1)
- #define PAI_ALLOW_FAR_OFFSET    (1 << 2)

  extern void PageInit(Page page, Size pageSize, Size specialSize);
  extern bool PageIsVerified(Page page, BlockNumber blkno);
--- 409,414 ----
*************** extern void PageIndexTupleDelete(Page pa
*** 429,434 ****
--- 428,435 ----
  extern void PageIndexMultiDelete(Page page, OffsetNumber *itemnos, int nitems);
  extern void PageIndexDeleteNoCompact(Page page, OffsetNumber *itemnos,
                           int nitems);
+ extern bool PageIndexTupleOverwrite(Page page, OffsetNumber offnum,
+                         Item newtup, Size newsize);
  extern char *PageSetChecksumCopy(Page page, BlockNumber blkno);
  extern void PageSetChecksumInplace(Page page, BlockNumber blkno);


Re: Re: GiST optimizing memmoves in gistplacetopage for fixed-size updates [PoC]

От
Tom Lane
Дата:
... BTW, a quick grep for other calls of PageIndexTupleDelete suggests
that SPGiST could make very good use of PageIndexTupleOverwrite, and
there may be use for it in GIN too.

I see one place in btree where there's a PageIndexTupleDelete and then
an insert, but it's unlikely to be worth touching because the page is
expected to be empty after the PageIndexTupleDelete; so there's no
data movement we could avoid in that case.

I don't plan to investigate these cases myself, but somebody should.
        regards, tom lane



Re: Re: GiST optimizing memmoves in gistplacetopage for fixed-size updates [PoC]

От
Tom Lane
Дата:
I wrote:
> ... BTW, a quick grep for other calls of PageIndexTupleDelete suggests
> that SPGiST could make very good use of PageIndexTupleOverwrite, and
> there may be use for it in GIN too.

I've pushed this patch with some editorializing and some followup
work.  The above remains as possible material for a future patch,
but I'm going to mark this commitfest entry as closed.
        regards, tom lane