Обсуждение: GiST VACUUM

Поиск
Список
Период
Сортировка

GiST VACUUM

От
Andrey Borodin
Дата:
Hi, hackers!

Here's new version of GiST VACUUM patch set aimed at CF 2018-09.
Original thread can be found at [0]

==Purpose==
Currently GiST never reuse pages even if they are completely empty. This can lead to a bloat, if a lot of index tuples
aredeleted from key space, which do not receive newly inserted tuples. 
First patch fixes that issue: empty pages are collected and reused for new page splits.

Second patch improves scanning algorithm to read GiST blocks in physical order. This can dramatically increase speed of
scanning,especially on HDD. 
This scan is using relatively big chunk of memory to build map of whole GiST graph. If there is not enough maintenance
memory,patch had the fallback to old GiST VACUUM (from first patch). 

==How logical scan works==
GiST VACUUM scans graph in DFS search to find removed tuples on leaf pages. It remembers internal pages that are
referencingcompletely empty leaf pages. 
Then in additional step, these pages are rescanned to delete references and mark leaf pages as free.

==How physical scan works==
Scan builds array of GistPSItem (Physical scan item). GistPSItem contains List of offsets pointing to potentially empty
leafpages and the information necessary to collect that list in one sequential file read. 
Array of GistPSItems has one item for each GiST block.
When we encounter leaf page, if it is empty - we mark it empty and jump to parent (if known) to update it's list.
When we encounter internal page, we check GistPSItem of every child block to check if it is empty leaf and to mark
parentptr there. 

==Limitations==
At least one reference on each internal pages is left undeleted to preserve balancing of the tree.
Pages that has FOLLOW-RIGHT flag also are not deleted, even if empty.


Thank you for your attention, any thoughts are welcome.

Best regards, Andrey Borodin.


[0] https://www.postgresql.org/message-id/flat/1147341441925550%40web17j.yandex.ru#1147341441925550@web17j.yandex.ru

Вложения

Re: GiST VACUUM

От
Thomas Munro
Дата:
On Wed, Mar 7, 2018 at 7:50 PM, Andrey Borodin <x4mmm@yandex-team.ru> wrote:
> Here's new version of GiST VACUUM patch set aimed at CF 2018-09.

Hi Andrey,

FYI Windows doesn't like this patch[1].

+ uint16_t     flags;

I think that needs to be uint16?

[1] https://ci.appveyor.com/project/postgresql-cfbot/postgresql/build/1.0.16

-- 
Thomas Munro
http://www.enterprisedb.com


Re: GiST VACUUM

От
Andrey Borodin
Дата:
Hi, Thomas!

> 17 мая 2018 г., в 9:40, Thomas Munro <thomas.munro@enterprisedb.com> написал(а):
>
> On Wed, Mar 7, 2018 at 7:50 PM, Andrey Borodin <x4mmm@yandex-team.ru> wrote:
>> Here's new version of GiST VACUUM patch set aimed at CF 2018-09.
>
> Hi Andrey,
>
> FYI Windows doesn't like this patch[1].
>
> + uint16_t     flags;
>
> I think that needs to be uint16?


Thanks! Fixed.

Best regards, Andrey Borodin.

Вложения

Re: GiST VACUUM

От
Heikki Linnakangas
Дата:
I'm now looking at the first patch in this series, to allow completely 
empty GiST pages to be recycled. I've got some questions:

> --- a/src/backend/access/gist/gist.c
> +++ b/src/backend/access/gist/gist.c
> @@ -700,6 +700,13 @@ gistdoinsert(Relation r, IndexTuple itup, Size freespace, GISTSTATE *giststate)
>              GISTInsertStack *item;
>              OffsetNumber downlinkoffnum;
>  
> +            if(GistPageIsDeleted(stack->page))
> +            {
> +                UnlockReleaseBuffer(stack->buffer);
> +                xlocked = false;
> +                state.stack = stack = stack->parent;
> +                continue;
> +            }
>              downlinkoffnum = gistchoose(state.r, stack->page, itup, giststate);
>              iid = PageGetItemId(stack->page, downlinkoffnum);
>              idxtuple = (IndexTuple) PageGetItem(stack->page, iid);

This seems misplaced. This code deals with internal pages, and as far as 
I can see, this patch never marks internal pages as deleted, only leaf 
pages. However, we should have something like this in the leaf-page 
branch, to deal with the case that an insertion lands on a page that was 
concurrently deleted. Did you have any tests, where an insertion runs 
concurrently with vacuum, that would exercise this?

The code in gistbulkdelete() seems pretty expensive. In the first phase, 
it records the parent of every empty leaf page it encounters. In the 
second phase, it scans every leaf page of that parent, not only those 
leaves that were seen as empty.

I'm a bit wary of using pd_prune_xid for the checks to determine if a 
deleted page can be recycled yet. In heap pages, pd_prune_xid is just a 
hint, but here it's used for a critical check. This seems to be the same 
mechanism we use in B-trees, but in B-trees, we store the XID in 
BTPageOpaqueData.xact, not pd_prune_xid. Also, in B-trees, we use 
ReadNewTransactionId() to set it, not GetCurrentTransactionId(). See 
comments in _bt_unlink_halfdead_page() for explanation. This patch is 
missing any comments to explain how this works in GiST.

If you crash in the middle of gistbulkdelete(), after it has removed the 
downlink from the parent, but before it has marked the leaf page as 
deleted, the leaf page is "leaked". I think that's acceptable, but a 
comment at least would be good.

- Heikki


Re: GiST VACUUM

От
Andrey Borodin
Дата:
Hi, Heikki!

Thanks for looking into the patch!

> 11 июля 2018 г., в 0:07, Heikki Linnakangas <hlinnaka@iki.fi> написал(а):
>
> I'm now looking at the first patch in this series, to allow completely empty GiST pages to be recycled. I've got some
questions:
>
>> --- a/src/backend/access/gist/gist.c
>> +++ b/src/backend/access/gist/gist.c
>> @@ -700,6 +700,13 @@ gistdoinsert(Relation r, IndexTuple itup, Size freespace, GISTSTATE *giststate)
>>             GISTInsertStack *item;
>>             OffsetNumber downlinkoffnum;
>> +            if(GistPageIsDeleted(stack->page))
>> +            {
>> +                UnlockReleaseBuffer(stack->buffer);
>> +                xlocked = false;
>> +                state.stack = stack = stack->parent;
>> +                continue;
>> +            }
>>             downlinkoffnum = gistchoose(state.r, stack->page, itup, giststate);
>>             iid = PageGetItemId(stack->page, downlinkoffnum);
>>             idxtuple = (IndexTuple) PageGetItem(stack->page, iid);
>
> This seems misplaced. This code deals with internal pages, and as far as I can see, this patch never marks internal
pagesas deleted, only leaf pages. However, we should have something like this in the leaf-page branch, to deal with the
casethat an insertion lands on a page that was concurrently deleted. 
That's a bug. Will fix this.

> Did you have any tests, where an insertion runs concurrently with vacuum, that would exercise this?
Yes, I've tried to test this, but, obviously, not enough. I'll think more about how to deal with it.

>
> The code in gistbulkdelete() seems pretty expensive. In the first phase, it records the parent of every empty leaf
pageit encounters. In the second phase, it scans every leaf page of that parent, not only those leaves that were seen
asempty. 
Yes, in first patch there is simplest gistbulkdelete(), second patch will remember line pointers of downlinks to empty
leafs.

>
> I'm a bit wary of using pd_prune_xid for the checks to determine if a deleted page can be recycled yet. In heap
pages,pd_prune_xid is just a hint, but here it's used for a critical check. This seems to be the same mechanism we use
inB-trees, but in B-trees, we store the XID in BTPageOpaqueData.xact, not pd_prune_xid. Also, in B-trees, we use
ReadNewTransactionId()to set it, not GetCurrentTransactionId(). See comments in _bt_unlink_halfdead_page() for
explanation.This patch is missing any comments to explain how this works in GiST. 
Will look into this. I remember it was OK half a year ago, but now it is clear to me that I had to document that part
whenI understood it.... 

>
> If you crash in the middle of gistbulkdelete(), after it has removed the downlink from the parent, but before it has
markedthe leaf page as deleted, the leaf page is "leaked". I think that's acceptable, but a comment at least would be
good.
I was considering doing reverse-split (page merge) concurrency like in Lanin and Shasha's paper, but it is just too
complexfor little purpose. Will add comments on possible orphan pages. 

Many thanks! I hope to post updated patch series this week.


Best regards, Andrey Borodin.

Re: GiST VACUUM

От
Andrey Borodin
Дата:
Hi!

PFA v5 of the patch series.

> 11 июля 2018 г., в 0:07, Heikki Linnakangas <hlinnaka@iki.fi> написал(а):
>
> This seems misplaced. This code deals with internal pages, and as far as I can see, this patch never marks internal
pagesas deleted, only leaf pages. However, we should have something like this in the leaf-page branch, to deal with the
casethat an insertion lands on a page that was concurrently deleted. Did you have any tests, where an insertion runs
concurrentlywith vacuum, that would exercise this? 

That bug could manifest only in case of crash between removing downlinks and marking pages deleted. I do not know how
totest this reliably. 
Internal pages are locked before leafs and locks are coupled. No cuncurrent backend can see downlinks to pages being
deleted,unless crash happens. 

I've replaced code covering this situation into leaf code path and added comment.

>
> The code in gistbulkdelete() seems pretty expensive. In the first phase, it records the parent of every empty leaf
pageit encounters. In the second phase, it scans every leaf page of that parent, not only those leaves that were seen
asempty. 

It is fixed in second patch of the series.

>
> I'm a bit wary of using pd_prune_xid for the checks to determine if a deleted page can be recycled yet. In heap
pages,pd_prune_xid is just a hint, but here it's used for a critical check. This seems to be the same mechanism we use
inB-trees, but in B-trees, we store the XID in BTPageOpaqueData.xact, not pd_prune_xid. Also, in B-trees, we use
ReadNewTransactionId()to set it, not GetCurrentTransactionId(). See comments in _bt_unlink_halfdead_page() for
explanation.This patch is missing any comments to explain how this works in GiST. 

I've replaced usage of GetCurrentTransactionId() with ReadNewTransactionId() and added explanation of what is going on.
Also,I've added comments about that pd_prune_xid usage. There is no other use in GiST for this field. And there is no
otherroom to place this xid on a page without pg_upgrade. 

>
> If you crash in the middle of gistbulkdelete(), after it has removed the downlink from the parent, but before it has
markedthe leaf page as deleted, the leaf page is "leaked". I think that's acceptable, but a comment at least would be
good.

Added explanatory comment between WAL-logging downlink removal and marking pages deleted.


Thank you for reviewing the patch!

Best regards, Andrey Borodin.

Вложения

Re: GiST VACUUM

От
Heikki Linnakangas
Дата:
On 12/07/18 19:06, Andrey Borodin wrote:
>> 11 июля 2018 г., в 0:07, Heikki Linnakangas <hlinnaka@iki.fi>
>> написал(а):
>> 
>> This seems misplaced. This code deals with internal pages, and as
>> far as I can see, this patch never marks internal pages as deleted,
>> only leaf pages. However, we should have something like this in the
>> leaf-page branch, to deal with the case that an insertion lands on
>> a page that was concurrently deleted. Did you have any tests, where
>> an insertion runs concurrently with vacuum, that would exercise
>> this?
> 
> That bug could manifest only in case of crash between removing
> downlinks and marking pages deleted.

Hmm. The downlink is removed first, so I don't think you can see that 
situation after a crash. After a crash, you might have some empty, 
orphaned, pages that have already been unlinked from the parent, but a 
search/insert should never encounter them.

Actually, now that I think about it more, I'm not happy with leaving 
orphaned pages like that behind. Let's WAL-log the removal of the 
downlink, and marking the leaf pages as deleted, in one WAL record, to 
avoid that.

But the situation in gistdoinsert(), where you encounter a deleted leaf 
page, could happen during normal operation, if vacuum runs concurrently 
with an insert. Insertion locks only one page at a time, as it descends 
the tree, so after it has released the lock on the parent, but before it 
has locked the child, vacuum might have deleted the page. In the latest 
patch, you're checking for that just before swapping the shared lock for 
an exclusive one, but I think that's wrong; you need to check for that 
after swapping the lock, because otherwise vacuum might delete the page 
while you're not holding the lock.

> I do not know how to test this
> reliably. Internal pages are locked before leafs and locks are
> coupled. No cuncurrent backend can see downlinks to pages being
> deleted, unless crash happens.

Are you sure? At a quick glance, I don't think the locks are coupled.

We do need some way of testing this..

- Heikki


Re: GiST VACUUM

От
Andrey Borodin
Дата:

> 12 июля 2018 г., в 20:40, Heikki Linnakangas <hlinnaka@iki.fi> написал(а):
>
> On 12/07/18 19:06, Andrey Borodin wrote:
>>> 11 июля 2018 г., в 0:07, Heikki Linnakangas <hlinnaka@iki.fi>
>>> написал(а):
>>> This seems misplaced. This code deals with internal pages, and as
>>> far as I can see, this patch never marks internal pages as deleted,
>>> only leaf pages. However, we should have something like this in the
>>> leaf-page branch, to deal with the case that an insertion lands on
>>> a page that was concurrently deleted. Did you have any tests, where
>>> an insertion runs concurrently with vacuum, that would exercise
>>> this?
>> That bug could manifest only in case of crash between removing
>> downlinks and marking pages deleted.
>
> Hmm. The downlink is removed first, so I don't think you can see that situation after a crash. After a crash, you
mighthave some empty, orphaned, pages that have already been unlinked from the parent, but a search/insert should never
encounterthem. 
>
> Actually, now that I think about it more, I'm not happy with leaving orphaned pages like that behind. Let's WAL-log
theremoval of the downlink, and marking the leaf pages as deleted, in one WAL record, to avoid that. 

OK, will do this. But this will complicate WAL replay seriously, and I do not know a proper way to test that (BTW there
isGiST amcheck in progress, but I decided to leave it for a while). 

>
> But the situation in gistdoinsert(), where you encounter a deleted leaf page, could happen during normal operation,
ifvacuum runs concurrently with an insert. Insertion locks only one page at a time, as it descends the tree, so after
ithas released the lock on the parent, but before it has locked the child, vacuum might have deleted the page. In the
latestpatch, you're checking for that just before swapping the shared lock for an exclusive one, but I think that's
wrong;you need to check for that after swapping the lock, because otherwise vacuum might delete the page while you're
notholding the lock. 
Looks like a valid concern, I'll move that code again.
>
>> I do not know how to test this
>> reliably. Internal pages are locked before leafs and locks are
>> coupled. No cuncurrent backend can see downlinks to pages being
>> deleted, unless crash happens.
>
> Are you sure? At a quick glance, I don't think the locks are coupled.


Sorry for overquoting

+    /* rescan inner pages that had empty child pages */
+    foreach(cell,rescanList)
+    {
+        Buffer        buffer;
+        Page        page;
+        OffsetNumber i,
+                    maxoff;
+        IndexTuple    idxtuple;
+        ItemId        iid;
+        OffsetNumber todelete[MaxOffsetNumber];
+        Buffer        buftodelete[MaxOffsetNumber];
+        int            ntodelete = 0;
+
+        buffer = ReadBufferExtended(rel, MAIN_FORKNUM, (BlockNumber)lfirst_int(cell),
+                                    RBM_NORMAL, info->strategy);
+        LockBuffer(buffer, GIST_EXCLUSIVE);
Here's first lock
+        gistcheckpage(rel, buffer);
+        page = (Page) BufferGetPage(buffer);
+
+        Assert(!GistPageIsLeaf(page));
+
+        maxoff = PageGetMaxOffsetNumber(page);
+
+        for (i = OffsetNumberNext(FirstOffsetNumber); i <= maxoff; i = OffsetNumberNext(i))
+        {
+            Buffer        leafBuffer;
+            Page        leafPage;
+
+            iid = PageGetItemId(page, i);
+            idxtuple = (IndexTuple) PageGetItem(page, iid);
+
+            leafBuffer = ReadBufferExtended(rel, MAIN_FORKNUM, ItemPointerGetBlockNumber(&(idxtuple->t_tid)),
+                                RBM_NORMAL, info->strategy);
+            LockBuffer(leafBuffer, GIST_EXCLUSIVE);
now locks are coupled in a top-down descent


>
> We do need some way of testing this..

Can we test replication of concurrent VACUUM and inserts in existing test suit? I just do not know.
I can do this tests manually if this is enough.


Best regards, Andrey Borodin.

Re: GiST VACUUM

От
Heikki Linnakangas
Дата:
On 13/07/18 16:41, Andrey Borodin wrote:
>> 12 июля 2018 г., в 21:07, Andrey Borodin <x4mmm@yandex-team.ru 
>> <mailto:x4mmm@yandex-team.ru>> написал(а):
>> 12 июля 2018 г., в 20:40, Heikki Linnakangas <hlinnaka@iki.fi 
>> <mailto:hlinnaka@iki.fi>> написал(а):
>>> Actually, now that I think about it more, I'm not happy with leaving orphaned 
>>> pages like that behind. Let's WAL-log the removal of the downlink, and 
>>> marking the leaf pages as deleted, in one WAL record, to avoid that.
>>
>> OK, will do this. But this will complicate WAL replay seriously, and I do not 
>> know a proper way to test that (BTW there is GiST amcheck in progress, but I 
>> decided to leave it for a while).
> Done. Now WAL record for deleted page also removes downlink from internal page.
> I had to use IndexPageTupleDelete() instead of IndexPageTupleMultiDelete(), but
> I do not think it will have any impact on performance.

Yeah, I think that's fine, this isn't that performance critical

>>> But the situation in gistdoinsert(), where you encounter a deleted leaf page, 
>>> could happen during normal operation, if vacuum runs concurrently with an 
>>> insert. Insertion locks only one page at a time, as it descends the tree, so 
>>> after it has released the lock on the parent, but before it has locked the 
>>> child, vacuum might have deleted the page. In the latest patch, you're 
>>> checking for that just before swapping the shared lock for an exclusive one, 
>>> but I think that's wrong; you need to check for that after swapping the lock, 
>>> because otherwise vacuum might delete the page while you're not holding the lock.
>> Looks like a valid concern, I'll move that code again.
> Done.

Ok, the comment now says:

> +            /*
> +             * Leaf pages can be left deleted but still referenced in case of
> +             * crash during VACUUM's gistbulkdelete()
> +             */

But that's not accurate, right? You should never see deleted pages after 
a crash, because the parent is updated in the same WAL record as the 
child page, right?

I'm still a bit scared about using pd_prune_xid to store the XID that 
prevents recycling the page too early. Can we use some field in 
GISTPageOpaqueData for that, similar to how the B-tree stores it in 
BTPageOpaqueData?

- Heikki


Re: GiST VACUUM

От
Heikki Linnakangas
Дата:
Looking at the second patch, to scan the GiST index in physical order, 
that seems totally unsafe, if there are any concurrent page splits. In 
the logical scan, pushStackIfSplited() deals with that, by comparing the 
page's NSN with the parent's LSN. But I don't see anything like that in 
the physical scan code.

I think we can do something similar in the physical scan: remember the 
current LSN position at the beginning of the vacuum, and compare with 
that. The B-tree code uses the "cycle ID" for similar purposes.

Do we still need the separate gistvacuumcleanup() pass, if we scan the 
index in physical order in the bulkdelete pass already?

- Heikki


Re: GiST VACUUM

От
Andrey Borodin
Дата:

> 13 июля 2018 г., в 18:10, Heikki Linnakangas <hlinnaka@iki.fi> написал(а):
>
>>>> But the situation in gistdoinsert(), where you encounter a deleted leaf page, could happen during normal
operation,if vacuum runs concurrently with an insert. Insertion locks only one page at a time, as it descends the tree,
soafter it has released the lock on the parent, but before it has locked the child, vacuum might have deleted the page.
Inthe latest patch, you're checking for that just before swapping the shared lock for an exclusive one, but I think
that'swrong; you need to check for that after swapping the lock, because otherwise vacuum might delete the page while
you'renot holding the lock. 
>>> Looks like a valid concern, I'll move that code again.
>> Done.
>
> Ok, the comment now says:
>
>> +            /*
>> +             * Leaf pages can be left deleted but still referenced in case of
>> +             * crash during VACUUM's gistbulkdelete()
>> +             */
>
> But that's not accurate, right? You should never see deleted pages after a crash, because the parent is updated in
thesame WAL record as the child page, right? 
Fixed the comment.
>
> I'm still a bit scared about using pd_prune_xid to store the XID that prevents recycling the page too early. Can we
usesome field in GISTPageOpaqueData for that, similar to how the B-tree stores it in BTPageOpaqueData? 
There is no room in opaque data, but, technically all page is just a tombstone until reused, so we can pick arbitrary
place.PFA v7 where xid after delete is stored in opaque data, but we can use other places, like line pointer array or
opaque-1

> 13 июля 2018 г., в 18:25, Heikki Linnakangas <hlinnaka@iki.fi> написал(а):
>
> Looking at the second patch, to scan the GiST index in physical order, that seems totally unsafe, if there are any
concurrentpage splits. In the logical scan, pushStackIfSplited() deals with that, by comparing the page's NSN with the
parent'sLSN. But I don't see anything like that in the physical scan code. 

Leaf page can be pointed by internal page and rightlink simultaneously. The purpose of NSN is to visit this page
exactlyonce by following only on of two links in a scan. This is achieved naturally if we read everything from the
beginningto the end. (That is how I understand, I can be wrong) 

> I think we can do something similar in the physical scan: remember the current LSN position at the beginning of the
vacuum,and compare with that. The B-tree code uses the "cycle ID" for similar purposes. 
>
> Do we still need the separate gistvacuumcleanup() pass, if we scan the index in physical order in the bulkdelete pass
already?

We do not need to gather stats there, but we are doing RecordFreeIndexPage() and IndexFreeSpaceMapVacuum(). Is it
correctto move this to first scan? 

Best regards, Andrey Borodin.


Вложения

Re: GiST VACUUM

От
Heikki Linnakangas
Дата:
On 13/07/18 21:28, Andrey Borodin wrote:
>> 13 июля 2018 г., в 18:25, Heikki Linnakangas <hlinnaka@iki.fi>
>> написал(а):
>> 
>> Looking at the second patch, to scan the GiST index in physical
>> order, that seems totally unsafe, if there are any concurrent page
>> splits. In the logical scan, pushStackIfSplited() deals with that,
>> by comparing the page's NSN with the parent's LSN. But I don't see
>> anything like that in the physical scan code.
> 
> Leaf page can be pointed by internal page and rightlink
> simultaneously. The purpose of NSN is to visit this page exactly once
> by following only on of two links in a scan. This is achieved
> naturally if we read everything from the beginning to the end. (That
> is how I understand, I can be wrong)

The scenario where this fails goes like this:

1. Vacuum scans physical pages 1-10
2. A concurrent insertion splits page 15. The new left half stays on 
page 15, but the new right half goes to page 5
3. Vacuum scans pages 11-20

Now, if there were any dead tuples on the right half of the split, moved 
to page 5, the vacuum would miss them.

The way this is handled in B-tree is that when a page is split, the page 
is stamped with current "vacuum cycle id". When the vacuum scan 
encounters a page with the current cycle id, whose right-link points to 
a lower-numbered page, it immediately follows the right link, and 
re-scans it. I.e. in the above example, if it was a B-tree, in step 3 
when vacuum scans page 15, it would see that it was concurrently split. 
It would immediately vacuum page 5 again, before continuing the scan in 
physical order.

We'll need to do something similar in GiST.

- Heikki


Re: GiST VACUUM

От
Andrey Borodin
Дата:
> 14 июля 2018 г., в 0:28, Heikki Linnakangas <hlinnaka@iki.fi> написал(а):
>
> On 13/07/18 21:28, Andrey Borodin wrote:
>>> 13 июля 2018 г., в 18:25, Heikki Linnakangas <hlinnaka@iki.fi>
>>> написал(а):
>>> Looking at the second patch, to scan the GiST index in physical
>>> order, that seems totally unsafe, if there are any concurrent page
>>> splits. In the logical scan, pushStackIfSplited() deals with that,
>>> by comparing the page's NSN with the parent's LSN. But I don't see
>>> anything like that in the physical scan code.
>> Leaf page can be pointed by internal page and rightlink
>> simultaneously. The purpose of NSN is to visit this page exactly once
>> by following only on of two links in a scan. This is achieved
>> naturally if we read everything from the beginning to the end. (That
>> is how I understand, I can be wrong)
>
> The scenario where this fails goes like this:
>
> 1. Vacuum scans physical pages 1-10
> 2. A concurrent insertion splits page 15. The new left half stays on page 15, but the new right half goes to page 5
> 3. Vacuum scans pages 11-20
>
> Now, if there were any dead tuples on the right half of the split, moved to page 5, the vacuum would miss them.
>
> The way this is handled in B-tree is that when a page is split, the page is stamped with current "vacuum cycle id".
Whenthe vacuum scan encounters a page with the current cycle id, whose right-link points to a lower-numbered page, it
immediatelyfollows the right link, and re-scans it. I.e. in the above example, if it was a B-tree, in step 3 when
vacuumscans page 15, it would see that it was concurrently split. It would immediately vacuum page 5 again, before
continuingthe scan in physical order. 
>
> We'll need to do something similar in GiST.

OK, I will do this.

This is tradeoff between complex concurrency feature and possibility of few dead tuples left after VACUUM. I want to
understand:is it something dangerous in this dead tuples? 
There is one more serious race condition: result of first scan is just a hint where to look for downlinks to empty
pages.If internal page splits between scan and cleanup, offsets of downlinks will be changed, cleanup will lock pages,
seenon-empty pages and will not delete them (though there are not dead tuples, just not deleted empty leafs). 


Best regards, Andrey Borodin.

Re: GiST VACUUM

От
Heikki Linnakangas
Дата:
On 14/07/18 10:26, Andrey Borodin wrote:
> This is tradeoff between complex concurrency feature and possibility
> of few dead tuples left after VACUUM. I want to understand: is it
> something dangerous in this dead tuples?
Yeah, that's bad. When a new heap tuple is inserted in the same 
location, the old index tuple will point to the new, unrelated, tuple, 
and you will get incorrect query results.

> There is one more serious race condition: result of first scan is 
> just a hint where to look for downlinks to empty pages. If internal 
> page splits between scan and cleanup, offsets of downlinks will be 
> changed, cleanup will lock pages, see non-empty pages and will not 
> delete them (though there are not dead tuples, just not deleted empty
> leafs).

That's fine. Leaving behind a few empty pages is harmless, the next 
vacuum will pick them up.

- Heikki


Re: GiST VACUUM

От
Andrey Borodin
Дата:

> 14 июля 2018 г., в 14:39, Heikki Linnakangas <hlinnaka@iki.fi> написал(а):
>
> On 14/07/18 10:26, Andrey Borodin wrote:
>> This is tradeoff between complex concurrency feature and possibility
>> of few dead tuples left after VACUUM. I want to understand: is it
>> something dangerous in this dead tuples?
> Yeah, that's bad. When a new heap tuple is inserted in the same location, the old index tuple will point to the new,
unrelated,tuple, and you will get incorrect query results. 

PFA v8: at the beginning of physical scan I grab GetInsertRecPtr() and if leaf page have rightlink lefter in file and
NSNhigher than LSN of start we get back to scan that page one more time. This is recursive. 


I'm still not very comfortable with storing deletion xid in opaque data: we reuse rightlink field under very specific
conditionsinstead of using totally unused pd_prune_xid. We have a remark in bufpage.h 
 * pd_prune_xid is a hint field that helps determine whether pruning will be
 * useful.  It is currently unused in index pages.
Or may be we should use some other place on those empty 8Kb page.

Deleted page do not have rightlink now, but in B-trees rightlink on deleted pages is actively used.
We either cannot reuse NSN: rightlink is useless without NSN anyway. We cannot reuse flags, page is deleted in flags
afterall. uint16 gist_page_id is just not enough. 

Best regards, Andrey Borodin.

Вложения

Re: GiST VACUUM

От
Robert Haas
Дата:
On Fri, Jul 13, 2018 at 10:10 AM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> I'm still a bit scared about using pd_prune_xid to store the XID that
> prevents recycling the page too early. Can we use some field in
> GISTPageOpaqueData for that, similar to how the B-tree stores it in
> BTPageOpaqueData?

What's your reason for being scared?  It seems to me that the
alternatives being proposed (altering the size of the special space,
or sometimes repurposing a field for some other purpose) have their
own associated scariness.

If I had my druthers, I'd nuke pd_prune_xid from orbit -- it's a piece
of seemingly heap-specific data that is kept in the generic page
header rather than in the heap's special space.  Other AMs like btree
or zheap may have different needs; one size does not fit all.  That
said, since getting rid of pd_prune_xid seems impractical, maybe it
makes more sense to reuse it than to insist on leaving it idle and
consuming space elsewhere in the page.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: GiST VACUUM

От
Andrey Borodin
Дата:

> 16 июля 2018 г., в 18:58, Robert Haas <robertmhaas@gmail.com> написал(а):
>
> On Fri, Jul 13, 2018 at 10:10 AM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
>> I'm still a bit scared about using pd_prune_xid to store the XID that
>> prevents recycling the page too early. Can we use some field in
>> GISTPageOpaqueData for that, similar to how the B-tree stores it in
>> BTPageOpaqueData?
>
> What's your reason for being scared?  It seems to me that the
> alternatives being proposed (altering the size of the special space,
> or sometimes repurposing a field for some other purpose) have their
> own associated scariness.

Thanks, that's exactly what I'm thinking about where to store this xid.

Here's v9 of the patch, it uses pd_prune_xid, but it is abstracted to GistPageGetDeleteXid() \ GistPageSetDeleteXid()
sothat we can change the way we store it easily. 

Best regards, Andrey Borodin.

Вложения

Re: GiST VACUUM

От
Andrey Borodin
Дата:
Hi!

> 16 июля 2018 г., в 21:24, Andrey Borodin <x4mmm@yandex-team.ru> написал(а):
>

I was checking WAL replay of new scheme to log page deletes and found a bug there (incorrect value of deleted downlink
inWAL record). Here's fixed patch v10. 

Also I've added support to WAL identification for new record, done some improvements to comments and naming in data
structures.

Thanks!

Best regards, Andrey Borodin.

Вложения

Re: GiST VACUUM

От
Heikki Linnakangas
Дата:
On 17/07/18 21:41, Andrey Borodin wrote:
> I was checking WAL replay of new scheme to log page deletes and found
> a bug there (incorrect value of deleted downlink in WAL record).
> Here's fixed patch v10.
> 
> Also I've added support to WAL identification for new record, done
> some improvements to comments and naming in data structures.

Thanks!

> +        /*
> +         * If this page was splitted after start of the VACUUM we have to
> +         * revisit rightlink, if it points to block we already scanned.
> +         * This is recursive revisit, should not be deep, but we check
> +         * the possibility of stack overflow anyway.
> +         */
> +        if ((GistFollowRight(page) || startNSN < GistPageGetNSN(page)) &&
> +            (opaque->rightlink != InvalidBlockNumber) && (opaque->rightlink < blkno))
> +            {
> +                gistbulkdeletephysicalcanpage(info, stats, callback, callback_state, opaque->rightlink, startNSN,
graph);
> +            }

In the corresponding B-tree code, we use don't do actual recursion, but 
a hand-optimized "tail recursion", to avoid stack overflow if there are 
a lot of splits. I think we need to do something like tha there, too. I 
don't think it's safe to assume that we have enough stack space for the 
recursion. You have a check_stack_depth() in the function, so you'll get 
ERROR, but it sucks if VACUUM errors out because of that.


It's not cool to use up to 'maintenance_work_mem' of memory for holding 
the in-memory graph. VACUUM already uses up to that much memory to hold 
the list of dead TIDs, so we would use up to 2x maintenance_work_mem in 
total.

The only reason we still need the logical scan is to support page 
deletion, when there is not enough memory available. Can we get rid of 
that altogether? I think I'd prefer some other scheme to deal with that 
situation. For example, we could make note, in memory, of the empty 
pages during the physical scan, and perform a second physical scan of 
the index to find their downlinks. Two full-index scans is not nice, but 
it's actually not that bad if it's done in physical order. And you could 
have some heuristics, e.g. only do it if more than 10% of the pages were 
deletable.

Sorry to ask you to rewrite this again, but I think it would be better 
to split this into two patches as follows:

1st patch: Scan the index in physical rather than logical order. No 
attempt at deleting empty pages yet.

2nd patch: Add support for deleting empty pages.

I would be more comfortable reviewing and committing that first patch, 
which just switches to doing physical-order scan, first.

- Heikki


Re: GiST VACUUM

От
Andrey Borodin
Дата:
Hi!

> 18 июля 2018 г., в 16:02, Heikki Linnakangas <hlinnaka@iki.fi> написал(а):
>
> In the corresponding B-tree code, we use don't do actual recursion, but a hand-optimized "tail recursion", to avoid
stackoverflow if there are a lot of splits. I think we need to do something like tha there, too. I don't think it's
safeto assume that we have enough stack space for the recursion. You have a check_stack_depth() in the function, so
you'llget ERROR, but it sucks if VACUUM errors out because of that. 
Ok, will do that.

>
>
> It's not cool to use up to 'maintenance_work_mem' of memory for holding the in-memory graph. VACUUM already uses up
tothat much memory to hold the list of dead TIDs, so we would use up to 2x maintenance_work_mem in total. 
>
> The only reason we still need the logical scan is to support page deletion, when there is not enough memory
available.Can we get rid of that altogether? I think I'd prefer some other scheme to deal with that situation. For
example,we could make note, in memory, of the empty pages during the physical scan, and perform a second physical scan
ofthe index to find their downlinks. Two full-index scans is not nice, but it's actually not that bad if it's done in
physicalorder. 
I think this is a good idea. I will implement this.

> And you could have some heuristics, e.g. only do it if more than 10% of the pages were deletable.
>
> Sorry to ask you to rewrite this again
Piece of cake :)

> , but I think it would be better to split this into two patches as follows:
>
> 1st patch: Scan the index in physical rather than logical order. No attempt at deleting empty pages yet.
>
> 2nd patch: Add support for deleting empty pages.
>
> I would be more comfortable reviewing and committing that first patch, which just switches to doing physical-order
scan,first. 

This seems very unproportional division of complexity. First patch (PFA) is very simple. All work is done in one cycle,
withoutmemorizing anything. Actually, you do not even need to rescan rightlinks: there may be no splits to the left
whenno pages are deleted. 
If you think it is proper way to go - OK, I'll prepare better version of attached diff (by omitting tail recursion and
addingmore comments). 


Best regards, Andrey Borodin.

Вложения

Re: GiST VACUUM

От
Heikki Linnakangas
Дата:
On 18/07/18 21:27, Andrey Borodin wrote:
> Hi!
> 
>> 18 июля 2018 г., в 16:02, Heikki Linnakangas <hlinnaka@iki.fi>
>> написал(а):
>> 
>> , but I think it would be better to split this into two patches as
>> follows:
>> 
>> 1st patch: Scan the index in physical rather than logical order. No
>> attempt at deleting empty pages yet.
>> 
>> 2nd patch: Add support for deleting empty pages.
>> 
>> I would be more comfortable reviewing and committing that first
>> patch, which just switches to doing physical-order scan, first.
> 
> This seems very unproportional division of complexity. First patch
> (PFA) is very simple. All work is done in one cycle, without
> memorizing anything. Actually, you do not even need to rescan
> rightlinks: there may be no splits to the left when no pages are
> deleted.

Heh, good point.

I googled around and bumped into an older patch to do this: 
https://www.postgresql.org/message-id/1135121410099068%40web30j.yandex.ru. 
Unfortunately, Костя never got around to update the patch, and it was 
forgotten. But the idea seemed sound even back then.

As noted in that thread, there might be deleted pages in the index in 
some rare circumstances, even though we don't recycled empty pages: if 
the index was upgraded from a very old version, as VACUUM FULL used to 
recycle empty pages, or if you crash just when extending the index, and 
end up with newly-initialized but unused pages that way. So we do need 
to handle the concurrent split scenario, even without empty page recycling.

> If you think it is proper way to go - OK, I'll prepare
> better version of attached diff (by omitting tail recursion and
> adding more comments).

Yeah, please, I think this is the way to go.

- Heikki


Re: GiST VACUUM

От
Andrey Borodin
Дата:
Hi!

> 19 июля 2018 г., в 1:12, Heikki Linnakangas <hlinnaka@iki.fi> написал(а):
>
> Yeah, please, I think this is the way to go.

Here's v11 divided into proposed steps.

In second step I still use paloc's memory, but only to store two bitmaps: bitmap of internal pages and bitmap of empty
leafs.Second physical scan only reads internal pages. I can omit that bitmap, if I'll scan everything. 
Also, I can replace emptyLeafs bitmap with array\list, but I do not really think it will be big.

Anyway I propose focusing on first step.

Best regards, Andrey Borodin.

Вложения

Re: GiST VACUUM

От
Heikki Linnakangas
Дата:
On 19/07/18 13:52, Andrey Borodin wrote:
> Hi!
> 
>> 19 июля 2018 г., в 1:12, Heikki Linnakangas <hlinnaka@iki.fi>
>> написал(а):
>> 
>> Yeah, please, I think this is the way to go.
> 
> Here's v11 divided into proposed steps.

Thanks, one quick question:

>             /* We should not unlock buffer if we are going to jump left */
>             if (needScan)
>             {
>                 GistBDItem *ptr = (GistBDItem *) palloc(sizeof(GistBDItem));
>                 ptr->buffer = buffer;
>                 ptr->next = bufferStack;
>                 bufferStack = ptr;
>             }
>             else
>                 UnlockReleaseBuffer(buffer);

Why? I don't see any need to keep the page locked, when we "jump left".

- Heikki


Re: GiST VACUUM

От
Heikki Linnakangas
Дата:
On 19/07/18 14:42, Andrey Borodin wrote:
> 
> 19.07.2018, 15:20, "Heikki Linnakangas" <hlinnaka@iki.fi>:
>>
>> On 19/07/18 13:52, Andrey Borodin wrote:
>>
>>      Hi!
>>
>>          19 июля 2018 г., в 1:12, Heikki Linnakangas <hlinnaka@iki.fi
>>         <mailto:hlinnaka@iki.fi>>
>>          написал(а):
>>
>>          Yeah, please, I think this is the way to go.
>>
>>
>>      Here's v11 divided into proposed steps.
>>
>>
>> Thanks, one quick question:
>>
>>                              /* We should not unlock buffer if we are going to
>>     jump left */
>>                              if (needScan)
>>                              {
>>                                      GistBDItem *ptr = (GistBDItem *)
>>     palloc(sizeof(GistBDItem));
>>                                      ptr->buffer = buffer;
>>                                      ptr->next = bufferStack;
>>                                      bufferStack = ptr;
>>                              }
>>                              else
>>                                      UnlockReleaseBuffer(buffer);
>>
>>
>> Why? I don't see any need to keep the page locked, when we "jump left".
>>
> Because it can split to the left again, given that we release lock.

Hmm. So, while we are scanning the right sibling, which was moved to 
lower-numbered block because of a concurrent split, the original page is 
split again? That's OK, we've already scanned all the tuples on the 
original page, before we recurse to deal with the right sibling. (The 
corresponding B-tree code also releases the lock on the original page 
when recursing)

I did some refactoring, to bring this closer to the B-tree code, for the 
sake of consistency. See attached patch. This also eliminates the 2nd 
pass by gistvacuumcleanup(), in case we did that in the bulkdelete-phase 
already.

There was one crucial thing missing: in the outer loop, we must ensure 
that we scan all pages, even those that were added after the vacuum 
started. There's a comment explaining that in btvacuumscan(). This 
version fixes that.

I haven't done any testing on this. Do you have any test scripts you 
could share? I think we need some repeatable tests for the concurrent 
split cases. Even if it involves gdb or some other hacks that we can't 
include in the regression test suite, we need something now, while we're 
hacking on this.

One subtle point, that I think is OK, but gave me a pause, and probably 
deserves comment somewhere: A concurrent root split can turn a leaf page 
into one internal (root) page, and two new leaf pages. The new root page 
is placed in the same block as the old page, while both new leaf pages 
go to freshly allocated blocks. If that happens while vacuum is running, 
might we miss the new leaf pages? As the code stands, we don't do the 
"follow-right" dance on internal pages, so we would not recurse into the 
new leaf pages. At first, I thought that's a problem, but I think we can 
get away with it. The only scenario where a root split happens on a leaf 
page, is when the index has exactly one page, a single leaf page. Any 
subsequent root splits will split an internal page rather than a leaf 
page, and we're not bothered by those. In the case that a root split 
happens on a single-page index, we're OK, because we will always scan 
that page either before, or after the split. If we scan the single page 
before the split, we see all the leaf tuples on that page. If we scan 
the single page after the split, it means that we start the scan after 
the split, and we will see both leaf pages as we continue the scan.

- Heikki

Вложения

Re: GiST VACUUM

От
Andrey Borodin
Дата:

> 19 июля 2018 г., в 16:28, Heikki Linnakangas <hlinnaka@iki.fi> написал(а):
> Hmm. So, while we are scanning the right sibling, which was moved to lower-numbered block because of a concurrent
split,the original page is split again? That's OK, we've already scanned all the tuples on the original page, before we
recurseto deal with the right sibling. (The corresponding B-tree code also releases the lock on the original page when
recursing)
Seems right.

>
> I did some refactoring, to bring this closer to the B-tree code, for the sake of consistency. See attached patch.
Thisalso eliminates the 2nd pass by gistvacuumcleanup(), in case we did that in the bulkdelete-phase already. 
Thanks!

>
> There was one crucial thing missing: in the outer loop, we must ensure that we scan all pages, even those that were
addedafter the vacuum started. 
Correct. Quite a neat logic behind the order of acquiring npages, comparing and vacuuming page. Notes in FIXME look
correctexcept function names. 

> There's a comment explaining that in btvacuumscan(). This version fixes that.
>
> I haven't done any testing on this. Do you have any test scripts you could share?
I use just a simple tests that setup replication and does random inserts and vaccums. Not a rocket science, just a
mutatedscript 
for i in $(seq 1 12); do
size=$((100 * 2**$i))
./psql postgres -c "create table x as select cube(random()) c from generate_series(1,$size) y; create index on x using
gist(c);"
./psql postgres -c "delete from x;"
./psql postgres -c "VACUUM x;"
./psql postgres -c "VACUUM x;"
./psql postgres -c "drop table x;"
./psql postgres -c "create table x as select cube(random()) c from generate_series(1,$size) y; create index on x using
gist(c);"
./psql postgres -c "delete from x where (c~>1)>0.1;"
./psql postgres -c "VACUUM x;"
./psql postgres -c "insert into x select cube(random()) c from generate_series(1,$size) y;"
./psql postgres -c "VACUUM x;"
./psql postgres -c "delete from x where (c~>1)>0.1;"
./psql postgres -c "select pg_size_pretty(pg_relation_size('x_c_idx'));"
./psql postgres -c "VACUUM FULL x;"
./psql postgres -c "select pg_size_pretty(pg_relation_size('x_c_idx'));"
./psql postgres -c "drop table x;"
done

> I think we need some repeatable tests for the concurrent split cases.
It is hard to trigger left splits until we delete pages. I'll try to hack gistNewBuffer() to cause something similar.

> Even if it involves gdb or some other hacks that we can't include in the regression test suite, we need something
now,while we're hacking on this. 
>
> One subtle point, that I think is OK, but gave me a pause, and probably deserves comment somewhere: A concurrent root
splitcan turn a leaf page into one internal (root) page, and two new leaf pages. The new root page is placed in the
sameblock as the old page, while both new leaf pages go to freshly allocated blocks. If that happens while vacuum is
running,might we miss the new leaf pages? As the code stands, we don't do the "follow-right" dance on internal pages,
sowe would not recurse into the new leaf pages. At first, I thought that's a problem, but I think we can get away with
it.The only scenario where a root split happens on a leaf page, is when the index has exactly one page, a single leaf
page.Any subsequent root splits will split an internal page rather than a leaf page, and we're not bothered by those.
Inthe case that a root split happens on a single-page index, we're OK, because we will always scan that page either
before,or after the split. If we scan the single page before the split, we see all the leaf tuples on that page. If we
scanthe single page after the split, it means that we start the scan after the split, and we will see both leaf pages
aswe continue the scan. 
Yes, only page 0 may change type, and page 0 cannot split to left.


I'm working on triggering left split during vacuum. Will get back when done. Thanks!

Best regards, Andrey Borodin.

Re: GiST VACUUM

От
Andrey Borodin
Дата:
Hi!

> 19 июля 2018 г., в 23:26, Andrey Borodin <x4mmm@yandex-team.ru> написал(а):
>
> I'm working on triggering left split during vacuum. Will get back when done. Thanks!

Here's patch including some messy hacks to trigger NSN and FollowRight jumps during VACUUM.

To trigger FollowRight GiST sometimes forget to clear follow-right marker simulating crash of an insert. This fills
logswith "fixing incomplete split" messages. Search for "REMOVE THIS" to disable these ill-behavior triggers. 
To trigger NSN jump GiST allocate empty page after every real allocation.

gistvacuumcleanup() was constantly generating left jumps because there was used 0 instead of real start NSN, I moved
NSNacquisition to gistvacuumscan(). Also fixed some comments. 

gistvacuumcleanup() will have same effect as gistbulkdelete(), is it OK?

To reproduce left-jumps run ./rescantest.sh
Script contain variables for my local paths.


Best regards, Andrey Borodin.

Вложения

Re: GiST VACUUM

От
Andrey Borodin
Дата:
Hi!

> 21 июля 2018 г., в 17:11, Andrey Borodin <x4mmm@yandex-team.ru> написал(а):
>
> <0001-Physical-GiST-scan-in-VACUUM-v13.patch>

Just in case, here's second part of patch series with actual page deletion.

I was considering further decreasing memory footprint by using bloom filters instead of bitmap, but it will create
seriouslymore work for cpu to compute hashes. 

Best regards, Andrey Borodin.

Вложения

Re: GiST VACUUM

От
Thomas Munro
Дата:
On Tue, Jul 24, 2018 at 6:04 AM, Andrey Borodin <x4mmm@yandex-team.ru> wrote:
>> 21 июля 2018 г., в 17:11, Andrey Borodin <x4mmm@yandex-team.ru> написал(а):
>> <0001-Physical-GiST-scan-in-VACUUM-v13.patch>
>
> Just in case, here's second part of patch series with actual page deletion.

Hi Andrey,

Cfbot says:

https://ci.appveyor.com/project/postgresql-cfbot/postgresql/build/1.0.7146

That's because you declared a new variable after some other
statements.  You can't do that in old school C89.

https://travis-ci.org/postgresql-cfbot/postgresql/builds/409401951

That segfaulted here:

#0 0x00000000004ab620 in setinternal (state=<optimized out>,
state=<optimized out>, blkno=368) at gistvacuum.c:44
44 state->internalPagesMap[blkno / 8] |= 1 << (blkno % 8);

internalPagesMap was NULL, or pointed to memory that was too small and
happened to be near an unmapped region (unlikely?), or was some other
corrupted address?

--
Thomas Munro
http://www.enterprisedb.com


Re: GiST VACUUM

От
Andrey Borodin
Дата:
Hi!

Thank you!

> 29 июля 2018 г., в 14:04, Thomas Munro <thomas.munro@enterprisedb.com> написал(а):
>
> On Tue, Jul 24, 2018 at 6:04 AM, Andrey Borodin <x4mmm@yandex-team.ru> wrote:
>>> 21 июля 2018 г., в 17:11, Andrey Borodin <x4mmm@yandex-team.ru> написал(а):
>>> <0001-Physical-GiST-scan-in-VACUUM-v13.patch>
>>
>> Just in case, here's second part of patch series with actual page deletion.
>
> Hi Andrey,
>
> Cfbot says:
>
> https://ci.appveyor.com/project/postgresql-cfbot/postgresql/build/1.0.7146
>
> That's because you declared a new variable after some other
> statements.  You can't do that in old school C89.
Yep, mismerged patch steps and created misplaced declaration

> https://travis-ci.org/postgresql-cfbot/postgresql/builds/409401951
>
> That segfaulted here:
>
> #0 0x00000000004ab620 in setinternal (state=<optimized out>,
> state=<optimized out>, blkno=368) at gistvacuum.c:44
> 44 state->internalPagesMap[blkno / 8] |= 1 << (blkno % 8);
>
> internalPagesMap was NULL, or pointed to memory that was too small and
> happened to be near an unmapped region (unlikely?), or was some other
> corrupted address?
Yes, there was conditionally uninitialized variable mapNumPages. I do not know why it didn't trigger failure last time,
butnow it was reproduced on my machine reliably. 

Fixed both problems. PFA v14.

Best regards, Andrey Borodin.

Вложения

Re: GiST VACUUM

От
Heikki Linnakangas
Дата:
On 29/07/18 14:47, Andrey Borodin wrote:
> Fixed both problems. PFA v14.

Thanks, took a really quick look at this.

The text being added to README is outdated for these latest changes.

> In second step I still use paloc's memory, but only to store two
> bitmaps: bitmap of internal pages and bitmap of empty leafs. Second
> physical scan only reads internal pages. I can omit that bitmap, if
> I'll scan everything. Also, I can replace emptyLeafs bitmap with
> array\list, but I do not really think it will be big.

On a typical GiST index, what's the ratio of leaf vs. internal pages? 
Perhaps an array would indeed be better. If you have a really large 
index, the bitmaps can take a fair amount of memory, on top of the 
memory used for tracking the dead TIDs. I.e. that memory will be in 
addition to maintenance_work_mem. That's not nice, but I think it's OK 
in practice, and not worth spending too much effort to eliminate. For a 
1 TB index with default 8k block size, the two bitmaps will take 32 MB 
of memory in total. If you're dealing with a database of that size, you 
ought to have some memory to spare. But if an array would use less 
memory, that'd be better.

If you go with bitmaps, please use the existing Bitmapset instead of 
rolling your own. Saves some code, and it provides more optimized 
routines for iterating through all the set bits, too 
(bms_next_member()). Another possibility would be to use Tidbitmap, in 
the "lossy" mode, i.e. add the pages with tbm_add_page(). That might 
save some memory, compared to Bitmapset, if the bitmap is very sparse. 
Not sure how it compares with a plain array.

A straightforward little optimization would be to skip scanning the 
internal pages, when the first scan didn't find any empty pages. And 
stop the scan of the internal pages as soon as all the empty pages have 
been recycled.

- Heikki


Re: GiST VACUUM

От
Andrey Borodin
Дата:
Hi! Thanks for looking into the patch!

> 30 июля 2018 г., в 18:39, Heikki Linnakangas <hlinnaka@iki.fi> написал(а):
>
> On 29/07/18 14:47, Andrey Borodin wrote:
>> Fixed both problems. PFA v14.
>
> Thanks, took a really quick look at this.
>
> The text being added to README is outdated for these latest changes.
Fixed.
>
>> In second step I still use paloc's memory, but only to store two
>> bitmaps: bitmap of internal pages and bitmap of empty leafs. Second
>> physical scan only reads internal pages. I can omit that bitmap, if
>> I'll scan everything. Also, I can replace emptyLeafs bitmap with
>> array\list, but I do not really think it will be big.
>
> On a typical GiST index, what's the ratio of leaf vs. internal pages? Perhaps an array would indeed be better.
Typical GiST has around 200 tuples per internal page. I've switched to List since it's more efficient than bitmap. Is
> If you have a really large index, the bitmaps can take a fair amount of memory, on top of the memory used for
trackingthe dead TIDs. I.e. that memory will be in addition to maintenance_work_mem. That's not nice, but I think it's
OKin practice, and not worth spending too much effort to eliminate. For a 1 TB index with default 8k block size, the
twobitmaps will take 32 MB of memory in total. If you're dealing with a database of that size, you ought to have some
memoryto spare. But if an array would use less memory, that'd be better. 

>
> If you go with bitmaps, please use the existing Bitmapset instead of rolling your own. Saves some code, and it
providesmore optimized routines for iterating through all the set bits, too (bms_next_member()). Another possibility
wouldbe to use Tidbitmap, in the "lossy" mode, i.e. add the pages with tbm_add_page(). That might save some memory,
comparedto Bitmapset, if the bitmap is very sparse. Not sure how it compares with a plain array. 
Yeah, I've stopped reinventing that bicycle. But I have to note that default growth strategy of Bitmap is not good: we
willbe repallocing byte by byte. 

>
> A straightforward little optimization would be to skip scanning the internal pages, when the first scan didn't find
anyempty pages. And stop the scan of the internal pages as soon as all the empty pages have been recycled. 
Done.

PFA v15.

Best regards, Andrey Borodin.

Вложения

Re: GiST VACUUM

От
Heikki Linnakangas
Дата:
On 31/07/18 23:06, Andrey Borodin wrote:
>> On a typical GiST index, what's the ratio of leaf vs. internal
>> pages? Perhaps an array would indeed be better.
 >
> Typical GiST has around 200 tuples per internal page. I've switched
> to List since it's more efficient than bitmap.

Hmm. A ListCell is 16 bytes, plus the AllocChunk header, 16 bytes. 32
bytes per internal page in total, while a bitmap consumes one bit per 
page, leaf or internal. If I'm doing
my math right, assuming the ratio of leaf pages vs internal pages is
1:200, a List actually consumes more memory than a bitmap; 256 bits per
internal page, vs. 200 bits. An array, with 4 bytes per block number,
would be the winner here.

> But I have to note that default growth strategy of Bitmap is not 
> good: we will be repallocing byte by byte.

True, that repallocing seems bad. You could force it to be pre-allocated
by setting the last bit. Or add a function to explicitly enlarge the bitmap.

- Heikki


Re: GiST VACUUM

От
Andrey Borodin
Дата:
Hi!

> 5 авг. 2018 г., в 16:18, Heikki Linnakangas <hlinnaka@iki.fi> написал(а):
>
> On 31/07/18 23:06, Andrey Borodin wrote:
>>> On a typical GiST index, what's the ratio of leaf vs. internal
>>> pages? Perhaps an array would indeed be better.
> >
>> Typical GiST has around 200 tuples per internal page. I've switched
>> to List since it's more efficient than bitmap.
>
> Hmm. A ListCell is 16 bytes, plus the AllocChunk header, 16 bytes. 32
> bytes per internal page in total, while a bitmap consumes one bit per page, leaf or internal. If I'm doing
> my math right, assuming the ratio of leaf pages vs internal pages is
> 1:200, a List actually consumes more memory than a bitmap; 256 bits per
> internal page, vs. 200 bits. An array, with 4 bytes per block number,
> would be the winner here.
We do not know size of this array beforehand. I can implement normal ArrayList though (with repallocing array) or
linkedlist of chunks. Or anything from data structures zoo. 
Or just stick with bitmap (my preferred way).
>
>> But I have to note that default growth strategy of Bitmap is not good: we will be repallocing byte by byte.
>
> True, that repallocing seems bad. You could force it to be pre-allocated
> by setting the last bit. Or add a function to explicitly enlarge the bitmap.
OK, I'll think of proper resize function (ensure capacity, to be precise).

Best regards, Andrey Borodin.



Re: GiST VACUUM

От
Andrey Borodin
Дата:
Hi!

PFA v16.

> 5 авг. 2018 г., в 21:45, Andrey Borodin <x4mmm@yandex-team.ru> написал(а):
>> 5 авг. 2018 г., в 16:18, Heikki Linnakangas <hlinnaka@iki.fi> написал(а):
>>
>> Hmm. A ListCell is 16 bytes, plus the AllocChunk header, 16 bytes. 32
>> bytes per internal page in total, while a bitmap consumes one bit per page, leaf or internal. If I'm doing
>> my math right, assuming the ratio of leaf pages vs internal pages is
>> 1:200, a List actually consumes more memory than a bitmap; 256 bits per
>> internal page, vs. 200 bits. An array, with 4 bytes per block number,
>> would be the winner here.
> We do not know size of this array beforehand. I can implement normal ArrayList though (with repallocing array) or
linkedlist of chunks. Or anything from data structures zoo. 
> Or just stick with bitmap (my preferred way).
Done.
>>
>>> But I have to note that default growth strategy of Bitmap is not good: we will be repallocing byte by byte.
>>
>> True, that repallocing seems bad. You could force it to be pre-allocated
>> by setting the last bit. Or add a function to explicitly enlarge the bitmap.
> OK, I'll think of proper resize function (ensure capacity, to be precise).
Done. Added function bms_make_empty(int size)

Best regards, Andrey Borodin.

Вложения

Re: GiST VACUUM

От
Michael Paquier
Дата:
On Mon, Aug 06, 2018 at 11:12:00PM +0500, Andrey Borodin wrote:
> Done. Added function bms_make_empty(int size)

Andrey, your latest patch does not apply.  I am moving this to the next
CF, waiting for your input.
--
Michael

Вложения

Re: GiST VACUUM

От
Andrey Borodin
Дата:
Hi everyone!

> 2 окт. 2018 г., в 6:14, Michael Paquier <michael@paquier.xyz> написал(а):
> Andrey, your latest patch does not apply.  I am moving this to the next
> CF, waiting for your input.

I'm doing preps for CF.
Here's rebased version.

Best regards, Andrey Borodin.
 


Вложения

Re: GiST VACUUM

От
Dmitry Dolgov
Дата:
> On Sun, Oct 28, 2018 at 6:32 PM Andrey Borodin <x4mmm@yandex-team.ru> wrote:
>
> Hi everyone!
>
> > 2 окт. 2018 г., в 6:14, Michael Paquier <michael@paquier.xyz> написал(а):
> > Andrey, your latest patch does not apply.  I am moving this to the next
> > CF, waiting for your input.
>
> I'm doing preps for CF.
> Here's rebased version.

Looks like this patch is waiting for an input since August. Could any of the
reviewers (Heikki?) please take a look at the latest version? In the meantime
I'm moving it to the next CF.


Re: GiST VACUUM

От
Heikki Linnakangas
Дата:
On 28/10/2018 19:32, Andrey Borodin wrote:
> Hi everyone!
> 
>> 2 окт. 2018 г., в 6:14, Michael Paquier <michael@paquier.xyz> написал(а):
>> Andrey, your latest patch does not apply.  I am moving this to the next
>> CF, waiting for your input.
> 
> I'm doing preps for CF.
> Here's rebased version.

Thanks, I had another look at these.

In patch #1, to do the vacuum scan in physical order:

* The starting NSN was not acquired correctly for unlogged and temp 
relations. They don't use WAL, so their NSN values are based on the 
'unloggedLSN' counter, rather than current WAL insert pointer. So must 
use gistGetFakeLSN() rather than GetInsertRecPtr() for them. Fixed that.

* Adjusted comments a little bit, mostly by copy-pasting the 
better-worded comments from the corresponding nbtree code, and ran pgindent.

I think this is ready to be committed, except that I didn't do any 
testing. We discussed the need for testing earlier. Did you write some 
test scripts for this, or how have you been testing?


Patch #2:

* Bitmapset stores 32 bit signed integers, but BlockNumber is unsigned. 
So this will fail with an index larger than 2^31 blocks. That's perhaps 
academical, I don't think anyone will try to create a 16 TB GiST index 
any time soon. But it feels wrong to introduce an arbitrary limitation 
like that.

* I was surprised that the bms_make_empty() function doesn't set the 
'nwords' field to match the allocated size. Perhaps that was 
intentional, so that you don't need to scan the empty region at the end, 
when you scan through all matching bits? Still, seems surprising, and 
needs a comment at least.

* We're now scanning all internal pages in the 2nd phase. Not only those 
internal pages that contained downlinks to empty leaf pages. That's 
probably OK, but the comments need updating on that.

* In general, comments need to be fixed and added in several places. For 
example, there's no clear explanation on what the "delete XID", stored 
in pd_prune_xid, means. (I know what it is, because I'm familiar with 
the same mechanism in B-tree, but it's not clear from the code itself.)

These can be fixed, they're not show-stoppers, but patch #2 isn't quite 
ready yet.

- Heikki


Re: GiST VACUUM

От
Heikki Linnakangas
Дата:
On 02/01/2019 17:35, Heikki Linnakangas wrote:
> On 28/10/2018 19:32, Andrey Borodin wrote:
>> Hi everyone!
>>
>>> 2 окт. 2018 г., в 6:14, Michael Paquier <michael@paquier.xyz> написал(а):
>>> Andrey, your latest patch does not apply.  I am moving this to the next
>>> CF, waiting for your input.
>>
>> I'm doing preps for CF.
>> Here's rebased version.
> 
> Thanks, I had another look at these.

Here are new patch versions, with the fixes I mentioned. Forgot to 
attach these earlier.

- Heikki

Вложения

Re: GiST VACUUM

От
Andrey Borodin
Дата:
Cool, thanks!

> 2 янв. 2019 г., в 20:35, Heikki Linnakangas <hlinnaka@iki.fi> написал(а):
>
> In patch #1, to do the vacuum scan in physical order:
> ...
> I think this is ready to be committed, except that I didn't do any testing. We discussed the need for testing
earlier.Did you write some test scripts for this, or how have you been testing? 
Please see test I used to check left jumps for v18:
0001-Physical-GiST-scan-in-VACUUM-v18-with-test-modificat.patch
0002-Test-left-jumps-v18.patch

To trigger FollowRight GiST sometimes forget to clear follow-right marker simulating crash of an insert. This fills
logswith "fixing incomplete split" messages. Search for "REMOVE THIS" to disable these ill-behavior triggers. 
To trigger NSN jump GiST allocate empty page after every real allocation.

In log output I see
2019-01-03 22:27:30.028 +05 [54596] WARNING:  RESCAN TRIGGERED BY NSN
WARNING:  RESCAN TRIGGERED BY NSN
2019-01-03 22:27:30.104 +05 [54596] WARNING:  RESCAN TRIGGERED BY FollowRight
This means that code paths were really executed (for v18).

>
> Patch #2:

>
> * Bitmapset stores 32 bit signed integers, but BlockNumber is unsigned. So this will fail with an index larger than
2^31blocks. That's perhaps academical, I don't think anyone will try to create a 16 TB GiST index any time soon. But it
feelswrong to introduce an arbitrary limitation like that. 
Looks like bitmapset is unsuitable for collecting block numbers due to the interface. Let's just create custom
containerfor this? 
Or there is one other option: for each block number throw away sign bit and consider page potentially internal and
potentiallyempty leaf if (blkno & 0x7FFFFFF) is in bitmapset. 
And then we will have to create at least one 17Tb GiST to check it actually works.

> * I was surprised that the bms_make_empty() function doesn't set the 'nwords' field to match the allocated size.
Perhapsthat was intentional, so that you don't need to scan the empty region at the end, when you scan through all
matchingbits? Still, seems surprising, and needs a comment at least. 
Explicitly set nwords to zero. Looking at the code now, I do not see this nwords==0 as a very good idea. Probably, it's
effective,but it's hacky, creates implicit expectations in code. 

> * We're now scanning all internal pages in the 2nd phase. Not only those internal pages that contained downlinks to
emptyleaf pages. That's probably OK, but the comments need updating on that. 
Adjusted comments. That if before loop
> if (vstate.emptyPages > 0)
seems superfluous. But I kept it until we solve the problem with 31-bit bitmapset.
> * In general, comments need to be fixed and added in several places. For example, there's no clear explanation on
whatthe "delete XID", stored in pd_prune_xid, means. (I know what it is, because I'm familiar with the same mechanism
inB-tree, but it's not clear from the code itself.) 
I've added comment to GistPageSetDeleteXid()

* In this check
> if (GistPageIsDeleted(page) && TransactionIdPrecedes(GistPageGetDeleteXid(page), RecentGlobalXmin))
I've switched using RecentGlobalDataXmin to RecentGlobalXmin, because we have done so in similar mechanics for GIN (for
uniformitywith B-tree). 


Thanks for working on this!


Best regards, Andrey Borodin.



Вложения

Re: GiST VACUUM

От
Andrey Borodin
Дата:
3 янв. 2019 г., в 23:47, Andrey Borodin <x4mmm@yandex-team.ru> написал(а):
>
>> * Bitmapset stores 32 bit signed integers, but BlockNumber is unsigned. So this will fail with an index larger than
2^31blocks. That's perhaps academical, I don't think anyone will try to create a 16 TB GiST index any time soon. But it
feelswrong to introduce an arbitrary limitation like that. 
> Looks like bitmapset is unsuitable for collecting block numbers due to the interface. Let's just create custom
containerfor this? 
> Or there is one other option: for each block number throw away sign bit and consider page potentially internal and
potentiallyempty leaf if (blkno & 0x7FFFFFF) is in bitmapset. 
> And then we will have to create at least one 17Tb GiST to check it actually works.

Heikki, how do you think, is implementing our own radix tree for this is viable solution?
I've written working implementation with 4-level statically typed tree. If we follow this route, probably, there must
betests for this data structure. 

Best regards, Andrey Borodin.

Вложения

Re: GiST VACUUM

От
Michael Paquier
Дата:
On Fri, Jan 04, 2019 at 06:26:18PM +0500, Andrey Borodin wrote:
> Heikki, how do you think, is implementing our own radix tree for
> this is viable solution?
> I've written working implementation with 4-level statically typed
> tree. If we follow this route, probably, there must be tests for
> this data structure.

(Note that the latest patch does not apply.)
Heikki, are you planning to answer to the questions and/or review the
patch?  I have moved it to next CF for now.
--
Michael

Вложения

Re: GiST VACUUM

От
Heikki Linnakangas
Дата:
On 04/01/2019 21:26, Andrey Borodin wrote:
> 3 янв. 2019 г., в 23:47, Andrey Borodin <x4mmm@yandex-team.ru>
> написал(а):
>> 
>>> * Bitmapset stores 32 bit signed integers, but BlockNumber is
>>> unsigned. So this will fail with an index larger than 2^31
>>> blocks. That's perhaps academical, I don't think anyone will try
>>> to create a 16 TB GiST index any time soon. But it feels wrong to
>>> introduce an arbitrary limitation like that.
>> 
>> Looks like bitmapset is unsuitable for collecting block numbers due
>> to the interface. Let's just create custom container for this? Or
>> there is one other option: for each block number throw away sign
>> bit and consider page potentially internal and potentially empty
>> leaf if (blkno & 0x7FFFFFF) is in bitmapset. And then we will have
>> to create at least one 17Tb GiST to check it actually works.
> 
> Heikki, how do you think, is implementing our own radix tree for this
> is viable solution? I've written working implementation with 4-level
> statically typed tree. If we follow this route, probably, there must
> be tests for this data structure.

Yeah, seems reasonable.

- Heikki


Re: GiST VACUUM

От
Heikki Linnakangas
Дата:
On 04/01/2019 02:47, Andrey Borodin wrote:
>> 2 янв. 2019 г., в 20:35, Heikki Linnakangas <hlinnaka@iki.fi> написал(а):
>>
>> In patch #1, to do the vacuum scan in physical order:
>> ...
>> I think this is ready to be committed, except that I didn't do any testing. We discussed the need for testing
earlier.Did you write some test scripts for this, or how have you been testing?
 
> Please see test I used to check left jumps for v18:
> 0001-Physical-GiST-scan-in-VACUUM-v18-with-test-modificat.patch
> 0002-Test-left-jumps-v18.patch
> 
> To trigger FollowRight GiST sometimes forget to clear follow-right marker simulating crash of an insert. This fills
logswith "fixing incomplete split" messages. Search for "REMOVE THIS" to disable these ill-behavior triggers.
 
> To trigger NSN jump GiST allocate empty page after every real allocation.
> 
> In log output I see
> 2019-01-03 22:27:30.028 +05 [54596] WARNING:  RESCAN TRIGGERED BY NSN
> WARNING:  RESCAN TRIGGERED BY NSN
> 2019-01-03 22:27:30.104 +05 [54596] WARNING:  RESCAN TRIGGERED BY FollowRight
> This means that code paths were really executed (for v18).

Thanks! As I noted at 
https://www.postgresql.org/message-id/2ff57b1f-01b4-eacf-36a2-485a12017f6e%40iki.fi, 
the test patch left the index corrupt. I fixed it so that it leaves 
behind incompletely split pages, without the corruption, see attached. 
It's similar to yours, but let me recap what it does:

* Hacks gistbuild(), create 100 empty pages immediately after the root 
pages. They are leaked, so they won't be reused until the a VACUUM sees 
them and puts them to the FSM

* Hacks gistinserttuples(), to leave the split incompleted with 50% 
probability

* In vacuum, print a line to the log whenever it needs to "jump left"

I used this patch, with the attached test script that's similar to 
yours, but it also tries to verify that the index returns correct 
results. It prints a result set like this:

    sum
---------
  -364450
   364450
(2 rows)

If the two numbers add up to 0, the index seems valid. And you should 
see "RESCAN" lines in the log, to show that jumping left happened. 
Because the behavior is random and racy, you may need to run the script 
many times to see both "RESCAN TRIGGERED BY NSN" and "RESCAN TRIGGERED 
BY FollowRight" cases. Especially the "FollowRight" case happens less 
frequently than the NSN case, you might need to run the script > 10 
times to see it.

I also tried your amcheck tool with this. It did not report any errors.

Attached is also latest version of the patch itself. It is the same as 
your latest patch v19, except for some tiny comment kibitzing. I'll mark 
this as Ready for Committer in the commitfest app, and will try to 
commit it in the next couple of days.

- Heikki

Вложения

Re: GiST VACUUM

От
Andrey Borodin
Дата:
Hi!

Thanks for fixing gist amcheck! The idea of using these two patches together seems so obvious now, but never actually
visitedmy mind before. 

> 4 марта 2019 г., в 18:04, Heikki Linnakangas <hlinnaka@iki.fi> написал(а):
> Thanks! As I noted at https://www.postgresql.org/message-id/2ff57b1f-01b4-eacf-36a2-485a12017f6e%40iki.fi, the test
patchleft the index corrupt. I fixed it so that it leaves behind incompletely split pages, without the corruption, see
attached.It's similar to yours, but let me recap what it does: 
>
> * Hacks gistbuild(), create 100 empty pages immediately after the root pages. They are leaked, so they won't be
reuseduntil the a VACUUM sees them and puts them to the FSM 
>
> * Hacks gistinserttuples(), to leave the split incompleted with 50% probability
>
> * In vacuum, print a line to the log whenever it needs to "jump left"
>
> I used this patch, with the attached test script that's similar to yours, but it also tries to verify that the index
returnscorrect results. It prints a result set like this: 
>
>   sum
> ---------
> -364450
>  364450
> (2 rows)
>
> If the two numbers add up to 0, the index seems valid. And you should see "RESCAN" lines in the log, to show that
jumpingleft happened. Because the behavior is random and racy, you may need to run the script many times to see both
"RESCANTRIGGERED BY NSN" and "RESCAN TRIGGERED BY FollowRight" cases. Especially the "FollowRight" case happens less
frequentlythan the NSN case, you might need to run the script > 10 times to see it. 
Great! I've repeated your tests on my machine, I observe similar frequencies of causes of rescan left jumps.

> I also tried your amcheck tool with this. It did not report any errors.
>
> Attached is also latest version of the patch itself. It is the same as your latest patch v19, except for some tiny
commentkibitzing. I'll mark this as Ready for Committer in the commitfest app, and will try to commit it in the next
coupleof days. 

That's cool! I'll work on 2nd step of these patchset to make blockset data structure prettier and less hacky.

Best regards, Andrey Borodin.

Re: GiST VACUUM

От
Heikki Linnakangas
Дата:
On 05/03/2019 02:26, Andrey Borodin wrote:
>> I also tried your amcheck tool with this. It did not report any 
>> errors.
>> 
>> Attached is also latest version of the patch itself. It is the
>> same as your latest patch v19, except for some tiny comment
>> kibitzing. I'll mark this as Ready for Committer in the commitfest
>> app, and will try to commit it in the next couple of days.
> 
> That's cool! I'll work on 2nd step of these patchset to make
> blockset data structure prettier and less hacky.

Committed the first patch. Thanks for the patch!

I did some last minute copy-editing on the comments, and fixed one 
little thing that we missed earlier: the check for "invalid tuples" that 
might be left over in pg_upgraded pre-9.1 indexes, was lost at some 
point. I added that check back. (It would be nice if GiST indexes had a 
metadata page with a version number, so we could avoid that work in the 
99% of cases that that check is not needed, but that's a different story.)

I'll change the status of this patch to "Waiting on Author", to reflect 
the state of the 2nd patch, since you're working on the radix tree 
blockset stuff.

- Heikki


Re: GiST VACUUM

От
Andrey Borodin
Дата:
> 5 марта 2019 г., в 18:21, Heikki Linnakangas <hlinnaka@iki.fi> написал(а):
>
> On 05/03/2019 02:26, Andrey Borodin wrote:
>>
>> That's cool! I'll work on 2nd step of these patchset to make
>> blockset data structure prettier and less hacky.
>
> Committed the first patch. Thanks for the patch!

That's cool! Thanks!

> I'll change the status of this patch to "Waiting on Author", to reflect the state of the 2nd patch, since you're
workingon the radix tree blockset stuff. 

Here's new version of the patch for actual page deletion.
Changes:
1. Fixed possible concurrency issue:
We were locking child page while holding lock on internal page. Notes in GiST README recommend locking child before
parent.
Thus now we unlock internal page before locking children for deletion, and lock it again, check that everything is
stillon it's place and delete only then. 
One thing still bothers me. Let's assume that we have internal page with 2 deletable leaves. We lock these leaves in
orderof items on internal page. Is it possible that 2nd page have follow-right link on 1st and someone will lock 2nd
page,try to lock 1st and deadlock with VACUUM? 
2. Added radix-tree-based block set to lib, with tests.

Best regards, Andrey Borodin.

Вложения

Re: GiST VACUUM

От
Heikki Linnakangas
Дата:
On 10/03/2019 18:40, Andrey Borodin wrote:
> Here's new version of the patch for actual page deletion.
> Changes:
> 
> 1. Fixed possible concurrency issue:
> 
> We were locking child page while holding lock on internal page. Notes
> in GiST README recommend locking child before parent. Thus now we
> unlock internal page before locking children for deletion, and lock
> it again, check that everything is still on it's place and delete
> only then.

Good catch. The implementation is a bit broken, though. This segfaults:

create table gist_point_tbl(id int4, p point);
create index gist_pointidx on gist_point_tbl using gist(p);

insert into gist_point_tbl (id, p)
   select g,  point((1+g) % 100, (1+g) % 100) from generate_series(1, 
10000) g;
insert into gist_point_tbl (id, p)
   select -g, point(1000, 1000) from generate_series(1, 10000) g;

delete from gist_point_tbl where id >= 0;
vacuum verbose gist_point_tbl;

BTW, we don't seem to have test coverage for this in the regression 
tests. The tests that delete some rows, where you updated the comment, 
doesn't actually seem to produce any empty pages that could be deleted.

> One thing still bothers me. Let's assume that we have internal page
> with 2 deletable leaves. We lock these leaves in order of items on
> internal page. Is it possible that 2nd page have follow-right link on
> 1st and someone will lock 2nd page, try to lock 1st and deadlock with
> VACUUM?

Hmm. If the follow-right flag is set on a page, it means that its right 
sibling doesn't have a downlink in the parent yet. Nevertheless, I think 
I'd sleep better, if we acquired the locks in left-to-right order, to be 
safe.

An easy cop-out would be to use LWLockConditionalAcquire, and bail out 
if we can't get the lock. But if it's not too complicated, it'd be nice 
to acquire the locks in the correct order to begin with.

I did a round of cleanup and moving things around, before I bumped into 
the above issue. I'm including them here as separate patches, for easier 
review, but they should of course be squashed into yours. For 
convenience, I'm including your patches here as well, so that all the 
patches you need are in the same place, but they are identical to what 
you posted.

- Heikki

Вложения

Re: GiST VACUUM

От
Andrey Borodin
Дата:
Hi!

> 11 марта 2019 г., в 20:03, Heikki Linnakangas <hlinnaka@iki.fi> написал(а):
>
> On 10/03/2019 18:40, Andrey Borodin wrote:
>> Here's new version of the patch for actual page deletion.
>> Changes:
>> 1. Fixed possible concurrency issue:
>> We were locking child page while holding lock on internal page. Notes
>> in GiST README recommend locking child before parent. Thus now we
>> unlock internal page before locking children for deletion, and lock
>> it again, check that everything is still on it's place and delete
>> only then.
>
> Good catch. The implementation is a bit broken, though. This segfaults:
Sorry for the noise. Few lines of old code leaked into the patch between testing and mailing.

> BTW, we don't seem to have test coverage for this in the regression tests. The tests that delete some rows, where you
updatedthe comment, doesn't actually seem to produce any empty pages that could be deleted. 
I've updated test to delete more rows. But it did not trigger previous bug anyway, we had to delete everything for this
case.

>
>> One thing still bothers me. Let's assume that we have internal page
>> with 2 deletable leaves. We lock these leaves in order of items on
>> internal page. Is it possible that 2nd page have follow-right link on
>> 1st and someone will lock 2nd page, try to lock 1st and deadlock with
>> VACUUM?
>
> Hmm. If the follow-right flag is set on a page, it means that its right sibling doesn't have a downlink in the parent
yet.Nevertheless, I think I'd sleep better, if we acquired the locks in left-to-right order, to be safe. 
Actually, I did not found lock coupling in GiST code. But I decided to lock just two pages at once (leaf, then parent,
forevery pair). PFA v22 with this concurrency logic. 

>
> An easy cop-out would be to use LWLockConditionalAcquire, and bail out if we can't get the lock. But if it's not too
complicated,it'd be nice to acquire the locks in the correct order to begin with. 
>
> I did a round of cleanup and moving things around, before I bumped into the above issue. I'm including them here as
separatepatches, for easier review, but they should of course be squashed into yours. For convenience, I'm including
yourpatches here as well, so that all the patches you need are in the same place, but they are identical to what you
posted.
I've rebased all these patches so that VACUUM should work on every commit. Let's just squash them for the next
iteration,it was quite a messy rebase. 
BTW, you deleted numEmptyPages, this makes code cleaner, but this variable could stop gistvacuum_recycle_pages() when
everythingis recycled already. 


Thanks!

Best regards, Andrey Borodin.


Вложения

Re: GiST VACUUM

От
Jeff Janes
Дата:
On Tue, Mar 5, 2019 at 8:21 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
On 05/03/2019 02:26, Andrey Borodin wrote:
>> I also tried your amcheck tool with this. It did not report any
>> errors.
>>
>> Attached is also latest version of the patch itself. It is the
>> same as your latest patch v19, except for some tiny comment
>> kibitzing. I'll mark this as Ready for Committer in the commitfest
>> app, and will try to commit it in the next couple of days.
>
> That's cool! I'll work on 2nd step of these patchset to make
> blockset data structure prettier and less hacky.

Committed the first patch. Thanks for the patch!

Thank you.  This is a transformational change; it will allow GiST indexes larger than RAM to be used in some cases where they were simply not feasible to use before.  On a HDD, it resulted in a 50 fold improvement in vacuum time, and the machine went from unusably unresponsive to merely sluggish during the vacuum.  On a SSD (albeit a very cheap laptop one, and exposed from Windows host to Ubuntu over VM Virtual Box) it is still a 30 fold improvement, from a far faster baseline.  Even on an AWS instance with a "GP2" SSD volume, which normally shows little benefit from sequential reads, I get a 3 fold speed up.

I also ran this through a lot of crash-recovery testing using simulated torn-page writes using my traditional testing harness with high concurrency (AWS c4.4xlarge and a1.4xlarge using 32 concurrent update processes) and did not encounter any problems.  I tested both with btree_gist on a scalar int, and on tsvector with each tsvector having 101 tokens.

I did notice that the space freed up in the index by vacuum doesn't seem to get re-used very efficiently, but that is an ancestral problem independent of this change.

Cheers,

Jeff

Re: GiST VACUUM

От
Heikki Linnakangas
Дата:
On 15/03/2019 20:25, Andrey Borodin wrote:
>> 11 марта 2019 г., в 20:03, Heikki Linnakangas <hlinnaka@iki.fi>
>> написал(а):
>> 
>> On 10/03/2019 18:40, Andrey Borodin wrote:
>>> One thing still bothers me. Let's assume that we have internal
>>> page with 2 deletable leaves. We lock these leaves in order of
>>> items on internal page. Is it possible that 2nd page have
>>> follow-right link on 1st and someone will lock 2nd page, try to
>>> lock 1st and deadlock with VACUUM?
>> 
>> Hmm. If the follow-right flag is set on a page, it means that its
>> right sibling doesn't have a downlink in the parent yet.
>> Nevertheless, I think I'd sleep better, if we acquired the locks in
>> left-to-right order, to be safe.
 >
> Actually, I did not found lock coupling in GiST code. But I decided
> to lock just two pages at once (leaf, then parent, for every pair).
> PFA v22 with this concurrency logic.

Good. I just noticed, that the README actually does say explicitly, that 
the child must be locked before the parent.

I rebased this over the new IntegerSet implementation, from the other 
thread, and did another round of refactoring, cleanups, etc. Attached is 
a new version of this patch. I'm also including the IntegerSet patch 
here, for convenience, but it's the same patch I posted at [1].

It's in pretty good shapre, but one remaining issue that needs to be fixed:

During Hot Standby, the B-tree code writes a WAL reord, when a deleted 
page is recycled, to prevent the deletion from being replayed too early 
in the hot standby. See _bt_getbuf() and btree_xlog_reuse_page(). I 
think we need to do something similar in GiST.

I'll try fixing that tomorrow, unless you beat me to it. Making the 
changes is pretty straightforward, but it's a bit cumbersome to test.

[1] 
https://www.postgresql.org/message-id/1035d8e6-cfd1-0c27-8902-40d8d45eb6e8@iki.fi

- Heikki

Вложения

Re: GiST VACUUM

От
Heikki Linnakangas
Дата:
On 15/03/2019 20:25, Andrey Borodin wrote:
>> 11 марта 2019 г., в 20:03, Heikki Linnakangas <hlinnaka@iki.fi>
>> написал(а):
>> 
>> On 10/03/2019 18:40, Andrey Borodin wrote:
>>> One thing still bothers me. Let's assume that we have internal
>>> page with 2 deletable leaves. We lock these leaves in order of
>>> items on internal page. Is it possible that 2nd page have
>>> follow-right link on 1st and someone will lock 2nd page, try to
>>> lock 1st and deadlock with VACUUM?
>> 
>> Hmm. If the follow-right flag is set on a page, it means that its
>> right sibling doesn't have a downlink in the parent yet.
>> Nevertheless, I think I'd sleep better, if we acquired the locks in
>> left-to-right order, to be safe.
 >
> Actually, I did not found lock coupling in GiST code. But I decided
> to lock just two pages at once (leaf, then parent, for every pair).
> PFA v22 with this concurrency logic.

Good. I just noticed, that the README actually does say explicitly, that 
the child must be locked before the parent.

I rebased this over the new IntegerSet implementation, from the other 
thread, and did another round of refactoring, cleanups, etc. Attached is 
a new version of this patch. I'm also including the IntegerSet patch 
here, for convenience, but it's the same patch I posted at [1].

It's in pretty good shape, but one remaining issue that needs to be fixed:

During Hot Standby, the B-tree code writes a WAL reord, when a deleted 
page is recycled, to prevent the deletion from being replayed too early 
in the hot standby. See _bt_getbuf() and btree_xlog_reuse_page(). I 
think we need to do something similar in GiST.

I'll try fixing that tomorrow, unless you beat me to it. Making the 
changes is pretty straightforward, but it's a bit cumbersome to test.

[1] 
https://www.postgresql.org/message-id/1035d8e6-cfd1-0c27-8902-40d8d45eb6e8@iki.fi

- Heikki

Вложения

Re: GiST VACUUM

От
Andrey Borodin
Дата:

> 21 марта 2019 г., в 2:30, Heikki Linnakangas <hlinnaka@iki.fi> написал(а):
> one remaining issue that needs to be fixed:
>
> During Hot Standby, the B-tree code writes a WAL reord, when a deleted
> page is recycled, to prevent the deletion from being replayed too early
> in the hot standby. See _bt_getbuf() and btree_xlog_reuse_page(). I
> think we need to do something similar in GiST.
>
> I'll try fixing that tomorrow, unless you beat me to it. Making the
> changes is pretty straightforward, but it's a bit cumbersome to test.

I've tried to deal with it and stuck... I think we should make B-tree WAL record for this shared, remove BlockNumber
andother unused stuff, leaving just xid and db oid. 
And reuse this record for B-tree, GiST and GIN (yeah, it is not checking for that conflict).

Though, I'm not sure it is important for GIN. Scariest thing that can happen: it will return same tid twice. But it is
doingbitmap scan, you cannot return same bit twice... 

Eventually, hash, spgist and others will have same thing too.

Best regards, Andrey Borodin.

Re: GiST VACUUM

От
Heikki Linnakangas
Дата:
On 21/03/2019 18:06, Andrey Borodin wrote:
>> 21 марта 2019 г., в 2:30, Heikki Linnakangas <hlinnaka@iki.fi>
>> написал(а): one remaining issue that needs to be fixed:
>> 
>> During Hot Standby, the B-tree code writes a WAL reord, when a
>> deleted page is recycled, to prevent the deletion from being
>> replayed too early in the hot standby. See _bt_getbuf() and
>> btree_xlog_reuse_page(). I think we need to do something similar in
>> GiST.
>> 
>> I'll try fixing that tomorrow, unless you beat me to it. Making
>> the changes is pretty straightforward, but it's a bit cumbersome to
>> test.
> 
> I've tried to deal with it and stuck...

So, I came up with the attached. I basically copy-pasted the page-reuse 
WAL-logging stuff from nbtree.

When I started testing this, I quickly noticed that empty pages were not 
being deleted nearly as much as I expected. I tracked it to this check 
in gistdeletepage:

> +       if (GistFollowRight(leafPage)
> +               || GistPageGetNSN(parentPage) > GistPageGetNSN(leafPage))
> +       {
> +               /* Don't mess with a concurrent page split. */
> +               return false;
> +       }

That NSN test was bogus. It prevented the leaf page from being reused, 
if the parent page was *ever* split after the leaf page was created. I 
don't see any reason to check the NSN here. The NSN is normally used to 
detect if a (leaf) page has been concurrently split, when you descend 
the tree. We don't need to care about that here; as long as the 
FOLLOW_RIGHT flag is not set, the page has a downlink, and if we can 
find the downlink and the page is empty, we can delete it.

After removing that bogus NSN check, page reuse become much more 
effective. I've been testing this by running this test script repeatedly:

----------
/*
create sequence gist_test_seq;
create table gist_point_tbl(id int4, p point);
create index gist_pointidx on gist_point_tbl using gist(p);
*/

insert into gist_point_tbl (id, p)
    select nextval('gist_test_seq'), point(nextval('gist_test_seq'), 
1000 + g) from generate_series(1, 10000) g;

delete from gist_point_tbl where id < currval('gist_test_seq') - 20000;
vacuum gist_point_tbl;

select pg_table_size('gist_point_tbl'), pg_indexes_size('gist_point_tbl');
----------

It inserts a bunch of rows, deletes a bunch of older rows, and vacuums. 
The interesting thing here is that the key values keep "moving", so that 
new tuples are added to different places than where old ones are 
removed. That's the case where page reuse is needed.

Before this patch, the index bloated very quickly. With the patch, it 
still bloats, because we still don't delete internal pages. But it's a 
small fraction of the bloat you got before.

Attached is the latest patch version, to be applied on top of the 
IntegerSet patch.

> I think we should make B-tree WAL record for this shared, remove
> BlockNumber and other unused stuff, leaving just xid and db oid. And
> reuse this record for B-tree, GiST and GIN (yeah, it is not checking
> for that conflict).
Good point. For now, I didn't try to generalize this, but perhaps we 
should.

> Though, I'm not sure it is important for GIN. Scariest thing that can
> happen: it will return same tid twice. But it is doing bitmap scan,
> you cannot return same bit twice...

Hmm. Could it return a completely unrelated tuple? We don't always 
recheck the original index quals in a bitmap index scan, IIRC. Also, a 
search might get confused if it's descending down a posting tree, and 
lands on a different kind of a page, altogether.

Alexander, you added the mechanism to GIN recently, to prevent pages 
from being reused too early (commit 52ac6cd2d0). Do we need something 
like B-tree's REUSE_PAGE records in GIN, too, to prevent the same bug 
from happening in hot standby?


PS. for Gist, we could almost use the LSN / NSN mechanism to detect the 
case that a deleted page is reused: Add a new field to the GiST page 
header, to store a new "deleteNSN" field. When a page is deleted, the 
deleted page's deleteNSN is set to the LSN of the deletion record. When 
the page is reused, the deleteNSN field is kept unchanged. When you 
follow a downlink during search, if you see that the page's deleteNSN > 
parent's LSN, you know that it was concurrently deleted and recycled, 
and should be ignored. That would allow reusing deleted pages 
immediately. Unfortunately that would require adding a new field to the 
gist page header/footer, which requires upgrade work :-(. Maybe one day, 
we'll bite the bullet. Something to keep in mind, if we have to change 
the page format anyway, for some reason.

- Heikki

Вложения

Re: GiST VACUUM

От
Andrey Borodin
Дата:
> 22 марта 2019 г., в 1:04, Heikki Linnakangas <hlinnaka@iki.fi> написал(а):
> ...
> When I started testing this, I quickly noticed that empty pages were not being deleted nearly as much as I expected.
Itracked it to this check in gistdeletepage: 
>
>> +       if (GistFollowRight(leafPage)
>> +               || GistPageGetNSN(parentPage) > GistPageGetNSN(leafPage))
>> +       {
>> +               /* Don't mess with a concurrent page split. */
>> +               return false;
>> +       }
>
> That NSN test was bogus. It prevented the leaf page from being reused, if the parent page was *ever* split after the
leafpage was created. I don't see any reason to check the NSN here. 
That's true. This check had sense only when parent page was not locked (and seems like comparison should be opposite).
Whenboth pages are locked, this test as no sense at all. Check was incorrectly "fixed" by me when transitioning from
single-scandelete to two-scan delete during summer 2018. Just wandering how hard is it to simply delete a page... 

>> Though, I'm not sure it is important for GIN. Scariest thing that can
>> happen: it will return same tid twice. But it is doing bitmap scan,
>> you cannot return same bit twice...
>
> Hmm. Could it return a completely unrelated tuple?
No, I do not think so, it will do comparisons on posting tree tuples.

> We don't always recheck the original index quals in a bitmap index scan, IIRC. Also, a search might get confused if
it'sdescending down a posting tree, and lands on a different kind of a page, altogether. 
Yes, search will return an error, user will reissue query and everything will be almost fine.

> PS. for Gist, we could almost use the LSN / NSN mechanism to detect the case that a deleted page is reused: Add a new
fieldto the GiST page header, to store a new "deleteNSN" field. When a page is deleted, the deleted page's deleteNSN is
setto the LSN of the deletion record. When the page is reused, the deleteNSN field is kept unchanged. When you follow a
downlinkduring search, if you see that the page's deleteNSN > parent's LSN, you know that it was concurrently deleted
andrecycled, and should be ignored. That would allow reusing deleted pages immediately. Unfortunately that would
requireadding a new field to the gist page header/footer, which requires upgrade work :-(. Maybe one day, we'll bite
thebullet. Something to keep in mind, if we have to change the page format anyway, for some reason. 
Yeah, the same day we will get rid of invalid tuples. I can make a patch for v13. Actually, I have a lot of patches
thatI want in GiST in v13. Or v14. 

Best regards, Andrey Borodin.

Re: GiST VACUUM

От
Heikki Linnakangas
Дата:
On 22/03/2019 10:00, Andrey Borodin wrote:
>> 22 марта 2019 г., в 1:04, Heikki Linnakangas <hlinnaka@iki.fi>
>> написал(а):
>>
>> PS. for Gist, we could almost use the LSN / NSN mechanism to detect
>> the case that a deleted page is reused: Add a new field to the GiST
>> page header, to store a new "deleteNSN" field. When a page is
>> deleted, the deleted page's deleteNSN is set to the LSN of the
>> deletion record. When the page is reused, the deleteNSN field is
>> kept unchanged. When you follow a downlink during search, if you
>> see that the page's deleteNSN > parent's LSN, you know that it was
>> concurrently deleted and recycled, and should be ignored. That
>> would allow reusing deleted pages immediately. Unfortunately that
>> would require adding a new field to the gist page header/footer,
>> which requires upgrade work :-(. Maybe one day, we'll bite the
>> bullet. Something to keep in mind, if we have to change the page
>> format anyway, for some reason.
>
> Yeah, the same day we will get rid of invalid tuples. I can make a
> patch for v13. Actually, I have a lot of patches that I want in GiST
> in v13. Or v14.

Cool! Here's my wishlist:

* That deleteNSN thing
* Add a metapage to blk #0.
* Add a "level"-field to page header.
* Currently, a search needs to scan all items on a page. If the keys are 
small, that can be pretty slow. Divide each page further into e.g. 4 
sub-pages, with a "bounding box" key for each sub-page, to speed up search.

- Heikki


Re: GiST VACUUM

От
Heikki Linnakangas
Дата:
On 21/03/2019 19:04, Heikki Linnakangas wrote:
> Attached is the latest patch version, to be applied on top of the
> IntegerSet patch.

And committed. Thanks Andrey!

- Heikki


Re: GiST VACUUM

От
Heikki Linnakangas
Дата:
On 22/03/2019 13:37, Heikki Linnakangas wrote:
> On 21/03/2019 19:04, Heikki Linnakangas wrote:
>> Attached is the latest patch version, to be applied on top of the
>> IntegerSet patch.
> 
> And committed. Thanks Andrey!

This caused the buildfarm to go pink... I was able to reproduce it, by 
running the regression tests in one terminal, and repeatedly running 
"VACUUM;" in another terminal. Strange that it seems to happen all the 
time on the buildfarm, but never happened to me locally.

Anyway, I'm investigating.

- Heikki



Re: GiST VACUUM

От
Andrey Borodin
Дата:

> 22 марта 2019 г., в 19:37, Heikki Linnakangas <hlinnaka@iki.fi> написал(а):
>
> On 21/03/2019 19:04, Heikki Linnakangas wrote:
>> Attached is the latest patch version, to be applied on top of the
>> IntegerSet patch.
>
> And committed. Thanks Andrey!
>
> - Heikki

Cool! Thank you very much! At the beginning I could not image how many caveats are there.

> 22 марта 2019 г., в 18:28, Heikki Linnakangas <hlinnaka@iki.fi> написал(а):
>
> * Currently, a search needs to scan all items on a page. If the keys are small, that can be pretty slow. Divide each
pagefurther into e.g. 4 sub-pages, with a "bounding box" key for each sub-page, to speed up search. 
BTW, I already have intra-page indexing patch. But now it obviously need a rebase :) Along with gist amcheck patch.

Thanks!

Best regards, Andrey Borodin.

Re: GiST VACUUM

От
Andrey Borodin
Дата:

> 22 марта 2019 г., в 17:03, Heikki Linnakangas <hlinnaka@iki.fi> написал(а):
>

I was working on new version of gist check in amcheck and understand one more thing:

/* Can this page be recycled yet? */
bool
gistPageRecyclable(Page page)
{
    return PageIsNew(page) ||
        (GistPageIsDeleted(page) &&
         TransactionIdPrecedes(GistPageGetDeleteXid(page), RecentGlobalXmin));
}

Here RecentGlobalXmin can wraparound and page will become unrecyclable for half of xid cycle. Can we prevent it by
resettingPageDeleteXid to InvalidTransactionId before doing RecordFreeIndexPage()? 
(Seems like same applies to GIN...)

Best regards, Andrey Borodin.

Re: GiST VACUUM

От
Heikki Linnakangas
Дата:
On 24/03/2019 18:50, Andrey Borodin wrote:
> I was working on new version of gist check in amcheck and understand one more thing:
> 
> /* Can this page be recycled yet? */
> bool
> gistPageRecyclable(Page page)
> {
>      return PageIsNew(page) ||
>          (GistPageIsDeleted(page) &&
>           TransactionIdPrecedes(GistPageGetDeleteXid(page), RecentGlobalXmin));
> }
> 
> Here RecentGlobalXmin can wraparound and page will become unrecyclable for half of xid cycle. Can we prevent it by
resettingPageDeleteXid to InvalidTransactionId before doing RecordFreeIndexPage()?
 
> (Seems like same applies to GIN...)

True, and B-tree has the same issue. I thought I saw a comment somewhere 
in the B-tree code about that earlier, but now I can't find it. I 
must've imagined it.

We could reset it, but that would require dirtying the page. That would 
be just extra I/O overhead, if the page gets reused before XID 
wraparound. We could avoid that if we stored the full XID+epoch, not 
just XID. I think we should do that in GiST, at least, where this is 
new. In the B-tree, it would require some extra code to deal with 
backwards-compatibility, but maybe it would be worth it even there.

- Heikki


Re: GiST VACUUM

От
Heikki Linnakangas
Дата:
On 25/03/2019 15:20, Heikki Linnakangas wrote:
> On 24/03/2019 18:50, Andrey Borodin wrote:
>> I was working on new version of gist check in amcheck and understand one more thing:
>>
>> /* Can this page be recycled yet? */
>> bool
>> gistPageRecyclable(Page page)
>> {
>>       return PageIsNew(page) ||
>>           (GistPageIsDeleted(page) &&
>>            TransactionIdPrecedes(GistPageGetDeleteXid(page), RecentGlobalXmin));
>> }
>>
>> Here RecentGlobalXmin can wraparound and page will become unrecyclable for half of xid cycle. Can we prevent it by
resettingPageDeleteXid to InvalidTransactionId before doing RecordFreeIndexPage()?
 
>> (Seems like same applies to GIN...)
> 
> True, and B-tree has the same issue. I thought I saw a comment somewhere
> in the B-tree code about that earlier, but now I can't find it. I
> must've imagined it.
> 
> We could reset it, but that would require dirtying the page. That would
> be just extra I/O overhead, if the page gets reused before XID
> wraparound. We could avoid that if we stored the full XID+epoch, not
> just XID. I think we should do that in GiST, at least, where this is
> new. In the B-tree, it would require some extra code to deal with
> backwards-compatibility, but maybe it would be worth it even there.

I suggest that we do the attached. It fixes this for GiST. The patch 
changes expands the "deletion XID" to 64-bits, and changes where it's 
stored. Instead of storing it pd_prune_xid, it's stored in the page 
contents. Luckily, a deleted page has no real content.

I think we should fix this in a similar manner in B-tree, too, but that 
can be done separately. For B-tree, we need to worry about 
backwards-compatibility, but that seems simple enough; we just need to 
continue to understand old deleted pages, where the deletion XID is 
stored in the page opaque field.

- Heikki

Вложения

Re: GiST VACUUM

От
Andrey Borodin
Дата:
Hi!

> 4 апр. 2019 г., в 20:15, Heikki Linnakangas <hlinnaka@iki.fi> написал(а):
>
> On 25/03/2019 15:20, Heikki Linnakangas wrote:
>> On 24/03/2019 18:50, Andrey Borodin wrote:
>>> I was working on new version of gist check in amcheck and understand one more thing:
>>>
>>> /* Can this page be recycled yet? */
>>> bool
>>> gistPageRecyclable(Page page)
>>> {
>>>      return PageIsNew(page) ||
>>>          (GistPageIsDeleted(page) &&
>>>           TransactionIdPrecedes(GistPageGetDeleteXid(page), RecentGlobalXmin));
>>> }
>>>
>>> Here RecentGlobalXmin can wraparound and page will become unrecyclable for half of xid cycle. Can we prevent it by
resettingPageDeleteXid to InvalidTransactionId before doing RecordFreeIndexPage()? 
>>> (Seems like same applies to GIN...)
>> True, and B-tree has the same issue. I thought I saw a comment somewhere
>> in the B-tree code about that earlier, but now I can't find it. I
>> must've imagined it.
>> We could reset it, but that would require dirtying the page. That would
>> be just extra I/O overhead, if the page gets reused before XID
>> wraparound. We could avoid that if we stored the full XID+epoch, not
>> just XID. I think we should do that in GiST, at least, where this is
>> new. In the B-tree, it would require some extra code to deal with
>> backwards-compatibility, but maybe it would be worth it even there.
>
> I suggest that we do the attached. It fixes this for GiST. The patch changes expands the "deletion XID" to 64-bits,
andchanges where it's stored. Instead of storing it pd_prune_xid, it's stored in the page contents. Luckily, a deleted
pagehas no real content. 

So, we store full xid right after page header?
+static inline void
+GistPageSetDeleteXid(Page page, FullTransactionId deletexid)
+{
+    Assert(PageIsEmpty(page));
+    ((PageHeader) page)->pd_lower = MAXALIGN(SizeOfPageHeaderData) + sizeof(FullTransactionId);
+
+    *((FullTransactionId *) PageGetContents(page)) = deletexid;
+}

Usually we leave one ItemId (located at invalid offset number) untouched. I do not know is it done for a reason or
not....


Also, I did not understand this optimization:
+    /*
+     * We can skip this if the page was deleted so long ago, that no scan can possibly
+     * still see it, even in a standby. One measure might be anything older than the
+     * table's frozen-xid, but we don't have that at hand here. But anything older than
+     * 2 billion, from the next XID, is surely old enough, because you would hit XID
+     * wraparound at that point.
+     */
+    nextxid = ReadNextFullTransactionId();
+    diff = U64FromFullTransactionId(nextxid) - U64FromFullTransactionId(latestRemovedXid);
+    if (diff < 0x7fffffff)
+        return;

Standby can be lagging months from primary, and, theoretically, close the gap in one sudden WAL leap... Also, I think,
thatcomparison sign should be >, not <. 


Best regards, Andrey Borodin.


Re: GiST VACUUM

От
Michael Paquier
Дата:
Heikki,

On Thu, Apr 04, 2019 at 06:15:21PM +0300, Heikki Linnakangas wrote:
> I think we should fix this in a similar manner in B-tree, too, but that can
> be done separately. For B-tree, we need to worry about
> backwards-compatibility, but that seems simple enough; we just need to
> continue to understand old deleted pages, where the deletion XID is stored
> in the page opaque field.

This is an open item present already for a couple of weeks.  Are you
planning to tackle that soon?
--
Michael

Вложения

Re: GiST VACUUM

От
Heikki Linnakangas
Дата:
(Thanks for the reminder on this, Michael!)

On 05/04/2019 08:39, Andrey Borodin wrote:
>> 4 апр. 2019 г., в 20:15, Heikki Linnakangas <hlinnaka@iki.fi>  написал(а):
>> I suggest that we do the attached. It fixes this for GiST. The
>> patch changes expands the "deletion XID" to 64-bits, and changes
>> where it's stored. Instead of storing it pd_prune_xid, it's stored
>> in the page contents. Luckily, a deleted page has no real content.
> 
> So, we store full xid right after page header?

Yep.

> +static inline void
> +GistPageSetDeleteXid(Page page, FullTransactionId deletexid)
> +{
> +    Assert(PageIsEmpty(page));
> +    ((PageHeader) page)->pd_lower = MAXALIGN(SizeOfPageHeaderData) + sizeof(FullTransactionId);
> +
> +    *((FullTransactionId *) PageGetContents(page)) = deletexid;
> +}
> 
> Usually we leave one ItemId (located at invalid offset number)
> untouched. I do not know is it done for a reason or not....
No. Take a look at PageGetItemId() macro, it subtracts one from the 
offset number. But in any case, that's not really relevant here, because 
this patch stores the transaction id directly as the page content. There 
are no itemids at all on a deleted page.

> Also, I did not understand this optimization:
> +    /*
> +     * We can skip this if the page was deleted so long ago, that no scan can possibly
> +     * still see it, even in a standby. One measure might be anything older than the
> +     * table's frozen-xid, but we don't have that at hand here. But anything older than
> +     * 2 billion, from the next XID, is surely old enough, because you would hit XID
> +     * wraparound at that point.
> +     */
> +    nextxid = ReadNextFullTransactionId();
> +    diff = U64FromFullTransactionId(nextxid) - U64FromFullTransactionId(latestRemovedXid);
> +    if (diff < 0x7fffffff)
> +        return;
> 
> Standby can be lagging months from primary, and, theoretically, close
> the gap in one sudden WAL leap...
It would still process the WAL one WAL record at a time, even if it's 
lagging months behind. It can't just jump over 2 billion XIDs.

> Also, I think, that comparison sign should be >, not <.

Ah, good catch! And it shows that this needs more testing..

- Heikki



Re: GiST VACUUM

От
Andrey Borodin
Дата:
Hi!

Thanks for clarification, now I understand these patches better.

> 25 июня 2019 г., в 13:10, Heikki Linnakangas <hlinnaka@iki.fi> написал(а):
>
>> Also, I did not understand this optimization:
>> +    /*
>> +     * We can skip this if the page was deleted so long ago, that no scan can possibly
>> +     * still see it, even in a standby. One measure might be anything older than the
>> +     * table's frozen-xid, but we don't have that at hand here. But anything older than
>> +     * 2 billion, from the next XID, is surely old enough, because you would hit XID
>> +     * wraparound at that point.
>> +     */
>> +    nextxid = ReadNextFullTransactionId();
>> +    diff = U64FromFullTransactionId(nextxid) - U64FromFullTransactionId(latestRemovedXid);
>> +    if (diff < 0x7fffffff)
>> +        return;
>> Standby can be lagging months from primary, and, theoretically, close
>> the gap in one sudden WAL leap...
> It would still process the WAL one WAL record at a time, even if it's lagging months behind. It can't just jump over
2billion XIDs. 
I feel a little uncomfortable with number 0x7fffffff right in code.

Thanks!

Best regards, Andrey Borodin.


Re: GiST VACUUM

От
Michael Paquier
Дата:
On Tue, Jun 25, 2019 at 02:38:43PM +0500, Andrey Borodin wrote:
> I feel a little uncomfortable with number 0x7fffffff right in code.

PG_INT32_MAX...
--
Michael

Вложения

Re: GiST VACUUM

От
Thomas Munro
Дата:
On Wed, Jun 26, 2019 at 2:33 PM Michael Paquier <michael@paquier.xyz> wrote:
> On Tue, Jun 25, 2019 at 02:38:43PM +0500, Andrey Borodin wrote:
> > I feel a little uncomfortable with number 0x7fffffff right in code.
>
> PG_INT32_MAX...

MaxTransactionId / 2?

-- 
Thomas Munro
https://enterprisedb.com



Re: GiST VACUUM

От
Heikki Linnakangas
Дата:
On 26/06/2019 06:07, Thomas Munro wrote:
> On Wed, Jun 26, 2019 at 2:33 PM Michael Paquier <michael@paquier.xyz> wrote:
>> On Tue, Jun 25, 2019 at 02:38:43PM +0500, Andrey Borodin wrote:
>>> I feel a little uncomfortable with number 0x7fffffff right in code.
>>
>> PG_INT32_MAX...
> 
> MaxTransactionId / 2?

Yeah, that makes sense.

Here's a new version of the patches. Changes:

* I changed the reuse-logging so that we always write a reuse WAL 
record, even if the deleteXid is very old. I tried to avoid that with 
the check for MaxTransactionId / 2 or 0x7fffffff, but it had some 
problems. In the previous patch version, it wasn't just an optimization. 
Without the check, we would write 32-bit XIDs to the log that had 
already wrapped around, and that caused the standby to unnecessarily 
wait for or kill backends. But checking for MaxTransaction / 2 isn't 
quite enough: there was a small chance that the next XID counter 
advanced just after checking for that, so that we still logged an XID 
that had just wrapped around. A more robust way to deal with this is to 
log a full 64-bit XID, and check for wraparound at redo in the standby. 
And if we do that, trying to optimize this in the master doesn't seem 
that important anymore. So in this patch version, we always log the 
64-bit XID, and check for the MaxTransaction / 2 when replaying the WAL 
record instead.

* I moved the logic to extend a 32-bit XID to 64-bits to a new helper 
function in varsup.c.

* Instead of storing just a naked FullTransactionId in the "page 
contents" of a deleted page, I created a new struct for that. The effect 
is the same, but I think the struct clarifies what's happening, and it's 
a good location to attach a comment explaining it.

* Fixed the mixup between < and >

I haven't done any testing on this yet. Andrey, would you happen to have 
an environment ready to test this?

- Heikki

Вложения

Re: GiST VACUUM

От
Andrey Borodin
Дата:

> 27 июня 2019 г., в 16:38, Heikki Linnakangas <hlinnaka@iki.fi> написал(а):
>
> I haven't done any testing on this yet. Andrey, would you happen to have an environment ready to test this?

Thanks!

I will do some testing this evening (UTC+5). But I'm not sure I can reliably test wraparound of xids...
I will try to break code with usual setup which we used to stress vacuum with deletes and inserts.

Best regards, Andrey Borodin.


Re: GiST VACUUM

От
Andrey Borodin
Дата:

> 27 июня 2019 г., в 16:38, Heikki Linnakangas <hlinnaka@iki.fi> написал(а):
>
> I haven't done any testing on this yet. Andrey, would you happen to have an environment ready to test this?

Patches do not deadlock and delete pages on "rescan test" - setup that we used to detect "left jumps" in during
developmentof physical vacuum. check-world is happy on my machine. 
I really like that now there is GISTDeletedPageContents and we do not just cast *(FullTransactionId *)
PageGetContents(page).

But I have stupid question again, about this code:


https://github.com/x4m/postgres_g/commit/096d5586537d29ff316ca8ce074bbe1b325879ee#diff-754126824470cb8e68fd5e32af6d3bcaR417

        nextFullXid = ReadNextFullTransactionId();
            diff = U64FromFullTransactionId(nextFullXid) -
                U64FromFullTransactionId(latestRemovedFullXid);
            if (diff < MaxTransactionId / 2)
            {
                TransactionId latestRemovedXid;

                // sleep(100500 hours); latestRemovedXid becomes xid from future

                latestRemovedXid = XidFromFullTransactionId(latestRemovedFullXid);
                ResolveRecoveryConflictWithSnapshot(latestRemovedXid,
                                                    xlrec->node);
            }

Do we have a race condition here? Can latestRemovedXid overlap and start to be xid from future?
I understand that it is purely hypothetical, but still latestRemovedXid is from ancient past already.

Best regards, Andrey Borodin.


Re: GiST VACUUM

От
Heikki Linnakangas
Дата:
On 27/06/2019 20:15, Andrey Borodin wrote:
> But I have stupid question again, about this code:
> 
>
https://github.com/x4m/postgres_g/commit/096d5586537d29ff316ca8ce074bbe1b325879ee#diff-754126824470cb8e68fd5e32af6d3bcaR417
> 
>         nextFullXid = ReadNextFullTransactionId();
>             diff = U64FromFullTransactionId(nextFullXid) -
>                 U64FromFullTransactionId(latestRemovedFullXid);
>             if (diff < MaxTransactionId / 2)
>             {
>                 TransactionId latestRemovedXid;
>                     
>                 // sleep(100500 hours); latestRemovedXid becomes xid from future
>     
>                 latestRemovedXid = XidFromFullTransactionId(latestRemovedFullXid);
>                 ResolveRecoveryConflictWithSnapshot(latestRemovedXid,
>                                                     xlrec->node);
>             }
> 
> Do we have a race condition here? Can latestRemovedXid overlap and start to be xid from future?
> I understand that it is purely hypothetical, but still latestRemovedXid is from ancient past already.

Good question. No, that can't happen, because this code is in the WAL 
redo function. In a standby, the next XID counter only moves forward 
when a WAL record is replayed that advances it, and all WAL records are 
replayed serially, so that can't happen when we're in the middle of 
replaying this record. A comment on that would be good, though.

When I originally had the check like above in the code that created the 
WAL record, it had exactly that problem, because in the master the next 
XID counter can advance concurrently.

- Heikki



Re: GiST VACUUM

От
Thomas Munro
Дата:
On Thu, Jun 27, 2019 at 11:38 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> * I moved the logic to extend a 32-bit XID to 64-bits to a new helper
> function in varsup.c.

I'm a bit uneasy about making this into a generally available function
for reuse.  I think we should instead come up with a very small number
of sources of fxids that known to be free of races because of some
well explained interlocking.

For example, I believe it is safe to convert an xid obtained from a
WAL record during recovery, because (for now) recovery is a single
thread of execution and the next fxid can't advance underneath you.
So I think XLogRecGetFullXid(XLogReaderState *record)[1] as I'm about
to propose in another thread (though I've forgotten who wrote it,
maybe Andres, Amit or me, but if it wasn't me then it's exactly what I
would have written) is a safe blessed source of fxids.  The rationale
for writing this function instead of an earlier code that looked much
like your proposed helper function, during EDB-internal review of
zheap stuff, was this: if we provide a general purpose xid->fxid
facility, it's virtually guaranteed that someone will eventually use
it to shoot footwards, by acquiring an xid from somewhere, and then
some arbitrary time later extending it to a fxid when no interlocking
guarantees that the later conversion has the correct epoch.

I'd like to figure out how to maintain full versions of the
procarray.c-managed xid horizons, without widening the shared memory
representation.  I was planning to think harder about for 13.
Obviously that doesn't help you now.  So I'm wondering if you should
just open-code this for now, and put in a comment about why it's safe
and a note that there'll hopefully be a fxid horizon available in a
later release?

[1] https://github.com/EnterpriseDB/zheap/commit/1203c2fa49f5f872f42ea4a02ccba2b191c45f32

-- 
Thomas Munro
https://enterprisedb.com



Re: GiST VACUUM

От
Peter Geoghegan
Дата:
On Thu, Apr 4, 2019 at 8:15 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> I think we should fix this in a similar manner in B-tree, too, but that
> can be done separately. For B-tree, we need to worry about
> backwards-compatibility, but that seems simple enough; we just need to
> continue to understand old deleted pages, where the deletion XID is
> stored in the page opaque field.

What Postgres versions will the B-Tree fix end up targeting? Sounds
like you plan to backpatch all the way?

-- 
Peter Geoghegan



Re: GiST VACUUM

От
Amit Kapila
Дата:
On Fri, Jun 28, 2019 at 3:32 AM Thomas Munro <thomas.munro@gmail.com> wrote:
>
> On Thu, Jun 27, 2019 at 11:38 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> > * I moved the logic to extend a 32-bit XID to 64-bits to a new helper
> > function in varsup.c.
>
> I'm a bit uneasy about making this into a generally available function
> for reuse.  I think we should instead come up with a very small number
> of sources of fxids that known to be free of races because of some
> well explained interlocking.
>

I have two more cases in undo patch series where the same function is
needed and it is safe to use it there.  The first place is twophase.c
for rolling back prepared transactions where we know that we don't
support aborted xacts that are older than 2 billion, so we can rely on
such a function.  We also need it in undodiscard.c to compute the
value of oldestFullXidHavingUnappliedUndo.  See the usage of
GetEpochForXid in unprocessing patches.  Now, we might find a way to
avoid using in one of these places by doing some more work, but not
sure we can avoid from all the three places (one proposed by this
patch and two by undo patchset).

> For example, I believe it is safe to convert an xid obtained from a
> WAL record during recovery, because (for now) recovery is a single
> thread of execution and the next fxid can't advance underneath you.
> So I think XLogRecGetFullXid(XLogReaderState *record)[1] as I'm about
> to propose in another thread (though I've forgotten who wrote it,
> maybe Andres, Amit or me, but if it wasn't me then it's exactly what I
> would have written) is a safe blessed source of fxids.  The rationale
> for writing this function instead of an earlier code that looked much
> like your proposed helper function, during EDB-internal review of
> zheap stuff, was this: if we provide a general purpose xid->fxid
> facility, it's virtually guaranteed that someone will eventually use
> it to shoot footwards, by acquiring an xid from somewhere, and then
> some arbitrary time later extending it to a fxid when no interlocking
> guarantees that the later conversion has the correct epoch.
>
> I'd like to figure out how to maintain full versions of the
> procarray.c-managed xid horizons, without widening the shared memory
> representation.  I was planning to think harder about for 13.
> Obviously that doesn't help you now.  So I'm wondering if you should
> just open-code this for now, and put in a comment about why it's safe
> and a note that there'll hopefully be a fxid horizon available in a
> later release?
>

Do you suggest to open code for all the three places for now?  I am
not against open coding the logic for now but not sure if we can
eliminate its need because even if we can do what you are saying with
procarray.c-managed xid horizons, I think we need to do something
about clog.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: GiST VACUUM

От
Heikki Linnakangas
Дата:
On 28/06/2019 01:02, Thomas Munro wrote:
> On Thu, Jun 27, 2019 at 11:38 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
>> * I moved the logic to extend a 32-bit XID to 64-bits to a new helper
>> function in varsup.c.
> 
> I'm a bit uneasy about making this into a generally available function
> for reuse.  I think we should instead come up with a very small number
> of sources of fxids that known to be free of races because of some
> well explained interlocking.
> 
> For example, I believe it is safe to convert an xid obtained from a
> WAL record during recovery, because (for now) recovery is a single
> thread of execution and the next fxid can't advance underneath you.
> So I think XLogRecGetFullXid(XLogReaderState *record)[1] as I'm about
> to propose in another thread (though I've forgotten who wrote it,
> maybe Andres, Amit or me, but if it wasn't me then it's exactly what I
> would have written) is a safe blessed source of fxids.  The rationale
> for writing this function instead of an earlier code that looked much
> like your proposed helper function, during EDB-internal review of
> zheap stuff, was this: if we provide a general purpose xid->fxid
> facility, it's virtually guaranteed that someone will eventually use
> it to shoot footwards, by acquiring an xid from somewhere, and then
> some arbitrary time later extending it to a fxid when no interlocking
> guarantees that the later conversion has the correct epoch.

Fair point.

> I'd like to figure out how to maintain full versions of the
> procarray.c-managed xid horizons, without widening the shared memory
> representation.  I was planning to think harder about for 13.
> Obviously that doesn't help you now.  So I'm wondering if you should
> just open-code this for now, and put in a comment about why it's safe
> and a note that there'll hopefully be a fxid horizon available in a
> later release?

I came up with the attached. Instead of having a general purpose "widen" 
function, it adds GetFullRecentGlobalXmin(), to return a 64-bit version 
of RecentGlobalXmin. That's enough for this Gist vacuum patch.

In addition to that change, I modified the GistPageSetDeleted(), 
GistPageSetDeleteXid(), GistPageGetDeleteXid() inline functions a bit. I 
merged GistPageSetDeleted() and GistPageSetDeleteXid() into a single 
function, to make sure that the delete-XID is always set when a page is 
marked as deleted. And I modified GistPageGetDeleteXid() to return 
NormalTransactionId (or a FullTransactionId version of it, to be 
precise), for Gist pages that were deleted with older PostgreSQL v12 
beta versions. The previous patch tripped an assertion in that case, 
which is not nice for people binary-upgrading from earlier beta versions.

I did some testing of this with XID wraparound, and it seems to work. I 
used the attached bash script for the testing, with the a helper contrib 
module to consume XIDs faster. It's not very well commented and probably 
not too useful for anyone, but I'm attaching it here mainly as a note to 
future-self, if we need to revisit this.

Unless something comes up, I'll commit this tomorrow.

- Heikki

Вложения

Re: GiST VACUUM

От
Heikki Linnakangas
Дата:
On 22/07/2019 16:09, Heikki Linnakangas wrote:
> Unless something comes up, I'll commit this tomorrow.

Pushed this now, to master and REL_12_STABLE.

Now, B-tree indexes still have the same problem, in all versions. Any 
volunteers to write a similar fix for B-trees?

- Heikki



Re: GiST VACUUM

От
Peter Geoghegan
Дата:
On Wed, Jul 24, 2019 at 10:30 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> Pushed this now, to master and REL_12_STABLE.
>
> Now, B-tree indexes still have the same problem, in all versions. Any
> volunteers to write a similar fix for B-trees?

I was hoping that you'd work on it.   :-)

Any reason to think that it's much different to what you've done with GiST?

-- 
Peter Geoghegan



Re: GiST VACUUM

От
Heikki Linnakangas
Дата:
On 24/07/2019 21:02, Peter Geoghegan wrote:
> On Wed, Jul 24, 2019 at 10:30 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
>> Pushed this now, to master and REL_12_STABLE.
>>
>> Now, B-tree indexes still have the same problem, in all versions. Any
>> volunteers to write a similar fix for B-trees?
> 
> I was hoping that you'd work on it.   :-)

That's probably how it's going to go, but hey, doesn't hurt to ask :-).

> Any reason to think that it's much different to what you've done with GiST?

No, it should be very similar.

- Heikki



Re: GiST VACUUM

От
Peter Geoghegan
Дата:
On Wed, Jul 24, 2019 at 11:33 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> That's probably how it's going to go, but hey, doesn't hurt to ask :-).

I think that it would be fine to be conservative with nbtree, and only
target the master branch. The problem is annoying, certainly, but it's
not likely to make a huge difference for most real world workloads.
OTOH, perhaps the risk is so low that we might as well target
backbranches.

How do you feel about it?

-- 
Peter Geoghegan