Обсуждение: GiST insert algorithm rewrite

Поиск

Список

Период

Сортировка

GiST insert algorithm rewrite

От

Heikki Linnakangas

Дата:

16 ноября 2010 г., 15:33:44

Following the discussions here
(http://archives.postgresql.org/message-id/12375.1289429390@sss.pgh.pa.us),
here's a draft version of a rewrite of the GiST insertion algorithm and
WAL-logging. The goal is to get rid of the tracking of incomplete splits
and inserts in WAL replay, and the fixup activities at the end of recovery.

There are two key changes to the insert algorithm:

1. When we walk down the tree searching for a suitable leaf page to
insert the new tuple to, we update the nodes on the way down so that
they are consistent with the new key we're inserting. The old GiST
algorithm adjusted all the parent pages only after inserting the tuple
on the leaf. Updating them on the way down ensures that the tree is
self-consistent at all times, even if we crash just after inserting the
new tuple on the leaf page.

2. When a page is split, we mark the new left page with a flag to
indicate that the downlink for the page to the right hasn't been
inserted yet. When the downlink is inserted, the flag is cleared. Again
the purpose is to ensure that the tree is self-consistent at all times.
If we crash just after a page split, before the downlink is inserted,
scans will find the tuples on the right page by following the rightlink.
It's slightly less performant, but correct.

The WAL replay code has been simplified accordingly, we no longer need
any fixup code to finish incomplete splits or inserts, and we no longer
need the "invalid" pointers in the tree.

There's no on-disk format changes, except for the additional flag in the
page headers, so this does not affect pg_upgrade. However, if there's
any "invalid" keys in the old index because of an incomplete insertion,
the new code will not understand that. So you should run vacuum to
ensure that there's no such invalid keys in the index before upgrading.
Vacuum will print a message in the log if it finds any, and you will
have to reindex. But that's what it suggests you to do anyway.

I haven't done much testing yet, although it passes the regression
tests. I'll do more testing, and crash testing, and I'll have to
re-review the patch myself before committing.

Comments?

--
   Heikki Linnakangas
   EnterpriseDB   http://www.enterprisedb.com

Вложения

gist-insert-rewrite-1.patch

Re: GiST insert algorithm rewrite

От

Tom Lane

Дата:

16 ноября 2010 г., 17:01:12

Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:
> There are two key changes to the insert algorithm:

> 1. When we walk down the tree searching for a suitable leaf page to 
> insert the new tuple to, we update the nodes on the way down so that 
> they are consistent with the new key we're inserting. The old GiST 
> algorithm adjusted all the parent pages only after inserting the tuple 
> on the leaf. Updating them on the way down ensures that the tree is 
> self-consistent at all times, even if we crash just after inserting the 
> new tuple on the leaf page.

> 2. When a page is split, we mark the new left page with a flag to 
> indicate that the downlink for the page to the right hasn't been 
> inserted yet. When the downlink is inserted, the flag is cleared. Again 
> the purpose is to ensure that the tree is self-consistent at all times. 
> If we crash just after a page split, before the downlink is inserted, 
> scans will find the tuples on the right page by following the rightlink. 
> It's slightly less performant, but correct.

The one thought that comes to mind is how does the flag business work
after multiple splittings?  That is, assume we have a page that has the
flag set because of a previous crash.  If we have to split either that
page or its right sibling, what do we do with the flags?  I think that
it can be made to work, so long as searches keep moving right as long
as the flag is set.  But this needs to be thought through, and
documented in the README file.  I'm particularly worried whether the
after-the-fact fixup that you mention in README might fail in such
scenarios.
        regards, tom lane

Re: GiST insert algorithm rewrite

От

Heikki Linnakangas

Дата:

16 ноября 2010 г., 17:22:36

On 16.11.2010 20:01, Tom Lane wrote:
> Heikki Linnakangas<heikki.linnakangas@enterprisedb.com>  writes:
>> 2. When a page is split, we mark the new left page with a flag to
>> indicate that the downlink for the page to the right hasn't been
>> inserted yet. When the downlink is inserted, the flag is cleared. Again
>> the purpose is to ensure that the tree is self-consistent at all times.
>> If we crash just after a page split, before the downlink is inserted,
>> scans will find the tuples on the right page by following the rightlink.
>> It's slightly less performant, but correct.
>
> The one thought that comes to mind is how does the flag business work
> after multiple splittings?  That is, assume we have a page that has the
> flag set because of a previous crash.  If we have to split either that
> page or its right sibling, what do we do with the flags?

As I mentioned in the README, the insertion algorithm finishes any 
incomplete splits it sees before proceeding. AFAICS that should ensure 
that the situation never arises where you try to split a page that 
already has the flag set. Or its right sibling; such a page can only be 
reached via the rightlink so you would see the page with the flag set first.

Hmm, there is one corner-case that I didin't think of before: One 
backend splits a leaf page, and another backend concurrently splits the 
parent of that leaf page. If for some reason the backend operating on 
the parent page dies, releasing the locks, the other backend will see 
the incomplete split when it walks up to insert the downlink. Although 
it should be possible to handle that, I think we can simply give up on 
inserting the downlink in that case, and leave that split incomplete as 
well. It's still correct, and next insert that comes along will complete 
the splits, from top to bottom.

BTW, I don't try to fix incomplete splits during vacuum in the patch. 
That's perhaps a bit surprising, and probably would be easy to add, but 
I left it out for now as it's not strictly necessary.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com

Re: GiST insert algorithm rewrite

От

Robert Haas

Дата:

16 ноября 2010 г., 17:46:33

On Tue, Nov 16, 2010 at 1:22 PM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:
> BTW, I don't try to fix incomplete splits during vacuum in the patch. That's
> perhaps a bit surprising, and probably would be easy to add, but I left it
> out for now as it's not strictly necessary.

Seems like it would be good to have this; otherwise, the split might
stay incompletely indefinitely?  Would that be bad?

If we start to enlarge the bounding boxes on the higher levels of the
tree and then crash before inserting the key, is there any mechanism
for getting them back down to the minimal size?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: GiST insert algorithm rewrite

От

Heikki Linnakangas

Дата:

16 ноября 2010 г., 17:51:01

On 16.11.2010 20:46, Robert Haas wrote:
> On Tue, Nov 16, 2010 at 1:22 PM, Heikki Linnakangas
> <heikki.linnakangas@enterprisedb.com>  wrote:
>> BTW, I don't try to fix incomplete splits during vacuum in the patch. That's
>> perhaps a bit surprising, and probably would be easy to add, but I left it
>> out for now as it's not strictly necessary.
>
> Seems like it would be good to have this; otherwise, the split might
> stay incompletely indefinitely?  Would that be bad?

Nothing bad should happen. Scans that need to traverse the incompletely 
split page would just be marginally slower.

> If we start to enlarge the bounding boxes on the higher levels of the
> tree and then crash before inserting the key, is there any mechanism
> for getting them back down to the minimal size?

No. There's also no mechanism for trimming the bounding boxes if a tuple 
is deleted.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com

Re: GiST insert algorithm rewrite

От

Robert Haas

Дата:

16 ноября 2010 г., 18:06:49

On Tue, Nov 16, 2010 at 1:50 PM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:
>> If we start to enlarge the bounding boxes on the higher levels of the
>> tree and then crash before inserting the key, is there any mechanism
>> for getting them back down to the minimal size?
>
> No. There's also no mechanism for trimming the bounding boxes if a tuple is
> deleted.

Oh.  So do the indexes just degrade over time until they eventually
need to be REINDEX'd?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: GiST insert algorithm rewrite

От

Teodor Sigaev

Дата:

16 ноября 2010 г., 18:21:14

> they are consistent with the new key we're inserting. The old GiST
> algorithm adjusted all the parent pages only after inserting the tuple
> on the leaf. Updating them on the way down ensures that the tree is
Hm, performance? while you traverse to leaf page, on each inner page you will 
need to call unionFn/equalFn methods to decide update or not key on inner page. 
Now we stops do that after first positive result of equalFn while walking up. 
Next, when child page splits then you need to update parent twice - first time 
while walk down, and second while walk up.

As I see, you try to implement algorithm very close to original, but it was 
rather slow.
http://git.postgresql.org/gitweb?p=postgresql.git;a=commit;h=0ad7db4be4b1f0208271c49fc1c8348f11ebc5b3




> self-consistent at all times, even if we crash just after inserting the
> new tuple on the leaf page.
>
> 2. When a page is split, we mark the new left page with a flag to
> indicate that the downlink for the page to the right hasn't been
> inserted yet. When the downlink is inserted, the flag is cleared. Again
Again, twice write of new children (it could be several because of 
implementation of user-defined picksplit method).



-- 
Teodor Sigaev                                   E-mail: teodor@sigaev.ru
  WWW: http://www.sigaev.ru/

Re: GiST insert algorithm rewrite

От

Tom Lane

Дата:

16 ноября 2010 г., 19:46:28

Robert Haas <robertmhaas@gmail.com> writes:
> Oh.  So do the indexes just degrade over time until they eventually
> need to be REINDEX'd?

At some point you might reach a state where a reindex would be helpful.
But the same is true of btrees.  I don't think this is a serious
objection, at least not unless backed by evidence that the tree often
degrades rapidly.  Anyway fixing it would be material for a different
patch.
        regards, tom lane

Re: GiST insert algorithm rewrite

От

Heikki Linnakangas

Дата:

16 ноября 2010 г., 19:51:15

On 16.11.2010 21:20, Teodor Sigaev wrote:
>> they are consistent with the new key we're inserting. The old GiST
>> algorithm adjusted all the parent pages only after inserting the tuple
>> on the leaf. Updating them on the way down ensures that the tree is
> Hm, performance? while you traverse to leaf page, on each inner page you
> will need to call unionFn/equalFn methods to decide update or not key on
> inner page. Now we stops do that after first positive result of equalFn
> while walking up. Next, when child page splits then you need to update
> parent twice - first time while walk down, and second while walk up.

Hmm, will have to do some benchmarking on that. I'm using the Consistent 
function when walking down to check if the downlink needs to be updated, 
and assumed that it would be insignificant compared to the cost of 
calling Penalty on all the keys on the page.

> As I see, you try to implement algorithm very close to original, but it
> was rather slow.
> http://git.postgresql.org/gitweb?p=postgresql.git;a=commit;h=0ad7db4be4b1f0208271c49fc1c8348f11ebc5b3

Thanks for that, I'll have to read that through as well.

>> self-consistent at all times, even if we crash just after inserting the
>> new tuple on the leaf page.
>>
>> 2. When a page is split, we mark the new left page with a flag to
>> indicate that the downlink for the page to the right hasn't been
>> inserted yet. When the downlink is inserted, the flag is cleared. Again
> Again, twice write of new children (it could be several because of
> implementation of user-defined picksplit method).

There should be no difference in performance here AFAICS. The children 
need to be updated a second time to clear the flag, but we don't release 
the locks on them in the middle, and we're only talking about setting a 
single flag, so it should make no difference.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com

Re: GiST insert algorithm rewrite

От

Robert Haas

Дата:

16 ноября 2010 г., 19:53:53

On Tue, Nov 16, 2010 at 3:46 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> Oh.  So do the indexes just degrade over time until they eventually
>> need to be REINDEX'd?
>
> At some point you might reach a state where a reindex would be helpful.
> But the same is true of btrees.  I don't think this is a serious
> objection, at least not unless backed by evidence that the tree often
> degrades rapidly.  Anyway fixing it would be material for a different
> patch.

Oh, I agree it's not for this patch to fix it, if it's already that
way.  I was just curious.  I think index maintenance is going to be a
problem we have to devote some cycles to down the road, though.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: GiST insert algorithm rewrite

От

Teodor Sigaev

Дата:

17 ноября 2010 г., 16:37:06

> Hmm, will have to do some benchmarking on that. I'm using the Consistent
> function when walking down to check if the downlink needs to be updated,
> and assumed that it would be insignificant compared to the cost of
> calling Penalty on all the keys on the page.
Why consistent?! It's impossible - you don't know right strategy number, index 
with storage type/over type could do not accept the same type as query. Index 
over tsvector is an example.

> There should be no difference in performance here AFAICS. The children
> need to be updated a second time to clear the flag, but we don't release
> the locks on them in the middle, and we're only talking about setting a
> single flag, so it should make no difference.

Agree with that
-- 
Teodor Sigaev                                   E-mail: teodor@sigaev.ru
  WWW: http://www.sigaev.ru/

Re: GiST insert algorithm rewrite

От

Teodor Sigaev

Дата:

17 ноября 2010 г., 16:46:37

Sorry, I missed beginning of discussion on GiST, so I read it on the web mail 
archive.

You wrote:
http://archives.postgresql.org/pgsql-hackers/2010-11/msg00939.php
[skip]
0. (the child page is locked)
1. The parent page is locked.
2. The child page is split. The original page becomes the left half, and new 
buffers are allocated for the right halves.
3. The downlink is inserted on the parent page (and the original downlink is 
updated to reflect only the keys that stayed on the left page). While keeping 
the child pages locked, the NSN field on the children are updated with the new 
LSN of the parent page.
...
The scan checks that by comparing the LSN it saw on the parent page with the NSN 
on the child page. If parent LSN < NSN, we saw the parent before the downlink 
was inserted.

Now, the problem with crash recovery is that the above algorithm depends on the 
split to keep the parent and child locked until the downlink is inserted in the 
parent. If you crash between steps 2 and 3, the locks are gone. If a later 
insert then updates the parent page, because of a split on some unrelated child 
page, that will bump the LSN of the parent above the NSN on the child. Scans 
will see that the parent LSN > child NSN, and will no longer follow the > rightlink.
[skip]


I disagree with that opinion: if we crash between 2 and 3 then why will somebody 
update parent before WAL replay? WAL replay process in this case should complete 
child split by inserting "invalid" pointer and tree become correct again, 
although it needs to repair "invalid" pointers. The same situation with b-tree: 
WAL replay repairs incomplete split before any other processing.

Or do I miss something important?



-- 
Teodor Sigaev                                   E-mail: teodor@sigaev.ru
  WWW: http://www.sigaev.ru/

Re: GiST insert algorithm rewrite

От

Heikki Linnakangas

Дата:

17 ноября 2010 г., 17:30:04

On 17.11.2010 19:46, Teodor Sigaev wrote:
> I disagree with that opinion: if we crash between 2 and 3 then why will
> somebody update parent before WAL replay? WAL replay process in this
> case should complete child split by inserting "invalid" pointer and tree
> become correct again, although it needs to repair "invalid" pointers.
> The same situation with b-tree: WAL replay repairs incomplete split
> before any other processing.
>
> Or do I miss something important?

Yeah, see the thread that started this:

http://archives.postgresql.org/pgsql-hackers/2010-11/msg00052.php
http://archives.postgresql.org/message-id/12375.1289429390@sss.pgh.pa.us

The code currently relies on the end-of-recovery processing to finish 
the incomplete, but I'm trying to get rid of that end-of-recovery 
processing altogether.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com

Re: GiST insert algorithm rewrite

От

Heikki Linnakangas

Дата:

18 ноября 2010 г., 11:58:22

On 17.11.2010 19:36, Teodor Sigaev wrote:
>> Hmm, will have to do some benchmarking on that. I'm using the Consistent
>> function when walking down to check if the downlink needs to be updated,
>> and assumed that it would be insignificant compared to the cost of
>> calling Penalty on all the keys on the page.
> Why consistent?! It's impossible - you don't know right strategy number,
> index with storage type/over type could do not accept the same type as
> query. Index over tsvector is an example.

Sorry, I was confused. I'm calling the gistgetadjusted() function, which
uses the Union function. Ie. I'm doing the same we did before when
propagating the changes up the tree. I'm just doing it on the way down
instead.

I ran some quick performance tests on my laptop, and couldn't see any
measurable difference with the patch. So I think we're good on
performance. I used the attached scripts, with \timing.

Have you had a chance to look at the patch yet? I'm hesitant to commit
before you take a look at it, though I still have to proofread it myself
as well.

--
   Heikki Linnakangas
   EnterpriseDB   http://www.enterprisedb.com

Вложения

Re: GiST insert algorithm rewrite

От

Heikki Linnakangas

Дата:

25 ноября 2010 г., 09:52:31

On 18.11.2010 14:58, Heikki Linnakangas wrote:
> On 17.11.2010 19:36, Teodor Sigaev wrote:
>>> Hmm, will have to do some benchmarking on that. I'm using the Consistent
>>> function when walking down to check if the downlink needs to be updated,
>>> and assumed that it would be insignificant compared to the cost of
>>> calling Penalty on all the keys on the page.
>> Why consistent?! It's impossible - you don't know right strategy number,
>> index with storage type/over type could do not accept the same type as
>> query. Index over tsvector is an example.
>
> Sorry, I was confused. I'm calling the gistgetadjusted() function, which
> uses the Union function. Ie. I'm doing the same we did before when
> propagating the changes up the tree. I'm just doing it on the way down
> instead.
>
> I ran some quick performance tests on my laptop, and couldn't see any
> measurable difference with the patch. So I think we're good on
> performance. I used the attached scripts, with \timing.
>
> Have you had a chance to look at the patch yet? I'm hesitant to commit
> before you take a look at it, though I still have to proofread it myself
> as well.

Here's an updated version with some minor fixes. I'd appreciate review,
as well as pointers to good test cases for this.

--
   Heikki Linnakangas
   EnterpriseDB   http://www.enterprisedb.com

Вложения

gist-insert-rewrite-2.patch

Re: GiST insert algorithm rewrite

От

Bruce Momjian

Дата:

27 ноября 2010 г., 18:31:45

Heikki Linnakangas wrote:
> There's no on-disk format changes, except for the additional flag in the 
> page headers, so this does not affect pg_upgrade. However, if there's 
> any "invalid" keys in the old index because of an incomplete insertion, 
> the new code will not understand that. So you should run vacuum to 
> ensure that there's no such invalid keys in the index before upgrading. 
> Vacuum will print a message in the log if it finds any, and you will 
> have to reindex. But that's what it suggests you to do anyway.

OK, pg_upgrade has code to report invalid gin and hash indexes because
of changes between PG 8.3 and 8.4.  Is this something we would do for
9.0 to 9.1?

You are saying it would have to be run before the upgrade.  Can it not
be run after?

I can output a script to VACUUM all such indexes, and tell users to
manually REINDEX any index that generates a warning messasge.  I don't
have any way to automate an optional REINDEX step.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + It's impossible for everything to be true. +

Re: GiST insert algorithm rewrite

От

Heikki Linnakangas

Дата:

30 ноября 2010 г., 08:55:27

On 27.11.2010 21:31, Bruce Momjian wrote:
> Heikki Linnakangas wrote:
>> There's no on-disk format changes, except for the additional flag in the
>> page headers, so this does not affect pg_upgrade. However, if there's
>> any "invalid" keys in the old index because of an incomplete insertion,
>> the new code will not understand that. So you should run vacuum to
>> ensure that there's no such invalid keys in the index before upgrading.
>> Vacuum will print a message in the log if it finds any, and you will
>> have to reindex. But that's what it suggests you to do anyway.
>
> OK, pg_upgrade has code to report invalid gin and hash indexes because
> of changes between PG 8.3 and 8.4.  Is this something we would do for
> 9.0 to 9.1?

9.1. The problem that started this whole thing is there in older 
versions, but given the lack of real-life reports and the scale of the 
changes required it doesn't seem wise to backport.

> You are saying it would have to be run before the upgrade.  Can it not
> be run after?

After would work too.

> I can output a script to VACUUM all such indexes, and tell users to
> manually REINDEX any index that generates a warning messasge.  I don't
> have any way to automate an optional REINDEX step.

That seems good enough.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com

Re: GiST insert algorithm rewrite

От

Heikki Linnakangas

Дата:

30 ноября 2010 г., 09:03:02

On 30.11.2010 11:55, Heikki Linnakangas wrote:
> On 27.11.2010 21:31, Bruce Momjian wrote:
>> Heikki Linnakangas wrote:
>>> There's no on-disk format changes, except for the additional flag in the
>>> page headers, so this does not affect pg_upgrade. However, if there's
>>> any "invalid" keys in the old index because of an incomplete insertion,
>>> the new code will not understand that. So you should run vacuum to
>>> ensure that there's no such invalid keys in the index before upgrading.
>>> Vacuum will print a message in the log if it finds any, and you will
>>> have to reindex. But that's what it suggests you to do anyway.
>>
>> OK, pg_upgrade has code to report invalid gin and hash indexes because
>> of changes between PG 8.3 and 8.4. Is this something we would do for
>> 9.0 to 9.1?
>
> 9.1. The problem that started this whole thing is there in older
> versions, but given the lack of real-life reports and the scale of the
> changes required it doesn't seem wise to backport.

Oh sorry, I read your question as "9.0 *or* 9.1".

Only GiST indexes that have any "invalid" tuples in them n, as a result 
of a crash, need to be reindexed. That's very rare in practice, so we 
shouldn't invalidate all GiST indexes. I don't think there's any simple 
way to check whether reindex is required, so I think we have to just 
document this.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com

Re: GiST insert algorithm rewrite

От

Robert Haas

Дата:

30 ноября 2010 г., 13:23:42

On Tue, Nov 30, 2010 at 5:02 AM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:
> On 30.11.2010 11:55, Heikki Linnakangas wrote:
>>
>> On 27.11.2010 21:31, Bruce Momjian wrote:
>>>
>>> Heikki Linnakangas wrote:
>>>>
>>>> There's no on-disk format changes, except for the additional flag in the
>>>> page headers, so this does not affect pg_upgrade. However, if there's
>>>> any "invalid" keys in the old index because of an incomplete insertion,
>>>> the new code will not understand that. So you should run vacuum to
>>>> ensure that there's no such invalid keys in the index before upgrading.
>>>> Vacuum will print a message in the log if it finds any, and you will
>>>> have to reindex. But that's what it suggests you to do anyway.
>>>
>>> OK, pg_upgrade has code to report invalid gin and hash indexes because
>>> of changes between PG 8.3 and 8.4. Is this something we would do for
>>> 9.0 to 9.1?
>>
>> 9.1. The problem that started this whole thing is there in older
>> versions, but given the lack of real-life reports and the scale of the
>> changes required it doesn't seem wise to backport.
>
> Oh sorry, I read your question as "9.0 *or* 9.1".
>
> Only GiST indexes that have any "invalid" tuples in them n, as a result of a
> crash, need to be reindexed. That's very rare in practice, so we shouldn't
> invalidate all GiST indexes. I don't think there's any simple way to check
> whether reindex is required, so I think we have to just document this.

It seems odd to say, the indexes are corrupted, but they're probably
not, so let's not worry about it.

I assume there's no way to make the new code cope with any
pre-existing corruption?

Does the current code cope with the corruption?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: GiST insert algorithm rewrite

От

Heikki Linnakangas

Дата:

30 ноября 2010 г., 14:26:44

On 30.11.2010 16:23, Robert Haas wrote:
> On Tue, Nov 30, 2010 at 5:02 AM, Heikki Linnakangas
> <heikki.linnakangas@enterprisedb.com>  wrote:
>> On 30.11.2010 11:55, Heikki Linnakangas wrote:
>>>
>>> On 27.11.2010 21:31, Bruce Momjian wrote:
>>>>
>>>> Heikki Linnakangas wrote:
>>>>>
>>>>> There's no on-disk format changes, except for the additional flag in the
>>>>> page headers, so this does not affect pg_upgrade. However, if there's
>>>>> any "invalid" keys in the old index because of an incomplete insertion,
>>>>> the new code will not understand that. So you should run vacuum to
>>>>> ensure that there's no such invalid keys in the index before upgrading.
>>>>> Vacuum will print a message in the log if it finds any, and you will
>>>>> have to reindex. But that's what it suggests you to do anyway.
>>>>
>>>> OK, pg_upgrade has code to report invalid gin and hash indexes because
>>>> of changes between PG 8.3 and 8.4. Is this something we would do for
>>>> 9.0 to 9.1?
>>>
>>> 9.1. The problem that started this whole thing is there in older
>>> versions, but given the lack of real-life reports and the scale of the
>>> changes required it doesn't seem wise to backport.
>>
>> Oh sorry, I read your question as "9.0 *or* 9.1".
>>
>> Only GiST indexes that have any "invalid" tuples in them n, as a result of a
>> crash, need to be reindexed. That's very rare in practice, so we shouldn't
>> invalidate all GiST indexes. I don't think there's any simple way to check
>> whether reindex is required, so I think we have to just document this.
>
> It seems odd to say, the indexes are corrupted, but they're probably
> not, so let's not worry about it.
>
> I assume there's no way to make the new code cope with any
> pre-existing corruption?
>
> Does the current code cope with the corruption?

It's not corruption, but "intended degradation". Yes, the current code 
copes with it, that's how GiST survives a crash. However, even with the 
current code, VACUUM will nag if it finds any invalid tuples with this 
message:

ereport(NOTICE,(errmsg("index \"%s\" needs VACUUM FULL or REINDEX to finish crash 
recovery",

That's harmless, in the sense that all scans and inserts work fine, but 
scans might need to do more work than if the invalid tuple wasn't there.

I don't think we need to go out of our way to support such degraded 
indexes in 9.1. If you see such notices in your logs, you should REINDEX 
anyway, before of after pg_upgrade. Let's just make sure that you get a 
reasonable error message in 9.1 if a scan or insert encounters such a tuple.

There is a section on this in the docs, BTW: 
http://www.postgresql.org/docs/9.0/static/gist-recovery.html

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com

Re: GiST insert algorithm rewrite

От

Bruce Momjian

Дата:

01 декабря 2010 г., 01:06:21

Heikki Linnakangas wrote:
> > Does the current code cope with the corruption?
> 
> It's not corruption, but "intended degradation". Yes, the current code 
> copes with it, that's how GiST survives a crash. However, even with the 
> current code, VACUUM will nag if it finds any invalid tuples with this 
> message:
> 
> ereport(NOTICE,
>     (errmsg("index \"%s\" needs VACUUM FULL or REINDEX to finish crash 
> recovery",
> 
> That's harmless, in the sense that all scans and inserts work fine, but 
> scans might need to do more work than if the invalid tuple wasn't there.
> 
> I don't think we need to go out of our way to support such degraded 
> indexes in 9.1. If you see such notices in your logs, you should REINDEX 
> anyway, before of after pg_upgrade. Let's just make sure that you get a 
> reasonable error message in 9.1 if a scan or insert encounters such a tuple.
> 
> There is a section on this in the docs, BTW: 
> http://www.postgresql.org/docs/9.0/static/gist-recovery.html

OK, administrators will be prompted during normal operation --- seems
there is nothing extra for pg_upgrade to do here.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + It's impossible for everything to be true. +

Re: GiST insert algorithm rewrite

От

Robert Haas

Дата:

01 декабря 2010 г., 01:10:32

On Tue, Nov 30, 2010 at 10:26 AM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:
>> Does the current code cope with the corruption?
>
> It's not corruption, but "intended degradation". Yes, the current code copes
> with it, that's how GiST survives a crash. However, even with the current
> code, VACUUM will nag if it finds any invalid tuples with this message:
>
> ereport(NOTICE,
>        (errmsg("index \"%s\" needs VACUUM FULL or REINDEX to finish crash
> recovery",
>
> That's harmless, in the sense that all scans and inserts work fine, but
> scans might need to do more work than if the invalid tuple wasn't there.
>
> I don't think we need to go out of our way to support such degraded indexes
> in 9.1. If you see such notices in your logs, you should REINDEX anyway,
> before of after pg_upgrade. Let's just make sure that you get a reasonable
> error message in 9.1 if a scan or insert encounters such a tuple.

I just don't want to take a risk of giving people unexpected wrong
answers.  It's not clear to me whether that's a risk here or not.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: GiST insert algorithm rewrite

От

Heikki Linnakangas

Дата:

01 декабря 2010 г., 08:00:59

On 01.12.2010 04:10, Robert Haas wrote:
> On Tue, Nov 30, 2010 at 10:26 AM, Heikki Linnakangas
> <heikki.linnakangas@enterprisedb.com>  wrote:
>>> Does the current code cope with the corruption?
>>
>> It's not corruption, but "intended degradation". Yes, the current code copes
>> with it, that's how GiST survives a crash. However, even with the current
>> code, VACUUM will nag if it finds any invalid tuples with this message:
>>
>> ereport(NOTICE,
>>         (errmsg("index \"%s\" needs VACUUM FULL or REINDEX to finish crash
>> recovery",
>>
>> That's harmless, in the sense that all scans and inserts work fine, but
>> scans might need to do more work than if the invalid tuple wasn't there.
>>
>> I don't think we need to go out of our way to support such degraded indexes
>> in 9.1. If you see such notices in your logs, you should REINDEX anyway,
>> before of after pg_upgrade. Let's just make sure that you get a reasonable
>> error message in 9.1 if a scan or insert encounters such a tuple.
>
> I just don't want to take a risk of giving people unexpected wrong
> answers.  It's not clear to me whether that's a risk here or not.

You'll get an error if a scan encounters an invalid tuple.

In the patch I posted, I just ripped out everything related to invalid 
tuples altogether. But we should add a check and ereport for that before 
commit.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com

Re: GiST insert algorithm rewrite

От

Robert Haas

Дата:

01 декабря 2010 г., 13:28:02

On Wed, Dec 1, 2010 at 4:00 AM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:
> On 01.12.2010 04:10, Robert Haas wrote:
>>
>> On Tue, Nov 30, 2010 at 10:26 AM, Heikki Linnakangas
>> <heikki.linnakangas@enterprisedb.com>  wrote:
>>>>
>>>> Does the current code cope with the corruption?
>>>
>>> It's not corruption, but "intended degradation". Yes, the current code
>>> copes
>>> with it, that's how GiST survives a crash. However, even with the current
>>> code, VACUUM will nag if it finds any invalid tuples with this message:
>>>
>>> ereport(NOTICE,
>>>        (errmsg("index \"%s\" needs VACUUM FULL or REINDEX to finish crash
>>> recovery",
>>>
>>> That's harmless, in the sense that all scans and inserts work fine, but
>>> scans might need to do more work than if the invalid tuple wasn't there.
>>>
>>> I don't think we need to go out of our way to support such degraded
>>> indexes
>>> in 9.1. If you see such notices in your logs, you should REINDEX anyway,
>>> before of after pg_upgrade. Let's just make sure that you get a
>>> reasonable
>>> error message in 9.1 if a scan or insert encounters such a tuple.
>>
>> I just don't want to take a risk of giving people unexpected wrong
>> answers.  It's not clear to me whether that's a risk here or not.
>
> You'll get an error if a scan encounters an invalid tuple.
>
> In the patch I posted, I just ripped out everything related to invalid
> tuples altogether. But we should add a check and ereport for that before
> commit.

All right, that seems like a reasonable backstop, if we're fairly sure
this won't be a common scenario.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: GiST insert algorithm rewrite

От

Heikki Linnakangas

Дата:

03 декабря 2010 г., 20:55:16

On 01.12.2010 11:00, Heikki Linnakangas wrote:
> On 01.12.2010 04:10, Robert Haas wrote:
>> On Tue, Nov 30, 2010 at 10:26 AM, Heikki Linnakangas
>> <heikki.linnakangas@enterprisedb.com> wrote:
>>>> Does the current code cope with the corruption?
>>>
>>> It's not corruption, but "intended degradation". Yes, the current
>>> code copes
>>> with it, that's how GiST survives a crash. However, even with the
>>> current
>>> code, VACUUM will nag if it finds any invalid tuples with this message:
>>>
>>> ereport(NOTICE,
>>> (errmsg("index \"%s\" needs VACUUM FULL or REINDEX to finish crash
>>> recovery",
>>>
>>> That's harmless, in the sense that all scans and inserts work fine, but
>>> scans might need to do more work than if the invalid tuple wasn't there.
>>>
>>> I don't think we need to go out of our way to support such degraded
>>> indexes
>>> in 9.1. If you see such notices in your logs, you should REINDEX anyway,
>>> before of after pg_upgrade. Let's just make sure that you get a
>>> reasonable
>>> error message in 9.1 if a scan or insert encounters such a tuple.
>>
>> I just don't want to take a risk of giving people unexpected wrong
>> answers. It's not clear to me whether that's a risk here or not.
>
> You'll get an error if a scan encounters an invalid tuple.
>
> In the patch I posted, I just ripped out everything related to invalid
> tuples altogether. But we should add a check and ereport for that before
> commit.

Here's an updated patch.

On closer look, supporting the invalid tuples in scans was trivial, so I
kept that after all. So you can still query an index with invalid
tuples. If an insert encounters one, you get an error, and VACUUM emits
a LOG message on any such tuples.

There's one bug remaining that I found during testing. If you crash,
leaving an incomplete split behind, and then vacuum the table removing
all the aborted tuples from the pages, it's possible that you end up
with a completely empty page that has no downlink yet. The code to
complete incomplete splits doesn't cope with that at the moment - it
doesn't know how to construct a parent key for a child that has no tuples.

The nicest way to handle that would be to recycle the empty page instead
of trying to finish the page split, but I think there might be a race
condition there if the page gets quickly reused while a scan is just
about to visit it through the rightlink. GiST doesn't seem to ever reuse
pages in normal operation, which conveniently avoids that problem.
Simply abandoning the page forever is certainly one way to handle it, it
shouldn't happen that often.

--
   Heikki Linnakangas
   EnterpriseDB   http://www.enterprisedb.com

Вложения

gist-insert-rewrite-3.patch

Re: GiST insert algorithm rewrite

От

Robert Haas

Дата:

04 декабря 2010 г., 02:50:22

On Dec 3, 2010, at 4:54 PM, Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> wrote:
> Here's an updated patch.

How carefully have you perf-tested this?

> On closer look, supporting the invalid tuples in scans was trivial, so I kept that after all. So you can still query
anindex with invalid tuples. If an insert encounters one, you get an error, and VACUUM emits a LOG message on any such
tuples.

Cool.

> There's one bug remaining that I found during testing. If you crash, leaving an incomplete split behind, and then
vacuumthe table removing all the aborted tuples from the pages, it's possible that you end up with a completely empty
pagethat has no downlink yet. The code to complete incomplete splits doesn't cope with that at the moment - it doesn't
knowhow to construct a parent key for a child that has no tuples. 

I think we can live with this.
>


...Robert

Re: GiST insert algorithm rewrite

От

Heikki Linnakangas

Дата:

13 декабря 2010 г., 11:09:26

On 03.12.2010 23:54, Heikki Linnakangas wrote:
> There's one bug remaining that I found during testing. If you crash,
> leaving an incomplete split behind, and then vacuum the table removing
> all the aborted tuples from the pages, it's possible that you end up
> with a completely empty page that has no downlink yet. The code to
> complete incomplete splits doesn't cope with that at the moment - it
> doesn't know how to construct a parent key for a child that has no tuples.
>
> The nicest way to handle that would be to recycle the empty page instead
> of trying to finish the page split, but I think there might be a race
> condition there if the page gets quickly reused while a scan is just
> about to visit it through the rightlink. GiST doesn't seem to ever reuse
> pages in normal operation, which conveniently avoids that problem.
> Simply abandoning the page forever is certainly one way to handle it, it
> shouldn't happen that often.

I fixed that by simply aobandoning pages. That seems acceptable, given
that you shouldn't crash with incomplete splits that often. GiST never
tries to reuse empty pages anyway, so some leakage at a crash seems like
the least of our worries on that front.

I also added check for the F_FOLLOW_RIGHT flag in the gist scan code.
Tom Lane pointed out offlist that it was missing earlier.

I realized that the way I was setting the NSN and clearing the flag on
child pages, when the downlink is inserted to the parent, was not safe.
We need to update the LSN and take a full-page image per the usual rules.

But that creates a new problem: There's a maximum of three backup blocks
per WAL record, but a GiST page can be split into any number of child
pages as one operation. You might run out of backup block slots.

Attached is an updated patch, but that issue with limited number of
backup blocks needs to be resolved. The straightforward way would be to
change the WAL format to increase the limit. Another option is to
refactor the GiST insertion code some more, to insert the downlink
pointers to the parent one-by-one, instead of as one big operation, when
a page is split into more than two halves.

--
   Heikki Linnakangas
   EnterpriseDB   http://www.enterprisedb.com

Вложения

gist-insert-rewrite-4.patch

Re: GiST insert algorithm rewrite

От

Robert Haas

Дата:

13 декабря 2010 г., 12:04:53

On Mon, Dec 13, 2010 at 7:09 AM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:
> Attached is an updated patch, but that issue with limited number of backup
> blocks needs to be resolved. The straightforward way would be to change the
> WAL format to increase the limit.

Eh, is that going to bloat WAL?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: GiST insert algorithm rewrite

От

Heikki Linnakangas

Дата:

13 декабря 2010 г., 12:20:33

On 13.12.2010 15:04, Robert Haas wrote:
> On Mon, Dec 13, 2010 at 7:09 AM, Heikki Linnakangas
> <heikki.linnakangas@enterprisedb.com>  wrote:
>> Attached is an updated patch, but that issue with limited number of backup
>> blocks needs to be resolved. The straightforward way would be to change the
>> WAL format to increase the limit.
>
> Eh, is that going to bloat WAL?

Nah. The way split now works is that:

1. Split the page. Write a WAL record with the contents of the page halves.

2. Insert the downlink pointers in the parent, and set the NSN and clear 
F_FOLLOW_RIGHT flags on the child pages. A 2nd WAL record is written for 
this.

In this new patch version, at step 2, the 2nd WAL record updates the 
LSNs and takes full-page-images of the child pages if necessary. 
Previous patches overlooked that. Usually a full page image won't be 
necessary, because we just wrote the page-split WAL record at step 1 for 
those pages. It's only if a checkpoint started between in the small 
window between steps 1 and 2.

So this should have no effect on performance, but it is necessary for 
correctness.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com

Increasing max # of backup blocks (was Re: GiST insert algorithm rewrite)

От

Heikki Linnakangas

Дата:

13 декабря 2010 г., 12:29:55

On 13.12.2010 15:20, Heikki Linnakangas wrote:
> On 13.12.2010 15:04, Robert Haas wrote:
>> On Mon, Dec 13, 2010 at 7:09 AM, Heikki Linnakangas
>> <heikki.linnakangas@enterprisedb.com> wrote:
>>> Attached is an updated patch, but that issue with limited number of
>>> backup
>>> blocks needs to be resolved. The straightforward way would be to
>>> change the
>>> WAL format to increase the limit.
>>
>> Eh, is that going to bloat WAL?
>
> Nah.

Or did you mean, if changing the WAL format to raise the limit would 
bloat WAL? Depends on how it's done, of course. Assuming MAXALIGN=4, 
there's 2 wasted bytes in XLogRecord at the moment that we could take 
into use without making the header bigger.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com

Re: GiST insert algorithm rewrite

От

Tom Lane

Дата:

13 декабря 2010 г., 14:21:19

Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:
> But that creates a new problem: There's a maximum of three backup blocks 
> per WAL record, but a GiST page can be split into any number of child 
> pages as one operation. You might run out of backup block slots.

> Attached is an updated patch, but that issue with limited number of 
> backup blocks needs to be resolved. The straightforward way would be to 
> change the WAL format to increase the limit.

I don't think you can fix it that way.  If a page can be split into any
number of child pages, then no fixed number of pages in a WAL record
will be enough.  Even if you were willing to suppose that ~16 would be
enough, it would be a bad idea because of the extra overhead added into
XLogInsert, which'd be paid by *every* WAL insert operation.

I think you need to refactor the operation so that there's one WAL
record per child page, or something along that line.  I concede this
might be diffcult :-(
        regards, tom lane

Re: GiST insert algorithm rewrite

От

Greg Stark

Дата:

13 декабря 2010 г., 16:20:29

On Mon, Dec 13, 2010 at 3:14 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> I think you need to refactor the operation so that there's one WAL
> record per child page, or something along that line.  I concede this
> might be diffcult :-(

If it's only the backup blocks that matter couldn't you generate noop
WAL records with just the full page image in them. Once all those are
generated then generate the actual split operation and since all the
pages have been written to wal since the last checkpoint they won't
need any backup block slots.

This would require surpressing any checkpoints between writing the
first backup block and the final operation record. That might be
pretty hard to do cleanly.

--
greg

Re: GiST insert algorithm rewrite

От

Heikki Linnakangas

Дата:

13 декабря 2010 г., 16:46:21

On 13.12.2010 19:19, Greg Stark wrote:
> On Mon, Dec 13, 2010 at 3:14 PM, Tom Lane<tgl@sss.pgh.pa.us>  wrote:
>> I think you need to refactor the operation so that there's one WAL
>> record per child page, or something along that line.  I concede this
>> might be diffcult :-(
>
> If it's only the backup blocks that matter couldn't you generate noop
> WAL records with just the full page image in them. Once all those are
> generated then generate the actual split operation and since all the
> pages have been written to wal since the last checkpoint they won't
> need any backup block slots.
>
> This would require surpressing any checkpoints between writing the
> first backup block and the final operation record. That might be
> pretty hard to do cleanly.

That would work, but it brings us back to square one 
(http://archives.postgresql.org/message-id/4CCFEE61.2090702@enterprisedb.com). 
It's not necessarily a bad idea, A capability to hold off checkpoints 
might be the easiest way to do this, and other things in the future.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com

Re: GiST insert algorithm rewrite

От

Tom Lane

Дата:

13 декабря 2010 г., 16:49:02

Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:
> On 13.12.2010 19:19, Greg Stark wrote:
>> If it's only the backup blocks that matter couldn't you generate noop
>> WAL records with just the full page image in them. Once all those are
>> generated then generate the actual split operation and since all the
>> pages have been written to wal since the last checkpoint they won't
>> need any backup block slots.
>> 
>> This would require surpressing any checkpoints between writing the
>> first backup block and the final operation record. That might be
>> pretty hard to do cleanly.

> That would work, but it brings us back to square one 

Yeah.  Wouldn't the original page-split record have been carrying full
page images already?  (And if so, why didn't we have this problem in the
previous implementation?)
        regards, tom lane

Re: GiST insert algorithm rewrite

От

Heikki Linnakangas

Дата:

13 декабря 2010 г., 17:21:50

On 13.12.2010 19:48, Tom Lane wrote:
> Heikki Linnakangas<heikki.linnakangas@enterprisedb.com>  writes:
>> On 13.12.2010 19:19, Greg Stark wrote:
>>> If it's only the backup blocks that matter couldn't you generate noop
>>> WAL records with just the full page image in them. Once all those are
>>> generated then generate the actual split operation and since all the
>>> pages have been written to wal since the last checkpoint they won't
>>> need any backup block slots.
>>>
>>> This would require surpressing any checkpoints between writing the
>>> first backup block and the final operation record. That might be
>>> pretty hard to do cleanly.
>
>> That would work, but it brings us back to square one
>
> Yeah.  Wouldn't the original page-split record have been carrying full
> page images already?

Yes.

BTW, the original split record doesn't run into the limit because it 
doesn't use the backup-block mechanism, it contains all the tuples for 
all the pages in the main payload.

>  (And if so, why didn't we have this problem in the
> previous implementation?)

In the previous implementation, the NSN was updated immediately in the 
page split record, and there was no follow-right flag to clear. So the 
child pages didn't need to be updated when the downlinks are inserted.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com

Re: GiST insert algorithm rewrite

От

Tom Lane

Дата:

13 декабря 2010 г., 17:31:09

Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:
> On 13.12.2010 19:48, Tom Lane wrote:
>> Yeah.  Wouldn't the original page-split record have been carrying full
>> page images already?

> Yes.

> BTW, the original split record doesn't run into the limit because it 
> doesn't use the backup-block mechanism, it contains all the tuples for 
> all the pages in the main payload.

I see.

>> (And if so, why didn't we have this problem in the
>> previous implementation?)

> In the previous implementation, the NSN was updated immediately in the 
> page split record, and there was no follow-right flag to clear. So the 
> child pages didn't need to be updated when the downlinks are inserted.

Can we fix it so that each child page is updated, and its downlink
inserted, as a separate atomic action?  That'd require each intermediate
state to be consistent and crash-safe, but I think you really need the
intermediate states to be consistent anyway because of concurrent scans.
        regards, tom lane

Re: GiST insert algorithm rewrite

От

Heikki Linnakangas

Дата:

13 декабря 2010 г., 17:47:01

On 13.12.2010 20:30, Tom Lane wrote:
> Can we fix it so that each child page is updated, and its downlink
> inserted, as a separate atomic action?  That'd require each intermediate
> state to be consistent and crash-safe, but I think you really need the
> intermediate states to be consistent anyway because of concurrent scans.

Yes, all the intermediate states are consistent. I'm looking at that 
approach as we speak. The logic to track what we've done and what needs 
to be done as the changes are propagated gets quite hairy, but in 
principle it should work.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com

Re: GiST insert algorithm rewrite

От

Heikki Linnakangas

Дата:

16 декабря 2010 г., 12:52:57

On 13.12.2010 20:30, Tom Lane wrote:
> Can we fix it so that each child page is updated, and its downlink
> inserted, as a separate atomic action?  That'd require each intermediate
> state to be consistent and crash-safe, but I think you really need the
> intermediate states to be consistent anyway because of concurrent scans.

Here's an updated patch, using that idea.If a page split into more than
two pages, the downlinks for the pages are inserted to the parent
one-by-one, right-to-left, until there's only two remaining. Finally the
downlink for the last remaining right page is inserted and the downlink
for the original page is updated as one atomic operation.

It was a pretty big rewrite again, but seems to work now. I tested
splits that span more than two pages by rigging the btree_gist picksplit
function to choose very bad split points, and the fix-split logic by
adding elog(ERROR) in strategic places to sometimes leave splits incomplete.

--
   Heikki Linnakangas
   EnterpriseDB   http://www.enterprisedb.com

Вложения

gist-insert-rewrite-5.patch

Re: GiST insert algorithm rewrite

От

Heikki Linnakangas

Дата:

21 декабря 2010 г., 17:00:53

On 16.12.2010 15:52, Heikki Linnakangas wrote:
> On 13.12.2010 20:30, Tom Lane wrote:
>> Can we fix it so that each child page is updated, and its downlink
>> inserted, as a separate atomic action? That'd require each intermediate
>> state to be consistent and crash-safe, but I think you really need the
>> intermediate states to be consistent anyway because of concurrent scans.
>
> Here's an updated patch, using that idea.If a page split into more than
> two pages, the downlinks for the pages are inserted to the parent
> one-by-one, right-to-left, until there's only two remaining. Finally the
> downlink for the last remaining right page is inserted and the downlink
> for the original page is updated as one atomic operation.
>
> It was a pretty big rewrite again, but seems to work now. I tested
> splits that span more than two pages by rigging the btree_gist picksplit
> function to choose very bad split points, and the fix-split logic by
> adding elog(ERROR) in strategic places to sometimes leave splits
> incomplete.

One final version, with a bug fix wrt. root page split and some cleanup.
I'm planning to commit this before Christmas. It's a big patch, so
review would be much appreciated.

--
   Heikki Linnakangas
   EnterpriseDB   http://www.enterprisedb.com

Вложения

gist-insert-rewrite-6.patch

Re: GiST insert algorithm rewrite

От

Heikki Linnakangas

Дата:

23 декабря 2010 г., 13:28:15

On 21.12.2010 20:00, Heikki Linnakangas wrote:
> One final version, with a bug fix wrt. root page split and some cleanup.
> I'm planning to commit this before Christmas. It's a big patch, so
> review would be much appreciated.

Committed. Phew! Review & testing is of course still appreciated, given 
how big and complicated this was.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com

Re: GiST insert algorithm rewrite

От

Tom Lane

Дата:

09 января 2011 г., 04:05:48

Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:
> On 21.12.2010 20:00, Heikki Linnakangas wrote:
>> One final version, with a bug fix wrt. root page split and some cleanup.
>> I'm planning to commit this before Christmas. It's a big patch, so
>> review would be much appreciated.

> Committed. Phew! Review & testing is of course still appreciated, given 
> how big and complicated this was.

I just found out that the "benchmark" test script in
contrib/intarray/bench/ crashes HEAD in gistdoinsert() --- it looks like
it's trying to pop to a stack entry that isn't there.

Run it per the instructions in the intarray documentation:

$ createdb TEST
$ psql TEST < ../_int.sql
...
$ ./create_test.pl | psql TEST
CREATE TABLE
CREATE TABLE
CREATE INDEX
CREATE INDEX
server closed the connection unexpectedly       This probably means the server terminated abnormally       before or
whileprocessing the request.

connection to server was lost

The script generates randomized data, so possibly it won't fail every
time, but it failed three out of three times for me.  The changes I'm
about to commit in intarray don't seem to make any difference.
        regards, tom lane

Re: GiST insert algorithm rewrite

От

Heikki Linnakangas

Дата:

09 января 2011 г., 18:38:16

On 09.01.2011 07:05, Tom Lane wrote:
> I just found out that the "benchmark" test script in
> contrib/intarray/bench/ crashes HEAD in gistdoinsert() --- it looks like
> it's trying to pop to a stack entry that isn't there.
>
> Run it per the instructions in the intarray documentation:
>
> $ createdb TEST
> $ psql TEST<  ../_int.sql
> ...
> $ ./create_test.pl | psql TEST
> CREATE TABLE
> CREATE TABLE
> CREATE INDEX
> CREATE INDEX
> server closed the connection unexpectedly
>          This probably means the server terminated abnormally
>          before or while processing the request.
> connection to server was lost
>
> The script generates randomized data, so possibly it won't fail every
> time, but it failed three out of three times for me.  The changes I'm
> about to commit in intarray don't seem to make any difference.

Thanks, fixed. Apparently my testing never hit the case where an update, 
rather than an insert, into the root page causes it to split.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Обсуждение: GiST insert algorithm rewrite

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения