Обсуждение: 9.4 btree index corruption

Поиск
Список
Период
Сортировка

9.4 btree index corruption

От
Jeff Janes
Дата:
Testing partial-write crash-recovery in 9.4 (e12d7320ca494fd05134847e30) with foreign keys, I found some btree index corruption.

28807 VACUUM 2014-05-21 15:33:46.878 PDT:ERROR:  right sibling 4044 of block 460 is not next child 23513 of block 1264 in index "foo_p_id_idx"
28807 VACUUM 2014-05-21 15:33:46.878 PDT:STATEMENT:  VACUUM;

It took ~8 hours on 8 cores to encounter this problem.  This is a single occurrence, it has not yet been reproduced.

I don't know that the partial-writes, or the crash recovery, or the foreign key, parts of this test are important--it could be a more generic problem that only happened to be observed here.  Nor do I know yet if it occurs in 9_3_STABLE.

Below is the testing harness and the data directory (massively bloated at 3.7GB once uncompressed).  It is currently in wrap-around shutdown, but that is the effect of persistent vacuum failures, not the cause of them.  You can restart the data directory and it will repeat the above sibling error once autovac kicks in.  I don't know if the bloat is due to the vacuum failure or if it was already in process before the failures started.  I've cranked up the logging on that front future efforts.

I'm using some fast-foward code on the xid consumption so that freezing occurs more often, and some people have expressed reservations that the code might be imperfect, and I can't rule that out as the cause (but I've never traced any other problems back to that code).  But it did make it through 4 complete wraps before this problem was encountered, so if that is the problem it must be probabilistic rather than deterministic.

https://drive.google.com/folderview?id=0Bzqrh1SO9FcENWd6ZXlwVWpxU0E&usp=sharing

Cheers,

Jeff

Re: 9.4 btree index corruption

От
Heikki Linnakangas
Дата:
On 05/21/2014 10:22 PM, Jeff Janes wrote:
> Testing partial-write crash-recovery in 9.4 (e12d7320ca494fd05134847e30)
> with foreign keys, I found some btree index corruption.
>
> 28807 VACUUM 2014-05-21 15:33:46.878 PDT:ERROR:  right sibling 4044 of
> block 460 is not next child 23513 of block 1264 in index "foo_p_id_idx"
> 28807 VACUUM 2014-05-21 15:33:46.878 PDT:STATEMENT:  VACUUM;
>
> It took ~8 hours on 8 cores to encounter this problem.  This is a single
> occurrence, it has not yet been reproduced.
>
> I don't know that the partial-writes, or the crash recovery, or the foreign
> key, parts of this test are important--it could be a more generic problem
> that only happened to be observed here.  Nor do I know yet if it occurs in
> 9_3_STABLE.
>
> Below is the testing harness and the data directory (massively bloated at
> 3.7GB once uncompressed).  It is currently in wrap-around shutdown, but
> that is the effect of persistent vacuum failures, not the cause of them.
>   You can restart the data directory and it will repeat the above sibling
> error once autovac kicks in.  I don't know if the bloat is due to the
> vacuum failure or if it was already in process before the failures started.
>   I've cranked up the logging on that front future efforts.
>
> I'm using some fast-foward code on the xid consumption so that freezing
> occurs more often, and some people have expressed reservations that the
> code might be imperfect, and I can't rule that out as the cause (but I've
> never traced any other problems back to that code).  But it did make it
> through 4 complete wraps before this problem was encountered, so if that is
> the problem it must be probabilistic rather than deterministic.
>
> https://drive.google.com/folderview?id=0Bzqrh1SO9FcENWd6ZXlwVWpxU0E&usp=sharing

I downloaded the data directory and investigated. I got this message
when I started it up:

20392  2014-05-25 05:51:37.835 PDT:ERROR:  right sibling 4044 of block
460 is not next child 23513 of block 86458 in index "foo_p_id_idx"
20392  2014-05-25 05:51:37.835 PDT:CONTEXT:  automatic vacuum of table
"jjanes.public.foo"

Interestingly, it's complaining about parent page 86458, while yours
claimed it was 1264. I don't know why; perhaps a huge number of
insertions happened after that error, causing the parent level pages to
be split, moving the downlinks it complains about to the right. Did you
continue running the test after that error occurred?

This is what the tree looks like around those pages:

Level 1:
+-------------+     +-------------+     +-------------+
| Blk 1264    |     | Blk 160180  |     | Blk 86458   |
|             |     |             |     |             |
| Downlinks:  | <-> | Downlinks:  | <-> | Downlinks:  |
| ...         |     | ...         |     |     1269    |
|             |     |             |     |      460    |
|             |     |             |     |    23513    |
+-------------+     +-------------+     +-------------+


Leaf level:
+-------------+     +---------+     +-----------+     +-------------+
| Blk 1269    |     | Blk 460 |     | Blk 4044  |     | Blk 23513   |
|             | <-> |         | <-> | HALF_DEAD | <-> |             |
| (314 items) |     | (empty) |     |           |     | (212 items) |
+-------------+     +---------+     +-----------+     +-------------+

Leaf block 4044 is marked as half-dead, and there is no downlink to it
on the parent block, 86458. That is normal, when a vacuum is interrupted
during page deletion (in 9.4). The next vacuum ought to continue it.
However, because block 460 is empty, the next vacuum that comes along
will try to start the page deletion process on that block. Page deletion
does a cross-check at the parent level, checking that the next downlink
on the parent page points to the right sibling of the page we're about
to delete. However, because the right sibling is already marked as
half-dead, and the downlink pointing to it has already been removed, the
cross-check fails.

So, the sanity check is a bit overzealous. It's normal that the right
sibling has no downlink, if it's marked as half-dead. The trivial fix is
to just remove the cross-check, but I'd rather not do that because
generally-speaking more sanity checks are good.

I think we could fetch the right sibling page to see if it's marked as
half-dead, and if it is, just give up on deleting the current page. This
vacuum will eventually get to the right sibling page and finish the
deletion, and the next vacuum will then be able to delete the empty
page. Or we could finish the deletion of the right page immediately, but
I'd like to keep the rarely-occurring corner-case codepaths to the minimum.


With the attached patch, I was able to get past that error, but when
VACUUM reaches the end, I got this:

jjanes=# vacuum foo;
ERROR:  database is not accepting commands to avoid wraparound data loss
in database "jjanes"
HINT:  Stop the postmaster and vacuum that database in single-user mode.
You might also need to commit or roll back old prepared transactions.

That seems like an unrelated issue, I'll start a new thread about that.

Many thanks for the testing, Jeff!

- Heikki


Вложения

Re: 9.4 btree index corruption

От
Jeff Janes
Дата:
On Sun, May 25, 2014 at 7:16 AM, Heikki Linnakangas <hlinnakangas@vmware.com> wrote:
On 05/21/2014 10:22 PM, Jeff Janes wrote:
Testing partial-write crash-recovery in 9.4 (e12d7320ca494fd05134847e30)
with foreign keys, I found some btree index corruption.

... 
I downloaded the data directory and investigated. I got this message when I started it up:

20392  2014-05-25 05:51:37.835 PDT:ERROR:  right sibling 4044 of block 460 is not next child 23513 of block 86458 in index "foo_p_id_idx"
20392  2014-05-25 05:51:37.835 PDT:CONTEXT:  automatic vacuum of table "jjanes.public.foo"

Interestingly, it's complaining about parent page 86458, while yours claimed it was 1264. I don't know why; perhaps a huge number of insertions happened after that error, causing the parent level pages to be split, moving the downlinks it complains about to the right. Did you continue running the test after that error occurred?

Yes, I didn't set it up to abort upon vacuum errors, so it continued to run until it reached the 1,000,000 wrap around limit.

Between when the vacuums started failing and when it reached wrap around limit, it seemed to run normally (other than the increasing bloat and non-advancement of frozenxid).

I only noticed the ERROR when I was looking through the log file in postmortem.  Looking back, I see 10 errors with 1264, then it switched to 86458.  I didn't notice the change and reported only the first ERROR message.

I'll apply your patch and see what happens, but 90 further hours of testing gave no more occurrences of the error so I'm afraid that not seeing errors is not much evidence that the error was fixed.  So I'll have to rely on your knowledge of the code.



This is what the tree looks like around those pages:

Level 1:
+-------------+     +-------------+     +-------------+
| Blk 1264    |     | Blk 160180  |     | Blk 86458   |
|             |     |             |     |             |
| Downlinks:  | <-> | Downlinks:  | <-> | Downlinks:  |
| ...         |     | ...         |     |     1269    |
|             |     |             |     |      460    |
|             |     |             |     |    23513    |
+-------------+     +-------------+     +-------------+



Is this done with the pageinspect extension?  


With the attached patch, I was able to get past that error, but when VACUUM reaches the end, I got this:

jjanes=# vacuum foo;
ERROR:  database is not accepting commands to avoid wraparound data loss in database "jjanes"
HINT:  Stop the postmaster and vacuum that database in single-user mode.
You might also need to commit or roll back old prepared transactions.

That seems like an unrelated issue, I'll start a new thread about that.

Yeah, that's expected.  Since vacuum failures don't abort the test, it ran on until something else stopped it.



Many thanks for the testing, Jeff!

You're welcome.  If only I was as good at fixing things as at breaking them. 

Cheers,

Jeff

Re: 9.4 btree index corruption

От
Heikki Linnakangas
Дата:
On 05/25/2014 05:45 PM, Jeff Janes wrote:
> On Sun, May 25, 2014 at 7:16 AM, Heikki Linnakangas <hlinnakangas@vmware.com
>> wrote:
>
>> On 05/21/2014 10:22 PM, Jeff Janes wrote:
>>
>>> Testing partial-write crash-recovery in 9.4 (e12d7320ca494fd05134847e30)
>>> with foreign keys, I found some btree index corruption.
>>>
>>> ...
>
>>
>>> https://drive.google.com/folderview?id=0Bzqrh1SO9FcENWd6ZXlwVWpxU0E&
>>> usp=sharing
>>>
>>
>> I downloaded the data directory and investigated. I got this message when
>> I started it up:
>>
>> 20392  2014-05-25 05:51:37.835 PDT:ERROR:  right sibling 4044 of block 460
>> is not next child 23513 of block 86458 in index "foo_p_id_idx"
>> 20392  2014-05-25 05:51:37.835 PDT:CONTEXT:  automatic vacuum of table
>> "jjanes.public.foo"
>>
>> Interestingly, it's complaining about parent page 86458, while yours
>> claimed it was 1264. I don't know why; perhaps a huge number of insertions
>> happened after that error, causing the parent level pages to be split,
>> moving the downlinks it complains about to the right. Did you continue
>> running the test after that error occurred?
>>
>
> Yes, I didn't set it up to abort upon vacuum errors, so it continued to run
> until it reached the 1,000,000 wrap around limit.
>
> Between when the vacuums started failing and when it reached wrap around
> limit, it seemed to run normally (other than the increasing bloat and
> non-advancement of frozenxid).
>
> I only noticed the ERROR when I was looking through the log file in
> postmortem.  Looking back, I see 10 errors with 1264, then it switched to
> 86458.  I didn't notice the change and reported only the first ERROR
> message.

Ok, that explains the difference in the error message.

> I'll apply your patch and see what happens, but 90 further hours of testing
> gave no more occurrences of the error so I'm afraid that not seeing errors
> is not much evidence that the error was fixed.  So I'll have to rely on
> your knowledge of the code.

I've pushed the fix.

>> This is what the tree looks like around those pages:
>>
>> Level 1:
>> +-------------+     +-------------+     +-------------+
>> | Blk 1264    |     | Blk 160180  |     | Blk 86458   |
>> |             |     |             |     |             |
>> | Downlinks:  | <-> | Downlinks:  | <-> | Downlinks:  |
>> | ...         |     | ...         |     |     1269    |
>> |             |     |             |     |      460    |
>> |             |     |             |     |    23513    |
>> +-------------+     +-------------+     +-------------+
>>
>>
>
> Is this done with the pageinspect extension?

Yeah, with some more tooling on top of it. The elegant diagram I drew by 
hand.

Thanks again! BTW, it would be useful to keep the WAL logs for as long 
as you have disk space. I'm not sure if that would've helped in this 
case, but it can be really useful to see all the actions that were done 
to the page before something goes wrong.

- Heikki



Re: 9.4 btree index corruption

От
Andres Freund
Дата:
Hi,

On 2014-05-25 14:45:38 -0700, Jeff Janes wrote:
> You're welcome.  If only I was as good at fixing things as at breaking
> them.

At the moment there seems to be a bigger shortage of folks being good at
breaking stuff - before the release! - than of people fixing the
resulting breakage...

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services