Re: amcheck verification for GiST

Поиск
Список
Период
Сортировка
От Andrey Borodin
Тема Re: amcheck verification for GiST
Дата
Msg-id 6D0F90B2-4515-477C-8BF3-31E9FE64C2E1@yandex-team.ru
обсуждение исходный текст
Ответ на Re: amcheck verification for GiST  (Heikki Linnakangas <hlinnaka@iki.fi>)
Ответы Re: amcheck verification for GiST
Список pgsql-hackers
Thanks for looking into this!

> 27 марта 2019 г., в 22:29, Heikki Linnakangas <hlinnaka@iki.fi> написал(а):
>
> On 27/03/2019 11:51, Andrey Borodin wrote:
>> Hi!
>> Here's new version of GiST amcheck, which takes into account recently committed GiST VACUUM.
>> It tests that deleted pages do not contain any data.
>
> Thanks! I had a look, and refactored it quite a bit.
Cool! New scan logic is much easier to read.

> I found the way the recursion worked confusing. On each internal page, it looped through all the child nodes,
checkingthe consistency of the downlinks. And then it looped through the children again, to recurse. This isn't
performance-critical,but visiting every page twice still seems strange. 
>
> In gist_check_page_keys(), if we get into the code to deal with a concurrent update, we set 'parent' to point to a
tupleon a parent page, then unlock it, and continue to look at remaining tuples, using the pointer that points to an
unlockedbuffer. 
Uh, that was a tricky bug.

> I came up with the attached, which fixes the above-mentioned things. I also replaced the check that each node has
onlyinternal or leaf children, with a different check that the tree has the same height in all branches. That catches
morepotential problems, and was easier to implement after the refactoring. This still needs at least a round of fixing
typosand tidying up comments, but it's more straightforward now, IMHO. 
>
> What have you been using to test this?
Please see attached patch with line
//if (false) // THIS LINE IS INTENTIONALLY BROKEN
which breaks GiST consistency. Uncomment it to create logically broken GiST.

Also, I've fixed buffer release and small typo (childblkno->parentblkno).

To test that it does not deadlock with inserts and vacuums I use pgbench the same way is it is used here [0] but also
with
SELECT gist_index_parent_check('gist_check_idx');

Also I've added NOTICE when parent is not refound. To test this, I was removing adjust here
if (stack->parenttup && gistgetadjusted(rel, stack->parenttup, idxtuple, state))

> XXX: shouldn't we rather use gist_consistent?
No, consistency test must use scan strategy, which can be absent from opclass.
Parent tuples must be adjusted with every child. We check this simple invariant, it should cover all corner cases.


> 28 марта 2019 г., в 4:57, Peter Geoghegan <pg@bowt.ie> написал(а):
>
>
>> In gist_check_page_keys(), if we get into the code to deal with a
>> concurrent update, we set 'parent' to point to a tuple on a parent page,
>> then unlock it, and continue to look at remaining tuples, using the
>> pointer that points to an unlocked buffer.
>
> Why not just copy the page into a local buffer? See my remarks on this below.
It already was copied, there was a bug of reusing copy from released buffer. In case, when we suspected error, which
outedto be not error. 

>
> You probably didn't mean to leave this "boom" error behind:
>
>> +   if (GistPageIsDeleted(page))
>> +   {
>> +       elog(ERROR,"boom");
Oops. Sorry.

> I see that you have this check for deleted pages:
>
>> +       if (PageGetMaxOffsetNumber(page) > InvalidOffsetNumber)
>> +           ereport(ERROR,
>> +                   (errcode(ERRCODE_INDEX_CORRUPTED),
>> +                   errmsg("index \"%s\" has deleted page with tuples",
>> +                           RelationGetRelationName(rel))));
>> +   }
>
> Why not have a similar check for non-deleted pages, whose maxoffset
> must be <= MaxIndexTuplesPerPage?
Done.

>
> I see various errors like this one:
>
>> +           if (MAXALIGN(ItemIdGetLength(iid)) != MAXALIGN(IndexTupleSize(idxtuple)))
>> +               ereport(ERROR,
>> +                       (errcode(ERRCODE_INDEX_CORRUPTED),
>> +                        errmsg("index \"%s\" has inconsistent tuple sizes",
>> +                               RelationGetRelationName(rel))));
>
> Can't we say which TID is involved here, so we can find the offending
> corrupt tuple afterwards? Or at least the block number? And maybe even
> the LSN of the page? I think that that kind of stuff could be added in
> a number of places.
I've added block number and offset whenever known. I do not understand point of LSN here...

>
> I see this stuff that's related to concurrent processes:
>
>> +       /*
>> +        * It's possible that the page was split since we looked at the parent,
>> +        * so that we didn't missed the downlink of the right sibling when we
>> +        * scanned the parent. If so, add the right sibling to the stack now.
>> +        */
>
>> +               /*
>> +                * There was a  discrepancy between parent and child tuples.
>> +                * We need to verify it is not a result of concurrent call
>> +                * of gistplacetopage(). So, lock parent and try to find downlink
>> +                * for current page. It may be missing due to concurrent page
>> +                * split, this is OK.
>> +                */
>
> Is this really needed? Isn't the ShareLock on the index sufficient? If so, why?
There may be concurrent inserts? In GiST they can reorder items on page.

>
>> +               stack->parenttup = gist_refind_parent(rel, stack->parentblk, stack->blkno, strategy);
>
> If the gistplacetopage() stuff is truly necessary, then is it okay to
> call gist_refind_parent() with the original buffer lock still held
> like this?
When we call gist_refind_parent() we hold lock for a child and lock parent.
We exclude concurrent VACUUM, thus parent cannot become a child for current child, because it has to be recycled for
suchcoincidence. 

>
> I still suspect that we should have something like palloc_btree_page()
> for GiST, so that we're always operating on a copy of the page in
> palloc()'d memory.
That's actually is what we are doing now. GistScanItem->parenttup is always a copy. But we are doing more small copies
ofeach individual tuple, because we have to refresh this copies sometimes. 

>  Maybe it's worthwhile to do something clever with
> concurrently holding buffer locks, but if that's what we're doing here
> then I would also expect us to have something weaker than ShareLock as
> our relation-level heavyweight lock. And, there should be a prominent
> explanation of the theory behind it somewhere.
We definitely can run under SharedLock. And I will try to compose up all things that prevent us from using weaker
levelsin next message... 

>
>> What have you been using to test this?
>
> pg_hexedit has full support for GiST.     ;-)
For me it is easier to break GiST in it's code for tests :)

Thanks!

Best regards, Andrey Borodin.


[0]https://www.postgresql.org/message-id/flat/96ec7ebd-42b9-4df5-18a4-42181c8a5a41%40iki.fi#be331694fd0c8f2a575b3e51a4306e72


Вложения

В списке pgsql-hackers по дате отправления:

Предыдущее
От: Andres Freund
Дата:
Сообщение: Re: jsonpath
Следующее
От: Michael Paquier
Дата:
Сообщение: Re: basebackup checksum verification