Re: FSM corruption leading to errors

Поиск
Список
Период
Сортировка
От Pavan Deolasee
Тема Re: FSM corruption leading to errors
Дата
Msg-id CABOikdM5rw=25qQc+wZoYN5yym2r09Q9X0Ria4_P48CGeCRU_g@mail.gmail.com
обсуждение исходный текст
Ответ на Re: FSM corruption leading to errors  (Michael Paquier <michael.paquier@gmail.com>)
Ответы Re: FSM corruption leading to errors  (Michael Paquier <michael.paquier@gmail.com>)
Список pgsql-hackers


On Mon, Oct 10, 2016 at 7:55 PM, Michael Paquier <michael.paquier@gmail.com> wrote:


+   /*
+    * See comments in GetPageWithFreeSpace about handling outside the valid
+    * range blocks
+    */
+   nblocks = RelationGetNumberOfBlocks(rel);
+   while (target_block >= nblocks && target_block != InvalidBlockNumber)
+   {
+       target_block = RecordAndGetPageWithFreeSpace(rel, target_block, 0,
+               spaceNeeded);
+   }
Hm. This is just a workaround. Even if things are done this way the
FSM will remain corrupted.

No, because the code above updates the FSM of those out-of-the range blocks. But now that I look at it again, may be this is not correct and it may get into an endless loop if the relation is repeatedly extended concurrently.
 
And isn't that going to break once the
relation is extended again?

Once the underlying bug is fixed, I don't see why it should break again. I added the above code to mostly deal with already corrupt FSMs. May be we can just document and leave it to the user to run some correctness checks (see below), especially given that the code is not cheap and adds overheads for everybody, irrespective of whether they have or will ever have corrupt FSM.
 
I'd suggest instead putting in the release
notes a query that allows one to analyze what are the relations broken
and directly have them fixed. That's annoying, but it would be really
better than a workaround. One idea here is to use pg_freespace() and
see if it returns a non-zero value for an out-of-range block on a
standby.


Right, that's how I tested for broken FSMs. A challenge with any such query is that if the shared buffer copy of the FSM page is intact, then the query won't return problematic FSMs. Of course, if the fix is applied to the standby and is restarted, then corrupt FSMs can be detected.
 

At the same time, I have translated your script into a TAP test, I
found that more useful when testing..

Thanks for doing that.

Thanks,
Pavan

--
 Pavan Deolasee                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

В списке pgsql-hackers по дате отправления:

Предыдущее
От: Michael Paquier
Дата:
Сообщение: Re: FSM corruption leading to errors
Следующее
От: Merlin Moncure
Дата:
Сообщение: Re: autonomous transactions