Обсуждение: [GENERAL] Would like to below scenario is possible for getting page/block corruption

Поиск
Список
Период
Сортировка

[GENERAL] Would like to below scenario is possible for getting page/block corruption

От
Sreekanth Palluru
Дата:
Hi ,
I am working on page corruption issue want to know if below scenario is possible 

1)  Insert command from client , I understand heap_insert is called from heampam.c
2) Let us say table is full and relation is extended and added a new block 
3) Tuple is inserted into new page for the block RelationPutHeapTuple/hio.c
4) Later  WAL record is inserted  through recptr = XLogInsert(RM_HEAP_ID, info);
5) Then backend update the PageHeader with WAL LSN details  PageSetLSN(page, recptr);

If my server got crashed after step 4) is there a possibility that after postgres database restart I get  below error when I access the relation or vacuum is run on this relation or taking backup through pg_dump  ?
ERROR:  invalid page header in block 204 of relation base/16413/16900 ?

or 
Postgres can automatically recover the page  without throwing any error ?

Appreciate your valuable response on this 

--
Regards
Sreekanth

Re: [GENERAL] Would like to below scenario is possible for gettingpage/block corruption

От
Michael Paquier
Дата:
On Fri, Dec 9, 2016 at 9:46 AM, Sreekanth Palluru <sree4pg@gmail.com> wrote:
> Hi ,
> I am working on page corruption issue want to know if below scenario is
> possible
>
> 1)  Insert command from client , I understand heap_insert is called from
> heampam.c
> 2) Let us say table is full and relation is extended and added a new block
> 3) Tuple is inserted into new page for the block RelationPutHeapTuple/hio.c
> 4) Later  WAL record is inserted  through recptr = XLogInsert(RM_HEAP_ID,
> info);
> 5) Then backend update the PageHeader with WAL LSN details  PageSetLSN(page,
> recptr);
>
> If my server got crashed after step 4) is there a possibility that after
> postgres database restart I get  below error when I access the relation or
> vacuum is run on this relation or taking backup through pg_dump  ?
> ERROR:  invalid page header in block 204 of relation base/16413/16900 ?

So the block is corrupted. You may want to move to another server.

> or
> Postgres can automatically recover the page  without throwing any error ?

At crash recovery, Postgres would redo things from a point where
everything was consistent on disk. If this corrupted page made it to
disk, there is not much that can be done except restoring from a
backup. You could as well zero_damaged_pages to help here, but you
would lose the data on this page, still you would be able to perform
pg_dump and get back as much data as you can. At the same time,
corruption can spread as well as if that's a hardware problem, so you
are just seeing the beginning of a series of problems.
--
Michael


Re: [GENERAL] Would like to below scenario is possible for gettingpage/block corruption

От
Sreekanth Palluru
Дата:
Michael,
Can I generalize that, if after step 4)  page ( new page or old page)  got written disk from buffer  and crash happens between step 4) and 5)  we always get
block corruption issues with Postgres which can only be recovered by setting zero_damaged_pages if we just have pg_dump backups and we are OK lose data in the affected blocks?

I am also looking at ways of reproducing the issue ? appreciate your advice on it ?


On Fri, Dec 9, 2016 at 12:01 PM, Michael Paquier <michael.paquier@gmail.com> wrote:
On Fri, Dec 9, 2016 at 9:46 AM, Sreekanth Palluru <sree4pg@gmail.com> wrote:
> Hi ,
> I am working on page corruption issue want to know if below scenario is
> possible
>
> 1)  Insert command from client , I understand heap_insert is called from
> heampam.c
> 2) Let us say table is full and relation is extended and added a new block
> 3) Tuple is inserted into new page for the block RelationPutHeapTuple/hio.c
> 4) Later  WAL record is inserted  through recptr = XLogInsert(RM_HEAP_ID,
> info);
> 5) Then backend update the PageHeader with WAL LSN details  PageSetLSN(page,
> recptr);
>
> If my server got crashed after step 4) is there a possibility that after
> postgres database restart I get  below error when I access the relation or
> vacuum is run on this relation or taking backup through pg_dump  ?
> ERROR:  invalid page header in block 204 of relation base/16413/16900 ?

So the block is corrupted. You may want to move to another server.

> or
> Postgres can automatically recover the page  without throwing any error ?

At crash recovery, Postgres would redo things from a point where
everything was consistent on disk. If this corrupted page made it to
disk, there is not much that can be done except restoring from a
backup. You could as well zero_damaged_pages to help here, but you
would lose the data on this page, still you would be able to perform
pg_dump and get back as much data as you can. At the same time,
corruption can spread as well as if that's a hardware problem, so you
are just seeing the beginning of a series of problems.
--
Michael



--
Regards
Sreekanth

Re: [GENERAL] Would like to below scenario is possible for gettingpage/block corruption

От
Michael Paquier
Дата:
(Please top-post that's annoying)

On Fri, Dec 9, 2016 at 10:28 AM, Sreekanth Palluru <sree4pg@gmail.com> wrote:
> Can I generalize that, if after step 4)  page ( new page or old page)  got
> written disk from buffer  and crash happens between step 4) and 5)  we
> always get
> block corruption issues with Postgres which can only be recovered by setting
> zero_damaged_pages if we just have pg_dump backups and we are OK lose data
> in the affected blocks?
>
> I am also looking at ways of reproducing the issue ? appreciate your advice
> on it ?

Postgres is designed to avoid such corruption problems if
full_page_writes and fsync are enabled, that's a base stone of its
reliability. If you can create a self-contained scenario able to
reproduce a failure, that could be treated as a Postgres bug, but you
are giving no evidence that this is the case.
--
Michael


Re: [GENERAL] Would like to below scenario is possible for gettingpage/block corruption

От
Sreekanth Palluru
Дата:
Michael,
Thanks for your prompt reply

In my environment those two parameters are enabled . Just give you brief of PG database envornment
Version 9.2.4.1 
Windows 7 Professional SP1
fsync=on
full_page_writes=on
wal_sync_method=open_datasync

My Customer is into building Cancer related systems and we ship Dell systems with our software image contains PG. Few of the customers are facing corruption issues say around 5% .
We are in process of reproducing the issue , since there are different variables involved in reproducing issue like  Dell HW, Software image versions, Application versions, write-cache settings RAID/Disk, RAID controllers with no backup and power failures  etc  , I am trying to understand is there possibility that PG can end up in having corrupted blocks due to system crash.

1)As I understand fsycn will write the block from memory to disk and block just after step 4) would have written disk assuming disk cache did not lie 
2)and assume that full_page_writes=on has dumped the whole 8k block into WAL
before it updates block i.e. after step 2) and before 3) 
3) if crash happens after step4) , since there is no PageHeader data , after system restarts PG will complain that it is corrupted block or invalid header

Please correct me if my understanding about play fsync and full_page_writes are correct ? if so , I see that there is possibility getting corruptions whenever PG extends a relation and crash happens just after step 4) 

I am not sure will the same applicable to  existing page (not a new page) and how it handles if there is PageHeader available as part of full_page_writes, will same corruption can be happen or will PG can recover database as I am not sure
recovery process can update the PageHeader   from WAL records it wrote recptr as part of step 4) during the recovery process .


-Sreekanth



On Fri, Dec 9, 2016 at 12:44 PM, Michael Paquier <michael.paquier@gmail.com> wrote:
(Please top-post that's annoying)

On Fri, Dec 9, 2016 at 10:28 AM, Sreekanth Palluru <sree4pg@gmail.com> wrote:
> Can I generalize that, if after step 4)  page ( new page or old page)  got
> written disk from buffer  and crash happens between step 4) and 5)  we
> always get
> block corruption issues with Postgres which can only be recovered by setting
> zero_damaged_pages if we just have pg_dump backups and we are OK lose data
> in the affected blocks?
>
> I am also looking at ways of reproducing the issue ? appreciate your advice
> on it ?

Postgres is designed to avoid such corruption problems if
full_page_writes and fsync are enabled, that's a base stone of its
reliability. If you can create a self-contained scenario able to
reproduce a failure, that could be treated as a Postgres bug, but you
are giving no evidence that this is the case.
--
Michael



--
Regards
Sreekanth

Re: [GENERAL] Would like to below scenario is possible for gettingpage/block corruption

От
Sreekanth Palluru
Дата:
Correcting typos
Michael,
Thanks for your prompt reply

In my environment those two parameters are enabled . Just give you brief of PG database envornment
Version 9.2.4.1 
Windows 7 Professional SP1
fsync=on
full_page_writes=on
wal_sync_method=open_datasync

My Customer is into building Cancer related systems and we ship Dell systems with our software image contains PG. Few of the customers are facing corruption issues say around 5% .
We are in process of reproducing the issue , since there are different variables involved in reproducing issue like  Dell HW, Software image versions, Application versions, write-cache settings RAID/Disk, RAID controllers with no battery backup and power failures  etc  , I am trying to understand is there possibility that PG can end up in having corrupted blocks due to system crash though we set these parameters

a)As I understand fsycn will write the block from memory to disk and block just after step 4) would have written disk assuming disk cache did not lie 
b)and assume that full_page_writes=on has dumped the whole 8k block into WAL
before it updates block i.e. after step 2) and before 3) 
c) if crash happens after step4) , since there is no PageHeader data , after system restarts PG will complain that it is corrupted block or invalid header

Please correct me if my understanding about play fsync and full_page_writes are correct ? if so , I see that there is possibility getting corruptions whenever PG extends a relation and crash happens just after step 4) 

I am not sure will the same applicable to  existing page (not a new page) and how it handles if there is PageHeader available as part of full_page_writes, will same corruption can be happen or will PG can recover database as I am not sure
recovery process can update the PageHeader   from WAL records it wrote recptr as part of step 4) during the recovery process .

-Sreekanth


On Fri, Dec 9, 2016 at 2:09 PM, Sreekanth Palluru <sree4pg@gmail.com> wrote:
Michael,
Thanks for your prompt reply

In my environment those two parameters are enabled . Just give you brief of PG database envornment
Version 9.2.4.1 
Windows 7 Professional SP1
fsync=on
full_page_writes=on
wal_sync_method=open_datasync

My Customer is into building Cancer related systems and we ship Dell systems with our software image contains PG. Few of the customers are facing corruption issues say around 5% .
We are in process of reproducing the issue , since there are different variables involved in reproducing issue like  Dell HW, Software image versions, Application versions, write-cache settings RAID/Disk, RAID controllers with no backup and power failures  etc  , I am trying to understand is there possibility that PG can end up in having corrupted blocks due to system crash.

1)As I understand fsycn will write the block from memory to disk and block just after step 4) would have written disk assuming disk cache did not lie 
2)and assume that full_page_writes=on has dumped the whole 8k block into WAL
before it updates block i.e. after step 2) and before 3) 
3) if crash happens after step4) , since there is no PageHeader data , after system restarts PG will complain that it is corrupted block or invalid header

Please correct me if my understanding about play fsync and full_page_writes are correct ? if so , I see that there is possibility getting corruptions whenever PG extends a relation and crash happens just after step 4) 

I am not sure will the same applicable to  existing page (not a new page) and how it handles if there is PageHeader available as part of full_page_writes, will same corruption can be happen or will PG can recover database as I am not sure
recovery process can update the PageHeader   from WAL records it wrote recptr as part of step 4) during the recovery process .


-Sreekanth



On Fri, Dec 9, 2016 at 12:44 PM, Michael Paquier <michael.paquier@gmail.com> wrote:
(Please top-post that's annoying)

On Fri, Dec 9, 2016 at 10:28 AM, Sreekanth Palluru <sree4pg@gmail.com> wrote:
> Can I generalize that, if after step 4)  page ( new page or old page)  got
> written disk from buffer  and crash happens between step 4) and 5)  we
> always get
> block corruption issues with Postgres which can only be recovered by setting
> zero_damaged_pages if we just have pg_dump backups and we are OK lose data
> in the affected blocks?
>
> I am also looking at ways of reproducing the issue ? appreciate your advice
> on it ?

Postgres is designed to avoid such corruption problems if
full_page_writes and fsync are enabled, that's a base stone of its
reliability. If you can create a self-contained scenario able to
reproduce a failure, that could be treated as a Postgres bug, but you
are giving no evidence that this is the case.
--
Michael



--
Regards
Sreekanth



--
Regards
Sreekanth

[GENERAL] Re: [ADMIN] Would like to below scenario is possible for gettingpage/block corruption

От
Shreeyansh Dba
Дата:
Hi Sreekanth,

I doubt auto-recover of the page might be possible, as the header of the page is no more valid & corrupted and not sure whether the corruption occurred in relation of a data or index block.

We have seen some occurrences like this before which got rectified by performing reindexing and vacuum full operations on index or entire table. 

If the corrupted relation is a data block & reindexing didn't help, based on your current backup strategy,  logical (pg)dump/restore) or PITR  may help in recovering  from corruption problems provided having in tact valid backups before you faced this error.


Hope this helps you in getting required solution.


Please feel free to reach us if you have any queries.

On Fri, Dec 9, 2016 at 6:16 AM, Sreekanth Palluru <sree4pg@gmail.com> wrote:
Hi ,
I am working on page corruption issue want to know if below scenario is possible 

1)  Insert command from client , I understand heap_insert is called from heampam.c
2) Let us say table is full and relation is extended and added a new block 
3) Tuple is inserted into new page for the block RelationPutHeapTuple/hio.c
4) Later  WAL record is inserted  through recptr = XLogInsert(RM_HEAP_ID, info);
5) Then backend update the PageHeader with WAL LSN details  PageSetLSN(page, recptr);

If my server got crashed after step 4) is there a possibility that after postgres database restart I get  below error when I access the relation or vacuum is run on this relation or taking backup through pg_dump  ?
ERROR:  invalid page header in block 204 of relation base/16413/16900 ?

or 
Postgres can automatically recover the page  without throwing any error ?

Appreciate your valuable response on this 

--
Regards
Sreekanth



--

[GENERAL] Re: [ADMIN] Would like to below scenario is possible for gettingpage/block corruption

От
Sreekanth Palluru
Дата:
shreeyansh,
we have issue with relation and we have fixed this using setting zero_damaged_pages and then running vacuum fullbon relatuon.

I am looking at possibility of PG introducing corruption if relation extends and before it updates new page with pageheader in memory and crash happens?

Is this possible? Does PG updates pageheader when relation get extends?
If so what details it writes? Or will it be null?

On 09/12/2016 8:56 PM, "Shreeyansh Dba" <shreeyansh2014@gmail.com> wrote:
Hi Sreekanth,

I doubt auto-recover of the page might be possible, as the header of the page is no more valid & corrupted and not sure whether the corruption occurred in relation of a data or index block.

We have seen some occurrences like this before which got rectified by performing reindexing and vacuum full operations on index or entire table. 

If the corrupted relation is a data block & reindexing didn't help, based on your current backup strategy,  logical (pg)dump/restore) or PITR  may help in recovering  from corruption problems provided having in tact valid backups before you faced this error.


Hope this helps you in getting required solution.


Please feel free to reach us if you have any queries.

On Fri, Dec 9, 2016 at 6:16 AM, Sreekanth Palluru <sree4pg@gmail.com> wrote:
Hi ,
I am working on page corruption issue want to know if below scenario is possible 

1)  Insert command from client , I understand heap_insert is called from heampam.c
2) Let us say table is full and relation is extended and added a new block 
3) Tuple is inserted into new page for the block RelationPutHeapTuple/hio.c
4) Later  WAL record is inserted  through recptr = XLogInsert(RM_HEAP_ID, info);
5) Then backend update the PageHeader with WAL LSN details  PageSetLSN(page, recptr);

If my server got crashed after step 4) is there a possibility that after postgres database restart I get  below error when I access the relation or vacuum is run on this relation or taking backup through pg_dump  ?
ERROR:  invalid page header in block 204 of relation base/16413/16900 ?

or 
Postgres can automatically recover the page  without throwing any error ?

Appreciate your valuable response on this 

--
Regards
Sreekanth



--

Re: [GENERAL] Re: [ADMIN] Would like to below scenario is possiblefor getting page/block corruption

От
Michael Paquier
Дата:
On Sun, Dec 11, 2016 at 12:00 PM, Sreekanth Palluru <sree4pg@gmail.com> wrote:
> I am looking at possibility of PG introducing corruption if relation extends and before it updates new page with
pageheaderin memory and crash happens? 
>
> Is this possible?

No.

> Does PG updates pageheader when relation get extends?

You need to look at smgrextend() when extension an on-disk relation
file. The page is written in a correct shape.

> If so what details it writes? Or will it be null?
--
Michael