Обсуждение: FSM corruption and standby servers

Поиск
Список
Период
Сортировка

FSM corruption and standby servers

От
Tim Goodaire
Дата:
Hello,

I have a question regarding the FSM corruption bug that is fixed in postgresql 9.5.5 (https://wiki.postgresql.org/wiki/Free_Space_Map_Problems). If I don't find any corruption on a master database, is it still possible that there is corruption on the standbys? 

--
Tim Goodaire
Database Engineer
Dyn, Inc
tgoodaire@dyn.com
M: 603-264-6642

Re: FSM corruption and standby servers

От
"Hunley, Douglas"
Дата:

On Mon, Oct 31, 2016 at 10:38 AM, Tim Goodaire <tgoodaire@dyn.com> wrote:
I have a question regarding the FSM corruption bug that is fixed in postgresql 9.5.5 (https://wiki.postgresql.org/wiki/Free_Space_Map_Problems). If I don't find any corruption on a master database, is it still possible that there is corruption on the standbys? 

It shouldn't be, iirc. FSMs are only ever created/updated by vacuum, which doesn't run on a slave until it is promoted to a master.


--
{
  "name" : "douglas j hunley",
  "email" : "douglas.hunley@openscg.com",
  "title" : "database engineer",
  "phone" : "+1 732 339 3419 x163"
}

Re: FSM corruption and standby servers

От
Tom Lane
Дата:
"Hunley, Douglas" <douglas.hunley@openscg.com> writes:
> On Mon, Oct 31, 2016 at 10:38 AM, Tim Goodaire <tgoodaire@dyn.com> wrote:
>> I have a question regarding the FSM corruption bug that is fixed in
>> postgresql 9.5.5 (https://wiki.postgresql.org/wiki/Free_Space_Map_Problems).
>> If I don't find any corruption on a master database, is it still possible
>> that there is corruption on the standbys?

> It shouldn't be, iirc. FSMs are only ever created/updated by vacuum, which
> doesn't run on a slave until it is promoted to a master.

The problem is that the WAL data can be wrong in these cases, and since
the standbys only know what they were told in the WAL stream, their images
will be wrong even if the master is valid.

I would have thought that the referenced page is clear enough about
needing to check the standbys; do you think it isn't?

            regards, tom lane


Re: FSM corruption and standby servers

От
Tim Goodaire
Дата:


On Mon, Oct 31, 2016 at 12:55 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
"Hunley, Douglas" <douglas.hunley@openscg.com> writes:
> On Mon, Oct 31, 2016 at 10:38 AM, Tim Goodaire <tgoodaire@dyn.com> wrote:
>> I have a question regarding the FSM corruption bug that is fixed in
>> postgresql 9.5.5 (https://wiki.postgresql.org/wiki/Free_Space_Map_Problems).
>> If I don't find any corruption on a master database, is it still possible
>> that there is corruption on the standbys?

> It shouldn't be, iirc. FSMs are only ever created/updated by vacuum, which
> doesn't run on a slave until it is promoted to a master.

The problem is that the WAL data can be wrong in these cases, and since
the standbys only know what they were told in the WAL stream, their images
will be wrong even if the master is valid.

I would have thought that the referenced page is clear enough about
needing to check the standbys; do you think it isn't?

The page does clearly say that you need to check standby databases as well. The bit that I was unsure of is whether the absence of corruption on the master was evidence that this problem has not affected the standbys.


--
Tim Goodaire
Database Engineer
Dyn, Inc
tgoodaire@dyn.com
M: 603-264-6642

Re: FSM corruption and standby servers

От
"David G. Johnston"
Дата:
On Mon, Oct 31, 2016 at 9:55 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
"Hunley, Douglas" <douglas.hunley@openscg.com> writes:
> On Mon, Oct 31, 2016 at 10:38 AM, Tim Goodaire <tgoodaire@dyn.com> wrote:
>> I have a question regarding the FSM corruption bug that is fixed in
>> postgresql 9.5.5 (https://wiki.postgresql.org/wiki/Free_Space_Map_Problems).
>> If I don't find any corruption on a master database, is it still possible
>> that there is corruption on the standbys?

> It shouldn't be, iirc. FSMs are only ever created/updated by vacuum, which
> doesn't run on a slave until it is promoted to a master.

The problem is that the WAL data can be wrong in these cases, and since
the standbys only know what they were told in the WAL stream, their images
will be wrong even if the master is valid.

I would have thought that the referenced page is clear enough about
needing to check the standbys; do you think it isn't?

​I can ​see how the following is a bit loose for someone not super-familiar with WAL.

"A database crash-and-restart shortly after such an event can lead to corrupted FSMs. Also, standby servers will receive incorrect WAL data causing them to create corrupted FSMs locally."

I believe the "shortly" here is present because the crash must occur before the next checkpoint in order for the problem to appear on the master.  Given this constraint the secondary emphasis that standby servers receive seems mis-placed.  The most probable scenario - given the bug has manifested and one is running a standby - is a broken standby and a functioning master.​

"Standby servers are directly impacted by this bug and must be checked for corruption even if their master appears clean.  The master will only exhibit a problem if there is a crash-and-restart cycle shortly after (up until a checkpoint) the problem statement that causes the master to replay the just generated WAL."

It is not clear to what extent traditional backups (in the realm of using pg_basebackup) are affected...

David J.


Re: FSM corruption and standby servers

От
Tom Lane
Дата:
"David G. Johnston" <david.g.johnston@gmail.com> writes:
> On Mon, Oct 31, 2016 at 9:55 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> I would have thought that the referenced page is clear enough about
>> needing to check the standbys; do you think it isn't?

> ​I can ​see how the following is a bit loose for someone not super-familiar
> with WAL.

Yeah.  On the other hand, I don't want people who aren't running
replication to stop reading as soon as they see something about standby
servers.  I tweaked the existing wording a bit to emphasize that standbys
can be corrupt even with a valid master; hopefully that's enough.

            regards, tom lane


Re: FSM corruption and standby servers

От
Poul Kristensen
Дата:
I have some comments to the documentation. 

I worked with PostgreSQL for some time ago( 5-8 years)  setting up a few Postgres databases using version 8.4 and RHEL 5:* in production and still running.
A LOT have happened since then and I am going to setup Postgres 9+ with everything needed to secure data, backup(PITR) 
, cluster with zero dataloss. OS is RHEL 7.2.
I my opinion the documentation is not very straightforward with examples omn exactly how to make such a complete installation using
2+ servers to make data available 24/7. One has to pick up a little here and a litte there. And one always have to do a litte there and here to make things switch
in order to make data available 24/7. 
Creating of tablespace can't be scriptet, Restore(I know this another list(pg_rman) can't handle symbolic links. 
Sorry about my belching as vocabulary and my bad english!

/Poul
     

 

2016-10-31 19:19 GMT+01:00 Tom Lane <tgl@sss.pgh.pa.us>:
"David G. Johnston" <david.g.johnston@gmail.com> writes:
> On Mon, Oct 31, 2016 at 9:55 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> I would have thought that the referenced page is clear enough about
>> needing to check the standbys; do you think it isn't?

> ​I can ​see how the following is a bit loose for someone not super-familiar
> with WAL.

Yeah.  On the other hand, I don't want people who aren't running
replication to stop reading as soon as they see something about standby
servers.  I tweaked the existing wording a bit to emphasize that standbys
can be corrupt even with a valid master; hopefully that's enough.

                        regards, tom lane


--
Sent via pgsql-admin mailing list (pgsql-admin@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-admin



--
Med venlig hilsen / Best regards
Poul Kristensen
Linux-OS/Virtualizationexpert and Oracle DBA

Re: FSM corruption and standby servers

От
Poul Kristensen
Дата:
If or when I get a successfull installation containing the abovementioned components in 24/7 operation I
will try to contribute to your documentation. Before that I would appreciate if one of you kind people
would review the solution. The cluster(OS) design I intend to get rewieved by Redhat to be sure of zero faulttolerant.

Thanks.

Poul

2016-10-31 22:46 GMT+01:00 Poul Kristensen <bcc5226@gmail.com>:
I have some comments to the documentation. 

I worked with PostgreSQL for some time ago( 5-8 years)  setting up a few Postgres databases using version 8.4 and RHEL 5:* in production and still running.
A LOT have happened since then and I am going to setup Postgres 9+ with everything needed to secure data, backup(PITR) 
, cluster with zero dataloss. OS is RHEL 7.2.
I my opinion the documentation is not very straightforward with examples omn exactly how to make such a complete installation using
2+ servers to make data available 24/7. One has to pick up a little here and a litte there. And one always have to do a litte there and here to make things switch
in order to make data available 24/7. 
Creating of tablespace can't be scriptet, Restore(I know this another list(pg_rman) can't handle symbolic links. 
Sorry about my belching as vocabulary and my bad english!

/Poul
     

 

2016-10-31 19:19 GMT+01:00 Tom Lane <tgl@sss.pgh.pa.us>:
"David G. Johnston" <david.g.johnston@gmail.com> writes:
> On Mon, Oct 31, 2016 at 9:55 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> I would have thought that the referenced page is clear enough about
>> needing to check the standbys; do you think it isn't?

> ​I can ​see how the following is a bit loose for someone not super-familiar
> with WAL.

Yeah.  On the other hand, I don't want people who aren't running
replication to stop reading as soon as they see something about standby
servers.  I tweaked the existing wording a bit to emphasize that standbys
can be corrupt even with a valid master; hopefully that's enough.

                        regards, tom lane


--
Sent via pgsql-admin mailing list (pgsql-admin@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-admin



--
Med venlig hilsen / Best regards
Poul Kristensen
Linux-OS/Virtualizationexpert and Oracle DBA



--
Med venlig hilsen / Best regards
Poul Kristensen
Linux-OS/Virtualizationexpert and Oracle DBA