Обсуждение: [GENERAL] requested timeline doesn't contain minimum recovery point

Поиск
Список
Период
Сортировка

[GENERAL] requested timeline doesn't contain minimum recovery point

От
Tom DalPozzo
Дата:
Hi,
there is something happening in my replication that is not clear to me. I think I'm missing something.
I've two server, red and blue.
red is primary blue is standby, async repl.
Now:
1 cleanly stop red
2 promote blue
3 insert tuples in blue
4 from red site, pg_rewind from blue to red dir.
5 start red as standby-> OK
6 wait a long time and then cleanly stop blue
7 promote red
8 insert tuples in red
9 from blue site, pg_rewind from red to blue dir
10 start blue as standby -> I get "requested timeline 3 doesn't contain minimum recovery point 1/... on timeline 1

Sometimes this "switching game"  works up to timeline 4 or 5, not always 3

Regards
Pupillo






Re: [GENERAL] requested timeline doesn't contain minimum recovery point

От
Michael Paquier
Дата:
On Fri, Jan 6, 2017 at 1:01 AM, Tom DalPozzo <t.dalpozzo@gmail.com> wrote:
> Hi,
> there is something happening in my replication that is not clear to me. I
> think I'm missing something.
> I've two server, red and blue.
> red is primary blue is standby, async repl.
> Now:
> 1 cleanly stop red
> 2 promote blue
> 3 insert tuples in blue
> 4 from red site, pg_rewind from blue to red dir.
> 5 start red as standby-> OK
> 6 wait a long time and then cleanly stop blue
> 7 promote red
> 8 insert tuples in red
> 9 from blue site, pg_rewind from red to blue dir
> 10 start blue as standby -> I get "requested timeline 3 doesn't contain
> minimum recovery point 1/... on timeline 1
>
> Sometimes this "switching game"  works up to timeline 4 or 5, not always 3

Could you give more details? What does pg_rewind tell you at each
phase? Is that on Postgres 9.5 or 9.6? I use pg_rewind quite
extensively on 9.5 but I have no problems of this time with multiple
timeline jumps when juggling between two nodes. Another thing that is
coming to my mind: you are using pg_rewing with a source node that is
running. You should issue a checkpoint manually after promoting the
node to be sure that its control file gets the new timeline number.
--
Michael


Re: [GENERAL] requested timeline doesn't contain minimum recovery point

От
Tom DalPozzo
Дата:


2017-01-06 13:09 GMT+01:00 Michael Paquier <michael.paquier@gmail.com>:
On Fri, Jan 6, 2017 at 1:01 AM, Tom DalPozzo <t.dalpozzo@gmail.com> wrote:
> Hi,
> there is something happening in my replication that is not clear to me. I
> think I'm missing something.
> I've two server, red and blue.
> red is primary blue is standby, async repl.
> Now:
> 1 cleanly stop red
> 2 promote blue
> 3 insert tuples in blue
> 4 from red site, pg_rewind from blue to red dir.
> 5 start red as standby-> OK
> 6 wait a long time and then cleanly stop blue
> 7 promote red
> 8 insert tuples in red
> 9 from blue site, pg_rewind from red to blue dir
> 10 start blue as standby -> I get "requested timeline 3 doesn't contain
> minimum recovery point 1/... on timeline 1
>
> Sometimes this "switching game"  works up to timeline 4 or 5, not always 3

Could you give more details? What does pg_rewind tell you at each
phase? Is that on Postgres 9.5 or 9.6? I use pg_rewind quite
extensively on 9.5 but I have no problems of this time with multiple
timeline jumps when juggling between two nodes. Another thing that is
coming to my mind: you are using pg_rewing with a source node that is
running. You should issue a checkpoint manually after promoting the
node to be sure that its control file gets the new timeline number.
--
Michael
Hi,

sometimes pg_rewind says that nothing needs to be done, sometimes it says it's rewinding and done at the end.
I'm using 9.6. I moved there from 9.5 as I'm also using replication slots and in 9.6 there is a second parameter added. But I seem to remember that it did the same in 9.5 too but I'm not really sure.
I checked that the server, at promotion said the message about the new timeline.
I will make some more tests. 
Regards
Pupillo

Re: [GENERAL] requested timeline doesn't contain minimum recovery point

От
Tom DalPozzo
Дата:
Could you give more details? What does pg_rewind tell you at each
phase? Is that on Postgres 9.5 or 9.6? I use pg_rewind quite
extensively on 9.5 but I have no problems of this time with multiple
timeline jumps when juggling between two nodes. Another thing that is
coming to my mind: you are using pg_rewing with a source node that is
running. You should issue a checkpoint manually after promoting the
node to be sure that its control file gets the new timeline number.
--
Michael
Hi,

sometimes pg_rewind says that nothing needs to be done, sometimes it says it's rewinding and done at the end.
I'm using 9.6. I moved there from 9.5 as I'm also using replication slots and in 9.6 there is a second parameter added. But I seem to remember that it did the same in 9.5 too but I'm not really sure.
I checked that the server, at promotion said the message about the new timeline.
I will make some more tests. 
Regards
Pupillo

Hi!
I redid the tests following your suggestion to issue a checkpoint manually. IT WORKS!
Just a question: when the standby server starts, I see the log error messages (ex.: "invalid record length...")  when WAL end is reached. I know that it's normal.
But I'm wondering if the system, in order to detect the end of the WAL, controls only the validity of the records in the WAL.
I mean, could random bytes appear as a valid record (very unlikely, but possible)?
Thanks
Pupillo







Re: [GENERAL] requested timeline doesn't contain minimum recovery point

От
Michael Paquier
Дата:
On Tue, Jan 10, 2017 at 10:35 PM, Tom DalPozzo <t.dalpozzo@gmail.com> wrote:
> I redid the tests following your suggestion to issue a checkpoint manually.
> IT WORKS!
> Just a question: when the standby server starts, I see the log error
> messages (ex.: "invalid record length...")  when WAL end is reached. I know
> that it's normal.
> But I'm wondering if the system, in order to detect the end of the WAL,
> controls only the validity of the records in the WAL.

You may want to look at xlogreader.c and track report_invalid_record()
to see what are the error checks being done. No full checks are done
depending on the record types, but there are some checks for the
backup blocks, the size record, etc.

> I mean, could random bytes appear as a valid record (very unlikely, but
> possible)?

Yes, that could be possible if some memory or disk is broken. That's
why, while it is important to take backups, it is more important to
make sure that they are able to restore correctly before deploying
them.
--
Michael


Re: [GENERAL] requested timeline doesn't contain minimum recovery point

От
Tom DalPozzo
Дата:

> I mean, could random bytes appear as a valid record (very unlikely, but
> possible)?

Yes, that could be possible if some memory or disk is broken. That's
why, while it is important to take backups, it is more important to
make sure that they are able to restore correctly before deploying
them.
--
Michael
Hi,
of course against memory or disk corruption, nothing 100% safe can be done. But, excluding these cases, can there be situations in which the WAL reader gets confused?
I'm thinking at WAL segments recycling: when a WAL is recycled it is not filled with anything (zeroes...) right? 
If I'm right, then there are still old records in the WAL. If they're aligned with the new offsets, I guess that the system can understand that they're older (looking at some ID) and not valid but if not aligned, there could be an unlucky and unlikely issue.

In other word,  excluding HW problems and possible unwanted bugs, I'd like to know if the logic underneath WAL reading at startup is 100%safe.

Regards
Pupillo