Обсуждение: .history': No such file or directory - a symptom of ?

Поиск
Список
Период
Сортировка

.history': No such file or directory - a symptom of ?

От
lejeczek
Дата:
Hi guys.

What is that - as per subject - a symptom of exactly?
I get that there was an issue, but with more details explained. Perhaps there are docs which explain that?

When it happens I take slave down and do _pg_basebackup_ off the master - but is there a more "civilized" way to "push" the slave back in sync, maybe without taking slave off-line?

many thanks, L.

Re: .history': No such file or directory - a symptom of ?

От
Stephen Frost
Дата:
Greetings,

* lejeczek (peljasz@yahoo.co.uk) wrote:
> What is that - as per subject - a symptom of exactly?

Arguably, a poor restore command being used ...

PostgreSQL will request .history files when doing recovery and will keep
requesting them, by default, until it finds that one isn't there- it
will then target the last timeline that it found to perform replay to
(target timeline = latest).  We do also look for .history files when
going through the promotion process and similarily will keep checking
for timeline files until we don't find one and then that's the timeline
we will move to for the promotion and we'll then immediately push a new
.history file into the archive to 'claim' that timeline.

Basically, it's not an error and it's entirely intentional that it works
that way and your restore command probably shouldn't be complaining
about it really (and that's where the actual 'No such file or directory'
bit is coming from- not from PG itself).

> I get that there was an issue, but with more details explained. Perhaps
> there are docs which explain that?

Much more likely that there wasn't actually any issue...

> When it happens I take slave down and do _pg_basebackup_ off the master -
> but is there a more "civilized" way to "push" the slave back in sync, maybe
> without taking slave off-line?

Not following this bit at all.  There being a message about PG not
finding a .history file during restore or promotion isn't actually an
indication of anything having gone wrong or that the replica is out of
sync.  In other words, I don't know that you needed to actually do
anything.  Is there some reason you think you did need to do something
beside that message being in the log..?

Thanks,

Stephen

Вложения

Re: .history': No such file or directory - a symptom of ?

От
lejeczek
Дата:

On 10/08/2023 17:41, Stephen Frost wrote:
> Greetings,
>
> * lejeczek (peljasz@yahoo.co.uk) wrote:
>> What is that - as per subject - a symptom of exactly?
> Arguably, a poor restore command being used ...
>
> PostgreSQL will request .history files when doing recovery and will keep
> requesting them, by default, until it finds that one isn't there- it
> will then target the last timeline that it found to perform replay to
> (target timeline = latest).  We do also look for .history files when
> going through the promotion process and similarily will keep checking
> for timeline files until we don't find one and then that's the timeline
> we will move to for the promotion and we'll then immediately push a new
> .history file into the archive to 'claim' that timeline.
>
> Basically, it's not an error and it's entirely intentional that it works
> that way and your restore command probably shouldn't be complaining
> about it really (and that's where the actual 'No such file or directory'
> bit is coming from- not from PG itself).
>
>> I get that there was an issue, but with more details explained. Perhaps
>> there are docs which explain that?
> Much more likely that there wasn't actually any issue...
>
>> When it happens I take slave down and do _pg_basebackup_ off the master -
>> but is there a more "civilized" way to "push" the slave back in sync, maybe
>> without taking slave off-line?
> Not following this bit at all.  There being a message about PG not
> finding a .history file during restore or promotion isn't actually an
> indication of anything having gone wrong or that the replica is out of
> sync.  In other words, I don't know that you needed to actually do
> anything.  Is there some reason you think you did need to do something
> beside that message being in the log..?
>
> Thanks,
>
> Stephen
Perhaps there is not an actual, real issue with 
synchronization, however the logs make me - I'd imagine 
anybody who is a novice like me - uncomfortable.
These logs, the errors never quiet down - I've been waiting 
a few days.

from master:
....
2023-08-11 10:12:18.908 CEST [776006] STATEMENT:  
START_REPLICATION 0/4E000000 TIMELINE 1
2023-08-11 10:12:23.909 CEST [777443] ERROR:  requested WAL 
segment 00000001000000000000004E has already been removed
2023-08-11 10:12:23.909 CEST [777443] STATEMENT:  
START_REPLICATION 0/4E000000 TIMELINE 1
2023-08-11 10:12:28.911 CEST [778491] ERROR:  requested WAL 
segment 00000001000000000000004E has already been removed
2023-08-11 10:12:28.911 CEST [778491] STATEMENT:  
START_REPLICATION 0/4E000000 TIMELINE 1
...

from slave:
...
cp: cannot stat 
'/var/lib/pgsql/pg_archive/00000002.history': No such file 
or directory
2023-08-11 10:12:38.919 CEST [773947] LOG:  waiting for WAL 
to become available at 0/4E002000
cp: cannot stat 
'/var/lib/pgsql/pg_archive/00000001000000000000004E': No 
such file or directory
2023-08-11 10:12:43.916 CEST [1050527] LOG:  started 
streaming WAL from primary at 0/4E000000 on timeline 1
2023-08-11 10:12:43.916 CEST [1050527] FATAL:  could not 
receive data from WAL stream: ERROR:  requested WAL segment 
00000001000000000000004E has already been removed
cp: cannot stat 
'/var/lib/pgsql/pg_archive/00000002.history': No such file 
or directory
2023-08-11 10:12:43.918 CEST [773947] LOG:  waiting for WAL 
to become available at 0/4E002000
cp: cannot stat 
'/var/lib/pgsql/pg_archive/00000001000000000000004E': No 
such file or directory
2023-08-11 10:12:48.920 CEST [1051003] LOG:  started 
streaming WAL from primary at 0/4E000000 on timeline 1
2023-08-11 10:12:48.920 CEST [1051003] FATAL:  could not 
receive data from WAL stream: ERROR:  requested WAL segment 
00000001000000000000004E has already been removed
cp: cannot stat 
'/var/lib/pgsql/pg_archive/00000002.history': No such file 
or directory
2023-08-11 10:12:48.923 CEST [773947] LOG:  waiting for WAL 
to become available at 0/4E002000
...

So, seeing logs flooded that way.... I don't like it (even 
if I could be sure everything is in sync) particularly for 
master shows:

-> $ sudo -u postgres psql --port=5432 -x -c 'select 
client_addr,sync_state from pg_stat_replication;'
could not change directory to "/root": Permission denied
(0 rows)

so I do 'pg_basebackup' then.

many thanks, L.



Re: .history': No such file or directory - a symptom of ?

От
Ron
Дата:
On 8/11/23 03:20, lejeczek wrote:
[snip]
>
> So, seeing logs flooded that way.... I don't like it (even if I could be 
> sure everything is in sync) particularly for master shows:
>
> -> $ sudo -u postgres psql --port=5432 -x -c 'select 
> client_addr,sync_state from pg_stat_replication;'
> could not change directory to "/root": Permission denied
> (0 rows)

Try this:

sudo su - postgres -c "psql ..."

-- 
Born in Arizona, moved to Babylonia.



Re: .history': No such file or directory - a symptom of ?

От
Stephen Frost
Дата:
Greetings,

* lejeczek (peljasz@yahoo.co.uk) wrote:
> On 10/08/2023 17:41, Stephen Frost wrote:
> > * lejeczek (peljasz@yahoo.co.uk) wrote:
> > > What is that - as per subject - a symptom of exactly?
> > Arguably, a poor restore command being used ...
> >
> > PostgreSQL will request .history files when doing recovery and will keep
> > requesting them, by default, until it finds that one isn't there- it
> > will then target the last timeline that it found to perform replay to
> > (target timeline = latest).  We do also look for .history files when
> > going through the promotion process and similarily will keep checking
> > for timeline files until we don't find one and then that's the timeline
> > we will move to for the promotion and we'll then immediately push a new
> > .history file into the archive to 'claim' that timeline.
> >
> > Basically, it's not an error and it's entirely intentional that it works
> > that way and your restore command probably shouldn't be complaining
> > about it really (and that's where the actual 'No such file or directory'
> > bit is coming from- not from PG itself).
> >
> > > I get that there was an issue, but with more details explained. Perhaps
> > > there are docs which explain that?
> > Much more likely that there wasn't actually any issue...
> >
> > > When it happens I take slave down and do _pg_basebackup_ off the master -
> > > but is there a more "civilized" way to "push" the slave back in sync, maybe
> > > without taking slave off-line?
> > Not following this bit at all.  There being a message about PG not
> > finding a .history file during restore or promotion isn't actually an
> > indication of anything having gone wrong or that the replica is out of
> > sync.  In other words, I don't know that you needed to actually do
> > anything.  Is there some reason you think you did need to do something
> > beside that message being in the log..?

> Perhaps there is not an actual, real issue with synchronization, however the
> logs make me - I'd imagine anybody who is a novice like me - uncomfortable.

You *really* shouldn't be using simple 'cp' commands for your archive or
restore commands.

> These logs, the errors never quiet down - I've been waiting a few days.
>
> from master:
> ....
> 2023-08-11 10:12:18.908 CEST [776006] STATEMENT:  START_REPLICATION
> 0/4E000000 TIMELINE 1
> 2023-08-11 10:12:23.909 CEST [777443] ERROR:  requested WAL segment
> 00000001000000000000004E has already been removed
> 2023-08-11 10:12:23.909 CEST [777443] STATEMENT:  START_REPLICATION
> 0/4E000000 TIMELINE 1
> 2023-08-11 10:12:28.911 CEST [778491] ERROR:  requested WAL segment
> 00000001000000000000004E has already been removed
> 2023-08-11 10:12:28.911 CEST [778491] STATEMENT:  START_REPLICATION
> 0/4E000000 TIMELINE 1
> ...

This is saying that the replica is asking for WAL segments from the
primary that have already been archived.  That's not a problem if you've
got a functioning archive repository where the replica can pull that WAL
from.

> from slave:
> ...
> cp: cannot stat '/var/lib/pgsql/pg_archive/00000002.history': No such file
> or directory

As mentioned, this can happen without there being an issue.

> 2023-08-11 10:12:38.919 CEST [773947] LOG:  waiting for WAL to become
> available at 0/4E002000
> cp: cannot stat '/var/lib/pgsql/pg_archive/00000001000000000000004E': No
> such file or directory
> 2023-08-11 10:12:43.916 CEST [1050527] LOG:  started streaming WAL from
> primary at 0/4E000000 on timeline 1
> 2023-08-11 10:12:43.916 CEST [1050527] FATAL:  could not receive data from
> WAL stream: ERROR:  requested WAL segment 00000001000000000000004E has
> already been removed

This is a problem though- the primary doesn't have the WAL and neither
does the archive.  Without that WAL, the replica can't play forward and
therefore isn't able to ever catch up to where the primary is.  There's
clearly something going wrong if you're properly archiving the WAL on
your primary to some location and then the replica isn't able to fetch
that WAL.

> So, seeing logs flooded that way.... I don't like it (even if I could be
> sure everything is in sync) particularly for master shows:
>
> -> $ sudo -u postgres psql --port=5432 -x -c 'select client_addr,sync_state
> from pg_stat_replication;'
> could not change directory to "/root": Permission denied

This is just from psql starting up and trying to look in /root's home
dir because you used sudo.  That's not actually an issue.

> so I do 'pg_basebackup' then.

Do you have an archive_command configured on your primary..?  I'd
strongly recommend that you set that up and, ideally, use a well written
tool like pgbackrest to handle your backup, recovery, archiving, et al.
Without a WAL archive, you'll have this risk that the WAL which the
replica needs isn't available any more, or otherwise risk running the
primary out of disk space if the replica is offline for a long time.

Thanks,

Stephen

Вложения