Обсуждение: [GENERAL] Recovery Assistance

Поиск
Список
Период
Сортировка

[GENERAL] Recovery Assistance

От
Brian Mills
Дата:
Hi,

I have a Atlassian Confluence Wiki that depends on postgres, but I haven't much experience with postgres other than for this purpose.

A few days ago, the hard disk filled, so all services stopped working.
When the admin realised this he increased the disk size (its in a cloud, so that was easy to do) however I think the way it shutdown left Postgres in an inconsistent state for some reason.
Now when I start it up I get this message in the log over and over again:
"FATAL:  the database system is starting up"

I have a backup, which I have successfully recovered, but it is 24 hours old, the next backup was the cause of the disk filling. So I'm using this as exercise in learning a bit more about postgres.

I did some research and found a number of options. I took a file level backup with the service not running then tried 2 things. I haven't found much else on what to do though. 

Attempt 1 - Reset Log

It sounded like this shouldn't be my first option (it wasn't) but it did sound like what I needed to do.
I ran this command
./pg_resetxlog /var/lib/postgresql/9.3/main -f
It worked a treat, the database did startup ok.
However when I tried to dump the DB:
root@atlassian:/home/myuser# sudo -u postgres pg_dump confluencedb > $now-confluencedb.sql
pg_dump: Dumping the contents of table "bodycontent" failed: PQgetResult() failed.
pg_dump: Error message from server: ERROR:  unexpected chunk size 104 (expected 1996) in chunk 3 of 22 for toast value 48862 in pg_toast_20028
pg_dump: The command was: COPY public.bodycontent (bodycontentid, body, contentid, bodytypeid) TO stdout;
The dump failed, so I assume this did leave my database in an inconsistent state.


Attempt 2 -  startup manually and let it try recovery

I restored my file level backup and started again.
This time I tried to startup manually on the command line to see the output (I'd done it as a service startup a number of times to nearly the same effect too)

postgres@atlassian:/usr/lib/postgresql/9.3/bin$ ./pg_ctl -D /etc/postgresql/9.3/main start
server starting
postgres@atlassian:/usr/lib/postgresql/9.3/bin$ 2017-01-27 20:36:08 AEDT LOG:  database system was interrupted while in recovery at 2017-01-27 20:13:26 AEDT
2017-01-27 20:36:08 AEDT HINT:  This probably means that some data is corrupted and you will have to use the last backup for recovery.
2017-01-27 20:36:08 AEDT LOG:  database system was not properly shut down; automatic recovery in progress
2017-01-27 20:36:08 AEDT LOG:  redo starts at 5/528B4558
2017-01-27 20:36:18 AEDT LOG:  redo done at 5/A3FFF9E8
2017-01-27 20:36:18 AEDT LOG:  last completed transaction was at log time 2017-01-24 02:08:00.023064+11
2017-01-27 23:00:01 AEDT FATAL:  the database system is starting up
2017-01-27 23:00:01 AEDT FATAL:  the database system is starting up

I've left it that way for 12 hours, and its still not allowing connections. 
I assume its doing some kind of consistency check?

Does anyone have any suggestions on what I should be doing to try and restore this database?

- The amount of change is minimal in the DB after 6pm it should be basically no change overnight. 
- The log above seems to suggest it has completed a redo ok, is that right?
- The last completed transaction time 2017-01-24 02:08:00.023064+11 the log mentions would be fine to use, so loosing even a few hours before that would be more than adequate.

I'm just not clear what the database is doing now, or how I should be trying to recover it. 

Any help anyone can suggest would be great! I've given myself this weekend to attempt to recover more than the last backup, after that I need to restore the service for my team to use and suck up the lost last day of updates. 

Thanks,
Brian

Re: [GENERAL] Recovery Assistance

От
Adrian Klaver
Дата:
On 01/27/2017 01:31 PM, Brian Mills wrote:
> Hi,
>
> I have a Atlassian Confluence Wiki that depends on postgres, but I
> haven't much experience with postgres other than for this purpose.
>
> A few days ago, the hard disk filled, so all services stopped working.
> When the admin realised this he increased the disk size (its in a cloud,
> so that was easy to do) however I think the way it shutdown left
> Postgres in an inconsistent state for some reason.
> Now when I start it up I get this message in the log over and over again:
> "FATAL:  the database system is starting up"
>
> I have a backup, which I have successfully recovered, but it is 24 hours
> old, the next backup was the cause of the disk filling. So I'm using
> this as exercise in learning a bit more about postgres.
>
> I did some research and found a number of options. I took a file level
> backup with the service not running then tried 2 things. I haven't found
> much else on what to do though.
>
> *Attempt 1 - Reset Log *
>
> It sounded like this shouldn't be my first option (it wasn't) but it did
> sound like what I needed to do.
> I ran this command
> ./pg_resetxlog /var/lib/postgresql/9.3/main -f
> It worked a treat, the database did startup ok.
> However when I tried to dump the DB:
> root@atlassian:/home/myuser# sudo -u postgres pg_dump confluencedb >
> $now-confluencedb.sql
> pg_dump: Dumping the contents of table "bodycontent" failed:
> PQgetResult() failed.
> pg_dump: Error message from server: ERROR:  unexpected chunk size 104
> (expected 1996) in chunk 3 of 22 for toast value 48862 in pg_toast_20028
> pg_dump: The command was: COPY public.bodycontent (bodycontentid, body,
> contentid, bodytypeid) TO stdout;
> The dump failed, so I assume this did leave my database in an
> inconsistent state.
>
>
> *Attempt 2 -  startup manually and let it try recovery*
>
> I restored my file level backup and started again.
> This time I tried to startup manually on the command line to see the
> output (I'd done it as a service startup a number of times to nearly the
> same effect too)
>
> postgres@atlassian:/usr/lib/postgresql/9.3/bin$ ./pg_ctl -D
> /etc/postgresql/9.3/main start
> server starting
> postgres@atlassian:/usr/lib/postgresql/9.3/bin$ 2017-01-27 20:36:08 AEDT
> LOG:  database system was interrupted while in recovery at 2017-01-27
> 20:13:26 AEDT
> 2017-01-27 20:36:08 AEDT HINT:  This probably means that some data is
> corrupted and you will have to use the last backup for recovery.
> 2017-01-27 20:36:08 AEDT LOG:  database system was not properly shut
> down; automatic recovery in progress
> 2017-01-27 20:36:08 AEDT LOG:  redo starts at 5/528B4558
> 2017-01-27 20:36:18 AEDT LOG:  redo done at 5/A3FFF9E8
> 2017-01-27 20:36:18 AEDT LOG:  last completed transaction was at log
> time 2017-01-24 02:08:00.023064+11
> 2017-01-27 23:00:01 AEDT FATAL:  the database system is starting up
> 2017-01-27 23:00:01 AEDT FATAL:  the database system is starting up

What does ps ax | grep post show?

Is the cluster set up to log to a file, if so what does it show?

Does the system log show anything relevant?

>
> I've left it that way for 12 hours, and its still not allowing connections.
> I assume its doing some kind of consistency check?

What does it say when you attempt a connection?

>
> Does anyone have any suggestions on what I should be doing to try and
> restore this database?
>
> - The amount of change is minimal in the DB after 6pm it should be
> basically no change overnight.
> - The log above seems to suggest it has completed a redo ok, is that right?
> - The last completed transaction time 2017-01-24 02:08:00.023064+11 the
> log mentions would be fine to use, so loosing even a few hours before
> that would be more than adequate.
>
> I'm just not clear what the database is doing now, or how I should be
> trying to recover it.
>
> Any help anyone can suggest would be great! I've given myself this
> weekend to attempt to recover more than the last backup, after that I
> need to restore the service for my team to use and suck up the lost last
> day of updates.
>
> Thanks,
> Brian


--
Adrian Klaver
adrian.klaver@aklaver.com


Re: [GENERAL] Recovery Assistance

От
Brian Mills
Дата:
First of all, Thank you for your time to assist me learning. I really appreciate it. 

root# ps ax | grep post
 1364 ?        Ss     0:00 /usr/lib/postfix/master
 5198 pts/3    S      0:00 su postgres
 5221 pts/3    S      0:00 /usr/lib/postgresql/9.3/bin/postgres -D /etc/postgresql/9.3/main
 5222 ?        Ss     0:10 postgres: startup process   recovering 0000000100000005000000A3 
 7930 pts/4    S+     0:00 grep --color=auto post

Its a single machine postgres database server, so I'm assuming there is no cluster log. If there is one, where would I look for it? 
The only log in /var/log/postgres is postgresql-9.3-main.log
which shows (tail):

2017-01-27 20:27:01 AEDT LOG:  database system was shut down at 2017-01-27 20:26:29 AEDT
2017-01-27 20:27:01 AEDT LOG:  MultiXact member wraparound protections are now enabled
2017-01-27 20:27:01 AEDT LOG:  autovacuum launcher started
2017-01-27 20:27:01 AEDT LOG:  database system is ready to accept connections
2017-01-27 20:27:02 AEDT LOG:  incomplete startup packet
2017-01-27 20:28:54 AEDT ERROR:  unexpected chunk size 104 (expected 1996) in chunk 3 of 22 for toast value 48862 in pg_toast_20028
2017-01-27 20:28:54 AEDT STATEMENT:  COPY public.bodycontent (bodycontentid, body, contentid, bodytypeid) TO stdout;
2017-01-27 20:30:13 AEDT LOG:  received fast shutdown request
2017-01-27 20:30:13 AEDT LOG:  aborting any active transactions
2017-01-27 20:30:13 AEDT LOG:  autovacuum launcher shutting down
2017-01-27 20:30:13 AEDT LOG:  shutting down
2017-01-27 20:30:13 AEDT LOG:  database system is shut down

I ran the screen utility so I could leave the db started using the pg_ctl command. The later logs in that session have not progressed, its last entry is still
2017-01-27 23:00:01 AEDT FATAL:  the database system is starting up 
Which is still later datetime than the /var/log/postgres... log. 

Connection attempt shows the same log
postgres@atlassian:/home/myuser$ psql
psql: FATAL:  the database system is starting up

Nothing in the syslog that seems connected. 

Brian Mills
CTO


Melbourne 03 9012 3460 or 03 8376 6327 |  Sydney 02 8064 3600 |  Brisbane 07 3173 1570
Level 1 |  600 Chapel Street | South Yarra|  VIC |  3141 |  Australia 

  

On 28 January 2017 at 12:05, Adrian Klaver <adrian.klaver@aklaver.com> wrote:
On 01/27/2017 01:31 PM, Brian Mills wrote:
Hi,

I have a Atlassian Confluence Wiki that depends on postgres, but I
haven't much experience with postgres other than for this purpose.

A few days ago, the hard disk filled, so all services stopped working.
When the admin realised this he increased the disk size (its in a cloud,
so that was easy to do) however I think the way it shutdown left
Postgres in an inconsistent state for some reason.
Now when I start it up I get this message in the log over and over again:
"FATAL:  the database system is starting up"

I have a backup, which I have successfully recovered, but it is 24 hours
old, the next backup was the cause of the disk filling. So I'm using
this as exercise in learning a bit more about postgres.

I did some research and found a number of options. I took a file level
backup with the service not running then tried 2 things. I haven't found
much else on what to do though.

*Attempt 1 - Reset Log *

It sounded like this shouldn't be my first option (it wasn't) but it did
sound like what I needed to do.
I ran this command
./pg_resetxlog /var/lib/postgresql/9.3/main -f
It worked a treat, the database did startup ok.
However when I tried to dump the DB:
root@atlassian:/home/myuser# sudo -u postgres pg_dump confluencedb >
$now-confluencedb.sql
pg_dump: Dumping the contents of table "bodycontent" failed:
PQgetResult() failed.
pg_dump: Error message from server: ERROR:  unexpected chunk size 104
(expected 1996) in chunk 3 of 22 for toast value 48862 in pg_toast_20028
pg_dump: The command was: COPY public.bodycontent (bodycontentid, body,
contentid, bodytypeid) TO stdout;
The dump failed, so I assume this did leave my database in an
inconsistent state.


*Attempt 2 -  startup manually and let it try recovery*

I restored my file level backup and started again.
This time I tried to startup manually on the command line to see the
output (I'd done it as a service startup a number of times to nearly the
same effect too)

postgres@atlassian:/usr/lib/postgresql/9.3/bin$ ./pg_ctl -D
/etc/postgresql/9.3/main start
server starting
postgres@atlassian:/usr/lib/postgresql/9.3/bin$ 2017-01-27 20:36:08 AEDT
LOG:  database system was interrupted while in recovery at 2017-01-27
20:13:26 AEDT
2017-01-27 20:36:08 AEDT HINT:  This probably means that some data is
corrupted and you will have to use the last backup for recovery.
2017-01-27 20:36:08 AEDT LOG:  database system was not properly shut
down; automatic recovery in progress
2017-01-27 20:36:08 AEDT LOG:  redo starts at 5/528B4558
2017-01-27 20:36:18 AEDT LOG:  redo done at 5/A3FFF9E8
2017-01-27 20:36:18 AEDT LOG:  last completed transaction was at log
time 2017-01-24 02:08:00.023064+11
2017-01-27 23:00:01 AEDT FATAL:  the database system is starting up
2017-01-27 23:00:01 AEDT FATAL:  the database system is starting up

What does ps ax | grep post show?

Is the cluster set up to log to a file, if so what does it show?

Does the system log show anything relevant?


I've left it that way for 12 hours, and its still not allowing connections.
I assume its doing some kind of consistency check?

What does it say when you attempt a connection?



Does anyone have any suggestions on what I should be doing to try and
restore this database?

- The amount of change is minimal in the DB after 6pm it should be
basically no change overnight.
- The log above seems to suggest it has completed a redo ok, is that right?
- The last completed transaction time 2017-01-24 02:08:00.023064+11 the
log mentions would be fine to use, so loosing even a few hours before
that would be more than adequate.

I'm just not clear what the database is doing now, or how I should be
trying to recover it.

Any help anyone can suggest would be great! I've given myself this
weekend to attempt to recover more than the last backup, after that I
need to restore the service for my team to use and suck up the lost last
day of updates.

Thanks,
Brian


--
Adrian Klaver
adrian.klaver@aklaver.com

Re: [GENERAL] Recovery Assistance

От
Adrian Klaver
Дата:
On 01/27/2017 05:40 PM, Brian Mills wrote:
> First of all, Thank you for your time to assist me learning. I really
> appreciate it.
>
> root# ps ax | grep post
>  1364 ?        Ss     0:00 /usr/lib/postfix/master
>  5198 pts/3    S      0:00 su postgres
>  5221 pts/3    S      0:00 /usr/lib/postgresql/9.3/bin/postgres -D
> /etc/postgresql/9.3/main
>  5222 ?        Ss     0:10 postgres: startup process   recovering
> 0000000100000005000000A3
>  7930 pts/4    S+     0:00 grep --color=auto post

So if you check back does the recovering XXXXXXXX part change?

If so Postgres is walking through the WAL files as it should.

>
> Its a single machine postgres database server, so I'm assuming there is
> no cluster log. If there is one, where would I look for it?
> The only log in /var/log/postgres is postgresql-9.3-main.log

That would be it. A single Postgres instance has multiple databases in
it, by default it starts with template0, template1 and postgres
databases. Then add whatever databases you created and you have a
cluster of databases.

> which shows (tail):
>
> 2017-01-27 20:27:01 AEDT LOG:  database system was shut down at
> 2017-01-27 20:26:29 AEDT
> 2017-01-27 20:27:01 AEDT LOG:  MultiXact member wraparound protections
> are now enabled
> 2017-01-27 20:27:01 AEDT LOG:  autovacuum launcher started
> 2017-01-27 20:27:01 AEDT LOG:  database system is ready to accept
> connections
> 2017-01-27 20:27:02 AEDT LOG:  incomplete startup packet
> 2017-01-27 20:28:54 AEDT ERROR:  unexpected chunk size 104 (expected
> 1996) in chunk 3 of 22 for toast value 48862 in pg_toast_20028
> 2017-01-27 20:28:54 AEDT STATEMENT:  COPY public.bodycontent
> (bodycontentid, body, contentid, bodytypeid) TO stdout;
> 2017-01-27 20:30:13 AEDT LOG:  received fast shutdown request
> 2017-01-27 20:30:13 AEDT LOG:  aborting any active transactions
> 2017-01-27 20:30:13 AEDT LOG:  autovacuum launcher shutting down
> 2017-01-27 20:30:13 AEDT LOG:  shutting down
> 2017-01-27 20:30:13 AEDT LOG:  database system is shut down

That would be your Attempt 1 log.
>
> I ran the screen utility so I could leave the db started using
> the pg_ctl command. The later logs in that session have not progressed,
> its last entry is still
> 2017-01-27 23:00:01 AEDT FATAL:  the database system is starting up
> Which is still later datetime than the /var/log/postgres... log.

So it is just logging to stdout and not to the log file.

>
> Connection attempt shows the same log
> postgres@atlassian:/home/myuser$ psql
> psql: FATAL:  the database system is starting up
>
> Nothing in the syslog that seems connected.
>
> *Brian Mills*
> CTO
>
>
> *Mob: *0410660003 <tel:0410660003>
> *Melbourne* 03 9012 3460 <tel:03%209012%203460> or 03 8376 6327
> <tel:03%208376%206327> *|* * **Sydney* 02 8064 3600
> <tel:02%208064%203600> *|*  *Brisbane* 07 3173 1570 <tel:07%203173%201570>
> Level 1 *|*  600 Chapel Street *|* South
> Yarra*|*  VIC *|*  3141 *|*  Australia
>
> <https://www.facebook.com/TryBooking/> <https://twitter.com/trybooking>
<https://www.linkedin.com/company/trybooking-com>
>
> On 28 January 2017 at 12:05, Adrian Klaver <adrian.klaver@aklaver.com
> <mailto:adrian.klaver@aklaver.com>> wrote:
>
>     On 01/27/2017 01:31 PM, Brian Mills wrote:
>
>         Hi,
>
>         I have a Atlassian Confluence Wiki that depends on postgres, but I
>         haven't much experience with postgres other than for this purpose.
>
>         A few days ago, the hard disk filled, so all services stopped
>         working.
>         When the admin realised this he increased the disk size (its in
>         a cloud,
>         so that was easy to do) however I think the way it shutdown left
>         Postgres in an inconsistent state for some reason.
>         Now when I start it up I get this message in the log over and
>         over again:
>         "FATAL:  the database system is starting up"
>
>         I have a backup, which I have successfully recovered, but it is
>         24 hours
>         old, the next backup was the cause of the disk filling. So I'm using
>         this as exercise in learning a bit more about postgres.
>
>         I did some research and found a number of options. I took a file
>         level
>         backup with the service not running then tried 2 things. I
>         haven't found
>         much else on what to do though.
>
>         *Attempt 1 - Reset Log *
>
>         It sounded like this shouldn't be my first option (it wasn't)
>         but it did
>         sound like what I needed to do.
>         I ran this command
>         ./pg_resetxlog /var/lib/postgresql/9.3/main -f
>         It worked a treat, the database did startup ok.
>         However when I tried to dump the DB:
>         root@atlassian:/home/myuser# sudo -u postgres pg_dump confluencedb >
>         $now-confluencedb.sql
>         pg_dump: Dumping the contents of table "bodycontent" failed:
>         PQgetResult() failed.
>         pg_dump: Error message from server: ERROR:  unexpected chunk
>         size 104
>         (expected 1996) in chunk 3 of 22 for toast value 48862 in
>         pg_toast_20028
>         pg_dump: The command was: COPY public.bodycontent
>         (bodycontentid, body,
>         contentid, bodytypeid) TO stdout;
>         The dump failed, so I assume this did leave my database in an
>         inconsistent state.
>
>
>         *Attempt 2 -  startup manually and let it try recovery*
>
>         I restored my file level backup and started again.
>         This time I tried to startup manually on the command line to see the
>         output (I'd done it as a service startup a number of times to
>         nearly the
>         same effect too)
>
>         postgres@atlassian:/usr/lib/postgresql/9.3/bin$ ./pg_ctl -D
>         /etc/postgresql/9.3/main start
>         server starting
>         postgres@atlassian:/usr/lib/postgresql/9.3/bin$ 2017-01-27
>         20:36:08 AEDT
>         LOG:  database system was interrupted while in recovery at
>         2017-01-27
>         20:13:26 AEDT
>         2017-01-27 20:36:08 AEDT HINT:  This probably means that some
>         data is
>         corrupted and you will have to use the last backup for recovery.
>         2017-01-27 20:36:08 AEDT LOG:  database system was not properly shut
>         down; automatic recovery in progress
>         2017-01-27 20:36:08 AEDT LOG:  redo starts at 5/528B4558
>         2017-01-27 20:36:18 AEDT LOG:  redo done at 5/A3FFF9E8
>         2017-01-27 20:36:18 AEDT LOG:  last completed transaction was at log
>         time 2017-01-24 02:08:00.023064+11
>         2017-01-27 23:00:01 AEDT FATAL:  the database system is starting up
>         2017-01-27 23:00:01 AEDT FATAL:  the database system is starting up
>
>
>     What does ps ax | grep post show?
>
>     Is the cluster set up to log to a file, if so what does it show?
>
>     Does the system log show anything relevant?
>
>
>         I've left it that way for 12 hours, and its still not allowing
>         connections.
>         I assume its doing some kind of consistency check?
>
>
>     What does it say when you attempt a connection?
>
>
>
>         Does anyone have any suggestions on what I should be doing to
>         try and
>         restore this database?
>
>         - The amount of change is minimal in the DB after 6pm it should be
>         basically no change overnight.
>         - The log above seems to suggest it has completed a redo ok, is
>         that right?
>         - The last completed transaction time 2017-01-24
>         02:08:00.023064+11 the
>         log mentions would be fine to use, so loosing even a few hours
>         before
>         that would be more than adequate.
>
>         I'm just not clear what the database is doing now, or how I
>         should be
>         trying to recover it.
>
>         Any help anyone can suggest would be great! I've given myself this
>         weekend to attempt to recover more than the last backup, after
>         that I
>         need to restore the service for my team to use and suck up the
>         lost last
>         day of updates.
>
>         Thanks,
>         Brian
>
>
>
>     --
>     Adrian Klaver
>     adrian.klaver@aklaver.com <mailto:adrian.klaver@aklaver.com>
>
>


--
Adrian Klaver
adrian.klaver@aklaver.com


Re: [GENERAL] Recovery Assistance

От
Brian Mills
Дата:
Hi,

No, it hasn't changed since the first time I looked at it. 

root@atlassian:/home/tbadmin# ps ax | grep post
 1364 ?        Ss     0:00 /usr/lib/postfix/master
 5198 pts/3    S      0:00 su postgres
 5221 pts/3    S      0:00 /usr/lib/postgresql/9.3/bin/postgres -D /etc/postgresql/9.3/main
 5222 ?        Ss     0:10 postgres: startup process   recovering 0000000100000005000000A3
11161 pts/4    S+     0:00 grep --color=auto post

The /var/log/postgres logs haven't got anything new in them. 
Also the screen session hasn't got anything new in them, except it adds a line every time I attempt to connect to it using psql. The last attempt below was my connection attempt for this email, a bit over 8 hours after my last connection attempt. 

Screen session: 
2017-01-28 12:38:35 AEDT FATAL:  the database system is starting up
2017-01-28 23:00:01 AEDT FATAL:  the database system is starting up
2017-01-28 23:00:01 AEDT FATAL:  the database system is starting up
2017-01-29 07:14:00 AEDT FATAL:  the database system is starting up

I also still can't connect. 
postgres@atlassian:/home/tbadmin$ psql
psql: FATAL:  the database system is starting up


Brian Mills
CTO


Melbourne 03 9012 3460 or 03 8376 6327 |  Sydney 02 8064 3600 |  Brisbane 07 3173 1570
Level 1 |  600 Chapel Street | South Yarra|  VIC |  3141 |  Australia 

  

On 28 January 2017 at 16:08, Adrian Klaver <adrian.klaver@aklaver.com> wrote:
On 01/27/2017 05:40 PM, Brian Mills wrote:
First of all, Thank you for your time to assist me learning. I really
appreciate it.

root# ps ax | grep post
 1364 ?        Ss     0:00 /usr/lib/postfix/master
 5198 pts/3    S      0:00 su postgres
 5221 pts/3    S      0:00 /usr/lib/postgresql/9.3/bin/postgres -D
/etc/postgresql/9.3/main
 5222 ?        Ss     0:10 postgres: startup process   recovering
0000000100000005000000A3
 7930 pts/4    S+     0:00 grep --color=auto post

So if you check back does the recovering XXXXXXXX part change?

If so Postgres is walking through the WAL files as it should.


Its a single machine postgres database server, so I'm assuming there is
no cluster log. If there is one, where would I look for it?
The only log in /var/log/postgres is postgresql-9.3-main.log

That would be it. A single Postgres instance has multiple databases in it, by default it starts with template0, template1 and postgres databases. Then add whatever databases you created and you have a cluster of databases.

which shows (tail):

2017-01-27 20:27:01 AEDT LOG:  database system was shut down at
2017-01-27 20:26:29 AEDT
2017-01-27 20:27:01 AEDT LOG:  MultiXact member wraparound protections
are now enabled
2017-01-27 20:27:01 AEDT LOG:  autovacuum launcher started
2017-01-27 20:27:01 AEDT LOG:  database system is ready to accept
connections
2017-01-27 20:27:02 AEDT LOG:  incomplete startup packet
2017-01-27 20:28:54 AEDT ERROR:  unexpected chunk size 104 (expected
1996) in chunk 3 of 22 for toast value 48862 in pg_toast_20028
2017-01-27 20:28:54 AEDT STATEMENT:  COPY public.bodycontent
(bodycontentid, body, contentid, bodytypeid) TO stdout;
2017-01-27 20:30:13 AEDT LOG:  received fast shutdown request
2017-01-27 20:30:13 AEDT LOG:  aborting any active transactions
2017-01-27 20:30:13 AEDT LOG:  autovacuum launcher shutting down
2017-01-27 20:30:13 AEDT LOG:  shutting down
2017-01-27 20:30:13 AEDT LOG:  database system is shut down

That would be your Attempt 1 log.

I ran the screen utility so I could leave the db started using
the pg_ctl command. The later logs in that session have not progressed,
its last entry is still
2017-01-27 23:00:01 AEDT FATAL:  the database system is starting up
Which is still later datetime than the /var/log/postgres... log.

So it is just logging to stdout and not to the log file.


Connection attempt shows the same log
postgres@atlassian:/home/myuser$ psql
psql: FATAL:  the database system is starting up

Nothing in the syslog that seems connected.

*Brian Mills*
CTO


*Mob: *0410660003 <tel:0410660003>
*Melbourne* 03 9012 3460 <tel:03%209012%203460> or 03 8376 6327
<tel:03%208376%206327> *|* * **Sydney* 02 8064 3600
<tel:02%208064%203600> *|*  *Brisbane* 07 3173 1570 <tel:07%203173%201570>
Level 1 *|*  600 Chapel Street *|* South
Yarra*|*  VIC *|*  3141 *|*  Australia

<https://www.facebook.com/TryBooking/> <https://twitter.com/trybooking> <https://www.linkedin.com/company/trybooking-com>

On 28 January 2017 at 12:05, Adrian Klaver <adrian.klaver@aklaver.com
<mailto:adrian.klaver@aklaver.com>> wrote:

    On 01/27/2017 01:31 PM, Brian Mills wrote:

        Hi,

        I have a Atlassian Confluence Wiki that depends on postgres, but I
        haven't much experience with postgres other than for this purpose.

        A few days ago, the hard disk filled, so all services stopped
        working.
        When the admin realised this he increased the disk size (its in
        a cloud,
        so that was easy to do) however I think the way it shutdown left
        Postgres in an inconsistent state for some reason.
        Now when I start it up I get this message in the log over and
        over again:
        "FATAL:  the database system is starting up"

        I have a backup, which I have successfully recovered, but it is
        24 hours
        old, the next backup was the cause of the disk filling. So I'm using
        this as exercise in learning a bit more about postgres.

        I did some research and found a number of options. I took a file
        level
        backup with the service not running then tried 2 things. I
        haven't found
        much else on what to do though.

        *Attempt 1 - Reset Log *

        It sounded like this shouldn't be my first option (it wasn't)
        but it did
        sound like what I needed to do.
        I ran this command
        ./pg_resetxlog /var/lib/postgresql/9.3/main -f
        It worked a treat, the database did startup ok.
        However when I tried to dump the DB:
        root@atlassian:/home/myuser# sudo -u postgres pg_dump confluencedb >
        $now-confluencedb.sql
        pg_dump: Dumping the contents of table "bodycontent" failed:
        PQgetResult() failed.
        pg_dump: Error message from server: ERROR:  unexpected chunk
        size 104
        (expected 1996) in chunk 3 of 22 for toast value 48862 in
        pg_toast_20028
        pg_dump: The command was: COPY public.bodycontent
        (bodycontentid, body,
        contentid, bodytypeid) TO stdout;
        The dump failed, so I assume this did leave my database in an
        inconsistent state.


        *Attempt 2 -  startup manually and let it try recovery*

        I restored my file level backup and started again.
        This time I tried to startup manually on the command line to see the
        output (I'd done it as a service startup a number of times to
        nearly the
        same effect too)

        postgres@atlassian:/usr/lib/postgresql/9.3/bin$ ./pg_ctl -D
        /etc/postgresql/9.3/main start
        server starting
        postgres@atlassian:/usr/lib/postgresql/9.3/bin$ 2017-01-27
        20:36:08 AEDT
        LOG:  database system was interrupted while in recovery at
        2017-01-27
        20:13:26 AEDT
        2017-01-27 20:36:08 AEDT HINT:  This probably means that some
        data is
        corrupted and you will have to use the last backup for recovery.
        2017-01-27 20:36:08 AEDT LOG:  database system was not properly shut
        down; automatic recovery in progress
        2017-01-27 20:36:08 AEDT LOG:  redo starts at 5/528B4558
        2017-01-27 20:36:18 AEDT LOG:  redo done at 5/A3FFF9E8
        2017-01-27 20:36:18 AEDT LOG:  last completed transaction was at log
        time 2017-01-24 02:08:00.023064+11
        2017-01-27 23:00:01 AEDT FATAL:  the database system is starting up
        2017-01-27 23:00:01 AEDT FATAL:  the database system is starting up


    What does ps ax | grep post show?

    Is the cluster set up to log to a file, if so what does it show?

    Does the system log show anything relevant?


        I've left it that way for 12 hours, and its still not allowing
        connections.
        I assume its doing some kind of consistency check?


    What does it say when you attempt a connection?



        Does anyone have any suggestions on what I should be doing to
        try and
        restore this database?

        - The amount of change is minimal in the DB after 6pm it should be
        basically no change overnight.
        - The log above seems to suggest it has completed a redo ok, is
        that right?
        - The last completed transaction time 2017-01-24
        02:08:00.023064+11 the
        log mentions would be fine to use, so loosing even a few hours
        before
        that would be more than adequate.

        I'm just not clear what the database is doing now, or how I
        should be
        trying to recover it.

        Any help anyone can suggest would be great! I've given myself this
        weekend to attempt to recover more than the last backup, after
        that I
        need to restore the service for my team to use and suck up the
        lost last
        day of updates.

        Thanks,
        Brian



    --
    Adrian Klaver
    adrian.klaver@aklaver.com <mailto:adrian.klaver@aklaver.com>




--
Adrian Klaver
adrian.klaver@aklaver.com

Re: [GENERAL] Recovery Assistance

От
rob stone
Дата:
Hello Brian,
On Sun, 2017-01-29 at 07:16 +1100, Brian Mills wrote:
> Hi,
>
> No, it hasn't changed since the first time I looked at it. 
>
> root@atlassian:/home/tbadmin# ps ax | grep post
>  1364 ?        Ss     0:00 /usr/lib/postfix/master
>  5198 pts/3    S      0:00 su postgres
>  5221 pts/3    S      0:00 /usr/lib/postgresql/9.3/bin/postgres -D
> /etc/postgresql/9.3/main
>  5222 ?        Ss     0:10 postgres: startup process   recovering
> 0000000100000005000000A3
> 11161 pts/4    S+     0:00 grep --color=auto post
>


Does this WAL file exist "0000000100000005000000A3"?

Cheers,
Rob


Re: [GENERAL] Recovery Assistance

От
Brian Mills
Дата:
Yes, its the last one in the directory, pg_xlog directory 

...more files...
-rw-------  1 postgres postgres 16777216 Jan 21 10:05 0000000100000005000000A1
-rw-------  1 postgres postgres 16777216 Jan 22 21:29 0000000100000005000000A2
-rw-------  1 postgres postgres 16777216 Jan 24 02:08 0000000100000005000000A3


Brian Mills
CTO


Melbourne 03 9012 3460 or 03 8376 6327 |  Sydney 02 8064 3600 |  Brisbane 07 3173 1570
Level 1 |  600 Chapel Street | South Yarra|  VIC |  3141 |  Australia 

  

On 29 January 2017 at 08:18, rob stone <floriparob@gmail.com> wrote:
Hello Brian,
On Sun, 2017-01-29 at 07:16 +1100, Brian Mills wrote:
> Hi,
>
> No, it hasn't changed since the first time I looked at it. 
>
> root@atlassian:/home/tbadmin# ps ax | grep post
>  1364 ?        Ss     0:00 /usr/lib/postfix/master
>  5198 pts/3    S      0:00 su postgres
>  5221 pts/3    S      0:00 /usr/lib/postgresql/9.3/bin/postgres -D
> /etc/postgresql/9.3/main
>  5222 ?        Ss     0:10 postgres: startup process   recovering
> 0000000100000005000000A3
> 11161 pts/4    S+     0:00 grep --color=auto post
>


Does this WAL file exist "0000000100000005000000A3"?

Cheers,
Rob

Re: [GENERAL] Recovery Assistance

От
Adrian Klaver
Дата:
On 01/28/2017 01:55 PM, Brian Mills wrote:
> Yes, its the last one in the directory, pg_xlog directory
>
> ...more files...
> -rw-------  1 postgres postgres 16777216 Jan 21 10:05
> 0000000100000005000000A1
> -rw-------  1 postgres postgres 16777216 Jan 22 21:29
> 0000000100000005000000A2
> -rw-------  1 postgres postgres 16777216 Jan 24 02:08
> 0000000100000005000000A3

Best guess is the last WAL is not complete.

 From your original post:
"Attempt 2 -  startup manually and let it try recovery

I restored my file level backup and started again. "

How was the file level backup done?

>
>
> *Brian Mills*
> CTO
>
>
> *Mob: *0410660003 <tel:0410660003>
> *Melbourne* 03 9012 3460 <tel:03%209012%203460> or 03 8376 6327
> <tel:03%208376%206327> *|* * **Sydney* 02 8064 3600
> <tel:02%208064%203600> *|*  *Brisbane* 07 3173 1570 <tel:07%203173%201570>
> Level 1 *|*  600 Chapel Street *|* South
> Yarra*|*  VIC *|*  3141 *|*  Australia
>
> <https://www.facebook.com/TryBooking/> <https://twitter.com/trybooking>
<https://www.linkedin.com/company/trybooking-com>
>
> On 29 January 2017 at 08:18, rob stone <floriparob@gmail.com
> <mailto:floriparob@gmail.com>> wrote:
>
>     Hello Brian,
>     On Sun, 2017-01-29 at 07:16 +1100, Brian Mills wrote:
>     > Hi,
>     >
>     > No, it hasn't changed since the first time I looked at it.
>     >
>     > root@atlassian:/home/tbadmin# ps ax | grep post
>     >  1364 ?        Ss     0:00 /usr/lib/postfix/master
>     >  5198 pts/3    S      0:00 su postgres
>     >  5221 pts/3    S      0:00 /usr/lib/postgresql/9.3/bin/postgres -D
>     > /etc/postgresql/9.3/main
>     >  5222 ?        Ss     0:10 postgres: startup process   recovering
>     > 0000000100000005000000A3
>     > 11161 pts/4    S+     0:00 grep --color=auto post
>     >
>
>
>     Does this WAL file exist "0000000100000005000000A3"?
>
>     Cheers,
>     Rob
>
>


--
Adrian Klaver
adrian.klaver@aklaver.com


Re: [GENERAL] Recovery Assistance

От
Brian Mills
Дата:
I have a consistent sql dump from 24 hour previous. 

The file level backup was done with rsync -a of full data directory after the issue occurred so could reset as I learned. 

Brian


On Sun, 29 Jan 2017 at 9:18 am, Adrian Klaver <adrian.klaver@aklaver.com> wrote:
On 01/28/2017 01:55 PM, Brian Mills wrote:
> Yes, its the last one in the directory, pg_xlog directory
>
> ...more files...
> -rw-------  1 postgres postgres 16777216 Jan 21 10:05
> 0000000100000005000000A1
> -rw-------  1 postgres postgres 16777216 Jan 22 21:29
> 0000000100000005000000A2
> -rw-------  1 postgres postgres 16777216 Jan 24 02:08
> 0000000100000005000000A3

Best guess is the last WAL is not complete.

 From your original post:
"Attempt 2 -  startup manually and let it try recovery

I restored my file level backup and started again. "

How was the file level backup done?

>
>
> *Brian Mills*
> CTO
>
>
> *Mob: *0410660003 <tel:0410660003>
> *Melbourne* 03 9012 3460 <tel:03%209012%203460> or 03 8376 6327
> <tel:03%208376%206327> *|* * **Sydney* 02 8064 3600
> <tel:02%208064%203600> *|*  *Brisbane* 07 3173 1570 <tel:07%203173%201570>
> Level 1 *|*  600 Chapel Street *|* South
> Yarra*|*  VIC *|*  3141 *|*  Australia
>
> <https://www.facebook.com/TryBooking/> <https://twitter.com/trybooking> <https://www.linkedin.com/company/trybooking-com>
>
> On 29 January 2017 at 08:18, rob stone <floriparob@gmail.com
> <mailto:floriparob@gmail.com>> wrote:
>
>     Hello Brian,
>     On Sun, 2017-01-29 at 07:16 +1100, Brian Mills wrote:
>     > Hi,
>     >
>     > No, it hasn't changed since the first time I looked at it.
>     >
>     > root@atlassian:/home/tbadmin# ps ax | grep post
>     >  1364 ?        Ss     0:00 /usr/lib/postfix/master
>     >  5198 pts/3    S      0:00 su postgres
>     >  5221 pts/3    S      0:00 /usr/lib/postgresql/9.3/bin/postgres -D
>     > /etc/postgresql/9.3/main
>     >  5222 ?        Ss     0:10 postgres: startup process   recovering
>     > 0000000100000005000000A3
>     > 11161 pts/4    S+     0:00 grep --color=auto post
>     >
>
>
>     Does this WAL file exist "0000000100000005000000A3"?
>
>     Cheers,
>     Rob
>
>


--
Adrian Klaver
adrian.klaver@aklaver.com
--
Brian Mills
CTO


Melbourne 03 9012 3460 or 03 8376 6327 |  Sydney 02 8064 3600 |  Brisbane 07 3173 1570
Level 1 |  600 Chapel Street | South Yarra|  VIC |  3141 |  Australia 

  

Re: [GENERAL] Recovery Assistance

От
Brian Mills
Дата:
I presume this is a binary log file for the database. 

Am I able to recover to a point in time using this log file?

What I would do in SQL Server would be recover to a point in time, say a bit before the last completed transaction time the log mentions, then take a backup. Is that possible in postgres?

The log mentions this: 
2017-01-27 20:36:18 AEDT LOG:  last completed transaction was at log time 2017-01-24 02:08:00.023064+11

(which is moments before, or possibly as the disk filled up doing a db backup dump)

Brian Mills
CTO


Melbourne 03 9012 3460 or 03 8376 6327 |  Sydney 02 8064 3600 |  Brisbane 07 3173 1570
Level 1 |  600 Chapel Street | South Yarra|  VIC |  3141 |  Australia 

  

On 29 January 2017 at 12:58, Brian Mills <brian@trybooking.com> wrote:
I have a consistent sql dump from 24 hour previous. 

The file level backup was done with rsync -a of full data directory after the issue occurred so could reset as I learned. 

Brian


On Sun, 29 Jan 2017 at 9:18 am, Adrian Klaver <adrian.klaver@aklaver.com> wrote:
On 01/28/2017 01:55 PM, Brian Mills wrote:
> Yes, its the last one in the directory, pg_xlog directory
>
> ...more files...
> -rw-------  1 postgres postgres 16777216 Jan 21 10:05
> 0000000100000005000000A1
> -rw-------  1 postgres postgres 16777216 Jan 22 21:29
> 0000000100000005000000A2
> -rw-------  1 postgres postgres 16777216 Jan 24 02:08
> 0000000100000005000000A3

Best guess is the last WAL is not complete.

 From your original post:
"Attempt 2 -  startup manually and let it try recovery

I restored my file level backup and started again. "

How was the file level backup done?

>
>
> *Brian Mills*
> CTO
>
>
> *Mob: *0410660003 <tel:0410660003>
> *Melbourne* 03 9012 3460 <tel:03%209012%203460> or 03 8376 6327
> <tel:03%208376%206327> *|* * **Sydney* 02 8064 3600
> <tel:02%208064%203600> *|*  *Brisbane* 07 3173 1570 <tel:07%203173%201570>
> Level 1 *|*  600 Chapel Street *|* South
> Yarra*|*  VIC *|*  3141 *|*  Australia
>
> <https://www.facebook.com/TryBooking/> <https://twitter.com/trybooking> <https://www.linkedin.com/company/trybooking-com>
>
> On 29 January 2017 at 08:18, rob stone <floriparob@gmail.com
> <mailto:floriparob@gmail.com>> wrote:
>
>     Hello Brian,
>     On Sun, 2017-01-29 at 07:16 +1100, Brian Mills wrote:
>     > Hi,
>     >
>     > No, it hasn't changed since the first time I looked at it.
>     >
>     > root@atlassian:/home/tbadmin# ps ax | grep post
>     >  1364 ?        Ss     0:00 /usr/lib/postfix/master
>     >  5198 pts/3    S      0:00 su postgres
>     >  5221 pts/3    S      0:00 /usr/lib/postgresql/9.3/bin/postgres -D
>     > /etc/postgresql/9.3/main
>     >  5222 ?        Ss     0:10 postgres: startup process   recovering
>     > 0000000100000005000000A3
>     > 11161 pts/4    S+     0:00 grep --color=auto post
>     >
>
>
>     Does this WAL file exist "0000000100000005000000A3"?
>
>     Cheers,
>     Rob
>
>


--
Adrian Klaver
adrian.klaver@aklaver.com
--
Brian Mills
CTO


Melbourne 03 9012 3460 or 03 8376 6327 |  Sydney 02 8064 3600 |  Brisbane 07 3173 1570
Level 1 |  600 Chapel Street | South Yarra|  VIC |  3141 |  Australia 

  

Re: [GENERAL] Recovery Assistance

От
Adrian Klaver
Дата:
On 01/28/2017 11:23 PM, Brian Mills wrote:
> I presume this is a binary log file for the database.
>
> Am I able to recover to a point in time using this log file?
>
> What I would do in SQL Server would be recover to a point in time, say a
> bit before the last completed transaction time the log mentions, then
> take a backup. Is that possible in postgres?

Yes, though I am not sure you have the setup to do it. I would suggest
reading the below to see how much of it applies:

https://www.postgresql.org/docs/9.3/static/continuous-archiving.html

In particular:
24.3.4. Recovering Using a Continuous Archive Backup

https://www.postgresql.org/docs/9.3/static/recovery-target-settings.html

>
> The log mentions this:
> 2017-01-27 20:36:18 AEDT LOG:  last completed transaction was at log
> time 2017-01-24 02:08:00.023064+11
>
> (which is moments before, or possibly as the disk filled up doing a db
> backup dump)
>
> *Brian Mills*
> CTO
>
>
> *Mob: *0410660003 <tel:0410660003>
> *Melbourne* 03 9012 3460 <tel:03%209012%203460> or 03 8376 6327
> <tel:03%208376%206327> *|* * **Sydney* 02 8064 3600
> <tel:02%208064%203600> *|*  *Brisbane* 07 3173 1570 <tel:07%203173%201570>
> Level 1 *|*  600 Chapel Street *|* South
> Yarra*|*  VIC *|*  3141 *|*  Australia
>
> <https://www.facebook.com/TryBooking/> <https://twitter.com/trybooking>
<https://www.linkedin.com/company/trybooking-com>
>
> On 29 January 2017 at 12:58, Brian Mills <brian@trybooking.com
> <mailto:brian@trybooking.com>> wrote:
>
>     I have a consistent sql dump from 24 hour previous.
>
>     The file level backup was done with rsync -a of full data directory
>     after the issue occurred so could reset as I learned.
>
>     Brian
>
>
>     On Sun, 29 Jan 2017 at 9:18 am, Adrian Klaver
>     <adrian.klaver@aklaver.com <mailto:adrian.klaver@aklaver.com>> wrote:
>
>         On 01/28/2017 01:55 PM, Brian Mills wrote:
>         > Yes, its the last one in the directory, pg_xlog directory
>         >
>         > ...more files...
>         > -rw-------  1 postgres postgres 16777216 Jan 21 10:05
>         > 0000000100000005000000A1
>         > -rw-------  1 postgres postgres 16777216 Jan 22 21:29
>         > 0000000100000005000000A2
>         > -rw-------  1 postgres postgres 16777216 Jan 24 02:08
>         > 0000000100000005000000A3
>
>         Best guess is the last WAL is not complete.
>
>          From your original post:
>         "Attempt 2 -  startup manually and let it try recovery
>
>         I restored my file level backup and started again. "
>
>         How was the file level backup done?
>
>         >
>         >
>         > *Brian Mills*
>         > CTO
>         >
>         >
>         > *Mob: *0410660003 <tel:0410%20660%20003> <tel:0410660003
>         <tel:0410%20660%20003>>
>         > *Melbourne* 03 9012 3460 <tel:(03)%209012%203460>
>         <tel:03%209012%203460> or 03 8376 6327 <tel:(03)%208376%206327>
>         > <tel:03%208376%206327> *|* * **Sydney* 02 8064 3600
>         <tel:(02)%208064%203600>
>         > <tel:02%208064%203600> *|*  *Brisbane* 07 3173 1570
>         <tel:(07)%203173%201570> <tel:07%203173%201570>
>         > Level 1 *|*  600 Chapel Street *|* South
>         > Yarra*|*  VIC *|*  3141 *|*  Australia
>         >
>         > <https://www.facebook.com/TryBooking/
>         <https://www.facebook.com/TryBooking/>>
>         <https://twitter.com/trybooking
>         <https://twitter.com/trybooking>>
>         <https://www.linkedin.com/company/trybooking-com
>         <https://www.linkedin.com/company/trybooking-com>>
>         >
>         > On 29 January 2017 at 08:18, rob stone <floriparob@gmail.com
>         <mailto:floriparob@gmail.com>
>         > <mailto:floriparob@gmail.com <mailto:floriparob@gmail.com>>>
>         wrote:
>         >
>         >     Hello Brian,
>         >     On Sun, 2017-01-29 at 07:16 +1100, Brian Mills wrote:
>         >     > Hi,
>         >     >
>         >     > No, it hasn't changed since the first time I looked at it.
>         >     >
>         >     > root@atlassian:/home/tbadmin# ps ax | grep post
>         >     >  1364 ?        Ss     0:00 /usr/lib/postfix/master
>         >     >  5198 pts/3    S      0:00 su postgres
>         >     >  5221 pts/3    S      0:00
>         /usr/lib/postgresql/9.3/bin/postgres -D
>         >     > /etc/postgresql/9.3/main
>         >     >  5222 ?        Ss     0:10 postgres: startup process
>          recovering
>         >     > 0000000100000005000000A3
>         >     > 11161 pts/4    S+     0:00 grep --color=auto post
>         >     >
>         >
>         >
>         >     Does this WAL file exist "0000000100000005000000A3"?
>         >
>         >     Cheers,
>         >     Rob
>         >
>         >
>
>
>         --
>         Adrian Klaver
>         adrian.klaver@aklaver.com <mailto:adrian.klaver@aklaver.com>
>
>     --
>     *Brian Mills*
>     CTO
>
>
>     *Mob: *0410660003
>     *Melbourne* 03 9012 3460 or 03 8376 6327 *|* * **Sydney* 02 8064
>     3600 *|*  *Brisbane* 07 3173 1570
>     Level 1 *|*  600 Chapel Street *|* South
>     Yarra*|*  VIC *|*  3141 *|*  Australia
>
>     <https://www.facebook.com/TryBooking/> <https://twitter.com/trybooking>
<https://www.linkedin.com/company/trybooking-com>
>
>


--
Adrian Klaver
adrian.klaver@aklaver.com


Re: [GENERAL] Recovery Assistance

От
Adrian Klaver
Дата:
On 01/28/2017 11:23 PM, Brian Mills wrote:
> I presume this is a binary log file for the database.
>
> Am I able to recover to a point in time using this log file?
>
> What I would do in SQL Server would be recover to a point in time, say a
> bit before the last completed transaction time the log mentions, then
> take a backup. Is that possible in postgres?

Had another thought. If I remember correctly you are using this as an
exercise in Postgres recovery. If that is indeed the case you might try:

1) Stop the Postgres instance you have running now.

2) Move the WAL file that Postgres is currently stalled on,
0000000100000005000000A3, out of pg_xlog.

3) Restart the Postgres instance.

My guess it it will not bring it back to the exact point you want, but
close. You can get a general idea by running(before and after removing
the WAL), as the postgres user:

pg_controldata -D /etc/postgresql/9.3/main

>
> The log mentions this:
> 2017-01-27 20:36:18 AEDT LOG:  last completed transaction was at log
> time 2017-01-24 02:08:00.023064+11
>
> (which is moments before, or possibly as the disk filled up doing a db
> backup dump)
>
> *Brian Mills*
> CTO
>



--
Adrian Klaver
adrian.klaver@aklaver.com


Re: [GENERAL] Recovery Assistance

От
Brian Mills
Дата:
OK. I think I'm on to something here, I first reset back to my file level backup. 
I created a recovery.conf file in the root of the data directory like this: 
---------------------
restore_command = 'cp /mnt/archive/%f %p'
recovery_target_time = '2017-01-24 02:08:00.023064+11'
recovery_target_inclusive = 'true'
pause_at_recovery_target = 'false'
---------------------
Note, the archive directory had no files in it, I left the WAL files in the pg_xlog directory. 

Then I started up the database again: 
postgres@atlassian:~/9.3/main$ /usr/lib/postgresql/9.3/bin/pg_ctl -D /etc/postgresql/9.3/main start
server starting
postgres@atlassian:~/9.3/main$ 2017-01-30 10:07:28 AEDT LOG:  database system was interrupted while in recovery at 2017-01-27 20:13:26 AEDT
2017-01-30 10:07:28 AEDT HINT:  This probably means that some data is corrupted and you will have to use the last backup for recovery.
2017-01-30 10:07:28 AEDT LOG:  starting point-in-time recovery to 2017-01-24 02:08:00.023064+11
2017-01-30 10:07:28 AEDT LOG:  database system was not properly shut down; automatic recovery in progress
2017-01-30 10:07:28 AEDT WARNING:  WAL was generated with wal_level=minimal, data may be missing
2017-01-30 10:07:28 AEDT HINT:  This happens if you temporarily set wal_level=minimal without taking a new base backup.
2017-01-30 10:07:28 AEDT LOG:  redo starts at 5/528B4558
2017-01-30 10:07:40 AEDT LOG:  consistent recovery state reached at 5/A3FFFA30
cp: cannot stat ‘/mnt/archive/0000000100000005000000A3’: No such file or directory
cp: cannot stat ‘/mnt/archive/0000000100000005000000A4’: No such file or directory
2017-01-30 10:07:40 AEDT LOG:  redo done at 5/A3FFF9E8
2017-01-30 10:07:40 AEDT LOG:  last completed transaction was at log time 2017-01-24 02:08:00.023064+11
cp: cannot stat ‘/mnt/archive/0000000100000005000000A3’: No such file or directory
cp: cannot stat ‘/mnt/archive/00000002.history’: No such file or directory
2017-01-30 10:07:40 AEDT LOG:  selected new timeline ID: 2
cp: cannot stat ‘/mnt/archive/00000001.history’: No such file or directory
2017-01-30 10:07:40 AEDT LOG:  archive recovery complete
2017-01-30 10:08:55 AEDT FATAL:  the database system is starting up
2017-01-30 10:08:57 AEDT FATAL:  the database system is starting up

This time it looks like it has actually finished the startup and recovery. However I think I might have something wrong about the process. 
Any thoughts on the above log?


Brian Mills
CTO


Melbourne 03 9012 3460 or 03 8376 6327 |  Sydney 02 8064 3600 |  Brisbane 07 3173 1570
Level 1 |  600 Chapel Street | South Yarra|  VIC |  3141 |  Australia 

  

On 30 January 2017 at 04:49, Adrian Klaver <adrian.klaver@aklaver.com> wrote:
On 01/28/2017 11:23 PM, Brian Mills wrote:
I presume this is a binary log file for the database.

Am I able to recover to a point in time using this log file?

What I would do in SQL Server would be recover to a point in time, say a
bit before the last completed transaction time the log mentions, then
take a backup. Is that possible in postgres?

Had another thought. If I remember correctly you are using this as an exercise in Postgres recovery. If that is indeed the case you might try:

1) Stop the Postgres instance you have running now.

2) Move the WAL file that Postgres is currently stalled on, 0000000100000005000000A3, out of pg_xlog.

3) Restart the Postgres instance.

My guess it it will not bring it back to the exact point you want, but close. You can get a general idea by running(before and after removing the WAL), as the postgres user:

pg_controldata -D /etc/postgresql/9.3/main


The log mentions this:
2017-01-27 20:36:18 AEDT LOG:  last completed transaction was at log
time 2017-01-24 02:08:00.023064+11

(which is moments before, or possibly as the disk filled up doing a db
backup dump)

*Brian Mills*
CTO




--
Adrian Klaver
adrian.klaver@aklaver.com

Re: [GENERAL] Recovery Assistance

От
Adrian Klaver
Дата:
On 01/29/2017 03:12 PM, Brian Mills wrote:
> OK. I think I'm on to something here, I first reset back to my file
> level backup.
> I created a recovery.conf file in the root of the data directory like this:
> ---------------------
> restore_command = 'cp /mnt/archive/%f %p'
> recovery_target_time = '2017-01-24 02:08:00.023064+11'
> recovery_target_inclusive = 'true'
> pause_at_recovery_target = 'false'
> ---------------------
> Note, the archive directory had no files in it, I left the WAL files in
> the pg_xlog directory.
>
> Then I started up the database again:
> postgres@atlassian:~/9.3/main$ /usr/lib/postgresql/9.3/bin/pg_ctl -D
> /etc/postgresql/9.3/main start
> server starting
> postgres@atlassian:~/9.3/main$ 2017-01-30 10:07:28 AEDT LOG:  database
> system was interrupted while in recovery at 2017-01-27 20:13:26 AEDT
> 2017-01-30 10:07:28 AEDT HINT:  This probably means that some data is
> corrupted and you will have to use the last backup for recovery.
> 2017-01-30 10:07:28 AEDT LOG:  starting point-in-time recovery to
> 2017-01-24 02:08:00.023064+11
> 2017-01-30 10:07:28 AEDT LOG:  database system was not properly shut
> down; automatic recovery in progress
> 2017-01-30 10:07:28 AEDT WARNING:  WAL was generated with
> wal_level=minimal, data may be missing

This would be a problem:

https://www.postgresql.org/docs/9.3/static/runtime-config-wal.html#RUNTIME-CONFIG-WAL-SETTINGS

"But minimal WAL does not contain enough information to reconstruct the
data from a base backup and the WAL logs, so either archive or
hot_standby level must be used to enable WAL archiving (archive_mode)
and streaming replication. "

> 2017-01-30 10:07:28 AEDT HINT:  This happens if you temporarily set
> wal_level=minimal without taking a new base backup.
> 2017-01-30 10:07:28 AEDT LOG:  redo starts at 5/528B4558
> 2017-01-30 10:07:40 AEDT LOG:  consistent recovery state reached at
> 5/A3FFFA30
> cp: cannot stat ‘/mnt/archive/0000000100000005000000A3’: No such file or
> directory
> cp: cannot stat ‘/mnt/archive/0000000100000005000000A4’: No such file or
> directory
> 2017-01-30 10:07:40 AEDT LOG:  redo done at 5/A3FFF9E8
> 2017-01-30 10:07:40 AEDT LOG:  last completed transaction was at log
> time 2017-01-24 02:08:00.023064+11
> cp: cannot stat ‘/mnt/archive/0000000100000005000000A3’: No such file or
> directory
> cp: cannot stat ‘/mnt/archive/00000002.history’: No such file or directory
> 2017-01-30 10:07:40 AEDT LOG:  selected new timeline ID: 2
> cp: cannot stat ‘/mnt/archive/00000001.history’: No such file or directory
> 2017-01-30 10:07:40 AEDT LOG:  archive recovery complete
> 2017-01-30 10:08:55 AEDT FATAL:  the database system is starting up
> 2017-01-30 10:08:57 AEDT FATAL:  the database system is starting up
>
> This time it looks like it has actually finished the startup and
> recovery. However I think I might have something wrong about the process.
> Any thoughts on the above log?

See above.

>
>
> *Brian Mills*
> CTO
>
>
> *Mob: *0410660003 <tel:0410660003>
> *Melbourne* 03 9012 3460 <tel:03%209012%203460> or 03 8376 6327
> <tel:03%208376%206327> *|* * **Sydney* 02 8064 3600
> <tel:02%208064%203600> *|*  *Brisbane* 07 3173 1570 <tel:07%203173%201570>
> Level 1 *|*  600 Chapel Street *|* South
> Yarra*|*  VIC *|*  3141 *|*  Australia
>
> <https://www.facebook.com/TryBooking/> <https://twitter.com/trybooking>
<https://www.linkedin.com/company/trybooking-com>
>
> On 30 January 2017 at 04:49, Adrian Klaver <adrian.klaver@aklaver.com
> <mailto:adrian.klaver@aklaver.com>> wrote:
>
>     On 01/28/2017 11:23 PM, Brian Mills wrote:
>
>         I presume this is a binary log file for the database.
>
>         Am I able to recover to a point in time using this log file?
>
>         What I would do in SQL Server would be recover to a point in
>         time, say a
>         bit before the last completed transaction time the log mentions,
>         then
>         take a backup. Is that possible in postgres?
>
>
>     Had another thought. If I remember correctly you are using this as
>     an exercise in Postgres recovery. If that is indeed the case you
>     might try:
>
>     1) Stop the Postgres instance you have running now.
>
>     2) Move the WAL file that Postgres is currently stalled on,
>     0000000100000005000000A3, out of pg_xlog.
>
>     3) Restart the Postgres instance.
>
>     My guess it it will not bring it back to the exact point you want,
>     but close. You can get a general idea by running(before and after
>     removing the WAL), as the postgres user:
>
>     pg_controldata -D /etc/postgresql/9.3/main
>
>
>         The log mentions this:
>         2017-01-27 20:36:18 AEDT LOG:  last completed transaction was at log
>         time 2017-01-24 02:08:00.023064+11
>
>         (which is moments before, or possibly as the disk filled up
>         doing a db
>         backup dump)
>
>         *Brian Mills*
>         CTO
>
>
>
>
>     --
>     Adrian Klaver
>     adrian.klaver@aklaver.com <mailto:adrian.klaver@aklaver.com>
>
>


--
Adrian Klaver
adrian.klaver@aklaver.com