Обсуждение: Server Down problem

Поиск
Список
Период
Сортировка

Server Down problem

От
Volker
Дата:
Hello all,

currently we have a server down problem. The server is a IBM xSeries
306, 1GB RAM, HD 36GB RAID1. Operating System ist Debian Sarge 3.1.
Postgresql version is 7.4.7.

The reason for this ist not knwon. The postmaster won't come up any
more. Here is the relevant content of the postgres.log file:

2006-08-21 15:22:45 [2889] LOG:  Datenbanksystem wurde am 2006-08-21
14:15:54 CEST während des Herunterfahrens unterbrochen
2006-08-21 15:22:45 [2889] LOG:  konnte Datei
»/ha/db/data/pg_xlog/0000000000000000« nicht öffnen (Logdatei 0, Segment
0): Datei oder Verzeichnis nicht gefunden
2006-08-21 15:22:45 [2889] LOG:  ungültiger primärer Checkpoint-Datensatz
2006-08-21 15:22:45 [2889] LOG:  konnte Datei
»/ha/db/data/pg_xlog/0000000000000000« nicht öffnen (Logdatei 0, Segment
0): Datei oder Verzeichnis nicht gefunden
2006-08-21 15:22:45 [2889] LOG:  ungültiger sekundärer Checkpoint-Datensatz
2006-08-21 15:22:45 [2889] PANIK:  konnte keinen gültigen
Checkpoint-Datensatz finden
2006-08-21 15:22:45 [2883] LOG:  Start-Prozess (PID 2889) wurde von
Signal 6 beendet
2006-08-21 15:22:45 [2883] LOG:  Serverstart abgebrochen wegen
Start-Prozess-Fehler
2006-08-21 15:23:53 [2958] LOG:  Datenbanksystem wurde am 2006-08-21
14:15:54 CEST während des Herunterfahrens unterbrochen
2006-08-21 15:23:53 [2958] LOG:  konnte Datei
»/ha/db/data/pg_xlog/0000000000000000« nicht öffnen (Logdatei 0, Segment
0): Datei oder Verzeichnis nicht gefunden
2006-08-21 15:23:53 [2958] LOG:  ungültiger primärer Checkpoint-Datensatz
2006-08-21 15:23:53 [2958] LOG:  konnte Datei
»/ha/db/data/pg_xlog/0000000000000000« nicht öffnen (Logdatei 0, Segment
0): Datei oder Verzeichnis nicht gefunden
2006-08-21 15:23:53 [2958] LOG:  ungültiger sekundärer Checkpoint-Datensatz
2006-08-21 15:23:53 [2958] PANIK:  konnte keinen gültigen
Checkpoint-Datensatz finden
2006-08-21 15:23:53 [2955] LOG:  Start-Prozess (PID 2958) wurde von
Signal 6 beendet
2006-08-21 15:23:53 [2955] LOG:  Serverstart abgebrochen wegen
Start-Prozess-Fehler
2006-08-21 17:41:17 [3184] LOG:  Datenbanksystem wurde am 2006-08-21
14:15:54 CEST während des Herunterfahrens unterbrochen
2006-08-21 17:41:17 [3184] LOG:  konnte Datei
»/ha/db/data/pg_xlog/0000000000000000« nicht öffnen (Logdatei 0, Segment
0): Datei oder Verzeichnis nicht gefunden
2006-08-21 17:41:17 [3184] LOG:  ungültiger primärer Checkpoint-Datensatz
2006-08-21 17:41:17 [3184] LOG:  konnte Datei
»/ha/db/data/pg_xlog/0000000000000000« nicht öffnen (Logdatei 0, Segment
0): Datei oder Verzeichnis nicht gefunden
2006-08-21 17:41:17 [3184] LOG:  ungültiger sekundärer Checkpoint-Datensatz
2006-08-21 17:41:17 [3184] PANIK:  konnte keinen gültigen
Checkpoint-Datensatz finden
2006-08-21 17:41:17 [3178] LOG:  Start-Prozess (PID 3184) wurde von
Signal 6 beendet
2006-08-21 17:41:17 [3178] LOG:  Serverstart abgebrochen wegen
Start-Prozess-Fehler

We assume that we have to use pg_resetxlog to get us started again,
but we are not sure what parameters to use. We need assistance in this
case.

We can execute pg_controldata without problem. Here is the output:

pg_control-Versionsnummer:            72
Katalog-Versionsnummer:               200310211
Datenbank-Cluster-Status:             fährt herunter
pg_control zuletzt geändert:          Mo 21 Aug 2006 14:15:54 CEST
Aktuelle Logdatei-ID:                 0
Nächstes Logdatei-Segment:            1
Letzter Checkpoint-Ort:               0/9B1118
Voriger Checkpoint-Ort:               0/9B10D8
REDO-Ort vom letzten Checkpoint:      0/9B1118
UNDO-Ort vom letzten Checkpoint:      0/0
StartUpID vom letzten Checkpoint:     15
NextXID vom letzten Checkpoint:       536
NextOID vom letzten Checkpoint:       17142
Zeit vom letzten Checkpoint:          Mi 16 Aug 2006 15:29:07 CEST
Datenbank-Blockgröße:                 8192
Blöcke pro Segment:                   131072
Höchstlänge von Namen:                64
Maximale Funktionsargumente:          32
Speicherung von Datum/Zeit-Typen:     64-Bit Ganzzahlen
Maximallänge eines Locale-Namens:     128
LC_COLLATE:                           de_DE@euro
LC_CTYPE:                             de_DE@euro

In addition i add the directory listing of our data, pg_clog and
pg_xlog directories:

/ha/db/data:
drwx------  6 postgres postgres 4096 2006-03-22 11:37 base
drwx------  2 postgres postgres 4096 2006-08-21 17:41 global
drwx------  2 postgres postgres 4096 2006-06-01 15:02 pg_clog
lrwxrwxrwx  1 root     postgres   27 2005-11-23 13:13 pg_hba.conf ->
/etc/postgresql/pg_hba.conf
lrwxrwxrwx  1 root     postgres   29 2005-11-23 13:13 pg_ident.conf ->
/etc/postgresql/pg_ident.conf
-rw-------  1 postgres postgres    4 2005-11-23 13:13 PG_VERSION
drwx------  2 postgres postgres 4096 2006-08-07 05:06 pg_xlog
lrwxrwxrwx  1 root     postgres   31 2005-11-23 13:13 postgresql.conf ->
/etc/postgresql/postgresql.conf
-rw-------  1 postgres postgres   54 2006-08-21 17:41 postmaster.opts

/ha/db/data/pg_clog:
-rw-------  1 postgres postgres 262144 2006-08-07 10:06 0005

/ha/db/data/pg_xlog:
-rw-------  1 postgres postgres 16777216 2006-08-07 14:16 000000010000008B
-rw-------  1 postgres postgres 16777216 2006-08-01 20:02 000000010000008C
-rw-------  1 postgres postgres 16777216 2006-08-02 10:02 000000010000008D
-rw-------  1 postgres postgres 16777216 2006-08-03 05:02 000000010000008E
-rw-------  1 postgres postgres 16777216 2006-08-04 00:02 000000010000008F
-rw-------  1 postgres postgres 16777216 2006-08-04 15:02 0000000100000090
-rw-------  1 postgres postgres 16777216 2006-08-05 10:02 0000000100000091
-rw-------  1 postgres postgres 16777216 2006-08-07 00:02 0000000100000092

Any help is appreciated.

Volker

Re: Server Down problem

От
Tom Lane
Дата:
Volker <arendt@wiwi.uni-wuppertal.de> writes:
> currently we have a server down problem. The server is a IBM xSeries
> 306, 1GB RAM, HD 36GB RAID1. Operating System ist Debian Sarge 3.1.
> Postgresql version is 7.4.7.
> The reason for this ist not knwon. The postmaster won't come up any
> more. Here is the relevant content of the postgres.log file:

It looks to me like pg_control is quite out of sync with the files in
pg_xlog.  pg_control claims to have checkpointed as recently as 16-Aug
but there is no file newer than 7-Aug in pg_xlog.  The other thing that
is strange is that the filename numbers in pg_xlog correspond to WAL
locations much higher than what pg_control claims is the end of WAL.

Is $PGDATA kept on a dismountable volume (ie, not the root disk)?
If so it might be a good idea to unmount the volume and look to see
if there's anything under the mount point.  I recall having seen
corruption that came from trying to start Postgres before the $PGDATA
disk had been mounted --- the initscript happily initdb'd a new database
under the mount-point directory, and then when the main disk did come up
things were badly hosed because the server was working with a pg_control
in memory that was completely out of sync with everything else.  This
looks a bit like that might have happened here.

As far as running pg_resetxlog goes, there is advice about setting the
parameters in recent releases' documentation; try
http://developer.postgresql.org/docs/postgres/app-pgresetxlog.html
for the latest.  (Some of the parameters mentioned don't exist in
7.4; just ignore 'em.)  I'm afraid though that you may have actual
database corruption.  If so pg_resetxlog won't fix it.

            regards, tom lane

Re: Server Down problem

От
Volker
Дата:
Hi tom,

the database is on a mountable volume. So that could be our problem.
Reading your comments i would like to know if an automatic initdb on
postmaster startup can be disabled.
Is this possible?

We would receive a severe postmaster error, but this would be expected
behaviour in case that the mountable volume is temporarily unavailable.

Volker
> It looks to me like pg_control is quite out of sync with the files in
> pg_xlog.  pg_control claims to have checkpointed as recently as 16-Aug
> but there is no file newer than 7-Aug in pg_xlog.  The other thing that
> is strange is that the filename numbers in pg_xlog correspond to WAL
> locations much higher than what pg_control claims is the end of WAL.
>
> Is $PGDATA kept on a dismountable volume (ie, not the root disk)?
> If so it might be a good idea to unmount the volume and look to see
> if there's anything under the mount point.  I recall having seen
> corruption that came from trying to start Postgres before the $PGDATA
> disk had been mounted --- the initscript happily initdb'd a new database
> under the mount-point directory, and then when the main disk did come up
> things were badly hosed because the server was working with a pg_control
> in memory that was completely out of sync with everything else.  This
> looks a bit like that might have happened here.
>
> As far as running pg_resetxlog goes, there is advice about setting the
> parameters in recent releases' documentation; try
> http://developer.postgresql.org/docs/postgres/app-pgresetxlog.html
> for the latest.  (Some of the parameters mentioned don't exist in
> 7.4; just ignore 'em.)  I'm afraid though that you may have actual
> database corruption.  If so pg_resetxlog won't fix it.

Re: Server Down problem

От
Tom Lane
Дата:
Volker <arendt@wiwi.uni-wuppertal.de> writes:
> the database is on a mountable volume. So that could be our problem.
> Reading your comments i would like to know if an automatic initdb on
> postmaster startup can be disabled.

Just change the initscript.

There's been periodic discussion about whether the auto-initdb isn't
too unsafe to have in there, but it's mighty handy for newbies.

            regards, tom lane