Обсуждение: Postgres error: could not open relation base/xxxxx/yyyyy

Поиск
Список
Период
Сортировка

Postgres error: could not open relation base/xxxxx/yyyyy

От
Pablo Delgado Díaz-Pache
Дата:
Hi all,

We've been using postgres for 9 years without a problem until now! Two problems in a very short time! 
This is another one (not related I think) ...

Postgres server is usually working fine. All of a sudden we start getting these errors ...

2010-11-09 11:49:15.320 CET|2|database1|10.19.0.51(18895)|20929|SELECT|4cd926fd.51c1|2010-11-09 11:48:29 CET|10/417796|1390150|postgres| LOG:  duration: 1518.422 ms  execute <unnamed>: SELECT id_token_fk,xxxxxxxxx ORDER BY avadate
2010-11-09 11:52:25.364 CET|1|database1|10.19.0.51(23286)|21566|PARSE|4cd927cf.543e|2010-11-09 11:51:59 CET|7/430041|0|postgres| ERROR:  could not open relation base/273198960/273198979: No such file or directory
2010-11-09 11:52:25.364 CET|2|database1|10.19.0.51(23286)|21566|PARSE|4cd927cf.543e|2010-11-09 11:51:59 CET|7/430041|0|postgres| STATEMENT:  SELECT id_token_fkxxxxxxxxxxx  ORDER BY avadate
2010-11-09 11:52:29.981 CET|3|database1|10.19.0.51(23286)|21566|PARSE|4cd927cf.543e|2010-11-09 11:51:59 CET|7/430049|0|postgres| ERROR:  could not open relation base/273198960/273199235: No such file or directory
2010-11-09 11:52:30.988 CET|6|database1|10.19.0.51(23286)|21566|PARSE|4cd927cf.543e|2010-11-09 11:51:59 CET|7/430050|0|postgres| STATEMENT:  SELECT max(avadate) xxxxxxxx 32036)
2010-11-09 11:53:36.346 CET|16|database2|10.19.0.42(44916)|22107|SELECT|4cd9280e.565b|2010-11-09 11:53:02 CET|94/516004|0|postgres| STATEMENT:  SELECT * FROM "photos" xxxxxxxxxxxxxx LIMIT 1
2010-11-09 11:53:37.956 CET|17|database2|10.19.0.42(44916)|22107|SELECT|4cd9280e.565b|2010-11-09 11:53:02 CET|94/516025|0|postgres| ERROR:  could not open relation base/271253899/271254075: No such file or directory
................
................
2010-11-09 11:53:55.560 CET|111|||26090||4cc6e970.65ea|2010-10-26 16:45:04 CEST||0|| ERROR:  could not open relation base/273198960/273199235: No such file or directory
2010-11-09 11:53:55.560 CET|112|||26090||4cc6e970.65ea|2010-10-26 16:45:04 CEST||0|| CONTEXT:  writing block 8866 of relation base/273198960/273199235
2010-11-09 11:53:55.560 CET|113|||26090||4cc6e970.65ea|2010-10-26 16:45:04 CEST||0|| WARNING:  could not write block 8866 of base/273198960/273199235
2010-11-09 11:53:55.560 CET|114|||26090||4cc6e970.65ea|2010-10-26 16:45:04 CEST||0|| DETAIL:  Multiple failures --- write error might be permanent.
2010-11-09 11:53:56.590 CET|115|||26090||4cc6e970.65ea|2010-10-26 16:45:04 CEST||0|| ERROR:  could not open relation base/273198960/273199235: No such file or directory
2010-11-09 11:53:56.590 CET|116|||26090||4cc6e970.65ea|2010-10-26 16:45:04 CEST||0|| CONTEXT:  writing block 8866 of relation base/273198960/273199235

See that there are 2 different databases involved. (database1 and database2).

Looking for distinct errors (among the many we have in the log) I find there are only 4 involved ...

base/271253899/271254075
base/273198960/273198979
base/273198960/273199235
base/273198960/273199253

and those files are not in the postgres base directory.

To fix it we have no option but to restart postgres (which restarts fine with a /etc/init.d/postgresql stop & start)
However, once we restarted postgres some data was corrupted. Tables that used to have 4,5 million rows had only 60 rows. As a consequence we had to restore from file system backup.
Once we did that, it worked fine for a few days until it happened again. We're worried it can happen again!

Could this error be a hardware problem?
We recently increased the memory from 8GB to 28GB, although it was working fine for more than 3 weeks.
We also recently upgraded from postgres 8.3.6 to 8.4.5, althought it also worked fine for a few months.
Upgrading to postgres 9 is easy for us. however, not sure that would help.

Some info of our server:

OS: Centos 5.5
Kernel: 2.6.18-194.1.el5
Postgres version: 8.4.5 (installation out-of-the-box using yum)
Server memory: 28GB

Any help would be appreciated

Pablo

Re: Postgres error: could not open relation base/xxxxx/yyyyy

От
Achilleas Mantzios
Дата:
Stop using the system immediately, since many things inserted to the DB might simply be garbage.
Inspect your memory with memtest86.

I would even suggest moving to a new HW if available, and start working into two parallel directions:

a) try to bring your DB into a sane state
b) try to fix your HW server, which apparently has problems.

Files are not "disappearing" like that, in normal situations.
You cannot trust your current HW to run postgresql. Its a guarantee method to lose your data, if you
go on by this start/stop/restore/pray circle.

IMPORTANT: Handle your old (pre-memory upgrade) backups like expensive chinese vases.
Start from those good and trustworthy backups to do any database "replay" (or similar) work which
would lead to a viable situation.

Στις Monday 15 November 2010 11:55:38 ο/η Pablo Delgado Díaz-Pache έγραψε:
> Hi all,
>
> We've been using postgres for 9 years without a problem until now! Two
> problems in a very short time!
> The first one is described in
>
http://postgresql.1045698.n5.nabble.com/Autovacuum-seems-to-block-database-WARNING-worker-took-too-long-to-start-td3264261.html
> This is another one (not related I think) ...
>
> Postgres server is usually working fine. All of a sudden we start getting
> these errors ...
>
> * **2010-11-09 11:49:15.320
> CET|2|database1|10.19.0.51(18895)|20929|SELECT|4cd926fd.51c1|2010-11-09
> 11:48:29 CET|10/417796|1390150|postgres| LOG:  duration: 1518.422 ms
>  execute <unnamed>: SELECT id_token_fk,xxxxxxxxx ORDER BY avadate*
> * **2010-11-09 11:52:25.364
> CET|1|database1|10.19.0.51(23286)|21566|PARSE|4cd927cf.543e|2010-11-09
> 11:51:59 CET|7/430041|0|postgres| ERROR:  could not open relation
> base/273198960/273198979: No such file or directory*
> * **2010-11-09 11:52:25.364
> CET|2|database1|10.19.0.51(23286)|21566|PARSE|4cd927cf.543e|2010-11-09
> 11:51:59 CET|7/430041|0|postgres| STATEMENT:  SELECT id_token_fkxxxxxxxxxxx
>  ORDER BY avadate*
> * **2010-11-09 11:52:29.981
> CET|3|database1|10.19.0.51(23286)|21566|PARSE|4cd927cf.543e|2010-11-09
> 11:51:59 CET|7/430049|0|postgres| ERROR:  could not open relation
> base/273198960/273199235: No such file or directory*
> * **2010-11-09 11:52:30.988
> CET|6|database1|10.19.0.51(23286)|21566|PARSE|4cd927cf.543e|2010-11-09
> 11:51:59 CET|7/430050|0|postgres| STATEMENT:  SELECT max(avadate) xxxxxxxx
> 32036)*
> * **2010-11-09 11:53:36.346
> CET|16|database2|10.19.0.42(44916)|22107|SELECT|4cd9280e.565b|2010-11-09
> 11:53:02 CET|94/516004|0|postgres| STATEMENT:  SELECT * FROM "photos"
> xxxxxxxxxxxxxx LIMIT 1*
> * **2010-11-09 11:53:37.956
> CET|17|database2|10.19.0.42(44916)|22107|SELECT|4cd9280e.565b|2010-11-09
> 11:53:02 CET|94/516025|0|postgres| ERROR:  could not open relation
> base/271253899/271254075: No such file or directory*
> * **................*
> * **................*
> * **2010-11-09 11:53:55.560 CET|111|||26090||4cc6e970.65ea|2010-10-26
> 16:45:04 CEST||0|| ERROR:  could not open relation base/273198960/273199235:
> No such file or directory*
> * **2010-11-09 11:53:55.560 CET|112|||26090||4cc6e970.65ea|2010-10-26
> 16:45:04 CEST||0|| CONTEXT:  writing block 8866 of relation
> base/273198960/273199235*
> * **2010-11-09 11:53:55.560 CET|113|||26090||4cc6e970.65ea|2010-10-26
> 16:45:04 CEST||0|| WARNING:  could not write block 8866 of
> base/273198960/273199235*
> * **2010-11-09 11:53:55.560 CET|114|||26090||4cc6e970.65ea|2010-10-26
> 16:45:04 CEST||0|| DETAIL:  Multiple failures --- write error might be
> permanent.*
> * **2010-11-09 11:53:56.590 CET|115|||26090||4cc6e970.65ea|2010-10-26
> 16:45:04 CEST||0|| ERROR:  could not open relation base/273198960/273199235:
> No such file or directory*
> * **2010-11-09 11:53:56.590 CET|116|||26090||4cc6e970.65ea|2010-10-26
> 16:45:04 CEST||0|| CONTEXT:  writing block 8866 of relation
> base/273198960/273199235*
>
> See that there are 2 different databases involved. (database1 and
> database2).
>
> Looking for distinct errors (among the many we have in the log) I find there
> are only 4 involved ...
>
> * **base/271253899/271254075*
> * **base/273198960/273198979*
> * **base/273198960/273199235*
> * **base/273198960/273199253*
>
> and those files are not in the postgres base directory.
>
> To fix it we have no option but to restart postgres (which restarts fine
> with a /etc/init.d/postgresql stop & start)
> However, once we restarted postgres some data was corrupted. Tables that
> used to have 4,5 million rows had only 60 rows. As a consequence we had to
> restore from file system backup.
> Once we did that, it worked fine for a few days until it happened again.
> We're worried it can happen again!
>
> Could this error be a hardware problem?
> We recently increased the memory from 8GB to 28GB, although it was working
> fine for more than 3 weeks.
> We also recently upgraded from postgres 8.3.6 to 8.4.5, althought it also
> worked fine for a few months.
> Upgrading to postgres 9 is easy for us. however, not sure that would help.
>
> Some info of our server:
>
> OS: Centos 5.5
> Kernel: 2.6.18-194.1.el5
> Postgres version: 8.4.5 (installation out-of-the-box using yum)
> Server memory: 28GB
>
>
> Any help would be appreciated
>
> Pablo
>



--
Achilleas Mantzios

Re: Postgres error: could not open relation base/xxxxx/yyyyy

От
Pablo Delgado Díaz-Pache
Дата:
Hi Achilleas,

Thanks for your quick response.
We do have backups and a couple of spare servers running in parallel so we're safe in that sense. Thanks for your advice anyway.

To run memtest86 we have to go to the datacenter and that will take us a few days. Is there anything we can do remotely?

Could it be anything different than memory HW problem?
Btw, we are not using that server at the moment :)

Thanks

On Mon, Nov 15, 2010 at 12:03 PM, Achilleas Mantzios <achill@matrix.gatewaynet.com> wrote:
Stop using the system immediately, since many things inserted to the DB might simply be garbage.
Inspect your memory with memtest86.

I would even suggest moving to a new HW if available, and start working into two parallel directions:

a) try to bring your DB into a sane state
b) try to fix your HW server, which apparently has problems.

Files are not "disappearing" like that, in normal situations.
You cannot trust your current HW to run postgresql. Its a guarantee method to lose your data, if you
go on by this start/stop/restore/pray circle.

IMPORTANT: Handle your old (pre-memory upgrade) backups like expensive chinese vases.
Start from those good and trustworthy backups to do any database "replay" (or similar) work which
would lead to a viable situation.

Στις Monday 15 November 2010 11:55:38 ο/η Pablo Delgado Díaz-Pache έγραψε:
> Hi all,
>
> We've been using postgres for 9 years without a problem until now! Two
> problems in a very short time!
> The first one is described in
> http://postgresql.1045698.n5.nabble.com/Autovacuum-seems-to-block-database-WARNING-worker-took-too-long-to-start-td3264261.html
> This is another one (not related I think) ...
>
> Postgres server is usually working fine. All of a sudden we start getting
> these errors ...
>
> * **2010-11-09 11:49:15.320
> CET|2|database1|10.19.0.51(18895)|20929|SELECT|4cd926fd.51c1|2010-11-09
> 11:48:29 CET|10/417796|1390150|postgres| LOG:  duration: 1518.422 ms
>  execute <unnamed>: SELECT id_token_fk,xxxxxxxxx ORDER BY avadate*
> * **2010-11-09 11:52:25.364
> CET|1|database1|10.19.0.51(23286)|21566|PARSE|4cd927cf.543e|2010-11-09
> 11:51:59 CET|7/430041|0|postgres| ERROR:  could not open relation
> base/273198960/273198979: No such file or directory*
> * **2010-11-09 11:52:25.364
> CET|2|database1|10.19.0.51(23286)|21566|PARSE|4cd927cf.543e|2010-11-09
> 11:51:59 CET|7/430041|0|postgres| STATEMENT:  SELECT id_token_fkxxxxxxxxxxx
>  ORDER BY avadate*
> * **2010-11-09 11:52:29.981
> CET|3|database1|10.19.0.51(23286)|21566|PARSE|4cd927cf.543e|2010-11-09
> 11:51:59 CET|7/430049|0|postgres| ERROR:  could not open relation
> base/273198960/273199235: No such file or directory*
> * **2010-11-09 11:52:30.988
> CET|6|database1|10.19.0.51(23286)|21566|PARSE|4cd927cf.543e|2010-11-09
> 11:51:59 CET|7/430050|0|postgres| STATEMENT:  SELECT max(avadate) xxxxxxxx
> 32036)*
> * **2010-11-09 11:53:36.346
> CET|16|database2|10.19.0.42(44916)|22107|SELECT|4cd9280e.565b|2010-11-09
> 11:53:02 CET|94/516004|0|postgres| STATEMENT:  SELECT * FROM "photos"
> xxxxxxxxxxxxxx LIMIT 1*
> * **2010-11-09 11:53:37.956
> CET|17|database2|10.19.0.42(44916)|22107|SELECT|4cd9280e.565b|2010-11-09
> 11:53:02 CET|94/516025|0|postgres| ERROR:  could not open relation
> base/271253899/271254075: No such file or directory*
> * **................*
> * **................*
> * **2010-11-09 11:53:55.560 CET|111|||26090||4cc6e970.65ea|2010-10-26
> 16:45:04 CEST||0|| ERROR:  could not open relation base/273198960/273199235:
> No such file or directory*
> * **2010-11-09 11:53:55.560 CET|112|||26090||4cc6e970.65ea|2010-10-26
> 16:45:04 CEST||0|| CONTEXT:  writing block 8866 of relation
> base/273198960/273199235*
> * **2010-11-09 11:53:55.560 CET|113|||26090||4cc6e970.65ea|2010-10-26
> 16:45:04 CEST||0|| WARNING:  could not write block 8866 of
> base/273198960/273199235*
> * **2010-11-09 11:53:55.560 CET|114|||26090||4cc6e970.65ea|2010-10-26
> 16:45:04 CEST||0|| DETAIL:  Multiple failures --- write error might be
> permanent.*
> * **2010-11-09 11:53:56.590 CET|115|||26090||4cc6e970.65ea|2010-10-26
> 16:45:04 CEST||0|| ERROR:  could not open relation base/273198960/273199235:
> No such file or directory*
> * **2010-11-09 11:53:56.590 CET|116|||26090||4cc6e970.65ea|2010-10-26
> 16:45:04 CEST||0|| CONTEXT:  writing block 8866 of relation
> base/273198960/273199235*
>
> See that there are 2 different databases involved. (database1 and
> database2).
>
> Looking for distinct errors (among the many we have in the log) I find there
> are only 4 involved ...
>
> * **base/271253899/271254075*
> * **base/273198960/273198979*
> * **base/273198960/273199235*
> * **base/273198960/273199253*
>
> and those files are not in the postgres base directory.
>
> To fix it we have no option but to restart postgres (which restarts fine
> with a /etc/init.d/postgresql stop & start)
> However, once we restarted postgres some data was corrupted. Tables that
> used to have 4,5 million rows had only 60 rows. As a consequence we had to
> restore from file system backup.
> Once we did that, it worked fine for a few days until it happened again.
> We're worried it can happen again!
>
> Could this error be a hardware problem?
> We recently increased the memory from 8GB to 28GB, although it was working
> fine for more than 3 weeks.
> We also recently upgraded from postgres 8.3.6 to 8.4.5, althought it also
> worked fine for a few months.
> Upgrading to postgres 9 is easy for us. however, not sure that would help.
>
> Some info of our server:
>
> OS: Centos 5.5
> Kernel: 2.6.18-194.1.el5
> Postgres version: 8.4.5 (installation out-of-the-box using yum)
> Server memory: 28GB
>
>
> Any help would be appreciated
>
> Pablo
>



--
Achilleas Mantzios

--
Sent via pgsql-admin mailing list (pgsql-admin@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-admin

Re: Postgres error: could not open relation base/xxxxx/yyyyy

От
Achilleas Mantzios
Дата:
Whoever did the memory installation, should also have taken care of checking it's working ok.
A good idea is to use ECC memory.
Does fsck on the remote system run with reporting problems as well?? (it might report problems such as incorrect block
counts,missing inodes etc...) 

Also check for disk, controller problems.

Στις Monday 15 November 2010 14:42:32 ο/η Pablo Delgado Díaz-Pache έγραψε:
> Hi Achilleas,
>
> Thanks for your quick response.
> We do have backups and a couple of spare servers running in parallel so
> we're safe in that sense. Thanks for your advice anyway.
>
> To run *memtest86 *we have to go to the datacenter and that will take us a
> few days. Is there anything we can do remotely?
>
> Could it be anything different than memory HW problem?
> Btw, we are not using that server at the moment :)
>
> Thanks
>
> On Mon, Nov 15, 2010 at 12:03 PM, Achilleas Mantzios <
> achill@matrix.gatewaynet.com> wrote:
>
> > Stop using the system immediately, since many things inserted to the DB
> > might simply be garbage.
> > Inspect your memory with memtest86.
> >
> > I would even suggest moving to a new HW if available, and start working
> > into two parallel directions:
> >
> > a) try to bring your DB into a sane state
> > b) try to fix your HW server, which apparently has problems.
> >
> > Files are not "disappearing" like that, in normal situations.
> > You cannot trust your current HW to run postgresql. Its a guarantee method
> > to lose your data, if you
> > go on by this start/stop/restore/pray circle.
> >
> > IMPORTANT: Handle your old (pre-memory upgrade) backups like expensive
> > chinese vases.
> > Start from those good and trustworthy backups to do any database "replay"
> > (or similar) work which
> > would lead to a viable situation.
> >
> > Στις Monday 15 November 2010 11:55:38 ο/η Pablo Delgado Díaz-Pache έγραψε:
> > > Hi all,
> > >
> > > We've been using postgres for 9 years without a problem until now! Two
> > > problems in a very short time!
> > > The first one is described in
> > >
> >
http://postgresql.1045698.n5.nabble.com/Autovacuum-seems-to-block-database-WARNING-worker-took-too-long-to-start-td3264261.html
> > > This is another one (not related I think) ...
> > >
> > > Postgres server is usually working fine. All of a sudden we start getting
> > > these errors ...
> > >
> > > * **2010-11-09 11:49:15.320
> > > CET|2|database1|10.19.0.51(18895)|20929|SELECT|4cd926fd.51c1|2010-11-09
> > > 11:48:29 CET|10/417796|1390150|postgres| LOG:  duration: 1518.422 ms
> > >  execute <unnamed>: SELECT id_token_fk,xxxxxxxxx ORDER BY avadate*
> > > * **2010-11-09 11:52:25.364
> > > CET|1|database1|10.19.0.51(23286)|21566|PARSE|4cd927cf.543e|2010-11-09
> > > 11:51:59 CET|7/430041|0|postgres| ERROR:  could not open relation
> > > base/273198960/273198979: No such file or directory*
> > > * **2010-11-09 11:52:25.364
> > > CET|2|database1|10.19.0.51(23286)|21566|PARSE|4cd927cf.543e|2010-11-09
> > > 11:51:59 CET|7/430041|0|postgres| STATEMENT:  SELECT
> > id_token_fkxxxxxxxxxxx
> > >  ORDER BY avadate*
> > > * **2010-11-09 11:52:29.981
> > > CET|3|database1|10.19.0.51(23286)|21566|PARSE|4cd927cf.543e|2010-11-09
> > > 11:51:59 CET|7/430049|0|postgres| ERROR:  could not open relation
> > > base/273198960/273199235: No such file or directory*
> > > * **2010-11-09 11:52:30.988
> > > CET|6|database1|10.19.0.51(23286)|21566|PARSE|4cd927cf.543e|2010-11-09
> > > 11:51:59 CET|7/430050|0|postgres| STATEMENT:  SELECT max(avadate)
> > xxxxxxxx
> > > 32036)*
> > > * **2010-11-09 11:53:36.346
> > > CET|16|database2|10.19.0.42(44916)|22107|SELECT|4cd9280e.565b|2010-11-09
> > > 11:53:02 CET|94/516004|0|postgres| STATEMENT:  SELECT * FROM "photos"
> > > xxxxxxxxxxxxxx LIMIT 1*
> > > * **2010-11-09 11:53:37.956
> > > CET|17|database2|10.19.0.42(44916)|22107|SELECT|4cd9280e.565b|2010-11-09
> > > 11:53:02 CET|94/516025|0|postgres| ERROR:  could not open relation
> > > base/271253899/271254075: No such file or directory*
> > > * **................*
> > > * **................*
> > > * **2010-11-09 11:53:55.560 CET|111|||26090||4cc6e970.65ea|2010-10-26
> > > 16:45:04 CEST||0|| ERROR:  could not open relation
> > base/273198960/273199235:
> > > No such file or directory*
> > > * **2010-11-09 11:53:55.560 CET|112|||26090||4cc6e970.65ea|2010-10-26
> > > 16:45:04 CEST||0|| CONTEXT:  writing block 8866 of relation
> > > base/273198960/273199235*
> > > * **2010-11-09 11:53:55.560 CET|113|||26090||4cc6e970.65ea|2010-10-26
> > > 16:45:04 CEST||0|| WARNING:  could not write block 8866 of
> > > base/273198960/273199235*
> > > * **2010-11-09 11:53:55.560 CET|114|||26090||4cc6e970.65ea|2010-10-26
> > > 16:45:04 CEST||0|| DETAIL:  Multiple failures --- write error might be
> > > permanent.*
> > > * **2010-11-09 11:53:56.590 CET|115|||26090||4cc6e970.65ea|2010-10-26
> > > 16:45:04 CEST||0|| ERROR:  could not open relation
> > base/273198960/273199235:
> > > No such file or directory*
> > > * **2010-11-09 11:53:56.590 CET|116|||26090||4cc6e970.65ea|2010-10-26
> > > 16:45:04 CEST||0|| CONTEXT:  writing block 8866 of relation
> > > base/273198960/273199235*
> > >
> > > See that there are 2 different databases involved. (database1 and
> > > database2).
> > >
> > > Looking for distinct errors (among the many we have in the log) I find
> > there
> > > are only 4 involved ...
> > >
> > > * **base/271253899/271254075*
> > > * **base/273198960/273198979*
> > > * **base/273198960/273199235*
> > > * **base/273198960/273199253*
> > >
> > > and those files are not in the postgres base directory.
> > >
> > > To fix it we have no option but to restart postgres (which restarts fine
> > > with a /etc/init.d/postgresql stop & start)
> > > However, once we restarted postgres some data was corrupted. Tables that
> > > used to have 4,5 million rows had only 60 rows. As a consequence we had
> > to
> > > restore from file system backup.
> > > Once we did that, it worked fine for a few days until it happened again.
> > > We're worried it can happen again!
> > >
> > > Could this error be a hardware problem?
> > > We recently increased the memory from 8GB to 28GB, although it was
> > working
> > > fine for more than 3 weeks.
> > > We also recently upgraded from postgres 8.3.6 to 8.4.5, althought it also
> > > worked fine for a few months.
> > > Upgrading to postgres 9 is easy for us. however, not sure that would
> > help.
> > >
> > > Some info of our server:
> > >
> > > OS: Centos 5.5
> > > Kernel: 2.6.18-194.1.el5
> > > Postgres version: 8.4.5 (installation out-of-the-box using yum)
> > > Server memory: 28GB
> > >
> > >
> > > Any help would be appreciated
> > >
> > > Pablo
> > >
> >
> >
> >
> > --
> > Achilleas Mantzios
> >
> > --
> > Sent via pgsql-admin mailing list (pgsql-admin@postgresql.org)
> > To make changes to your subscription:
> > http://www.postgresql.org/mailpref/pgsql-admin
> >
>



--
Achilleas Mantzios

Re: Postgres error: could not open relation base/xxxxx/yyyyy

От
Achilleas Mantzios
Дата:
Also you should check for any CPU malfunction, overheating, etc..
and for motherboard problems as well, power supply, etc...

Anyway, your problem is not postgresql (postgresql does not delete its own files).
Your problem is below your OS and everything that comes underneath.
What do the linux logs look like? Any hangups? Any crashes at OS level?

Στις Monday 15 November 2010 21:04:40 γράψατε:
> I found the memory is indeed ECC
> http://accessories.us.dell.com/sna/productdetail.aspx?c=us&l=en&s=biz&cs=05&sku=a2257179
>
> Type: DRAM
> Technology: DDR2 SDRAM
> Form Factor: FB-DIMM 240-pin
> Memory Speed: 667 MHz ( PC2-5300 )
> Data Integrity Check: ECC
> RAM Features: Fully buffered
>
> 2010/11/15 Pablo Delgado DΓ­az-Pache <delgadop@gmail.com>
>
> > Thanks again. We will go to the datacenter and run the both the memory and
> > fs check.
> >
> > Could you think something besides the memory hw problem could be causing
> > this postgres problem? or you think it's a 99% hardware problem?
> >
> > In the meantime and in case it helps here it goes the memory info of the
> > server:
> >
> > # cat /proc/meminfo
> > MemTotal:     28818256 kB
> > MemFree:      10277776 kB
> > Buffers:        554956 kB
> > Cached:       17388880 kB
> > SwapCached:          0 kB
> > Active:       14245932 kB
> > Inactive:      3791952 kB
> > HighTotal:           0 kB
> > HighFree:            0 kB
> > LowTotal:     28818256 kB
> > LowFree:      10277776 kB
> > SwapTotal:     4192924 kB
> > SwapFree:      4192924 kB
> > Dirty:             348 kB
> > Writeback:         124 kB
> > AnonPages:       93824 kB
> > Mapped:        5699404 kB
> > Slab:           448360 kB
> > PageTables:      16808 kB
> > NFS_Unstable:        0 kB
> > Bounce:              0 kB
> > CommitLimit:  18602052 kB
> > Committed_AS: 15344816 kB
> > VmallocTotal: 34359738367 kB
> > VmallocUsed:    265116 kB
> > VmallocChunk: 34359473063 kB
> > HugePages_Total:     0
> > HugePages_Free:      0
> > HugePages_Rsvd:      0
> > Hugepagesize:     2048 kB
> >
> > and more information ...
> >
> > # dmidecode --type 17
> > # dmidecode 2.10
> > SMBIOS 2.4 present.
> >
> > Handle 0x1100, DMI type 17, 28 bytes
> > Memory Device
> >         Array Handle: 0x1000
> >         Error Information Handle: Not Provided
> >         Total Width: 72 bits
> >         Data Width: 64 bits
> >         Size: 2048 MB
> >         Form Factor: FB-DIMM
> >         Set: 1
> >         Locator: DIMM1
> >         Bank Locator: Not Specified
> >         Type: DDR2 FB-DIMM
> >         Type Detail: Synchronous
> >         Speed: 667 MHz
> >         Manufacturer: 80AD808980AD
> >         Serial Number: 0394080B
> >         Asset Tag: 010802
> >         Part Number: HYMP525F72CP4N3-Y5
> >         Rank: 2
> >
> > Handle 0x1101, DMI type 17, 28 bytes
> > Memory Device
> >         Array Handle: 0x1000
> >         Error Information Handle: Not Provided
> >         Total Width: 72 bits
> >         Data Width: 64 bits
> >         Size: 2048 MB
> >         Form Factor: FB-DIMM
> >         Set: 1
> >         Locator: DIMM2
> >         Bank Locator: Not Specified
> >         Type: DDR2 FB-DIMM
> >         Type Detail: Synchronous
> >         Speed: 667 MHz
> >         Manufacturer: 80AD808980AD
> >         Serial Number: 0394040C
> >         Asset Tag: 010802
> >         Part Number: HYMP525F72CP4N3-Y5
> >         Rank: 2
> >
> > Handle 0x1102, DMI type 17, 28 bytes
> > Memory Device
> >         Array Handle: 0x1000
> >         Error Information Handle: Not Provided
> >         Total Width: 72 bits
> >         Data Width: 64 bits
> >         Size: 4096 MB
> >         Form Factor: FB-DIMM
> >         Set: 2
> >         Locator: DIMM3
> >         Bank Locator: Not Specified
> >         Type: DDR2 FB-DIMM
> >         Type Detail: Synchronous
> >         Speed: 667 MHz
> >         Manufacturer: 830B7FB3830B
> >         Serial Number: 151A2A5F
> >         Asset Tag: 0D0938
> >         Part Number: NT4GT72U4ND2BD-3C
> >         Rank: 2
> >
> > Handle 0x1103, DMI type 17, 28 bytes
> > Memory Device
> >         Array Handle: 0x1000
> >         Error Information Handle: Not Provided
> >         Total Width: 72 bits
> >         Data Width: 64 bits
> >         Size: 4096 MB
> >         Form Factor: FB-DIMM
> >         Set: 2
> >         Locator: DIMM4
> >         Bank Locator: Not Specified
> >         Type: DDR2 FB-DIMM
> >         Type Detail: Synchronous
> >         Speed: 667 MHz
> >         Manufacturer: 830B7FB3830B
> >         Serial Number: 6D1D2A5F
> >         Asset Tag: 0D0938
> >         Part Number: NT4GT72U4ND2BD-3C
> >         Rank: 2
> >
> > Handle 0x1104, DMI type 17, 28 bytes
> > Memory Device
> >         Array Handle: 0x1000
> >         Error Information Handle: Not Provided
> >         Total Width: 72 bits
> >         Data Width: 64 bits
> >         Size: 4096 MB
> >         Form Factor: FB-DIMM
> >         Set: 3
> >         Locator: DIMM5
> >         Bank Locator: Not Specified
> >         Type: DDR2 FB-DIMM
> >         Type Detail: Synchronous
> >         Speed: 667 MHz
> >         Manufacturer: 830B7FB3830B
> >         Serial Number: C43D2B5C
> >         Asset Tag: 0D0934
> >         Part Number: NT4GT72U4ND2BD-3C
> >         Rank: 2
> >
> > Handle 0x1105, DMI type 17, 28 bytes
> > Memory Device
> >         Array Handle: 0x1000
> >         Error Information Handle: Not Provided
> >         Total Width: 72 bits
> >         Data Width: 64 bits
> >         Size: 4096 MB
> >         Form Factor: FB-DIMM
> >         Set: 3
> >         Locator: DIMM6
> >         Bank Locator: Not Specified
> >         Type: DDR2 FB-DIMM
> >         Type Detail: Synchronous
> >         Speed: 667 MHz
> >         Manufacturer: 830B7FB3830B
> >         Serial Number: 9F472B5F
> >         Asset Tag: 0D0934
> >         Part Number: NT4GT72U4ND2BD-3C
> >         Rank: 2
> >
> > Handle 0x1106, DMI type 17, 28 bytes
> > Memory Device
> >         Array Handle: 0x1000
> >         Error Information Handle: Not Provided
> >         Total Width: 72 bits
> >         Data Width: 64 bits
> >         Size: 4096 MB
> >         Form Factor: FB-DIMM
> >         Set: 4
> >         Locator: DIMM7
> >         Bank Locator: Not Specified
> >         Type: DDR2 FB-DIMM
> >         Type Detail: Synchronous
> >         Speed: 667 MHz
> >         Manufacturer: 830B7FB3830B
> >         Serial Number: C51E2A5F
> >         Asset Tag: 0D0938
> >         Part Number: NT4GT72U4ND2BD-3C
> >         Rank: 2
> >
> > Handle 0x1107, DMI type 17, 28 bytes
> > Memory Device
> >         Array Handle: 0x1000
> >         Error Information Handle: Not Provided
> >         Total Width: 72 bits
> >         Data Width: 64 bits
> >         Size: 4096 MB
> >         Form Factor: FB-DIMM
> >         Set: 4
> >         Locator: DIMM8
> >         Bank Locator: Not Specified
> >         Type: DDR2 FB-DIMM
> >         Type Detail: Synchronous
> >         Speed: 667 MHz
> >         Manufacturer: 830B7FB3830B
> >         Serial Number: 1E1E2A5B
> >         Asset Tag: 0D0938
> >         Part Number: NT4GT72U4ND2BD-3C
> >         Rank: 2
> >
> >
> > On Mon, Nov 15, 2010 at 2:05 PM, Achilleas Mantzios <
> > achill@matrix.gatewaynet.com> wrote:
> >
> >> Whoever did the memory installation, should also have taken care of
> >> checking it's working ok.
> >> A good idea is to use ECC memory.
> >> Does fsck on the remote system run with reporting problems as well?? (it
> >> might report problems such as incorrect block counts, missing inodes etc...)
> >>
> >> Also check for disk, controller problems.
> >>
> >> Στις Monday 15 November 2010 14:42:32 ΞΏ/Ξ· Pablo Delgado DΓ­az-Pache έγραψΡ:
> >> > Hi Achilleas,
> >> >
> >> > Thanks for your quick response.
> >> > We do have backups and a couple of spare servers running in parallel so
> >> > we're safe in that sense. Thanks for your advice anyway.
> >> >
> >> > To run *memtest86 *we have to go to the datacenter and that will take us
> >> a
> >> > few days. Is there anything we can do remotely?
> >> >
> >> > Could it be anything different than memory HW problem?
> >> > Btw, we are not using that server at the moment :)
> >> >
> >> > Thanks
> >> >
> >> > On Mon, Nov 15, 2010 at 12:03 PM, Achilleas Mantzios <
> >> > achill@matrix.gatewaynet.com> wrote:
> >> >
> >> > > Stop using the system immediately, since many things inserted to the
> >> DB
> >> > > might simply be garbage.
> >> > > Inspect your memory with memtest86.
> >> > >
> >> > > I would even suggest moving to a new HW if available, and start
> >> working
> >> > > into two parallel directions:
> >> > >
> >> > > a) try to bring your DB into a sane state
> >> > > b) try to fix your HW server, which apparently has problems.
> >> > >
> >> > > Files are not "disappearing" like that, in normal situations.
> >> > > You cannot trust your current HW to run postgresql. Its a guarantee
> >> method
> >> > > to lose your data, if you
> >> > > go on by this start/stop/restore/pray circle.
> >> > >
> >> > > IMPORTANT: Handle your old (pre-memory upgrade) backups like expensive
> >> > > chinese vases.
> >> > > Start from those good and trustworthy backups to do any database
> >> "replay"
> >> > > (or similar) work which
> >> > > would lead to a viable situation.
> >> > >
> >> > > Στις Monday 15 November 2010 11:55:38 ΞΏ/Ξ· Pablo Delgado DΓ­az-Pache
> >> έγραψΡ:
> >> > > > Hi all,
> >> > > >
> >> > > > We've been using postgres for 9 years without a problem until now!
> >> Two
> >> > > > problems in a very short time!
> >> > > > The first one is described in
> >> > > >
> >> > >
> >>
http://postgresql.1045698.n5.nabble.com/Autovacuum-seems-to-block-database-WARNING-worker-took-too-long-to-start-td3264261.html
> >> > > > This is another one (not related I think) ...
> >> > > >
> >> > > > Postgres server is usually working fine. All of a sudden we start
> >> getting
> >> > > > these errors ...
> >> > > >
> >> > > > * **2010-11-09 11:49:15.320
> >> > > >
> >> CET|2|database1|10.19.0.51(18895)|20929|SELECT|4cd926fd.51c1|2010-11-09
> >> > > > 11:48:29 CET|10/417796|1390150|postgres| LOG:  duration: 1518.422 ms
> >> > > >  execute <unnamed>: SELECT id_token_fk,xxxxxxxxx ORDER BY avadate*
> >> > > > * **2010-11-09 11:52:25.364
> >> > > >
> >> CET|1|database1|10.19.0.51(23286)|21566|PARSE|4cd927cf.543e|2010-11-09
> >> > > > 11:51:59 CET|7/430041|0|postgres| ERROR:  could not open relation
> >> > > > base/273198960/273198979: No such file or directory*
> >> > > > * **2010-11-09 11:52:25.364
> >> > > >
> >> CET|2|database1|10.19.0.51(23286)|21566|PARSE|4cd927cf.543e|2010-11-09
> >> > > > 11:51:59 CET|7/430041|0|postgres| STATEMENT:  SELECT
> >> > > id_token_fkxxxxxxxxxxx
> >> > > >  ORDER BY avadate*
> >> > > > * **2010-11-09 11:52:29.981
> >> > > >
> >> CET|3|database1|10.19.0.51(23286)|21566|PARSE|4cd927cf.543e|2010-11-09
> >> > > > 11:51:59 CET|7/430049|0|postgres| ERROR:  could not open relation
> >> > > > base/273198960/273199235: No such file or directory*
> >> > > > * **2010-11-09 11:52:30.988
> >> > > >
> >> CET|6|database1|10.19.0.51(23286)|21566|PARSE|4cd927cf.543e|2010-11-09
> >> > > > 11:51:59 CET|7/430050|0|postgres| STATEMENT:  SELECT max(avadate)
> >> > > xxxxxxxx
> >> > > > 32036)*
> >> > > > * **2010-11-09 11:53:36.346
> >> > > >
> >> CET|16|database2|10.19.0.42(44916)|22107|SELECT|4cd9280e.565b|2010-11-09
> >> > > > 11:53:02 CET|94/516004|0|postgres| STATEMENT:  SELECT * FROM
> >> "photos"
> >> > > > xxxxxxxxxxxxxx LIMIT 1*
> >> > > > * **2010-11-09 11:53:37.956
> >> > > >
> >> CET|17|database2|10.19.0.42(44916)|22107|SELECT|4cd9280e.565b|2010-11-09
> >> > > > 11:53:02 CET|94/516025|0|postgres| ERROR:  could not open relation
> >> > > > base/271253899/271254075: No such file or directory*
> >> > > > * **................*
> >> > > > * **................*
> >> > > > * **2010-11-09 11:53:55.560
> >> CET|111|||26090||4cc6e970.65ea|2010-10-26
> >> > > > 16:45:04 CEST||0|| ERROR:  could not open relation
> >> > > base/273198960/273199235:
> >> > > > No such file or directory*
> >> > > > * **2010-11-09 11:53:55.560
> >> CET|112|||26090||4cc6e970.65ea|2010-10-26
> >> > > > 16:45:04 CEST||0|| CONTEXT:  writing block 8866 of relation
> >> > > > base/273198960/273199235*
> >> > > > * **2010-11-09 11:53:55.560
> >> CET|113|||26090||4cc6e970.65ea|2010-10-26
> >> > > > 16:45:04 CEST||0|| WARNING:  could not write block 8866 of
> >> > > > base/273198960/273199235*
> >> > > > * **2010-11-09 11:53:55.560
> >> CET|114|||26090||4cc6e970.65ea|2010-10-26
> >> > > > 16:45:04 CEST||0|| DETAIL:  Multiple failures --- write error might
> >> be
> >> > > > permanent.*
> >> > > > * **2010-11-09 11:53:56.590
> >> CET|115|||26090||4cc6e970.65ea|2010-10-26
> >> > > > 16:45:04 CEST||0|| ERROR:  could not open relation
> >> > > base/273198960/273199235:
> >> > > > No such file or directory*
> >> > > > * **2010-11-09 11:53:56.590
> >> CET|116|||26090||4cc6e970.65ea|2010-10-26
> >> > > > 16:45:04 CEST||0|| CONTEXT:  writing block 8866 of relation
> >> > > > base/273198960/273199235*
> >> > > >
> >> > > > See that there are 2 different databases involved. (database1 and
> >> > > > database2).
> >> > > >
> >> > > > Looking for distinct errors (among the many we have in the log) I
> >> find
> >> > > there
> >> > > > are only 4 involved ...
> >> > > >
> >> > > > * **base/271253899/271254075*
> >> > > > * **base/273198960/273198979*
> >> > > > * **base/273198960/273199235*
> >> > > > * **base/273198960/273199253*
> >> > > >
> >> > > > and those files are not in the postgres base directory.
> >> > > >
> >> > > > To fix it we have no option but to restart postgres (which restarts
> >> fine
> >> > > > with a /etc/init.d/postgresql stop & start)
> >> > > > However, once we restarted postgres some data was corrupted. Tables
> >> that
> >> > > > used to have 4,5 million rows had only 60 rows. As a consequence we
> >> had
> >> > > to
> >> > > > restore from file system backup.
> >> > > > Once we did that, it worked fine for a few days until it happened
> >> again.
> >> > > > We're worried it can happen again!
> >> > > >
> >> > > > Could this error be a hardware problem?
> >> > > > We recently increased the memory from 8GB to 28GB, although it was
> >> > > working
> >> > > > fine for more than 3 weeks.
> >> > > > We also recently upgraded from postgres 8.3.6 to 8.4.5, althought it
> >> also
> >> > > > worked fine for a few months.
> >> > > > Upgrading to postgres 9 is easy for us. however, not sure that would
> >> > > help.
> >> > > >
> >> > > > Some info of our server:
> >> > > >
> >> > > > OS: Centos 5.5
> >> > > > Kernel: 2.6.18-194.1.el5
> >> > > > Postgres version: 8.4.5 (installation out-of-the-box using yum)
> >> > > > Server memory: 28GB
> >> > > >
> >> > > >
> >> > > > Any help would be appreciated
> >> > > >
> >> > > > Pablo
> >> > > >
> >> > >
> >> > >
> >> > >
> >> > > --
> >> > > Achilleas Mantzios
> >> > >
> >> > > --
> >> > > Sent via pgsql-admin mailing list (pgsql-admin@postgresql.org)
> >> > > To make changes to your subscription:
> >> > > http://www.postgresql.org/mailpref/pgsql-admin
> >> > >
> >> >
> >>
> >>
> >>
> >> --
> >> Achilleas Mantzios
> >>
> >> --
> >> Sent via pgsql-admin mailing list (pgsql-admin@postgresql.org)
> >> To make changes to your subscription:
> >> http://www.postgresql.org/mailpref/pgsql-admin
> >>
> >
> >
>



--
Achilleas Mantzios