Re: URGENT: Database keeps crashing - suspect damaged RAM

Поиск
Список
Период
Сортировка
От Markus Wollny
Тема Re: URGENT: Database keeps crashing - suspect damaged RAM
Дата
Msg-id 2266D0630E43BB4290742247C891057501B1321C@dozer.computec.de
обсуждение исходный текст
Ответ на URGENT: Database keeps crashing - suspect damaged RAM  ("Markus Wollny" <Markus.Wollny@computec.de>)
Список pgsql-general
Oh - and I forgot to mention: The crashes only occur when there is load
on the machine. No load - no crashes. But then, that wouldn't be any
surprise, as it wouldn't make use of a lot of RAM without any load...

Regards,

    Markus

> -----Ursprüngliche Nachricht-----
> Von: Markus Wollny
> Gesendet: Dienstag, 6. August 2002 18:38
> An: pgsql-general@postgresql.org
> Betreff: [GENERAL] URGENT: Database keeps crashing - suspect
> damaged RAM
>
>
> Hello!
>
> I just installed PostgreSQL 7.2.1 on SuSE 7.3, 4xPIIIXEON 550MHz, 2GB
> RAM, 5x18GB SCSI RAID. The OS was freshly installed, after that I
> compiled and installed PostgreSQL from source (./configure
> --prefix=/opt/pgsql/ --with-perl --enable-odbc --enable-locale
> --enable-syslog). I copied the settings in postgresql.conf
> etc. from an
> identical machine running the identical platform. Then I imported a
> database to the new installation. The import seems to be
> successfull, I
> didn't get any errors during import. A subsequent vacuum analyze did
> finish without anything out of the ordinary.
>
> Just a few minutes after this vacuum analyze, the database crashed for
> the first time. It keeps crashing every now and then - every
> one or two
> minutes.
>
> What puzzles me is the fact that this very same machine was running
> Oracle 8i on Win2k more or less flawlessly just up to a few
> hours before
> - more or less meaning that we never really noticed anything
> much out of
> the ordinary. There might have been some minor issues after a
> RAM-upgrade from 1 GB to 2 GB just a week ago, but looking back it's
> hard to say if that could be due to bad RAM or just some bad
> code which
> we've sorted out (or disposed of) by now. As the machine is already
> running Linux and PostgreSQL it's quite impossible to prove
> my suspicion
> by going back to Oracle and having a closer look.
>
> What I'd like to know is if I need to look any further than
> RAM - shall
> I just chuck the new modules out of the machine? Or is there
> some other
> issue that could cause this behaviour? I am quite sure that I
> didn't do
> anything wrong during installation, configuration and import and the
> same application code is running without errors on a different machine
> at this very moment. I don't like the "record with zero length" and
> "Cannot allocate memory"-bits in the logfile at all, let
> alone the "was
> terminated by signal 9"-thingy.
>
> So: Is it bad RAM? How can I make sure? What else could it be?
>
> Here's a small excerpt from the logfile:
>
> 2002-08-06 17:31:38 [17063]  DEBUG:  Pages 0: Changed 0,
> Empty 0; Tup 0:
> Vac 0, Keep 0, UnUsed 0.
>         Total CPU 0.00s/0.00u sec elapsed 0.00 sec.
> 2002-08-06 17:36:23 [17296]  DEBUG:  _mdfd_blind_getseg: couldn't open
> /var/lib/pgsql/data/base/base/16596/16671: Cannot allocate memory
> 2002-08-06 17:36:24 [17296]  FATAL 2:  cannot write block 13387 of
> 16596/16671 blind: Cannot allocate memory
> 2002-08-06 17:36:24 [16530]  DEBUG:  server process (pid 17296) exited
> with exit code 2
> 2002-08-06 17:36:24 [16530]  DEBUG:  terminating any other
> active server
> processes
> 2002-08-06 17:36:24 [17081]  NOTICE:  Message from PostgreSQL backend:
>         The Postmaster has informed me that some other backend
>         died abnormally and possibly corrupted shared memory.
>         I have rolled back the current transaction and am
>         going to terminate your database system connection and exit.
> [...]
> 2002-08-06 17:36:24 [16530]  DEBUG:  all server processes terminated;
> reinitializing shared memory and semaphores
> 2002-08-06 17:36:24 [17298]  DEBUG:  database system was
> interrupted at
> 2002-08-06 17:31:21 CEST
> 2002-08-06 17:36:24 [17298]  DEBUG:  checkpoint record is at
> 0/325D7C78
> 2002-08-06 17:36:24 [17298]  DEBUG:  redo record is at
> 0/325D7C78; undo
> record is at 0/0; shutdown FALSE
> 2002-08-06 17:36:24 [17298]  DEBUG:  next transaction id: 2270; next
> oid: 901292
> 2002-08-06 17:36:24 [17298]  DEBUG:  database system was not properly
> shut down; automatic recovery in progress
> 2002-08-06 17:36:24 [17298]  DEBUG:  redo starts at 0/325D7CB8
> 2002-08-06 17:36:25 [17298]  DEBUG:  ReadRecord: record with
> zero length
> at 0/326E16C4
> 2002-08-06 17:36:25 [17298]  DEBUG:  redo done at 0/326E16A0
> 2002-08-06 17:36:30 [17298]  DEBUG:  database system is ready
> 2002-08-06 17:40:53 [16530]  DEBUG:  connection startup failed (fork
> failure): Cannot allocate memory
> 2002-08-06 17:52:50 [16530]  DEBUG:  connection startup failed (fork
> failure): Cannot allocate memory
> 2002-08-06 17:52:54 [16530]  DEBUG:  server process (pid 18237) was
> terminated by signal 9
> 2002-08-06 17:52:54 [16530]  DEBUG:  terminating any other
> active server
> processes
> 2002-08-06 17:52:54 [18234]  NOTICE:  Message from PostgreSQL backend:
>         The Postmaster has informed me that some other backend
>         died abnormally and possibly corrupted shared memory.
>         I have rolled back the current transaction and am
>         going to terminate your database system connection and exit.
> [...]
> 2002-08-06 17:52:57 [18253]  FATAL 1:  The database system is in
> recovery mode
> 2002-08-06 17:52:57 [18255]  FATAL 1:  The database system is in
> recovery mode
> 2002-08-06 17:52:57 [18254]  FATAL 1:  The database system is in
> recovery mode
> 2002-08-06 17:52:57 [18235]  NOTICE:  Message from PostgreSQL backend:
>         The Postmaster has informed me that some other backend
>         died abnormally and possibly corrupted shared memory.
>         I have rolled back the current transaction and am
>         going to terminate your database system connection and exit.
>         Please reconnect to the database system and repeat your query.
> 2002-08-06 17:52:57 [18256]  FATAL 1:  The database system is in
> recovery mode
> 2002-08-06 17:52:57 [18257]  FATAL 1:  The database system is in
> recovery mode
> 2002-08-06 17:52:57 [18258]  FATAL 1:  The database system is in
> recovery mode
> 2002-08-06 17:52:57 [16530]  DEBUG:  all server processes terminated;
> reinitializing shared memory and semaphores
> 2002-08-06 17:52:57 [18260]  FATAL 1:  The database system is starting
> up
> 2002-08-06 17:52:57 [18259]  DEBUG:  database system was
> interrupted at
> 2002-08-06 17:51:38 CEST
> 2002-08-06 17:52:57 [18259]  DEBUG:  checkpoint record is at
> 0/32991848
> 2002-08-06 17:52:57 [18259]  DEBUG:  redo record is at
> 0/3297F4D8; undo
> record is at 0/0; shutdown FALSE
> 2002-08-06 17:52:57 [18259]  DEBUG:  next transaction id: 3704; next
> oid: 909484
> 2002-08-06 17:52:57 [18259]  DEBUG:  database system was not properly
> shut down; automatic recovery in progress
> 2002-08-06 17:52:57 [18259]  DEBUG:  redo starts at 0/3297F4D8
> 2002-08-06 17:52:57 [18261]  FATAL 1:  The database system is starting
> up
> 2002-08-06 17:52:58 [18259]  DEBUG:  ReadRecord: record with
> zero length
> at 0/32BF0278
> 2002-08-06 17:52:58 [18259]  DEBUG:  redo done at 0/32BF0254
> 2002-08-06 17:52:59 [18262]  FATAL 1:  The database system is starting
> up
> 2002-08-06 17:53:00 [18259]  DEBUG:  database system is ready
> 2002-08-06 17:54:24 [16530]  DEBUG:  connection startup failed (fork
> failure): Cannot allocate memory
> 2002-08-06 17:54:31 [16530]  DEBUG:  server process (pid 18283) was
> terminated by signal 9
> 2002-08-06 17:54:31 [16530]  DEBUG:  terminating any other
> active server
> processes
> 2002-08-06 17:54:31 [18275]  NOTICE:  Message from PostgreSQL backend:
>         The Postmaster has informed me that some other backend
>         died abnormally and possibly corrupted shared memory.
>         I have rolled back the current transaction and am
>         going to terminate your database system connection and exit.
>         Please reconnect to the database system and repeat your query.
> [...]
> 2002-08-06 17:54:32 [16530]  DEBUG:  all server processes terminated;
> reinitializing shared memory and semaphores
> 2002-08-06 17:54:32 [18296]  DEBUG:  database system was
> interrupted at
> 2002-08-06 17:53:00 CEST
> 2002-08-06 17:54:32 [18296]  DEBUG:  checkpoint record is at
> 0/32BF0278
> 2002-08-06 17:54:32 [18296]  DEBUG:  redo record is at
> 0/32BF0278; undo
> record is at 0/0; shutdown TRUE
> 2002-08-06 17:54:32 [18296]  DEBUG:  next transaction id: 4456; next
> oid: 909484
> 2002-08-06 17:54:32 [18296]  DEBUG:  database system was not properly
> shut down; automatic recovery in progress
> 2002-08-06 17:54:32 [18296]  DEBUG:  redo starts at 0/32BF02B8
> 2002-08-06 17:54:32 [18296]  DEBUG:  ReadRecord: record with
> zero length
> at 0/32F0B3C0
> 2002-08-06 17:54:32 [18296]  DEBUG:  redo done at 0/32F0B39C
> 2002-08-06 17:54:34 [18297]  FATAL 1:  The database system is starting
> up
> 2002-08-06 17:54:34 [18298]  FATAL 1:  The database system is starting
> up
> 2002-08-06 17:54:34 [18299]  FATAL 1:  The database system is starting
> up
> 2002-08-06 17:54:34 [18300]  FATAL 1:  The database system is starting
> up
> 2002-08-06 17:54:34 [18296]  DEBUG:  database system is ready
> 2002-08-06 17:57:35 [16530]  DEBUG:  connection startup failed (fork
> failure): Cannot allocate memory
> 2002-08-06 17:57:54 [16530]  DEBUG:  server process (pid 18366) was
> terminated by signal 9
> 2002-08-06 17:57:54 [16530]  DEBUG:  terminating any other
> active server
> processes
> 2002-08-06 17:57:54 [18368]  NOTICE:  Message from PostgreSQL backend:
>         The Postmaster has informed me that some other backend
>         died abnormally and possibly corrupted shared memory.
>         I have rolled back the current transaction and am
>         going to terminate your database system connection and exit.
>         Please reconnect to the database system and repeat your query.
> 2002-08-06 17:57:56 [18409]  DEBUG:  ReadRecord: record with
> zero length
> at 0/3338749C
> 2002-08-06 17:57:58 [18425]  FATAL 1:  The database system is starting
> up
> 2002-08-06 17:57:58 [18409]  DEBUG:  database system is ready
> 2002-08-06 17:58:53 [18432]  NOTICE:  RelationBuildDesc: can't open
> idx_bm_user_id: Cannot allocate memory
> 2002-08-06 17:59:00 [18443]  FATAL 1:  cannot open
> pg_attribute: Cannot
> allocate memory
> 2002-08-06 17:59:01 [16530]  DEBUG:  connection startup failed (fork
> failure): Cannot allocate memory
> 2002-08-06 17:59:01 [16530]  DEBUG:  server process (pid 18436) was
> terminated by signal 9
> 2002-08-06 17:59:01 [16530]  DEBUG:  terminating any other
> active server
> processes
> 2002-08-06 17:59:03 [18510]  DEBUG:  ReadRecord: record with
> zero length
> at 0/336E9970
> 2002-08-06 18:00:15 [16530]  DEBUG:  connection startup failed (fork
> failure): Cannot allocate memory
> 2002-08-06 18:00:17 [18589]  DEBUG:  ReadRecord: record with
> zero length
> at 0/33A7C194
>
> Thank you for your kind assistance!
>
> Regards,
>
>     Markus Wollny
>
> ---------------------------(end of
> broadcast)---------------------------
> TIP 5: Have you checked our extensive FAQ?
>
> http://www.postgresql.org/users-lounge/docs/faq.html
>

В списке pgsql-general по дате отправления:

Предыдущее
От: "Markus Wollny"
Дата:
Сообщение: URGENT: Database keeps crashing - suspect damaged RAM
Следующее
От: nconway@klamath.dyndns.org (Neil Conway)
Дата:
Сообщение: Re: URGENT: Database keeps crashing - suspect damaged RAM