Обсуждение: How to repair Pg 9.1?

Поиск
Список
Период
Сортировка

How to repair Pg 9.1?

От
"Dmitry E. Oboukhov"
Дата:
Recently I started a hard query (ALTER TABLE) and then Postgresql was
killed by OOM killer.

Since that clients lose their connection and Postgresql says in log:

    2012-04-15 16:46:02 MSK LOG:  server process (PID 825) was terminated by signal 11: Segmentation fault
    2012-04-15 16:46:02 MSK LOG:  terminating any other active server processes
    2012-04-15 16:46:02 MSK WARNING:  terminating connection because of crash of another server process
    2012-04-15 16:46:02 MSK DETAIL:  The postmaster has commanded this server process to roll back the current
transactionand exit, because another server process exited abnormally and possibly corrupted shared memory. 
    2012-04-15 16:46:02 MSK HINT:  In a moment you should be able to reconnect to the database and repeat your command.
    2012-04-15 16:46:02 MSK WARNING:  terminating connection because of crash of another server process
    2012-04-15 16:46:02 MSK DETAIL:  The postmaster has commanded this server process to roll back the current
transactionand exit, because another server process exited abnormally and possibly corrupted shared memory. 
    2012-04-15 16:46:02 MSK HINT:  In a moment you should be able to reconnect to the database and repeat your command.
    2012-04-15 16:46:02 MSK WARNING:  terminating connection because of crash of another server process
    2012-04-15 16:46:02 MSK DETAIL:  The postmaster has commanded this server process to roll back the current
transactionand exit, because another server process exited abnormally and possibly corrupted shared memory. 
    2012-04-15 16:46:02 MSK HINT:  In a moment you should be able to reconnect to the database and repeat your command.
    2012-04-15 16:46:02 MSK LOG:  all server processes terminated; reinitializing
    2012-04-15 16:46:02 MSK LOG:  database system was interrupted; last known up at 2012-04-15 16:45:17 MSK
    2012-04-15 16:46:02 MSK LOG:  database system was not properly shut down; automatic recovery in progress
    2012-04-15 16:46:02 MSK LOG:  consistent recovery state reached at 58/21C44D18
    2012-04-15 16:46:02 MSK LOG:  redo starts at 58/21C44D18
    2012-04-15 16:46:02 MSK LOG:  record with zero length at 58/21CA7E20
    2012-04-15 16:46:02 MSK LOG:  redo done at 58/21CA7DE0
    2012-04-15 16:46:02 MSK LOG:  last completed transaction was at log time 2012-04-15 16:46:02.620188+04
    2012-04-15 16:46:02 MSK LOG:  autovacuum launcher started
    2012-04-15 16:46:02 MSK LOG:  database system is ready to accept connections


What can I do in the situation?

Postgresql 9.1/Debian wheezy
--

. ''`.                               Dmitry E. Oboukhov
: :’  :   email: unera@debian.org jabber://UNera@uvw.ru
`. `~’              GPGKey: 1024D / F8E26537 2006-11-21
  `- 1B23 D4F8 8EC0 D902 0555  E438 AB8C 00CF F8E2 6537

Вложения

Re: How to repair Pg 9.1?

От
Tomas Vondra
Дата:
On 15.4.2012 14:52, Dmitry E. Oboukhov wrote:
> Recently I started a hard query (ALTER TABLE) and then Postgresql was
> killed by OOM killer.
>
> Since that clients lose their connection and Postgresql says in log:
>
>     2012-04-15 16:46:02 MSK LOG:  server process (PID 825) was terminated by signal 11: Segmentation fault
>     2012-04-15 16:46:02 MSK LOG:  terminating any other active server processes
>     2012-04-15 16:46:02 MSK WARNING:  terminating connection because of crash of another server process
>     2012-04-15 16:46:02 MSK DETAIL:  The postmaster has commanded this server process to roll back the current
transactionand exit, because another server process exited abnormally and possibly corrupted shared memory. 
>     2012-04-15 16:46:02 MSK HINT:  In a moment you should be able to reconnect to the database and repeat your
command.
>     2012-04-15 16:46:02 MSK WARNING:  terminating connection because of crash of another server process
>     2012-04-15 16:46:02 MSK DETAIL:  The postmaster has commanded this server process to roll back the current
transactionand exit, because another server process exited abnormally and possibly corrupted shared memory. 
>     2012-04-15 16:46:02 MSK HINT:  In a moment you should be able to reconnect to the database and repeat your
command.
>     2012-04-15 16:46:02 MSK WARNING:  terminating connection because of crash of another server process
>     2012-04-15 16:46:02 MSK DETAIL:  The postmaster has commanded this server process to roll back the current
transactionand exit, because another server process exited abnormally and possibly corrupted shared memory. 
>     2012-04-15 16:46:02 MSK HINT:  In a moment you should be able to reconnect to the database and repeat your
command.
>     2012-04-15 16:46:02 MSK LOG:  all server processes terminated; reinitializing
>     2012-04-15 16:46:02 MSK LOG:  database system was interrupted; last known up at 2012-04-15 16:45:17 MSK
>     2012-04-15 16:46:02 MSK LOG:  database system was not properly shut down; automatic recovery in progress
>     2012-04-15 16:46:02 MSK LOG:  consistent recovery state reached at 58/21C44D18
>     2012-04-15 16:46:02 MSK LOG:  redo starts at 58/21C44D18
>     2012-04-15 16:46:02 MSK LOG:  record with zero length at 58/21CA7E20
>     2012-04-15 16:46:02 MSK LOG:  redo done at 58/21CA7DE0
>     2012-04-15 16:46:02 MSK LOG:  last completed transaction was at log time 2012-04-15 16:46:02.620188+04
>     2012-04-15 16:46:02 MSK LOG:  autovacuum launcher started
>     2012-04-15 16:46:02 MSK LOG:  database system is ready to accept connections
>
>
> What can I do in the situation?

Well, what happened is quite clear. One of the processes was terminated
by the OOM killer. The database noticed this and responded properly by
restarting the database. That caused a proper recovery - you don't need
to "repair" the database.

What you need to do is change the machine configuration so that the OOM
errors do not happen. I don't know what's running on the machine, but
this usually means tweaking the vm.* parameters at the kernel level and
shared_buffers/work_mem/maintenance_work_mem at the PostgreSQL level.

In this particular situation (ALTER TABLE) I'd use a low value for the
maintenance_work_mem parameter - it'll run longer but won't need that
much memory.

Tomas

Re: How to repair Pg 9.1?

От
"Dmitry E. Oboukhov"
Дата:
>>     2012-04-15 16:46:02 MSK LOG:  server process (PID 825) was terminated by signal 11: Segmentation fault

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>> What can I do in the situation?

> Well, what happened is quite clear. One of the processes was terminated
> by the OOM killer. The database noticed this and responded properly by
> restarting the database. That caused a proper recovery - you don't need
> to "repair" the database.

> What you need to do is change the machine configuration so that the OOM
> errors do not happen. I don't know what's running on the machine, but
> this usually means tweaking the vm.* parameters at the kernel level and
> shared_buffers/work_mem/maintenance_work_mem at the PostgreSQL level.

> In this particular situation (ALTER TABLE) I'd use a low value for the
> maintenance_work_mem parameter - it'll run longer but won't need that
> much memory.

> Tomas

I've just upgraded my postgresql upto current wheezy version
(9.1.3-2/Debian): I wanted to get backtrace so it was upgraded when I
installed postgresql-dbg package. After upgrading it discontinued
segfault and began work fine.
:)
--

. ''`.                               Dmitry E. Oboukhov
: :’  :   email: unera@debian.org jabber://UNera@uvw.ru
`. `~’              GPGKey: 1024D / F8E26537 2006-11-21
  `- 1B23 D4F8 8EC0 D902 0555  E438 AB8C 00CF F8E2 6537

Вложения

Re: How to repair Pg 9.1?

От
Tom Lane
Дата:
"Dmitry E. Oboukhov" <unera@debian.org> writes:
> Recently I started a hard query (ALTER TABLE) and then Postgresql was
> killed by OOM killer.
> ...
>     2012-04-15 16:46:02 MSK LOG:  server process (PID 825) was terminated by signal 11: Segmentation fault

You sure it was an OOM kill?  The OOM killer does its dirty work with
signal 9 (SIGKILL), not signal 11 (SIGSEGV).

            regards, tom lane

Re: How to repair Pg 9.1?

От
"Dmitry E. Oboukhov"
Дата:
>> Recently I started a hard query (ALTER TABLE) and then Postgresql was
>> killed by OOM killer.
>> ...
>>     2012-04-15 16:46:02 MSK LOG:  server process (PID 825) was terminated by signal 11: Segmentation fault

> You sure it was an OOM kill?  The OOM killer does its dirty work with
> signal 9 (SIGKILL), not signal 11 (SIGSEGV).

OOM killer was when I did 'ALTER TABLE' (in the first time),
the log was since the time for each client's connecting.

Sorry, my bad English, I could not explain cleanly :)
--

. ''`.                               Dmitry E. Oboukhov
: :’  :   email: unera@debian.org jabber://UNera@uvw.ru
`. `~’              GPGKey: 1024D / F8E26537 2006-11-21
  `- 1B23 D4F8 8EC0 D902 0555  E438 AB8C 00CF F8E2 6537

Вложения

Re: How to repair Pg 9.1?

От
Tomas Vondra
Дата:
On 15.4.2012 16:16, Dmitry E. Oboukhov wrote:
>>>     2012-04-15 16:46:02 MSK LOG:  server process (PID 825) was terminated by signal 11: Segmentation fault
>
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>>> What can I do in the situation?
>
>> Well, what happened is quite clear. One of the processes was terminated
>> by the OOM killer. The database noticed this and responded properly by
>> restarting the database. That caused a proper recovery - you don't need
>> to "repair" the database.
>
>> What you need to do is change the machine configuration so that the OOM
>> errors do not happen. I don't know what's running on the machine, but
>> this usually means tweaking the vm.* parameters at the kernel level and
>> shared_buffers/work_mem/maintenance_work_mem at the PostgreSQL level.
>
>> In this particular situation (ALTER TABLE) I'd use a low value for the
>> maintenance_work_mem parameter - it'll run longer but won't need that
>> much memory.
>
>> Tomas
>
> I've just upgraded my postgresql upto current wheezy version
> (9.1.3-2/Debian): I wanted to get backtrace so it was upgraded when I
> installed postgresql-dbg package. After upgrading it discontinued
> segfault and began work fine.
> :)

I haven't noticed it was a segfault - I kinda believed you it actually
was an OOM kill, but that does not cause segfaults. That's a whole
different story ...

It you were getting that whenever you run the ALTER TABLE and it stopped
after upgrading to 9.1.3, then there was something wrong in the version
you've been using before and it got fixed in 9.1.3. Or maybe it still
exists but it does not happen so often.

It'd be nice to know what version were you using and a detailed
description of what you've done - table structure, size, exact ALTER
TABLE command etc.

Tomas