Обсуждение: Shared memory corrupted?

Поиск
Список
Период
Сортировка

Shared memory corrupted?

От
Jeff Boes
Дата:
We are experiencing the following error, usually during our nightly
delete-and-vacuum cycle (when there are very few other connections to
the database):

2003-10-30 01:36:59 [25392]  LOG:  server process (pid 697) was
terminated by signal 14
2003-10-30 01:36:59 [25392]  LOG:  terminating any other active server
processes
2003-10-30 01:37:01 [1977]   FATAL:  The database system is in recovery mode
2003-10-30 01:37:08 [25392]  LOG:  all server processes terminated;
reinitializing shared memory and semaphores
2003-10-30 01:37:09 [2856]   FATAL:  The database system is starting up
2003-10-30 01:37:09 [2855]   LOG:  database system was interrupted at
2003-10-30 01:26:13 EST

The only clues we have are that the server processes interrupted by
"signal 14" *seem* to be backends connected to Apache processes (on
another server). But even that isn't certain, because of the difficulty
in tracking down which process was doing what at the time.

--
Jeff Boes                                      vox 269.226.9550 ext 24
Database Engineer                                     fax 269.349.9076
Nexcerpt, Inc.                                 http://www.nexcerpt.com
            ...Nexcerpt... Extend your Expertise


Re: Shared memory corrupted?

От
Jeff Boes
Дата:
Tom Lane wrote:

>Jeff Boes <jboes@nexcerpt.com> writes:
>
>
>>We are experiencing the following error, usually during our nightly
>>delete-and-vacuum cycle (when there are very few other connections to
>>the database):
>>
>>
>
>
>
>>2003-10-30 01:36:59 [25392]  LOG:  server process (pid 697) was
>>terminated by signal 14
>>
>>
>
>What's signal 14 on your machine?  (Look in /usr/include/signal.h to
>be sure.)  Also, what PG version is this?
>
>            regards, tom lane
>
>

signal.h doesn't have any definitions for signal numbers in it; "kill
-l" lists 14 as:

 14) SIGALRM

This is Pg 7.3.4, running on Linux 7.3 (Kernel 2.4.18-18.7.xsmp on a
2-processor i686).

The system has 4 GB of RAM. Shared memory parameters out of
/etc/sysctl.conf follow:

kernel.shmall = 1352914698
kernel.shmmax = 1352914698

And here's what I guess are the pertinent data from the postgresql.conf
file:

sort_mem = 65536
vacuum_mem = 262144
effective_cache_size = 196608
shared_buffers = 131072
max_fsm_relations = 200
max_fsm_pages = 350000
wal_buffers = 32


We've seen the problem with vacuum_mem = 65536 also.

--
Jeff Boes                                      vox 269.226.9550 ext 24
Database Engineer                                     fax 269.349.9076
Nexcerpt, Inc.                                 http://www.nexcerpt.com
           ...Nexcerpt... Extend your Expertise



Re: Shared memory corrupted?

От
Tom Lane
Дата:
Jeff Boes <jboes@nexcerpt.com> writes:
> We are experiencing the following error, usually during our nightly
> delete-and-vacuum cycle (when there are very few other connections to
> the database):

> 2003-10-30 01:36:59 [25392]  LOG:  server process (pid 697) was
> terminated by signal 14

What's signal 14 on your machine?  (Look in /usr/include/signal.h to
be sure.)  Also, what PG version is this?

            regards, tom lane

Re: Shared memory corrupted?

От
Tom Lane
Дата:
Jeff Boes <jboes@nexcerpt.com> writes:
> Tom Lane wrote:
>> What's signal 14 on your machine?  (Look in /usr/include/signal.h to
>> be sure.)  Also, what PG version is this?

>  14) SIGALRM

> This is Pg 7.3.4, running on Linux 7.3 (Kernel 2.4.18-18.7.xsmp on a
> 2-processor i686).

Hm.  That doesn't make any sense at all, because SIGALRM is either
caught by a handler or ignored everywhere in the Postgres backend.
There is no situation where it could legitimately cause process
termination.  Is it possible you are dealing with a kernel bug?

[ thinks... ]  Another possibility is that you are running some
non-Postgres code that resets SIGALRM handling to default.  I have
heard rumors that Perl will do that in some cases, for example.
Are you using plperl?

            regards, tom lane

Re: Shared memory corrupted?

От
Jeff Boes
Дата:
Tom Lane wrote:

>[ thinks... ]  Another possibility is that you are running some
>non-Postgres code that resets SIGALRM handling to default.  I have
>heard rumors that Perl will do that in some cases, for example.
>Are you using plperl?
>
>
>
Yes, we are. I know there are some places in the code where SIGALRM is
used, so I'll start looking there. But if you or anyone else thinks of
anything, let me know ...

[How would a plperl function that changes the local behavior of SIGALRM
affect the backend?]

--
Jeff Boes                                      vox 269.226.9550 ext 24
Database Engineer                                     fax 269.349.9076
Nexcerpt, Inc.                                 http://www.nexcerpt.com
           ...Nexcerpt... Extend your Expertise



Re: Shared memory corrupted?

От
Tom Lane
Дата:
Jeff Boes <jboes@nexcerpt.com> writes:
> [How would a plperl function that changes the local behavior of SIGALRM
> affect the backend?]

IIRC, SIGALRM is used for two things: one, to trigger a deadlock check
cycle if we wait too long for a lock (see deadlock_timeout), and two,
to implement statement_timeout.  If you are using statement_timeout then
I think it would be dangerous to mess with SIGALRM at all.  If you are
not, then I think it would be all right to modify the SIGALRM handler
setting locally, as long as you restore it to its original setting when
you are done.  Don't try to run any database access operations while you
have a nonstandard setting of the SIGALRM handler, though, or you risk
problems with deadlock checking.

            regards, tom lane

Re: Shared memory corrupted?

От
Dmitry Morozovsky
Дата:
On Thu, 30 Oct 2003, Jeff Boes wrote:

JB> We are experiencing the following error, usually during our nightly
JB> delete-and-vacuum cycle (when there are very few other connections to
JB> the database):
JB>
JB> 2003-10-30 01:36:59 [25392]  LOG:  server process (pid 697) was
JB> terminated by signal 14
JB> 2003-10-30 01:36:59 [25392]  LOG:  terminating any other active server
JB> processes
JB> 2003-10-30 01:37:01 [1977]   FATAL:  The database system is in recovery mode
JB> 2003-10-30 01:37:08 [25392]  LOG:  all server processes terminated;
JB> reinitializing shared memory and semaphores
JB> 2003-10-30 01:37:09 [2856]   FATAL:  The database system is starting up
JB> 2003-10-30 01:37:09 [2855]   LOG:  database system was interrupted at
JB> 2003-10-30 01:26:13 EST
JB>
JB> The only clues we have are that the server processes interrupted by
JB> "signal 14" *seem* to be backends connected to Apache processes (on
JB> another server). But even that isn't certain, because of the difficulty
JB> in tracking down which process was doing what at the time.

Signal 14 is SIGALRM. Some kind of badly-behaving watchdog?

Sincerely,
D.Marck                                     [DM5020, MCK-RIPE, DM3-RIPN]
------------------------------------------------------------------------
*** Dmitry Morozovsky --- D.Marck --- Wild Woozle --- marck@rinet.ru ***
------------------------------------------------------------------------