Обсуждение: BUG #2371: database crashes with semctl failed error

Поиск
Список
Период
Сортировка

BUG #2371: database crashes with semctl failed error

От
"Brock Peabody"
Дата:
The following bug has been logged online:

Bug reference:      2371
Logged by:          Brock Peabody
Email address:      brock.peabody@npcinternational.com
PostgreSQL version: 8.1
Operating system:   Windows Server 2003
Description:        database crashes with semctl failed error
Details:

The full text of the error message is:

FATAL:  semctl(167894456, 4, SETVAL, 0) failed: A non-blocking socket
operation could not be completed immediately.

I have a program that inserts/updates/deletes a large number of records to a
large database (~100 GB).  The program works, but after it runs for a few
hours it starts getting this error message.  Once this happens the database
is unusable to all clients until it is restarted.

I found this documented bug:

http://archives.postgresql.org/pgsql-bugs/2006-02/msg00185.php

but I'm not sure if it is related to my problem or not.

Re: BUG #2371: database crashes with semctl failed error

От
"Qingqing Zhou"
Дата:
""Brock Peabody"" <brock.peabody@npcinternational.com> wrote
>
> FATAL:  semctl(167894456, 4, SETVAL, 0) failed: A non-blocking socket
> operation could not be completed immediately.
>

Can you reliablly reproduce the problem? If so, we may come up with a
testing patch to it. We encounter similar problems before but it is hard to
reproduce.

Magnus? As Bruce suggested, we can plug in a check-EINTR-loop here in
semctl():

   /* Quickly lock/unlock the semaphore (if we can) */
   if (semop(semId, &sops, 1) < 0)
    return -1;

Regards,
Qingqing

Re: BUG #2371: database crashes with semctl failed error

От
"Brock Peabody"
Дата:
> -----Original Message-----
> From: pgsql-bugs-owner@postgresql.org [mailto:pgsql-bugs-
> owner@postgresql.org] On Behalf Of Qingqing Zhou
> Sent: Wednesday, April 05, 2006 6:33 AM
> To: pgsql-bugs@postgresql.org
> Subject: Re: [BUGS] BUG #2371: database crashes with semctl failed
error
>
>
> ""Brock Peabody"" <brock.peabody@npcinternational.com> wrote
> >
> > FATAL:  semctl(167894456, 4, SETVAL, 0) failed: A non-blocking
socket
> > operation could not be completed immediately.
> >
>
> Can you reliablly reproduce the problem? If so, we may come up with a
> testing patch to it. We encounter similar problems before but it is
hard
> to
> reproduce.
>
> Magnus? As Bruce suggested, we can plug in a check-EINTR-loop here in
> semctl():
>
>    /* Quickly lock/unlock the semaphore (if we can) */
>    if (semop(semId, &sops, 1) < 0)
>     return -1;
>
> Regards,
> Qingqing
>
>
>
> ---------------------------(end of
broadcast)---------------------------
> TIP 2: Don't 'kill -9' the postmaster

Re: BUG #2371: database crashes with semctl failed error

От
"Brock Peabody"
Дата:
> -----Original Message-----
> From: pgsql-bugs-owner@postgresql.org [mailto:pgsql-bugs-
> owner@postgresql.org] On Behalf Of Qingqing Zhou
> Sent: Wednesday, April 05, 2006 6:33 AM
> To: pgsql-bugs@postgresql.org
> Subject: Re: [BUGS] BUG #2371: database crashes with semctl failed
> error

> Can you reliablly reproduce the problem?

I can here :).  I'm trying to figure out a way for someone to repeat it
outside my environment but I'm afraid it's got something to do with
timing.  I have 50 threads that are collecting data.  If I give each one
its own connection to the database the problem happens quickly.  If they
all share one connection the problem does not happen.

> If so, we may come up with a
> testing patch to it. We encounter similar problems before but it is
> hard to reproduce.

Do you think this is a Windows only problem?

Thanks,
Brock

Re: BUG #2371: database crashes with semctl failed error

От
Tom Lane
Дата:
"Brock Peabody" <brock.peabody@npcinternational.com> writes:
>> Can you reliablly reproduce the problem?

> I can here :).  I'm trying to figure out a way for someone to repeat it
> outside my environment but I'm afraid it's got something to do with
> timing.  I have 50 threads that are collecting data.  If I give each one
> its own connection to the database the problem happens quickly.  If they
> all share one connection the problem does not happen.

Perhaps you could whittle down your app into a testbed that just sends
dummy data with about the same timing as the real app?

            regards, tom lane

Re: BUG #2371: database crashes with semctl failed error

От
"Brock Peabody"
Дата:
> On Behalf Of Tom Lane

> Perhaps you could whittle down your app into a testbed that just sends
> dummy data with about the same timing as the real app?

I think I'm starting to get a better understanding of problem.  It looks
like one of the threads is trying to insert a pathological (~1,800,000)
number of records in one transaction in a table while the other threads
are also reading from and writing to that table.  I'll try to get
something simple to reproduce this behavior.

Re: BUG #2371: database crashes with semctl failed error

От
"Qingqing Zhou"
Дата:
""Brock Peabody"" <brock.peabody@npcinternational.com> wrote
>
> Do you think this is a Windows only problem?
>

I am afraid so. We have received 3 reports of this (or quite similar)
problem, all in 8.1/windows. I just noticed that yours is actually an EAGAIN
error, so the loop patch in semctl() doesn't work I guess :-(. If you can
find an easier way to reproduce it outside your environment, that's very
sweet.

Regards,
Qingqing

Re: BUG #2371: database crashes with semctl failed error

От
"Peter Brant"
Дата:
Hi all,

We were bitten by this same bug over the weekend (PG 8.1.3 / Windows
Server 2003).  The exact error was:

FATAL:  semctl(170688872, 6, SETVAL, 0) failed: A non-blocking socket
operation could not be completed immediately.

The start of the errors corresponded to a nightly "vacuum analyze"
(both Saturday and Sunday) run.  Things appeared to clear up after the
"vacuum analyze" completed.

One thing of note is that the semctl parameters were identical across
both nights (and a smaller incident Monday morning).  The Monday morning
occurence was also somewhat odd in that not much should have been going
on then.  Also, three other servers which faced identical update/insert
transaction streams did not have any trouble.  The select load might
have been higher on the server that failed though.

One question I had:

In src/backend/port/win32/sema.c, semctl() is implemented in terms of a
call to semop().  However, the man page for semctl() doesn't list EAGAIN
and EINTR as possible error returns, whereas for semop() it does.  Is
that just a mistake in the man page or a problem with the Win32
emulation call?

(See also
http://archives.postgresql.org/pgsql-bugs/2006-02/msg00233.php )

I'm afraid we're in the same category as everyone else with no good way
to reproduce the bug, but is there anything else we could do if this
happens again?

Pete

Re: BUG #2371: database crashes with semctl failed error

От
"Qingqing Zhou"
Дата:
""Peter Brant"" <Peter.Brant@wicourts.gov> wrote
>
> I'm afraid we're in the same category as everyone else with no good way
> to reproduce the bug, but is there anything else we could do if this
> happens again?
>

There is a "Win32 semaphore patch" in the patch list, but we are lack of
evidence to prove its usefulness. If you can try to apply it to your *test*
server (8.0.*, 8.1.* are all ok), that would be very helpful to see the
result.

Regards,
Qingqing

Re: BUG #2371: database crashes with semctl failed

От
"Peter Brant"
Дата:
Sure.

I should note that we're moving to Linux for our production servers so
our interest in the Windows port is waning (at least for the time
being).  In particular, the stuck WAL segment rename problem has
occasionally been rather a pain in the neck.

As long as we still have Windows test servers around though, it's easy
enough to apply a patch and load up the database to see if anything
interesting happens.

Pete

>>> "Qingqing Zhou" <zhouqq@cs.toronto.edu> 04/25/06 7:01 am >>>
There is a "Win32 semaphore patch" in the patch list, but we are lack
of
evidence to prove its usefulness. If you can try to apply it to your
*test*
server (8.0.*, 8.1.* are all ok), that would be very helpful to see
the
result.

Re: BUG #2371: database crashes with semctl failed

От
"Peter Brant"
Дата:
With the patch applied, I let an inhouse stress test run for several
hours and it completed without incident.  I also ran two runs of pgbench
with 50 connections x 1000 transactions and one run of 50 connections x
5000 transactions.  All completed successfully.  (Test server is a dual
Xeon with HyperThreading enabled, Windows Server 2003, PG 8.1.3).

Pete

>>> "Qingqing Zhou" <zhouqq@cs.toronto.edu> 04/25/06 7:01 am >>>
There is a "Win32 semaphore patch" in the patch list, but we are lack
of
evidence to prove its usefulness. If you can try to apply it to your
*test*
server (8.0.*, 8.1.* are all ok), that would be very helpful to see
the
result.