Обсуждение: BUG #2371: database crashes with semctl failed error
The following bug has been logged online: Bug reference: 2371 Logged by: Brock Peabody Email address: brock.peabody@npcinternational.com PostgreSQL version: 8.1 Operating system: Windows Server 2003 Description: database crashes with semctl failed error Details: The full text of the error message is: FATAL: semctl(167894456, 4, SETVAL, 0) failed: A non-blocking socket operation could not be completed immediately. I have a program that inserts/updates/deletes a large number of records to a large database (~100 GB). The program works, but after it runs for a few hours it starts getting this error message. Once this happens the database is unusable to all clients until it is restarted. I found this documented bug: http://archives.postgresql.org/pgsql-bugs/2006-02/msg00185.php but I'm not sure if it is related to my problem or not.
""Brock Peabody"" <brock.peabody@npcinternational.com> wrote > > FATAL: semctl(167894456, 4, SETVAL, 0) failed: A non-blocking socket > operation could not be completed immediately. > Can you reliablly reproduce the problem? If so, we may come up with a testing patch to it. We encounter similar problems before but it is hard to reproduce. Magnus? As Bruce suggested, we can plug in a check-EINTR-loop here in semctl(): /* Quickly lock/unlock the semaphore (if we can) */ if (semop(semId, &sops, 1) < 0) return -1; Regards, Qingqing
> -----Original Message----- > From: pgsql-bugs-owner@postgresql.org [mailto:pgsql-bugs- > owner@postgresql.org] On Behalf Of Qingqing Zhou > Sent: Wednesday, April 05, 2006 6:33 AM > To: pgsql-bugs@postgresql.org > Subject: Re: [BUGS] BUG #2371: database crashes with semctl failed error > > > ""Brock Peabody"" <brock.peabody@npcinternational.com> wrote > > > > FATAL: semctl(167894456, 4, SETVAL, 0) failed: A non-blocking socket > > operation could not be completed immediately. > > > > Can you reliablly reproduce the problem? If so, we may come up with a > testing patch to it. We encounter similar problems before but it is hard > to > reproduce. > > Magnus? As Bruce suggested, we can plug in a check-EINTR-loop here in > semctl(): > > /* Quickly lock/unlock the semaphore (if we can) */ > if (semop(semId, &sops, 1) < 0) > return -1; > > Regards, > Qingqing > > > > ---------------------------(end of broadcast)--------------------------- > TIP 2: Don't 'kill -9' the postmaster
> -----Original Message----- > From: pgsql-bugs-owner@postgresql.org [mailto:pgsql-bugs- > owner@postgresql.org] On Behalf Of Qingqing Zhou > Sent: Wednesday, April 05, 2006 6:33 AM > To: pgsql-bugs@postgresql.org > Subject: Re: [BUGS] BUG #2371: database crashes with semctl failed > error > Can you reliablly reproduce the problem? I can here :). I'm trying to figure out a way for someone to repeat it outside my environment but I'm afraid it's got something to do with timing. I have 50 threads that are collecting data. If I give each one its own connection to the database the problem happens quickly. If they all share one connection the problem does not happen. > If so, we may come up with a > testing patch to it. We encounter similar problems before but it is > hard to reproduce. Do you think this is a Windows only problem? Thanks, Brock
"Brock Peabody" <brock.peabody@npcinternational.com> writes: >> Can you reliablly reproduce the problem? > I can here :). I'm trying to figure out a way for someone to repeat it > outside my environment but I'm afraid it's got something to do with > timing. I have 50 threads that are collecting data. If I give each one > its own connection to the database the problem happens quickly. If they > all share one connection the problem does not happen. Perhaps you could whittle down your app into a testbed that just sends dummy data with about the same timing as the real app? regards, tom lane
> On Behalf Of Tom Lane > Perhaps you could whittle down your app into a testbed that just sends > dummy data with about the same timing as the real app? I think I'm starting to get a better understanding of problem. It looks like one of the threads is trying to insert a pathological (~1,800,000) number of records in one transaction in a table while the other threads are also reading from and writing to that table. I'll try to get something simple to reproduce this behavior.
""Brock Peabody"" <brock.peabody@npcinternational.com> wrote > > Do you think this is a Windows only problem? > I am afraid so. We have received 3 reports of this (or quite similar) problem, all in 8.1/windows. I just noticed that yours is actually an EAGAIN error, so the loop patch in semctl() doesn't work I guess :-(. If you can find an easier way to reproduce it outside your environment, that's very sweet. Regards, Qingqing
Hi all, We were bitten by this same bug over the weekend (PG 8.1.3 / Windows Server 2003). The exact error was: FATAL: semctl(170688872, 6, SETVAL, 0) failed: A non-blocking socket operation could not be completed immediately. The start of the errors corresponded to a nightly "vacuum analyze" (both Saturday and Sunday) run. Things appeared to clear up after the "vacuum analyze" completed. One thing of note is that the semctl parameters were identical across both nights (and a smaller incident Monday morning). The Monday morning occurence was also somewhat odd in that not much should have been going on then. Also, three other servers which faced identical update/insert transaction streams did not have any trouble. The select load might have been higher on the server that failed though. One question I had: In src/backend/port/win32/sema.c, semctl() is implemented in terms of a call to semop(). However, the man page for semctl() doesn't list EAGAIN and EINTR as possible error returns, whereas for semop() it does. Is that just a mistake in the man page or a problem with the Win32 emulation call? (See also http://archives.postgresql.org/pgsql-bugs/2006-02/msg00233.php ) I'm afraid we're in the same category as everyone else with no good way to reproduce the bug, but is there anything else we could do if this happens again? Pete
""Peter Brant"" <Peter.Brant@wicourts.gov> wrote > > I'm afraid we're in the same category as everyone else with no good way > to reproduce the bug, but is there anything else we could do if this > happens again? > There is a "Win32 semaphore patch" in the patch list, but we are lack of evidence to prove its usefulness. If you can try to apply it to your *test* server (8.0.*, 8.1.* are all ok), that would be very helpful to see the result. Regards, Qingqing
Sure. I should note that we're moving to Linux for our production servers so our interest in the Windows port is waning (at least for the time being). In particular, the stuck WAL segment rename problem has occasionally been rather a pain in the neck. As long as we still have Windows test servers around though, it's easy enough to apply a patch and load up the database to see if anything interesting happens. Pete >>> "Qingqing Zhou" <zhouqq@cs.toronto.edu> 04/25/06 7:01 am >>> There is a "Win32 semaphore patch" in the patch list, but we are lack of evidence to prove its usefulness. If you can try to apply it to your *test* server (8.0.*, 8.1.* are all ok), that would be very helpful to see the result.
With the patch applied, I let an inhouse stress test run for several hours and it completed without incident. I also ran two runs of pgbench with 50 connections x 1000 transactions and one run of 50 connections x 5000 transactions. All completed successfully. (Test server is a dual Xeon with HyperThreading enabled, Windows Server 2003, PG 8.1.3). Pete >>> "Qingqing Zhou" <zhouqq@cs.toronto.edu> 04/25/06 7:01 am >>> There is a "Win32 semaphore patch" in the patch list, but we are lack of evidence to prove its usefulness. If you can try to apply it to your *test* server (8.0.*, 8.1.* are all ok), that would be very helpful to see the result.