Обсуждение: pgAgent crashes on failed connection
I created 100 identical pgagent jobs, with one step that simply does "SELECT pg_sleep(10)". I then forced them all to run immediately, with "UPDATE pgagent.pga_job SET jobnextrun=now();". pgagent crashed. What happened is that the when all those jobs are launched at the same time, the server ran into the max_connections limit, and pgagent didn't handle that too well. JobThread::JobThread constructor does not check for NULL result from DBConn::Get(), and passes a NULL connection to Job::Job, which tries to reference it, leading to a segfault. I propose the attached patch. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Вложения
On Thu, Aug 4, 2011 at 2:19 PM, Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> wrote: > I created 100 identical pgagent jobs, with one step that simply does "SELECT > pg_sleep(10)". I then forced them all to run immediately, with "UPDATE > pgagent.pga_job SET jobnextrun=now();". pgagent crashed. > > What happened is that the when all those jobs are launched at the same time, > the server ran into the max_connections limit, and pgagent didn't handle > that too well. JobThread::JobThread constructor does not check for NULL > result from DBConn::Get(), and passes a NULL connection to Job::Job, which > tries to reference it, leading to a segfault. > > I propose the attached patch. hm, in the event that happens, is that logged in the client somehow? wouldn't you want to throw an exception or something like that? merlin
On Thu, Aug 4, 2011 at 8:30 PM, Merlin Moncure <mmoncure@gmail.com> wrote: > On Thu, Aug 4, 2011 at 2:19 PM, Heikki Linnakangas > <heikki.linnakangas@enterprisedb.com> wrote: >> I created 100 identical pgagent jobs, with one step that simply does "SELECT >> pg_sleep(10)". I then forced them all to run immediately, with "UPDATE >> pgagent.pga_job SET jobnextrun=now();". pgagent crashed. >> >> What happened is that the when all those jobs are launched at the same time, >> the server ran into the max_connections limit, and pgagent didn't handle >> that too well. JobThread::JobThread constructor does not check for NULL >> result from DBConn::Get(), and passes a NULL connection to Job::Job, which >> tries to reference it, leading to a segfault. >> >> I propose the attached patch. > > hm, in the event that happens, is that logged in the client somehow? > wouldn't you want to throw an exception or something like that? I think the most straightforward way to handle this is to dump an error into pgagent.pga_joblog when deleting the thread. Might be a little ugly to pass the original error message back rather than a generic one though. Can you take a look Heikki? -- Dave Page Blog: http://pgsnake.blogspot.com Twitter: @pgsnake EnterpriseDB UK: http://www.enterprisedb.com The Enterprise PostgreSQL Company
(sorry for the late reply, this fell through the cracks..) On 10.08.2011 11:48, Dave Page wrote: > On Thu, Aug 4, 2011 at 8:30 PM, Merlin Moncure<mmoncure@gmail.com> wrote: >> On Thu, Aug 4, 2011 at 2:19 PM, Heikki Linnakangas >> <heikki.linnakangas@enterprisedb.com> wrote: >>> I created 100 identical pgagent jobs, with one step that simply does "SELECT >>> pg_sleep(10)". I then forced them all to run immediately, with "UPDATE >>> pgagent.pga_job SET jobnextrun=now();". pgagent crashed. >>> >>> What happened is that the when all those jobs are launched at the same time, >>> the server ran into the max_connections limit, and pgagent didn't handle >>> that too well. JobThread::JobThread constructor does not check for NULL >>> result from DBConn::Get(), and passes a NULL connection to Job::Job, which >>> tries to reference it, leading to a segfault. >>> >>> I propose the attached patch. >> >> hm, in the event that happens, is that logged in the client somehow? >> wouldn't you want to throw an exception or something like that? > > I think the most straightforward way to handle this is to dump an > error into pgagent.pga_joblog when deleting the thread. Might be a > little ugly to pass the original error message back rather than a > generic one though. Can you take a look Heikki? You mean something like the attached? Works for me, but inserting an entry in joblog for each failed attempt might create a lot of entries there, if the problem persists. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Вложения
On Wed, Sep 28, 2011 at 2:16 PM, Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> wrote: > (sorry for the late reply, this fell through the cracks..) > > On 10.08.2011 11:48, Dave Page wrote: >> >> On Thu, Aug 4, 2011 at 8:30 PM, Merlin Moncure<mmoncure@gmail.com> wrote: >>> >>> On Thu, Aug 4, 2011 at 2:19 PM, Heikki Linnakangas >>> <heikki.linnakangas@enterprisedb.com> wrote: >>>> >>>> I created 100 identical pgagent jobs, with one step that simply does >>>> "SELECT >>>> pg_sleep(10)". I then forced them all to run immediately, with "UPDATE >>>> pgagent.pga_job SET jobnextrun=now();". pgagent crashed. >>>> >>>> What happened is that the when all those jobs are launched at the same >>>> time, >>>> the server ran into the max_connections limit, and pgagent didn't handle >>>> that too well. JobThread::JobThread constructor does not check for NULL >>>> result from DBConn::Get(), and passes a NULL connection to Job::Job, >>>> which >>>> tries to reference it, leading to a segfault. >>>> >>>> I propose the attached patch. >>> >>> hm, in the event that happens, is that logged in the client somehow? >>> wouldn't you want to throw an exception or something like that? >> >> I think the most straightforward way to handle this is to dump an >> error into pgagent.pga_joblog when deleting the thread. Might be a >> little ugly to pass the original error message back rather than a >> generic one though. Can you take a look Heikki? > > You mean something like the attached? Works for me, but inserting an entry > in joblog for each failed attempt might create a lot of entries there, if > the problem persists. Is it correct behavior to throw unhandled sql exception if the logging query fails? merlin
On 28.09.2011 20:24, Merlin Moncure wrote: > On Wed, Sep 28, 2011 at 2:16 PM, Heikki Linnakangas > <heikki.linnakangas@enterprisedb.com> wrote: >> (sorry for the late reply, this fell through the cracks..) >> >> On 10.08.2011 11:48, Dave Page wrote: >>> >>> On Thu, Aug 4, 2011 at 8:30 PM, Merlin Moncure<mmoncure@gmail.com> wrote: >>>> >>>> On Thu, Aug 4, 2011 at 2:19 PM, Heikki Linnakangas >>>> <heikki.linnakangas@enterprisedb.com> wrote: >>>>> >>>>> I created 100 identical pgagent jobs, with one step that simply does >>>>> "SELECT >>>>> pg_sleep(10)". I then forced them all to run immediately, with "UPDATE >>>>> pgagent.pga_job SET jobnextrun=now();". pgagent crashed. >>>>> >>>>> What happened is that the when all those jobs are launched at the same >>>>> time, >>>>> the server ran into the max_connections limit, and pgagent didn't handle >>>>> that too well. JobThread::JobThread constructor does not check for NULL >>>>> result from DBConn::Get(), and passes a NULL connection to Job::Job, >>>>> which >>>>> tries to reference it, leading to a segfault. >>>>> >>>>> I propose the attached patch. >>>> >>>> hm, in the event that happens, is that logged in the client somehow? >>>> wouldn't you want to throw an exception or something like that? >>> >>> I think the most straightforward way to handle this is to dump an >>> error into pgagent.pga_joblog when deleting the thread. Might be a >>> little ugly to pass the original error message back rather than a >>> generic one though. Can you take a look Heikki? >> >> You mean something like the attached? Works for me, but inserting an entry >> in joblog for each failed attempt might create a lot of entries there, if >> the problem persists. > > Is it correct behavior to throw unhandled sql exception if the logging > query fails? Sorry, I didn't understand that. I don't see any exceptions been thrown. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Wed, Sep 28, 2011 at 3:32 PM, Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> wrote: > On 28.09.2011 20:24, Merlin Moncure wrote: >> >> On Wed, Sep 28, 2011 at 2:16 PM, Heikki Linnakangas >> <heikki.linnakangas@enterprisedb.com> wrote: >>> >>> (sorry for the late reply, this fell through the cracks..) >>> >>> On 10.08.2011 11:48, Dave Page wrote: >>>> >>>> On Thu, Aug 4, 2011 at 8:30 PM, Merlin Moncure<mmoncure@gmail.com> >>>> wrote: >>>>> >>>>> On Thu, Aug 4, 2011 at 2:19 PM, Heikki Linnakangas >>>>> <heikki.linnakangas@enterprisedb.com> wrote: >>>>>> >>>>>> I created 100 identical pgagent jobs, with one step that simply does >>>>>> "SELECT >>>>>> pg_sleep(10)". I then forced them all to run immediately, with "UPDATE >>>>>> pgagent.pga_job SET jobnextrun=now();". pgagent crashed. >>>>>> >>>>>> What happened is that the when all those jobs are launched at the same >>>>>> time, >>>>>> the server ran into the max_connections limit, and pgagent didn't >>>>>> handle >>>>>> that too well. JobThread::JobThread constructor does not check for >>>>>> NULL >>>>>> result from DBConn::Get(), and passes a NULL connection to Job::Job, >>>>>> which >>>>>> tries to reference it, leading to a segfault. >>>>>> >>>>>> I propose the attached patch. >>>>> >>>>> hm, in the event that happens, is that logged in the client somehow? >>>>> wouldn't you want to throw an exception or something like that? >>>> >>>> I think the most straightforward way to handle this is to dump an >>>> error into pgagent.pga_joblog when deleting the thread. Might be a >>>> little ugly to pass the original error message back rather than a >>>> generic one though. Can you take a look Heikki? >>> >>> You mean something like the attached? Works for me, but inserting an >>> entry >>> in joblog for each failed attempt might create a lot of entries there, if >>> the problem persists. >> >> Is it correct behavior to throw unhandled sql exception if the logging >> query fails? > > Sorry, I didn't understand that. I don't see any exceptions been thrown. whoop! never mind -- I assumed conn->Execute could throw an exception! merlin
On Wed, Sep 28, 2011 at 8:16 PM, Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> wrote: > (sorry for the late reply, this fell through the cracks..) > > On 10.08.2011 11:48, Dave Page wrote: >> >> On Thu, Aug 4, 2011 at 8:30 PM, Merlin Moncure<mmoncure@gmail.com> wrote: >>> >>> On Thu, Aug 4, 2011 at 2:19 PM, Heikki Linnakangas >>> <heikki.linnakangas@enterprisedb.com> wrote: >>>> >>>> I created 100 identical pgagent jobs, with one step that simply does >>>> "SELECT >>>> pg_sleep(10)". I then forced them all to run immediately, with "UPDATE >>>> pgagent.pga_job SET jobnextrun=now();". pgagent crashed. >>>> >>>> What happened is that the when all those jobs are launched at the same >>>> time, >>>> the server ran into the max_connections limit, and pgagent didn't handle >>>> that too well. JobThread::JobThread constructor does not check for NULL >>>> result from DBConn::Get(), and passes a NULL connection to Job::Job, >>>> which >>>> tries to reference it, leading to a segfault. >>>> >>>> I propose the attached patch. >>> >>> hm, in the event that happens, is that logged in the client somehow? >>> wouldn't you want to throw an exception or something like that? >> >> I think the most straightforward way to handle this is to dump an >> error into pgagent.pga_joblog when deleting the thread. Might be a >> little ugly to pass the original error message back rather than a >> generic one though. Can you take a look Heikki? > > You mean something like the attached? Works for me, but inserting an entry > in joblog for each failed attempt might create a lot of entries there, if > the problem persists. Yes, like that :-). Thanks - patch applied. -- Dave Page Blog: http://pgsnake.blogspot.com Twitter: @pgsnake EnterpriseDB UK: http://www.enterprisedb.com The Enterprise PostgreSQL Company