Re: strange parallel query behavior after OOM crashes

Поиск
Список
Период
Сортировка
От Tomas Vondra
Тема Re: strange parallel query behavior after OOM crashes
Дата
Msg-id 73b84052-5dfe-0c40-4212-cfa96a1c6027@2ndquadrant.com
обсуждение исходный текст
Ответ на Re: strange parallel query behavior after OOM crashes  (Kuntal Ghosh <kuntalghosh.2007@gmail.com>)
Ответы Re: strange parallel query behavior after OOM crashes
Re: strange parallel query behavior after OOM crashes
Список pgsql-hackers
On 04/04/2017 06:52 PM, Robert Haas wrote:
> On Mon, Apr 3, 2017 at 6:08 AM, Kuntal Ghosh <kuntalghosh.2007@gmail.com> wrote:
>> On Fri, Mar 31, 2017 at 6:50 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>>> On Thu, Mar 30, 2017 at 4:35 PM, Kuntal Ghosh
>>> <kuntalghosh.2007@gmail.com> wrote:
>>>> 2. the server restarts automatically, initialize
>>>> BackgroundWorkerData->parallel_register_count and
>>>> BackgroundWorkerData->parallel_terminate_count in the shared memory.
>>>> After that, it calls ForgetBackgroundWorker and it increments
>>>> parallel_terminate_count.
>>>
>>> Hmm.  So this seems like the root of the problem.  Presumably those
>>> things need to be reset AFTER forgetting any background workers from
>>> before the crash.
>>>
>> IMHO, the fix would be not to increase the terminated parallel worker
>> count whenever ForgetBackgroundWorker is called due to a bgworker
>> crash. I've attached a patch for the same. PFA.
>
> While I'm not opposed to that approach, I don't think this is a good
> way to implement it.  If you want to pass an explicit flag to
> ForgetBackgroundWorker telling it whether or not it should performing
> the increment, fine.  But with what you've got here, you're
> essentially relying on "spooky action at a distance".  It would be
> easy for future code changes to break this, not realizing that
> somebody's got a hard dependency on 0 having a specific meaning.
>

I'm probably missing something, but I don't quite understand how these 
values actually survive the crash. I mean, what I observed is OOM 
followed by a restart, so shouldn't BackgroundWorkerShmemInit() simply 
reset the values back to 0? Or do we call ForgetBackgroundWorker() after 
the crash for some reason?


In any case, the comment right before BackgroundWorkerArray says this:
 * These counters can of course overflow, but it's not important here * since the subtraction will still give the right
number.

which means that this assert

+    Assert(BackgroundWorkerData->parallel_register_count >=
+        BackgroundWorkerData->parallel_terminate_count);

is outright broken, just like any other attempts to rely on simple 
comparisons of these two counters, no?

> BTW, if this isn't on the open items list, it should be.  It's
> presumably the fault of b460f5d6693103076dc554aa7cbb96e1e53074f9.
>

Unrelated, but perhaps we should also add this to open items:

https://www.postgresql.org/message-id/flat/57bbc52c-5d40-bb80-2992-7e1027d0b67c%402ndquadrant.com

(although that's more a 9.6 material).

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



В списке pgsql-hackers по дате отправления:

Предыдущее
От: Pavan Deolasee
Дата:
Сообщение: Re: Patch: Write Amplification Reduction Method (WARM)
Следующее
От: Tom Lane
Дата:
Сообщение: Re: Compiler warning in costsize.c