Обсуждение: Query running for very long time (server hanged) with parallel append

Поиск

Список

Период

Сортировка

Query running for very long time (server hanged) with parallel append

От

Rajkumar Raghuwanshi

Дата:

01 февраля 2018 г., 13:29:06

Hi,

I am getting server hang kind of issue with the below postgres.conf setup. Issue may occur while running below query single/multiple times (random). Not getting terminal back even after cancelling query.

explain output and query is given below.

SET enable_hashjoin TO off;
SET enable_nestloop TO off;
SET enable_seqscan TO off;
SET parallel_setup_cost=0;
SET parallel_tuple_cost=0;
SET max_parallel_workers_per_gather=4;
SET enable_parallel_append = on;
SET enable_partition_wise_join TO true;

CREATE TABLE plt1 (a int, b int, c text) PARTITION BY LIST(c);
CREATE TABLE plt1_p1 PARTITION OF plt1 FOR VALUES IN ('0000', '0003', '0004', '0010');
CREATE TABLE plt1_p2 PARTITION OF plt1 FOR VALUES IN ('0001', '0005', '0002', '0009');
CREATE TABLE plt1_p3 PARTITION OF plt1 FOR VALUES IN ('0006', '0007', '0008', '0011');
INSERT INTO plt1 SELECT i, i, to_char(i%12, 'FM0000') FROM generate_series(0, 599, 2) i;

CREATE TABLE plt2 (a int, b int, c text) PARTITION BY LIST(c);
CREATE TABLE plt2_p1 PARTITION OF plt2 FOR VALUES IN ('0000', '0003', '0004', '0010');
CREATE TABLE plt2_p2 PARTITION OF plt2 FOR VALUES IN ('0001', '0005', '0002', '0009');
CREATE TABLE plt2_p3 PARTITION OF plt2 FOR VALUES IN ('0006', '0007', '0008', '0011');
INSERT INTO plt2 SELECT i, i, to_char(i%12, 'FM0000') FROM generate_series(0, 599, 3) i;

CREATE INDEX iplt1_p1_c on plt1_p1(c);
CREATE INDEX iplt1_p2_c on plt1_p2(c);
CREATE INDEX iplt1_p3_c on plt1_p3(c);
CREATE INDEX iplt2_p1_c on plt2_p1(c);
CREATE INDEX iplt2_p2_c on plt2_p2(c);
CREATE INDEX iplt2_p3_c on plt2_p3(c);

ANALYZE plt1;
ANALYZE plt2;

EXPLAIN (COSTS OFF) SELECT DISTINCT t1.c,count(*) FROM plt1 t1 LEFT JOIN LATERAL (SELECT t2.c AS t2c, t3.c AS t3c, least(t1.c,t2.c,t3.c) FROM plt1 t2 JOIN plt2 t3 ON (t2.c = t3.c)) ss ON t1.c = ss.t2c WHERE t1.a % 25 = 0 GROUP BY 1 ORDER BY 1,2;
                                                     QUERY PLAN
--------------------------------------------------------------------------------------------------------------------
Unique
   -> Sort
         Sort Key: t1.c, (count(*))
         -> Finalize GroupAggregate
               Group Key: t1.c
               -> Sort
                     Sort Key: t1.c
                     -> Gather
                           Workers Planned: 2
                           -> Partial HashAggregate
                                 Group Key: t1.c
                                 -> Parallel Append
                                       -> Merge Right Join
                                             Merge Cond: (t2.c = t1.c)
                                             -> Merge Join
                                                   Merge Cond: (t3.c = t2.c)
                                                   -> Index Only Scan using iplt2_p1_c on plt2_p1 t3
                                                   -> Materialize
                                                         -> Index Only Scan using iplt1_p1_c on plt1_p1 t2
                                             -> Materialize
                                                   -> Index Scan using iplt1_p1_c on plt1_p1 t1
                                                         Filter: ((a % 25) = 0)
                                       -> Merge Left Join
                                             Merge Cond: (t1_2.c = t2_2.c)
                                             -> Parallel Index Scan using iplt1_p3_c on plt1_p3 t1_2
                                                   Filter: ((a % 25) = 0)
                                             -> Materialize
                                                   -> Merge Join
                                                         Merge Cond: (t3_2.c = t2_2.c)
                                                         -> Index Only Scan using iplt2_p3_c on plt2_p3 t3_2
                                                         -> Materialize
                                                               -> Index Only Scan using iplt1_p3_c on plt1_p3 t2_2
                                       -> Merge Left Join
                                             Merge Cond: (t1_1.c = t2_1.c)
                                             -> Parallel Index Scan using iplt1_p2_c on plt1_p2 t1_1
                                                   Filter: ((a % 25) = 0)
                                             -> Materialize
                                                   -> Merge Join
                                                         Merge Cond: (t2_1.c = t3_1.c)
                                                         -> Index Only Scan using iplt1_p2_c on plt1_p2 t2_1
                                                         -> Index Only Scan using iplt2_p2_c on plt2_p2 t3_1
(41 rows)

SELECT DISTINCT t1.c,count(*) FROM plt1 t1 LEFT JOIN LATERAL (SELECT t2.c AS t2c, t3.c AS t3c, least(t1.c,t2.c,t3.c) FROM plt1 t2 JOIN plt2 t3 ON (t2.c = t3.c)) ss ON t1.c = ss.t2c WHERE t1.a % 25 = 0 GROUP BY 1 ORDER BY 1,2;
.
.
.

"hanged".

Thanks & Regards,

Rajkumar Raghuwanshi

QMG, EnterpriseDB Corporation

Re: Query running for very long time (server hanged) with parallel append

От

Amit Khandekar

Дата:

01 февраля 2018 г., 14:53:54

Thanks Rajkumar for catching this. I will have a look ...

-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company

Re: Query running for very long time (server hanged) with parallel append

От

Thomas Munro

Дата:

02 февраля 2018 г., 01:20:19

On Thu, Feb 1, 2018 at 11:29 PM, Rajkumar Raghuwanshi
<rajkumar.raghuwanshi@enterprisedb.com> wrote:
> I am getting server hang kind of issue with the below postgres.conf setup.
> Issue may occur while running below query single/multiple times (random).
> Not getting terminal back even after cancelling query.
> explain output and query is given below.

Whatever logic bug might be causing the query to hang, it's not good
that we're unable to SIGINT/SIGTERM our way out of this state.  See
also this other bug report for a known problem (already fixed but not
yet released), but which came with an extra complaint, as yet
unexplained, that the query couldn't be interrupted:

https://www.postgresql.org/message-id/flat/151724453314.1238.409882538067070269%40wrigleys.postgresql.org

-- 
Thomas Munro
http://www.enterprisedb.com

Re: Query running for very long time (server hanged) with parallel append

От

Amit Khandekar

Дата:

02 февраля 2018 г., 09:43:20

On 2 February 2018 at 03:50, Thomas Munro <thomas.munro@enterprisedb.com> wrote:
> Whatever logic bug might be causing the query to hang, it's not good
> that we're unable to SIGINT/SIGTERM our way out of this state.  See
> also this other bug report for a known problem (already fixed but not
> yet released), but which came with an extra complaint, as yet
> unexplained, that the query couldn't be interrupted:
>
> https://www.postgresql.org/message-id/flat/151724453314.1238.409882538067070269%40wrigleys.postgresql.org

Yeah, it is not good that there is no response to the SIGINT.

The query is actually hanging because one of the workers is in a small
loop where it iterates over the subplans searching for unfinished
plans, and it never comes out of the loop (it's a bug which I am yet
to fix). And it does not make sense to keep CHECK_FOR_INTERRUPTS in
each iteration; it's a small loop that does not pass control to any
other functions .

But I am not sure about this : while the workers are at it, why the
backend that is waiting for the workers does not come out of the wait
state with a SIGINT. I guess the same issue has been discussed in the
mail thread that you pointed.

-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company

Re: Query running for very long time (server hanged) with parallel append

От

Robert Haas

Дата:

02 февраля 2018 г., 18:16:17

On Fri, Feb 2, 2018 at 1:43 AM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:
> On 2 February 2018 at 03:50, Thomas Munro <thomas.munro@enterprisedb.com> wrote:
>> Whatever logic bug might be causing the query to hang, it's not good
>> that we're unable to SIGINT/SIGTERM our way out of this state.  See
>> also this other bug report for a known problem (already fixed but not
>> yet released), but which came with an extra complaint, as yet
>> unexplained, that the query couldn't be interrupted:
>>
>> https://www.postgresql.org/message-id/flat/151724453314.1238.409882538067070269%40wrigleys.postgresql.org
>
> Yeah, it is not good that there is no response to the SIGINT.
>
> The query is actually hanging because one of the workers is in a small
> loop where it iterates over the subplans searching for unfinished
> plans, and it never comes out of the loop (it's a bug which I am yet
> to fix). And it does not make sense to keep CHECK_FOR_INTERRUPTS in
> each iteration; it's a small loop that does not pass control to any
> other functions .

Uh, sounds like we'd better fix that bug.

> But I am not sure about this : while the workers are at it, why the
> backend that is waiting for the workers does not come out of the wait
> state with a SIGINT. I guess the same issue has been discussed in the
> mail thread that you pointed.

Is it getting stuck here?

    /*
     * We can't finish transaction commit or abort until all of the workers
     * have exited.  This means, in particular, that we can't respond to
     * interrupts at this stage.
     */
    HOLD_INTERRUPTS();
    WaitForParallelWorkersToExit(pcxt);
    RESUME_INTERRUPTS();

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Query running for very long time (server hanged) with parallel append

От

David Kohn

Дата:

03 февраля 2018 г., 02:17:49

Please forgive my inexperience with the codebase, but as the guy who reported this bugger: https://www.postgresql.org/message-id/flat/151724453314.1238.409882538067070269%40wrigleys.postgresql.org#151724453314.1238.409882538067070269@wrigleys.postgresql.org, I thought I'd follow your hints, as it's causing some major issues for me. So some notes on what is happening for me and some (possibly silly) thoughts on why:

On Fri, Feb 2, 2018 at 10:16 AM Robert Haas <robertmhaas@gmail.com> wrote:

Is it getting stuck here?

/*
* We can't finish transaction commit or abort until all of the workers
* have exited. This means, in particular, that we can't respond to
* interrupts at this stage.
*/
HOLD_INTERRUPTS();
WaitForParallelWorkersToExit(pcxt);
RESUME_INTERRUPTS();

I am seeing unkillable queries with the client backend in IPC-BgWorkerShutdown wait event, which, it appears to me can only happen inside of bgworker.c at WaitForBackgroundWorkerShutdown which is called by parallel.c at WaitForParallelWorkersToExit inside of DestroyParallelContext, which seems like it should be called when there is a statement timeout (which I think is happening in at least some of my cases) so it would make sense that this is where the problem is.

My background workers are in the IPC-MessageQueuePutMessage event, which appears to only be possible from pqmq.c at mq_putmessage , directly following the WaitLatch, there is a CHECK_FOR_INTERRUPTS(); so, if it's waiting on that latch and never gets to the interrupt that would explain things. Also it appears that it sends a signal to the leader process a few lines before starting to wait, which is supposed to tell the leader to come read messages off the queue. If the leader gets to WaitForParallelWorkersToExit at the wrong time and ends up waiting on that event, I could see how they would both end up waiting for the other and never finishing.

The thing is that DestroyParallelContext seems to be detaching from the queues, but if the worker hit the wait step before the leader detaches from the queue does it have any way of knowing that?

Anyway, I'm entirely unsure of my analysis here, but thought I'd offer something to help speed this along.

Best,
David Kohn

Re: Query running for very long time (server hanged) with parallel append

От

Amit Khandekar

Дата:

05 февраля 2018 г., 12:59:27

On 2 February 2018 at 20:46, Robert Haas <robertmhaas@gmail.com> wrote:
> On Fri, Feb 2, 2018 at 1:43 AM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:
>> The query is actually hanging because one of the workers is in a small
>> loop where it iterates over the subplans searching for unfinished
>> plans, and it never comes out of the loop (it's a bug which I am yet
>> to fix). And it does not make sense to keep CHECK_FOR_INTERRUPTS in
>> each iteration; it's a small loop that does not pass control to any
>> other functions .
>
> Uh, sounds like we'd better fix that bug.

The scenario is this : One of the workers w1 hasn't yet chosen the
first plan, and all the plans are already finished. So w1 has it's
node->as_whichplan equal to -1. So the below condition in
choose_next_subplan_for_worker() never becomes true for this worker :

if (pstate->pa_next_plan == node->as_whichplan)
{
   /* We've tried everything! */
   pstate->pa_next_plan = INVALID_SUBPLAN_INDEX;
   LWLockRelease(&pstate->pa_lock);
   return false;
}

What I think is : we should save the information about which plan we
started the search with, before the loop starts. So, we save the
starting plan value like this, before the loop starts:
   initial_plan = pstate->pa_next_plan;

And then break out of the loop when we come back to to this initial plan :
if (initial_plan == pstate->pa_next_plan)
   break;

Now, suppose it so happens that initial_plan is a non-partial plan.
And then when we wrap around to the first partial plan, we check
(initial_plan == pstate->pa_next_plan) which will never become true
because initial_plan is less than first_partial_plan.

So what I have done in the patch is : have a flag 'should_wrap_around'
indicating that we should not wrap around. This flag is true when
initial_plan is a non-partial plan, in which case we know that we will
have covered all plans by the time we reach the last plan. So break
from the loop if this flag is false, or if we have reached the initial
plan.

Attached is a patch that fixes this issue on the above lines.

>
>> But I am not sure about this : while the workers are at it, why the
>> backend that is waiting for the workers does not come out of the wait
>> state with a SIGINT. I guess the same issue has been discussed in the
>> mail thread that you pointed.
>
> Is it getting stuck here?
>
>     /*
>      * We can't finish transaction commit or abort until all of the workers
>      * have exited.  This means, in particular, that we can't respond to
>      * interrupts at this stage.
>      */
>     HOLD_INTERRUPTS();
>     WaitForParallelWorkersToExit(pcxt);
>     RESUME_INTERRUPTS();

Actually the backend is getting stuck in
choose_next_subplan_for_leader(), in LWLockAcquire(), waiting for the
hanging worker to release the pstate->pa_lock. I think there is
nothing wrong in this, because it is assumed that LWLock wait is going
to be for very short tiime, and because of this bug, the lwlock waits
forever.

-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company

Вложения

fix_hang_issue.patch

Re: Query running for very long time (server hanged) with parallelappend

От

Kyotaro HORIGUCHI

Дата:

06 февраля 2018 г., 07:34:19

At Mon, 5 Feb 2018 15:29:27 +0530, Amit Khandekar <amitdkhan.pg@gmail.com> wrote in
<CAJ3gD9eFR8=kxjPYBEHe34uT9+RYET0VbhGEfSt79eZx3L9E1Q@mail.gmail.com>
> On 2 February 2018 at 20:46, Robert Haas <robertmhaas@gmail.com> wrote:
> > On Fri, Feb 2, 2018 at 1:43 AM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:
> >> The query is actually hanging because one of the workers is in a small
> >> loop where it iterates over the subplans searching for unfinished
> >> plans, and it never comes out of the loop (it's a bug which I am yet
> >> to fix). And it does not make sense to keep CHECK_FOR_INTERRUPTS in
> >> each iteration; it's a small loop that does not pass control to any
> >> other functions .
> >
> > Uh, sounds like we'd better fix that bug.
> 
> 
> The scenario is this : One of the workers w1 hasn't yet chosen the
> first plan, and all the plans are already finished. So w1 has it's
> node->as_whichplan equal to -1. So the below condition in
> choose_next_subplan_for_worker() never becomes true for this worker :
> 
> if (pstate->pa_next_plan == node->as_whichplan)
> {
>    /* We've tried everything! */
>    pstate->pa_next_plan = INVALID_SUBPLAN_INDEX;
>    LWLockRelease(&pstate->pa_lock);
>    return false;
> }
> 
> What I think is : we should save the information about which plan we
> started the search with, before the loop starts. So, we save the
> starting plan value like this, before the loop starts:
>    initial_plan = pstate->pa_next_plan;
> 
> And then break out of the loop when we come back to to this initial plan :
> if (initial_plan == pstate->pa_next_plan)
>    break;
> 
> Now, suppose it so happens that initial_plan is a non-partial plan.
> And then when we wrap around to the first partial plan, we check
> (initial_plan == pstate->pa_next_plan) which will never become true
> because initial_plan is less than first_partial_plan.
> 
> So what I have done in the patch is : have a flag 'should_wrap_around'
> indicating that we should not wrap around. This flag is true when
> initial_plan is a non-partial plan, in which case we know that we will
> have covered all plans by the time we reach the last plan. So break
> from the loop if this flag is false, or if we have reached the initial
> plan.
> 
> Attached is a patch that fixes this issue on the above lines.

The patch adds two new variables and always sets them but warp
around can happen at most once per call so I think it is enough
to arrange to stop at the wrap around time. Addition to that the
patch is forgetting the case of no partial plan. The loop can
overrun on pa_finished in the case.

I think something like the following would work.

@@ -485,6 +485,13 @@ choose_next_subplan_for_worker(AppendState *node)
     {
       /* Loop back to first partial plan. */
       pstate->pa_next_plan = append->first_partial_plan;
+
+      /*
+       * Arrange to bail out if we are trying to fetch the first partial
+       * plan
+       */
+      if (node->as_whichplan < append->first_partial_plan)
+        node->as_whichplan = append->first_partial_plan;
     }
     else

> >> But I am not sure about this : while the workers are at it, why the
> >> backend that is waiting for the workers does not come out of the wait
> >> state with a SIGINT. I guess the same issue has been discussed in the
> >> mail thread that you pointed.
> >
> > Is it getting stuck here?
> >
> >     /*
> >      * We can't finish transaction commit or abort until all of the workers
> >      * have exited.  This means, in particular, that we can't respond to
> >      * interrupts at this stage.
> >      */
> >     HOLD_INTERRUPTS();
> >     WaitForParallelWorkersToExit(pcxt);
> >     RESUME_INTERRUPTS();
> 
> Actually the backend is getting stuck in
> choose_next_subplan_for_leader(), in LWLockAcquire(), waiting for the
> hanging worker to release the pstate->pa_lock. I think there is
> nothing wrong in this, because it is assumed that LWLock wait is going
> to be for very short tiime, and because of this bug, the lwlock waits
> forever.

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: Query running for very long time (server hanged) with parallelappend

От

Kyotaro HORIGUCHI

Дата:

06 февраля 2018 г., 07:41:47

At Tue, 06 Feb 2018 13:34:19 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in
<20180206.133419.02213593.horiguchi.kyotaro@lab.ntt.co.jp>
> At Mon, 5 Feb 2018 15:29:27 +0530, Amit Khandekar <amitdkhan.pg@gmail.com> wrote in
<CAJ3gD9eFR8=kxjPYBEHe34uT9+RYET0VbhGEfSt79eZx3L9E1Q@mail.gmail.com>
> > On 2 February 2018 at 20:46, Robert Haas <robertmhaas@gmail.com> wrote:
> > > On Fri, Feb 2, 2018 at 1:43 AM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:
> > >> The query is actually hanging because one of the workers is in a small
> > >> loop where it iterates over the subplans searching for unfinished
> > >> plans, and it never comes out of the loop (it's a bug which I am yet
> > >> to fix). And it does not make sense to keep CHECK_FOR_INTERRUPTS in
> > >> each iteration; it's a small loop that does not pass control to any
> > >> other functions .
> > >
> > > Uh, sounds like we'd better fix that bug.
> > 
> > 
> > The scenario is this : One of the workers w1 hasn't yet chosen the
> > first plan, and all the plans are already finished. So w1 has it's
> > node->as_whichplan equal to -1. So the below condition in
> > choose_next_subplan_for_worker() never becomes true for this worker :
> > 
> > if (pstate->pa_next_plan == node->as_whichplan)
> > {
> >    /* We've tried everything! */
> >    pstate->pa_next_plan = INVALID_SUBPLAN_INDEX;
> >    LWLockRelease(&pstate->pa_lock);
> >    return false;
> > }
> > 
> > What I think is : we should save the information about which plan we
> > started the search with, before the loop starts. So, we save the
> > starting plan value like this, before the loop starts:
> >    initial_plan = pstate->pa_next_plan;
> > 
> > And then break out of the loop when we come back to to this initial plan :
> > if (initial_plan == pstate->pa_next_plan)
> >    break;
> > 
> > Now, suppose it so happens that initial_plan is a non-partial plan.
> > And then when we wrap around to the first partial plan, we check
> > (initial_plan == pstate->pa_next_plan) which will never become true
> > because initial_plan is less than first_partial_plan.
> > 
> > So what I have done in the patch is : have a flag 'should_wrap_around'
> > indicating that we should not wrap around. This flag is true when
> > initial_plan is a non-partial plan, in which case we know that we will
> > have covered all plans by the time we reach the last plan. So break
> > from the loop if this flag is false, or if we have reached the initial
> > plan.
> > 
> > Attached is a patch that fixes this issue on the above lines.
> 
> The patch adds two new variables and always sets them but warp
> around can happen at most once per call so I think it is enough
> to arrange to stop at the wrap around time. Addition to that the
> patch is forgetting the case of no partial plan. The loop can
> overrun on pa_finished in the case.

Sorry, the patch works fine. But still the new variables don't
seem needed.

> I think something like the following would work.
> 
> @@ -485,6 +485,13 @@ choose_next_subplan_for_worker(AppendState *node)
>      {
>        /* Loop back to first partial plan. */
>        pstate->pa_next_plan = append->first_partial_plan;
> +
> +      /*
> +       * Arrange to bail out if we are trying to fetch the first partial
> +       * plan
> +       */
> +      if (node->as_whichplan < append->first_partial_plan)
> +        node->as_whichplan = append->first_partial_plan;
>      }
>      else
> 
> > >> But I am not sure about this : while the workers are at it, why the
> > >> backend that is waiting for the workers does not come out of the wait
> > >> state with a SIGINT. I guess the same issue has been discussed in the
> > >> mail thread that you pointed.
> > >
> > > Is it getting stuck here?
> > >
> > >     /*
> > >      * We can't finish transaction commit or abort until all of the workers
> > >      * have exited.  This means, in particular, that we can't respond to
> > >      * interrupts at this stage.
> > >      */
> > >     HOLD_INTERRUPTS();
> > >     WaitForParallelWorkersToExit(pcxt);
> > >     RESUME_INTERRUPTS();
> > 
> > Actually the backend is getting stuck in
> > choose_next_subplan_for_leader(), in LWLockAcquire(), waiting for the
> > hanging worker to release the pstate->pa_lock. I think there is
> > nothing wrong in this, because it is assumed that LWLock wait is going
> > to be for very short tiime, and because of this bug, the lwlock waits
> > forever.

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: Query running for very long time (server hanged) with parallel append

От

Amit Khandekar

Дата:

06 февраля 2018 г., 09:26:27

On 6 February 2018 at 10:11, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:
>> At Mon, 5 Feb 2018 15:29:27 +0530, Amit Khandekar <amitdkhan.pg@gmail.com> wrote in
>> > Attached is a patch that fixes this issue on the above lines.
>>
>> The patch adds two new variables and always sets them but warp
>> around can happen at most once per call so I think it is enough
>> to arrange to stop at the wrap around time. Addition to that the
>> patch is forgetting the case of no partial plan. The loop can
>> overrun on pa_finished in the case.
>
> Sorry, the patch works fine. But still the new variables don't
> seem needed.

True, I could have set initially as_whichplan to pa_next_plan, and
used as_whichplan instead of initial_plan. And, I could have used the
condition initial_plan >= append->first_partial_plan instead of
should_wrap_around. The reason I used these two variables is because I
thought  the logic would be more reader-friendly. (Also,
should_wrap_around uses static values so using this variable avoids
determining the condition (initial_plan >= append->first_partial_plan)
at each iteration).

-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company

Re: Query running for very long time (server hanged) with parallel append

От

Rajkumar Raghuwanshi

Дата:

06 февраля 2018 г., 09:32:51

On Mon, Feb 5, 2018 at 3:29 PM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:

Attached is a patch that fixes this issue on the above lines.

Patch applied cleanly and work fine for me. mentioned issue is not reproducible now.

Thanks & Regards,

Rajkumar Raghuwanshi

QMG, EnterpriseDB Corporation

Re: Query running for very long time (server hanged) with parallelappend

От

Kyotaro HORIGUCHI

Дата:

06 февраля 2018 г., 13:41:23

At Tue, 6 Feb 2018 11:56:27 +0530, Amit Khandekar <amitdkhan.pg@gmail.com> wrote in
<CAJ3gD9fJhKBHn-fx6UzYZeLa0N8WnnZb2BsUTm_5A9kYsJgnww@mail.gmail.com>
> On 6 February 2018 at 10:11, Kyotaro HORIGUCHI
> <horiguchi.kyotaro@lab.ntt.co.jp> wrote:
> >> At Mon, 5 Feb 2018 15:29:27 +0530, Amit Khandekar <amitdkhan.pg@gmail.com> wrote in
> >> > Attached is a patch that fixes this issue on the above lines.
> >>
> >> The patch adds two new variables and always sets them but warp
> >> around can happen at most once per call so I think it is enough
> >> to arrange to stop at the wrap around time. Addition to that the
> >> patch is forgetting the case of no partial plan. The loop can
> >> overrun on pa_finished in the case.
> >
> > Sorry, the patch works fine. But still the new variables don't
> > seem needed.
> 
> True, I could have set initially as_whichplan to pa_next_plan, and
> used as_whichplan instead of initial_plan. And, I could have used the
> condition initial_plan >= append->first_partial_plan instead of
> should_wrap_around. The reason I used these two variables is because I
> thought  the logic would be more reader-friendly. (Also,
> should_wrap_around uses static values so using this variable avoids
> determining the condition (initial_plan >= append->first_partial_plan)
> at each iteration).

Thanks for the explanation. I'm fine with it. Well may I make
some comments on the patch?

- It seems to me that the if (!should_warp_around) block and else
  block are better be transposed so that make the bail out code
  closer to the exit condition check as the previous code
  was. (In other was the second block should be if
  (should_wrap...), not if (!should_warp..).

- The following comment needs a "the" before the "first", maybe.
 +            /* Loop back to first partial plan. */


regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: Query running for very long time (server hanged) with parallel append

От

Amit Khandekar

Дата:

06 февраля 2018 г., 19:32:35

On 6 February 2018 at 16:11, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:
> Thanks for the explanation. I'm fine with it. Well may I make
> some comments on the patch?

Sure, always welcome.

>
> - It seems to me that the if (!should_warp_around) block and else
>   block are better be transposed so that make the bail out code
>   closer to the exit condition check as the previous code
>   was. (In other was the second block should be if
>   (should_wrap...), not if (!should_warp..).

Yeah, I think it looks equally good that way, and like you said, the
current code does it that way. So in the attached patch, I have
swapped the two conditions.

>
> - The following comment needs a "the" before the "first", maybe.
>  +                      /* Loop back to first partial plan. */

Now since we are not changing this line in the patch, I have avoided
that. But the committer can very well make this change, all though I
am not sure myself about the grammar.

-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company

Вложения

fix_hang_issue_v2.patch

Re: Query running for very long time (server hanged) with parallel append

От

Robert Haas

Дата:

06 февраля 2018 г., 21:50:28

On Tue, Feb 6, 2018 at 11:32 AM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:
> Yeah, I think it looks equally good that way, and like you said, the
> current code does it that way. So in the attached patch, I have
> swapped the two conditions.

I prefer to avoid introducing 2 new variables and instead just prevent
the looping directly in the case where we started with a non-partial
plan.

See attached.  Does this look OK?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Вложения

fix-hang-issue-rmh.patch

Re: Query running for very long time (server hanged) with parallelappend

От

Kyotaro HORIGUCHI

Дата:

07 февраля 2018 г., 05:00:49

At Tue, 6 Feb 2018 13:50:28 -0500, Robert Haas <robertmhaas@gmail.com> wrote in
<CA+TgmoYqdC+9U8mLYkUgM=CaBt6Pzz4R_YNboqDbW-LvUaHO+g@mail.gmail.com>
> On Tue, Feb 6, 2018 at 11:32 AM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:
> > Yeah, I think it looks equally good that way, and like you said, the
> > current code does it that way. So in the attached patch, I have
> > swapped the two conditions.
> 
> I prefer to avoid introducing 2 new variables and instead just prevent
> the looping directly in the case where we started with a non-partial
> plan.
> 
> See attached.  Does this look OK?

Ah, we can bail out when starting from the first partial plan.
It's a bit uneasy that pa_next_plan can be -1 but it looks
perfect to me.

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: Query running for very long time (server hanged) with parallel append

От

Amit Khandekar

Дата:

07 февраля 2018 г., 10:11:25

On 7 February 2018 at 07:30, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:
> At Tue, 6 Feb 2018 13:50:28 -0500, Robert Haas <robertmhaas@gmail.com> wrote in
<CA+TgmoYqdC+9U8mLYkUgM=CaBt6Pzz4R_YNboqDbW-LvUaHO+g@mail.gmail.com>
>> On Tue, Feb 6, 2018 at 11:32 AM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:
>> > Yeah, I think it looks equally good that way, and like you said, the
>> > current code does it that way. So in the attached patch, I have
>> > swapped the two conditions.
>>
>> I prefer to avoid introducing 2 new variables and instead just prevent
>> the looping directly in the case where we started with a non-partial
>> plan.
>>
>> See attached.  Does this look OK?
>
> Ah, we can bail out when starting from the first partial plan.

> It's a bit uneasy that pa_next_plan can be -1

I also feel that way. In the new condition (node->as_whichplan >
append->first_partial_plan), there is an implicit assumption that
INVALID_SUBPLAN_INDEX is equal to -1. So in our issue, if
node->as_whichplan is -1, this condition becomes false and so the loop
breaks correctly. But like I said, it works correctly because
INVALID_SUBPLAN_INDEX is defined to be less than zero. I guess, this
assumption might be safe in the long term also.

But another thing is that, node->as_whichplan can point to some
non-partial plan because it is the value of the earlier chosen plan.
And in that case (node->as_whichplan > append->first_partial_plan) is
false, indicating we should not wrap around. But there still might be
unfinished partial plans, and the pa_next_plan might be pointing to,
say, a plan at the tail end. Here, it still makes sense to wrap around
and see if there are unfinished partial plans. But the above condition
would not allow us to wrap around.

For this, we should check whether we have started with a partial plan.
That is, use this condition like how I used : (initial_plan >=
append->first_partial_plan) . Or else, if we don't want to add a new
variable, set node->as_whichplan to pa_next_plan initially.

Also, the first condition is not necessary here :
-       else if (append->first_partial_plan < node->as_nplans)
+       else if (append->first_partial_plan < node->as_nplans &&
+                node->as_whichplan > append->first_partial_plan)
Just the fact that we are starting from a non-partial plan is enough
to determine that we should not wrap around, regardless of whether
there are any partial plans.
So we can just keep this single condition :
-       else if (append->first_partial_plan < node->as_nplans)
+       else if (node->as_whichplan > append->first_partial_plan)

Attached is the patch accordingly.

-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company

Вложения

fix_hang_issue_v3.patch

Re: Query running for very long time (server hanged) with parallel append

От

Robert Haas

Дата:

08 февраля 2018 г., 20:57:33

On Wed, Feb 7, 2018 at 2:11 AM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:
> Attached is the patch accordingly.

OK, I see.  That makes sense; committed.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Обсуждение: Query running for very long time (server hanged) with parallel append

Вложения

Вложения

Вложения

Вложения