Re: [sqlsmith] Failed assertions on parallel worker shutdown

Поиск
Список
Период
Сортировка
От Amit Kapila
Тема Re: [sqlsmith] Failed assertions on parallel worker shutdown
Дата
Msg-id CAA4eK1KOKGqmz9bGu+Z42qhRwMbm4R5rfnqsLCNqFs9j14jzEA@mail.gmail.com
обсуждение исходный текст
Ответ на Re: [sqlsmith] Failed assertions on parallel worker shutdown  (Andreas Seltenreich <seltenreich@gmx.de>)
Ответы Re: [sqlsmith] Failed assertions on parallel worker shutdown  (Noah Misch <noah@leadboat.com>)
Re: [sqlsmith] Failed assertions on parallel worker shutdown  (Robert Haas <robertmhaas@gmail.com>)
Список pgsql-hackers
On Tue, May 24, 2016 at 6:36 PM, Andreas Seltenreich <seltenreich@gmx.de> wrote:

Each of the sent plans was collected when a worker dumped core due to
the failed assertion.  More core dumps than plans were actually
observed, since with this failed assertion, multiple workers usually
trip on and dump core simultaneously.

The following query corresponds to plan2:

--8<---------------cut here---------------start------------->8---
select
  pg_catalog.pg_stat_get_bgwriter_requested_checkpoints() as c0,
  subq_0.c3 as c1, subq_0.c1 as c2, 31 as c3, 18 as c4,
  (select unique1 from public.bprime limit 1 offset 9) as c5,
  subq_0.c2 as c6
from
(select ref_0.tablename as c0, ref_0.inherited as c1,
        ref_0.histogram_bounds as c2, 100 as c3
      from
        pg_catalog.pg_stats as ref_0
      where 49 is not NULL limit 55) as subq_0
where true
limit 58;
--8<---------------cut here---------------end--------------->8---


I am able to reproduce the assertion (it occurs once in two to three times, but always at same place) you have reported upthread with the above query. It seems to me, issue here is that while workers are writing tuples in the tuple queue, the master backend has detached from the queues.  The reason why master backend has detached from tuple queues is because of Limit clause, basically after processing required tuples as specified by Limit clause, it calls shutdown of nodes in below part of code:

ExecutePlan()
{
..
if (TupIsNull(slot))
{
/* Allow nodes to release or shut down resources. */
(void) ExecShutdownNode(planstate);
break;
}
..
}

Now, it is quite possible that the worker has written part of it's data, after which the queue got detached.  The callers of shm_mq (tqueueReceiveSlot/shm_mq_send) doesn't handle SHM_MQ_DETACHED due to which it keeps on sending more data (next tuple) which leads to the assertion in below code:
shm_mq_sendv()
{
..
/* Write the actual data bytes into the buffer. */
Assert(mqh->mqh_partial_bytes <= nbytes);
..
}

I think the workers should stop processing tuples after the tuple queues got detached.  This will not only handle above situation gracefully, but will allow to speed up the queries where Limit clause is present on top of Gather node.  Patch for the same is attached with mail (this was part of original parallel seq scan patch, but was not applied and the reason as far as I remember was we thought such an optimization might not be required for initial version).

Another approach to fix this issue could be to reset mqh_partial_bytes and mqh_length_word_complete in shm_mq_sendv  in case of SHM_MQ_DETACHED.  These are currently reset only incase of success.

I have added this issue in list of PostgreSQL 9.6 open items @https://wiki.postgresql.org/wiki/PostgreSQL_9.6_Open_Items

The steps to reproduce it on regression database is:
1. We need to create enough rows in pg_statistic, so that parallel plan can be selected.
2. Used below procedure to ensure sufficient rows gets created in pg_statistic
CREATE OR REPLACE FUNCTION populate_pg_stat() RETURNS int AS
$BODY$
DECLARE
    count int;
BEGIN
    FOR count IN 1 .. 50000
    LOOP
    Execute 'Create table t' || count || '(c1 int)';
    Execute 'insert into t' || count || ' values (generate_series(1,10))';
    Execute 'Analyze t' || count;
    End Loop;
    Return 1;
END
$BODY$
LANGUAGE plpgsql;
3.
set parallel_tuple_cost = 0
set parallel_setup_cost = 0
execute query -
Explain Analyze select
  pg_catalog.pg_stat_get_bgwriter_requested_checkpoints() as c0,
  subq_0.c3 as c1, subq_0.c1 as c2, 31 as c3, 18 as c4,
  (select unique1 from public.bprime limit 1 offset 9) as c5,
  subq_0.c2 as c6
from
(select ref_0.tablename as c0, ref_0.inherited as c1,
        ref_0.histogram_bounds as c2, 100 as c3
      from
        pg_catalog.pg_stats as ref_0
      where 49 is not NULL limit 55) as subq_0
where true
limit 58;

 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com
Вложения

В списке pgsql-hackers по дате отправления:

Предыдущее
От: Kyotaro HORIGUCHI
Дата:
Сообщение: Fix a failure of pg_dump with !HAVE_LIBZ
Следующее
От: Amit Kapila
Дата:
Сообщение: Re: [HACKERS] Re: [HACKERS] Re: [HACKERS] Re: [HACKERS] Windows service is not starting so there’s message in log: FATAL: "could not create shared memory segment “Global/PostgreSQL.851401618”: Permission denied”