Re: dsa_allocate() faliure

Поиск
Список
Период
Сортировка
От Sand Stone
Тема Re: dsa_allocate() faliure
Дата
Msg-id CADrk5qMF=AfbCHmdWMk0ZvPac+v4Ra7KO-XwT0+ib=o0mrm7gQ@mail.gmail.com
обсуждение исходный текст
Ответ на Re: dsa_allocate() faliure  (Sand Stone <sand.m.stone@gmail.com>)
Ответы Re: dsa_allocate() faliure  (Thomas Munro <thomas.munro@enterprisedb.com>)
Список pgsql-performance
I attached a query (and its query plan) that caused the crash: "dsa_allocate could not find 13 free pages" on one of the worker nodes. I anonymised the query text a bit.  Interestingly, this time only one (same one) of the nodes is crashing. Since this is a production environment, I cannot get the stack trace. Once turned off parallel execution for this node. The whole query finished just fine. So the parallel query plan is from one of the nodes not crashed, hopefully the same plan would have been executed on the crashed node. In theory, every worker node has the same bits, and very similar data. 

===
psql (10.4)
\dx
                       List of installed extensions
      Name      | Version |   Schema   |            Description            
----------------+---------+------------+-----------------------------------
 citus          | 7.4-3   | pg_catalog | Citus distributed database
 hll            | 2.10    | public     | type for storing hyperloglog data
plpgsql        | 1.0     | pg_catalog | PL/pgSQL procedural language


On Sat, Aug 25, 2018 at 7:46 AM Sand Stone <sand.m.stone@gmail.com> wrote:
>Can you still see the problem with Citus 7.4?
Hi, Thomas. I actually went back to the cluster with Citus7.4 and
PG10.4. And modified the parallel param. So far, I haven't seen any
server crash.

The main difference between crashes observed and no crash, is the set
of Linux TCP time out parameters (to release the ports faster).
Unfortunately, I cannot "undo" the Linux params and run the stress
tests anymore, as this is a multi-million $ cluster and people are
doing more useful things on it. I will keep an eye on any parallel
execution issue.


On Wed, Aug 15, 2018 at 3:43 PM Thomas Munro
<thomas.munro@enterprisedb.com> wrote:
>
> On Thu, Aug 16, 2018 at 8:32 AM, Sand Stone <sand.m.stone@gmail.com> wrote:
> > Just as a follow up. I tried the parallel execution again (in a stress
> > test environment). Now the crash seems gone. I will keep an eye on
> > this for the next few weeks.
>
> Thanks for the report.  That's great news, but it'd be good to
> understand why it was happening.
>
> > My theory is that the Citus cluster created and shut down a lot of TCP
> > connections between coordinator and workers. If running on untuned
> > Linux machines, the TCP ports might run out.
>
> I'm not sure how that's relevant, unless perhaps it causes executor
> nodes to be invoked in a strange sequence that commit fd7c0fa7 didn't
> fix?  I wonder if there could be something different about the control
> flow with custom scans, or something about the way Citus worker nodes
> invoke plan fragments, or some error path that I failed to consider...
> It's a clue that all of your worker nodes reliably crashed at the same
> time on the same/similar queries (presumably distributed query
> fragments for different shards), making it seem more like a
> common-or-garden bug rather than some kind of timing-based heisenbug.
> If you ever manage to reproduce it, an explain plan and a back trace
> would be very useful.
>
> > Of course, I am using "newer" PG10 bits and Citus7.5 this time.
>
> Hmm.  There weren't any relevant commits to REL_10_STABLE that I can
> think of.  And (with the proviso that I know next to nothing about
> Citus) I just cloned https://github.com/citusdata/citus.git and
> skimmed through "git diff origin/release-7.4..origin/release-7.5", and
> nothing is jumping out at me.  Can you still see the problem with
> Citus 7.4?
>
> --
> Thomas Munro
> http://www.enterprisedb.com
Вложения

В списке pgsql-performance по дате отправления:

Предыдущее
От: David
Дата:
Сообщение: Extremely slow when query uses GIST exclusion index
Следующее
От: Andreas Kretschmer
Дата:
Сообщение: Re: Extremely slow when query uses GIST exclusion index