Обсуждение: XPRS

Поиск
Список
Период
Сортировка

XPRS

От
Thomas Munro
Дата:
Hello,

After rereading some old papers recently, I wanted to share some
thoughts about XPRS and modern PostgreSQL.  XPRS stood for "eXtended
Postgres on RAID and Sprite", and was a research project done nearly
three decades ago at Berkeley by the POSTGRES group working with
operating system researchers, on a shared memory machine with a lot of
CPUs for the time (12).

As far as I can tell (and if anyone knows how to find out, I'd love to
know), the parallel query parts of the XPRS system described in
various writings by Wei Hong and Michael Stonebraker are actually
present in the POSTGRES 4.2 tarball and were removed in Postgres95.
Evidence: 4.2's parallel code is all wrapped in #ifdef sequent, and we
know that XPRS ran on a Sequent Symmetry; the parallel hash join
algorithm matches the description given in Hong's doctoral thesis; the
page and range based parallel scans seen in various places also seem
to match.

Hong's thesis covers a lot of material and I certainly haven't
understood all of it, but basically it's about how to share CPU, IO
bandwidth and memory out fairly and dynamically at execution time so
you're using the whole system efficiently.  Facets of this problem
obviously keep coming up on this mailing list (see practically any
discussion of parallel degree, admission control, prefetch or
work_mem, all of which we punt on by deferring to user supplied GUCs
and scan size-based heuristics).

Here are three things I wanted to highlight from Hong's 1991 paper[1]
(later writings elaborate greatly but this one is short and much
easier to read than the thesis and sets the scene):

1.  "The overall performance goal of a multiprocessor database system
is to obtain increased throughput as well as reduced response time in
a multiuser environment.  The objective function that XPRS uses for
query optimization is a combination of resource consumption and
response time as follows:  cost = resource_consumption + w *
response_time, where w is a system-specific weighting factor."

2.  The "Buffer-Size-Independent Hypothesis" (here meaning work_mem):
"The choice of the best sequential plan is insensitive to the amount
of buffer space available as long as the buffer size is above the hash
join threshold" (with a caveat about localised special cases that can
be handled by choosing alternative subplans at runtime).

3.  The "Two-Phase Hypothesis": "The best parallel plan is a
parallelization of the best sequential plan."

I read all of that a while back while working on bits of parallel
query machinery (though I only realised after the fact that I'd
implemented parallel hash the same way as Hong did 27 years earlier,
that is, shared no-partition, which is now apparently back in vogue
due to the recent ubiquity of high core count shared memory systems,
so that every server looks a bit like a Sequent Symmetry; for example
Oracle is rumoured to have a "parallel shared hash join" like ours in
the pipeline).  I didn't understand the context or importance of XPRS,
though, until I read this bit of Hellerstein's "Looking Back at
Postgres"[2]:

"In principle, parallelism “blows up” the plan space for a query
optimizer by making it multiply the traditional choices made during
query optimization (data access, join algorithms, join orders) against
all possible ways of parallelizing each choice. The basic idea of what
Stonebraker called “The Wei Hong Optimizer” was to cut the problem in
two: run a traditional single-node query optimizer in the style of
System R, and then “parallelize” the resulting single-node query plan
by scheduling the degree of parallelism and placement of each operator
based on data layouts and system configuration. This approach is
heuristic, but it makes parallelism an additive cost to traditional
query optimization, rather than a multiplicative cost.

Although “The Wei Hong Optimizer” was designed in the context of
Postgres, it became the standard approach for many of the parallel
query optimizers in industry."

I don't know what to think about the buffer-size-independent
hypothesis, but the two-phase hypothesis and the claim that is is the
standard approach caught my attention.  Firstly, I don't think the
hypothesis holds on our system currently, because (for example) we
lack parallel merge joins and sorts, so you couldn't parallelise such
serial plans, and yet we'd already have thrown away a hash join based
plan that would be vastly better in parallel.  That might be just an
implementation completeness problem.  I wonder what fundamental
problems lurk here.  (Perhaps the non-availability of globally unique
partial paths?)  Anyway, AFAICS we do the exact thing Hong wanted to
avoid: we plan parallel queries as extra paths at planning time.  We
don't really suffer too much of a planning explosion though, because
we don't consider different parallel degrees.  If we did, because our
cost model doesn't include any penalty for resource usage, I suspect
we'd always go for the maximum number of workers because they're
'free', which creates a perverse incentive to burn resource (CPU +
copies of work_mem).  Those are all problems Hong solved with
execution time resource allocation, as part of a bigger picture.

I have no idea what to do about any of this but thought that was an
interesting bit of our project's history worth sharing.  It's really
humbling to read these old papers.  I wonder if we're missing a chance
to stand on the shoulders of giants.

[1] http://db.cs.berkeley.edu/jmh/tmp/pdis91-xprs.pdf
[2] https://arxiv.org/pdf/1901.01973.pdf

--
Thomas Munro
https://enterprisedb.com



Re: XPRS

От
Tomas Vondra
Дата:
On Thu, Aug 22, 2019 at 11:41:45AM +1200, Thomas Munro wrote:
>Hello,
>
>After rereading some old papers recently, I wanted to share some
>thoughts about XPRS and modern PostgreSQL.  XPRS stood for "eXtended
>Postgres on RAID and Sprite", and was a research project done nearly
>three decades ago at Berkeley by the POSTGRES group working with
>operating system researchers, on a shared memory machine with a lot of
>CPUs for the time (12).
>
>As far as I can tell (and if anyone knows how to find out, I'd love to
>know), the parallel query parts of the XPRS system described in
>various writings by Wei Hong and Michael Stonebraker are actually
>present in the POSTGRES 4.2 tarball and were removed in Postgres95.
>Evidence: 4.2's parallel code is all wrapped in #ifdef sequent, and we
>know that XPRS ran on a Sequent Symmetry; the parallel hash join
>algorithm matches the description given in Hong's doctoral thesis; the
>page and range based parallel scans seen in various places also seem
>to match.
>
>Hong's thesis covers a lot of material and I certainly haven't
>understood all of it, but basically it's about how to share CPU, IO
>bandwidth and memory out fairly and dynamically at execution time so
>you're using the whole system efficiently.  Facets of this problem
>obviously keep coming up on this mailing list (see practically any
>discussion of parallel degree, admission control, prefetch or
>work_mem, all of which we punt on by deferring to user supplied GUCs
>and scan size-based heuristics).
>
>Here are three things I wanted to highlight from Hong's 1991 paper[1]
>(later writings elaborate greatly but this one is short and much
>easier to read than the thesis and sets the scene):
>
>1.  "The overall performance goal of a multiprocessor database system
>is to obtain increased throughput as well as reduced response time in
>a multiuser environment.  The objective function that XPRS uses for
>query optimization is a combination of resource consumption and
>response time as follows:  cost = resource_consumption + w *
>response_time, where w is a system-specific weighting factor."
>
>2.  The "Buffer-Size-Independent Hypothesis" (here meaning work_mem):
>"The choice of the best sequential plan is insensitive to the amount
>of buffer space available as long as the buffer size is above the hash
>join threshold" (with a caveat about localised special cases that can
>be handled by choosing alternative subplans at runtime).
>
>3.  The "Two-Phase Hypothesis": "The best parallel plan is a
>parallelization of the best sequential plan."
>
>I read all of that a while back while working on bits of parallel
>query machinery (though I only realised after the fact that I'd
>implemented parallel hash the same way as Hong did 27 years earlier,
>that is, shared no-partition, which is now apparently back in vogue
>due to the recent ubiquity of high core count shared memory systems,
>so that every server looks a bit like a Sequent Symmetry; for example
>Oracle is rumoured to have a "parallel shared hash join" like ours in
>the pipeline).  I didn't understand the context or importance of XPRS,
>though, until I read this bit of Hellerstein's "Looking Back at
>Postgres"[2]:
>
>"In principle, parallelism “blows up” the plan space for a query
>optimizer by making it multiply the traditional choices made during
>query optimization (data access, join algorithms, join orders) against
>all possible ways of parallelizing each choice. The basic idea of what
>Stonebraker called “The Wei Hong Optimizer” was to cut the problem in
>two: run a traditional single-node query optimizer in the style of
>System R, and then “parallelize” the resulting single-node query plan
>by scheduling the degree of parallelism and placement of each operator
>based on data layouts and system configuration. This approach is
>heuristic, but it makes parallelism an additive cost to traditional
>query optimization, rather than a multiplicative cost.
>

I think this relies on a huge assumption that all steps in any sequential
plan can be parallelized. Which certainly is not true for PostgreSQL - as
you point out later on the join example. That means the optimal join order
may be different for parallel plan, and so on.

The other thing is that "parallelizations" of different sequential plans
may have different requirements for resources (say, work_mem), so I'd
expect cases when a parallel version of a "worse" sequential plan may end
up being superior thanks to allowing larger number of workers.

>Although “The Wei Hong Optimizer” was designed in the context of
>Postgres, it became the standard approach for many of the parallel
>query optimizers in industry."
>

I assume this quote is from 30 years ago. I wonder if the claim is still
true, on current hardware (including e.g. distributed databases).

>I don't know what to think about the buffer-size-independent
>hypothesis, but the two-phase hypothesis and the claim that is is the
>standard approach caught my attention.  Firstly, I don't think the
>hypothesis holds on our system currently, because (for example) we
>lack parallel merge joins and sorts, so you couldn't parallelise such
>serial plans, and yet we'd already have thrown away a hash join based
>plan that would be vastly better in parallel.  That might be just an
>implementation completeness problem.  I wonder what fundamental
>problems lurk here.  (Perhaps the non-availability of globally unique
>partial paths?)  Anyway, AFAICS we do the exact thing Hong wanted to
>avoid: we plan parallel queries as extra paths at planning time.  We
>don't really suffer too much of a planning explosion though, because
>we don't consider different parallel degrees.  If we did, because our
>cost model doesn't include any penalty for resource usage, I suspect
>we'd always go for the maximum number of workers because they're
>'free', which creates a perverse incentive to burn resource (CPU +
>copies of work_mem).  Those are all problems Hong solved with
>execution time resource allocation, as part of a bigger picture.
>

I don't know. I'd guess the hardware changed quite a bit, so maybe some of
the assumptions from the paper are too simplistic nowadays? Consider for
example the memory hierarchy - 30 years ago the amount of on-CPU cache was
miniscule, while now we have L1/L2/L3 caches that are tens of megabytes.
It's usually much faster to do sorts that fit into L3, for example.

FWIW I think we'll have to do something about resource acquisition, sooner
or later. It was always quite annoying that we don't really consider
memory consumption of the query as a whole during planning, and parallel
query made it a bit more painful.

>I have no idea what to do about any of this but thought that was an
>interesting bit of our project's history worth sharing.  It's really
>humbling to read these old papers.  I wonder if we're missing a chance
>to stand on the shoulders of giants.
>

Thanks, I think it's always useful / interesting to look at papers like
this. I don't know if we can use the stuff described in those papers
directly, but maybe we can build on those ideas and see which of the
assumptions are no longer true.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: XPRS

От
Thomas Munro
Дата:
On Sat, Aug 24, 2019 at 3:19 AM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
> > Although “The Wei Hong Optimizer” was designed in the context of
> > Postgres, it became the standard approach for many of the parallel
> > query optimizers in industry."
>
> I assume this quote is from 30 years ago. I wonder if the claim is still
> true, on current hardware (including e.g. distributed databases).

The quote is from 2018, and appears in the article I linked (it's a
chapter from the book Making Databases Work: The Wisdom of Michael
Stonebraker), but I'm not sure which systems it's referring to.

Speculation:  Many other systems have what we call parallel-oblivious
operators only, and then insert various exchange operators to make a
parallel plan.  That is, they don't necessarily have a direct
equivalent of our "Parallel Sequential Scan", "Parallel Index Scan",
"Parallel Foreign Scan": they just use their regular scans, possibly
with the addition of some kind of "Parallel Scatter" node (that's my
made up name, it's the opposite of Gather, called various things like
"page supplier" or "block iterator") or "Parallel Repartition"
inserted in the right places.  Perhaps they create a serial plan
first, and then try to parallelise it by inserting various parallel
operators and then recomputing the costs?  Rather than considering the
separate paths in the first phase of the optimiser, as we do.  The
cases where Hong's best-parallel-plan hypothesis isn't true for us now
might go away if we had Parallel Repartition, so that each 'stream'
would be the complete set of tuples for some known partition.

To be clear, I'm not suggesting we do that necessarily, just pointing
out some interesting claims about ancient POSTGRES wisdom, in a highly
speculative water cooler thread.   Actually, this isn't the first time
it's occurred to me that elements of our design were falling out of
the order that we chose to implement things in.  Another example is
the "Shared Hash" that I had in an early version of my work on
Parallel Hash Join, where just one process would run a parallel-safe
but non-parallel-oblivious plan to build a shared hash table while
other workers twiddled their thumbs; I dropped it because our cost
model has no penalty for running N copies of the same plan rather than
just one so there was no way to pick that plan, and that's because we
don't have a cost model like Hong's that considers resource usage too.
Another more speculative observation: maybe no-partition/shared
Parallel Hash Join is only obvious if you already have the general
concept of parallel-aware executor nodes.  AFAIK Robert and Amit
invented those to be able to coordinate parallel scans between
processes, where thread-based systems might be able to share a single
scan state somehow under a Scatter-like operator.  If you had threads,
you might not need that concept that allows arbitrary executor nodes
to communicate with each other between workers, and then it might be
more obvious and natural to use repartitioning for parallelising hash
joins.

> FWIW I think we'll have to do something about resource acquisition, sooner
> or later. It was always quite annoying that we don't really consider
> memory consumption of the query as a whole during planning, and parallel
> query made it a bit more painful.

Agreed.

Here's an approach I have been wondering about to cap total executor
memory usage, which is a much more down-to-Earth idea than any of the
above space cadet planner problems.  Let's start at the other end of
the problem, by introducing admission control and memory quotas.  That
is, keep using work_mem with its current per-node-copy meaning at
planning time, for now, and then:

1.  Compute the peak amount of memory each plan thinks it will need.
Initially that could be done by by summing estimates from all nodes
and considering workers.  A later refinement could deal with nodes
that give back memory early, if we get around to doing that.  The
estimate could be shown by EXPLAIN.  (Some details to work out: worst
case vs expected etc.)

2.  Introduce a new GUC global_work_mem, which limits the total plan
that are allowed to run concurrently, according to their memory
estimates.  Introduce a shared memory counter of currently allocated
quota.

3.  Introduce a new GUC session_work_mem, which is the amount of quota
that every session tries to acquire when it connects or perhaps first
runs a query, and that it won't give back until the end of the
session.  Or perhaps they acquire less than that if they need less,
but that's the amount they never give back once they've got that much.
The idea is to allow queries with estimates under that limit, for
example high frequency OLTP queries, to avoid any extra communication
overhead from this scheme.

4.  To run queries that have estimates higher than the session's
current allocated quota, the session must acquire more quota for the
duration of the query.  If it can't be acquired right now without
exceeding global_work_mem, it has to join a queue and wait.  A
refinement could be that you are allowed to run with fewer workers
than planned to reduce the requirement.

5.  While executing, executor nodes could opportunistically ask for
more quota than was planned for, up to some limit, to avoid having to
spill to disk.  If the request is unsuccessful, that's OK, they can
deal with that.

6.  So long as we have nodes that have no escape mechanism in certain
edge cases (hash aggregates and joins with bad stats and extreme
skew), you could perhaps have the option of raising an error or
forcing the total to exceed global_work_mem temporarily with a warning
(which would at least prevent other large queries from running and
making it worse).

7.  Regular heap memory and DSM memory should be counted together,
since it makes no difference to the operating system, it's all memory
and we should count it against the same quota.  You'd probably want to
consider hidden allocator fragmentation too, as well as other hidden
overheads, to get decent accuracy.

This is sort of fudging together of ideas from conversations with
Kevin Grittner (who talked about admission control a few years back),
Peter Geoghegan (who mentioned opportunistically asking for more), and
things I've heard of on SQL Server ("memory grants").  I think it
would provide some relief from the problems we see today: it's hard to
set work_mem so that you never get OOM but you can still use a decent
amount of your precious memory, especially with mixed parallel and
non-parallel query workloads thanks to our current
work_mem-multiplying design.

--
Thomas Munro
https://enterprisedb.com



Re: XPRS

От
Tomas Vondra
Дата:
On Mon, Sep 02, 2019 at 02:19:15PM +1200, Thomas Munro wrote:
>On Sat, Aug 24, 2019 at 3:19 AM Tomas Vondra
><tomas.vondra@2ndquadrant.com> wrote:
>> > Although “The Wei Hong Optimizer” was designed in the context of
>> > Postgres, it became the standard approach for many of the parallel
>> > query optimizers in industry."
>>
>> I assume this quote is from 30 years ago. I wonder if the claim is still
>> true, on current hardware (including e.g. distributed databases).
>
>The quote is from 2018, and appears in the article I linked (it's a
>chapter from the book Making Databases Work: The Wisdom of Michael
>Stonebraker), but I'm not sure which systems it's referring to.
>

Hmm, that's unfortunate - it'd be quite interesting to know which
databases it's referring to. I suspect no optimizer is ideal in this
regard, i.e. each database has some "gaps" where some nodes don't have a
straightforward parallel version.

>Speculation:  Many other systems have what we call parallel-oblivious
>operators only, and then insert various exchange operators to make a
>parallel plan.  That is, they don't necessarily have a direct
>equivalent of our "Parallel Sequential Scan", "Parallel Index Scan",
>"Parallel Foreign Scan": they just use their regular scans, possibly
>with the addition of some kind of "Parallel Scatter" node (that's my
>made up name, it's the opposite of Gather, called various things like
>"page supplier" or "block iterator") or "Parallel Repartition"
>inserted in the right places.  Perhaps they create a serial plan
>first, and then try to parallelise it by inserting various parallel
>operators and then recomputing the costs?  Rather than considering the
>separate paths in the first phase of the optimiser, as we do.  The
>cases where Hong's best-parallel-plan hypothesis isn't true for us now
>might go away if we had Parallel Repartition, so that each 'stream'
>would be the complete set of tuples for some known partition.
>

I don't know. It kinda reminds me planning with distributed databases,
which also need exchange data between nodes in various cases - say, when
joining two relations distributed in different ways. The redistribution is
however pretty costly (network I/O, bandwidth etc.) to the extent that
it's often much better to pick a very different join to reduce the amount
of data to exchange, or eliminate the redistribution altogether. For
parallelism the costs are much lower, of course, but I don't think we can
just ignore those.

FWIW it's not clear to me why the cost would need to be recomputed after
constructing the parallel version of the plan? My understanding is that
the idea is to do cost-based planning for the serial plan, and then just
"mechanically" construct a parallel plan. Although, maybe there could be
multiple parallel alternatives ...

>To be clear, I'm not suggesting we do that necessarily, just pointing
>out some interesting claims about ancient POSTGRES wisdom, in a highly
>speculative water cooler thread.   Actually, this isn't the first time
>it's occurred to me that elements of our design were falling out of
>the order that we chose to implement things in.  Another example is
>the "Shared Hash" that I had in an early version of my work on
>Parallel Hash Join, where just one process would run a parallel-safe
>but non-parallel-oblivious plan to build a shared hash table while
>other workers twiddled their thumbs; I dropped it because our cost
>model has no penalty for running N copies of the same plan rather than
>just one so there was no way to pick that plan, and that's because we
>don't have a cost model like Hong's that considers resource usage too.
>Another more speculative observation: maybe no-partition/shared
>Parallel Hash Join is only obvious if you already have the general
>concept of parallel-aware executor nodes.  AFAIK Robert and Amit
>invented those to be able to coordinate parallel scans between
>processes, where thread-based systems might be able to share a single
>scan state somehow under a Scatter-like operator.  If you had threads,
>you might not need that concept that allows arbitrary executor nodes
>to communicate with each other between workers, and then it might be
>more obvious and natural to use repartitioning for parallelising hash
>joins.
>
>> FWIW I think we'll have to do something about resource acquisition, sooner
>> or later. It was always quite annoying that we don't really consider
>> memory consumption of the query as a whole during planning, and parallel
>> query made it a bit more painful.
>
>Agreed.
>
>Here's an approach I have been wondering about to cap total executor
>memory usage, which is a much more down-to-Earth idea than any of the
>above space cadet planner problems.  Let's start at the other end of
>the problem, by introducing admission control and memory quotas.  That
>is, keep using work_mem with its current per-node-copy meaning at
>planning time, for now, and then:
>
>1.  Compute the peak amount of memory each plan thinks it will need.
>Initially that could be done by by summing estimates from all nodes
>and considering workers.  A later refinement could deal with nodes
>that give back memory early, if we get around to doing that.  The
>estimate could be shown by EXPLAIN.  (Some details to work out: worst
>case vs expected etc.)
>
>2.  Introduce a new GUC global_work_mem, which limits the total plan
>that are allowed to run concurrently, according to their memory
>estimates.  Introduce a shared memory counter of currently allocated
>quota.
>
>3.  Introduce a new GUC session_work_mem, which is the amount of quota
>that every session tries to acquire when it connects or perhaps first
>runs a query, and that it won't give back until the end of the
>session.  Or perhaps they acquire less than that if they need less,
>but that's the amount they never give back once they've got that much.
>The idea is to allow queries with estimates under that limit, for
>example high frequency OLTP queries, to avoid any extra communication
>overhead from this scheme.
>
>4.  To run queries that have estimates higher than the session's
>current allocated quota, the session must acquire more quota for the
>duration of the query.  If it can't be acquired right now without
>exceeding global_work_mem, it has to join a queue and wait.  A
>refinement could be that you are allowed to run with fewer workers
>than planned to reduce the requirement.
>
>5.  While executing, executor nodes could opportunistically ask for
>more quota than was planned for, up to some limit, to avoid having to
>spill to disk.  If the request is unsuccessful, that's OK, they can
>deal with that.
>
>6.  So long as we have nodes that have no escape mechanism in certain
>edge cases (hash aggregates and joins with bad stats and extreme
>skew), you could perhaps have the option of raising an error or
>forcing the total to exceed global_work_mem temporarily with a warning
>(which would at least prevent other large queries from running and
>making it worse).
>
>7.  Regular heap memory and DSM memory should be counted together,
>since it makes no difference to the operating system, it's all memory
>and we should count it against the same quota.  You'd probably want to
>consider hidden allocator fragmentation too, as well as other hidden
>overheads, to get decent accuracy.
>
>This is sort of fudging together of ideas from conversations with
>Kevin Grittner (who talked about admission control a few years back),
>Peter Geoghegan (who mentioned opportunistically asking for more), and
>things I've heard of on SQL Server ("memory grants").  I think it
>would provide some relief from the problems we see today: it's hard to
>set work_mem so that you never get OOM but you can still use a decent
>amount of your precious memory, especially with mixed parallel and
>non-parallel query workloads thanks to our current
>work_mem-multiplying design.
>

I think this is probably the simplest and most realistic first step.

Whenever I was thinking about memory acquisition, I've assumed we'd
monitor how much memory the plan is expected to use while we're
constructing it. My main problem was what to do when we reach the
per-query limit - whether to (a) simply reject the plan, (b) go back and
see if we can replan with lower work_mem (but how much and for which
nodes?), or (c) just continue.

The proposed plan deals with this by not limiting the per-query (or rather
per-session) budget directly, and instead requesting requesting additional
budget. Which is nice.

I suspect we should also keep an additional plan that is expected to meet
the session_work_mem limit, aside from the regular cheapest plan, and use
it if it's not much worse. Imagine you have a plan with cost 1000 that
needs (global_work_mem/2 + 1kB) memory, essentially serializing executions
of this query. And then there's an alternative plan with cost 1100 that
can run with session_work_mem. It seems better to just accept the second
plan, because it won't need to wait.

Another challenge with work_mem is that anyone can modify it arbitrarily,
i.e. a user can do

  SET work_mem = '1TB';

and use as much memory as they wist, or even crash the system. I wonder if
we could define the new GUCs (session_work_mem and global_work_mem) in a
way to prevent this. We probably don't want to make them PGC_POSTMASTER
(it seems useful to allow overriding them in ALTER USER/DATABASE), but I
don't think we have a good way to do that at the moment. Any ideas in this
direction?


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: XPRS

От
Thomas Munro
Дата:
On Tue, Sep 3, 2019 at 5:20 AM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
> FWIW it's not clear to me why the cost would need to be recomputed after
> constructing the parallel version of the plan? My understanding is that
> the idea is to do cost-based planning for the serial plan, and then just
> "mechanically" construct a parallel plan. Although, maybe there could be
> multiple parallel alternatives ...

Presumably you still need to choose between the serial and parallel
plans by comparing costs.  You lose some by adding exchange operators,
but you win some by dividing cardinality estimates.

> >This is sort of fudging together of ideas from conversations with
> >Kevin Grittner (who talked about admission control a few years back),
> >Peter Geoghegan (who mentioned opportunistically asking for more), and
> >things I've heard of on SQL Server ("memory grants").  I think it
> >would provide some relief from the problems we see today: it's hard to
> >set work_mem so that you never get OOM but you can still use a decent
> >amount of your precious memory, especially with mixed parallel and
> >non-parallel query workloads thanks to our current
> >work_mem-multiplying design.
>
> I think this is probably the simplest and most realistic first step.
>
> Whenever I was thinking about memory acquisition, I've assumed we'd
> monitor how much memory the plan is expected to use while we're
> constructing it. My main problem was what to do when we reach the
> per-query limit - whether to (a) simply reject the plan, (b) go back and
> see if we can replan with lower work_mem (but how much and for which
> nodes?), or (c) just continue.

Yeah, it's all quite tricky and circular.  But I'm pretty sure that we
need caps at execution time, anyway, so I think it's OK to start at
that end of the problem and then later try to improve the way the
planner.

> The proposed plan deals with this by not limiting the per-query (or rather
> per-session) budget directly, and instead requesting requesting additional
> budget. Which is nice.
>
> I suspect we should also keep an additional plan that is expected to meet
> the session_work_mem limit, aside from the regular cheapest plan, and use
> it if it's not much worse. Imagine you have a plan with cost 1000 that
> needs (global_work_mem/2 + 1kB) memory, essentially serializing executions
> of this query. And then there's an alternative plan with cost 1100 that
> can run with session_work_mem. It seems better to just accept the second
> plan, because it won't need to wait.

Hmm.  I wonder if it's worth it.  You could also just replan as you
said, but I'm wondering if just rejecting the query would be OK.

> Another challenge with work_mem is that anyone can modify it arbitrarily,
> i.e. a user can do
>
>   SET work_mem = '1TB';
>
> and use as much memory as they wist, or even crash the system. I wonder if
> we could define the new GUCs (session_work_mem and global_work_mem) in a
> way to prevent this. We probably don't want to make them PGC_POSTMASTER
> (it seems useful to allow overriding them in ALTER USER/DATABASE), but I
> don't think we have a good way to do that at the moment. Any ideas in this
> direction?

How about something giving the superuser the following GUCs:

global_work_mem = 16GB
session_min_work_mem = 0.5%  -- the amount of quota sessions keep, for
fast small queries
session_max_work_mem = 20% -- the maximum quota any one session is allowed
session_extra_work_mem = 5% -- opportunistic execution-time boost

Users are free to plan queries with work_mem = 1TB, and if you do that
and it estimates that it wants 512GB, it will be rejected if you try
to execute it because it exceeds session_max_work_mem, with a hint
telling you to turn down work_mem.  Otherwise it either runs or joins
the queue if it can't get the quota it needs immediately.

Eventually we could try to figure out how to set work_mem to automatic
(I don't want to propose a concrete rule, but maybe something based on
session_max_work_mem / njoins, with various fudge factors, and some
accounting for parallel workers; it's probably good to low-ball it and
rely on session_extra_work_mem).

Yeah, I think you'd want to be able to set session_XXX on databases
and roles so that you can say your regular users can't eat more than
10% of memory each, but a big reporting thing is allowed more.

-- 
Thomas Munro
https://enterprisedb.com



Re: XPRS

От
Tomas Vondra
Дата:
On Tue, Sep 03, 2019 at 11:04:43AM +1200, Thomas Munro wrote:
>On Tue, Sep 3, 2019 at 5:20 AM Tomas Vondra
><tomas.vondra@2ndquadrant.com> wrote:
>> FWIW it's not clear to me why the cost would need to be recomputed after
>> constructing the parallel version of the plan? My understanding is that
>> the idea is to do cost-based planning for the serial plan, and then just
>> "mechanically" construct a parallel plan. Although, maybe there could be
>> multiple parallel alternatives ...
>
>Presumably you still need to choose between the serial and parallel
>plans by comparing costs.  You lose some by adding exchange operators,
>but you win some by dividing cardinality estimates.
>

Oh, right. Silly me.

>> >This is sort of fudging together of ideas from conversations with
>> >Kevin Grittner (who talked about admission control a few years back),
>> >Peter Geoghegan (who mentioned opportunistically asking for more), and
>> >things I've heard of on SQL Server ("memory grants").  I think it
>> >would provide some relief from the problems we see today: it's hard to
>> >set work_mem so that you never get OOM but you can still use a decent
>> >amount of your precious memory, especially with mixed parallel and
>> >non-parallel query workloads thanks to our current
>> >work_mem-multiplying design.
>>
>> I think this is probably the simplest and most realistic first step.
>>
>> Whenever I was thinking about memory acquisition, I've assumed we'd
>> monitor how much memory the plan is expected to use while we're
>> constructing it. My main problem was what to do when we reach the
>> per-query limit - whether to (a) simply reject the plan, (b) go back and
>> see if we can replan with lower work_mem (but how much and for which
>> nodes?), or (c) just continue.
>
>Yeah, it's all quite tricky and circular.  But I'm pretty sure that we
>need caps at execution time, anyway, so I think it's OK to start at
>that end of the problem and then later try to improve the way the
>planner.
>

True.

>> The proposed plan deals with this by not limiting the per-query (or rather
>> per-session) budget directly, and instead requesting requesting additional
>> budget. Which is nice.
>>
>> I suspect we should also keep an additional plan that is expected to meet
>> the session_work_mem limit, aside from the regular cheapest plan, and use
>> it if it's not much worse. Imagine you have a plan with cost 1000 that
>> needs (global_work_mem/2 + 1kB) memory, essentially serializing executions
>> of this query. And then there's an alternative plan with cost 1100 that
>> can run with session_work_mem. It seems better to just accept the second
>> plan, because it won't need to wait.
>
>Hmm.  I wonder if it's worth it.  You could also just replan as you
>said, but I'm wondering if just rejecting the query would be OK.
>

I think we should not reject queries unnecessarily, if there's a workable
execution plan. It's just another optimization criteria, and erroring out
right after planning is essentially "can't find a plan". But when there is
a plan that we could use, that seems like a bad idea. 

>> Another challenge with work_mem is that anyone can modify it arbitrarily,
>> i.e. a user can do
>>
>>   SET work_mem = '1TB';
>>
>> and use as much memory as they wist, or even crash the system. I wonder if
>> we could define the new GUCs (session_work_mem and global_work_mem) in a
>> way to prevent this. We probably don't want to make them PGC_POSTMASTER
>> (it seems useful to allow overriding them in ALTER USER/DATABASE), but I
>> don't think we have a good way to do that at the moment. Any ideas in this
>> direction?
>
>How about something giving the superuser the following GUCs:
>
>global_work_mem = 16GB
>session_min_work_mem = 0.5%  -- the amount of quota sessions keep, for
>fast small queries
>session_max_work_mem = 20% -- the maximum quota any one session is allowed
>session_extra_work_mem = 5% -- opportunistic execution-time boost
>
>Users are free to plan queries with work_mem = 1TB, and if you do that
>and it estimates that it wants 512GB, it will be rejected if you try
>to execute it because it exceeds session_max_work_mem, with a hint
>telling you to turn down work_mem.  Otherwise it either runs or joins
>the queue if it can't get the quota it needs immediately.
>

Seems reasonable, certainly for v1. I'd keep it as simple as possible.

>Eventually we could try to figure out how to set work_mem to automatic
>(I don't want to propose a concrete rule, but maybe something based on
>session_max_work_mem / njoins, with various fudge factors, and some
>accounting for parallel workers; it's probably good to low-ball it and
>rely on session_extra_work_mem).
>

Hmm, so you'd tweak work_mem for individual queries? Not sure that's
something I'd do at this point - it may seem simple, but I think it's
actually way harder to get right.

For example let's say you have two views that are planned nicely, then you
join then and suddenly the plan is much worse because the actual work_mem
got much lower suddenly. That's not great.

Of course, if it's just optional behavior, and the current with explicit
work_mem value is the default, then this is not an issue.

Anyway, I'd focus on MVP doing the bare minimum with simply enforcing a
session limit, and leave this for the future.

>Yeah, I think you'd want to be able to set session_XXX on databases
>and roles so that you can say your regular users can't eat more than
>10% of memory each, but a big reporting thing is allowed more.
>

Yeah, something like that.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services