Re: Parallel Seq Scan

Поиск
Список
Период
Сортировка
От Stephen Frost
Тема Re: Parallel Seq Scan
Дата
Msg-id 20150109172420.GB3062@tamriel.snowman.net
обсуждение исходный текст
Ответ на Re: Parallel Seq Scan  (Amit Kapila <amit.kapila16@gmail.com>)
Ответы Re: Parallel Seq Scan  (Jim Nasby <Jim.Nasby@BlueTreble.com>)
Re: Parallel Seq Scan  (Amit Kapila <amit.kapila16@gmail.com>)
Re: Parallel Seq Scan  (Robert Haas <robertmhaas@gmail.com>)
Список pgsql-hackers
Amit,

* Amit Kapila (amit.kapila16@gmail.com) wrote:
> On Fri, Dec 19, 2014 at 7:57 PM, Stephen Frost <sfrost@snowman.net> wrote:
> > There's certainly documentation available from the other RDBMS' which
> > already support parallel query, as one source.  Other academic papers
> > exist (and once you've linked into one, the references and prior work
> > helps bring in others).  Sadly, I don't currently have ACM access (might
> > have to change that..), but there are publicly available papers also,
>
> I have gone through couple of papers and what some other databases
> do in case of parallel sequential scan and here is brief summarization
> of same and how I am planning to handle in the patch:

Great, thanks!

> Costing:
> In one of the paper's [1] suggested by you, below is the summarisation:
>   a. Startup costs are negligible if processes can be reused
>            rather than created afresh.
> b. Communication cost consists of the CPU cost of sending
>            and receiving messages.
>   c. Communication costs can exceed the cost of operators such
>             as scanning, joining or grouping
> These findings lead to the important conclusion that
> Query optimization should be concerned with communication costs
> but not with startup costs.
>
> In our case as currently we don't have a mechanism to reuse parallel
> workers, so we need to account for that cost as well.  So based on that,
> I am planing to add three new parameters cpu_tuple_comm_cost,
> parallel_setup_cost, parallel_startup_cost
> *  cpu_tuple_comm_cost - Cost of CPU time to pass a tuple from worker
>     to master backend with default value
>     DEFAULT_CPU_TUPLE_COMM_COST as 0.1, this will be multiplied
>     with tuples expected to be selected
> *  parallel_setup_cost - Cost of setting up shared memory for parallelism
>    with default value as 100.0
>  *  parallel_startup_cost  - Cost of starting up parallel workers with
> default
>     value as 1000.0 multiplied by number of workers decided for scan.
>
> I will do some experiments to finalise the default values, but in general,
> I feel developing cost model on above parameters is good.

The parameters sound reasonable but I'm a bit worried about the way
you're describing the implementation.  Specifically this comment:

"Cost of starting up parallel workers with default value as 1000.0
multiplied by number of workers decided for scan."

That appears to imply that we'll decide on the number of workers, figure
out the cost, and then consider "parallel" as one path and
"not-parallel" as another.  I'm worried that if I end up setting the max
parallel workers to 32 for my big, beefy, mostly-single-user system then
I'll actually end up not getting parallel execution because we'll always
be including the full startup cost of 32 threads.  For huge queries,
it'll probably be fine, but there's a lot of room to parallelize things
at levels less than 32 which we won't even consider.

What I was advocating for up-thread was to consider multiple "parallel"
paths and to pick whichever ends up being the lowest overall cost.  The
flip-side to that is increased planning time.  Perhaps we can come up
with an efficient way of working out where the break-point is based on
the non-parallel cost and go at it from that direction instead of
building out whole paths for each increment of parallelism.

I'd really like to be able to set the 'max parallel' high and then have
the optimizer figure out how many workers should actually be spawned for
a given query.

> Execution:
> Most other databases does partition level scan for partition on
> different disks by each individual parallel worker.  However,
> it seems amazon dynamodb [2] also works on something
> similar to what I have used in patch which means on fixed
> blocks.  I think this kind of strategy seems better than dividing
> the blocks at runtime because dividing randomly the blocks
> among workers could lead to random scan for a parallel
> sequential scan.

Yeah, we also need to consider the i/o side of this, which will
definitely be tricky.  There are i/o systems out there which are faster
than a single CPU and ones where a single CPU can manage multiple i/o
channels.  There are also cases where the i/o system handles sequential
access nearly as fast as random and cases where sequential is much
faster than random.  Where we can get an idea of that distinction is
with seq_page_cost vs. random_page_cost as folks running on SSDs tend to
lower random_page_cost from the default to indicate that.

> Also I find in whatever I have read (Oracle, dynamodb) that most
> databases divide work among workers and master backend acts
> as coordinator, atleast that's what I could understand.

Yeah, I agree that's more typical.  Robert's point that the master
backend should participate is interesting but, as I recall, it was based
on the idea that the master could finish faster than the worker- but if
that's the case then we've planned it out wrong from the beginning.
Thanks!
    Stephen

В списке pgsql-hackers по дате отправления:

Предыдущее
От: Aaron Botsis
Дата:
Сообщение: Re: Patch: [BUGS] BUG #12320: json parsing with embedded double quotes
Следующее
От: Bruce Momjian
Дата:
Сообщение: Re: Fixing memory leak in pg_upgrade