Re: Parallel Seq Scan

Поиск

Список

Период

Сортировка

От	Amit Kapila
Тема	Re: Parallel Seq Scan
Дата	9 января 2015 г. 17:39:03
Msg-id	CAA4eK1L0dk9D3hARoAb84v2pGvUw4B5YoS4x18ORQREwR+1VCg@mail.gmail.com обсуждение исходный текст
Ответ на	Re: Parallel Seq Scan (Stephen Frost <sfrost@snowman.net>)
Ответы	Re: Parallel Seq Scan
Список	pgsql-hackers

Дерево обсуждения

On Fri, Dec 19, 2014 at 7:57 PM, Stephen Frost <sfrost@snowman.net> wrote:
>
>
> There's certainly documentation available from the other RDBMS' which
> already support parallel query, as one source. Other academic papers
> exist (and once you've linked into one, the references and prior work
> helps bring in others). Sadly, I don't currently have ACM access (might
> have to change that..), but there are publicly available papers also,

I have gone through couple of papers and what some other databases

do in case of parallel sequential scan and here is brief summarization

of same and how I am planning to handle in the patch:

Costing:

In one of the paper's [1] suggested by you, below is the summarisation:

a. Startup costs are negligible if processes can be reused

rather than created afresh.

b. Communication cost consists of the CPU cost of sending

and receiving messages.

c. Communication costs can exceed the cost of operators such

as scanning, joining or grouping

These findings lead to the important conclusion that

Query optimization should be concerned with communication costs

but not with startup costs.

In our case as currently we don't have a mechanism to reuse parallel

workers, so we need to account for that cost as well. So based on that,

I am planing to add three new parameters cpu_tuple_comm_cost,

parallel_setup_cost, parallel_startup_cost

* cpu_tuple_comm_cost - Cost of CPU time to pass a tuple from worker

to master backend with default value

DEFAULT_CPU_TUPLE_COMM_COST as 0.1, this will be multiplied

with tuples expected to be selected

* parallel_setup_cost - Cost of setting up shared memory for parallelism

with default value as 100.0
* parallel_startup_cost - Cost of starting up parallel workers with default

value as 1000.0 multiplied by number of workers decided for scan.

I will do some experiments to finalise the default values, but in general,

I feel developing cost model on above parameters is good.

Execution:

Most other databases does partition level scan for partition on

different disks by each individual parallel worker. However,

it seems amazon dynamodb [2] also works on something

similar to what I have used in patch which means on fixed

blocks. I think this kind of strategy seems better than dividing

the blocks at runtime because dividing randomly the blocks

among workers could lead to random scan for a parallel

sequential scan.

Also I find in whatever I have read (Oracle, dynamodb) that most

databases divide work among workers and master backend acts

as coordinator, atleast that's what I could understand.

Let me know your opinion about the same?

I am planning to proceed with above ideas to strengthen the patch

in absence of any objection or better ideas.

[1] : http://i.stanford.edu/pub/cstr/reports/cs/tr/96/1570/CS-TR-96-1570.pdf

[2] : http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/QueryAndScan.html#QueryAndScanParallelScan

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

В списке pgsql-hackers по дате отправления:

Предыдущее

От: Heikki Linnakangas
Дата: 09 января 2015 г., 17:15:40
Сообщение: Translating xlogreader.c

Следующее

От: Tom Lane
Дата: 09 января 2015 г., 17:53:39
Сообщение: Re: [PATCH] server_version_num should be GUC_REPORT

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Re: Parallel Seq Scan

Предыдущее

Следующее