Re: Parallel Sort

Поиск
Список
Период
Сортировка
От Claudio Freire
Тема Re: Parallel Sort
Дата
Msg-id CAGTBQpaXwDn6CX6mCrR2TfQ6vj+FyFwD2eWToE9axpwih7F6dg@mail.gmail.com
обсуждение исходный текст
Ответ на Re: Parallel Sort  (Noah Misch <noah@leadboat.com>)
Список pgsql-hackers
On Wed, May 15, 2013 at 3:04 PM, Noah Misch <noah@leadboat.com> wrote:
> On Tue, May 14, 2013 at 12:15:24PM -0300, Claudio Freire wrote:
>> You know what would be a low-hanging fruit that I've been thinking
>> would benefit many of my own queries?
>>
>> "Parallel" sequential scan nodes. Even if there's no real parallelism
>> involved, when a query has to scan the same table at multiple nodes,
>> if it's big, it would be worth parallelizing the scans to transform
>> them into synchro scans.
>>
>> I have absolutely no idea how this would work easily without forked
>> workers, because the scans might be buried in more complex execution
>> trees. But still, it's worth considering, that parallelizing may
>> benefit more than core usage.
>>
>> If execution nodes could be paused at arbitrary points, a "parallel
>> scan" node could pause one branch that has consumed the circular
>> buffer, letting another branches consume their part, and thus
>> "parallelizing" branch execution. But this would be perhaps more
>> complex than simply forking.
>
> Execution nodes do pause between every output tuple, at least nominally.
> Still, given the architecture of our executor and the planner work to
> implement such a thing, I would not classify it as low-hanging fruit.  It
> would primarily apply to a plan with independent sequential scans of the same
> large (relative to total memory) relation.  I'm sure that comes up, but it
> doesn't strike me as typical.


I found it rather typical of some of my workloads, but it could
probably not be the case globally.

It would be rather easier if it could pause without returning rows. I
think ATM, not returning any rows means the node is finished doing its
scan. The nodes that would have to be pausable like this wouldn't be
sequential scans, but sorts, hashes, and in general those that take a
long time to start returning rows.

So, a plan that goes like:
 Seq on A -> Sort -> Merge -> result Seq on A -> Sort --/

Would be turned into:
 Seq on A -> Step Sort -> Parallel Merge -> result Seq on A -> Step Sort --/

Or even maybe
 Seq on A -> Sort -> Tee X -> Parallel Merge                                    X --/

I think Tee and Parallel Merge should be doable with current
infrastructure, because they don't require pausing without returning
any tuples. Not sure how may meters above ground that is, or how many
gotchas might be involved. But it's been circling in my head for a
while.



В списке pgsql-hackers по дате отправления:

Предыдущее
От: Heikki Linnakangas
Дата:
Сообщение: Re: streaming replication, "frozen snapshot backup on it" and missing relfile (postgres 9.2.3 on xfs + LVM)
Следующее
От: Tom Lane
Дата:
Сообщение: pg_dump versus defaults on foreign tables