Обсуждение: CustomScan under the Gather node?

Поиск

Список

Период

Сортировка

CustomScan under the Gather node?

От

Kouhei Kaigai

Дата:

26 января 2016 г., 09:31:15

Hello,

What enhancement will be necessary to implement similar feature of
partial seq-scan using custom-scan interface?

It seems to me callbacks on the three points below are needed.
* ExecParallelEstimate
* ExecParallelInitializeDSM
* ExecParallelInitializeWorker

Anything else?
Does ForeignScan also need equivalent enhancement?



Background of my motivation is the slides below:
http://www.slideshare.net/kaigai/sqlgpussd-english
(LT slides in JPUG conference last Dec)

I'm under investigation of SSD-to-GPU direct feature on top of
the custom-scan interface. It intends to load a bunch of data
blocks on NVMe-SSD to GPU RAM using peer-to-peer DMA, prior to
data loading onto CPU/RAM. (Probably, it shall be loaded only
all-visible blocks like as index-only scan.)
Once we load the data blocks onto GPU RAM, we can reduce rows
to be filtered out later but consumes CPU RAM.
An expected major bottleneck is CPU thread which issues the
peer-to-peer DMA requests to the device, rather than GPU tasks.
So, utilization of parallel execution is a natural thought.
However, a CustomScan node that takes underlying PartialSeqScan
node is not sufficient because it once loads the data blocks
onto CPU RAM. P2P DMA does not make sense.

The expected "GpuSsdScan" on CustomScan will reference a shared
block-index to be incremented by multiple backend, then it
enqueues P2P DMA request (if all visible) to the device driver.
Then it receives the rows only visible towards the scan qualifiers.
It is almost equivalent to SeqScan, but wants to bypass heap layer
to utilize SSD-to-GPU direct data translation path.

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei <kaigai@ak.jp.nec.com>

Re: CustomScan under the Gather node?

От

Amit Kapila

Дата:

27 января 2016 г., 08:29:44

On Tue, Jan 26, 2016 at 12:00 PM, Kouhei Kaigai <kaigai@ak.jp.nec.com> wrote:
>
> Hello,
>
> What enhancement will be necessary to implement similar feature of
> partial seq-scan using custom-scan interface?
>
> It seems to me callbacks on the three points below are needed.
> * ExecParallelEstimate
> * ExecParallelInitializeDSM
> * ExecParallelInitializeWorker
>
> Anything else?

I don't think so.

> Does ForeignScan also need equivalent enhancement?

I think this depends on the way ForeignScan is supposed to be

parallelized, basically if it needs to coordinate any information

with other set of workers, then it will require such an enhancement.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: CustomScan under the Gather node?

От

Kouhei Kaigai

Дата:

27 января 2016 г., 08:53:24

> -----Original Message-----
> From: Amit Kapila [mailto:amit.kapila16@gmail.com]
> Sent: Wednesday, January 27, 2016 2:30 PM
> To: Kaigai Kouhei(海外 浩平)
> Cc: pgsql-hackers@postgresql.org
> Subject: ##freemail## Re: [HACKERS] CustomScan under the Gather node?
> 
> On Tue, Jan 26, 2016 at 12:00 PM, Kouhei Kaigai <kaigai@ak.jp.nec.com> wrote:
> >
> > Hello,
> >
> > What enhancement will be necessary to implement similar feature of
> > partial seq-scan using custom-scan interface?
> >
> > It seems to me callbacks on the three points below are needed.
> > * ExecParallelEstimate
> > * ExecParallelInitializeDSM
> > * ExecParallelInitializeWorker
> >
> > Anything else?
> 
> I don't think so.
> 
> > Does ForeignScan also need equivalent enhancement?
> 
> I think this depends on the way ForeignScan is supposed to be
> parallelized, basically if it needs to coordinate any information
> with other set of workers, then it will require such an enhancement.
>
After the post yesterday, I reminded an possible scenario around FDW
if it manages own private storage, like cstore_fdw.

Probably, ForeignScan node performing on columnar store (for example)
will need a coordination information like as partial seq-scan doing.
It is a case very similar to the implementation on local storage.

On the other hands, if we try postgres_fdw (or others) to get parallelized
with background worker, I doubt whether we need this coordination information
on local side. Remote query will have an additional qualifier to skip blocks
already fetched for this purpose.
At least, it does not needs something special enhancement.

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei <kaigai@ak.jp.nec.com>

Re: CustomScan under the Gather node?

От

Robert Haas

Дата:

28 января 2016 г., 00:07:32

On Tue, Jan 26, 2016 at 1:30 AM, Kouhei Kaigai <kaigai@ak.jp.nec.com> wrote:
> What enhancement will be necessary to implement similar feature of
> partial seq-scan using custom-scan interface?
>
> It seems to me callbacks on the three points below are needed.
> * ExecParallelEstimate
> * ExecParallelInitializeDSM
> * ExecParallelInitializeWorker
>
> Anything else?
> Does ForeignScan also need equivalent enhancement?

For postgres_fdw, running the query from a parallel worker would
change the transaction semantics.  Suppose you begin a transaction,
UPDATE data on the foreign server, and then run a parallel query.  If
the leader performs the ForeignScan it will see the uncommitted
UPDATE, but a worker would have to make its own connection which not
be part of the same transaction and which would therefore not see the
update.  That's a problem.

Also, for postgres_fdw, and many other FDWs I suspect, the assumption
is that most of the work is being done on the remote side, so doing
the work in a parallel worker doesn't seem super interesting.  Instead
of incurring transfer costs to move the data from remote to local, we
incur two sets of transfer costs: first remote to local, then worker
to leader.  Ouch.  I think a more promising line of inquiry is to try
to provide asynchronous execution when we have something like:

Append
-> Foreign Scan
-> Foreign Scan

...so that we can return a row from whichever Foreign Scan receives
data back from the remote server first.

So it's not impossible that an FDW author could want this, but mostly
probably not.  I think.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: CustomScan under the Gather node?

От

Kouhei Kaigai

Дата:

28 января 2016 г., 03:34:10

> On Tue, Jan 26, 2016 at 1:30 AM, Kouhei Kaigai <kaigai@ak.jp.nec.com> wrote:
> > What enhancement will be necessary to implement similar feature of
> > partial seq-scan using custom-scan interface?
> >
> > It seems to me callbacks on the three points below are needed.
> > * ExecParallelEstimate
> > * ExecParallelInitializeDSM
> > * ExecParallelInitializeWorker
> >
> > Anything else?
> > Does ForeignScan also need equivalent enhancement?
> 
> For postgres_fdw, running the query from a parallel worker would
> change the transaction semantics.  Suppose you begin a transaction,
> UPDATE data on the foreign server, and then run a parallel query.  If
> the leader performs the ForeignScan it will see the uncommitted
> UPDATE, but a worker would have to make its own connection which not
> be part of the same transaction and which would therefore not see the
> update.  That's a problem.
>
Ah, yes, as long as FDW driver ensure the remote session has no
uncommitted data, pg_export_snapshot() might provide us an opportunity,
however, once a session writes something, FDW driver has to prohibit it.

> Also, for postgres_fdw, and many other FDWs I suspect, the assumption
> is that most of the work is being done on the remote side, so doing
> the work in a parallel worker doesn't seem super interesting.  Instead
> of incurring transfer costs to move the data from remote to local, we
> incur two sets of transfer costs: first remote to local, then worker
> to leader.  Ouch.  I think a more promising line of inquiry is to try
> to provide asynchronous execution when we have something like:
> 
> Append
> -> Foreign Scan
> -> Foreign Scan
> 
> ...so that we can return a row from whichever Foreign Scan receives
> data back from the remote server first.
> 
> So it's not impossible that an FDW author could want this, but mostly
> probably not.  I think.
>
Yes, I also have same opinion. Likely, local parallelism is not
valuable for the class of FDWs that obtains data from the remote
server (e.g, postgres_fdw, ...), expect for the case when packing
and unpacking cost over the network is major bottleneck.

On the other hands, it will be valuable for the class of FDW that
performs as a wrapper to local data structure, as like current
partial seq-scan doing. (e.g, file_fdw, ...)
Its data source is not under the transaction control, and 'remote
execution' of these FDWs are eventually executed on the local
computing resources.

If I would make a proof-of-concept patch with interface itself, it
seems to me file_fdw may be a good candidate for this enhancement.
It is not a field for postgres_fdw.

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei <kaigai@ak.jp.nec.com>

Re: CustomScan under the Gather node?

От

Kouhei Kaigai

Дата:

28 января 2016 г., 18:51:32

> If I would make a proof-of-concept patch with interface itself, it
> seems to me file_fdw may be a good candidate for this enhancement.
> It is not a field for postgres_fdw.
>
The attached patch is enhancement of FDW/CSP interface and PoC feature
of file_fdw to scan source file partially. It was smaller enhancement
than my expectations.

It works as follows. This query tried to read 20M rows from a CSV file,
using 3 background worker processes.

postgres=# set max_parallel_degree = 3;
SET
postgres=# explain analyze select * from test_csv where id % 20 = 6;
                                  QUERY PLAN
--------------------------------------------------------------------------------
 Gather  (cost=1000.00..194108.60 rows=94056 width=52)
         (actual time=0.570..19268.010 rows=2000000 loops=1)
   Number of Workers: 3
   ->  Parallel Foreign Scan on test_csv  (cost=0.00..183703.00 rows=94056 width=52)
                                  (actual time=0.180..12744.655 rows=500000 loops=4)
         Filter: ((id % 20) = 6)
         Rows Removed by Filter: 9500000
         Foreign File: /tmp/testdata.csv
         Foreign File Size: 1504892535
 Planning time: 0.147 ms
 Execution time: 19330.201 ms
(9 rows)


I'm not 100% certain whether this implementation of file_fdw is reasonable
for partial read, however, the callbacks located on the following functions
enabled to implement a parallel-aware custom logic based on the coordination
information.

> * ExecParallelEstimate
> * ExecParallelInitializeDSM
> * ExecParallelInitializeWorker

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei <kaigai@ak.jp.nec.com>

> -----Original Message-----
> From: Kaigai Kouhei(海外 浩平)
> Sent: Thursday, January 28, 2016 9:33 AM
> To: 'Robert Haas'
> Cc: pgsql-hackers@postgresql.org
> Subject: Re: [HACKERS] CustomScan under the Gather node?
> 
> > On Tue, Jan 26, 2016 at 1:30 AM, Kouhei Kaigai <kaigai@ak.jp.nec.com> wrote:
> > > What enhancement will be necessary to implement similar feature of
> > > partial seq-scan using custom-scan interface?
> > >
> > > It seems to me callbacks on the three points below are needed.
> > > * ExecParallelEstimate
> > > * ExecParallelInitializeDSM
> > > * ExecParallelInitializeWorker
> > >
> > > Anything else?
> > > Does ForeignScan also need equivalent enhancement?
> >
> > For postgres_fdw, running the query from a parallel worker would
> > change the transaction semantics.  Suppose you begin a transaction,
> > UPDATE data on the foreign server, and then run a parallel query.  If
> > the leader performs the ForeignScan it will see the uncommitted
> > UPDATE, but a worker would have to make its own connection which not
> > be part of the same transaction and which would therefore not see the
> > update.  That's a problem.
> >
> Ah, yes, as long as FDW driver ensure the remote session has no
> uncommitted data, pg_export_snapshot() might provide us an opportunity,
> however, once a session writes something, FDW driver has to prohibit it.
> 
> > Also, for postgres_fdw, and many other FDWs I suspect, the assumption
> > is that most of the work is being done on the remote side, so doing
> > the work in a parallel worker doesn't seem super interesting.  Instead
> > of incurring transfer costs to move the data from remote to local, we
> > incur two sets of transfer costs: first remote to local, then worker
> > to leader.  Ouch.  I think a more promising line of inquiry is to try
> > to provide asynchronous execution when we have something like:
> >
> > Append
> > -> Foreign Scan
> > -> Foreign Scan
> >
> > ...so that we can return a row from whichever Foreign Scan receives
> > data back from the remote server first.
> >
> > So it's not impossible that an FDW author could want this, but mostly
> > probably not.  I think.
> >
> Yes, I also have same opinion. Likely, local parallelism is not
> valuable for the class of FDWs that obtains data from the remote
> server (e.g, postgres_fdw, ...), expect for the case when packing
> and unpacking cost over the network is major bottleneck.
> 
> On the other hands, it will be valuable for the class of FDW that
> performs as a wrapper to local data structure, as like current
> partial seq-scan doing. (e.g, file_fdw, ...)
> Its data source is not under the transaction control, and 'remote
> execution' of these FDWs are eventually executed on the local
> computing resources.
> 
> If I would make a proof-of-concept patch with interface itself, it
> seems to me file_fdw may be a good candidate for this enhancement.
> It is not a field for postgres_fdw.
> 
> Thanks,
> --
> NEC Business Creation Division / PG-Strom Project
> KaiGai Kohei <kaigai@ak.jp.nec.com>

Вложения

Re: CustomScan under the Gather node?

От

Robert Haas

Дата:

28 января 2016 г., 19:28:25

On Thu, Jan 28, 2016 at 10:50 AM, Kouhei Kaigai <kaigai@ak.jp.nec.com> wrote:
>> If I would make a proof-of-concept patch with interface itself, it
>> seems to me file_fdw may be a good candidate for this enhancement.
>> It is not a field for postgres_fdw.
>>
> The attached patch is enhancement of FDW/CSP interface and PoC feature
> of file_fdw to scan source file partially. It was smaller enhancement
> than my expectations.
>
> It works as follows. This query tried to read 20M rows from a CSV file,
> using 3 background worker processes.
>
> postgres=# set max_parallel_degree = 3;
> SET
> postgres=# explain analyze select * from test_csv where id % 20 = 6;
>                                   QUERY PLAN
> --------------------------------------------------------------------------------
>  Gather  (cost=1000.00..194108.60 rows=94056 width=52)
>          (actual time=0.570..19268.010 rows=2000000 loops=1)
>    Number of Workers: 3
>    ->  Parallel Foreign Scan on test_csv  (cost=0.00..183703.00 rows=94056 width=52)
>                                   (actual time=0.180..12744.655 rows=500000 loops=4)
>          Filter: ((id % 20) = 6)
>          Rows Removed by Filter: 9500000
>          Foreign File: /tmp/testdata.csv
>          Foreign File Size: 1504892535
>  Planning time: 0.147 ms
>  Execution time: 19330.201 ms
> (9 rows)

Could you try it not in parallel and then with 1, 2, 3, and 4 workers
and post the times for all?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: CustomScan under the Gather node?

От

Kouhei Kaigai

Дата:

29 января 2016 г., 02:51:02

> On Thu, Jan 28, 2016 at 10:50 AM, Kouhei Kaigai <kaigai@ak.jp.nec.com> wrote:
> >> If I would make a proof-of-concept patch with interface itself, it
> >> seems to me file_fdw may be a good candidate for this enhancement.
> >> It is not a field for postgres_fdw.
> >>
> > The attached patch is enhancement of FDW/CSP interface and PoC feature
> > of file_fdw to scan source file partially. It was smaller enhancement
> > than my expectations.
> >
> > It works as follows. This query tried to read 20M rows from a CSV file,
> > using 3 background worker processes.
> >
> > postgres=# set max_parallel_degree = 3;
> > SET
> > postgres=# explain analyze select * from test_csv where id % 20 = 6;
> >                                   QUERY PLAN
> >
> ----------------------------------------------------------------------------
> ----
> >  Gather  (cost=1000.00..194108.60 rows=94056 width=52)
> >          (actual time=0.570..19268.010 rows=2000000 loops=1)
> >    Number of Workers: 3
> >    ->  Parallel Foreign Scan on test_csv  (cost=0.00..183703.00 rows=94056
> width=52)
> >                                   (actual time=0.180..12744.655 rows=500000
> loops=4)
> >          Filter: ((id % 20) = 6)
> >          Rows Removed by Filter: 9500000
> >          Foreign File: /tmp/testdata.csv
> >          Foreign File Size: 1504892535
> >  Planning time: 0.147 ms
> >  Execution time: 19330.201 ms
> > (9 rows)
> 
> Could you try it not in parallel and then with 1, 2, 3, and 4 workers
> and post the times for all?
>
The above query has 5% selectivity on the entire CSV file.
Its execution time (total, only ForeignScan) are below

             total         ForeignScan        diff
0 workers: 17584.319 ms   17555.904 ms      28.415 ms
1 workers: 18464.476 ms   18110.968 ms     353.508 ms
2 workers: 19042.755 ms   14580.335 ms    4462.420 ms
3 workers: 19318.254 ms   12668.912 ms    6649.342 ms
4 workers: 21732.910 ms   13596.788 ms    8136.122 ms
5 workers: 23486.846 ms   14533.409 ms    8953.437 ms

This workstation has 4 CPU cores, so it is natural nworkers=3 records the
peak performance on ForeignScan portion. On the other hands, nworkers>1 also
recorded unignorable time consumption (probably, by Gather node?)

An interesting observation was, less selectivity (1% and 0%) didn't change the
result so much. Something consumes CPU time other than file_fdw.

* selectivity 1%
               total       ForeignScan       diff
0 workers: 17573.572 ms   17566.875 ms      6.697 ms
1 workers: 18098.070 ms   18020.790 ms     77.280 ms
2 workers: 18676.078 ms   14600.749 ms   4075.329 ms
3 workers: 18830.597 ms   12731.459 ms   6099.138 ms
4 workers: 21015.842 ms   13590.657 ms   7425.185 ms
5 workers: 22865.496 ms   14634.342 ms   8231.154 ms

* selectivity 0% (...so Gather didn't work hard actually)
              total        ForeignScan       diff
0 workers: 17551.011 ms   17550.811 ms      0.200 ms
1 workers: 18055.185 ms   18048.975 ms      6.210 ms
2 workers: 18567.660 ms   14593.974 ms   3973.686 ms
3 workers: 18649.819 ms   12671.429 ms   5978.390 ms
4 workers: 20619.184 ms   13606.715 ms   7012.469 ms
5 workers: 22557.575 ms   14594.420 ms   7963.155 ms

Further investigation will need....

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei <kaigai@ak.jp.nec.com>

Вложения

Re: CustomScan under the Gather node?

От

Kouhei Kaigai

Дата:

29 января 2016 г., 04:14:36

>              total         ForeignScan        diff
> 0 workers: 17584.319 ms   17555.904 ms      28.415 ms
> 1 workers: 18464.476 ms   18110.968 ms     353.508 ms
> 2 workers: 19042.755 ms   14580.335 ms    4462.420 ms
> 3 workers: 19318.254 ms   12668.912 ms    6649.342 ms
> 4 workers: 21732.910 ms   13596.788 ms    8136.122 ms
> 5 workers: 23486.846 ms   14533.409 ms    8953.437 ms
> 
> This workstation has 4 CPU cores, so it is natural nworkers=3 records the
> peak performance on ForeignScan portion. On the other hands, nworkers>1 also
> recorded unignorable time consumption (probably, by Gather node?)
  :
> Further investigation will need....
>
It was a bug of my file_fdw patch. ForeignScan node in the master process was
also kicked by the Gather node, however, it didn't have coordinate information
due to oversight of the initialization at InitializeDSMForeignScan callback.
In the result, local ForeignScan node is still executed after the completion
of coordinated background worker processes, and returned twice amount of rows.

In the revised patch, results seems to me reasonable.
             total         ForeignScan      diff
0 workers: 17592.498 ms   17564.457 ms     28.041ms
1 workers: 12152.998 ms   11983.485 ms    169.513 ms
2 workers: 10647.858 ms   10502.100 ms    145.758 ms
3 workers:  9635.445 ms    9509.899 ms    125.546 ms
4 workers: 11175.456 ms   10863.293 ms    312.163 ms
5 workers: 12586.457 ms   12279.323 ms    307.134 ms

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei <kaigai@ak.jp.nec.com>


> -----Original Message-----
> From: pgsql-hackers-owner@postgresql.org
> [mailto:pgsql-hackers-owner@postgresql.org] On Behalf Of Kouhei Kaigai
> Sent: Friday, January 29, 2016 8:51 AM
> To: Robert Haas
> Cc: pgsql-hackers@postgresql.org
> Subject: Re: [HACKERS] CustomScan under the Gather node?
> 
> > On Thu, Jan 28, 2016 at 10:50 AM, Kouhei Kaigai <kaigai@ak.jp.nec.com> wrote:
> > >> If I would make a proof-of-concept patch with interface itself, it
> > >> seems to me file_fdw may be a good candidate for this enhancement.
> > >> It is not a field for postgres_fdw.
> > >>
> > > The attached patch is enhancement of FDW/CSP interface and PoC feature
> > > of file_fdw to scan source file partially. It was smaller enhancement
> > > than my expectations.
> > >
> > > It works as follows. This query tried to read 20M rows from a CSV file,
> > > using 3 background worker processes.
> > >
> > > postgres=# set max_parallel_degree = 3;
> > > SET
> > > postgres=# explain analyze select * from test_csv where id % 20 = 6;
> > >                                   QUERY PLAN
> > >
> >
> ----------------------------------------------------------------------------
> > ----
> > >  Gather  (cost=1000.00..194108.60 rows=94056 width=52)
> > >          (actual time=0.570..19268.010 rows=2000000 loops=1)
> > >    Number of Workers: 3
> > >    ->  Parallel Foreign Scan on test_csv  (cost=0.00..183703.00 rows=94056
> > width=52)
> > >                                   (actual time=0.180..12744.655
> rows=500000
> > loops=4)
> > >          Filter: ((id % 20) = 6)
> > >          Rows Removed by Filter: 9500000
> > >          Foreign File: /tmp/testdata.csv
> > >          Foreign File Size: 1504892535
> > >  Planning time: 0.147 ms
> > >  Execution time: 19330.201 ms
> > > (9 rows)
> >
> > Could you try it not in parallel and then with 1, 2, 3, and 4 workers
> > and post the times for all?
> >
> The above query has 5% selectivity on the entire CSV file.
> Its execution time (total, only ForeignScan) are below
> 
>              total         ForeignScan        diff
> 0 workers: 17584.319 ms   17555.904 ms      28.415 ms
> 1 workers: 18464.476 ms   18110.968 ms     353.508 ms
> 2 workers: 19042.755 ms   14580.335 ms    4462.420 ms
> 3 workers: 19318.254 ms   12668.912 ms    6649.342 ms
> 4 workers: 21732.910 ms   13596.788 ms    8136.122 ms
> 5 workers: 23486.846 ms   14533.409 ms    8953.437 ms
> 
> This workstation has 4 CPU cores, so it is natural nworkers=3 records the
> peak performance on ForeignScan portion. On the other hands, nworkers>1 also
> recorded unignorable time consumption (probably, by Gather node?)
> 
> An interesting observation was, less selectivity (1% and 0%) didn't change the
> result so much. Something consumes CPU time other than file_fdw.
> 
> * selectivity 1%
>                total       ForeignScan       diff
> 0 workers: 17573.572 ms   17566.875 ms      6.697 ms
> 1 workers: 18098.070 ms   18020.790 ms     77.280 ms
> 2 workers: 18676.078 ms   14600.749 ms   4075.329 ms
> 3 workers: 18830.597 ms   12731.459 ms   6099.138 ms
> 4 workers: 21015.842 ms   13590.657 ms   7425.185 ms
> 5 workers: 22865.496 ms   14634.342 ms   8231.154 ms
> 
> * selectivity 0% (...so Gather didn't work hard actually)
>               total        ForeignScan       diff
> 0 workers: 17551.011 ms   17550.811 ms      0.200 ms
> 1 workers: 18055.185 ms   18048.975 ms      6.210 ms
> 2 workers: 18567.660 ms   14593.974 ms   3973.686 ms
> 3 workers: 18649.819 ms   12671.429 ms   5978.390 ms
> 4 workers: 20619.184 ms   13606.715 ms   7012.469 ms
> 5 workers: 22557.575 ms   14594.420 ms   7963.155 ms
> 
> Further investigation will need....
> 
> Thanks,
> --
> NEC Business Creation Division / PG-Strom Project
> KaiGai Kohei <kaigai@ak.jp.nec.com>

Вложения

Re: CustomScan under the Gather node?

От

Robert Haas

Дата:

03 февраля 2016 г., 20:53:40

On Thu, Jan 28, 2016 at 8:14 PM, Kouhei Kaigai <kaigai@ak.jp.nec.com> wrote:
>>              total         ForeignScan        diff
>> 0 workers: 17584.319 ms   17555.904 ms      28.415 ms
>> 1 workers: 18464.476 ms   18110.968 ms     353.508 ms
>> 2 workers: 19042.755 ms   14580.335 ms    4462.420 ms
>> 3 workers: 19318.254 ms   12668.912 ms    6649.342 ms
>> 4 workers: 21732.910 ms   13596.788 ms    8136.122 ms
>> 5 workers: 23486.846 ms   14533.409 ms    8953.437 ms
>>
>> This workstation has 4 CPU cores, so it is natural nworkers=3 records the
>> peak performance on ForeignScan portion. On the other hands, nworkers>1 also
>> recorded unignorable time consumption (probably, by Gather node?)
>   :
>> Further investigation will need....
>>
> It was a bug of my file_fdw patch. ForeignScan node in the master process was
> also kicked by the Gather node, however, it didn't have coordinate information
> due to oversight of the initialization at InitializeDSMForeignScan callback.
> In the result, local ForeignScan node is still executed after the completion
> of coordinated background worker processes, and returned twice amount of rows.
>
> In the revised patch, results seems to me reasonable.
>              total         ForeignScan      diff
> 0 workers: 17592.498 ms   17564.457 ms     28.041ms
> 1 workers: 12152.998 ms   11983.485 ms    169.513 ms
> 2 workers: 10647.858 ms   10502.100 ms    145.758 ms
> 3 workers:  9635.445 ms    9509.899 ms    125.546 ms
> 4 workers: 11175.456 ms   10863.293 ms    312.163 ms
> 5 workers: 12586.457 ms   12279.323 ms    307.134 ms

Hmm.  Is the file_fdw part of this just a demo, or do you want to try
to get that committed?  If so, maybe start a new thread with a more
appropriate subject line to just talk about that.  I haven't
scrutinized that part of the patch in any detail, but the general
infrastructure for FDWs and custom scans to use parallelism seems to
be in good shape, so I rewrote the documentation and committed that
part.

Do you have any idea why this isn't scaling beyond, uh, 1 worker?
That seems like a good thing to try to figure out.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: CustomScan under the Gather node?

От

Kouhei Kaigai

Дата:

04 февраля 2016 г., 03:43:43

> -----Original Message-----
> From: Robert Haas [mailto:robertmhaas@gmail.com]
> Sent: Thursday, February 04, 2016 2:54 AM
> To: Kaigai Kouhei(海外 浩平)
> Cc: pgsql-hackers@postgresql.org
> Subject: ##freemail## Re: [HACKERS] CustomScan under the Gather node?
> 
> On Thu, Jan 28, 2016 at 8:14 PM, Kouhei Kaigai <kaigai@ak.jp.nec.com> wrote:
> >>              total         ForeignScan        diff
> >> 0 workers: 17584.319 ms   17555.904 ms      28.415 ms
> >> 1 workers: 18464.476 ms   18110.968 ms     353.508 ms
> >> 2 workers: 19042.755 ms   14580.335 ms    4462.420 ms
> >> 3 workers: 19318.254 ms   12668.912 ms    6649.342 ms
> >> 4 workers: 21732.910 ms   13596.788 ms    8136.122 ms
> >> 5 workers: 23486.846 ms   14533.409 ms    8953.437 ms
> >>
> >> This workstation has 4 CPU cores, so it is natural nworkers=3 records the
> >> peak performance on ForeignScan portion. On the other hands, nworkers>1 also
> >> recorded unignorable time consumption (probably, by Gather node?)
> >   :
> >> Further investigation will need....
> >>
> > It was a bug of my file_fdw patch. ForeignScan node in the master process was
> > also kicked by the Gather node, however, it didn't have coordinate information
> > due to oversight of the initialization at InitializeDSMForeignScan callback.
> > In the result, local ForeignScan node is still executed after the completion
> > of coordinated background worker processes, and returned twice amount of rows.
> >
> > In the revised patch, results seems to me reasonable.
> >              total         ForeignScan      diff
> > 0 workers: 17592.498 ms   17564.457 ms     28.041ms
> > 1 workers: 12152.998 ms   11983.485 ms    169.513 ms
> > 2 workers: 10647.858 ms   10502.100 ms    145.758 ms
> > 3 workers:  9635.445 ms    9509.899 ms    125.546 ms
> > 4 workers: 11175.456 ms   10863.293 ms    312.163 ms
> > 5 workers: 12586.457 ms   12279.323 ms    307.134 ms
> 
> Hmm.  Is the file_fdw part of this just a demo, or do you want to try
> to get that committed?  If so, maybe start a new thread with a more
> appropriate subject line to just talk about that.  I haven't
> scrutinized that part of the patch in any detail, but the general
> infrastructure for FDWs and custom scans to use parallelism seems to
> be in good shape, so I rewrote the documentation and committed that
> part.
>
Thanks, I expect file_fdw part is just for demonstration.
It does not require any special hardware to reproduce this parallel
execution, rather than GpuScan of PG-Strom.

> Do you have any idea why this isn't scaling beyond, uh, 1 worker?
> That seems like a good thing to try to figure out.
>
The hardware I run the above query has 4 CPU cores, so it is not
surprising that 3 workers (+ 1 master) recorded the peak performance.

In addition, enhancement of file_fdw part is a corner-cutting work.

It picks up the next line number to be fetched from the shared memory
segment using pg_atomic_add_fetch_u32(), then it reads the input file
until worker meets the target line. Unrelated line shall be ignored.
Individual worker parses its responsible line only, thus, parallel
execution makes sense in this part. On the other hands, total amount
of CPU cycles for file scan will increase because all the workers
at least have to parse all the lines.

If we would simply split time consumption factor in 0 worker case
as follows: (time to scan file; TSF) + (time to parse lines; TPL)

Total amount of workloads when we distribute file_fdw into N workers is:
 N * (TSF) + (TPL)

Thus, individual worker has to process the following amount of works:
 (TSF) + (TPL)/N

It is a typical formula of Amdahl's law when sequencial part is not
small. The above result says, TSF part is about 7.4s, TPL part is
about 10.1s.

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei <kaigai@ak.jp.nec.com>

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Обсуждение: CustomScan under the Gather node?

Вложения

Вложения

Вложения