Обсуждение: Parallel Seq Scan vs kernel read ahead

Поиск

Список

Период

Сортировка

Parallel Seq Scan vs kernel read ahead

От

Thomas Munro

Дата:

20 мая 2020 г., 04:53:24

Hello hackers,

Parallel sequential scan relies on the kernel detecting sequential
access, but we don't make the job easy.  The resulting striding
pattern works terribly on strict next-block systems like FreeBSD UFS,
and degrades rapidly when you add too many workers on sliding window
systems like Linux.

Demonstration using FreeBSD on UFS on a virtual machine, taking ball
park figures from iostat:

  create table t as select generate_series(1, 200000000)::int i;

  set max_parallel_workers_per_gather = 0;
  select count(*) from t;
  -> execution time 13.3s, average read size = ~128kB, ~500MB/s

  set max_parallel_workers_per_gather = 1;
  select count(*) from t;
  -> execution time 24.9s, average read size = ~32kB, ~250MB/s

Note the small read size, which means that there was no read
clustering happening at all: that's the logical block size of this
filesystem.

That explains some complaints I've heard about PostgreSQL performance
on that filesystem: parallel query destroys I/O performance.

As a quick experiment, I tried teaching the block allocated to
allocate ranges of up 64 blocks at a time, ramping up incrementally,
and ramping down at the end, and I got:

  set max_parallel_workers_per_gather = 1;
  select count(*) from t;
  -> execution time 7.5s, average read size = ~128kB, ~920MB/s

  set max_parallel_workers_per_gather = 3;
  select count(*) from t;
  -> execution time 5.2s, average read size = ~128kB, ~1.2GB/s

I've attached the quick and dirty patch I used for that.

Вложения

0001-Use-larger-step-sizes-for-Parallel-Seq-Scan.patch

Re: Parallel Seq Scan vs kernel read ahead

От

Amit Kapila

Дата:

20 мая 2020 г., 05:23:28

On Wed, May 20, 2020 at 7:24 AM Thomas Munro <thomas.munro@gmail.com> wrote:
>
> Hello hackers,
>
> Parallel sequential scan relies on the kernel detecting sequential
> access, but we don't make the job easy.  The resulting striding
> pattern works terribly on strict next-block systems like FreeBSD UFS,
> and degrades rapidly when you add too many workers on sliding window
> systems like Linux.
>
> Demonstration using FreeBSD on UFS on a virtual machine, taking ball
> park figures from iostat:
>
>   create table t as select generate_series(1, 200000000)::int i;
>
>   set max_parallel_workers_per_gather = 0;
>   select count(*) from t;
>   -> execution time 13.3s, average read size = ~128kB, ~500MB/s
>
>   set max_parallel_workers_per_gather = 1;
>   select count(*) from t;
>   -> execution time 24.9s, average read size = ~32kB, ~250MB/s
>
> Note the small read size, which means that there was no read
> clustering happening at all: that's the logical block size of this
> filesystem.
>
> That explains some complaints I've heard about PostgreSQL performance
> on that filesystem: parallel query destroys I/O performance.
>
> As a quick experiment, I tried teaching the block allocated to
> allocate ranges of up 64 blocks at a time, ramping up incrementally,
> and ramping down at the end, and I got:
>

Good experiment.  IIRC, we have discussed a similar idea during the
development of this feature but we haven't seen any better results by
allocating in ranges on the systems we have tried.  So, we want with
the current approach which is more granular and seems to allow better
parallelism.  I feel we need to ensure that we don't regress
parallelism in existing cases, otherwise, the idea sounds promising to
me.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Parallel Seq Scan vs kernel read ahead

От

Thomas Munro

Дата:

20 мая 2020 г., 06:08:24

On Wed, May 20, 2020 at 2:23 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> Good experiment.  IIRC, we have discussed a similar idea during the
> development of this feature but we haven't seen any better results by
> allocating in ranges on the systems we have tried.  So, we want with
> the current approach which is more granular and seems to allow better
> parallelism.  I feel we need to ensure that we don't regress
> parallelism in existing cases, otherwise, the idea sounds promising to
> me.

Yeah, Linux seems to do pretty well at least with smallish numbers of
workers, and when you use large numbers you can probably tune your way
out of the problem.  ZFS seems to do fine.  I wonder how well the
other OSes cope.

Re: Parallel Seq Scan vs kernel read ahead

От

Ranier Vilela

Дата:

20 мая 2020 г., 14:02:42

Em qua., 20 de mai. de 2020 às 00:09, Thomas Munro <thomas.munro@gmail.com> escreveu:

On Wed, May 20, 2020 at 2:23 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> Good experiment. IIRC, we have discussed a similar idea during the
> development of this feature but we haven't seen any better results by
> allocating in ranges on the systems we have tried. So, we want with
> the current approach which is more granular and seems to allow better
> parallelism. I feel we need to ensure that we don't regress
> parallelism in existing cases, otherwise, the idea sounds promising to
> me.

Yeah, Linux seems to do pretty well at least with smallish numbers of
workers, and when you use large numbers you can probably tune your way
out of the problem. ZFS seems to do fine. I wonder how well the
other OSes cope.

Windows 10 (64bits, i5, 8GB, SSD)

postgres=# set max_parallel_workers_per_gather = 0;
SET
Time: 2,537 ms
postgres=# select count(*) from t;
count
-----------
200000000
(1 row)

Time: 47767,916 ms (00:47,768)
postgres=# set max_parallel_workers_per_gather = 1;
SET
Time: 4,889 ms
postgres=# select count(*) from t;
count
-----------
200000000
(1 row)

Time: 32645,448 ms (00:32,645)

How display " -> execution time 5.2s, average read size ="?

regards,

Ranier VIlela

Re: Parallel Seq Scan vs kernel read ahead

От

Thomas Munro

Дата:

21 мая 2020 г., 00:48:42

On Wed, May 20, 2020 at 11:03 PM Ranier Vilela <ranier.vf@gmail.com> wrote:
> Time: 47767,916 ms (00:47,768)
> Time: 32645,448 ms (00:32,645)

Just to make sure kernel caching isn't helping here, maybe try making
the table 2x or 4x bigger?  My test was on a virtual machine with only
4GB RAM, so the table couldn't be entirely cached.

> How display " -> execution time 5.2s, average read size ="?

Execution time is what you showed, and average read size should be
inside the Windows performance window somewhere (not sure what it's
called).

Re: Parallel Seq Scan vs kernel read ahead

От

Ranier Vilela

Дата:

21 мая 2020 г., 02:14:12

Em qua., 20 de mai. de 2020 às 18:49, Thomas Munro <thomas.munro@gmail.com> escreveu:

On Wed, May 20, 2020 at 11:03 PM Ranier Vilela <ranier.vf@gmail.com> wrote:
> Time: 47767,916 ms (00:47,768)
> Time: 32645,448 ms (00:32,645)

Just to make sure kernel caching isn't helping here, maybe try making
the table 2x or 4x bigger? My test was on a virtual machine with only
4GB RAM, so the table couldn't be entirely cached.

4x bigger.

Postgres defaults settings.

postgres=# create table t as select generate_series(1, 800000000)::int i;
SELECT 800000000
postgres=# \timing
Timing is on.
postgres=# set max_parallel_workers_per_gather = 0;
SET
Time: 8,622 ms
postgres=# select count(*) from t;
count
-----------
800000000
(1 row)

Time: 227238,445 ms (03:47,238)
postgres=# set max_parallel_workers_per_gather = 1;
SET
Time: 20,975 ms
postgres=# select count(*) from t;
count
-----------
800000000
(1 row)

Time: 138027,351 ms (02:18,027)

regards,

Ranier Vilela

Re: Parallel Seq Scan vs kernel read ahead

От

Thomas Munro

Дата:

21 мая 2020 г., 02:47:47

On Thu, May 21, 2020 at 11:15 AM Ranier Vilela <ranier.vf@gmail.com> wrote:
> postgres=# set max_parallel_workers_per_gather = 0;
> Time: 227238,445 ms (03:47,238)
> postgres=# set max_parallel_workers_per_gather = 1;
> Time: 138027,351 ms (02:18,027)

Ok, so it looks like NT/NTFS isn't suffering from this problem.
Thanks for testing!

Re: Parallel Seq Scan vs kernel read ahead

От

Ranier Vilela

Дата:

21 мая 2020 г., 02:50:45

Em qua., 20 de mai. de 2020 às 20:48, Thomas Munro <thomas.munro@gmail.com> escreveu:

On Thu, May 21, 2020 at 11:15 AM Ranier Vilela <ranier.vf@gmail.com> wrote:
> postgres=# set max_parallel_workers_per_gather = 0;
> Time: 227238,445 ms (03:47,238)
> postgres=# set max_parallel_workers_per_gather = 1;
> Time: 138027,351 ms (02:18,027)

Ok, so it looks like NT/NTFS isn't suffering from this problem.
Thanks for testing!

Maybe it wasn’t clear, the tests were done with your patch applied.

regards,

Ranier Vilela

Re: Parallel Seq Scan vs kernel read ahead

От

Thomas Munro

Дата:

21 мая 2020 г., 03:02:36

On Thu, May 21, 2020 at 11:51 AM Ranier Vilela <ranier.vf@gmail.com> wrote:
> Em qua., 20 de mai. de 2020 às 20:48, Thomas Munro <thomas.munro@gmail.com> escreveu:
>> On Thu, May 21, 2020 at 11:15 AM Ranier Vilela <ranier.vf@gmail.com> wrote:
>> > postgres=# set max_parallel_workers_per_gather = 0;
>> > Time: 227238,445 ms (03:47,238)
>> > postgres=# set max_parallel_workers_per_gather = 1;
>> > Time: 138027,351 ms (02:18,027)
>>
>> Ok, so it looks like NT/NTFS isn't suffering from this problem.
>> Thanks for testing!
>
> Maybe it wasn’t clear, the tests were done with your patch applied.

Oh!  And how do the times look without it?

Re: Parallel Seq Scan vs kernel read ahead

От

Ranier Vilela

Дата:

21 мая 2020 г., 04:37:36

Em qua., 20 de mai. de 2020 às 21:03, Thomas Munro <thomas.munro@gmail.com> escreveu:

On Thu, May 21, 2020 at 11:51 AM Ranier Vilela <ranier.vf@gmail.com> wrote:
> Em qua., 20 de mai. de 2020 às 20:48, Thomas Munro <thomas.munro@gmail.com> escreveu:
>> On Thu, May 21, 2020 at 11:15 AM Ranier Vilela <ranier.vf@gmail.com> wrote:
>> > postgres=# set max_parallel_workers_per_gather = 0;
>> > Time: 227238,445 ms (03:47,238)
>> > postgres=# set max_parallel_workers_per_gather = 1;
>> > Time: 138027,351 ms (02:18,027)
>>
>> Ok, so it looks like NT/NTFS isn't suffering from this problem.
>> Thanks for testing!
>
> Maybe it wasn’t clear, the tests were done with your patch applied.

Oh! And how do the times look without it?

Vanila Postgres (latest)

create table t as select generate_series(1, 800000000)::int i;

set max_parallel_workers_per_gather = 0;

Time: 210524,317 ms (03:30,524)

set max_parallel_workers_per_gather = 1;

Time: 146982,737 ms (02:26,983)

regards,

Ranier Vilela

Re: Parallel Seq Scan vs kernel read ahead

От

Thomas Munro

Дата:

21 мая 2020 г., 05:31:37

On Thu, May 21, 2020 at 1:38 PM Ranier Vilela <ranier.vf@gmail.com> wrote:
>> >> On Thu, May 21, 2020 at 11:15 AM Ranier Vilela <ranier.vf@gmail.com> wrote:
>> >> > postgres=# set max_parallel_workers_per_gather = 0;
>> >> > Time: 227238,445 ms (03:47,238)
>> >> > postgres=# set max_parallel_workers_per_gather = 1;
>> >> > Time: 138027,351 ms (02:18,027)

> Vanila Postgres (latest)
>
> create table t as select generate_series(1, 800000000)::int i;
>  set max_parallel_workers_per_gather = 0;
> Time: 210524,317 ms (03:30,524)
> set max_parallel_workers_per_gather = 1;
> Time: 146982,737 ms (02:26,983)

Thanks.  So it seems like Linux, Windows and anything using ZFS are
OK, which probably explains why we hadn't heard complaints about it.

Re: Parallel Seq Scan vs kernel read ahead

От

David Rowley

Дата:

21 мая 2020 г., 08:06:18

On Thu, 21 May 2020 at 14:32, Thomas Munro <thomas.munro@gmail.com> wrote:
> Thanks.  So it seems like Linux, Windows and anything using ZFS are
> OK, which probably explains why we hadn't heard complaints about it.

I tried out a different test on a Windows 8.1 machine I have here.  I
was concerned that the test that was used here ends up with tuples
that are too narrow and that the executor would spend quite a bit of
time going between nodes and performing the actual aggregation.  I
thought it might be good to add some padding so that there are far
fewer tuples on the page.

I ended up with:

create table t (a int, b text);
-- create a table of 100GB in size.
insert into t select x,md5(x::text) from
generate_series(1,1000000*1572.7381809)x; -- took 1 hr 18 mins
vacuum freeze t;

query = select count(*) from t;
Disk = Samsung SSD 850 EVO mSATA 1TB.

Master:
workers = 0 : Time: 269104.281 ms (04:29.104)  380MB/s
workers = 1 : Time: 741183.646 ms (12:21.184)  138MB/s
workers = 2 : Time: 656963.754 ms (10:56.964)  155MB/s

Patched:

workers = 0 : Should be the same as before as the code for this didn't change.
workers = 1 : Time: 300299.364 ms (05:00.299) 340MB/s
workers = 2 : Time: 270213.726 ms (04:30.214) 379MB/s

(A better query would likely have been just: SELECT * FROM t WHERE a =
1; but I'd run the test by the time I thought of that.)

So, this shows that Windows, at least 8.1, does suffer from this too.

For the patch. I know you just put it together quickly, but I don't
think you can do that ramp up the way you have. It looks like there's
a risk of torn reads and torn writes and I'm unsure how much that
could affect the test results here. It looks like there's a risk that
a worker gets some garbage number of pages to read rather than what
you think it will. Also, I also don't quite understand the need for a
ramp-up in pages per serving. Shouldn't you instantly start at some
size and hold that, then only maybe ramp down at the end so that
workers all finish at close to the same time?  However, I did have
other ideas which I'll explain below.

From my previous work on that function to add the atomics. I did think
that it would be better to dish out more than 1 page at a time.
However, there is the risk that the workload is not evenly distributed
between the workers.  My thoughts were that we could divide the total
pages by the number of workers then again by 100 and dish out blocks
based on that. That way workers will get about 100th of their fair
share of pages at once, so assuming there's an even amount of work to
do per serving of pages, then the last worker should only run on at
most 1% longer.  Perhaps that 100 should be 1000, then the run on time
for the last worker is just 0.1%.  Perhaps the serving size can also
be capped at some maximum like 64. We'll certainly need to ensure it's
at least 1!   I imagine that will eliminate the need for any ramp down
of pages per serving near the end of the scan.

David

Re: Parallel Seq Scan vs kernel read ahead

От

David Rowley

Дата:

22 мая 2020 г., 00:59:58

On Thu, 21 May 2020 at 17:06, David Rowley <dgrowleyml@gmail.com> wrote:
> For the patch. I know you just put it together quickly, but I don't
> think you can do that ramp up the way you have. It looks like there's
> a risk of torn reads and torn writes and I'm unsure how much that
> could affect the test results here.

Oops. On closer inspection, I see that memory is per worker, not
global to the scan.

Re: Parallel Seq Scan vs kernel read ahead

От

Thomas Munro

Дата:

22 мая 2020 г., 01:27:15

On Fri, May 22, 2020 at 10:00 AM David Rowley <dgrowleyml@gmail.com> wrote:
> On Thu, 21 May 2020 at 17:06, David Rowley <dgrowleyml@gmail.com> wrote:
> > For the patch. I know you just put it together quickly, but I don't
> > think you can do that ramp up the way you have. It looks like there's
> > a risk of torn reads and torn writes and I'm unsure how much that
> > could affect the test results here.
>
> Oops. On closer inspection, I see that memory is per worker, not
> global to the scan.

Right, I think it's safe.  I think you were probably right that
ramp-up isn't actually useful though, it's only the end of the scan
that requires special treatment so we don't get unfair allocation as
the work runs out, due to course grain.  I suppose that even if you
have a scheme that falls back to fine grained allocation for the final
N pages, it's still possible that a highly distracted process (most
likely the leader given its double duties) can finish up sitting on a
large range of pages and eventually have to process them all at the
end after the other workers have already knocked off and gone for a
pint.

Re: Parallel Seq Scan vs kernel read ahead

От

Soumyadeep Chakraborty

Дата:

22 мая 2020 г., 04:14:37

Hi Thomas,

Some more data points:

create table t_heap as select generate_series(1, 100000000) i;

Query: select count(*) from t_heap;
shared_buffers=32MB (so that I don't have to clear buffers, OS page
cache)
OS: FreeBSD 12.1 with UFS on GCP
4 vCPUs, 4GB RAM Intel Skylake
22G Google PersistentDisk
Time is measured with \timing on.

Without your patch:

max_parallel_workers_per_gather Time(seconds)
0 33.88s
1 57.62s
2 62.01s
6 222.94s

With your patch:

max_parallel_workers_per_gather Time(seconds)
0 29.04s
1 29.17s
2 28.78s
6 291.27s

I checked with explain analyze to ensure that the number of workers
planned = max_parallel_workers_per_gather

Apart from the last result (max_parallel_workers_per_gather=6), all
the other results seem favorable.
Could the last result be down to the fact that the number of workers
planned exceeded the number of vCPUs?

I also wanted to evaluate Zedstore with your patch.
I used the same setup as above.
No discernible difference though, maybe I'm missing something:

Without your patch:

max_parallel_workers_per_gather Time(seconds)
0 25.86s
1 15.70s
2 12.60s
6 12.41s

With your patch:

max_parallel_workers_per_gather Time(seconds)
0 26.96s
1 15.73s
2 12.46s
6 12.10s

Soumyadeep

On Thu, May 21, 2020 at 3:28 PM Thomas Munro <thomas.munro@gmail.com> wrote:

On Fri, May 22, 2020 at 10:00 AM David Rowley <dgrowleyml@gmail.com> wrote:
> On Thu, 21 May 2020 at 17:06, David Rowley <dgrowleyml@gmail.com> wrote:
> > For the patch. I know you just put it together quickly, but I don't
> > think you can do that ramp up the way you have. It looks like there's
> > a risk of torn reads and torn writes and I'm unsure how much that
> > could affect the test results here.
>
> Oops. On closer inspection, I see that memory is per worker, not
> global to the scan.

Right, I think it's safe. I think you were probably right that
ramp-up isn't actually useful though, it's only the end of the scan
that requires special treatment so we don't get unfair allocation as
the work runs out, due to course grain. I suppose that even if you
have a scheme that falls back to fine grained allocation for the final
N pages, it's still possible that a highly distracted process (most
likely the leader given its double duties) can finish up sitting on a
large range of pages and eventually have to process them all at the
end after the other workers have already knocked off and gone for a
pint.

Re: Parallel Seq Scan vs kernel read ahead

От

Thomas Munro

Дата:

22 мая 2020 г., 05:26:44

On Fri, May 22, 2020 at 1:14 PM Soumyadeep Chakraborty
<sochakraborty@pivotal.io> wrote:
> Some more data points:

Thanks!

> max_parallel_workers_per_gather    Time(seconds)
>                               0           29.04s
>                               1           29.17s
>                               2           28.78s
>                               6          291.27s
>
> I checked with explain analyze to ensure that the number of workers
> planned = max_parallel_workers_per_gather
>
> Apart from the last result (max_parallel_workers_per_gather=6), all
> the other results seem favorable.
> Could the last result be down to the fact that the number of workers
> planned exceeded the number of vCPUs?

Interesting.  I guess it has to do with patterns emerging from various
parameters like that magic number 64 I hard coded into the test patch,
and other unknowns in your storage stack.  I see a small drop off that
I can't explain yet, but not that.

> I also wanted to evaluate Zedstore with your patch.
> I used the same setup as above.
> No discernible difference though, maybe I'm missing something:

It doesn't look like it's using table_block_parallelscan_nextpage() as
a block allocator so it's not affected by the patch.  It has its own
thing zs_parallelscan_nextrange(), which does
pg_atomic_fetch_add_u64(&pzscan->pzs_allocatedtids,
ZS_PARALLEL_CHUNK_SIZE), and that macro is 0x100000.

Re: Parallel Seq Scan vs kernel read ahead

От

Robert Haas

Дата:

22 мая 2020 г., 21:30:20

On Tue, May 19, 2020 at 10:23 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> Good experiment.  IIRC, we have discussed a similar idea during the
> development of this feature but we haven't seen any better results by
> allocating in ranges on the systems we have tried.  So, we want with
> the current approach which is more granular and seems to allow better
> parallelism.  I feel we need to ensure that we don't regress
> parallelism in existing cases, otherwise, the idea sounds promising to
> me.

I think there's a significant difference. The idea I remember being
discussed at the time was to divide the relation into equal parts at
the very start and give one part to each worker. I think that carries
a lot of risk of some workers finishing much sooner than others. This
idea, AIUI, is to divide the relation into chunks that are small
compared to the size of the relation, but larger than 1 block. That
carries some risk of an unequal division of work, as has already been
noted, but it's much less, especially if we use smaller chunk sizes
once we get close to the end, as proposed here.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Parallel Seq Scan vs kernel read ahead

От

Robert Haas

Дата:

22 мая 2020 г., 21:31:22

On Thu, May 21, 2020 at 6:28 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> Right, I think it's safe.  I think you were probably right that
> ramp-up isn't actually useful though, it's only the end of the scan
> that requires special treatment so we don't get unfair allocation as
> the work runs out, due to course grain.

The ramp-up seems like it might be useful if the query involves a LIMIT.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Parallel Seq Scan vs kernel read ahead

От

Amit Kapila

Дата:

23 мая 2020 г., 07:34:54

On Sat, May 23, 2020 at 12:00 AM Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Tue, May 19, 2020 at 10:23 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > Good experiment.  IIRC, we have discussed a similar idea during the
> > development of this feature but we haven't seen any better results by
> > allocating in ranges on the systems we have tried.  So, we want with
> > the current approach which is more granular and seems to allow better
> > parallelism.  I feel we need to ensure that we don't regress
> > parallelism in existing cases, otherwise, the idea sounds promising to
> > me.
>
> I think there's a significant difference. The idea I remember being
> discussed at the time was to divide the relation into equal parts at
> the very start and give one part to each worker.
>

I have checked the archives and found that we have done some testing
by allowing each worker to work on a block-by-block basis and by
having a fixed number of chunks for each worker.  See the results [1]
(the program used is attached in another email [2]).  The conclusion
was that we didn't find much difference with any of those approaches.
Now, the reason could be that because we have tested on a machine (I
think it was hydra (Power-7)) where the chunk-size doesn't matter but
I think it can show some difference in the machines on which Thomas
and David are testing.  At that time there was also a discussion to
chunk on the basis of "each worker processes one 1GB-sized segment"
which Tom and Stephen seem to support [3].  I think an idea to divide
the relation into segments based on workers for a parallel scan has
been used by other database (DynamoDB) as well [4] so it is not
completely without merit.  I understand that larger sized chunks can
lead to unequal work distribution but they have their own advantages,
so we might want to get the best of both the worlds where in the
beginning we have larger sized chunks and then slowly reduce the
chunk-size towards the end of the scan.  I am not sure what is the
best thing to do here but maybe some experiments can shed light on
this mystery.

[1] - https://www.postgresql.org/message-id/CAA4eK1JHCmN2X1LjQ4bOmLApt%2BbtOuid5Vqqk5G6dDFV69iyHg%40mail.gmail.com
[2] - https://www.postgresql.org/message-id/CAA4eK1JyVNEBE8KuxKd3bJhkG6tSbpBYX_%2BZtP34ZSTCSucA1A%40mail.gmail.com
[3] - https://www.postgresql.org/message-id/30549.1422459647%40sss.pgh.pa.us
[4] - https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/Scan.html#Scan.ParallelScan

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Parallel Seq Scan vs kernel read ahead

От

David Rowley

Дата:

24 мая 2020 г., 21:17:34

On Sat, 23 May 2020 at 06:31, Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Thu, May 21, 2020 at 6:28 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> > Right, I think it's safe.  I think you were probably right that
> > ramp-up isn't actually useful though, it's only the end of the scan
> > that requires special treatment so we don't get unfair allocation as
> > the work runs out, due to course grain.
>
> The ramp-up seems like it might be useful if the query involves a LIMIT.

That's true, but I think the intelligence there would need to go
beyond, "if there's a LIMIT clause, do ramp-up", as we might have
already fully ramped up well before the LIMIT is reached.

David

Re: Parallel Seq Scan vs kernel read ahead

От

Soumyadeep Chakraborty

Дата:

04 июня 2020 г., 01:14:59

> It doesn't look like it's using table_block_parallelscan_nextpage() as
> a block allocator so it's not affected by the patch.  It has its own
> thing zs_parallelscan_nextrange(), which does
> pg_atomic_fetch_add_u64(&pzscan->pzs_allocatedtids,
> ZS_PARALLEL_CHUNK_SIZE), and that macro is 0x100000.

My apologies, I was too hasty. Indeed, you are correct. Zedstore's
unit of work is chunks of the logical zstid space. There is a
correlation between the zstid and blocks: zstids near each other are
likely to lie in the same block or in neighboring blocks. It would be
interesting to try something like this patch for Zedstore.

Regards,
Soumyadeep

Re: Parallel Seq Scan vs kernel read ahead

От

Soumyadeep Chakraborty

Дата:

04 июня 2020 г., 01:18:31

On Sat, May 23, 2020 at 12:00 AM Robert Haas
<robertmhaas(at)gmail(dot)com> wrote:
> I think there's a significant difference. The idea I remember being
> discussed at the time was to divide the relation into equal parts at
> the very start and give one part to each worker. I think that carries
> a lot of risk of some workers finishing much sooner than others.

Was the idea of work-stealing considered? Here is what I have been
thinking about:

Each worker would be assigned a contiguous chunk of blocks at init time.
Then if a worker is finished with its work, it can inspect other
workers' remaining work and "steal" some of the blocks from the end of
the victim worker's allocation.

Considerations for such a scheme:

1. Victim selection: Who will be the victim worker? It can be selected at
random if nothing else.

2. How many blocks to steal? Stealing half of the victim's remaining
blocks seems to be fair.

3. Stealing threshold: We should disallow stealing if the amount of
remaining work is not enough in the victim worker.

4. Additional parallel state: Representing the chunk of "work". I guess
one variable for the current block and one for the last block in the
chunk allocated. The latter would have to be protected with atomic
fetches as it would be decremented by the stealing worker.

5. Point 4 implies that there might be more atomic fetch operations as
compared to this patch. Idk if that is a lesser evil than the workers
being idle..probably not? A way to salvage that a little would be to
forego atomic fetches when the amount of work remaining is less than the
threshold discussed in 3 as there is no possibility of work stealing then.

Regards,

Soumyadeep

Re: Parallel Seq Scan vs kernel read ahead

От

Soumyadeep Chakraborty

Дата:

04 июня 2020 г., 02:21:57

On Wed, Jun 3, 2020 at 3:18 PM Soumyadeep Chakraborty
<soumyadeep2007@gmail.com> wrote:
> Idk if that is a lesser evil than the workers
> being idle..probably not?

Apologies, I meant that the extra atomic fetches is probably a lesser
evil than the workers being idle.

Soumyadeep

Re: Parallel Seq Scan vs kernel read ahead

От

David Rowley

Дата:

10 июня 2020 г., 08:07:46

On Thu, 21 May 2020 at 17:06, David Rowley <dgrowleyml@gmail.com> wrote:
> create table t (a int, b text);
> -- create a table of 100GB in size.
> insert into t select x,md5(x::text) from
> generate_series(1,1000000*1572.7381809)x; -- took 1 hr 18 mins
> vacuum freeze t;
>
> query = select count(*) from t;
> Disk = Samsung SSD 850 EVO mSATA 1TB.
>
> Master:
> workers = 0 : Time: 269104.281 ms (04:29.104)  380MB/s
> workers = 1 : Time: 741183.646 ms (12:21.184)  138MB/s
> workers = 2 : Time: 656963.754 ms (10:56.964)  155MB/s
>
> Patched:
>
> workers = 0 : Should be the same as before as the code for this didn't change.
> workers = 1 : Time: 300299.364 ms (05:00.299) 340MB/s
> workers = 2 : Time: 270213.726 ms (04:30.214) 379MB/s
>
> (A better query would likely have been just: SELECT * FROM t WHERE a =
> 1; but I'd run the test by the time I thought of that.)
>
> So, this shows that Windows, at least 8.1, does suffer from this too.

I repeated this test on an up-to-date Windows 10 machine to see if the
later kernel is any better at the readahead.

Results for the same test are:

Master:

max_parallel_workers_per_gather = 0: Time: 148481.244 ms (02:28.481)
(706.2MB/sec)
max_parallel_workers_per_gather = 1: Time: 327556.121 ms (05:27.556)
(320.1MB/sec)
max_parallel_workers_per_gather = 2: Time: 329055.530 ms (05:29.056)
(318.6MB/sec)

Patched:

max_parallel_workers_per_gather = 0: Time: 141363.991 ms (02:21.364)
(741.7MB/sec)
max_parallel_workers_per_gather = 1: Time: 144982.202 ms (02:24.982)
(723.2MB/sec)
max_parallel_workers_per_gather = 2: Time: 143355.656 ms (02:23.356)
(731.4MB/sec)

David

Re: Parallel Seq Scan vs kernel read ahead

От

Thomas Munro

Дата:

10 июня 2020 г., 08:20:32

On Wed, Jun 10, 2020 at 5:06 PM David Rowley <dgrowleyml@gmail.com> wrote:
> I repeated this test on an up-to-date Windows 10 machine to see if the
> later kernel is any better at the readahead.
>
> Results for the same test are:
>
> Master:
>
> max_parallel_workers_per_gather = 0: Time: 148481.244 ms (02:28.481)
> (706.2MB/sec)
> max_parallel_workers_per_gather = 1: Time: 327556.121 ms (05:27.556)
> (320.1MB/sec)
> max_parallel_workers_per_gather = 2: Time: 329055.530 ms (05:29.056)
> (318.6MB/sec)
>
> Patched:
>
> max_parallel_workers_per_gather = 0: Time: 141363.991 ms (02:21.364)
> (741.7MB/sec)
> max_parallel_workers_per_gather = 1: Time: 144982.202 ms (02:24.982)
> (723.2MB/sec)
> max_parallel_workers_per_gather = 2: Time: 143355.656 ms (02:23.356)
> (731.4MB/sec)

Thanks!

I also heard from Andres that he likes this patch with his AIO
prototype, because of the way request merging works.  So it seems like
there are several reasons to want it.

But ... where should we get the maximum step size from?  A GUC?

Re: Parallel Seq Scan vs kernel read ahead

От

David Rowley

Дата:

10 июня 2020 г., 08:39:26

On Wed, 10 Jun 2020 at 17:21, Thomas Munro <thomas.munro@gmail.com> wrote:
> I also heard from Andres that he likes this patch with his AIO
> prototype, because of the way request merging works.  So it seems like
> there are several reasons to want it.
>
> But ... where should we get the maximum step size from?  A GUC?

I guess we'd need to determine if other step sizes were better under
any conditions.  I guess one condition would be if there was a LIMIT
clause. I could check if setting it to 1024 makes any difference, but
I'm thinking it won't since I got fairly consistent results on all
worker settings with the patched version.

David

Re: Parallel Seq Scan vs kernel read ahead

От

David Rowley

Дата:

10 июня 2020 г., 15:33:45

On Wed, 10 Jun 2020 at 17:39, David Rowley <dgrowleyml@gmail.com> wrote:
>
> On Wed, 10 Jun 2020 at 17:21, Thomas Munro <thomas.munro@gmail.com> wrote:
> > I also heard from Andres that he likes this patch with his AIO
> > prototype, because of the way request merging works.  So it seems like
> > there are several reasons to want it.
> >
> > But ... where should we get the maximum step size from?  A GUC?
>
> I guess we'd need to determine if other step sizes were better under
> any conditions.  I guess one condition would be if there was a LIMIT
> clause. I could check if setting it to 1024 makes any difference, but
> I'm thinking it won't since I got fairly consistent results on all
> worker settings with the patched version.

I did another round of testing on the same machine trying some step
sizes larger than 64 blocks. I can confirm that it does improve the
situation further going bigger than 64.

I got up as far as 16384, but made a couple of additional changes for
that run only. Instead of increasing the ramp-up 1 block at a time, I
initialised phsw_step_size to 1 and multiplied it by 2 until I reached
the chosen step size. With numbers that big, ramping up 1 block at a
time was slow enough that I'd never have reached the target step size

Here are the results of the testing:

Master:

max_parallel_workers_per_gather = 0: Time: 148481.244 ms (02:28.481)
(706.2MB/sec)
max_parallel_workers_per_gather = 1: Time: 327556.121 ms (05:27.556)
(320.1MB/sec)
max_parallel_workers_per_gather = 2: Time: 329055.530 ms (05:29.056)
(318.6MB/sec)

Patched stepsize = 64:

max_parallel_workers_per_gather = 0: Time: 141363.991 ms (02:21.364)
(741.7MB/sec)
max_parallel_workers_per_gather = 1: Time: 144982.202 ms (02:24.982)
(723.2MB/sec)
max_parallel_workers_per_gather = 2: Time: 143355.656 ms (02:23.356)
(731.4MB/sec)

Patched stepsize = 1024:

max_parallel_workers_per_gather = 0: Time: 152599.159 ms (02:32.599)
(687.1MB/sec)
max_parallel_workers_per_gather = 1: Time: 104227.232 ms (01:44.227)
(1006.04MB/sec)
max_parallel_workers_per_gather = 2: Time: 97149.343 ms (01:37.149)
(1079.3MB/sec)

Patched stepsize = 8192:

max_parallel_workers_per_gather = 0: Time: 143524.038 ms (02:23.524)
(730.59MB/sec)
max_parallel_workers_per_gather = 1: Time: 102899.288 ms (01:42.899)
(1019.0MB/sec)
max_parallel_workers_per_gather = 2: Time: 91148.340 ms (01:31.148)
(1150.4MB/sec)

Patched stepsize = 16384 (power 2 ramp-up)

max_parallel_workers_per_gather = 0: Time: 144598.502 ms (02:24.599)
(725.16MB/sec)
max_parallel_workers_per_gather = 1: Time: 97344.160 ms (01:37.344)
(1077.1MB/sec)
max_parallel_workers_per_gather = 2: Time: 88025.420 ms (01:28.025)
(1191.2MB/sec)

I thought about what you mentioned about a GUC, and I think it's a bad
idea to do that. I think it would be better to choose based on the
relation size. For smaller relations, we want to keep the step size
small. Someone may enable parallel query on such a small relation if
they're doing something like calling an expensive function on the
results, so we do need to avoid going large for small relations.

I considered something like:

create function nextpower2(a bigint) returns bigint as $$ declare n
bigint := 1; begin while n < a loop n := n * 2; end loop; return n;
end; $$ language plpgsql;
select pg_size_pretty(power(2,p)::numeric * 8192) rel_size,
nextpower2(power(2,p)::bigint / 1024) as stepsize from
generate_series(1,32) p;
 rel_size | stepsize
----------+----------
 16 kB    |        1
 32 kB    |        1
 64 kB    |        1
 128 kB   |        1
 256 kB   |        1
 512 kB   |        1
 1024 kB  |        1
 2048 kB  |        1
 4096 kB  |        1
 8192 kB  |        1
 16 MB    |        2
 32 MB    |        4
 64 MB    |        8
 128 MB   |       16
 256 MB   |       32
 512 MB   |       64
 1024 MB  |      128
 2048 MB  |      256
 4096 MB  |      512
 8192 MB  |     1024
 16 GB    |     2048
 32 GB    |     4096
 64 GB    |     8192
 128 GB   |    16384
 256 GB   |    32768
 512 GB   |    65536
 1024 GB  |   131072
 2048 GB  |   262144
 4096 GB  |   524288
 8192 GB  |  1048576
 16 TB    |  2097152
 32 TB    |  4194304

So with that algorithm with this 100GB table that I've been using in
my test, we'd go with a step size of 16384. I think we'd want to avoid
going any more than that. The above code means we'll do between just
below 0.1% and 0.2% of the relation per step. If I divided the number
of blocks by say 128 instead of 1024, then that would be about 0.78%
and 1.56% of the relation each time. It's not unrealistic today that
someone might throw that many workers at a job, so, I'd say dividing
by 1024 or even 2048 would likely be about right.

David

Re: Parallel Seq Scan vs kernel read ahead

От

Amit Kapila

Дата:

10 июня 2020 г., 16:24:23

On Wed, Jun 10, 2020 at 6:04 PM David Rowley <dgrowleyml@gmail.com> wrote:
>
> On Wed, 10 Jun 2020 at 17:39, David Rowley <dgrowleyml@gmail.com> wrote:
> >
> > On Wed, 10 Jun 2020 at 17:21, Thomas Munro <thomas.munro@gmail.com> wrote:
> > > I also heard from Andres that he likes this patch with his AIO
> > > prototype, because of the way request merging works.  So it seems like
> > > there are several reasons to want it.
> > >
> > > But ... where should we get the maximum step size from?  A GUC?
> >
> > I guess we'd need to determine if other step sizes were better under
> > any conditions.  I guess one condition would be if there was a LIMIT
> > clause. I could check if setting it to 1024 makes any difference, but
> > I'm thinking it won't since I got fairly consistent results on all
> > worker settings with the patched version.
>
> I did another round of testing on the same machine trying some step
> sizes larger than 64 blocks. I can confirm that it does improve the
> situation further going bigger than 64.
>

Can we try the same test with 4, 8, 16 workers as well?  I don't
foresee any problem with a higher number of workers but it might be
better to once check that if it is not too much additional work.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Parallel Seq Scan vs kernel read ahead

От

David Rowley

Дата:

11 июня 2020 г., 04:47:49

On Thu, 11 Jun 2020 at 01:24, Amit Kapila <amit.kapila16@gmail.com> wrote:
> Can we try the same test with 4, 8, 16 workers as well?  I don't
> foresee any problem with a higher number of workers but it might be
> better to once check that if it is not too much additional work.

I ran the tests again with up to 7 workers. The CPU here only has 8
cores (a laptop), so I'm not sure if there's much sense in going
higher than that?

CPU = Intel i7-8565U. 16GB RAM.

Note that I did the power2 ramp-up with each of the patched tests this
time. Thomas' version ramps up 1 pages at a time, which is ok when
only ramping up to 64 pages, but not for these higher numbers I'm
testing with. (Patch attached)

Results attached in a graph format, or in text below:

Master:

workers=0: Time: 141175.935 ms (02:21.176) (742.7MB/sec)
workers=1: Time: 316854.538 ms (05:16.855) (330.9MB/sec)
workers=2: Time: 323471.791 ms (05:23.472) (324.2MB/sec)
workers=3: Time: 321637.945 ms (05:21.638) (326MB/sec)
workers=4: Time: 308689.599 ms (05:08.690) (339.7MB/sec)
workers=5: Time: 289014.709 ms (04:49.015) (362.8MB/sec)
workers=6: Time: 267785.27 ms (04:27.785) (391.6MB/sec)
workers=7: Time: 248735.817 ms (04:08.736) (421.6MB/sec)

Patched 64: (power 2 ramp-up)

workers=0: Time: 152752.558 ms (02:32.753) (686.5MB/sec)
workers=1: Time: 149940.841 ms (02:29.941) (699.3MB/sec)
workers=2: Time: 136534.043 ms (02:16.534) (768MB/sec)
workers=3: Time: 119387.248 ms (01:59.387) (878.3MB/sec)
workers=4: Time: 114080.131 ms (01:54.080) (919.2MB/sec)
workers=5: Time: 111472.144 ms (01:51.472) (940.7MB/sec)
workers=6: Time: 108290.608 ms (01:48.291) (968.3MB/sec)
workers=7: Time: 104349.947 ms (01:44.350) (1004.9MB/sec)

Patched 1024: (power 2 ramp-up)

workers=0: Time: 146106.086 ms (02:26.106) (717.7MB/sec)
workers=1: Time: 109832.773 ms (01:49.833) (954.7MB/sec)
workers=2: Time: 98921.515 ms (01:38.922) (1060MB/sec)
workers=3: Time: 94259.243 ms (01:34.259) (1112.4MB/sec)
workers=4: Time: 93275.637 ms (01:33.276) (1124.2MB/sec)
workers=5: Time: 93921.452 ms (01:33.921) (1116.4MB/sec)
workers=6: Time: 93988.386 ms (01:33.988) (1115.6MB/sec)
workers=7: Time: 92096.414 ms (01:32.096) (1138.6MB/sec)

Patched 8192: (power 2 ramp-up)

workers=0: Time: 143367.057 ms (02:23.367) (731.4MB/sec)
workers=1: Time: 103138.918 ms (01:43.139) (1016.7MB/sec)
workers=2: Time: 93368.573 ms (01:33.369) (1123.1MB/sec)
workers=3: Time: 89464.529 ms (01:29.465) (1172.1MB/sec)
workers=4: Time: 89921.393 ms (01:29.921) (1166.1MB/sec)
workers=5: Time: 93575.401 ms (01:33.575) (1120.6MB/sec)
workers=6: Time: 93636.584 ms (01:33.637) (1119.8MB/sec)
workers=7: Time: 93682.21 ms (01:33.682) (1119.3MB/sec)

Patched 16384 (power 2 ramp-up)

workers=0: Time: 144598.502 ms (02:24.599) (725.2MB/sec)
workers=1: Time: 97344.16 ms (01:37.344) (1077.2MB/sec)
workers=2: Time: 88025.42 ms (01:28.025) (1191.2MB/sec)
workers=3: Time: 97711.521 ms (01:37.712) (1073.1MB/sec)
workers=4: Time: 88877.913 ms (01:28.878) (1179.8MB/sec)
workers=5: Time: 96985.978 ms (01:36.986) (1081.2MB/sec)
workers=6: Time: 92368.543 ms (01:32.369) (1135.2MB/sec)
workers=7: Time: 87498.156 ms (01:27.498) (1198.4MB/sec)

David

On Wed, 17 Jun 2020 at 03:20, Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Mon, Jun 15, 2020 at 5:09 PM David Rowley <dgrowleyml@gmail.com> wrote:
> > * Perhaps when there are less than 2 full chunks remaining, workers
> > can just take half of what is left. Or more specifically
> > Max(pg_next_power2(remaining_blocks) / 2, 1), which ideally would work
> > out allocating an amount of pages proportional to the amount of beer
> > each mathematician receives in the "An infinite number of
> > mathematicians walk into a bar" joke, obviously with the exception
> > that we stop dividing when we get to 1. However, I'm not quite sure
> > how well that can be made to work with multiple bartenders working in
> > parallel.
>
> That doesn't sound nearly aggressive enough to me. I mean, let's
> suppose that we're concerned about the scenario where one chunk takes
> 50x as long as all the other chunks. Well, if we have 1024 chunks
> total, and we hit the problem chunk near the beginning, there will be
> no problem. In effect, there are 1073 units of work instead of 1024,
> and we accidentally assigned one guy 50 units of work when we thought
> we were assigning 1 unit of work. If there's enough work left that we
> can assign each other worker 49 units more than what we would have
> done had that chunk been the same cost as all the others, then there's
> no problem. So for instance if there are 4 workers, we can still even
> things out if we hit the problematic chunk more than ~150 chunks from
> the end. If we're closer to the end than that, there's no way to avoid
> the slow chunk delaying the overall completion time, and the problem
> gets worse as the problem chunk gets closer to the end.

I've got something like that in the attached.  Currently, I've set the
number of chunks to 2048 and I'm starting the ramp down when 64 chunks
remain, which means we'll start the ramp-down when there's about 3.1%
of the scan remaining. I didn't see the point of going with the larger
number of chunks and having ramp-down code.

Attached is the patch and an .sql file with a function which can be
used to demonstrate what chunk sizes the patch will choose and demo
the ramp-down.

e.g.
# select show_parallel_scan_chunks(1000000, 2048, 64);

It would be really good if people could test this using the test case
mentioned in [1]. We really need to get a good idea of how this
behaves on various operating systems.

With a 32TB relation, the code will make the chunk size 16GB.  Perhaps
I should change the code to cap that at 1GB.

David

[1] https://www.postgresql.org/message-id/CAApHDvrfJfYH51_WY-iQqPw8yGR4fDoTxAQKqn%2BSa7NTKEVWtg%40mail.gmail.com

On Fri, 19 Jun 2020 at 11:34, David Rowley <dgrowleyml@gmail.com> wrote:
>
> On Fri, 19 Jun 2020 at 03:26, Robert Haas <robertmhaas@gmail.com> wrote:
> >
> > On Thu, Jun 18, 2020 at 6:15 AM David Rowley <dgrowleyml@gmail.com> wrote:
> > > With a 32TB relation, the code will make the chunk size 16GB.  Perhaps
> > > I should change the code to cap that at 1GB.
> >
> > It seems pretty hard to believe there's any significant advantage to a
> > chunk size >1GB, so I would be in favor of that change.
>
> I could certainly make that change.  With the standard page size, 1GB
> is 131072 pages and a power of 2. That would change for non-standard
> page sizes, so we'd need to decide if we want to keep the chunk size a
> power of 2, or just cap it exactly at whatever number of pages 1GB is.
>
> I'm not sure how much of a difference it'll make, but I also just want
> to note that synchronous scans can mean we'll start the scan anywhere
> within the table, so capping to 1GB does not mean we read an entire
> extent. It's more likely to span 2 extents.

Here's a patch which caps the maximum chunk size to 131072.  If
someone doubles the page size then that'll be 2GB instead of 1GB. I'm
not personally worried about that.

I tested the performance on a Windows 10 laptop using the test case from [1]

Master:

workers=0: Time: 141175.935 ms (02:21.176)
workers=1: Time: 316854.538 ms (05:16.855)
workers=2: Time: 323471.791 ms (05:23.472)
workers=3: Time: 321637.945 ms (05:21.638)
workers=4: Time: 308689.599 ms (05:08.690)
workers=5: Time: 289014.709 ms (04:49.015)
workers=6: Time: 267785.270 ms (04:27.785)
workers=7: Time: 248735.817 ms (04:08.736)

Patched:

workers=0: Time: 155985.204 ms (02:35.985)
workers=1: Time: 112238.741 ms (01:52.239)
workers=2: Time: 105861.813 ms (01:45.862)
workers=3: Time: 91874.311 ms (01:31.874)
workers=4: Time: 92538.646 ms (01:32.539)
workers=5: Time: 93012.902 ms (01:33.013)
workers=6: Time: 94269.076 ms (01:34.269)
workers=7: Time: 90858.458 ms (01:30.858)

David

[1] https://www.postgresql.org/message-id/CAApHDvrfJfYH51_WY-iQqPw8yGR4fDoTxAQKqn%2BSa7NTKEVWtg%40mail.gmail.com

Вложения

bigger_io_chunks_for_parallel_seqscan_v2.patch

Re: Parallel Seq Scan vs kernel read ahead

От

Robert Haas

Дата:

19 июня 2020 г., 23:00:07

On Thu, Jun 18, 2020 at 10:10 PM David Rowley <dgrowleyml@gmail.com> wrote:
> Here's a patch which caps the maximum chunk size to 131072.  If
> someone doubles the page size then that'll be 2GB instead of 1GB. I'm
> not personally worried about that.

Maybe use RELSEG_SIZE?

> I tested the performance on a Windows 10 laptop using the test case from [1]
>
> Master:
>
> workers=0: Time: 141175.935 ms (02:21.176)
> workers=1: Time: 316854.538 ms (05:16.855)
> workers=2: Time: 323471.791 ms (05:23.472)
> workers=3: Time: 321637.945 ms (05:21.638)
> workers=4: Time: 308689.599 ms (05:08.690)
> workers=5: Time: 289014.709 ms (04:49.015)
> workers=6: Time: 267785.270 ms (04:27.785)
> workers=7: Time: 248735.817 ms (04:08.736)
>
> Patched:
>
> workers=0: Time: 155985.204 ms (02:35.985)
> workers=1: Time: 112238.741 ms (01:52.239)
> workers=2: Time: 105861.813 ms (01:45.862)
> workers=3: Time: 91874.311 ms (01:31.874)
> workers=4: Time: 92538.646 ms (01:32.539)
> workers=5: Time: 93012.902 ms (01:33.013)
> workers=6: Time: 94269.076 ms (01:34.269)
> workers=7: Time: 90858.458 ms (01:30.858)

Nice results. I wonder if these stack with the gains Thomas was
discussing with his DSM-from-the-main-shmem-segment patch.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Parallel Seq Scan vs kernel read ahead

От

David Rowley

Дата:

22 июня 2020 г., 01:52:04

On Sat, 20 Jun 2020 at 08:00, Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Thu, Jun 18, 2020 at 10:10 PM David Rowley <dgrowleyml@gmail.com> wrote:
> > Here's a patch which caps the maximum chunk size to 131072.  If
> > someone doubles the page size then that'll be 2GB instead of 1GB. I'm
> > not personally worried about that.
>
> Maybe use RELSEG_SIZE?

I was hoping to keep the guarantees that the chunk size is always a
power of 2.  If, for example, someone configured PostgreSQL
--with-segsize=3, then RELSEG_SIZE would be 393216 with the standard
BLCKSZ.

Not having it a power of 2 does mean the ramp-down is more uneven when
the sizes become very small:

postgres=# select 393216>>x from generate_Series(0,18)x;
 ?column?
----------
   393216
   196608
    98304
    49152
    24576
    12288
     6144
     3072
     1536
      768
      384
      192
       96
       48
       24
       12
        6
        3
        1
(19 rows)

Perhaps that's not a problem though, but then again, perhaps just
keeping it at 131072 regardless of RELSEG_SIZE and BLCKSZ is also ok.
The benchmarks I did on Windows [1] showed that the returns diminished
once we started making the step size some decent amount so my thoughts
are that I've set PARALLEL_SEQSCAN_MAX_CHUNK_SIZE to something large
enough that it'll make no difference to the performance anyway. So
there's probably not much point in giving it too much thought.

Perhaps pg_nextpower2_32(RELSEG_SIZE) would be okay though.

David

[1] https://www.postgresql.org/message-id/CAApHDvopPkA+q5y_k_6CUV4U6DPhmz771VeUMuzLs3D3mWYMOg@mail.gmail.com

Re: Parallel Seq Scan vs kernel read ahead

От

David Rowley

Дата:

22 июня 2020 г., 07:54:22

On Fri, 19 Jun 2020 at 14:10, David Rowley <dgrowleyml@gmail.com> wrote:
> Here's a patch which caps the maximum chunk size to 131072.  If
> someone doubles the page size then that'll be 2GB instead of 1GB. I'm
> not personally worried about that.
>
> I tested the performance on a Windows 10 laptop using the test case from [1]

I also tested this an AMD machine running Ubuntu 20.04 on kernel
version 5.4.0-37.  I used the same 100GB table I mentioned in [1], but
with the query "select * from t where a < 0;", which saves having to
do any aggregate work.

There seems to be quite a big win with Linux too. See the attached
graphs.  Both graphs are based on the same results, just the MB/sec
one takes the query time in milliseconds and converts that into MB/sec
for the 100 GB table. i.e. 100*1024/(<milliseconds> /1000)

The machine is a 64core / 128 thread AMD machine (3990x) with a 1TB
Samsung 970 Pro evo plus SSD, 64GB RAM

> [1] https://www.postgresql.org/message-id/CAApHDvrfJfYH51_WY-iQqPw8yGR4fDoTxAQKqn%2BSa7NTKEVWtg%40mail.gmail.com

Em seg., 22 de jun. de 2020 às 02:53, David Rowley <dgrowleyml@gmail.com> escreveu:

On Mon, 22 Jun 2020 at 16:54, David Rowley <dgrowleyml@gmail.com> wrote:
> I also tested this an AMD machine running Ubuntu 20.04 on kernel
> version 5.4.0-37. I used the same 100GB table I mentioned in [1], but
> with the query "select * from t where a < 0;", which saves having to
> do any aggregate work.

I just wanted to add a note here that Thomas and I just discussed this
a bit offline. He recommended I try setting the kernel readhead a bit
higher.

It was set to 128kB, so I cranked it up to 2MB with:

sudo blockdev --setra 4096 /dev/nvme0n1p2

I didn't want to run the full test again as it took quite a long time,
so I just tried with 32 workers.

The first two results here are taken from the test results I just
posted 1 hour ago.

Master readhead=128kB = 89921.283 ms
v2 patch readhead=128kB = 36085.642 ms
master readhead=2MB = 60984.905 ms
v2 patch readhead=2MB = 22611.264 ms

Hi, redoing the tests with v2 here.

notebook with i5, 8GB, 256 GB (SSD)

Windows 10 64 bits (2004

msvc 2019 64 bits

Postgresql head (with v2 patch)

Configuration: none

Connection local ipv4 (not localhost)

create table t (a int, b text);
insert into t select x,md5(x::text) from
generate_series(1,1000000*1572.7381809)x;
vacuum freeze t;

set max_parallel_workers_per_gather = 0;

Time: 354211,826 ms (05:54,212)

set max_parallel_workers_per_gather = 1;

Time: 332805,773 ms (05:32,806)

set max_parallel_workers_per_gather = 2;

Time: 282566,711 ms (04:42,567)

set max_parallel_workers_per_gather = 3;

Time: 263383,945 ms (04:23,384)

set max_parallel_workers_per_gather = 4;

Time: 255728,259 ms (04:15,728)

set max_parallel_workers_per_gather = 5;

Time: 238288,720 ms (03:58,289)

set max_parallel_workers_per_gather = 6;

Time: 238647,792 ms (03:58,648)

set max_parallel_workers_per_gather = 7;

Time: 231295,763 ms (03:51,296)

set max_parallel_workers_per_gather = 8;

Time: 232502,828 ms (03:52,503)

set max_parallel_workers_per_gather = 9;

Time: 230970,604 ms (03:50,971)

set max_parallel_workers_per_gather = 10;

Time: 232104,182 ms (03:52,104)

set max_parallel_workers_per_gather = 8;

postgres=# explain select count(*) from t;
QUERY PLAN
-------------------------------------------------------------------------------------------
Finalize Aggregate (cost=15564556.43..15564556.44 rows=1 width=8)
-> Gather (cost=15564555.60..15564556.41 rows=8 width=8)
Workers Planned: 8
-> Partial Aggregate (cost=15563555.60..15563555.61 rows=1 width=8)
-> Parallel Seq Scan on t (cost=0.00..15072074.88 rows=196592288 width=0)
(5 rows)

Questions:

1. Why acquire and release lock in retry: loop.

Wouldn't that be better?

/* Grab the spinlock. */
SpinLockAcquire(&pbscan->phs_mutex);

retry:
/*
* If the scan's startblock has not yet been initialized, we must do so
* now. If this is not a synchronized scan, we just start at block 0, but
* if it is a synchronized scan, we must get the starting position from
* the synchronized scan machinery. We can't hold the spinlock while
* doing that, though, so release the spinlock, get the information we
* need, and retry. If nobody else has initialized the scan in the
* meantime, we'll fill in the value we fetched on the second time
* through.
*/
if (pbscan->phs_startblock == InvalidBlockNumber)
{
if (!pbscan->base.phs_syncscan)
pbscan->phs_startblock = 0;
else if (sync_startpage != InvalidBlockNumber)
pbscan->phs_startblock = sync_startpage;
else
{
sync_startpage = ss_get_location(rel, pbscan->phs_nblocks);
goto retry;
}
}
SpinLockRelease(&pbscan->phs_mutex);
}

Acquire lock once, before retry?

2. Is there any configuration to improve performance?

regards,

Ranier Vilela

Re: Parallel Seq Scan vs kernel read ahead

От

Robert Haas

Дата:

22 июня 2020 г., 22:32:59

Ranier,

This topic is largely unrelated to the current thread. Also...

On Mon, Jun 22, 2020 at 12:47 PM Ranier Vilela <ranier.vf@gmail.com> wrote:
> Questions:
> 1. Why acquire and release lock in retry: loop.

This is a super-bad idea. Note the coding rule mentioned in spin.h.
There are many discussion on this mailing list about the importance of
keeping the critical section for a spinlock to a few instructions.
Calling another function that *itself acquires an LWLock* is
definitely not OK.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Parallel Seq Scan vs kernel read ahead

От

Ranier Vilela

Дата:

22 июня 2020 г., 23:27:46

Em seg., 22 de jun. de 2020 às 16:33, Robert Haas <robertmhaas@gmail.com> escreveu:

Ranier,

This topic is largely unrelated to the current thread. Also...

Weel, I was trying to improve the patch for the current thread.

Or perhaps, you are referring to something else, which I may not have understood.

On Mon, Jun 22, 2020 at 12:47 PM Ranier Vilela <ranier.vf@gmail.com> wrote:
> Questions:
> 1. Why acquire and release lock in retry: loop.

This is a super-bad idea. Note the coding rule mentioned in spin.h.
There are many discussion on this mailing list about the importance of
keeping the critical section for a spinlock to a few instructions.
Calling another function that *itself acquires an LWLock* is
definitely not OK.

Perhaps, I was not clear and it is another misunderstanding.
I am not suggesting a function to acquire the lock.
By the way, I did the tests with this change and it worked perfectly.
But, as it is someone else's patch, I asked why to learn.
By the way, my suggestion is with less instructions than the patch.
The only change I asked is why to acquire and release the lock repeatedly within the goto retry, when you already have it.
If I can acquire the lock before retry: and release it only at the end when I leave table_block_parallelscan_startblock_init,
why not do it.
I will attach the suggested excerpt so that I have no doubts about what I am asking.

regards,

Ranier Vilela

Вложения

function.c

Re: Parallel Seq Scan vs kernel read ahead

От

Thomas Munro

Дата:

23 июня 2020 г., 00:52:21

On Fri, Jun 19, 2020 at 2:10 PM David Rowley <dgrowleyml@gmail.com> wrote:
> Here's a patch which caps the maximum chunk size to 131072.  If
> someone doubles the page size then that'll be 2GB instead of 1GB. I'm
> not personally worried about that.

I wonder how this interacts with the sync scan feature.  It has a
conflicting goal...

Re: Parallel Seq Scan vs kernel read ahead

От

David Rowley

Дата:

23 июня 2020 г., 01:50:16

On Tue, 23 Jun 2020 at 09:52, Thomas Munro <thomas.munro@gmail.com> wrote:
>
> On Fri, Jun 19, 2020 at 2:10 PM David Rowley <dgrowleyml@gmail.com> wrote:
> > Here's a patch which caps the maximum chunk size to 131072.  If
> > someone doubles the page size then that'll be 2GB instead of 1GB. I'm
> > not personally worried about that.
>
> I wonder how this interacts with the sync scan feature.  It has a
> conflicting goal...

Of course, syncscan relies on subsequent scanners finding buffers
cached, either in (ideally) shared buffers or the kernel cache. The
scans need to be roughly synchronised for that to work.  If we go and
make the chunk size too big, then that'll reduce the chances useful
buffers being found by subsequent scans.  It sounds like a good reason
to try and find the smallest chunk size that allows readahead to work
well. The benchmarks I did on Windows certainly show that there are
diminishing returns when the chunk size gets larger, so capping it at
some number of megabytes would probably be a good idea.  It would just
take a bit of work to figure out how many megabytes that should be.
Likely it's going to depend on the size of shared buffers and how much
memory the machine has got, but also what other work is going on that
might be evicting buffers at the same time. Perhaps something in the
range of 2-16MB would be ok.  I can do some tests with that and see if
I can get the same performance as with the larger chunks.

David

Re: Parallel Seq Scan vs kernel read ahead

От

David Rowley

Дата:

23 июня 2020 г., 05:29:12

On Tue, 23 Jun 2020 at 07:33, Robert Haas <robertmhaas@gmail.com> wrote:
> On Mon, Jun 22, 2020 at 12:47 PM Ranier Vilela <ranier.vf@gmail.com> wrote:
> > Questions:
> > 1. Why acquire and release lock in retry: loop.
>
> This is a super-bad idea. Note the coding rule mentioned in spin.h.
> There are many discussion on this mailing list about the importance of
> keeping the critical section for a spinlock to a few instructions.
> Calling another function that *itself acquires an LWLock* is
> definitely not OK.

Just a short history lesson for Ranier to help clear up any confusion:

Back before 3cda10f41 there was some merit in improving the
performance of these functions. Before that, we used to dish out pages
under a lock. With that old method, if given enough workers and a
simple enough query, we could start to see workers waiting on the lock
just to obtain the next block number they're to work on.  After the
atomics were added in that commit, we didn't really see that again.

What we're trying to fix here is the I/O pattern that these functions
induce and that's all we should be doing here.  Changing this is
tricky to get right as we need to consider so many operating systems
and how they deal with I/O readahead.

David

Re: Parallel Seq Scan vs kernel read ahead

От

Ranier Vilela

Дата:

23 июня 2020 г., 14:05:03

Em seg., 22 de jun. de 2020 às 23:29, David Rowley <dgrowleyml@gmail.com> escreveu:

On Tue, 23 Jun 2020 at 07:33, Robert Haas <robertmhaas@gmail.com> wrote:
> On Mon, Jun 22, 2020 at 12:47 PM Ranier Vilela <ranier.vf@gmail.com> wrote:
> > Questions:
> > 1. Why acquire and release lock in retry: loop.
>
> This is a super-bad idea. Note the coding rule mentioned in spin.h.
> There are many discussion on this mailing list about the importance of
> keeping the critical section for a spinlock to a few instructions.
> Calling another function that *itself acquires an LWLock* is
> definitely not OK.

Just a short history lesson for Ranier to help clear up any confusion:

Back before 3cda10f41 there was some merit in improving the
performance of these functions. Before that, we used to dish out pages
under a lock. With that old method, if given enough workers and a
simple enough query, we could start to see workers waiting on the lock
just to obtain the next block number they're to work on. After the
atomics were added in that commit, we didn't really see that again.

It is a good explanation. I already imagined it could be to help other processes, but I still wasn't sure.
However, I did a test with this modification (lock before retry), and it worked.

What we're trying to fix here is the I/O pattern that these functions
induce and that's all we should be doing here. Changing this is
tricky to get right as we need to consider so many operating systems
and how they deal with I/O readahead.

Yes, I understand that focus here is I/O.

Sorry, by the noise.

regards,

Ranier Vilela

Re: Parallel Seq Scan vs kernel read ahead

От

David Rowley

Дата:

24 июня 2020 г., 06:52:46

On Tue, 23 Jun 2020 at 10:50, David Rowley <dgrowleyml@gmail.com> wrote:
>
> On Tue, 23 Jun 2020 at 09:52, Thomas Munro <thomas.munro@gmail.com> wrote:
> >
> > On Fri, Jun 19, 2020 at 2:10 PM David Rowley <dgrowleyml@gmail.com> wrote:
> > > Here's a patch which caps the maximum chunk size to 131072.  If
> > > someone doubles the page size then that'll be 2GB instead of 1GB. I'm
> > > not personally worried about that.
> >
> > I wonder how this interacts with the sync scan feature.  It has a
> > conflicting goal...
>
> Of course, syncscan relies on subsequent scanners finding buffers
> cached, either in (ideally) shared buffers or the kernel cache. The
> scans need to be roughly synchronised for that to work.  If we go and
> make the chunk size too big, then that'll reduce the chances useful
> buffers being found by subsequent scans.  It sounds like a good reason
> to try and find the smallest chunk size that allows readahead to work
> well. The benchmarks I did on Windows certainly show that there are
> diminishing returns when the chunk size gets larger, so capping it at
> some number of megabytes would probably be a good idea.  It would just
> take a bit of work to figure out how many megabytes that should be.
> Likely it's going to depend on the size of shared buffers and how much
> memory the machine has got, but also what other work is going on that
> might be evicting buffers at the same time. Perhaps something in the
> range of 2-16MB would be ok.  I can do some tests with that and see if
> I can get the same performance as with the larger chunks.

I did some further benchmarking on both Windows 10 and on Linux with
the 5.4.0-37 kernel running on Ubuntu 20.04.  I started by reducing
PARALLEL_SEQSCAN_MAX_CHUNK_SIZE down to 256 and ran the test multiple
times, each time doubling the PARALLEL_SEQSCAN_MAX_CHUNK_SIZE.  On the
Linux test, I used the standard kernel readhead of 128kB. Thomas and I
discovered earlier that increasing that increases the throughput all
round.

These tests were done with the PARALLEL_SEQSCAN_NCHUNKS as 2048, which
means with the 100GB table I used for testing, the uncapped chunk size
of 8192 blocks would be selected (aka 16MB).  The performance is quite
a bit slower when the chunk size is capped to 256 blocks and it does
increase again with larger maximum chunk sizes, but the returns do get
smaller and smaller with each doubling of
PARALLEL_SEQSCAN_MAX_CHUNK_SIZE. Uncapped, or 8192 did give the best
performance on both Windows and Linux. I didn't test with anything
higher than that.

So, based on these results, it seems 16MBs is not a bad value to cap
the chunk size at. If that turns out to be true for other tests too,
then likely 16MB is not too unrealistic a value to cap the size of the
block chunks to.

Please see the attached v2_on_linux.png and v2_on_windows.png for the
results of that.

I also performed another test to see how the performance looks with
both synchronize_seqscans on and off.  To test this I decided that a
100GB table on a 64GB RAM machine was just not larger enough, so I
increased the table size to 800GB. I set parallel_workers for the
relation to 10 and ran:

drowley@amd3990x:~$ cat bench.sql
select * from t where a < 0;
pgbench -n -f bench.sql -T 10800 -P 600 -c 6 -j 6 postgres

(This query returns 0 rows).

So each query had 11 backends (including the main process) and there
were 6 of those running concurrently. i.e 66 backends busy working on
the problem in total.

The results of that were:

Auto chunk size selection without any cap (for an 800GB table that's
65536 blocks)

synchronize_seqscans = on: latency average = 372738.134 ms (2197.7 MB/s) <-- bad
synchronize_seqscans = off: latency average = 320204.028 ms (2558.3 MB/s)

So here it seems that synchronize_seqscans = on slows things down.

Trying again after capping the number of blocks per chunk to 8192:

synchronize_seqscans = on: latency average = 321969.172 ms (2544.3 MB/s)
synchronize_seqscans = off: latency average = 321389.523 ms (2548.9 MB/s)

So the performance there is about the same.

I was surprised to see that synchronize_seqscans = off didn't slow
down the performance by about 6x. So I tested to see what master does,
and:

synchronize_seqscans = on: latency average = 1070226.162 ms (765.4MB/s)
synchronize_seqscans = off: latency average = 1085846.859 ms (754.4MB/s)

It does pretty poorly in both cases.

The full results of that test are in the attached
800gb_table_synchronize_seqscans_test.txt file.

In summary, based on these tests, I don't think we're making anything
worse in regards to synchronize_seqscans if we cap the maximum number
of blocks to allocate to each worker at once to 8192. Perhaps there's
some argument for using something smaller than that for servers with
very little RAM, but I don't personally think so as it still depends
on the table size and It's hard to imagine tables in the hundreds of
GBs on servers that struggle with chunk allocations of 16MB.  The
table needs to be at least ~70GB to get a 8192 chunk size with the
current v2 patch settings.

David

On Wed, 15 Jul 2020 at 12:24, David Rowley <dgrowleyml@gmail.com> wrote:
> If we've not seen any performance regressions within 1 week, then I
> propose that we (pending final review) push this to allow wider
> testing. It seems we're early enough in the PG14 cycle that there's a
> large window of time for us to do something about any reported
> performance regressions that come in.

I did that final review which ended up in quite a few cosmetic changes.

Functionality-wise, it's basically that of the v2 patch with the
PARALLEL_SEQSCAN_MAX_CHUNK_SIZE set to 8192.

I mentioned that we might want to revisit giving users some influence
on the chunk size, but we'll only do so once we see some conclusive
evidence that it's worthwhile.

David

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Обсуждение: Parallel Seq Scan vs kernel read ahead

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения