RE: Parallel Seq Scan vs kernel read ahead

Поиск
Список
Период
Сортировка
От k.jamison@fujitsu.com
Тема RE: Parallel Seq Scan vs kernel read ahead
Дата
Msg-id OSBPR01MB23411BBC0C950F406E774EE7EF780@OSBPR01MB2341.jpnprd01.prod.outlook.com
обсуждение исходный текст
Ответ на Re: Parallel Seq Scan vs kernel read ahead  (Amit Kapila <amit.kapila16@gmail.com>)
Список pgsql-hackers
On Friday, July 17, 2020 6:18 PM (GMT+9), Amit Kapila wrote:

> On Fri, Jul 17, 2020 at 11:35 AM k.jamison@fujitsu.com <k.jamison@fujitsu.com>
> wrote:
> >
> > On Wednesday, July 15, 2020 12:52 PM (GMT+9), David Rowley wrote:
> >
> > >On Wed, 15 Jul 2020 at 14:51, Amit Kapila <amit.kapila16@gmail.com> wrote:
> > >>
> > >> On Wed, Jul 15, 2020 at 5:55 AM David Rowley <dgrowleyml@gmail.com>
> wrote:
> > >>> If we've not seen any performance regressions within 1 week, then
> > >>> I propose that we (pending final review) push this to allow wider
> > >>> testing.
> > >>
> > >> I think Soumyadeep has reported a regression case [1] with the
> > >> earlier version of the patch.  I am not sure if we have verified
> > >> that the situation improves with the latest version of the patch.
> > >> I request Soumyadeep to please try once with the latest patch.
> > >...
> > >Yeah, it would be good to see some more data points on that test.
> > >Jumping from 2 up to 6 workers just leaves us to guess where the
> > >performance started to become bad. >It would be good to know if the
> > >regression is repeatable or if it was affected by some other process.
> > >...
> > >It would be good to see EXPLAIN (ANALYZE, BUFFERS) with SET
> > >track_io_timing = on; for each value of >max_parallel_workers.
> >
> > Hi,
> >
> > If I'm following the thread correctly, we may have gains on this patch
> > of Thomas and David, but we need to test its effects on different
> > filesystems. It's also been clarified by David through benchmark tests
> > that synchronize_seqscans is not affected as long as the set cap per
> > chunk size of parallel scan is at 8192.
> >
> > I also agree that having a control on this through GUC can be
> > beneficial for users, however, that can be discussed in another thread
> > or development in the future.
> >
> > David Rowley wrote:
> > >I'd like to propose that if anyone wants to do further testing on
> > >other operating systems with SSDs or HDDs then it would be good if
> > >that could be done within a 1 week from this email. There are various
> > >benchmarking ideas on this thread for inspiration.
> >
> > I'd like to join on testing it, this one using HDD, and at the bottom
> > are the results. Due to my machine limitations, I only tested
> > 0~6 workers, that even if I increase max_parallel_workers_per_gather
> > more than that, the query planner would still cap the workers at 6.
> > I also set the track_io_timing to on as per David's recommendation.
> >
> > Tested on:
> > XFS filesystem, HDD virtual machine
> > RHEL4, 64-bit,
> > 4 CPUs, Intel Core Processor (Haswell, IBRS) PostgreSQL 14devel on
> > x86_64-pc-linux-gnu
> >
> >
> > ----Test Case (Soumyadeep's) [1]
> >
> > shared_buffers = 32MB (to use OS page cache)
> >
> > create table t_heap as select generate_series(1, 100000000) i;   --about
> 3.4GB size
> >
> > SET track_io_timing = on;
> >
> > \timing
> >
> > set max_parallel_workers_per_gather = 0;      --0 to 6
> >
> > SELECT count(*) from t_heap;
> > EXPLAIN (ANALYZE, BUFFERS) SELECT count(*) from t_heap;
> >
> > [Summary]
> > I used the same query from the thread. However, the sql query
> > execution time and query planner execution time results between the
> > master and patched do not vary much.
> > OTOH, in terms of I/O stats, I observed similar regression in both
> > master and patched as we increase max_parallel_workers_per_gather.
> >
> > It could also be possible that each benchmark setting for
> > max_parallel_workers_per_gather is affected by previous result . IOW,
> > later benchmark runs benefit from the data cached by previous runs on OS level.
> >
> 
> Yeah, I think to some extent that is visible in results because, after patch, at 0
> workers, the execution time is reduced significantly whereas there is not much
> difference at other worker counts.  I think for non-parallel case (0 workers),
> there shouldn't be any difference.
> Also, I am not sure if there is any reason why after patch the number of shared hits
> is improved, probably due to caching effects?
> 
> > Any advice?
> 
> I think recreating the database and restarting the server after each run might help
> in getting consistent results.  Also, you might want to take median of three runs.

Thank you for the advice. I repeated the test as per your advice and average of 3 runs
per worker/s planned.
It still shows the following similar performance results between Master and Patch V2.
I wonder why there's no difference though.

The test on my machine is roughly like this:

createdb test
psql -d test
create table t_heap as select generate_series(1, 100000000) i;
\q

pg_ctl restart
psql -d test
SET track_io_timing = on;
SET max_parallel_workers_per_gather = 0;
SHOW max_parallel_workers_per_gather;
EXPLAIN (ANALYZE, BUFFERS) SELECT count(*) from t_heap;
\timing
SELECT count(*) from t_heap;

drop table t_heap;
\q
dropdb test
pg_ctl restart

Below are the results. Again, almost no discernible difference between the master and patch.
Also, the results when max_parallel_workers_per_gather is more than 4 could be inaccurate
due to my machine's limitation of only having v4 CPUs. Even so, query planner capped it at
6 workers.

Query Planner I/O Timings (track_io_timing = on) in ms :
| Worker | I/O READ (Master) | I/O READ (Patch) | I/O WRITE (Master) | I/O WRITE (Patch) | 
|--------|-------------------|------------------|--------------------|-------------------| 
| 0      | "1,130.777"       | "1,250.821"      | "01,698.051"       | "01,733.439"      | 
| 1      | "1,603.016"       | "1,660.767"      | "02,312.248"       | "02,291.661"      | 
| 2      | "2,036.269"       | "2,107.066"      | "02,698.216"       | "02,796.893"      | 
| 3      | "2,298.811"       | "2,307.254"      | "05,695.991"       | "05,894.183"      | 
| 4      | "2,098.642"       | "2,135.960"      | "23,837.088"       | "26,537.158"      | 
| 5      | "1,956.536"       | "1,997.464"      | "45,891.851"       | "48,049.338"      | 
| 6      | "2,201.816"       | "2,219.001"      | "61,937.828"       | "67,809.486"      |

Query Planner Execution Time (ms):
| Worker | QueryPlanner (Master) | QueryPlanner (Patch) | 
|--------|-----------------------|----------------------| 
| 0.000  | "40,454.252"          | "40,521.578"         | 
| 1.000  | "21,332.067"          | "21,205.068"         | 
| 2.000  | "14,266.756"          | "14,385.539"         | 
| 3.000  | "11,597.936"          | "11,722.055"         | 
| 4.000  | "12,937.468"          | "13,439.247"         | 
| 5.000  | "14,383.083"          | "14,782.866"         | 
| 6.000  | "14,671.336"          | "15,507.581"         |

Based from the results above, the I/O latency increases as number of workers
also increase. Despite that, the query planner execution time is almost closely same
when 2 or more workers are used (14~11s). Same results between Master and Patch V2.

As for buffers, same results are shown per worker (both Master and Patch).
| Worker | Buffers                                          | 
|--------|--------------------------------------------------| 
| 0      | shared read=442478 dirtied=442478 written=442446 | 
| 1      | shared read=442478 dirtied=442478 written=442414 | 
| 2      | shared read=442478 dirtied=442478 written=442382 | 
| 3      | shared read=442478 dirtied=442478 written=442350 | 
| 4      | shared read=442478 dirtied=442478 written=442318 | 
| 5      | shared read=442478 dirtied=442478 written=442286 | 
| 6      | shared read=442478 dirtied=442478 written=442254 |


SQL Query Execution Time (ms) :
| Worker | SQL (Master) | SQL (Patch)  | 
|--------|--------------|--------------| 
| 0      | "10,418.606" | "10,377.377" | 
| 1      | "05,427.460"  | "05,402.727" | 
| 2      | "03,662.998"  | "03,650.277" | 
| 3      | "02,718.837"  | "02,692.871" | 
| 4      | "02,759.802"  | "02,693.370" | 
| 5      | "02,761.834"  | "02,682.590" | 
| 6      | "02,711.434"  | "02,726.332" |

The SQL query execution time definitely benefitted from previous run of query planner,
so the results are faster. But again, both Master and Patched have almost the same results.
Nonetheless, the execution time is almost consistent when
max_parallel_workers_per_gather is 2 (default) and above.

I am definitely missing something. Perhaps I think I could not understand why there's no
I/O difference between the Master and Patched (V2). Or has it been already improved
even without this patch?

Kind regards,
Kirk Jamison

В списке pgsql-hackers по дате отправления:

Предыдущее
От: Tom Lane
Дата:
Сообщение: Re: Wrong results from in_range() tests with infinite offset
Следующее
От: "Andrey M. Borodin"
Дата:
Сообщение: Re: Allow ERROR from heap_prepare_freeze_tuple to be downgraded to WARNING