Обсуждение: perf tuning for 28 cores and 252GB RAM

Поиск

Список

Период

Сортировка

perf tuning for 28 cores and 252GB RAM

От

Michael Curry

Дата:

17 июня 2019 г., 23:50:41

I am using a Postgres instance in an HPC cluster, where they have generously given me an entire node. This means I have 28 cores and 252GB RAM. I have to assume that the very conservative default settings for things like buffers and max working memory are too small here.

We have about 20 billion rows in a single large table. The database is not intended to run an application but rather to allow a few individuals to do data analysis, so we can guarantee the number of concurrent queries will be small, and that nothing else will need to use the server. Creating multiple different indices on a few subsets of the columns will be needed to support the kinds of queries we want.

What settings should be changed to maximize performance?

Michael J. Curry

Re: perf tuning for 28 cores and 252GB RAM

От

Jeff Janes

Дата:

18 июня 2019 г., 02:45:41

On Mon, Jun 17, 2019 at 4:51 PM Michael Curry <curry@cs.umd.edu> wrote:

I am using a Postgres instance in an HPC cluster, where they have generously given me an entire node. This means I have 28 cores and 252GB RAM. I have to assume that the very conservative default settings for things like buffers and max working memory are too small here.

We have about 20 billion rows in a single large table.

What is that in bytes? Do you only have that one table?

The database is not intended to run an application but rather to allow a few individuals to do data analysis, so we can guarantee the number of concurrent queries will be small, and that nothing else will need to use the server. Creating multiple different indices on a few subsets of the columns will be needed to support the kinds of queries we want.

What settings should be changed to maximize performance?

With 28 cores for only a few users, parallelization will probably be important. That feature is fairly new to PostgreSQL and rapidly improving from version to version, so you will want to use the last version you can (v11). And then increase the values for max_worker_processes, max_parallel_maintenance_workers, max_parallel_workers_per_gather, and max_parallel_workers. With the potential for so many parallel workers running at once, you wouldn't want to go overboard on work_mem, maybe 2GB. If you don't think all allowed users will be running large queries at the same time (because they are mostly thinking what query to run, or thinking about the results of the last one they ran, rather than actually running queries), then maybe higher than that.

If your entire database can comfortably fit in RAM, I would make shared_buffers large enough to hold the entire database. If not, I would set the value small (say, 8GB) and let the OS do the heavy lifting of deciding what to keep in cache. If you go with the first option, you probably want to use pg_prewarm after each restart to get the data into cache as fast as you can, rather than let it get loaded in naturally as you run queries; Also, you would probably want to set random_page_cost and seq_page_cost quite low, like maybe 0.1 and 0.05.

You haven't described what kind of IO capacity and setup you have, knowing that could suggest other changes to make. Also, seeing the results of `explain (analyze, buffers)`, especially with track_io_timing turned on, for some actual queries could provide good insight for what else might need changing.

Cheers,

Jeff

Re: perf tuning for 28 cores and 252GB RAM

От

Michael Lewis

Дата:

18 июня 2019 г., 18:48:11

If your entire database can comfortably fit in RAM, I would make shared_buffers large enough to hold the entire database. If not, I would set the value small (say, 8GB) and let the OS do the heavy lifting of deciding what to keep in cache. If you go with the first option, you probably want to use pg_prewarm after each restart to get the data into cache as fast as you can, rather than let it get loaded in naturally as you run queries; Also, you would probably want to set random_page_cost and seq_page_cost quite low, like maybe 0.1 and 0.05.

In all deference to your status as a contributor, what are these recommendations based on/would you share the rationale? I'd just like to better understand. I have never heard a recommendation to set random & seq page cost below 1 before for instance.

If the entire database were say 1 or 1.5 TBs and ram was on the order of 96 or 128 GBs, but some of the data is (almost) never accessed, would the recommendation still be the same to rely more on the OS caching? Do you target a particular cache hit rate as reported by Postgres stats?

Re: perf tuning for 28 cores and 252GB RAM

От

Merlin Moncure

Дата:

18 июня 2019 г., 19:06:24

On Mon, Jun 17, 2019 at 6:46 PM Jeff Janes <jeff.janes@gmail.com> wrote:
>
> On Mon, Jun 17, 2019 at 4:51 PM Michael Curry <curry@cs.umd.edu> wrote:
>>
>> I am using a Postgres instance in an HPC cluster, where they have generously given me an entire node. This means I
have28 cores and 252GB RAM. I have to assume that the very conservative default settings for things like buffers and
maxworking memory are too small here. 
>>
>> We have about 20 billion rows in a single large table.
>
>
> What is that in bytes?  Do you only have that one table?
>
>>
>> The database is not intended to run an application but rather to allow a few individuals to do data analysis, so we
canguarantee the number of concurrent queries will be small, and that nothing else will need to use the server.
Creatingmultiple different indices on a few subsets of the columns will be needed to support the kinds of queries we
want.
>>
>> What settings should be changed to maximize performance?
>
>
> With 28 cores for only a few users, parallelization will probably be important.  That feature is fairly new to
PostgreSQLand rapidly improving from version to version, so you will want to use the last version you can (v11).  And
thenincrease the values for max_worker_processes, max_parallel_maintenance_workers, max_parallel_workers_per_gather,
andmax_parallel_workers.  With the potential for so many parallel workers running at once, you wouldn't want to go
overboardon work_mem, maybe 2GB.  If you don't think all allowed users will be running large queries at the same time
(becausethey are mostly thinking what query to run, or thinking about the results of the last one they ran, rather than
actuallyrunning queries), then maybe higher than that. 
>
> If your entire database can comfortably fit in RAM, I would make shared_buffers large enough to hold the entire
database. If not, I would set the value small (say, 8GB) and let the OS do the heavy lifting of deciding what to keep
incache.  If you go with the first option, you probably want to use pg_prewarm after each restart to get the data into
cacheas fast as you can, rather than let it get loaded in naturally as you run queries;  Also, you would probably want
toset random_page_cost and seq_page_cost quite low, like maybe 0.1 and 0.05. 
>
> You haven't described what kind of IO capacity and setup you have, knowing that could suggest other changes to make.
Also,seeing the results of `explain (analyze, buffers)`, especially with track_io_timing turned on, for some actual
queriescould provide good insight for what else might need changing. 

This is all fantastic advice.  If all the data fits in memory (or at
least, all the data that is typically read from) and the cache is warm
then your database becomes an in memory database with respect to read
operations and all the i/o concerns and buffer management overhead go
away.

If your database does not fit in memory and your storage is fast, one
influential setting besides the above to look at besides the above is
effective_io_concurrency; it gets you faster (in some cases much
faster) bitmap heap scans. Also make sure to set effective_cache_size
high reflecting the large amount of memory you have; this will
influence query plan choice.

merlin

Re: perf tuning for 28 cores and 252GB RAM

От

Fabio Ugo Venchiarutti

Дата:

18 июня 2019 г., 19:13:20

On 18/06/2019 00:45, Jeff Janes wrote:
> On Mon, Jun 17, 2019 at 4:51 PM Michael Curry <curry@cs.umd.edu
> <mailto:curry@cs.umd.edu>> wrote:
>
>     I am using a Postgres instance in an HPC cluster, where they have
>     generously given me an entire node. This means I have 28 cores and
>     252GB RAM. I have to assume that the very conservative default
>     settings for things like buffers and max working memory are too
>     small here.
>
>     We have about 20 billion rows in a single large table.
>
>
> What is that in bytes?  Do you only have that one table?
>
>     The database is not intended to run an application but rather to
>     allow a few individuals to do data analysis, so we can guarantee the
>     number of concurrent queries will be small, and that nothing else
>     will need to use the server. Creating multiple different indices on
>     a few subsets of the columns will be needed to support the kinds of
>     queries we want.
>
>     What settings should be changed to maximize performance?
>
>
> With 28 cores for only a few users, parallelization will probably be
> important.  That feature is fairly new to PostgreSQL and rapidly
> improving from version to version, so you will want to use the last
> version you can (v11).  And then increase the values for
> max_worker_processes, max_parallel_maintenance_workers,
> max_parallel_workers_per_gather, and max_parallel_workers.  With the
> potential for so many parallel workers running at once, you wouldn't
> want to go overboard on work_mem, maybe 2GB.  If you don't think all
> allowed users will be running large queries at the same time (because
> they are mostly thinking what query to run, or thinking about the
> results of the last one they ran, rather than actually running queries),
> then maybe higher than that.
>
> If your entire database can comfortably fit in RAM, I would make
> shared_buffers large enough to hold the entire database.  If not, I
> would set the value small (say, 8GB) and let the OS do the heavy lifting
> of deciding what to keep in cache.

Does the backend mmap() data files when that's possible?

I've heard the "use the page cache" suggestion before, from users and
hackers alike, but I never quite heard a solid argument dismissing
potential overhead-related ill effects of the seek() & read() syscalls
if they're needed, especially on many random page fetches.

Given that shmem-based shared_buffers are bound to be mapped into the
backend's address space anyway, why isn't that considered always
preferable/cheaper?

I'm aware that there are other benefits in counting on the page cache
(eg: staying hot in the face of a backend restart), however I'm
considering performance in steady state here.

TIA

If you go with the first option, you
> probably want to use pg_prewarm after each restart to get the data into
> cache as fast as you can, rather than let it get loaded in naturally as
> you run queries;  Also, you would probably want to set random_page_cost
> and seq_page_cost quite low, like maybe 0.1 and 0.05.
>
> You haven't described what kind of IO capacity and setup you have,
> knowing that could suggest other changes to make.  Also, seeing the
> results of `explain (analyze, buffers)`, especially with track_io_timing
> turned on, for some actual queries could provide good insight for what
> else might need changing.
>
> Cheers,
>
> Jeff

--
Regards

Fabio Ugo Venchiarutti
OSPCFC Network Engineering Dpt.
Ocado Technology

--

Notice:  This email is confidential and may contain copyright material of
members of the Ocado Group. Opinions and views expressed in this message
may not necessarily reflect the opinions and views of the members of the
Ocado Group. 

If you are not the intended recipient, please notify us
immediately and delete all copies of this message. Please note that it is
your responsibility to scan this message for viruses. 

Fetch and Sizzle
are trading names of Speciality Stores Limited and Fabled is a trading name
of Marie Claire Beauty Limited, both members of the Ocado Group.

References to the “Ocado Group” are to Ocado Group plc (registered in
England and Wales with number 7098618) and its subsidiary undertakings (as
that expression is defined in the Companies Act 2006) from time to time.  
The registered office of Ocado Group plc is Buildings One & Two, Trident
Place, Mosquito Way, Hatfield, Hertfordshire, AL10 9UL.

Re: perf tuning for 28 cores and 252GB RAM

От

Andres Freund

Дата:

18 июня 2019 г., 19:18:02

Hi,

On 2019-06-17 19:45:41 -0400, Jeff Janes wrote:
> If not, I would set the value small (say, 8GB) and let the OS do the
> heavy lifting of deciding what to keep in cache.

FWIW, in my opinion this is not a good idea in most cases. E.g. linux's
pagecache doesn't scale particularly gracefully to large amounts of
data, and it's decisions when to evict data aren't really better than
postgres'. And there's a significant potential for additional
unnecessary disk writes (because the kernel will flush dirty pagecache
buffers, and then we'll just re-issue many of those writes again).

It's a bit hard to be specific without knowing the workload, but my
guidance would be that if the data has some expected form of locality
(e.g. index lookups etc, rather than just sequentially scanning the
whole database) then sizing s_b for at least the amount of data likely
to be repeatedly accessed can be quite beneficial.

If increasing s_b can achieve that most writes can be issued by
checkpointer rather than backends and bgwriter, the generated IO pattern
is *far* superior since 9.6 (as checkpointer writes are sorted, whereas
bgwriter/backend writes aren't to a meaningful degree).

The one big exception is if the workload frequently needs to
drop/truncate non-temporary tables. There we currently linearly need to
search shared_buffers, which, although the constants are fairly small,
obviously means that drop/truncations get noticably slower with a larger
shared_buffers.

- Andres

Re: perf tuning for 28 cores and 252GB RAM

От

Alvaro Herrera

Дата:

18 июня 2019 г., 19:33:30

Hello

On 2019-Jun-18, Andres Freund wrote:

> On 2019-06-17 19:45:41 -0400, Jeff Janes wrote:
> > If not, I would set the value small (say, 8GB) and let the OS do the
> > heavy lifting of deciding what to keep in cache.
> 
> FWIW, in my opinion this is not a good idea in most cases. E.g. linux's
> pagecache doesn't scale particularly gracefully to large amounts of
> data, and it's decisions when to evict data aren't really better than
> postgres'. And there's a significant potential for additional
> unnecessary disk writes (because the kernel will flush dirty pagecache
> buffers, and then we'll just re-issue many of those writes again).

One additional tuning point in this area is that you need to tweak the
Linux flusher so that it starts writing earlier than what it does by
default (by setting dirty_background_bytes to some reasonable value --
as I recall it normally runs on the dirty_background_ratio setting,
which means it scales up when you add RAM, which I'm not sure is really
sensible since you really care about the disk I/O write rate rather than
anything about the memory).  If you don't do that, the kernel can
accumulate lots of memory to write before starting to write it, which is
troublesome once it does.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: perf tuning for 28 cores and 252GB RAM

От

Andres Freund

Дата:

18 июня 2019 г., 19:48:49

Hi,

On 2019-06-18 12:33:30 -0400, Alvaro Herrera wrote:
> On 2019-Jun-18, Andres Freund wrote:
> 
> > On 2019-06-17 19:45:41 -0400, Jeff Janes wrote:
> > > If not, I would set the value small (say, 8GB) and let the OS do the
> > > heavy lifting of deciding what to keep in cache.
> > 
> > FWIW, in my opinion this is not a good idea in most cases. E.g. linux's
> > pagecache doesn't scale particularly gracefully to large amounts of
> > data, and it's decisions when to evict data aren't really better than
> > postgres'. And there's a significant potential for additional
> > unnecessary disk writes (because the kernel will flush dirty pagecache
> > buffers, and then we'll just re-issue many of those writes again).
> 
> One additional tuning point in this area is that you need to tweak the
> Linux flusher so that it starts writing earlier than what it does by
> default (by setting dirty_background_bytes to some reasonable value --
> as I recall it normally runs on the dirty_background_ratio setting,
> which means it scales up when you add RAM, which I'm not sure is really
> sensible since you really care about the disk I/O write rate rather than
> anything about the memory).  If you don't do that, the kernel can
> accumulate lots of memory to write before starting to write it, which is
> troublesome once it does.

I think that's less needed these days - by default postgres triggers the
backend to flush data when writes occur from bgwriter
(bgwriter_flush_after) or checkpointer (checkpointer_flush_after). And
IMO one should turn on the flushing by backends in most cases too
(e.g. backend_flush_after=2MB), unless it's a really latency/jitter
insensitive application, or storage is *REALLY* slow.

There's a few things we don't flush that we maybe should (file extension
writes, SLRUs), so it can still be sensible to tune
dirty_background_bytes. But that has the disadvantage of also affecting
temp file writes etc, which is usually not wanted.

Greetings,

Andres Freund

Re: perf tuning for 28 cores and 252GB RAM

От

Michael Curry

Дата:

18 июня 2019 г., 21:08:30

Thanks to everyone for all the detailed responses so far.

The dataset we're working with, in its original form, is roughly a terabyte; I would guess in the database it will come out to about 2TB-ish. It will definitely not fit in RAM, unfortunately, and our access patterns will be quite random, at least at first (maybe as the analysis continues it will become clear which records are irrelevant, but not yet).

I would love to take advantage of parallelism, but because this is on an HPC cluster I don't have any personal control over the database version (9.6.6) or anything involving OS configuration. I will take a look at the current values of those configuration variables however; maybe we can get at least some speedup.

It seems I can be confident that shared_buffers and work_mem, along with effective_io_concurrency and effective_cache_size, ought to be much larger than their default values. How much larger shared_buffers should be depends on whether it is better to let Linux or Postgres handle the cache. I will try to get those changes made and then start benchmarking some simple queries.

On Tue, Jun 18, 2019 at 12:48 PM Andres Freund <andres@anarazel.de> wrote:

Hi,

On 2019-06-18 12:33:30 -0400, Alvaro Herrera wrote:
> On 2019-Jun-18, Andres Freund wrote:
>
> > On 2019-06-17 19:45:41 -0400, Jeff Janes wrote:
> > > If not, I would set the value small (say, 8GB) and let the OS do the
> > > heavy lifting of deciding what to keep in cache.
> >
> > FWIW, in my opinion this is not a good idea in most cases. E.g. linux's
> > pagecache doesn't scale particularly gracefully to large amounts of
> > data, and it's decisions when to evict data aren't really better than
> > postgres'. And there's a significant potential for additional
> > unnecessary disk writes (because the kernel will flush dirty pagecache
> > buffers, and then we'll just re-issue many of those writes again).
>
> One additional tuning point in this area is that you need to tweak the
> Linux flusher so that it starts writing earlier than what it does by
> default (by setting dirty_background_bytes to some reasonable value --
> as I recall it normally runs on the dirty_background_ratio setting,
> which means it scales up when you add RAM, which I'm not sure is really
> sensible since you really care about the disk I/O write rate rather than
> anything about the memory). If you don't do that, the kernel can
> accumulate lots of memory to write before starting to write it, which is
> troublesome once it does.

I think that's less needed these days - by default postgres triggers the
backend to flush data when writes occur from bgwriter
(bgwriter_flush_after) or checkpointer (checkpointer_flush_after). And
IMO one should turn on the flushing by backends in most cases too
(e.g. backend_flush_after=2MB), unless it's a really latency/jitter
insensitive application, or storage is *REALLY* slow.

There's a few things we don't flush that we maybe should (file extension
writes, SLRUs), so it can still be sensible to tune
dirty_background_bytes. But that has the disadvantage of also affecting
temp file writes etc, which is usually not wanted.

Greetings,

Andres Freund

Michael J. Curry

cs.umd.edu/~curry

Re: perf tuning for 28 cores and 252GB RAM

От

Andres Freund

Дата:

18 июня 2019 г., 21:10:11

Hi,

On 2019-06-18 17:13:20 +0100, Fabio Ugo Venchiarutti wrote:
> Does the backend mmap() data files when that's possible?

No. That doesn't allow us to control when data is written back to disk,
which is crucial for durability/consistency.

> I've heard the "use the page cache" suggestion before, from users and
> hackers alike, but I never quite heard a solid argument dismissing potential
> overhead-related ill effects of the seek() & read() syscalls if they're
> needed, especially on many random page fetches.

We don't issue seek() for reads anymore in 12, instead do a pread() (but
it's not a particularly meaningful performance improvement). The read
obviously has cost, especially with syscalls getting more and more
expensive due to the mitigation for intel vulnerabilities.

I'd say that a bigger factor than the overhead of the read itself is
that for many workloads we'll e.g. incur additional writes when s_b is
smaller, that the kernel has less information about when to discard
data, that the kernel pagecaches have some scalability issues (partially
due to their generality), and double buffering.

> Given that shmem-based shared_buffers are bound to be mapped into the
> backend's address space anyway, why isn't that considered always
> preferable/cheaper?

See e.g. my point in my previous email in this thread about
drop/truncate.

> I'm aware that there are other benefits in counting on the page cache (eg:
> staying hot in the face of a backend restart), however I'm considering
> performance in steady state here.

There's also the issue that using a large shared buffers setting means
that each process' page table gets bigger, unless you configure
huge_pages. Which one definitely should - but that's an additional
configuration step that requires superuser access on most operating
systems.

Greetings,

Andres Freund

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Обсуждение: perf tuning for 28 cores and 252GB RAM